← All digests
📡

AI Developer Digest

Fri, May 1, 2026

Model Releases

Mistral Medium 3.5 — 128B Dense, 256k Context, Open Weights

Source: Mistral AI | Date: April 29, 2026 (HN thread April 30) | Link: https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5 TL;DR: Mistral releases a 128B dense open-weight model with 256k context window, 77.6% on SWE-Bench Verified, configurable per-request reasoning effort, and EAGLE speculative decoding variant. Dev signal: API pricing $1.50/1M input, $7.50/1M output. Open weights on HuggingFace (huggingface.co/mistralai/Mistral-Medium-3.5-128B) under modified MIT — self-hostable on 4× GPUs. A separate Mistral-Medium-3.5-128B-EAGLE variant enables speculative inference for lower latency. Reasoning effort is configurable per request (quick chat vs. full agentic runs from the same weights). Competes with Gemini 3.1 Pro Preview (78.8% SWE-Bench) at a fraction of the API cost. Primary source: https://huggingface.co/mistralai/Mistral-Medium-3.5-128B | https://news.ycombinator.com/item?id=47949642


API & SDK Changes

ACTION REQUIRED: gemini-robotics-er-1.5-preview Shut Down Today

Source: Google AI for Developers | Date: April 30, 2026, 9AM PST | Link: https://ai.google.dev/gemini-api/docs/deprecations TL;DR: The gemini-robotics-er-1.5-preview model endpoint was decommissioned today at 9AM PST — all calls to it now return errors. Dev signal: Migrate all code referencing gemini-robotics-er-1.5-preview to gemini-robotics-er-1.6-preview, which added instrument reading, improved spatial reasoning, and better physical reasoning. Grep your configs and environment variables now. Primary source: https://ai.google.dev/gemini-api/docs/deprecations


Research Papers

Nothing today that cleared the quality bar (no May 1 submissions from recognized labs with code + benchmarks confirmed).


Tooling Updates

llama.cpp — 32-bit WASM >2GB Models + Vulkan Mixed-Quant Flash Attention + Fast Int MatMul

Source: ggml-org/llama.cpp | Date: April 30–May 1, 2026 (b8984, b8992, b8995) | Link: https://github.com/ggml-org/llama.cpp/releases

TL;DR: Three back-to-back releases shipped: 32-bit WebAssembly now supports models larger than 2GB; Vulkan coopmat2 path gains asymmetric flash attention with mixed quantization types (incl. Q1_0); fast matrix multiplication landed for integer quantization types.

Breakdown:

  • b8992 (Apr 30, 23:33 UTC): Switches memory mapping to ftello/fseeko, removing the 2GB model ceiling in 32-bit WASM. Previously, 32-bit WebAssembly environments couldn't address model files >2GB due to 32-bit file offset limits. PR #22497.
  • b8995 (May 1, 16:10 UTC): Vulkan coopmat2 path now supports asymmetric flash attention with mixed quantization types including Q1_0. Completes the originally intended design of the cm2 FA shader.
  • b8984 (Apr 30, 10:35 UTC): Fast matrix multiplication for integer quantization types — no benchmark numbers in release notes, but targets throughput for quantized model inference.

Dev signal: If you serve llama.cpp in a 32-bit WASM environment (e.g., browser or embedded runtime) and were hard-capped at <2GB models, upgrade to b8992+. Vulkan users running coopmat2 with mixed quantization can now use Q1_0 in flash attention — check VRAM headroom. The fast int matmul in b8984 may benefit quantized model throughput but needs your own benchmarking to confirm. Primary source: https://github.com/ggml-org/llama.cpp/releases/tag/b8992 | https://github.com/ggml-org/llama.cpp/releases/tag/b8995 | https://github.com/ggml-org/llama.cpp/releases/tag/b8984


Technical Discussions

Nothing today that cleared the quality bar.


Quality gate excluded ~32 items. Several significant releases from the past week fell outside the 24-48h window: GPT-5.5 (Apr 24, API pricing $5/$30/1M, Terminal-Bench 82.7%), Claude Opus 4.7 (Apr 16, SWE-bench 87.6%), Gemma 4 (Apr 2, 31B Apache 2.0, AIME 2026 89.2%), Ollama v0.22.0 (Apr 28, Nemotron 3 Omni + Poolside Laguna XS.2). Check prior digests for those.

← All digestspersonal/digests/ai-2026-05-01.md