AI Developer Digest

Fri, May 1, 2026

3 signals that cleared the gate3 min read

Model Releases

Mistral Medium 3.5 — 128B Dense, 256k Context, Open Weights

TL;DR

Mistral releases a 128B dense open-weight model with 256k context window, 77.6% on SWE-Bench Verified, configurable per-request reasoning effort, and EAGLE speculative decoding variant.

Developer signal

API pricing $1.50/1M input, $7.50/1M output. Open weights on HuggingFace (huggingface.co/mistralai/Mistral-Medium-3.5-128B) under modified MIT — self-hostable on 4× GPUs. A separate Mistral-Medium-3.5-128B-EAGLE variant enables speculative inference for lower latency. Reasoning effort is configurable per request (quick chat vs. full agentic runs from the same weights). Competes with Gemini 3.1 Pro Preview (78.8% SWE-Bench) at a fraction of the API cost.

Mistral AI | Date: April 29, 2026 (HN thread April 30) | Link: https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5https://huggingface.co/mistralai/Mistral-Medium-3.5-128B | https://news.ycombinator.com/item?id=47949642

API & SDK Changes

ACTION REQUIRED: gemini-robotics-er-1.5-preview Shut Down Today

TL;DR

The gemini-robotics-er-1.5-preview model endpoint was decommissioned today at 9AM PST — all calls to it now return errors.

Developer signal

Migrate all code referencing gemini-robotics-er-1.5-preview to gemini-robotics-er-1.6-preview, which added instrument reading, improved spatial reasoning, and better physical reasoning. Grep your configs and environment variables now.

Google AI for Developers | Date: April 30, 2026, 9AM PST | Link: https://ai.google.dev/gemini-api/docs/deprecationshttps://ai.google.dev/gemini-api/docs/deprecations

Research Papers

Nothing today that cleared the quality bar (no May 1 submissions from recognized labs with code + benchmarks confirmed).

Tooling Updates

llama.cpp — 32-bit WASM >2GB Models + Vulkan Mixed-Quant Flash Attention + Fast Int MatMul

TL;DR

Three back-to-back releases shipped: 32-bit WebAssembly now supports models larger than 2GB; Vulkan coopmat2 path gains asymmetric flash attention with mixed quantization types (incl. Q1_0); fast matrix multiplication landed for integer quantization types.

b8992 (Apr 30, 23:33 UTC): Switches memory mapping to ftello/fseeko, removing the 2GB model ceiling in 32-bit WASM. Previously, 32-bit WebAssembly environments couldn't address model files >2GB due to 32-bit file offset limits. PR #22497.

b8995 (May 1, 16:10 UTC): Vulkan coopmat2 path now supports asymmetric flash attention with mixed quantization types including Q1_0. Completes the originally intended design of the cm2 FA shader.

b8984 (Apr 30, 10:35 UTC): Fast matrix multiplication for integer quantization types — no benchmark numbers in release notes, but targets throughput for quantized model inference.

Developer signal

If you serve llama.cpp in a 32-bit WASM environment (e.g., browser or embedded runtime) and were hard-capped at <2GB models, upgrade to b8992+. Vulkan users running coopmat2 with mixed quantization can now use Q1_0 in flash attention — check VRAM headroom. The fast int matmul in b8984 may benefit quantized model throughput but needs your own benchmarking to confirm.

ggml-org/llama.cpp | Date: April 30–May 1, 2026 (b8984, b8992, b8995) | Link: https://github.com/ggml-org/llama.cpp/releaseshttps://github.com/ggml-org/llama.cpp/releases/tag/b8992 | https://github.com/ggml-org/llama.cpp/releases/tag/b8995 | https://github.com/ggml-org/llama.cpp/releases/tag/b8984

Technical Discussions

Nothing today that cleared the quality bar.

Quality gate excluded ~32 items. Several significant releases from the past week fell outside the 24-48h window: GPT-5.5 (Apr 24, API pricing $5/$30/1M, Terminal-Bench 82.7%), Claude Opus 4.7 (Apr 16, SWE-bench 87.6%), Gemma 4 (Apr 2, 31B Apache 2.0, AIME 2026 89.2%), Ollama v0.22.0 (Apr 28, Nemotron 3 Omni + Poolside Laguna XS.2). Check prior digests for those.

Filtered from 30+ primary sources against a published quality rubric. No press releases, no fluff — only what changes what you build.