AI Developer Digest

Tue, May 5, 2026

7 signals that cleared the gate20 scanned5 min read

Model Releases

Nothing today — no new model GA announcements within the last 24 hours. (GPT-5.5: April 24; Claude Opus 4.7: April 16; Mistral Medium 3.5: May 2 — all outside window.)

API & SDK Changes

Gemini API File Search is Now Multimodal

TL;DR

The Gemini API's File Search tool now indexes and retrieves images alongside text, powered by Gemini Embedding 2, with custom metadata filters and page-level citations for RAG pipelines.

Developer signal

If you're building RAG on Gemini, you can now store charts, product images, and diagrams in the same File Search corpus as your text docs and query across all of them in a single API call. Page citations are returned for grounding. Update your integration and check the new metadata_filters param.

Google Developers Blog | Date: 2026-05-05 | Link: https://blog.google/innovation-and-ai/technology/developers-tools/expanded-gemini-api-file-search-multimodal-rag/https://ai.google.dev/gemini-api/docs/changelog

Anthropic Python SDK v0.98.0 + v0.99.0

TL;DR

v0.98.0 adds Workload Identity Federation, interactive OAuth, and named auth profiles to the client; v0.99.0 adds OIDC federation token exchange scoped to a specific workspace.

Developer signal

Enterprise deployments using service accounts or SSO can now authenticate without static API keys — pass auth_profile="my-profile" or use the new workload_identity_federation client option. The streaming bug where stop_details was missing from accumulated Message objects (issue #1725) is also fixed. Run pip install anthropic==0.99.0.

anthropics/anthropic-sdk-python GitHub | Date: 2026-05-04 (v0.98.0), 2026-05-05 (v0.99.0) | Link: https://github.com/anthropics/anthropic-sdk-python/releaseshttps://github.com/anthropics/anthropic-sdk-python/compare/v0.97.0...v0.99.0

Claude Agent SDK Python v0.1.73 — Eager Session Store Flushing

TL;DR

New session_store_flush option ("batched" | "eager") on ClaudeAgentOptions delivers session frames to SessionStore.append() in near-real-time instead of only at end-of-turn.

Developer signal

eager mode enables three previously difficult patterns: (1) live-tailing UIs that stream agent progress, (2) cross-process agent resume from another worker, (3) crash-durability — agent state is already persisted before a crash. Set session_store_flush="eager" if you're building any of these. Default remains "batched". Run pip install claude-agent-sdk==0.1.73.

anthropics/claude-agent-sdk-python GitHub | Date: 2026-05-04 | Link: https://github.com/anthropics/claude-agent-sdk-python/releases/tag/v0.1.73https://github.com/anthropics/claude-agent-sdk-python/releases/tag/v0.1.73

LiteLLM v1.83.14-stable — GPT-5.5, Memory API, LLM-as-Judge Guardrail

TL;DR

Adds GPT-5.5 and GPT-5.5 Pro support, GLM-5 and Minimax M2.5 on Bedrock, new /v1/memory CRUD endpoints, an LLM-as-a-Judge guardrail option, and route_all_chat_openai_to_responses global flag.

Developer signal

If you're using LiteLLM proxy for unified access: (1) gpt-5.5 and gpt-5.5-pro model strings are now valid; (2) the new /v1/memory endpoints let you persist and retrieve structured memory across sessions through the proxy; (3) the LLM-as-a-Judge guardrail lets you define a judge prompt that runs on every completion — useful for policy enforcement without a custom layer. Run pip install litellm==1.83.14.

BerriAI/litellm GitHub | Date: 2026-05-04 | Link: https://github.com/BerriAI/litellm/releases/tag/v1.83.14-stablehttps://github.com/BerriAI/litellm/releases/tag/v1.83.14-stable

Research Papers

Nothing today — no papers with verified May 5 submission dates, strong benchmark numbers, and code repos surfaced from cs.CL/cs.AI/cs.LG scans.

Tooling Updates

vLLM v0.20.1 — DeepSeek V4 Stability Patch

TL;DR

Patch release fixing CUDA graph memory capture, BailingMoE linear layer bugs, and adding multi-stream pre-attention GEMM plus BF16/MXFP8 all-to-all support for DeepSeek V4.

Developer signal

If you're serving DeepSeek V4 on vLLM, upgrade — the CUDA graph memory capture bug caused silent OOM failures in multi-GPU setups. The new tile kernels for "optimized head computation" also improve throughput on H100 configurations. Run pip install vllm==0.20.1.

vllm-project/vllm GitHub | Date: 2026-05-04 | Link: https://github.com/vllm-project/vllm/releases/tag/v0.20.1https://github.com/vllm-project/vllm/releases/tag/v0.20.1

Ollama v0.23.1-rc0 — Gemma 4 MTP Speculative Decoding (>2x speed on coding)

TL;DR

Release candidate adds Gemma 4 MTP (multi-token prediction) speculative decoding claiming over 2x throughput increase for Gemma 4 31B on coding tasks, plus MLX threading fixes and Go 1.26 bump.

Developer signal

If you're running Gemma 4 31B locally for code tasks, this RC is worth testing — 2x tokens/second is a significant gain for interactive use. Note it's a release candidate; wait for v0.23.1 stable for production. MLX threading fix also resolves an Apple Silicon hang under concurrent requests.

ollama/ollama GitHub | Date: 2026-05-05 | Link: https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0

llama.cpp b9028–b9033 — KV Rotation Fast Path, Memory Savings, Lazy Backend Loading

TL;DR

Five builds today: fast Walsh-Hadamard transform for KV rotation (perf), device buffer memory saving option, backend lazy-loading (only initializes GPU/CPU backends on demand), and a new /models?reload=1 server API endpoint.

Developer signal

Three things worth acting on: (1) --memory-save-device-buffers flag reduces VRAM overhead for multi-model servers; (2) /models?reload=1 lets you hot-reload model configs without restarting the server process; (3) the Walsh-Hadamard KV rotation is a drop-in perf improvement for models that use RoPE variants — no config change needed, just pull latest build.

ggml-org/llama.cpp GitHub | Date: 2026-05-05 | Link: https://github.com/ggml-org/llama.cpp/releaseshttps://github.com/ggml-org/llama.cpp/releases

Technical Discussions

Nothing today passing the quality gate. The Anthropic CEO cybersecurity remarks (Dario Amodei warning of "tens of thousands of vulnerabilities" found by Opus 4.7) are newsworthy but scored below 3 — no code, no API link, no reproducible data.

Quality gate excluded 13+ items: 4 strong releases outside the 24h window (Claude Opus 4.7, Managed Agents, GPT-5.5, Mistral Medium 3.5), Devstral 2 (Dec 2025), Ollama v0.23.0 (May 3), arXiv papers without confirmed today-dates and benchmark links, plus several opinion/roundup pieces. Light day on papers — honest digest beats a padded one.

Filtered from 30+ primary sources against a published quality rubric. No press releases, no fluff — only what changes what you build.