AI Developer Digest

Fri, May 8, 2026

6 signals that cleared the gate31 scanned5 min read

Model Releases

Nothing today.

API & SDK Changes

Anthropic Doubles Claude Code Limits; Opus API Raised 16x

TL;DR

Anthropic doubled Claude Code's 5-hour rolling limits for all paid tiers, removed peak-hour throttling for Pro/Max, and raised Claude Opus API tier-1 input tokens/min from 30K to 500K — crediting a new compute deal with SpaceX's Colossus 1 (220K+ NVIDIA GPUs, 300 MW).

Developer signal

If you're hitting Claude Code session limits or rate-limiting Opus in production agents, re-check your tier limits now — no code change needed, limits are live. The Opus API increase (~16x for tier 1) is especially relevant for batch pipelines that were previously throttled.

Anthropic | Date: 2026-05-06 | Link: https://www.anthropic.com/news/higher-limits-spacexhttps://www.anthropic.com/news/higher-limits-spacex

anthropic-sdk-python v0.100.0 — Managed Agents Webhooks + Vault

TL;DR

v0.100.0 adds support for Managed Agents multiagent workflows, outcome tracking, inbound webhooks, and vault validation; v0.99.0 (May 5) added workspace-scoped OIDC federation; v0.98.0 (May 4) shipped Workload Identity Federation, interactive OAuth, and auth profiles.

Developer signal

If you're building on the Managed Agents API (public beta since April 8, beta header anthropic-beta: managed-agents-2026-04-01, endpoints under /v1/agents), upgrade to ≥0.100.0 to get webhook event support and vault credential validation. The OIDC + WIF additions in 0.98–0.99 are production-ready for enterprise deployments needing federated identity.

Anthropic SDK | Date: 2026-05-06 | Link: https://github.com/anthropics/anthropic-sdk-python/releases/tag/v0.100.0https://github.com/anthropics/anthropic-sdk-python/compare/v0.99.0...v0.100.0

llm-gemini 0.31 — gemini-3.1-flash-lite Exits Preview

TL;DR

gemini-3.1-flash-lite is now generally available (no longer a preview model) per Google Cloud; llm-gemini 0.31 reflects this status change in the plugin.

Developer signal

gemini-3.1-flash-lite is Google's cheapest, lowest-latency Gemini model — GA status means stable pricing and SLAs. If you were avoiding it because of preview limitations, it's now production-safe. Update via llm install -U llm-gemini.

Simon Willison (plugin author) | Date: 2026-05-07 | Link: https://simonwillison.net/2026/May/7/llm-gemini/https://github.com/simonw/llm-gemini/releases/tag/0.31

Research Papers

LCM: Lossless Context Management — Beats Claude Code on Long-Context Evals

TL;DR

LCM is a deterministic context-management architecture that decomposes symbolic recursion into (1) hierarchical DAG-based context compression and (2) engine-managed parallel task partitioning (LLM-Map); their Volt agent scores 74.8 vs Claude Code's 70.3 on OOLONG long-context eval (Opus 4.6 backbone), with Volt's advantage growing at longer contexts up to 1M tokens.

Developer signal

The key claim is engine-managed memory beats model-managed memory for coding agents at scale. Code is at https://github.com/Martian-Engineering/volt — worth benchmarking against your own long-context workloads. The LLM-Map primitive (parallel map over LLM calls with structured context passing) is directly usable in agent frameworks.

Voltropy PBC (Clint Ehrlich, Theodore Blackman) | Date: 2026-05 | Link: https://arxiv.org/abs/2605.04050https://arxiv.org/abs/2605.04050 | Code: https://github.com/Martian-Engineering/volt

Tooling Updates

llama.cpp — 7 Builds Shipped Today (b9070–b9077); Top 3 Below

b9077 — Vertex AI Compatible Server API

TL;DR

llama-server now supports the Vertex AI-compatible API surface; activated when AIP_MODE env var is set (standard Google Cloud AI Platform convention), otherwise a no-op.

Developer signal

You can now point tools that target Vertex AI's predict endpoint at a local llama-server instance. Useful for testing Google Cloud integrations locally or running self-hosted inference behind a Vertex-compatible proxy. No code change needed if AIP_MODE is unset.

b9075 — CUDA Snake Activation Fusion (5 ops → 1 kernel)

TL;DR

The CUDA graph optimizer now fuses the 5-op snake activation decomposition (x + sin(a*x)² * inv_b) used by audio decoders (BigVGAN, Vocos) into a single elementwise kernel.

Developer signal

If you're running BigVGAN or Vocos audio models via llama.cpp on CUDA, expect a throughput improvement on the decoder step — no configuration change required.

b9070 — Q4_0 MoE GEMM for Adreno GPUs via OpenCL

TL;DR

Q4_0 quantized Mixture-of-Experts GEMM is now accelerated on Qualcomm Adreno GPUs via the OpenCL backend.

Developer signal

MoE models (Mixtral, DeepSeek, Gemma 4 MoE variants) running Q4_0 on Snapdragon devices get native GPU acceleration — significant for on-device inference on Android flagship hardware. Update your llama.cpp build; no other changes needed.

ggml-org/llama.cpp | Date: 2026-05-08 | Link: https://github.com/ggml-org/llama.cpp/releases

Ollama v0.23.1 — Gemma 4 MTP Speculative Decoding on Mac (2x Coding Speed)

TL;DR

Adds Gemma 4 MTP (Multi-Token Processing) speculative decoding support on Apple Silicon via MLX, delivering "over 2x speed increase for the Gemma 4 31B model on coding tasks"; also bumps MLX/MLX-C for threading fixes and Go 1.26.

Developer signal

If you're running Gemma 4 31B locally on a Mac for code generation, ollama pull gemma4:31b and update to v0.23.1 — the speedup is from speculative decoding via the MLX backend and requires no config change. The threading fixes also resolve intermittent stalls in the MLX runner.

ollama/ollama | Date: 2026-05-05 | Link: https://github.com/ollama/ollama/releases/tag/v0.23.1https://github.com/ollama/ollama/releases/tag/v0.23.1

Technical Discussions

Nothing today that cleared the quality gate.

Quality gate excluded 23 items: business/funding announcements (Anthropic Series G, SpaceX deal narrative coverage, India expansion), model releases outside window (GPT-5.4, Llama 4 recap, Gemini 3.1 Flash TTS), stale SDK releases (openai-python v1.103.0 from Sept 2025, transformers v5.3.0 from Mar 2025), vLLM v0.20.1 (May 3, outside window), minor llama.cpp builds (b9071–b9074, b9076 — routine maintenance), LiteLLM 1.83.14 patch (patch-level stable, no changelog specifics found), DeepMind publications (no primary-source technical detail reachable), opinion/prediction pieces, and paraphrase-heavy third-party summaries.

Light day for model releases. Solid day for tooling and SDK infrastructure.

Filtered from 30+ primary sources against a published quality rubric. No press releases, no fluff — only what changes what you build.