← All digests
📡

AI Developer Digest

Sun, Jun 7, 2026

4 items passed quality gate | ~45 scanned | ~41 excluded | Sources checked: 27 Scan window: June 6–7, 2026 (24h). Prior digest covered: Claude Code v2.1.166/167; Claude Opus 4.1 deprecation (Aug 5); anthropic-sdk-python v0.106.0/v0.107.0; LiteLLM v1.88.0-rc.3; llama.cpp b9537–b9543 (Qwen3.5 video/frame merge, OpenCL ops); CL-Bench (arXiv 2606.05661); DeployBench (arXiv 2606.05238); LMArena Agent Arena launch; OpenAI Lockdown Mode for personal accounts.


This Week's Signal

A genuinely light 24-hour window. One urgent action item dominates everything else: the Gemini Interactions API opt-out header (Api-Revision: 2026-05-07) stops being accepted tomorrow at the start of June 8 — developers who haven't migrated from response.outputs to response.steps have hours, not days. Beyond that one must-act item, the one new technical development worth noting is llama.cpp b9549 landing Gemma4 Multi-Token Prediction in the official upstream repo — community testing shows ~40% throughput gain on consumer hardware with no quality loss. The rest of the period is bug fixes and minor SDK patches across the usual repos.

Must-reads this digest:

  • Gemini Interactions API — LAST DAY before June 8 removal — if your code reads response.outputs or sends Api-Revision: 2026-05-07, it breaks tomorrow; migrate to response.steps now
  • llama.cpp b9549: Gemma4 MTP — 40% throughput gain (97 → 138 tokens/s on M5 Max) for local Gemma4 inference, no quality tradeoff

[BREAKING] Breaking Changes

[BREAKING] Gemini Interactions API — Legacy Schema Removal June 8, Less Than 24 Hours

Source: Google AI for Developers | Date: Removal scheduled June 8, 2026 | Link: https://ai.google.dev/gemini-api/docs/interactions-breaking-changes-may-2026 What changed: The Api-Revision: 2026-05-07 opt-out header that preserved the legacy response.outputs schema stops being accepted on June 8. No grandfathering, no extended opt-out. After today, all Interactions API traffic uses only the new response.steps schema — unconditionally. TL;DR: The Gemini Interactions API legacy response.outputs schema is removed June 8 (fewer than 24 hours); any code sending Api-Revision: 2026-05-07 or reading response.outputs stops working at the start of June 8. Developer signal: You have today. Three things to check right now: (1) Search your codebase for response.outputs, Api-Revision: 2026-05-07, and response_mime_type. Each of these either breaks or stops working June 8. Replace response.outputs with response.steps — the steps array provides a structured timeline of each interaction turn with polymorphic entry types. (2) If you use the Python SDK (google-generativeai ≥2.0.0) or JavaScript SDK (@google/generative-ai ≥2.0.0), the SDK already automatically uses the new schema — you only need to update how your application code reads the response structure. (3) response_mime_type is gone; use the new response_format polymorphic field instead. Note: any Gemini features shipped after May 7 — including new Gemini 3.5 Flash capabilities — are only available in the new schema. Staying on the opt-out was already costing you new capabilities. Full migration guide with before/after code examples at the primary source link. Affects you if: You call the Gemini Interactions API directly or via SDK and your code accesses response.outputs, sends the Api-Revision: 2026-05-07 header, or references response_mime_type Adoption effort: Moderate (response-reading code must be updated; if using a current SDK version, the SDK transport layer is already migrated — only application code needs changing) Primary source: https://ai.google.dev/gemini-api/docs/interactions-breaking-changes-may-2026 Quality gate score: 9 (official Google AI documentation +3, specific field names and migration path with code examples +2, primary migration guide link +2, removal within 24 hours +1, technical audience +1)


Model Releases

No new model releases in this 24h period.


API & SDK Changes

No new API or SDK changes requiring full entries in this 24h period beyond the Gemini breaking change above. See Quick Hits for anthropic-sdk-python v0.107.1 Foundry bug fix.


Research

Nothing cleared the quality bar in this 24h period. No new arXiv papers from recognized labs with measurable benchmark numbers and associated code found in the June 6–7 window.


Tooling

[NOTABLE] llama.cpp b9549: Gemma4 Multi-Token Prediction Lands in Official Upstream

Source: ggml-org/llama.cpp | Date: June 7, 2026, 13:38 UTC | Link: https://github.com/ggml-org/llama.cpp/releases/tag/b9549 What changed: PR #23398 merges Gemma4 MTP (Multi-Token Prediction) support into the official llama.cpp repo. Previously, Gemma4 MTP was only available in the ik_llama.cpp performance fork (PR #1744, merged May 10, 2026, with verified 2.6–2.98x lossless speedup); b9549 brings it to the main project used by most downstream integrations. TL;DR: llama.cpp b9549 adds official Gemma4 Multi-Token Prediction support, enabling community-verified ~40% throughput gains (97 → 138 tokens/s on MacBook Pro M5 Max) with no measurable quality degradation. Developer signal: If you're running Gemma4 locally via llama.cpp, update to b9549 and enable MTP to unlock the throughput improvement without changing your model. To enable: add --draft-model <gemma4-mtp-draft-gguf> to your llama-cli or llama-server invocation. The draft GGUF is generated from the Gemma4AssistantForCausalLM model class via the standard convert_hf_to_gguf.py conversion path. The 40% figure comes from community benchmarks on Apple Silicon (M5 Max); CUDA users should expect comparable or better gains since MTP overhead is typically lower on GPU. Before this build, using Gemma4 MTP required building from the ik_llama.cpp fork or maintaining a custom build — b9549 removes that barrier and makes MTP a standard option in any llama.cpp distribution. The ik_llama.cpp fork results (2.6–2.98x at larger batch sizes) are the upper bound; 40% is the conservative single-stream figure. Check the PR for updated configuration examples. Affects you if: You run Gemma4 models locally using llama.cpp and care about inference throughput on consumer or prosumer hardware (Apple Silicon, CUDA, or CPU) Adoption effort: Quick (update to b9549, generate or download a Gemma4 MTP draft GGUF, add one CLI flag — no architecture or code changes required) Primary source: https://github.com/ggml-org/llama.cpp/pull/23398 Quality gate score: 8 (official project GitHub source +3, concrete throughput numbers from community benchmarks with hardware context +2, primary PR link +2, within scan window +1)


Benchmarks & Leaderboards

No new leaderboard movements for June 6–7. The LMArena text leaderboard top cluster and SWE-bench Verified rankings are unchanged from the prior digest. Last significant leaderboard entry was June 5 (mistral-medium-3.5 added to Code Arena WebDev leaderboard).


Trends & Emerging Tech

Multi-Token Prediction Crossing the Mainstream Threshold in Local Inference

Source: ggml-org/llama.cpp | Date: June 7, 2026 | Link: https://github.com/ggml-org/llama.cpp/releases/tag/b9549 What's happening: llama.cpp b9549 is the latest in a run of open inference runtimes adding MTP support through Q2 2026, following ik_llama.cpp (May 10), a Gemma4 Transformers-native MTP implementation by the Hugging Face team (April 2026), and experimental DeepSeek V4 MTP in vLLM's roadmap. The common pattern: 1.5x–3x throughput gains without quality loss, by predicting 2–4 tokens per forward pass using a small draft head trained alongside the main model. The ik_llama.cpp fork verified a 2.6–2.98x lossless speedup for Gemma4 at launch prior to upstream merge; b9549 makes this a standard option in mainstream llama.cpp distributions. Why watch this: MTP is following the same adoption arc as speculative decoding: experimental fork proves it works → upstream integration → standard config option developers set once and forget. If that trajectory continues, the practical impact for builders is twofold. First, throughput benchmark comparisons for local inference need a new disclosure: tokens/second without stating whether MTP was active and what draft model was used is no longer a comparable number. Second, watch whether model card releases start shipping companion draft-model GGUF files as a standard artifact alongside base-model quantizations — that would signal the ecosystem treating MTP as a first-class configuration rather than a power-user optimization.


Technical Discussions

Nothing cleared the quality bar this period. No HN threads with score >200 and concrete technical depth found for June 6–7. No qualifying posts from Nathan Lambert (last: June 1), Eugene Yan, or Sebastian Raschka in the scan window.


Quick Hits


Worth Watching (Announced, Not Yet Shipped)

⚠️⚠️⚠️ Gemini API Legacy Schema (Interactions) — Hard Removal TOMORROW June 8 — ACT TODAY

(Elevated to [BREAKING] Breaking Changes above — same item; see that section for full migration checklist) Source: Google AI for Developers | Link: https://ai.google.dev/gemini-api/docs/interactions-breaking-changes-may-2026

⚠️⚠️ Windows Local AI Runtime — KB5039239 June 9 (2 days)

(Countdown updated) Source: Windows Developer Blog | Link: https://blogs.windows.com/windowsdeveloper/2026/06/02/build-2026-furthering-windows-as-the-trusted-platform-for-development/ Windows Update KB5039239 delivers the expanded on-device AI stack (Aion 1.0 runtime, CPU/GPU/NPU support) on June 9. Required for production use of Aion 1.0 Instruct and Aion 1.0 Plan on end-user devices. Aion 1.0 open weights land on Hugging Face in July.

⚠️⚠️⚠️ Claude Sonnet 4 + Opus 4 — Retirement June 15 (8 days)

(Countdown updated) Source: Anthropic | Link: https://platform.claude.com/docs/en/about-claude/model-deprecations claude-sonnet-4-20250514 and claude-opus-4-20250514 return errors June 15. Migrate to claude-sonnet-4-6-20260217 and claude-opus-4-8 respectively. Review the Opus 4.8 migration guide before upgrading — adaptive thinking replaces budget_tokens; setting temperature, top_p, or top_k to non-default values returns a 400 error.

⚠️⚠️ Gemini CLI Hard Stop — June 18 (11 days)

(Countdown updated) Source: Google Developers Blog | Link: https://developers.googleblog.com/an-important-update-transitioning-gemini-cli-to-antigravity-cli/ gemini CLI and Gemini Code Assist IDE extensions stop serving requests for Google AI Pro, Ultra, and free personal users on June 18. Replacement is Antigravity CLI (agy). Audit CLI scripts and CI pipeline steps now — Antigravity CLI does not have 1:1 feature parity.

⚠️⚠️ Gemini API Unrestricted Key Deadline — June 19 (12 days)

(Countdown updated) Source: Google AI for Developers | Link: https://ai.google.dev/gemini-api/docs/api-key All unrestricted Gemini API keys blocked June 19. Restrict via AI Studio → API Keys → "Restrict to Gemini API." Takes 2 minutes; no code changes required.

⚠️ Gemini Image Models Shutdown — June 25 (18 days)

(Countdown updated) Source: Google AI for Developers | Link: https://ai.google.dev/gemini-api/docs/deprecations gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shutting down June 25, 2026. Migrate to stable image model equivalents before the shutdown date.

⚠️ GPT-4.5 Retirement from ChatGPT — June 27 (20 days)

(Countdown updated) Source: OpenAI Platform Changelog | Link: https://platform.openai.com/docs/changelog GPT-4.5 being retired from the ChatGPT product surface on June 27; direct API route retirement unconfirmed. Audit gpt-4.5 model identifiers in code.

⚠️ Claude Opus 4.1 Retirement — August 5 (59 days)

(Countdown updated) Source: Anthropic | Link: https://platform.claude.com/docs/en/about-claude/model-deprecations claude-opus-4-1-20250805 retires August 5. Migrate to claude-opus-4-8. Significant migration effort if coming from a pre-4.7 model — see June 6, 2026 digest for the full migration checklist including breaking changes around adaptive thinking, sampling parameters, and tokenizer differences.

⚠️ OpenAI Reusable Prompts (v1/prompts) Shutdown — November 30 (176 days)

Source: OpenAI | Link: https://developers.openai.com/api/docs/deprecations Deprecated June 3, shutdown November 30, 2026. Move prompt content to application code.

⚠️ OpenAI Evals Platform Shutdown — November 30 (176 days)

Source: OpenAI | Link: https://developers.openai.com/api/docs/deprecations Read-only October 31, shutdown November 30, 2026. Export eval configs before October 31.

⚠️ OpenAI Agent Builder Shutdown — November 30 (176 days)

Source: OpenAI | Link: https://developers.openai.com/api/docs/deprecations Shutdown November 30, 2026. Migrate to Agents SDK (openai.agents) or ChatGPT Workspace Agents.

Claude Mythos — Public Release "Once Stronger Safeguards Ready"

(Carried — status unchanged) Source: Anthropic | Link: https://www.anthropic.com/news/expanding-project-glasswing No timeline given. Currently: no public API, no claude.ai access at any tier. Leads SWE-bench Verified at 93.9% (internal benchmark as of June 2, 2026).

Gemini 3.5 Pro — Expected July 2026

(Carried — no official date) Sundar Pichai stated "give us until next month" at Google I/O 2026 (May 19). No official announcement, pricing, model ID, or benchmark numbers.


<details> <summary>🔭 Horizon — Open Questions, Emerging Patterns & Grounded Speculation</summary>

This section operates under different rules than the digest above. Evidence-grounded speculation is allowed. Pure prediction is not. Every claim here must cite a source from this digest or a real paper/benchmark. Label each entry by type so the reader knows what kind of thinking they're engaging with.

[PATTERN] MTP is following the speculative decoding trajectory: experimental fork → official upstream → standard config option ik_llama.cpp added Gemma4 MTP on May 10, 2026, verified a 2.6–2.98x lossless speedup at larger batch sizes. The official llama.cpp upstream merged it on June 7 (b9549), making it available to all downstream distributions without a custom build. Speculative decoding followed an almost identical adoption arc: performance forks proved it worked and characterized the gains, official llama.cpp integrated it several months later, and it became a standard toggle that developers set once in their deployment config. If MTP follows the same arc, watch for: first-party model releases starting to ship companion draft-model GGUF files as standard artifacts alongside base quantizations; inference framework dashboards adding an "MTP enabled" column alongside quantization and GPU type; and throughput benchmarks needing to disclose MTP status to remain comparable. The Gemma4 team's next move — whether they publish a recommended draft model size/architecture guide — is the leading indicator of whether MTP becomes first-class or stays a power-user optimization. Grounded in: llama.cpp b9549 (this digest, Tooling); ik_llama.cpp PR #1744 (May 10, 2026, cited in community benchmarks referenced in this digest)

[TENSION] Gemini's 32-day breaking-change window vs. the migration time required by complex agentic systems The Interactions API opt-out (Api-Revision: 2026-05-07) was available from May 7. The legacy schema is removed June 8 — 32 days later. For simple request-response code, 32 days is adequate. For complex agentic systems that parse the steps/outputs structure deeply — multi-turn session state, streaming tool-call flows, structured output parsing, production monitoring dashboards — 32 days is marginal: discovery takes a week, integration testing takes two, coordination across teams and review cycles takes the rest. Google has historically given 60–90 days for API breaking changes on stable features. The 32-day window reflects pressure to keep pace with a faster feature cadence, but it creates a mismatch: the developers most affected by deep schema dependency are also the ones who need the most migration runway. Compare: Anthropic's Opus 4.1 deprecation carries a 60-day window; Sonnet/Opus 4 retirement had 62 days. The direction of travel is shorter windows across the industry — worth watching whether Google's next major API breaking change comes with a longer or shorter opt-out period, and whether it ships with automated migration tooling. Grounded in: Gemini Interactions API removal (this digest, Breaking Changes); Anthropic 60-day Opus 4.1 window and 62-day Sonnet/Opus 4 window (June 6, 2026 digest, for contrast)

[IF THIS CONTINUES] At the current multi-provider deprecation cadence, mandatory model and API migration is becoming a standing 10–20% engineering tax on AI-integrated teams Tallying the past 60 days across Anthropic and Google alone: Gemini Interactions API schema removal (32-day window, code changes required), Gemini CLI shutdown (30-day window, pipeline audit required), Gemini image model shutdowns (code changes required), Gemini unrestricted key deadline (2 minutes, but a deployment step), Claude Sonnet 4 / Opus 4 retirement (62-day window, code changes required for pre-4.7 users), Claude Opus 4.1 deprecation (60-day window, Significant migration effort). That is five separate forced migration events in 60 days across two providers, three requiring meaningful code changes. If both Anthropic and Google sustain roughly one forced migration per 30 days each, a team with production integrations to both providers now manages ~2 forced migrations per month as a baseline overhead — in addition to feature work. At moderate integration depth, that's a standing 10–20% of a backend developer's sprint capacity. The mitigation that doesn't yet exist as a first-class product: an official multi-provider deprecation monitoring API or dashboard that sends calendar-aware alerts before each migration window closes. The tools that do exist — email lists, manual changelog polling — don't scale to the current cadence. Grounded in: Gemini June 8 breaking change (this digest); Opus 4.1 deprecation, Sonnet/Opus 4 retirement (June 6 digest); Gemini CLI, image model, API key deadlines (Worth Watching, this digest)

[OPEN QUESTION] Will Gemma4's MTP support in llama.cpp accelerate a multi-token output feature on the Gemini API itself? Gemma4 MTP works by training a small draft head alongside the main model — a technique the Google Gemma team published and now ships as an open-weight inference optimization that any llama.cpp user can activate. The same team builds Gemini. The question is whether this internal capability eventually surfaces as a Gemini API parameter — a "draft acceleration" flag analogous to how Anthropic's fast mode shipped first as a research preview. Google hasn't announced any Gemini-side MTP feature. But the internal engineering exists (Gemma4 MTP weights are a Google artifact, and the ik_llama.cpp fork demonstrated 2.6–2.98x speedups at the model level). An inference-level MTP optimization on the Gemini API could reduce latency and cost for agentic workloads — particularly tool-call-heavy sessions where many short-turn responses dominate — without requiring a model swap. Watch the Gemini API changelog for any entry mentioning draft heads, multi-token generation, or inference acceleration toggles in Q3 2026. Grounded in: llama.cpp b9549 Gemma4 MTP (this digest, Tooling); Gemini 3.5 Flash release (prior digests, for context on Google's fast-model trajectory)

</details>

Excluded: ~41 items below quality gate threshold, outside scan window, or duplicate coverage. Near-misses: Gemini video-to-image generation on gemini-3.1-flash-image (May 28 — outside window); NVIDIA Nemotron 3 Ultra on Fireworks AI blog (June 4 — outside window); OpenAI container billing granularity change, per-minute billing with 5-minute minimum (June 2 — outside window); OpenAI deprecations for Reusable Prompts / Evals Platform / Agent Builder (June 3 — outside window); ChatGPT Enterprise Codex shared local plugins (June 6 — product surface only, no API impact); anthropic-sdk-python v0.107.1 Foundry x-api-key fix (promoted to Quick Hit — minor bug fix, below full-entry threshold); Claude Code v2.1.168 (promoted to Quick Hit — "bug fixes and reliability improvements", no specific changes disclosed); arXiv June 7 (no cs.AI/cs.CL papers from recognized labs with benchmark numbers and associated code found for this window — AXIOM and PersonaTree submissions noted but primary sources not fetchable); LMArena text ELO (no movement June 6–7); LMArena Agent Arena (no new model entries June 6–7, last entry June 5); SWE-bench Verified (no new entries); Nathan Lambert, Eugene Yan, Sebastian Raschka (no qualifying posts in scan window); vLLM (GitHub releases page not returning June 2026 entries — second consecutive scan with this display issue, no confirmed release); Mistral 3 (December 2025 — outside window); GLM-5.1 (April 2026 — outside window); AWS ML Blog (no June 6–7 items); Azure AI (no June 6–7 items); NVIDIA TensorRT-LLM (no confirmed June 6–7 release); Together AI / Fireworks AI / Modal (no June 6–7 items); Groq (no technical release in window); xAI (403 on direct fetch, no confirmed June 6–7 technical release via search); Meta AI (no June 6–7 items); Cohere (no June 6–7 items).

← All digestspersonal/digests/ai-2026-06-07.md