AI Developer Digest

Sat, Jun 13, 2026

18 signals that cleared the gate24 min read

The Signal — start here

The headline today is two things landing at once: vLLM v0.23.0 is the biggest open-source inference release in months — Model Runner V2 now default for Llama and Mistral, Transformers v4 fully removed, multi-tier KV cache offloading GA — and Anthropic's June 15 Agent SDK billing split takes effect in 48 hours, a change not covered in any prior digest that will silently reject automated requests for anyone who hasn't claimed their credit and configured overflow billing. If you run claude -p, Claude Code GitHub Actions, or any third-party Agent SDK integration, check your billing settings today. The vLLM breaking change (Transformers v4 removal) requires a migration step but is well-documented; the Anthropic billing change requires a claim action before June 15 or your pipelines stop.

Must-reads today

Anthropic Agent SDK billing split — June 15 in 48 hours — claude -p, Claude Code GitHub Actions, and third-party agents move to a separate per-user credit ($20–$200); unclaimed credits = rejected requests starting Monday.

vLLM v0.23.0 — Model Runner V2 default for Llama+Mistral, Transformers v4 fully removed, multi-tier KV cache offloading GA. Biggest vLLM release since v0.21.0.

Kimi K2.7 Code — Moonshot AI drops a 1T-parameter open-weight coding model (32B active) under Modified MIT. No independent benchmarks yet, but the weights are on Hugging Face today.

Breaking Changes

●Breaking

Anthropic Agent SDK Billing Split — Effective June 15 (48 Hours)

What changed

Claude Agent SDK, claude -p, Claude Code GitHub Actions, and all third-party apps authenticating through the Agent SDK are removed from subscription usage limits and moved to a new, separately billed "Agent SDK Credit" pool effective June 15. Previously, all Claude usage (interactive and programmatic) counted against a single subscription limit. After June 15, the credit pool is per-user, non-pooled, non-rollover: $20 (Pro) / $100 (Max 5x) / $200 (Max 20x) billed at standard API list rates. Once exhausted, automated requests are rejected outright — there is no automatic fallback. Interactive Claude Code terminal sessions and Claude.ai chat are not affected.

TL;DR

Starting June 15, programmatic Claude API use via the Agent SDK moves to a separate $20–$200 monthly credit at API rates — requests fail silently after credit runs out unless overflow billing is pre-enabled.

Developer signal

Two required actions, both needed before June 15: (1) Watch for Anthropic's credit-claim email (sent around June 8) and click the claim link — credits are not allocated automatically, you must claim once per account. (2) Decide whether to enable "usage credits" (Anthropic's overflow toggle) — this allows Agent SDK usage beyond the monthly credit to bill at standard API rates rather than being rejected. To check/enable: Anthropic Console → Settings → Billing → Usage Credits. If you run claude -p in CI/CD, Claude Code GitHub Actions (claude-code-action), or any third-party agent integration (AutoGen, CrewAI, LangChain agents calling Claude), test your billing configuration before Monday. Important: the credit is per-user, not per-organization — if you have multiple developers using Agent SDK on a shared team plan, each user needs to claim separately. The change was announced May 14 via Anthropic Help Center; the delay means this deadline may have been missed by teams that only track API changelogs.

Affects you ifYou use claude -p in scripts or pipelines; you have Claude Code GitHub Actions (uses: anthropics/claude-code-action@v*) in your CI/CD; you build products using the Anthropic Agent SDK or Managed Agents; you use third-party tools (AutoGen, CrewAI, etc.) that authenticate through the Agent SDKEffortModerate (claim the credit email + configure overflow billing toggle in Console; no code changes required, but pipelines will silently fail if you miss the deadline)

Anthropic Help Center | Date: Announced May 14, 2026; effective June 15, 2026 | Link: https://support.anthropic.com (Help Center — referenced in anthropics/claude-code issue #59823)Anthropic Help Center announcement May 14, 2026 — referenced in https://github.com/anthropics/claude-code/issues/59823

Model Releases

Medium

Kimi K2.7 Code — Moonshot AI's 1T-Parameter Open-Weight Coding Model

What changed

Fifth major release in the K2 series in under a year. K2.7 Code is a new Mixture-of-Experts coding model with 1 trillion total parameters and 32 billion active parameters, targeting improved reasoning efficiency and agentic coding tasks. The key change from K2.6 is a 30% reduction in reasoning-token usage with claimed improvements on Moonshot's proprietary coding benchmarks. Weights available on Hugging Face under Modified MIT License.

TL;DR

Kimi K2.7 Code lands June 12 with 1T params / 32B active, Modified MIT license, 256K context window, and +21.8% on Kimi Code Bench v2 over K2.6 — but all published benchmarks are Moonshot-proprietary; no SWE-bench Verified, LiveCodeBench, or GPQA Diamond scores exist yet.

Developer signal

The weights are live on Hugging Face (MoonshotAI/Kimi-K2.7-Code) under Modified MIT, making this deployable for commercial use. The 32B active parameter footprint means it runs on a single A100 80GB with appropriate quantization — similar hardware requirements to Llama 3.3 70B. Context window is 256K (down from some prior K2.x variants' longer windows — verify for your use case). The 30% reasoning-token reduction is the most concrete developer signal: if you're running K2.6-class models in agentic loops with extended thinking, K2.7 should produce equivalent output with fewer output tokens billed, improving cost efficiency. Critical caveat: Every published benchmark for K2.7 Code is Moonshot's own proprietary suite (Kimi Code Bench v2, Program Bench, MLS Bench Lite) — there are no independent SWE-bench Verified, LiveCodeBench, or AIME 2025 numbers as of June 13. VentureBeat reported "practitioners say the benchmarks don't check out" in early community testing. Wait for third-party evaluations on standard public benchmarks before deploying in place of a model with verified scores.

Affects you ifYou are building or fine-tuning on open-weight coding models; you have hardware that can serve 32B active MoE; you use K2.6 in agentic loops and want to reduce reasoning token costsEffortModerate (download weights from Hugging Face, serve via vLLM or similar; no fine-tuning migration needed if using base weights; verify context window and tokenizer compatibility)

Moonshot AI / Hugging Face | Date: June 12, 2026 | Link: https://huggingface.co/MoonshotAI/Kimi-K2.7-Codehttps://huggingface.co/MoonshotAI/Kimi-K2.7-Code

API & SDK Changes

No API or SDK breaking changes other than the Anthropic Agent SDK billing split (see [BREAKING] above). No new releases of anthropic-sdk-python (last: v0.109.1, June 9), openai-python (last: v2.41.1, June 10), or other primary SDK clients in the June 12–13 window.

Research

Nothing cleared the quality bar this period. arXiv cs.AI and cs.CL listing pages returned 403 on direct fetch. Hugging Face Papers Daily also returned 403. No qualifying papers were confirmed via search within the June 12–13 window. See near-misses.

Tooling

High

vLLM v0.23.0 — Model Runner V2 Default, Transformers v4 Removed, Multi-tier KV Cache GA

What changed

Model Runner V2 (MRv2) is now the default for Llama and Mistral dense models (was only default for Qwen3 in v0.22.x). Transformers v4 support is fully removed — vLLM now requires Transformers v5. Multi-tier KV cache offloading with object-store secondary tier now GA with HMA enabled by default. New model support: Gemma 4 Unified (encoder-free), DeepSeek-V4 with TRTLLM attention kernel, Step-3.7-Flash, Cosmos3 Reasoner, MiMo-V2.5, Cohere Mini Code, Granite Speech Plus. Rust frontend expanded with streaming, dynamic LoRA, and tool parsers for InternLM2, Phi-4-mini, Gemma4. AMD ROCm upgraded to v7.2.3 with native W4A16 kernels. NVIDIA CUTLASS FP8 padding bypass delivers +20% throughput improvement. Anthropic Messages API structured output now supported.

TL;DR

vLLM v0.23.0 (June 13, 408 commits from 200 contributors) makes Model Runner V2 the default for Llama+Mistral, removes Transformers v4 (breaking), and brings multi-tier KV cache offloading to GA — the biggest release since v0.21.0.

Developer signal

Three immediate actions: (1) Check your Transformers version — pip install transformers>=5 is now required. If your environment pins transformers<5 (common in older Dockerfile setups), vLLM v0.23.0 will fail to import. Run pip show transformers and upgrade before deploying. (2) Test Model Runner V2 for your Llama/Mistral workloads — MRv2 is the default now but can be disabled with --no-enable-mrv2 if you hit regressions. MRv2 delivers breakable CUDA graphs, pipeline-parallel bubble elimination, and FlashInfer sampling. (3) Evaluate multi-tier KV cache offloading — if you're memory-constrained on long-context workloads, --kv-cache-offloading-policy now supports object-store (S3-compatible) as a secondary tier with HMA enabled by default. The JAISLMHeadModel class is also removed — if you serve JAIS models, you need a workaround. The Anthropic Messages API structured output support means vLLM servers can now speak the Anthropic wire format directly for structured generation, without an adapter.

Affects you ifYou serve Llama, Mistral, DeepSeek, or Gemma models via vLLM; you have environments pinned to Transformers v4; you use JAISLMHeadModel; you are memory-constrained on long-context inference; you use AMD ROCm for inferenceEffortModerate (upgrade Transformers to v5, test MRv2 defaults on your model/workload, remove JAIS workarounds if applicable)

vllm-project/vllm GitHub | Date: June 13, 2026 | Link: https://github.com/vllm-project/vllm/releases/tag/v0.23.0https://github.com/vllm-project/vllm/releases/tag/v0.23.0

Medium

Claude Code v2.1.176 — Language-Aware Session Titles, Bedrock Credential Fix, Security Patch

What changed

Session titles are now generated in the language of your conversation (configurable via language setting). Added footerLinksRegexes setting for regex-matched link badges in footer. Bedrock credential caching from awsCredentialExport now respects the actual Expiration field rather than a fixed 1-hour ceiling. Security fix: availableModels enforcement now prevents alias model picks from being redirected to a blocked model via ANTHROPIC_DEFAULT_*_MODEL env vars, and /fast refuses to toggle when it would switch to a model outside the allowlist. Auto mode Fable 5 fallback fixed for organizations without Opus 4.8 enabled. Hook if conditions for Read/Edit/Write tool paths (Edit(src/**), Read(~/.ssh/**), Read(.env)) now match correctly. Linux sandbox fixed for symlinked settings.json. /copy and mouse-selection copy fixed inside tmux over SSH. Remote Control and /cd worktree branch-tracking fixes.

TL;DR

Claude Code v2.1.176 ships a security fix (blocked model alias bypass closed), Bedrock credential caching correctness, hook if path matching, and multilingual session titles — primarily a hardening release with multiple reliability fixes.

Developer signal

Three items require attention: (1) Hook if conditions — if you have hooks configured with patterns like Edit(src/**), Read(~/.ssh/**), or Read(.env), these were silently not matching before v2.1.176. After updating, these patterns now fire correctly — audit your .claude/settings.json hooks if you rely on selective hook triggering, as you may see new hook executions that weren't firing before. (2) Bedrock credential caching — if you use awsCredentialExport for AWS Bedrock and had a credential rotation window under 1 hour, Claude Code was previously over-caching credentials (fixed 1-hour ceiling). After v2.1.176, it caches until the actual Expiration. Verify your STS token rotation timing hasn't changed in a way that affects your workflow. (3) Model allowlist security — the ANTHROPIC_DEFAULT_*_MODEL bypass is closed: if your org uses availableModels to restrict which models can be used, env var overrides can no longer route to blocked models. Check that any CI/CD pipelines relying on ANTHROPIC_DEFAULT_MODEL env vars still point to allowed models. Run claude update to get v2.1.176.

Affects you ifYou configure hooks with if path conditions on Read/Edit/Write; you use AWS Bedrock with awsCredentialExport; you use availableModels model restrictions in your org; you run Claude Code inside tmux over SSH; you use Claude Code on Linux with a symlinked settings.jsonEffortQuick (run claude update; review hook if conditions if you rely on selective firing)

anthropics/claude-code GitHub | Date: June 12, 2026 (21:53 UTC) | Link: https://github.com/anthropics/claude-code/releases/tag/v2.1.176https://github.com/anthropics/claude-code/releases/tag/v2.1.176

Benchmarks & Leaderboards

No new leaderboard movements confirmed in the June 12–13 window. The most recent LMArena leaderboard update remains June 10 (Fable 5 additions across five Arena categories, covered in the June 11 digest). SWE-bench Verified and SWE-bench Pro: no new independent submissions confirmed in window. Kimi K2.7 Code benchmarks are proprietary only — no SWE-bench Verified entry placed.

Trends & Emerging Tech

The Inference Tooling Stack Is Consolidating Around Two Primitives: Speculative Decoding and Multi-Tier KV Cache

What's happening

Two releases in this digest's window — vLLM v0.23.0 (multi-tier KV cache offloading to object stores GA) and llama.cpp b9626 (Cohere2-MoE architecture support) — continue a pattern visible across the last 30 days: every major inference engine release is now shipping at least one of speculative decoding or KV cache offloading as a table-stakes feature. Yesterday's digest covered EAGLE3 in llama.cpp (2–3× throughput). Today's vLLM release makes multi-tier KV cache offloading the default configuration. These two primitives together address the two dominant inference bottlenecks: throughput (speculative decoding) and memory (KV cache offloading). The combination is particularly powerful for long-context deployments — Model Runner V2 with MRv2's pipeline-parallel bubble elimination and HMA-backed KV offloading addresses what was previously a throughput cliff at >200K context.

Why watch this

The convergence of these primitives across vLLM, llama.cpp, and (via Unsloth yesterday) fine-tuning toolchains suggests the self-hosted inference stack is entering a period of rapid capability catch-up with managed cloud APIs. Developers evaluating build-vs-buy for inference should revisit their benchmarks — the performance gap at 70B scale has materially narrowed in the last 60 days. The next 30 days will determine whether Ollama and llama.cpp absorb multi-tier KV cache offloading as standard configurations or whether vLLM's enterprise-first architecture maintains a clear advantage at high-concurrency production scale.

vllm-project/vllm GitHub + ggml-org/llama.cpp GitHub | Date: June 12–13, 2026 | Link: https://github.com/vllm-project/vllm/releases/tag/v0.23.0

Technical Discussions

Nothing cleared the quality bar this period. The most relevant community signal is a VentureBeat article ("Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out") published June 12, but the article could not be fetched directly and does not itself link to a primary benchmark source — could not verify concrete data for inclusion.

Quick Hits

Ollama v0.30.8 (June 12) — Fixed ollama launch provider selection; decoupled prompt caching from context shift for better KV cache reuse; hardened MLX linear and embedding layers for more stable inference on Apple Silicon; MLX runner now creates snapshots during prompt processing and speculative decoding; improved recurrent model support. [https://github.com/ollama/ollama/releases/tag/v0.30.8]
llama.cpp b9626 (June 13, 18:19) — Adds Cohere2-MoE architecture support (PR #24260), including North-Mini-Code-1.0 (30B-A3B) in BF16, Q4_K, and NVFP4 quantizations. Expert routing and tokenizer handling improved. [https://github.com/ggml-org/llama.cpp/releases/tag/b9626]
llama.cpp b9622 (June 13, 14:15) — Vulkan backend: non-contiguous unary and GLU ops now supported with stride. Improves quantized inference throughput on Vulkan-capable hardware (including consumer GPUs on Linux without CUDA). [https://github.com/ggml-org/llama.cpp/releases/tag/b9622]
LiteLLM v1.84.8 (June 13, 01:45) — Bug fix backport to stable/1.84.x branch. If you are pinned to stable/1.84.x, upgrade to 1.84.8 for accumulated fixes. [https://github.com/BerriAI/litellm/releases]
llama.cpp b9620/b9621/b9624/b9625 (June 12–13) — Server static asset cleanup and file naming simplification (b9620); UI file path preservation, nocache fix (b9621); build-time gzip compression for UI bundle (b9624); Jinja template fixes for split/replace empty args and negative-step slices (b9623, b9625). Maintenance releases, no model or inference changes. [https://github.com/ggml-org/llama.cpp/releases]

Worth Watching (Announced, Not Yet Shipped)

⚠️⚠️⚠️⚠️ JUNE 15 DOUBLE HIT — Claude Sonnet 4 / Opus 4 Retirement + Agent SDK Billing Split (TOMORROW)

Two separate Anthropic changes take effect June 15 simultaneously:

claude-sonnet-4-20250514 and claude-opus-4-20250514 return errors from June 15. If not migrated: move to claude-sonnet-4-6-20260217 and claude-opus-4-8. The Opus 4.8 migration requires removing non-default temperature, top_p, or top_k parameters (returns 400 otherwise) and updating budget_tokens → adaptive thinking.

Agent SDK credit split activates June 15. Claim your credit and enable overflow billing in Anthropic Console before Monday or automated requests fail.

(Full details on the billing split in the [BREAKING] section above.)

Anthropic | Link: https://platform.claude.com/docs/en/about-claude/model-deprecationsAnthropic Help Center (May 14, 2026) | Link: referenced via https://github.com/anthropics/claude-code/issues/59823

⚠️⚠️⚠️ Gemini CLI Hard Stop — June 18 (5 days)

(Countdown updated from yesterday's digest)

gemini CLI and Gemini Code Assist IDE extensions stop serving requests June 18. Replacement: Antigravity CLI (agy). Antigravity CLI does not have 1:1 feature parity — audit CI pipeline steps before the cutoff.

Google Developers Blog | Link: https://developers.googleblog.com/an-important-update-transitioning-gemini-cli-to-antigravity-cli/

⚠️⚠️⚠️ Gemini API Unrestricted Key Deadline — June 19 (6 days)

(Countdown updated)

All unrestricted Gemini API keys blocked June 19. Restrict via AI Studio → API Keys → "Restrict to Gemini API." ~2 minutes; no code changes.

Google AI for Developers | Link: https://ai.google.dev/gemini-api/docs/api-key

⚠️⚠️ Gemini Image Models Shutdown — June 25 (12 days)

(Countdown updated)

gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shutting down June 25. Migrate to stable image model equivalents.

Google AI for Developers | Link: https://ai.google.dev/gemini-api/docs/deprecations

⚠️⚠️ GPT-4.5 Retirement from ChatGPT — June 27 (14 days)

(Countdown updated)

GPT-4.5 removed from ChatGPT product surface June 27. API route retirement unconfirmed. Audit gpt-4.5 model identifiers.

OpenAI Platform Changelog | Link: https://platform.openai.com/docs/changelog

⚠️⚠️ Grok V9-Medium — Still Pending (est. mid-June, any day)

(Status unchanged — no confirmed launch as of June 13)

Training completed late May; SFT and RL underway. Mid-June release window still open as of June 13. 1.5 trillion parameters, Cursor-data training, coding-focused. No API pricing, model ID, or benchmark numbers confirmed. No launch detected in June 12–13 scan window.

xAI / Elon Musk announcement, May 25, 2026 | Link: https://x.ai/news

⚠️⚠️ Gemini 3.5 Pro — Still Pending, June 2026

(Status unchanged)

Still in limited Vertex enterprise preview. Expected: 2M token context, Deep Think reasoning mode. Watch ai.google.dev.

Google I/O 2026 / Google AI for Developers | Link: https://ai.google.dev/gemini-api/docs/models

⚠️ Kimi K2.7 Code — Third-Party Benchmarks — Expected ~June 22

Weights landed June 12. Third-party SWE-bench Verified and LiveCodeBench evaluations typically appear 7–14 days after open-weight release. The gap between vendor-claimed benchmarks and independent evaluation matters here given community skepticism. Watch paperswithcode.com and the SWE-bench Verified leaderboard around June 20–25 for independent scores.

⚠️ Claude Opus 4.1 Retirement — August 5 (53 days)

(Countdown updated)

claude-opus-4-1-20250805 retires August 5. Migrate to claude-opus-4-8.

Anthropic | Link: https://platform.claude.com/docs/en/about-claude/model-deprecations

⚠️ OpenAI Reusable Prompts / Evals Platform / Agent Builder Shutdown — November 30 (170 days)

Three products deprecated June 3. Export eval configs before October 31 (read-only from that date). Migrate Agent Builder to Agents SDK or ChatGPT Workspace Agents. Move prompt content from v1/prompts to application code.

OpenAI | Link: https://platform.openai.com/docs/deprecations

⚠️ Aion 1.0 Open Weights — July 2026 (~3 weeks)

(Carried — status unchanged)

Aion 1.0 Instruct open weights land on Hugging Face in July 2026. No confirmed specific date.

Windows Developer Blog | Link: https://blogs.windows.com/windowsdeveloper/2026/06/02/build-2026-furthering-windows-as-the-trusted-platform-for-development/

Apple iOS 27 / macOS Golden Gate / Core AI GA — Fall 2026 (September)

(Carried — status unchanged)

iOS 27, iPadOS 27, macOS Golden Gate ship September 2026. Includes: Siri Extensions API, Core AI (replaces Core ML), Foundation Models multi-provider support. Developer Beta 1 available now.

Apple Developer / WWDC 2026 | Link: https://developer.apple.com/ios/

Claude Mythos 5 General Availability — No Timeline

(Carried — status unchanged)

Currently only for vetted Project Glasswing participants.

Anthropic | Link: https://www.anthropic.com/news/expanding-project-glasswing

Filtered from 30+ primary sources against a published quality rubric. No press releases, no fluff — only what changes what you build.

Breaking Changes

Anthropic Agent SDK Billing Split — Effective June 15 (48 Hours)

Model Releases

Kimi K2.7 Code — Moonshot AI's 1T-Parameter Open-Weight Coding Model

API & SDK Changes

Research

Tooling

vLLM v0.23.0 — Model Runner V2 Default, Transformers v4 Removed, Multi-tier KV Cache GA

Claude Code v2.1.176 — Language-Aware Session Titles, Bedrock Credential Fix, Security Patch

Benchmarks & Leaderboards

Trends & Emerging Tech

The Inference Tooling Stack Is Consolidating Around Two Primitives: Speculative Decoding and Multi-Tier KV Cache

Technical Discussions

Quick Hits

Worth Watching (Announced, Not Yet Shipped)

⚠️⚠️⚠️⚠️ JUNE 15 DOUBLE HIT — Claude Sonnet 4 / Opus 4 Retirement + Agent SDK Billing Split **(TOMORROW)**

⚠️⚠️⚠️ Gemini CLI Hard Stop — **June 18 (5 days)**

⚠️⚠️⚠️ Gemini API Unrestricted Key Deadline — **June 19 (6 days)**

⚠️⚠️ Gemini Image Models Shutdown — **June 25 (12 days)**

⚠️⚠️ GPT-4.5 Retirement from ChatGPT — **June 27 (14 days)**

⚠️⚠️ Grok V9-Medium — **Still Pending (est. mid-June, any day)**

⚠️⚠️ Gemini 3.5 Pro — **Still Pending, June 2026**

⚠️ Kimi K2.7 Code — Third-Party Benchmarks — **Expected ~June 22**

⚠️ Claude Opus 4.1 Retirement — **August 5 (53 days)**

⚠️ OpenAI Reusable Prompts / Evals Platform / Agent Builder Shutdown — **November 30 (170 days)**

⚠️ Aion 1.0 Open Weights — **July 2026 (~3 weeks)**

Apple iOS 27 / macOS Golden Gate / Core AI GA — **Fall 2026 (September)**

Claude Mythos 5 General Availability — No Timeline

⚠️⚠️⚠️⚠️ JUNE 15 DOUBLE HIT — Claude Sonnet 4 / Opus 4 Retirement + Agent SDK Billing Split (TOMORROW)

⚠️⚠️⚠️ Gemini CLI Hard Stop — June 18 (5 days)

⚠️⚠️⚠️ Gemini API Unrestricted Key Deadline — June 19 (6 days)

⚠️⚠️ Gemini Image Models Shutdown — June 25 (12 days)

⚠️⚠️ GPT-4.5 Retirement from ChatGPT — June 27 (14 days)

⚠️⚠️ Grok V9-Medium — Still Pending (est. mid-June, any day)

⚠️⚠️ Gemini 3.5 Pro — Still Pending, June 2026

⚠️ Kimi K2.7 Code — Third-Party Benchmarks — Expected ~June 22

⚠️ Claude Opus 4.1 Retirement — August 5 (53 days)

⚠️ OpenAI Reusable Prompts / Evals Platform / Agent Builder Shutdown — November 30 (170 days)

⚠️ Aion 1.0 Open Weights — July 2026 (~3 weeks)

Apple iOS 27 / macOS Golden Gate / Core AI GA — Fall 2026 (September)