← All digests
📡

AI Developer Digest

Sat, Jun 6, 2026

9 items passed quality gate | ~65 scanned | ~56 excluded | Sources checked: 26 Scan window: June 5–6, 2026 (24h). Prior digest covered: Claude Code v2.1.163/165; llama.cpp b9522–b9535 (Vulkan Intel FWHT, KleidiAI scheduling); Ollama v0.30.5/0.30.6; LiteLLM v1.87.1; ChatGPT Dreaming V3.


This Week's Signal

Two threads worth holding together: Anthropic's deprecation of Claude Opus 4.1 (60-day window, retirement August 5) carries a sting that only shows on closer inspection — if you're on Opus 4.1, the path to Opus 4.8 runs through 4.7's breaking changes, not just a model ID swap. Simultaneously, Claude Code v2.1.166 ships fallbackModel, which is exactly the kind of infrastructure you'd want when a primary model retires. The research side tells a harsher story: two new agent benchmarks (CL-Bench and DeployBench) surfaced within hours of each other, both finding that current frontier models perform worse than most builders assume when the task requires real adaptation or deployment from scratch.

Must-reads this digest:

  • Anthropic Claude Opus 4.1 deprecation (Aug 5) — not a simple model swap; migrating to Opus 4.8 requires addressing all Opus 4.7 breaking changes first
  • Claude Code v2.1.166fallbackModel, --thinking disabled, glob deny rules, and 15+ bug fixes including the JetBrains 2026.1 flickering issue
  • CL-Bench / DeployBench — back-to-back papers showing frontier agents barely improve with experience and deploy research artifacts at 7.8–51% best-case

[BREAKING] Breaking Changes

No breaking changes this period.


Model Releases

No new model releases in this 24h period.


API & SDK Changes

[MEDIUM] Claude Opus 4.1 Deprecated — Retirement August 5, 2026

Source: Anthropic | Date: June 5, 2026 | Link: https://platform.claude.com/docs/en/about-claude/model-deprecations What changed: claude-opus-4-1-20250805 moved from Active to Deprecated on June 5, with API retirement scheduled for August 5, 2026. Recommended replacement is claude-opus-4-8. TL;DR: Anthropic deprecated Claude Opus 4.1 on June 5, giving 60 days until API retirement on August 5; straightforward ID swap in name only — migrating to Opus 4.8 from a pre-4.7 model requires addressing all the Opus 4.7 breaking changes. Developer signal: The retirement date is August 5 — 60 days, no hard urgency yet, but the migration is more involved than it looks. If your application runs on Opus 4.1 (released August 5, 2025), you are below the Opus 4.7 threshold, which means the full 4.7 breaking-change set applies when you migrate to 4.8. Specifically: (1) thinking: {type: "enabled", budget_tokens: N} is removed in 4.7+ — switch to thinking: {type: "adaptive"} and set effort via output_config.effort. (2) temperature, top_p, and top_k set to non-default values return a 400 error on 4.7+. (3) Thinking content display defaults to "omitted" rather than returning summarized reasoning; set thinking.display: "summarized" if your product streams reasoning to users. (4) New tokenizer may use up to 35% more tokens on the same prompts — adjust max_tokens budgets. (5) Prefilling assistant messages returns a 400 error. The fastest migration path: in Claude Code, run /claude-api migrate this project to claude-opus-4-8 — the skill applies the model ID swap and all required parameter changes, with your confirmation. For Managed Agents callers, only the model name change is needed. Check your usage CSV in Claude Console (Usage → Export) to find all active Opus 4.1 deployments now rather than in late July. Affects you if: You are calling claude-opus-4-1-20250805 directly on the Claude API, or on Claude Platform on AWS; partner platforms (Bedrock, Vertex AI) set their own retirement schedules separately Adoption effort: Significant (migration from pre-4.7 model requires removing extended thinking syntax, sampling params, and prefills — not just a model name change) Primary source: https://platform.claude.com/docs/en/about-claude/model-deprecations Quality gate score: 9 (official Anthropic source +3, concrete model IDs and retirement dates +2, primary deprecation page link +2, within scan window +1, technical audience +1)


[MEDIUM] Claude Code v2.1.166: fallbackModel, --thinking disabled, Glob Deny Rules, JetBrains Fix

Source: Anthropic / Claude Code GitHub | Date: June 6, 2026 | Link: https://github.com/anthropics/claude-code/releases/tag/v2.1.166 What changed: v2.1.166 adds fallbackModel (ordered list of up to three fallback models tried when the primary is overloaded or unavailable), extends --fallback-model to interactive sessions, adds glob pattern support in deny rule tool positions, adds MAX_THINKING_TOKENS=0 / --thinking disabled / per-model thinking toggle to silence thinking on default-thinking models, announces the download target before starting claude update, and filters Claude agents list by session URL when typing a URL. Bug fixes include: recurring "image could not be processed" error consuming extra tokens; remote sessions stuck after brief backend disruption during worker registration; JetBrains IDE terminal flickering on 2026.1+ (IntelliJ, PyCharm, WebStorm); Shift+non-ASCII characters dropped in Kitty keyboard protocol terminals (WezTerm, Ghostty, kitty); PowerShell command validation hanging on Windows; orphaned --bg-pty-host processes spinning at 100% CPU on macOS; voice mode requiring /login after toggle; background agent session crash-loops in git worktrees; duplicated thinking text in Ctrl+O view during streaming; 15+ additional fixes. TL;DR: Claude Code v2.1.166 ships fallbackModel for primary-model failover, a --thinking disabled flag to suppress default thinking on adaptive-thinking models, glob support in deny rules for broad tool blanket-denials, and a significant batch of bug fixes including the JetBrains 2026.1+ flickering regression. Developer signal: Four things to act on. (1) fallbackModel: Set "fallbackModel": ["claude-opus-4-7", "claude-sonnet-4-6"] in your Claude Code managed settings or user settings to give Claude Code an ordered fallback list if the primary model returns an overloaded error. This is the resilience primitive that should have been there before the first wave of model retirements. (2) --thinking disabled: If you are running automated pipelines with Claude Code using Opus 4.8 or other default-thinking models and want deterministic non-thinking responses (for cost control or latency), --thinking disabled or MAX_THINKING_TOKENS=0 suppresses thinking without changing the model. This replaces the previous workaround of using effort: "low". (3) Glob deny rules: The deny rule tool-name position now accepts glob patterns. "*" denies all tools (useful for read-only or planning-only sessions where no tool execution is permitted). Combine with allow rules for fine-grained tool policy. (4) JetBrains fix: If you use Claude Code inside IntelliJ, PyCharm, or WebStorm on the 2026.1+ release line and have been seeing terminal flickering, this release fixes it. Update via npm i -g @anthropic-ai/claude-code@latest. Affects you if: You run Claude Code in automated workflows and need fallback model resilience; you run Claude Code in JetBrains IDEs on 2026.1+; you build with default-thinking models and want to suppress thinking selectively; you manage deny rules for tool permissions Adoption effort: Quick (update; configure fallbackModel in settings if you want the resilience feature) Primary source: https://github.com/anthropics/claude-code/releases/tag/v2.1.166 Quality gate score: 9 (official Anthropic source +3, concrete new features with specific technical details +2, GitHub primary source link +2, within scan window +1, technical audience +1)


Research

[MEDIUM] CL-Bench: Frontier Models Barely Improve with Experience in Stateful Environments

Source: UC Berkeley / University of Wisconsin-Madison / Snorkel AI | Date: June 4–5, 2026 (arXiv 2606.05661) | Link: https://arxiv.org/abs/2606.05661 What changed: Introduces the first expert-validated benchmark specifically designed to measure whether LLM-based systems genuinely improve with experience across stateful, real-world tasks — a capability assumed in production agent designs but never systematically tested before. TL;DR: CL-Bench, from UC Berkeley / UW-Madison / Snorkel AI, tests 6-domain continual learning across agent architectures (naive ICL through dedicated memory systems), finding that even the best dedicated memory systems achieve only modest improvement over blind ICL — ACE (a dedicated memory system) reaches 8.6% normalized gain at $62.8 per full run, while ICL with Claude Sonnet 4.6 hits 13.5% stability gain on signal processing tasks and GPT-5.4 with Codex achieves 9% on the same task. Developer signal: The key finding is that dedicated memory architectures (the kind builders spend weeks implementing) do not dramatically outperform naive in-context learning on the CL-Bench gain metric. The paper introduces a "gain metric" that isolates learning improvement from underlying model capability — which is the right framing, because a more capable model can "improve" just by being smarter, not by actually learning. Two implications for builders: (1) Before investing in a custom memory layer for your agent, test whether naive context accumulation already provides most of the learning benefit on your actual task. CL-Bench suggests the bar for memory systems to win is higher than intuition suggests. (2) The benchmark is public at continual-learning-bench.com and worth running your agent architecture against if stateful learning matters for your use case. The six domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, demand forecasting) cover a range of realistic applications. The gain metric methodology — isolating learning gain from prior capability — is worth borrowing for your own agent evals. Affects you if: You are building stateful agents with long-term memory systems; you are evaluating whether to invest in memory architecture vs. just using a more capable model Adoption effort: Moderate (run your architecture against the benchmark; methodology re-evaluation required for existing agent memory systems) Primary source: https://arxiv.org/abs/2606.05661 Quality gate score: 6 (concrete benchmark numbers with specific models +2, arXiv primary source link +2, within scan window +1, technical audience +1; recognized institutions but not a frontier model lab, so no +3)


[MEDIUM] DeployBench: Top LLMs Deploy Research Artifacts at 7.8%–51.0% Pass Rates

Source: ArXiv 2606.05238 | Date: June 4–5, 2026 | Link: https://arxiv.org/abs/2606.05238 What changed: Introduces a benchmark specifically targeting the gap between agentic coding ability and real-world environment setup — an ability most existing agent benchmarks (including SWE-bench) assume away by providing a working environment. TL;DR: DeployBench provides 51 research-artifact deployment tasks (AI/ML, computer systems, scientific computing) evaluated by a hidden pipeline that executes the paper's designated experiment and checks outputs — and finds that state-of-the-art LLMs with OpenHands achieve pass rates between 7.8% and 51.0%. Developer signal: The 51% ceiling for best-in-class performance on real-world environment setup is the key number. These are tasks with complete instructions (a research paper and its artifact), not underspecified requests — the agent has to read, interpret, and execute a full software environment setup from scratch. The 7.8% floor shows how wide the distribution is. Three things to take from this: (1) If you are building research or data science agents that need to set up software environments, the gap between what models do on SWE-bench and what they can actually do starting from a bare machine is substantial. DeployBench's task set is the most realistic measure of this capability available. (2) The benchmark covers GPU/CUDA config, multi-language toolchains, and legacy artifact compatibility — the hard parts of real deployment that container-based evals exclude. (3) The evaluation uses OpenHands as the agent harness; if you want to compare your own agent framework, the benchmark infrastructure is available at the OpenHands benchmarks repo. Affects you if: You are building agents for scientific computing, ML research, or infrastructure automation that require self-directed environment setup Adoption effort: Moderate (evaluate your agent against the benchmark; interpret deployment capability gaps before building production systems that assume agents can bootstrap their own environments) Primary source: https://arxiv.org/abs/2606.05238 Quality gate score: 6 (concrete benchmark numbers: 7.8%–51.0% across 4 LLMs, 51 tasks +2, arXiv primary source link +2, within scan window +1, technical audience +1)


Tooling

No new major tooling releases in this 24h period. See Quick Hits for llama.cpp and SDK incremental updates.


Benchmarks & Leaderboards

LMArena Agent Arena Launches (June 4, just outside 24h window)

The Agent Arena leaderboard went live on June 4, 2026, ranking models on real-world agentic task evaluation at scale. Unlike the text/chat arena, which uses human preference votes, Agent Arena measures behavioral signals: file downloads, disapproval events, retries, tool reliability, task completion confirmation, steerability, instruction following, recovery speed, and hallucination rates. On June 5, mistral-medium-3.5 was added to the Code Arena WebDev leaderboard, and krea-2-medium, krea-2-large, and Cosmos3-Super-Text2Image were added to the Text-to-Image leaderboard.

No movement in the main text leaderboard ELO bands (top cluster ~1,480–1,561) or SWE-bench Verified (Claude Mythos Preview at 93.9% unchanged) for June 5–6. Full Agent Arena rankings visible at arena.ai/leaderboard/agent.


Trends & Emerging Tech

LMArena Shifts Agentic Evaluation from Preference Votes to Behavioral Signals

Source: Arena AI (LMArena) | Date: June 4, 2026 | Link: https://arena.ai/leaderboard/agent What's happening: The Agent Arena launched measuring behavioral signals from real sessions — file downloads (task completion proxy), disapproval events (user corrections), retries (model failure recovery), and steerability — rather than asking users "which response do you prefer?" The distinction matters: a model that generates plausible-sounding but wrong tool calls might win preference votes while failing on behavioral signals. This is also the first time an LMArena leaderboard runs agentic evals at crowd scale rather than in a controlled evaluation harness. Why watch this: If behavioral signal rankings diverge significantly from preference vote rankings for the same models, it suggests preference-based evals are miscalibrated for agentic use cases — which is the dominant evaluation method used in most published agentic benchmarks today. The Agent Arena's methodology is publicly available; the pattern of measuring behavioral traces rather than outcomes is worth stealing for internal agent evals. Watch for the first leaderboard analysis post from the Arena team comparing behavioral rank vs. chat rank for the same model lineup.

OpenAI Lockdown Mode Rolls Out to All Personal ChatGPT Accounts

Source: OpenAI | Date: June 4–5, 2026 | Link: https://help.openai.com/en/articles/20001061-lockdown-mode What's happening: Lockdown Mode, previously enterprise-only, is now available to all personal ChatGPT accounts (Free, Go, Plus, Pro) and self-serve Business accounts. When enabled, it limits or disables: live web access, image support in responses, Deep Research, Agent Mode, Canvas networking, live connectors, and file downloads — specifically to reduce prompt injection–based data exfiltration attack surface. Critically: Lockdown Mode and Developer Mode cannot be used simultaneously — enabling either disables the other. Why watch this: The Lockdown/Developer Mode mutual exclusion is the relevant developer constraint. If you build automated workflows using ChatGPT's web interface (Operator API, shared GPTs) and your users might enable Lockdown Mode, your workflow's tool-calling surface disappears. For organizations deploying ChatGPT to security-sensitive staff, Lockdown Mode could become a default policy that silently breaks agent-mode workflows. The move to personal accounts also signals that prompt injection defense is becoming a mainstream user-facing feature — which may accelerate similar controls appearing in the Responses API.


Technical Discussions

Nothing cleared the quality bar this period. No Hacker News threads with score >200 and concrete technical depth found for June 5–6, 2026. Simon Willison published a personal project post on June 6 ("Running Python code in a sandbox with MicroPython and WASM," releasing micropython-wasm as an alpha package for agent code execution sandboxing in Datasette) — interesting direction for lightweight agent sandboxing but alpha quality and personal project scope; moved to Horizon. No new posts from Nathan Lambert (most recent qualifying post: June 1), Eugene Yan, or Sebastian Raschka in the scan window.


Quick Hits

  • Claude Code v2.1.167 (June 6) — bug fixes and reliability improvements on top of v2.1.166; update to @latest to pick up both. [https://github.com/anthropics/claude-code/releases]
  • anthropic-sdk-python v0.106.0 (June 5) — marks claude-opus-4-1-20250805 as deprecated in SDK types; fixes Foundry client copy() and with_options() returning incorrect clients. [https://github.com/anthropics/anthropic-sdk-python/releases]
  • anthropic-sdk-python v0.107.0 (June 6) — small updates to Managed Agents types; no breaking changes. [https://github.com/anthropics/anthropic-sdk-python/releases]
  • LiteLLM v1.88.0-rc.3 (June 5, pre-release) — security hardening: hardens GHSA-q775 session-token budget-ceiling exemption against default_key_generate_params; do not deploy pre-releases to production, but track stable v1.88.0 for this fix. [https://github.com/BerriAI/litellm/releases]
  • llama.cpp June 6 builds (b9537–b9543) — b9543 adds "frame merge" support for Qwen3.5-based video/multimodal inference (first video-capable VLM support in this series); b9537 fixes off-by-one comparisons to n_gpu_layers that could silently misconfigure GPU layer assignment; b9536 (June 5) improves OpenCL get_rows, cpy, concat, and q6_k flat GEMV operations for non-CUDA GPU users. [https://github.com/ggml-org/llama.cpp/releases]

Worth Watching (Announced, Not Yet Shipped)

⚠️⚠️⚠️ Gemini API Legacy Schema (Interactions) — Hard Removal June 8 (2 days) — MOST URGENT

(Countdown updated — 2 days remaining) Source: Google AI for Developers | Link: https://ai.google.dev/gemini-api/docs/interactions-breaking-changes-may-2026 The Api-Revision: 2026-05-07 opt-out header stops working June 8. Applications using response.outputs structure must migrate to response.steps. Action today: grep your codebase for response.outputs and Api-Revision: 2026-05-07. 2 days is the entire remaining window — act today.

⚠️⚠️ Windows Local AI Runtime — KB5039239 June 9 (3 days)

(Countdown updated) Source: Windows Developer Blog | Link: https://blogs.windows.com/windowsdeveloper/2026/06/02/build-2026-furthering-windows-as-the-trusted-platform-for-development/ Windows Update KB5039239 delivers the expanded on-device AI stack (Aion 1.0 runtime, CPU/GPU/NPU support) on June 9. Required for production use of Aion 1.0 Instruct and Aion 1.0 Plan on end-user devices. Aion 1.0 open weights land on Hugging Face in July.

⚠️⚠️⚠️ Claude Sonnet 4 + Opus 4 — Retirement June 15 (9 days)

(Countdown updated) Source: Anthropic | Link: https://platform.claude.com/docs/en/about-claude/model-deprecations claude-sonnet-4-20250514 and claude-opus-4-20250514 return errors June 15. Migrate to claude-sonnet-4-6-20260217 and claude-opus-4-8 respectively. Review the Opus 4.8 migration guide before upgrading — adaptive thinking replaces budget_tokens; setting temperature, top_p, or top_k to non-default values returns a 400 error.

⚠️⚠️ Gemini CLI Hard Stop — June 18 (12 days)

(Countdown updated) Source: Google Developers Blog | Link: https://developers.googleblog.com/an-important-update-transitioning-gemini-cli-to-antigravity-cli/ gemini CLI and Gemini Code Assist IDE extensions stop serving requests for Google AI Pro, Ultra, and free personal users on June 18. Replacement is Antigravity CLI (agy). Audit CLI scripts and CI pipeline steps now — Antigravity CLI does not have 1:1 feature parity.

⚠️⚠️ Gemini API Unrestricted Key Deadline — June 19 (13 days)

(Countdown updated) Source: Google AI for Developers | Link: https://ai.google.dev/gemini-api/docs/api-key All unrestricted Gemini API keys blocked June 19. Restrict via AI Studio → API Keys → "Restrict to Gemini API." Takes 2 minutes; no code changes required.

⚠️ Gemini Image Models Shutdown — June 25 (19 days)

(Countdown updated) Source: Google AI for Developers | Link: https://ai.google.dev/gemini-api/docs/deprecations gemini-3.1-flash-image-preview and gemini-3-pro-image-preview shutting down June 25, 2026. Migrate to stable image model equivalents before the shutdown date.

⚠️ GPT-4.5 Retirement from ChatGPT — June 27 (21 days)

(Countdown updated) Source: OpenAI Platform Changelog | Link: https://platform.openai.com/docs/changelog GPT-4.5 being retired from the ChatGPT product surface on June 27; direct API route retirement unconfirmed. Audit gpt-4.5 model identifiers in code.

⚠️ Claude Opus 4.1 Retirement — August 5 (60 days)

(New — announced June 5, 2026) Source: Anthropic | Link: https://platform.claude.com/docs/en/about-claude/model-deprecations claude-opus-4-1-20250805 retires August 5. Migrate to claude-opus-4-8. This is a Significant migration effort if coming from a pre-4.7 model — see API & SDK Changes section for full migration checklist.

⚠️ OpenAI Reusable Prompts (v1/prompts) Shutdown — November 30 (178 days)

Source: OpenAI | Link: https://developers.openai.com/api/docs/deprecations Deprecated June 3, shutdown November 30, 2026. Move prompt content to application code.

⚠️ OpenAI Evals Platform Shutdown — November 30 (178 days)

Source: OpenAI | Link: https://developers.openai.com/api/docs/deprecations Read-only October 31, shutdown November 30, 2026. Export eval configs before October 31.

⚠️ OpenAI Agent Builder Shutdown — November 30 (178 days)

Source: OpenAI | Link: https://developers.openai.com/api/docs/deprecations Shutdown November 30, 2026. Migrate to Agents SDK (openai.agents) or ChatGPT Workspace Agents.

Claude Mythos — Public Release "Once Stronger Safeguards Ready"

(Carried — status unchanged) Source: Anthropic | Link: https://www.anthropic.com/news/expanding-project-glasswing No timeline given. Currently: no public API, no claude.ai access at any tier. Leads SWE-bench Verified at 93.9% (internal benchmark as of June 2, 2026).

Gemini 3.5 Pro — Expected July 2026

(Carried — no official date) Sundar Pichai stated "give us until next month" at Google I/O 2026 (May 19). No official announcement, pricing, model ID, or benchmark numbers.


<details> <summary>🔭 Horizon — Open Questions, Emerging Patterns & Grounded Speculation</summary>

This section operates under different rules than the digest above. Evidence-grounded speculation is allowed. Pure prediction is not. Every claim here must cite a source from this digest or a real paper/benchmark. Label each entry by type so the reader knows what kind of thinking they're engaging with.

[PATTERN] Two benchmark papers exposing the same agent capability gap, from different angles, on the same day CL-Bench (2606.05661) and DeployBench (2606.05238) were both submitted around June 4–5, from unrelated teams, targeting different capability gaps. CL-Bench asks: do agents actually improve with experience? Answer: barely — dedicated memory systems barely beat naive ICL (ACE achieves 8.6% normalized gain at $62.8/run). DeployBench asks: can agents set up a real software environment from scratch? Answer: barely — 7.8%–51.0% pass rates for frontier models. Neither paper knew about the other's submission. The pattern: the community is independently arriving at the same observation — current agent capabilities are overestimated by existing benchmarks because those benchmarks make environment-availability and statefulness assumptions that don't hold in production. Two evaluation groups from different labs, no coordination, same week. Grounded in: CL-Bench arXiv 2606.05661 (this digest, Research); DeployBench arXiv 2606.05238 (this digest, Research)

[TENSION] Claude Code's fallbackModel feature and Anthropic's fast deprecation cadence are pulling in opposite directions v2.1.166 ships fallbackModel — the ability to specify up to three fallback models if the primary is overloaded or unavailable. This is explicitly a resilience feature for long-running sessions. Simultaneously, Anthropic deprecated Opus 4.1 today (60-day window), and Opus 4 / Sonnet 4 retire in 9 days. A fallbackModel list of ["claude-opus-4-7", "claude-sonnet-4-6"] needs to stay current as models age out — it is not a set-once-and-forget setting. The tension: the feature adds resilience against transient model outages, but the fast deprecation cadence means the fallback list itself is a maintenance burden. Teams that deploy Claude Code at scale will need a governance process to keep their fallbackModel list in sync with Anthropic's deprecation schedule. Worth watching: does Anthropic publish a fallbackModel recommendation alongside each deprecation notice? Grounded in: Claude Code v2.1.166 fallbackModel (this digest, API & SDK Changes); Opus 4.1 deprecation (this digest, API & SDK Changes); Sonnet/Opus 4 retirement June 15 (this digest, Worth Watching)

[OPEN QUESTION] Does LMArena's behavioral signal methodology actually rank models differently than preference votes for agentic tasks? The Agent Arena (launched June 4) measures behavioral signals — file downloads as task completion proxies, disapproval events as correction proxies, retries as failure proxies — rather than asking users to choose a preferred response. This is the right instinct: preference votes reward plausible-sounding responses, while behavioral signals reward task completion. But the key question is whether the rankings actually diverge. If Claude Opus 4.8 and GPT-5.5 rank the same in both the chat arena and the agent arena, the methodology refinement didn't change anything useful. If they rank differently, it suggests the chat preference leaderboard is measuring the wrong thing for agentic deployment decisions. The Arena team has not yet published a comparison. This is the most important early data point to watch from the Agent Arena launch. Grounded in: LMArena Agent Arena launch June 4 (this digest, Benchmarks & Leaderboards); CL-Bench finding that gain metric diverges from raw capability (this digest, Research)

[RESEARCH THREAD] MicroPython-WASM as a lightweight code execution sandbox for agents (Simon Willison, alpha) Simon Willison published micropython-wasm on June 6 — an alpha Python package enabling in-process MicroPython execution via WASM, released as a plugin for Datasette Agent as datasette-agent-micropython. The idea: instead of spinning up a Docker container or subprocess for agent code execution, run MicroPython in a WASM sandbox in the same process with strict memory and CPU limits. It is very much alpha, and the MicroPython stdlib coverage gap versus CPython is significant. But the architectural pattern — embedding a constrained language runtime in a WASM sandbox as a code execution tool — is relevant for developers who want agent code execution without the overhead of a full container spawn. If WASM isolation proves sufficient for the risk profile, this could eliminate the cold-start latency that currently makes container-based code execution unsuitable for interactive agent use. Not ready for production evaluation, but worth watching through the 2-3 releases that follow initial alpha. Grounded in: Simon Willison, simonwillison.net/2026/Jun/6/micropython-in-a-sandbox/ (excluded from main digest — alpha personal project); adjacent to CL-Bench finding that stateful agent environments are a bottleneck

[IF THIS CONTINUES] At Anthropic's current deprecation cadence, any production Anthropic API integration needs automated deprecation monitoring by end of 2026 In the past 60 days: Claude Haiku 3 retired (April 20), Claude Opus 4.7 launched (April 16), Claude Opus 4.8 launched (May 28), Opus 4 + Sonnet 4 retiring June 15, Opus 4.1 deprecated today for August 5. At this pace, there is approximately one deprecation event per 30–45 days. The API model name claude-opus-4-1-20250805 is a dated version string, and that date (August 5, 2025) is exactly one year before the retirement date — suggesting Anthropic may be moving toward a one-year lifecycle for non-flagship models. If that cadence holds, any production integration without automated deprecation monitoring (e.g., polling the deprecation page or using the Rate Limits API to detect model availability) will regularly encounter unexpected 400 errors from retired model IDs. The mitigation is straightforward: subscribe to the Anthropic deprecation email list, set calendar reminders at 60/30/14 days before each listed retirement date, and test the replacement model before the window closes. The tooling for this doesn't yet exist as a first-class API feature — but it probably should. Grounded in: Claude Opus 4.1 deprecation June 5 (this digest); Opus 4 + Sonnet 4 retiring June 15 (this digest, Worth Watching); Anthropic model deprecations page (primary source)

</details>

Excluded: ~56 items below quality gate threshold, outside scan window, or duplicate coverage. Near-misses: Grok Imagine 1.5 (June 3 — 1 day outside window); Grok Connectors with MCP (May 6 — outside window); Nathan Lambert "Open and closed models are on different exponentials" (June 1 — outside window); HF transformers v5.10.2 clip-model fix (June 4 — borderline, patch release scope); AdaPlanBench arXiv 2606.05622 (no concrete model performance numbers found); Simon Willison micropython-wasm (alpha personal project — moved to Horizon); OpenAI Canvas deprecated in GPT-5.5 Instant (product surface only, no API impact); OpenAI Lockdown Mode for enterprise (earlier rollout date — this period covered personal account expansion only, moved to Trends); LMArena text ELO (no movement June 5–6); LMArena Agent Arena initial rankings (arena.ai 403 — could not verify launch rankings, covered in Benchmarks section with attribution); SWE-bench (no new entries); AWS ML blog (no June 5–6 items); Azure AI (no June 5–6 specific items); NVIDIA TensorRT-LLM (no June 5–6 items); Together AI / Fireworks AI / Modal (no June 5–6 items); Groq (June 2026 news is funding raise, not technical release); xAI June 5–6 release notes (403 access); Mistral (last release May 22, no June 5–6 items); Meta AI (no June 5–6 items); Cohere (no June 5–6 items); vLLM (GitHub releases page returned 2024 dates — possible display issue, no confirmed June 5–6 release); DeepMind arXiv 2606.03237 (no code, no ML benchmarks — covered in prior digest Horizon).

← All digestspersonal/digests/ai-2026-06-06.md