AI Developer Digest

Tue, May 26, 2026

11 signals that cleared the gate25 scanned20 min read

The Signal — start here

The Gemini Interactions API deadline that has been tracked in this digest for nine consecutive days arrived: as of today (May 26), the outputs → steps schema switch is the live default for all Gemini API requests. Applications that have not migrated are now silently parsing wrong response structures with no exception raised — the legacy schema stays available via opt-out header until June 8, then it is gone permanently. That is the single most important item in this digest. Beyond the deadline, today is again a very light day — the post-Google I/O release lull is now at day six. The only other substantive news is llama.cpp, which continues shipping high-signal individual builds: b9330 fixed a misclassified matrix-multiply operation on Nemotron models that was silently dropping GPU acceleration, restoring throughput from 64.9 to 103.22 tokens/second (59% improvement). A bad build type label, not a missing GPU or wrong quantization — the kind of silent degradation that is hard to diagnose without benchmarking.

Must-reads today

Gemini Interactions API outputs → steps is NOW LIVE — if your code reads response.outputs, it is silently parsing wrong data as of today; act before June 8 when the legacy schema is permanently removed

llama.cpp b9330 — if you're running Nemotron models, you may have been running at 64.9 tok/sec instead of 103.22 tok/sec due to a mislabeled GPU operation; update to b9330 to restore full throughput

Breaking Changes

●Breaking

Gemini Interactions API `outputs → steps` — Default Schema Switch Is Now Live

What changed

The new steps-based response schema became the default for all Gemini Interactions API requests today; the old outputs array schema is now opt-out only (via Api-Revision: 2026-05-07 header) until June 8, 2026, when it is permanently removed.

TL;DR

As of May 26, 2026, Gemini Interactions API responses use a steps array by default — code still reading response.outputs is silently receiving incorrect data with no exception raised.

Developer signal

Unmigrated applications do not crash — they silently parse incorrect response structures, making this a data-integrity failure, not an availability failure. The migration requires updating any code that reads response.outputs to instead iterate response.steps, filtering by step.type (user_input, model_output, google_search_call, google_search_result, function_call). Multi-turn history code that builds conversation context from outputs must also be updated. Python SDK ≥2.0.0 and JS SDK ≥2.0.0 adopt the new schema automatically, but hand-written response parsing code and raw HTTP clients must be updated manually. The Api-Revision: 2026-05-07 header opts back into the legacy schema as a temporary escape hatch — use it to buy time, not as a permanent fix. New Gemini API features shipped after May 7 will not appear in outputs responses even with the opt-out header; migrating now unlocks new capabilities, not just compliance. Hard deadline: June 8 — after that, the legacy schema is gone and there is no header to fall back to.

Affects you ifYou are calling the Gemini Interactions API (not the standard generateContent API) and reading response.outputs in your parsing code, or building conversation history from the outputs array.EffortModerate (response parsing code must be updated; multi-turn history management requires review; test across all step types before removing the opt-out header)

Google AI for Developers | Date: May 26, 2026 | Link: https://ai.google.dev/gemini-api/docs/interactions-breaking-changes-may-2026https://ai.google.dev/gemini-api/docs/interactions-breaking-changes-may-2026

Model Releases

Nothing new in this scan window. Most recent major releases: Gemini 3.5 Flash and Cohere Command A+ (May 19–20, covered in May 21 digest); Claude Opus 4.7 (April 16); GPT-5.5 (April 23). Note: Qwen 3.7 Max (Alibaba, released May 20–21) appears not to have been covered in prior digest windows — see near-misses below.

API & SDK Changes

Nothing new in this scan window beyond the Gemini Interactions API breaking change (see [BREAKING] section above). Anthropic Platform last changelog entry: May 19, 2026 (MCP tunnels, self-hosted sandboxes, managed agent MCP config, large output spill). OpenAI Platform last entry: May 12, 2026 (DALL-E deprecation, Realtime API Beta removal). Google AI changelog returned 403 on direct fetch; confirmed via search that no new entries beyond the Interactions API switch were published today.

Research

Nothing cleared the quality bar this period. arXiv cs.CL and cs.AI listing pages returned 403 errors at fetch time. Hugging Face Papers Daily returned no papers confirmed within the May 25–26 scan window. Papers surfaced via search (AutoTool, AgentInfer co-design, Active Inference for Multi-LLM Systems) predate the scan window or lack confirmed recognized-lab authorship with associated code repositories.

Tooling

Notable

llama.cpp b9330 — Nemotron Throughput Regression Fixed (64.9 → 103.22 tokens/sec)

What changed

ffn_latent operations in Nemotron models were reclassified from GGML_OP_MUL (elementwise multiply) to MUL_MAT (matrix multiply), fixing an incorrect operation type that caused backend buffer probes to test the wrong operation, silently routing FFN layer weights to CPU instead of GPU.

TL;DR

A one-line operation-type fix in Nemotron model loading restores GPU-accelerated throughput from 64.9 to 103.22 tokens/second on Nemotron 3 Super 120B Q5_K_M — a 59% improvement with no configuration changes required.

Developer signal

If you are running Nemotron models via llama.cpp (Nemotron 3 Super 120B or other Nemotron variants that route FFN layers through matrix multiply), you have likely been operating at approximately 64% of expected throughput since the bug was introduced — no error is raised, no log warning, and the model runs normally but slower. Update to b9330 or newer and benchmark: if your throughput jumps 50–60%, this was your issue. The fix is transparent — just update the build, no configuration changes required. If you previously ran benchmarks on Nemotron models with a pre-b9330 llama.cpp build, those numbers understate GPU capability and should be rerun. The root cause (operation type mislabeling causing probe/execution path divergence) is a class of bug that can recur as llama.cpp adds more backend types — routine throughput benchmarking after updates is the reliable detection method.

Affects you ifYou are running Nemotron 3 Super 120B or other Nemotron variants with MUL_MAT FFN layers via llama.cpp on GPU.EffortQuick (update to b9330 or newer — no config changes needed)

ggml-org/llama.cpp (GitHub) | Date: May 26, 2026 (02:48 UTC) | Link: https://github.com/ggml-org/llama.cpp/releases/tag/b9330https://github.com/ggml-org/llama.cpp/releases/tag/b9330

Notable

llama.cpp b9329 + b9334 — CUDA Fast Walsh-Hadamard Transform Kernel

What changed

b9329 added a warp-level-optimized Fast Walsh-Hadamard Transform (FWHT) CUDA kernel with loop unrolling and int-sized index arithmetic; b9334 (same day) fixed a missing PDL synchronization point for the new kernel and improved fallback behavior.

TL;DR

llama.cpp's CUDA backend gains a native FWHT kernel for hardware-accelerated Walsh-Hadamard transform operations, replacing prior CPU fallback for model architectures that include this operation.

Developer signal

The Fast Walsh-Hadamard Transform is used in certain efficient attention variants and emerging subquadratic/SSM-family model architectures. If you run models that include FWHT operations (e.g., some Mamba-family and SSM-based models) on CUDA, this replaces a likely CPU fallback with a native GPU implementation. No configuration required — the kernel is selected automatically when the operation is present. Apply both b9329 and b9334 together; the sync fix in b9334 addresses a race condition introduced in b9329 and should not be skipped. The warp-size-64 variant in the commit message indicates the kernel is optimized for NVIDIA's standard 32-thread warp width with explicit unrolling for throughput.

Affects you ifYou are running SSM-family or other models that use FWHT operations via llama.cpp on CUDA.EffortQuick (update to b9334 or newer — both fixes are included in that build)

ggml-org/llama.cpp (GitHub) | Date: May 26, 2026 (b9329: 02:00 UTC; b9334: 09:00 UTC) | Links: https://github.com/ggml-org/llama.cpp/releases/tag/b9329 and https://github.com/ggml-org/llama.cpp/releases/tag/b9334https://github.com/ggml-org/llama.cpp/releases/tag/b9329

Notable

llama.cpp b9319 — New GGUF Loading APIs: `gguf_init_from_callback` and `gguf_init_from_buffer`

What changed

Added gguf_init_from_buffer (load a GGUF model from a pre-allocated memory buffer) and gguf_init_from_callback (load via a custom callback function invoked per read), with hardened file offset calculations and overflow prevention.

TL;DR

llama.cpp's GGUF loading layer gains two new entry points that enable loading models from memory buffers and custom data sources — not just local file paths — with production-hardened memory safety guarantees.

Developer signal

Previously, loading a GGUF model in llama.cpp required a file path. The new buffer API lets you pass pre-loaded model bytes directly, useful when you've mmap'd the model, downloaded it into memory, or received it over a network stream. The callback API is more general — it invokes your function for each read operation, enabling lazy loading (only read the weight chunks you need), custom decryption, streaming from object storage, or any non-file data source. File offset overflow protection was added in the same build, making both APIs suitable for production use. If you are embedding llama.cpp in an application that manages model storage differently from local files — cloud functions where disk is expensive, edge devices, distributed caches, S3-backed weight storage — these APIs significantly reduce the I/O plumbing you need to write.

Affects you ifYou are embedding llama.cpp in a custom application and need to load GGUF models from memory buffers, network streams, object storage, or other non-filesystem sources.EffortModerate (new API integration required; not a drop-in replacement for existing file-path loading paths)

ggml-org/llama.cpp (GitHub) | Date: May 25, 2026 (23:07 UTC) | Link: https://github.com/ggml-org/llama.cpp/releases/tag/b9319https://github.com/ggml-org/llama.cpp/releases/tag/b9319

Benchmarks & Leaderboards

No new leaderboard movements confirmed in this scan window. From search-sourced tracker data (lmarena.ai direct fetch returned 403): mai-image-2.5-preview was added to the LMArena Text-to-Image leaderboard as of May 26, with no ELO rank published yet. Text leaderboard continues to show Claude Opus 4.6 at Elo ~1504 with Gemini 3.1 Pro Preview and Claude Opus 4.6 Thinking in a statistical tie at top-3 (overlapping confidence intervals); Code arena: GPT-5.2-codex at Elo 1569, unchanged since January 2026. No confirmed new model entries or significant rank movements.

Trends & Emerging Tech

The Post-Google I/O Lull and the June Deadline Convergence

What's happening

Today marks the sixth consecutive day with zero main-section entries from major AI labs (Anthropic, OpenAI, Google, Meta, Mistral, xAI). The single new breaking-change item in this digest is not a new feature — it is a pre-announced deadline that arrived. Meanwhile, four hard deprecation deadlines cluster in the next 24 days: GitHub Copilot billing switch (June 1), Gemini 2.0 Flash/Lite shutdown (June 1), Claude Sonnet 4/Opus 4 retirement (June 15), and Gemini API key restriction (June 19). The developer maintenance load this week is paradoxically higher than the week of Google I/O itself.

Why watch this

The pattern of "conference-dense release wave" followed by "multi-week quiet plus deadline pressure" suggests a predictable AI engineering calendar: labs concentrate new features around flagship events (I/O, DevDay) and the following weeks force deprecation compliance. If you are running engineering teams, the current quiet window is the correct time to complete the four June deadline migrations — not the week they fire. The next release wave will likely cluster around Gemini 3.5 Pro (expected June 2026) and potentially OpenAI's historically late-June cadence, meaning June 1–15 may be the most operationally dense two-week stretch of Q2: two model shutdowns, one billing change, one API key deadline, and (likely) new model releases to evaluate simultaneously.

Pattern across May 21–26 digests | Date: May 26, 2026

Technical Discussions

Nothing cleared the quality bar this period. Nathan Lambert published "Some ideas for what comes next, May 2026" on Interconnects.ai today, but the page returned 403 on direct fetch and search snippets were insufficient to confirm concrete benchmark data or primary-source technical claims meeting the quality gate threshold.

Quick Hits

llama.cpp b9318 (May 25, 20:25) — server: MTP layer kv-cache should respect draft type ctk — fixes kv-cache handling in multi-token prediction server layers where the draft model type was not being respected; affects speculative decoding setups using MTP. [https://github.com/ggml-org/llama.cpp/releases/tag/b9318]
llama.cpp b9320 (May 25, 23:58) — TP: fix ggml context size calculation — resolves memory leaks and incorrect context size computations in tensor parallelism mode; affects multi-GPU setups using TP. [https://github.com/ggml-org/llama.cpp/releases/tag/b9320]
llama.cpp b9326 (May 26, 00:32) — ggml library synchronization update; dependency maintenance, no user-facing changes. [https://github.com/ggml-org/llama.cpp/releases/tag/b9326]
llama.cpp b9331 (May 26, 05:10) — CI workflow reorganization: SYCL builds disabled, Android/HIP/WebGPU/other backend CI jobs extracted into separate workflows; no runtime changes. [https://github.com/ggml-org/llama.cpp/releases/tag/b9331]
llama.cpp b9333 (May 26, 06:12) — metal: add apple device id — expanded Metal backend device identification for additional Apple hardware; minor compatibility update. [https://github.com/ggml-org/llama.cpp/releases/tag/b9333]
llama.cpp b9351 (May 26, 19:55) — Build published; no technical changes documented in release notes at time of scan. [https://github.com/ggml-org/llama.cpp/releases/tag/b9351]

Worth Watching (Announced, Not Yet Shipped)

⚠️⚠️ GitHub Copilot — Metered Billing Transition June 1 (6 days)

(Announced April 28, 2026)

All GitHub Copilot plans switch from request-based to token-based AI Credit billing on June 1. Code completions remain free. Agent-heavy workflows (multi-file edits, long-horizon reasoning with o3/GPT-5+) now carry explicit per-token costs. Audit projected usage in the GitHub preview bill experience before June 1.

GitHub Blog | Link: https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/

⚠️⚠️ Gemini 2.0 Flash + 2.0 Flash Lite — Shutdown June 1 (6 days)

(Carried from May 21–25 digests)

gemini-2.0-flash and gemini-2.0-flash-lite return errors on June 1, 2026. Migration: gemini-2.5-flash ($0.30/$2.50/MTok) or gemini-2.5-flash-lite ($0.10/$0.40/MTok).

Google AI for Developers | Link: https://ai.google.dev/gemini-api/docs/deprecations

⚠️ Claude Sonnet 4 + Opus 4 — Retirement June 15 (20 days)

(Carried from May 22–25 digests)

claude-sonnet-4-20250514 and claude-opus-4-20250514 return errors on June 15, 2026. No automatic failover. Migration: Sonnet 4 → claude-sonnet-4-6-20260217; Opus 4 → claude-opus-4-7-20260416. Opus 4.7 has API breaking changes versus Opus 4.6 — read the migration guide before upgrading.

Anthropic | Link: https://platform.claude.com/docs/en/about-claude/model-deprecations

Gemini API Unrestricted Key Deadline — June 19 (24 days)

(Carried from May 21–25 digests)

All unrestricted Gemini API keys blocked June 19. Restrict via AI Studio → API Keys → "Restrict to Gemini API" (one-click action).

Google AI for Developers | Link: https://ai.google.dev/gemini-api/docs/api-key

Ollama v0.30.0 — Still Pre-Release (rc23 as of May 22)

(Carried from May 15 digest)

v0.30.0 restructures Ollama to use llama.cpp directly as backend, with MLX for Apple Silicon inference. No stable GA date announced.

Ollama (GitHub) | Link: https://github.com/ollama/ollama/releases

Gemini 3.5 Pro — Expected ~June 2026

(Carried from May 21–25 digests)

Confirmed in internal testing at Gemini 3.5 Flash launch (May 19). No model ID, pricing, or benchmarks disclosed.

Google (Google I/O 2026) | Link: https://blog.google/innovation-and-ai/technology/ai/google-io-2026-all-our-announcements/

Filtered from 30+ primary sources against a published quality rubric. No press releases, no fluff — only what changes what you build.

Breaking Changes

Gemini Interactions API `outputs → steps` — Default Schema Switch Is Now Live

Model Releases

API & SDK Changes

Research

Tooling

llama.cpp b9330 — Nemotron Throughput Regression Fixed (64.9 → 103.22 tokens/sec)

llama.cpp b9329 + b9334 — CUDA Fast Walsh-Hadamard Transform Kernel

llama.cpp b9319 — New GGUF Loading APIs: `gguf_init_from_callback` and `gguf_init_from_buffer`

Benchmarks & Leaderboards

Trends & Emerging Tech

The Post-Google I/O Lull and the June Deadline Convergence

Technical Discussions

Quick Hits

Worth Watching (Announced, Not Yet Shipped)

⚠️⚠️ GitHub Copilot — Metered Billing Transition **June 1 (6 days)**

⚠️⚠️ Gemini 2.0 Flash + 2.0 Flash Lite — Shutdown **June 1 (6 days)**

⚠️ Claude Sonnet 4 + Opus 4 — Retirement **June 15 (20 days)**

Gemini API Unrestricted Key Deadline — June 19 (24 days)

Ollama v0.30.0 — Still Pre-Release (rc23 as of May 22)

Gemini 3.5 Pro — Expected ~June 2026

⚠️⚠️ GitHub Copilot — Metered Billing Transition June 1 (6 days)

⚠️⚠️ Gemini 2.0 Flash + 2.0 Flash Lite — Shutdown June 1 (6 days)

⚠️ Claude Sonnet 4 + Opus 4 — Retirement June 15 (20 days)