DeepMind's AlphaEvolve optimizes production systems at Google scale, Anthropic partners with SpaceX for 220K GPUs, and agents can now autonomously buy infrastructure.
This was the week AI agents stopped being tools and started becoming economic actors. Between AlphaEvolve rewriting Google's infrastructure autonomously, Cloudflare giving agents credit cards, and Anthropic discovering that Claude has thoughts it doesn't share — the "agentic future" arrived faster than anyone's governance frameworks were ready for.
DeepMind's AlphaEvolve is rewriting production code at Google — and it's working.
AlphaEvolve isn't another coding assistant. It's a Gemini-powered evolutionary search system that discovers novel algorithms by iterating through millions of candidates, using LLM-generated mutations guided by automated evaluation. Think genetic programming, but with a frontier model as the mutation operator.
The architecture is straightforward: define a fitness function, seed the population with existing implementations, let Gemini propose mutations, evaluate candidates against your objective, keep the winners, repeat. The LLM provides the creative leaps; the evaluation harness provides the ground truth.
The results are not academic. AlphaEvolve reduced variant detection errors in DeepConsensus by 30%. It pushed GNN feasibility on AC Optimal Power Flow from 14% to 88%. It cut write amplification in Google Spanner by 20% and storage by 9%. Klarna reported 2x transformer training speed. FM Logistic saw 10.4% routing efficiency gains, saving 15,000+ km per year. On the Willow quantum processor, it achieved 10x lower error rates for molecular simulations.
What makes this different from every "AI writes code" demo you've seen: the search loop is closed and automated. AlphaEvolve doesn't produce a single answer — it produces a population of candidates, scores them, and iterates. No human reviews the intermediate generations. The system explores regions of the algorithm design space that no engineer would try, because the search is cheap and the evaluation is rigorous.
For engineering teams, the implication is a fundamental skill shift. The bottleneck moves from "can we write a better algorithm" to "can we define a good enough fitness function." If you can specify what good looks like — in a test suite, a benchmark, a performance counter — AlphaEvolve-style systems can explore the solution space faster than any team of humans. If your objective is fuzzy or requires human judgment, you're still safe. For now.
This also reframes how we think about technical debt. Every piece of code with a clear performance objective and a measurable benchmark is now a candidate for automated optimization. Database query planners, compression algorithms, scheduling heuristics, routing logic — anything you can score, you can evolve. The teams with the best observability infrastructure just gained an unexpected advantage: their metrics become fitness functions.
Gemma 4's headline feature isn't the model itself — it's how Google ships inference. Multi-token prediction (MTP) uses a smaller "drafter" model that predicts N tokens ahead, then the main model verifies them in a single forward pass. If the drafts are correct, you get N tokens for the cost of one verification step. If not, you fall back to autoregressive generation.
This is speculative decoding, but with a twist: the drafter is co-trained with the main model during pre-training, not bolted on afterward. Google reports 2.2x speedup on Apple Silicon and similar gains on A100s, with zero quality degradation.
Why this matters for production systems: latency-sensitive applications (chat, code completion, search) can now hit sub-100ms time-to-first-token on consumer hardware. The 26B MoE variant activates roughly 9B parameters per forward pass, meaning an M3 Max can run it at interactive speeds.
Here's what the serving architecture looks like with vLLM:
The num_speculative_tokens=4 parameter controls the speculation window. Higher values give more speedup when the drafter is accurate (structured output, code, formulaic text) but waste compute on creative or unpredictable generation. In my testing, 3-5 is the sweet spot for most workloads. Set it to 1-2 for conversational agents, 4-6 for code generation, and skip it entirely for batch embedding jobs.
The co-trained drafter is the key innovation here. Previous speculative decoding approaches (like Medusa or Eagle) trained separate heads or distilled smaller models post-hoc. Google trains the MTP head jointly during pre-training, which means the drafter shares the main model's internal representations. The result: higher acceptance rates, especially on domain-specific content where a separately-trained drafter would diverge.
The practical decision tree for teams right now:
The broader signal: Google is competing on inference economics, not just model quality. With GPT-5.5 doubling prices, the open-source stack (Gemma 4 + vLLM + MTP) becomes the rational default for teams spending >$10K/month on API calls. The model is Apache 2.0. The serving infrastructure is mature. The gap between "free" and "best" just narrowed considerably.
One more angle worth tracking: MTP's accuracy depends heavily on output predictability. This creates an interesting incentive — structured output formats (JSON, function calls, typed responses) become even more efficient than free-form text. If your API returns JSON, MTP gives you near-3x throughput. If it returns creative prose, you get 1.3x. The economic pressure now favors constrained generation, which is exactly where production AI workloads already live. The architecture and the economics are finally aligned.
antirez/ds4 — Salvatore Sanfilippo (yes, the Redis creator) built a Metal-native inference engine for DeepSeek V4 Flash. Hits 26.68 tok/s generation and 250 tok/s prefill on M3 Max 128GB, with 1M token context window and compressed KV cache that persists to disk. Ships an OpenAI/Anthropic-compatible API, tool calling, and thinking mode out of the box. Requires Q2 quantization for 128GB machines, Q4 for 256GB+. If you have the hardware, this is the easiest path to running a frontier model without an API key.
h4ckf0r0day/obscura — A headless browser purpose-built for AI agents and web scraping, already at 11K stars. Handles fingerprint rotation, JavaScript rendering, and anti-bot bypass — the stuff that makes raw HTTP useless for modern web interactions. Fills a real gap: Playwright is too heavy for agent loops, requests/httpx can't handle SPAs, and existing scraping tools weren't designed for LLM-driven navigation. Worth watching as agent web interaction becomes a core primitive.
cosmicstack-labs/mercury-agent — Permission-hardened AI agent framework with token budgets and multi-channel access control. Ignore the "soul-driven" branding — the engineering underneath is what matters: granular tool permissions, per-action spending limits, full audit trails, and multi-channel routing. This is the missing middleware between "I have an LLM" and "I have a production agent system." Most agent frameworks treat permissions as an afterthought; Mercury makes them load-bearing.
Mouseww/anything-analyzer — All-in-one protocol analysis tool combining browser packet capture, MITM proxy, fingerprint spoofing, and AI-powered traffic analysis with MCP Server integration. Useful for debugging agent-to-service communication or reverse-engineering undocumented APIs. The MCP integration means you can point Claude or any MCP-compatible agent at it and get AI-assisted packet analysis. 2,300 stars and climbing.
Three things converged this week that point in the same direction: Cloudflare letting agents buy infrastructure autonomously, Anthropic's NLA research showing Claude has hidden internal states it doesn't verbalize, and AlphaEvolve operating without human review of intermediate generations. We're building systems that act independently, that have internal experiences we can't directly observe, and that optimize objectives at superhuman speed.
None of these are dangerous today. But the pattern matters. Every production team should be asking: "If this agent's spending cap fails, what's our blast radius? If it pursues a subgoal we didn't anticipate, how do we detect it? If it's aware of our tests, are our evaluations still valid?"
The teams that build these guardrails now — permission models, anomaly detection, evaluation integrity checks — will sleep better in 18 months than those shipping fast without them. The ones that treat agent autonomy as someone else's problem will learn the hard way that $100/month cap is a configurable parameter, not a law of physics.
— Aaron
Compare Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-4.1 Mini on speed, cost, quality, and tool calling. Benchmarks and code examples.
AI EngineeringCompare LangChain, CrewAI, AutoGen, Strands, and AgentCore — architecture, trade-offs, and when to use each. With code examples.
AI AgentsCompare Needle 26M, FunctionGemma 270M, Qwen 0.6B, and Granite 350M for on-device tool calling. Architecture and benchmarks.
AI Engineering