Anthropic launches Dynamic Workflows with hundreds of parallel subagents, a Paris startup hits 3,000 tok/s on commodity GPUs, and Cloudflare reviews 131K PRs with AI.
The frontier models keep getting smarter, but this week's real story is the infrastructure catching up. When you can run 3,000 tokens per second on standard GPUs and orchestrate hundreds of agents in a single session, the bottleneck shifts from "can the model do this" to "can your architecture keep up."
Anthropic shipped Opus 4.8 on Tuesday. The benchmarks improved — 4x fewer unremarked code flaws, new highs on prosocial alignment metrics, misaligned behavior rates "substantially lower" than Opus 4.7 — but the headline feature isn't the model itself. It's Dynamic Workflows: the ability to run hundreds of parallel subagents in a single Claude Code session.
This matters because it changes the unit of work. Previously, an AI coding agent operated as a single thread of execution — read file, think, edit, repeat. Dynamic Workflows lets you decompose a codebase migration into 200 parallel tasks, each with its own agent, coordinated by a deterministic script layer. You write the orchestration in plain JavaScript — loops, conditionals, fan-out — while the model handles each leaf task. Anthropic claims teams are using this for "migrations across hundreds of thousands of lines of code."
Think about what this means for large-scale refactoring. You're not waiting for a single context window to process 500 files sequentially. You're spawning a fleet of agents, each handling one file or one module, with a coordinator that merges results and handles conflicts. The architecture mirrors how we already think about distributed systems — except the workers are LLMs.
The pricing stayed flat: $5/$25 per million tokens (input/output). Fast mode runs at 2.5x speed for double the price — but 3x cheaper than previous models' fast tiers. The API model ID is claude-opus-4-8. Available on Enterprise, Team, and Max plans.
Three other announcements worth noting: effort control lets you dial reasoning depth from quick-fire to exhaustive (options: default "high", plus "extra" and "max"). System entries in the Messages API allow mid-conversation instruction updates without breaking the prompt cache — you can steer long-running agent sessions without paying the cache-miss penalty. And the model's honesty improvements are measurable: it's 4x less likely to let code flaws pass without flagging them, which directly impacts the reliability of automated code review pipelines.
Anthropic also teased Project Glasswing and Claude Mythos Preview — a "higher-intelligence model class" currently limited to cybersecurity applications. General availability is "coming weeks." The naming alone tells you something about where they think the ceiling is — and why they're gating access to security use cases first.
A Paris-based startup called Kog.ai published benchmarks this week showing 3,000 output tokens per second from a single request on 8x AMD MI300X GPUs. On 8x NVIDIA H200s, they hit 2,100 tok/s. The model: a 2B dense coding model called Laneformer, running in FP16 with no quantization.
What makes this interesting isn't the raw number — it's what they didn't use. No speculative decoding. No pruning. No KV cache compression. No quantization. Just a fundamentally different execution model.
The core insight: at batch size 1, LLM decode speed is bounded by memory bandwidth, not compute. Their 2B FP16 model weighs ~4GB. With 8x MI300X delivering ~33.6 TB/s effective bandwidth, the theoretical ceiling is ~8,400 tok/s. They're achieving 36% Memory Bandwidth Utilization (MBU). Standard inference stacks hit maybe 10-15%.
Where does the rest go? Kernel launch overhead. A typical inference stack launches ~10 kernels per layer across 25 layers. At ~45µs per launch, that's 1,125µs of dead time per token before any useful work happens. At 3,000 tok/s, your entire per-token budget is 333µs. You literally cannot afford kernel launches.
Their solution is a persistent monokernel — one GPU-resident program for the entire decode path. No launches, no CPU round-trips, no framework overhead. The hot path uses raw CUDA+PTX on NVIDIA and HIP+CDNA ISA on AMD. No PyTorch. No Triton. No CUTLASS. No NCCL.
They also built custom collective communication (KCCL) achieving AllReduce in under 3µs versus ~8µs for vendor libraries. On MI300X specifically, they mapped physical memory-address-to-IOD routing and placed buffers so each XCD polls from its local HBM stack — bringing barrier latency down to ~600ns.
The projected numbers for production MoE models are compelling:
For context: most production deployments get 50-100 tok/s per request from frontier models today. If Kog's approach generalizes to larger architectures, we're looking at a 10-25x improvement in user-perceived latency without touching the model at all. Real-time voice applications need ~200 tok/s to feel natural. Interactive coding assistants feel sluggish below ~80 tok/s. At 3,000 tok/s, you're generating faster than any human can read — the model is no longer the latency bottleneck; your network is.
The trade-off is engineering complexity. Writing and maintaining hand-tuned ISA kernels for every GPU architecture is brutal. You need deep hardware expertise, and every new chip generation requires re-optimization. There's a reason most of the industry leans on PyTorch and Triton — they trade performance for portability and developer velocity. Kog is betting that for inference serving (as opposed to training), the performance gap is large enough to justify the engineering investment.
The team is 11 people. Five PhDs. $5M raised from Varsity VC and BPI France. They're building the inference equivalent of what io_uring did for Linux I/O — proving that decades of software abstraction layers were hiding massive performance wins that only become visible when you rewrite from first principles.
SmallCode — A terminal-based coding agent engineered for 8B-35B local models on consumer hardware. Instead of assuming frontier-model capabilities, it compensates for small-model failure modes with architectural guardrails: 2-stage tool routing halves schema overhead by making the model pick a category first (read/write/search/run/plan), then showing only relevant tool schemas. A context budget engine caps tool results at 4K chars with semantic compression. A forgiving parser handles JSON, YAML, XML, and plain-text tool calls — because 8B models produce inconsistent output formats. The "adaptive model routing" tracks per-model failure rates and automatically escalates to stronger models when local failure rates spike past configurable thresholds. Even has early-stop detection for repetition loops and patch spirals. 1.6K stars, MIT licensed.
OpenSquilla — Token-efficient agent framework that routes each conversational turn to the cheapest capable model using an on-device LightGBM+ONNX classifier called SquillaRouter. Classification evaluates length, language, code presence, keywords, and semantic embeddings — and critically, the prompt never leaves your machine to make the routing decision. On PinchBench 1.2.1 (25 tasks), it scores 0.9251 at $0.69 total cost versus Claude Opus 4.7-only at 0.9255 for $6.23 — nearly identical quality at 9x lower cost. Also scales system prompts by complexity: lightweight prompts for trivial turns, full instructions for hard ones. Supports 20+ providers, persistent memory via SQLite+vector search, MCP client/server mode, and has a layered sandbox with Bubblewrap isolation. 2.1K stars, Apache 2.0.
AiSOC — Self-hostable AI security operations center with 52 first-party connectors (EDR/XDR, SIEM, cloud, identity, SaaS, network), a 600-line LangGraph orchestrator, and a Neo4j-backed MITRE ATT&CK reasoning engine for attack-path reconstruction. Every LLM decision — prompt, response, evidence cited, tool call — gets logged to an Investigation Ledger and is fully replayable. Ships with a 200-incident synthetic dataset across 55 templates and CI-gated eval harness that gates every PR. Compare that to commercial SOC vendors who won't let you see their agent's reasoning. Supports air-gapped deployment with local Ollama models. 1.1K stars, MIT licensed.
The Kog.ai result is the one I keep thinking about. We've spent three years assuming inference cost drops come from quantization, distillation, or cheaper hardware. Turns out a meaningful chunk of the cost was just bad software engineering — kernel launches, framework abstractions, vendor communication libraries nobody bothered to replace. A team of 11 in Paris just demonstrated that the software stack was leaving an order of magnitude on the table.
If their approach generalizes — and the projected numbers for production MoE models suggest it does — the "GPU shortage" narrative needs revision. It might partially be a "software efficiency shortage" wearing a hardware costume. When you can get 4,000 tok/s from a 3B-active MoE on existing MI300X hardware, the cost curve for real-time AI applications drops below thresholds that unlock entirely new product categories. That changes the economics of everything we're building — and it doesn't require waiting for next-gen silicon.
— Aaron
Using an LLM to authorize agent actions duplicates your attack surface. Why deterministic policy engines like Cedar and OPA belong in the decision path.
AI EngineeringWhy teaching AI agents to be lazy produces better code. Ponytail framework applies senior developer heuristics to reduce hallucination and improve reliability.
AI EngineeringPermission to access memory isn't purpose. Why AI agents fail silently when memory systems grant access but lack task context.
AI Engineering