Compare Needle 26M, FunctionGemma 270M, Qwen 0.6B, and Granite 350M for on-device tool calling. Architecture and benchmarks.

TL;DR: Needle is a 26M-parameter model distilled from Gemini 3.1 that outperforms models 10-25x its size on single-shot function calling, decoding at 1,200 tokens/second on edge hardware. For on-device AI agents that need to route user commands to the right tool with correct arguments — smartwatches, phones, glasses — specialized tiny models now beat general-purpose small LLMs by trading conversational breadth for tool-calling precision.
Every AI agent framework — LangChain, AgentCore, CrewAI — sends tool-calling requests to large cloud models. GPT-4o, Claude Sonnet, Gemini Pro all handle function calling excellently, but they require network round-trips of 500-2000ms, cost $3-15 per million tokens, and expose user data to third-party APIs. For the next wave of AI agents running on personal devices, this architecture breaks down.
Consider a smartwatch that needs to parse "set a timer for 8 minutes" into {"name": "set_timer", "arguments": {"duration_minutes": 8}}. The latency budget is 50ms. The memory budget is 30MB. The privacy requirement is absolute — no data leaves the device. No 70B parameter model fits this constraint. No cloud API meets this latency.
This gap created demand for tiny models specialized exclusively for tool calling — models that sacrifice general knowledge and conversational ability to achieve near-perfect accuracy on a narrow task: given a user query and a set of available tools, select the correct tool and generate valid arguments.
In early 2026, five models compete in this space, each with different architectural bets.
Before comparing models, it helps to understand what "tool calling" means computationally. The model receives a structured prompt containing:
The model must:
This is fundamentally a structured generation problem, not a conversational one. The model doesn't need world knowledge, reasoning chains, or multi-turn context. It needs pattern matching between natural language intents and function signatures, plus reliable JSON generation.
This insight is why specialized small models can beat general-purpose LLMs 25x their size — they allocate all their capacity to the specific computation needed, rather than spreading parameters across encyclopedic knowledge and general reasoning.
The size difference is striking. Needle is 10x smaller than its nearest competitor and 23x smaller than Qwen. Yet it claims superior single-shot function calling accuracy. How?
Needle uses what Cactus Compute calls a "Simple Attention Network" — an encoder-decoder architecture with unusual design choices optimized specifically for the function-calling task.
The critical design choice: no feed-forward network layers. In a standard transformer, each layer has a self-attention block followed by a 2-layer FFN (typically 4x the embedding dimension). The FFN acts as a key-value memory that stores factual knowledge learned during pre-training. By removing it entirely from the encoder, Needle makes an explicit architectural statement: the encoder doesn't need to store knowledge — it only needs to understand the structural relationship between tokens.
For tool calling, this makes sense. The encoder processes the user query and tool schemas. It doesn't need to know that Paris is the capital of France — it needs to understand that "weather in Paris" maps to the location parameter of get_weather. This is purely relational, not factual.
The decoder includes cross-attention over encoder outputs and self-attention for autoregressive generation. It generates the structured JSON output token by token.
Needle uses 8 attention heads with only 4 key-value heads — meaning KV heads are shared between pairs of query heads. This halves the KV cache memory during inference, critical for edge devices with limited RAM.
The embedding matrix is shared between encoder input, decoder input, and output projection. With a vocabulary of only 8,192 tokens, this saves approximately 8M parameters (8192 × 512 × 2 matrices that would otherwise be separate).
By eliminating FFN layers and using a tiny vocabulary, nearly all parameters go to attention — the component that actually matters for understanding query-tool relationships.
Needle wasn't trained from scratch on tool-calling examples. It was distilled from Gemini 3.1 in a two-phase process:
The 26M student model learns to predict the next token on 200 billion tokens, with Gemini 3.1's output logits as soft targets. This transfers general language understanding — syntax, semantics, common patterns — without requiring the student to have enough parameters to store all of Gemini's factual knowledge.
The key insight: distillation transfers capability (how to process language) more efficiently than knowledge (what facts are true). A 26M model can learn Gemini's linguistic computation patterns even though it can't store Gemini's world knowledge.
The pre-trained model is then fine-tuned on 2 billion tokens of single-shot function-call data — pairs of (query + tools) → (tool invocation JSON). This phase narrows the model's general language ability into the specific structured generation task.
The brevity of Phase 2 (45 minutes vs 27 hours) suggests that Phase 1 does most of the heavy lifting — the model already knows how to process language; it just needs to learn the specific output format.
This launches a web UI where you can test queries against your tool schemas and iterate on fine-tuning — making it straightforward to validate accuracy before deploying to a device.
On edge hardware, Needle completes a full tool-call generation in approximately 40ms — well within the 100ms latency budget required for responsive voice assistants.
At INT4 quantization, Needle fits in 13MB — small enough to run alongside other applications on a smartwatch with 512MB total RAM.
Cactus Compute reports Needle beating all comparison models on single-shot tool calling. The critical nuance: this is specifically for single-shot scenarios where the model sees one query, a tool schema, and must produce one correct tool call. For multi-turn conversations where the model must maintain context, larger models still dominate.
This specialization trade-off is the central insight: Needle sacrifices general capability for extreme performance on a narrow task. It cannot hold a conversation. It cannot answer questions. It cannot reason about multi-step plans. It does one thing — map queries to tool calls — and does it better than models 25x its size.
Here's a practical architecture for deploying small tool-calling models in production:
The tool-calling model is one component in a pipeline. On a modern smartphone, this entire pipeline runs in under 200ms end-to-end:
No cloud. No network latency. No API costs. No privacy concerns.
What happens when the user query doesn't clearly map to any tool? Small models need a fallback strategy:
This hybrid architecture uses the tiny model for the 80% of queries that map cleanly to tools, and falls back to a cloud model only when needed.
Needle represents a broader trend: instead of making small models good at everything (the Phi/Gemma approach), make them perfect at one thing through targeted distillation. This has implications beyond tool calling:
The common thread is that many "AI tasks" in production systems are actually narrow pattern-matching problems disguised as general language understanding. A 26M model with the right architecture and training data can handle them better than a 70B generalist.
Tool calling and function calling refer to the same capability — an AI model selecting a specific function and generating structured arguments based on a natural language query. "Function calling" was the original term used by OpenAI's API, while "tool calling" became the broader industry standard encompassing MCP tools, API endpoints, and local functions. Both describe the model's ability to output structured JSON that invokes external capabilities rather than generating free-text responses.
Needle replaces GPT-4o only for single-shot tool routing on edge devices with fixed tool schemas. If your application needs multi-turn conversations, complex reasoning about which tools to chain together, or handles hundreds of different tool schemas dynamically, large cloud models remain necessary. Needle excels when latency, privacy, and cost constraints make cloud models impractical and the tool-calling pattern is predictable.
Cactus Compute's pipeline generates 10,000+ synthetic examples per tool set, which takes approximately 45 minutes to fine-tune. In practice, 1,000-5,000 high-quality examples per tool typically achieve 95%+ accuracy on well-defined schemas. The data should cover variations in phrasing, parameter edge cases (missing optional params, multiple valid phrasings), and negative examples (queries that don't match any tool).
At INT4 quantization (13MB), Needle runs on virtually any modern processor including smartphone CPUs (Snapdragon 8 Gen 3, Apple A17), smartwatch chips (Snapdragon W5+), and embedded systems (Raspberry Pi 5, ESP32-S3 with PSRAM). The model requires no GPU — pure CPU inference at 1,200 tokens/second is fast enough for real-time tool calling. Any device with 50MB of free RAM can run Needle alongside other applications.
Join 229+ AI engineers who read AI Frontier every Friday. One 5-minute email, zero fluff.
50+ editions published · 1.3% unsubscribe rate · browse past issues
Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.
LangChain vs AgentCore vs LangGraph vs CrewAI vs AutoGen vs Strands — which framework wins for your use case? Architecture, benchmarks, and production trade-offs analyzed.
AI Agent Development, Framework ComparisonOne misplaced timestamp invalidated our entire KV cache and 10x'd our bill. Here are 6 context engineering patterns from Manus and production agent teams that prevent exactly this -- with code examples.
AI Engineering, Agent FrameworksExplore how Claude Code, Cursor, Aider, and Cline work under the hood. Agent loops, tool dispatch, and edit strategies explained.
AI Engineering, Agent Frameworks