Open agentic model routing for coding agent

Agent-as-a-Router
Agentic Model Routing for Coding Tasks

A router that chooses the most suitable LLM for solving the current coding task in an agent workflow, verifies what happened, and carries that experience into the next task.

~10K
coding tasks
8
frontier LLMs
80K+
verified responses
Updated Agent-as-a-Router pipeline figure

Routing in the loop

Static routers decide from frozen priors. ACRouter learns from execution signals across context, action, feedback, and memory.

01

Orchestrator

Combines priors, task metadata, retrieved neighbors, and a compact policy model.

02

Verifier

Aggregates AST checks, execution, prompt tests, and rule signals into feedback.

03

Memory

Stores task embeddings, chosen model, observed quality, cost, and verification trace.

Updated Routing in the loop figure

CodeRouterBench

A stream-shaped benchmark for model selection across single-turn coding tasks and an OOD agentic-programming stream.

9+1

Task dimensions

Nine ID coding dimensions plus OOD agentic programming.

+15.3%

Router gain

Relative lift from giving the router performance statistics.

Regret

Evaluation

Routers are compared by cumulative regret over realistic streams.

Updated CodeRouterBench construction figure

Main experiment

Routing Results

Grouped by component-configuration taxonomy. The left block measures in-distribution coding tasks; the right block measures real-world OOD agentic programming.

ID AvgPerf 49.98%

ACRouter leads all non-oracle routers on 2,919 in-distribution tasks.

OOD CumReg 17.0

The agentic router keeps the lowest regret on 176 held-out agentic-programming tasks.

Cost signal Perf / USD

Efficient baselines can be cheap, but their OOD quality drops sharply.

Routing results across in-distribution and OOD tests. CumReg is cumulative regret across all tasks.
Taxonomy Router In-Distribution OOD Test
AvgPerf % CumReg Perf / USD AvgPerf % CumReg Perf / USD
Upper bound Oracle 57.00 0 8.20 75.89 0 2.32
Agent-as-a-Router ACRouter ours 49.98 205.5 3.79 62.50 17.0 1.18
Dynamic: Online Bandit LinTS 46.48 307.4 4.49 46.43 35.9 0.75
LinUCB 46.84 296.9 4.38 49.82 31.1 0.96
Static: Heuristic DimensionBest 47.50 277.4 3.69 -- -- --
kNN Retrieval 47.18 286.7 6.07 14.29 66.7 1.45
Static: Trained Policy LogReg 47.26 284.4 6.27 19.64 61.8 1.17
RouteLLM-BERT 47.22 285.5 6.22 21.43 59.4 1.30
TF-IDF+MLP 46.97 292.8 6.11 13.39 67.9 1.17
Qwen3.5-0.8B-Finetuned 46.41 309.1 6.82 55.36 27.2 0.74
RouteLLM-MF 46.16 316.5 6.19 8.93 72.7 0.94
Single-Model Baselines Always-Opus 4.6 43.83 387.1 1.29 57.14 26.7 0.64
Always-Kimi-K2.5 36.66 593.3 12.62 18.75 62.3 1.22
Always-Qwen3.5-Plus 37.16 580.2 2.05 2.68 80.1 0.19
Random 38.75 533.6 2.48 31.25 50.4 0.85

Bold green values mark the strongest non-oracle quality/regret result. Gold values mark the strongest cost-efficiency result. DimensionBest is not applicable to OOD because unseen agentic-programming tasks have no predefined dimension-to-model mapping.

Evidence views

Why routing matters beyond the table

Three compact views show model complementarity, regret over task streams, and the cost-performance frontier behind the headline numbers.

Performance, cost, and efficiency analysis across coding dimensions
Complementarity Performance, cost, and efficiency analysis.

Performance varies by coding dimension, while cost and AvgPerf per dollar expose why a single premium model is not always the right deployment choice.

Cumulative regret across in-distribution and OOD task streams
Regret over streams Cumulative regret across task streams.

Static routers grow faster on in-distribution tasks and collapse on OOD tasks, while ACRouter keeps lower regret as verified memory accumulates.

Cost-performance Pareto frontier analysis
Trade-off frontier Cost-performance Pareto frontier analysis.

ACRouter extends the deployable frontier upward in both ID and OOD, with higher AvgPerf and less cost than always choosing a premium model.

Agentic Artifacts

An ARA-style entry is ready for ACRouter, benchmark splits, score matrices, verifier traces, and the held-out agentic stream.

ready to fetch local manifest
Open manifest
manifest

agentic-artifacts.json

Press fetch to load the local artifact manifest.

The best router keeps learning from execution.

Updated OOD results sharpen the split: ACRouter keeps the lowest regret among routers, while a standalone GPT-5.4 backend resolves 75.00% on the same 176 agentic-programming tasks.

Capability layers and Pareto analysis

Citation

@article{agent2026zhou,
  title         = {Agent-as-a-Router: Agentic Model Routing for Coding Tasks},
  author        = {Pengfei Zhou, Zhiwei Tang, Yixing Ma, Jiasheng Tang, Yizeng Han, Zhenglin Wan, Fanqing Meng, Wei Wang, Bohan Zhuang, Wangbo Zhao, Yang You},
  journal       = {arXiv preprint arXiv:2606.22902},
  year          = {2026},
  archivePrefix = {arXiv},
  eprint        = {2606.22902},
  url           = {https://arxiv.org/abs/2606.22902},
}