Open agentic model routing for coding agent

Agent-as-a-Router
Agentic Model Routing for Coding Tasks

A router that chooses the most suitable LLM for solving the current coding task in an agent workflow, verifies what happened, and carries that experience into the next task.

Fetch artifacts → Read paper Code

~10K

coding tasks

frontier LLMs

80K+

verified responses

Updated Agent-as-a-Router pipeline figure

Routing in the loop

Static routers decide from frozen priors. ACRouter learns from execution signals across context, action, feedback, and memory.

Orchestrator

Combines priors, task metadata, retrieved neighbors, and a compact policy model.

Verifier

Aggregates AST checks, execution, prompt tests, and rule signals into feedback.

Memory

Stores task embeddings, chosen model, observed quality, cost, and verification trace.

CodeRouterBench

A stream-shaped benchmark for model selection across single-turn coding tasks and an OOD agentic-programming stream.

9+1

Task dimensions

Nine ID coding dimensions plus OOD agentic programming.

+15.3%

Router gain

Relative lift from giving the router performance statistics.

Regret

Evaluation

Routers are compared by cumulative regret over realistic streams.

Updated CodeRouterBench construction figure

Main experiment

Routing Results

Grouped by component-configuration taxonomy. The left block measures in-distribution coding tasks; the right block measures real-world OOD agentic programming.

ID AvgPerf 49.98%

ACRouter leads all non-oracle routers on 2,919 in-distribution tasks.

OOD CumReg 17.0

The agentic router keeps the lowest regret on 176 held-out agentic-programming tasks.

Cost signal Perf / USD

Efficient baselines can be cheap, but their OOD quality drops sharply.

Routing results across in-distribution and OOD tests. CumReg is cumulative regret across all tasks.
Taxonomy	Router	In-Distribution			OOD Test
Taxonomy	Router	AvgPerf %	CumReg	Perf / USD	AvgPerf %	CumReg	Perf / USD
Upper bound	Oracle	57.00	0	8.20	75.89	0	2.32
Agent-as-a-Router	ACRouter ours	49.98	205.5	3.79	62.50	17.0	1.18
Dynamic: Online Bandit	LinTS	46.48	307.4	4.49	46.43	35.9	0.75
Dynamic: Online Bandit	LinUCB	46.84	296.9	4.38	49.82	31.1	0.96
Static: Heuristic	DimensionBest	47.50	277.4	3.69	--	--	--
Static: Heuristic	kNN Retrieval	47.18	286.7	6.07	14.29	66.7	1.45
Static: Trained Policy	LogReg	47.26	284.4	6.27	19.64	61.8	1.17
	RouteLLM-BERT	47.22	285.5	6.22	21.43	59.4	1.30
	TF-IDF+MLP	46.97	292.8	6.11	13.39	67.9	1.17
	Qwen3.5-0.8B-Finetuned	46.41	309.1	6.82	55.36	27.2	0.74
	RouteLLM-MF	46.16	316.5	6.19	8.93	72.7	0.94
Single-Model Baselines	Always-Opus 4.6	43.83	387.1	1.29	57.14	26.7	0.64
	Always-Kimi-K2.5	36.66	593.3	12.62	18.75	62.3	1.22
	Always-Qwen3.5-Plus	37.16	580.2	2.05	2.68	80.1	0.19
	Random	38.75	533.6	2.48	31.25	50.4	0.85

Bold green values mark the strongest non-oracle quality/regret result. Gold values mark the strongest cost-efficiency result. DimensionBest is not applicable to OOD because unseen agentic-programming tasks have no predefined dimension-to-model mapping.

Evidence views

Why routing matters beyond the table

Three compact views show model complementarity, regret over task streams, and the cost-performance frontier behind the headline numbers.

Performance, cost, and efficiency analysis across coding dimensions — Complementarity **Performance, cost, and efficiency analysis.**
Performance varies by coding dimension, while cost and AvgPerf per dollar expose why a single premium model is not always the right deployment choice.

Cumulative regret across in-distribution and OOD task streams — Regret over streams **Cumulative regret across task streams.**
Static routers grow faster on in-distribution tasks and collapse on OOD tasks, while ACRouter keeps lower regret as verified memory accumulates.

Trade-off frontier **Cost-performance Pareto frontier analysis.**
ACRouter extends the deployable frontier upward in both ID and OOD, with higher AvgPerf and less cost than always choosing a premium model.

Agentic Artifacts

An ARA-style entry is ready for ACRouter, benchmark splits, score matrices, verifier traces, and the held-out agentic stream.

ready to fetch local manifest

Open manifest

manifest

agentic-artifacts.json

Press fetch to load the local artifact manifest.

The best router keeps learning from execution.

Updated OOD results sharpen the split: ACRouter keeps the lowest regret among routers, while a standalone GPT-5.4 backend resolves 75.00% on the same 176 agentic-programming tasks.

Citation

@article{agent2026zhou,
  title         = {Agent-as-a-Router: Agentic Model Routing for Coding Tasks},
  author        = {Pengfei Zhou, Zhiwei Tang, Yixing Ma, Jiasheng Tang, Yizeng Han, Zhenglin Wan, Fanqing Meng, Wei Wang, Bohan Zhuang, Wangbo Zhao, Yang You},
  journal       = {arXiv preprint arXiv:2606.22902},
  year          = {2026},
  archivePrefix = {arXiv},
  eprint        = {2606.22902},
  url           = {https://arxiv.org/abs/2606.22902},
}

Agent-as-a-RouterAgentic Model Routing for Coding Tasks