we ran Llama 3.2 3B locally. unmodified. no fine-tuning. no fancy framework. just the raw model + Keiro research API.

~85% on SimpleQA. 4,326 questions.

Without keiro? 4% score

PPLX Sonar Pro: 85.8%. ROMA: 93.9% — a 357B model.

OpenDeepSearch: 88.3% — DeepSeek-R1 671B.

SGR: 86.1% — GPT-4.1-mini with Tavily ( SGR also skipped questions)

we're sitting right next to all of them. with a 3B model. running on your laptop.

DeepSeek-R1 671B with no search? 30.1%. Qwen-2.5 72B? 9.1%.

no LangChain. no research framework. just a small script, a small model, and a good API.

cost per query: $0.005.

Anyone with a decent laptop can run a 3B model, write a small script, plug in Keiro research api , and get results that compete with systems backed by hundreds of billions of parameters and serious infrastructure spend.

Benchmark script link + results --> https://lnkd.in/gdZJtGf9

Keiro research -- https://lnkd.in/g_QaCygZ

Llama-3.2 3B + Keiro research API hit ~85% on SimpleQA locally ($0.005/query)

Ready to build something?

Further reading

Ready to build something?