llm-eval — a pluggable A/B harness for benchmarking LLMs across providers via API calls.
Inspired by/testing this paper- it still kinda works
install:
pip install -e .
quickstart:
llm-eval list llm-eval run --provider openai --model gpt-4o-mini \ --benchmarks gsm8k,arc_challenge,nameindex --n 20 llm-eval ab \ -v gpt4o=openai/gpt-4o-mini \ -v sonnet=anthropic/claude-sonnet-4-6 \ --benchmarks gsm8k,mmlu_pro,math,middlematch --n 50 --trials 3 \ --compare gpt4o,sonnet \ --out-xlsx results/ab.xlsx
benchmarks:
openbookqa arc_easy arc_challenge mmlu_pro gsm8k math nameindex middlematch
providers: openai, anthropic, google, openrouter
output per run:
results/<run_id>.jsonl — every API call results/<run_id>.xlsx — Raw, Summary, Per_Item, McNemar, Config sheets cache/calls.jsonl — keyed dedup cache; reruns are free
extending:
benchmarks — subclass Benchmark, @register_benchmark("name")
measurers — subclass Measurer, @register_measurer("name")
providers — subclass Provider, @register_provider("name")