Reza Ramji

CASE-Bench

A feasibility-constrained business-ideation benchmark for LLMs.

LLMs tend to fail at idea generation in two ways: bland textbook answers, and impressive-sounding proposals that fall apart on contact with reality. CASE-Bench measures the capability that matters in practice — generating solutions to a real business problem that are feasible, high-impact, and original — and separately reports how far a model's ideas explore beyond the known playbook, without letting exploration masquerade as quality.

Each of 16 cases pairs a constrained business problem with a curated set of reference solutions (each tagged with its core mechanism). A model proposes ideas without seeing the references; a fixed panel of judges rates every idea on feasibility, impact, and originality, and a portfolio-diversity diagnostic measures how varied each model's slate is.

Dot-and-whisker chart: Idea Quality with 95% CIs — Sonnet 4.6 at 59.3 (CI 57.3–61.4) clearly above Haiku 4.5 at 52.9 (CI 50.6–55.4), non-overlapping; and Portfolio Diversity where the two models tie with overlapping CIs around 94.
Canonical run: 16 cases × 3 samples, judged by a disjoint Opus panel, with 95 % bootstrap confidence intervals. Sonnet 4.6 leads on Idea Quality with non-overlapping CIs (Δ 6.3, paired-bootstrap p≈0.0005); the two models tie on the diversity diagnostic. Figure rendered directly from the benchmark's output, not drawn by hand.

What makes it trustworthy

Open source (MIT): github.com/Zed-Rez/casebench · ← projects