I was curious enough to have Codex create a similar benchmark: https://github.co...

jcheng · 2025-10-05T19:38:48 1759693128

gpt-5 also got 100/100 for both CSV and JSON.

    uv run inspect eval evals/table_formats_eval.py@table_formats_csv --model openai/gpt-5 --limit 100
    uv run inspect eval evals/table_formats_eval.py@table_formats_json --model openai/gpt-5 --limit 100

xnx · 2025-10-05T23:27:26 1759706846

Cool tool. I tried a few different things to get to work with google/gemini-2.5-pro, but couldn't figure it out.

jcheng · 2025-10-06T01:27:06 1759714026

    uv add google-genai
    uv run scripts/run_benchmarks.py --models google/gemini-2.5-pro --formats markdown_kv --limit 100

And add GOOGLE_API_KEY=<your-key-here> to a file called .env in the repo root.

Unfortunately I started getting "quota exceeded" almost immediately, but it did give 6/6 correct answers before it crapped out.

xnx · 2025-10-06T02:07:34 1759716454

Thanks! That worked perfectly.

100 samples:

- gemini-2.5-pro: 100%

- gemini-2.5-flash: 97%

catlifeonmars · 2025-10-06T05:22:56 1759728176

Curious: how many iterations did you run of each benchmark and what was the variance?

solsane · 2025-10-07T22:14:38 1759875278

Did it make tool calls e.g. write code to read them?

jcheng · 2025-10-08T19:03:20 1759950200

No, I didn’t provide any tools

fragmede · 2025-10-06T05:16:18 1759727778

how about PNG?