uv add google-genai uv run scripts/run_benchmarks.py --models google/gemini-2.5-pro --formats markdown_kv --limit 100
Unfortunately I started getting "quota exceeded" almost immediately, but it did give 6/6 correct answers before it crapped out.
100 samples:
- gemini-2.5-pro: 100%
- gemini-2.5-flash: 97%