9 Open-Source AI Models, 46 Scenarios: The 2026 Benchmark for Business Chart Generation
Which open-source AI model generates the most accurate business charts? We tested 9 models across 46 real-world analytics scenarios on an Ollama server — measuring accuracy, metric compliance, chart quality, and speed. No cherry-picking. No cloud APIs. Just raw results.
What Changed Since v1
In our v1 benchmark, we tested 12 models across 32 scenarios. This time:
- 9 open-source models — focused on the most relevant options available on Ollama
- 46 test scenarios (up from 32) — including decision analysis, comparison periods, derived dimensions, and multi-step follow-up queries
- New metric: Compliance — do models follow formatting rules, thresholds, and metric specifications?
- New metric: Chart Quality — not just "did it generate a chart?" but "is it a good chart?" rated 1–5
- Follow-up queries — can models handle multi-step conversations like "now change this to a pie chart"?
Test Methodology
The 9 Models
All models tested on an Ollama server, representing the most popular open-source options for business analytics:
| # | Model | Parameters | Why Included |
|---|---|---|---|
| 1 | GLM-5.1 | — | Latest Zhipu model, strong reasoning |
| 2 | Gemma4 31B | 31B | Google's latest, largest Gemma |
| 3 | DeepSeek-V4-Pro | — | DeepSeek's flagship reasoning model |
| 4 | Kimi-K2.6 | — | Moonshot AI's latest, strong multilingual |
| 5 | Qwen-3.5 | — | Alibaba's latest, balanced performance |
| 6 | GPT-OSS 120B | 120B | Large open-source model |
| 7 | MiniMax-M3 | — | MiniMax's multimodal model |
| 8 | Gemini-3-Flash | — | Google's fast model |
| 9 | Mistral-Large-3 | — | Mistral's flagship, fastest response |
The 46 Scenarios — 17 Categories
| Category | Count | What It Tests |
|---|---|---|
| Basic Charts (bar, line, pie, area) | 8 | Core chart generation |
| Decision Analysis | 6 | Heatmaps, scatter, funnel, radar — complex analytical queries |
| Follow-up Queries | 6 | Multi-step: modify style, rebuild chart type, keep filters |
| Comparison Period | 1 | Year-over-year comparisons |
| Derived Dimensions | 1 | Age buckets from dates |
| Multi-Measure | 1 | Multiple metrics on one chart |
| Performance Matrix | 1 | Quadrant/scatter with dimensions |
| Advanced Charts (sankey, treemap, sunburst, etc.) | 22 | Specialized chart types |
Evaluation Criteria (4 Dimensions)
- Accuracy (40%) — Correct chart type + correct data column mapping
- Metric Compliance (25%) — Did the model follow thresholds, color rules, and formatting directives?
- Chart Quality (20%) — Is the resulting chart clear, well-structured, and presentation-ready? (1–5 scale)
- Speed (15%) — Response time on the Ollama server
Results: Overall Accuracy
The headline finding: only 2 models achieved 100% accuracy across all 46 scenarios.
| # | Model | Correct | Partial | Wrong | Accuracy |
|---|---|---|---|---|---|
| 🥇 | GLM-5.1 | 46/46 | 0 | 0 | 100% |
| 🥇 | Gemma4 31B | 46/46 | 0 | 0 | 100% |
| 3 | DeepSeek-V4-Pro | 45/46 | 0 | 1 | 97.8% |
| 4 | Kimi-K2.6 | 44/46 | 1 | 1 | 95.7% |
| 5 | Qwen-3.5 | 43/46 | 2 | 1 | 93.5% |
| 6 | GPT-OSS 120B | 40/46 | 3 | 3 | 87.0% |
| 7 | MiniMax-M3 | 38/46 | 3 | 5 | 82.6% |
| 8 | Gemini-3-Flash | 34/46 | 2 | 10 | 73.9% |
| 9 | Mistral-Large-3 | 33/46 | 2 | 11 | 71.7% |
Results: Metric Compliance
Compliance measures whether the model follows specific metric rules — thresholds, filters, color directives, and formatting specifications.
8 out of 9 models achieved 100% compliance. Only MiniMax-M3 dropped slightly to 99.5%.
This is a significant finding: most open-source models now handle structured output formatting reliably. The challenge is no longer "can it follow rules?" but "can it generate the right chart type?"
Results: Chart Quality
All 9 models scored 5.0/5 on chart quality when they produced correct output. The quality gap between models has narrowed dramatically compared to v1.
Key insight: When a model gets the chart type and data mapping right, the output quality is consistently excellent. The differentiation is entirely in accuracy, not quality.
Results: Speed
| # | Model | Avg. Response | Accuracy | Speed-Accuracy Ratio |
|---|---|---|---|---|
| 1 | Mistral-Large-3 | 1.1s | 71.7% | Fast but inaccurate |
| 2 | GLM-5.1 | 5.6s | 100% | Best balance |
| 3 | Kimi-K2.6 | 6.0s | 95.7% | Strong contender |
| 4 | Gemini-3-Flash | 6.6s | 73.9% | Fast but inaccurate |
| 5 | Qwen-3.5 | 10.0s | 93.5% | Good accuracy |
| 6 | Gemma4 31B | 10.2s | 100% | Accurate but slower |
| 7 | DeepSeek-V4-Pro | 11.5s | 97.8% | Very accurate |
| 8 | GPT-OSS 120B | 11.7s | 87.0% | Heavy and mediocre |
| 9 | MiniMax-M3 | 15.6s | 82.6% | Slow and inaccurate |
Category Deep Dive: Decision Analysis
Decision analysis was the hardest category — involving heatmaps, scatter plots, funnels, and radar charts. We tested 6 scenarios per model.
| Model | Correct | Accuracy | Key Issue |
|---|---|---|---|
| GLM-5.1 | 6/6 | 100% | None |
| Gemma4 31B | 6/6 | 100% | None |
| DeepSeek-V4-Pro | 6/6 | 100% | None |
| Kimi-K2.6 | 6/6 | 100% | None |
| Qwen-3.5 | 5/6 | 83.3% | Funnel misidentified as waterfall |
| GPT-OSS 120B | 4/6 | 66.7% | Missing metric specs |
| MiniMax-M3 | 4/6 | 66.7% | Missing metric specs |
| Gemini-3-Flash | 1/6 | 16.7% | Failed to generate metric specs in 5/6 |
| Mistral-Large-3 | 0/6 | 0% | Complete metric spec failure |
Mistral-Large-3 completely failed the decision category — it couldn't generate the required metric specification headers for any of the 6 decision prompts. Despite being the fastest model (1.1s), it produced zero usable charts in this category.
Gemini-3-Flash had a similar pattern: it failed 5 out of 6 decision prompts with the same error — missing_metric_spec_header.
Category Deep Dive: Follow-up Queries
Can models handle conversation? We tested multi-step interactions:
- "Show me revenue by region as a bar chart" → bar chart generated
- "Now change the style" → style modified
- "Now convert this to a pie chart, keep the filter" → chart type changed
- "Now change this line chart to area" → chart type changed
First response accuracy: All 9 models generated correct first responses.
Follow-up accuracy:
| Model | Follow-up Correct | Key Issue |
|---|---|---|
| GLM-5.1 | 3/3 | Perfect |
| Gemma4 31B | 3/3 | Perfect |
| Kimi-K2.6 | 3/3 | Perfect |
| Qwen-3.5 | 3/3 | Perfect |
| MiniMax-M3 | 3/3 | Perfect |
| DeepSeek-V4-Pro | 2/3 | Failed rebuild line→area |
| GPT-OSS 120B | 3/3 | Perfect |
| Gemini-3-Flash | 2/3 | Failed rebuild bar→pie |
| Mistral-Large-3 | 1/3 | Failed 2/3 follow-ups |
Key Findings
1. GLM-5.1 Is the Surprise Leader
GLM-5.1 (Zhipu AI) achieved 100% accuracy, 100% compliance, 100% chart quality, and 5.6s average response time. It handled every single category perfectly — including decision analysis and follow-up queries.
This is remarkable for a model that wasn't on most "best of" lists for analytics tasks.
2. Compliance Is No Longer the Differentiator
In v1, compliance was a key differentiator. In v2, 8 out of 9 models achieved 100% compliance. The challenge has shifted from "can it follow rules?" to "can it understand what chart to generate?" — particularly for complex analytical queries.
3. Decision Analysis Exposes a Critical Gap
The decision category (heatmaps, funnels, radar, scatter with analytical intent) revealed the biggest model differentiation:
- 4 models scored 100%
- 2 models scored 66.7%
- 1 model scored 16.7%
- 1 model scored 0%
This 100-point spread means model selection matters enormously for analytical use cases.
4. Speed vs. Accuracy Trade-off
Mistral-Large-3 is 5× faster than GLM-5.1 but 28 percentage points less accurate. For interactive dashboards where users need immediate feedback, this trade-off matters. But a fast wrong chart is worse than a slow correct one.
5. The Gap Between Top and Bottom Is Growing
In v1, the accuracy gap between best and worst was ~25 points (87.5% vs ~62%). In v2, it's 28.3 points (100% vs 71.7%). While top models improved significantly, bottom models stagnated on complex tasks.
Recommended Models by Use Case
| Use Case | Recommended Model | Why |
|---|---|---|
| 🏆 General analytics | GLM-5.1 | 100% accuracy, 5.6s speed, best all-rounder |
| Maximum accuracy (no speed concern) | Gemma4 31B | 100% accuracy, 10.2s speed |
| Fast interactive use | Kimi-K2.6 | 95.7% accuracy at 6.0s — best speed-accuracy ratio after GLM |
| Decision analysis & complex queries | GLM-5.1 or Gemma4 31B | Only 4 models scored 100% on decision category |
| Budget-conscious deployment | Qwen-3.5 | 93.5% accuracy, widely available |
| ❌ Avoid for analytics | Mistral-Large-3 | 0% on decision category, 71.7% overall |
What This Means for Business Analytics
The 2026 benchmark reveals three shifts from v1:
- Compliance is solved. Almost all models follow formatting rules. The battle has moved to understanding complex analytical intent.
- Decision analysis is the new differentiator. If you need heatmaps, funnels, or radar charts, model selection is critical. 4 models handle this perfectly; 2 fail completely.
- GLM-5.1 is the new benchmark leader. Previously, Llama 3.1 8B led. The landscape has shifted.
Try It Yourself
We built LivChart to make local AI dashboards accessible. You can try different models and see which works best for your data.
- Try the free playground — no sign-up, no installation
- Set up Ollama + LivChart in 5 minutes
- Try the LivChart-optimized model on Ollama
What model do you use for analytics? We'd love to hear your experience — share on X or discuss on r/LocalLLaMA.
This is v2 of the LivChart AI Benchmark. We'll update as new models are released. See v1 results here.