9 Open-Source AI Models, 46 Scenarios: The 2026 Benchmark for Business Chart Generation

Which open-source AI model generates the most accurate business charts? We tested 9 models across 46 real-world analytics scenarios on an Ollama server — measuring accuracy, metric compliance, chart quality, and speed. No cherry-picking. No cloud APIs. Just raw results.

What Changed Since v1

In our v1 benchmark, we tested 12 models across 32 scenarios. This time:

  • 9 open-source models — focused on the most relevant options available on Ollama
  • 46 test scenarios (up from 32) — including decision analysis, comparison periods, derived dimensions, and multi-step follow-up queries
  • New metric: Compliance — do models follow formatting rules, thresholds, and metric specifications?
  • New metric: Chart Quality — not just "did it generate a chart?" but "is it a good chart?" rated 1–5
  • Follow-up queries — can models handle multi-step conversations like "now change this to a pie chart"?

Test Methodology

The 9 Models

All models tested on an Ollama server, representing the most popular open-source options for business analytics:

# Model Parameters Why Included
1 GLM-5.1 Latest Zhipu model, strong reasoning
2 Gemma4 31B 31B Google's latest, largest Gemma
3 DeepSeek-V4-Pro DeepSeek's flagship reasoning model
4 Kimi-K2.6 Moonshot AI's latest, strong multilingual
5 Qwen-3.5 Alibaba's latest, balanced performance
6 GPT-OSS 120B 120B Large open-source model
7 MiniMax-M3 MiniMax's multimodal model
8 Gemini-3-Flash Google's fast model
9 Mistral-Large-3 Mistral's flagship, fastest response

The 46 Scenarios — 17 Categories

Category Count What It Tests
Basic Charts (bar, line, pie, area) 8 Core chart generation
Decision Analysis 6 Heatmaps, scatter, funnel, radar — complex analytical queries
Follow-up Queries 6 Multi-step: modify style, rebuild chart type, keep filters
Comparison Period 1 Year-over-year comparisons
Derived Dimensions 1 Age buckets from dates
Multi-Measure 1 Multiple metrics on one chart
Performance Matrix 1 Quadrant/scatter with dimensions
Advanced Charts (sankey, treemap, sunburst, etc.) 22 Specialized chart types

Evaluation Criteria (4 Dimensions)

  1. Accuracy (40%) — Correct chart type + correct data column mapping
  2. Metric Compliance (25%) — Did the model follow thresholds, color rules, and formatting directives?
  3. Chart Quality (20%) — Is the resulting chart clear, well-structured, and presentation-ready? (1–5 scale)
  4. Speed (15%) — Response time on the Ollama server

Results: Overall Accuracy

The headline finding: only 2 models achieved 100% accuracy across all 46 scenarios.

# Model Correct Partial Wrong Accuracy
🥇 GLM-5.1 46/46 0 0 100%
🥇 Gemma4 31B 46/46 0 0 100%
3 DeepSeek-V4-Pro 45/46 0 1 97.8%
4 Kimi-K2.6 44/46 1 1 95.7%
5 Qwen-3.5 43/46 2 1 93.5%
6 GPT-OSS 120B 40/46 3 3 87.0%
7 MiniMax-M3 38/46 3 5 82.6%
8 Gemini-3-Flash 34/46 2 10 73.9%
9 Mistral-Large-3 33/46 2 11 71.7%

Results: Metric Compliance

Compliance measures whether the model follows specific metric rules — thresholds, filters, color directives, and formatting specifications.

8 out of 9 models achieved 100% compliance. Only MiniMax-M3 dropped slightly to 99.5%.

This is a significant finding: most open-source models now handle structured output formatting reliably. The challenge is no longer "can it follow rules?" but "can it generate the right chart type?"

Results: Chart Quality

All 9 models scored 5.0/5 on chart quality when they produced correct output. The quality gap between models has narrowed dramatically compared to v1.

Key insight: When a model gets the chart type and data mapping right, the output quality is consistently excellent. The differentiation is entirely in accuracy, not quality.

Results: Speed

# Model Avg. Response Accuracy Speed-Accuracy Ratio
1 Mistral-Large-3 1.1s 71.7% Fast but inaccurate
2 GLM-5.1 5.6s 100% Best balance
3 Kimi-K2.6 6.0s 95.7% Strong contender
4 Gemini-3-Flash 6.6s 73.9% Fast but inaccurate
5 Qwen-3.5 10.0s 93.5% Good accuracy
6 Gemma4 31B 10.2s 100% Accurate but slower
7 DeepSeek-V4-Pro 11.5s 97.8% Very accurate
8 GPT-OSS 120B 11.7s 87.0% Heavy and mediocre
9 MiniMax-M3 15.6s 82.6% Slow and inaccurate

Category Deep Dive: Decision Analysis

Decision analysis was the hardest category — involving heatmaps, scatter plots, funnels, and radar charts. We tested 6 scenarios per model.

Model Correct Accuracy Key Issue
GLM-5.1 6/6 100% None
Gemma4 31B 6/6 100% None
DeepSeek-V4-Pro 6/6 100% None
Kimi-K2.6 6/6 100% None
Qwen-3.5 5/6 83.3% Funnel misidentified as waterfall
GPT-OSS 120B 4/6 66.7% Missing metric specs
MiniMax-M3 4/6 66.7% Missing metric specs
Gemini-3-Flash 1/6 16.7% Failed to generate metric specs in 5/6
Mistral-Large-3 0/6 0% Complete metric spec failure

Mistral-Large-3 completely failed the decision category — it couldn't generate the required metric specification headers for any of the 6 decision prompts. Despite being the fastest model (1.1s), it produced zero usable charts in this category.

Gemini-3-Flash had a similar pattern: it failed 5 out of 6 decision prompts with the same error — missing_metric_spec_header.

Category Deep Dive: Follow-up Queries

Can models handle conversation? We tested multi-step interactions:

  1. "Show me revenue by region as a bar chart" → bar chart generated
  2. "Now change the style" → style modified
  3. "Now convert this to a pie chart, keep the filter" → chart type changed
  4. "Now change this line chart to area" → chart type changed

First response accuracy: All 9 models generated correct first responses.

Follow-up accuracy:

Model Follow-up Correct Key Issue
GLM-5.1 3/3 Perfect
Gemma4 31B 3/3 Perfect
Kimi-K2.6 3/3 Perfect
Qwen-3.5 3/3 Perfect
MiniMax-M3 3/3 Perfect
DeepSeek-V4-Pro 2/3 Failed rebuild line→area
GPT-OSS 120B 3/3 Perfect
Gemini-3-Flash 2/3 Failed rebuild bar→pie
Mistral-Large-3 1/3 Failed 2/3 follow-ups

Key Findings

1. GLM-5.1 Is the Surprise Leader

GLM-5.1 (Zhipu AI) achieved 100% accuracy, 100% compliance, 100% chart quality, and 5.6s average response time. It handled every single category perfectly — including decision analysis and follow-up queries.

This is remarkable for a model that wasn't on most "best of" lists for analytics tasks.

2. Compliance Is No Longer the Differentiator

In v1, compliance was a key differentiator. In v2, 8 out of 9 models achieved 100% compliance. The challenge has shifted from "can it follow rules?" to "can it understand what chart to generate?" — particularly for complex analytical queries.

3. Decision Analysis Exposes a Critical Gap

The decision category (heatmaps, funnels, radar, scatter with analytical intent) revealed the biggest model differentiation:

  • 4 models scored 100%
  • 2 models scored 66.7%
  • 1 model scored 16.7%
  • 1 model scored 0%

This 100-point spread means model selection matters enormously for analytical use cases.

4. Speed vs. Accuracy Trade-off

Mistral-Large-3 is 5× faster than GLM-5.1 but 28 percentage points less accurate. For interactive dashboards where users need immediate feedback, this trade-off matters. But a fast wrong chart is worse than a slow correct one.

5. The Gap Between Top and Bottom Is Growing

In v1, the accuracy gap between best and worst was ~25 points (87.5% vs ~62%). In v2, it's 28.3 points (100% vs 71.7%). While top models improved significantly, bottom models stagnated on complex tasks.

Recommended Models by Use Case

Use Case Recommended Model Why
🏆 General analytics GLM-5.1 100% accuracy, 5.6s speed, best all-rounder
Maximum accuracy (no speed concern) Gemma4 31B 100% accuracy, 10.2s speed
Fast interactive use Kimi-K2.6 95.7% accuracy at 6.0s — best speed-accuracy ratio after GLM
Decision analysis & complex queries GLM-5.1 or Gemma4 31B Only 4 models scored 100% on decision category
Budget-conscious deployment Qwen-3.5 93.5% accuracy, widely available
❌ Avoid for analytics Mistral-Large-3 0% on decision category, 71.7% overall

What This Means for Business Analytics

The 2026 benchmark reveals three shifts from v1:

  1. Compliance is solved. Almost all models follow formatting rules. The battle has moved to understanding complex analytical intent.
  2. Decision analysis is the new differentiator. If you need heatmaps, funnels, or radar charts, model selection is critical. 4 models handle this perfectly; 2 fail completely.
  3. GLM-5.1 is the new benchmark leader. Previously, Llama 3.1 8B led. The landscape has shifted.

Try It Yourself

We built LivChart to make local AI dashboards accessible. You can try different models and see which works best for your data.

What model do you use for analytics? We'd love to hear your experience — share on X or discuss on r/LocalLLaMA.

This is v2 of the LivChart AI Benchmark. We'll update as new models are released. See v1 results here.