Glossary
LLM-Score
A 0–100 score that tracks how often and how accurately LLMs talk about your brand.
See your own LLM-Score in the demo report.
Definition
LLM-Score is a composite 0–100 metric from Getllmspy. It measures how often major models (ChatGPT, Gemini, Perplexity, Claude, YandexGPT, Alice, GigaChat, DeepSeek) mention your brand, and whether those mentions are accurate and usable. Unlike SEO rank tracking, this score reflects what users actually read in generated answers.
Scale benchmarks
| Range | How to read it |
|---|---|
| 0–20 | Brand is effectively invisible in LLM answers |
| 21–45 | Occasional mentions, often with factual errors |
| 46–65 | Steady presence in a subset of models |
| 66–80 | Strong visibility with rare hallucinations |
| 81–100 | Category-leading presence |
Median LLM-Score across the Getllmspy dataset is 38 (April 2026 snapshot).
Worked score example
Assume weights: α=0.40, β=0.35, γ=0.20, δ=0.15. For one monthly slice:
- Mention signal M = 0.68
- Correctness signal C = 0.74
- Sentiment signal S = 0.57
- Hallucination penalty H = 0.19
LLM-Score ≈ 100 × (0.40×0.68 + 0.35×0.74 + 0.20×0.57 − 0.15×0.19)
≈ 100 × (0.272 + 0.259 + 0.114 − 0.029)
≈ 61.6
Rounded headline score: 62.
Practical reading pattern
| Scenario | Score move | What to do |
|---|---|---|
| Content update shipped, score +6 | Broad improvement | Keep pack stable and confirm in next run |
| Score flat, contradictions up | Hidden quality issue | Fix source pages before adding more pages |
| Score up only in one model | Narrow gain | Expand validation to other high-traffic models |
Mini chart (8 weekly runs):
W1 54 ▇▇▇▇▇▇
W2 55 ▇▇▇▇▇▇
W3 57 ▇▇▇▇▇▇▇
W4 56 ▇▇▇▇▇▇
W5 60 ▇▇▇▇▇▇▇▇
W6 61 ▇▇▇▇▇▇▇▇
W7 63 ▇▇▇▇▇▇▇▇▇
W8 62 ▇▇▇▇▇▇▇▇
This pattern suggests real progress with normal weekly noise.
How it's computed
End-to-end, the pipeline looks like this:
- Prompt pack — a fixed set of category questions (your brand is not pasted into the literal prompt text), see prompt pack.
- Model fan-out — each model in coverage runs the same scripted steps; answers are stored as dated snapshots.
- Per-answer signals — was the brand mentioned appropriately, how factually consistent is the narrative, what is the sentiment, and is there a penalty for clear contradictions.
- Normalization — signals are scaled to 0–1 per model and prompt, weighted (traffic + regional relevance), then aggregated into one headline number for the brand and topic.
- Reporting — the UI shows the headline score, a per-model table, and quotes; you should always read the number next to the underlying answers.
Weights can evolve by product version, but the structure stays the same: presence + correctness + sentiment, with penalties for contradictions.
A simplified aggregation model
Illustrative only — production weights and normalization may differ, but the idea is the same: three signals minus a hallucination penalty.
Schematic formula
LLM-Score ≈ 100 × ( α·M + β·C + γ·S − δ·H )
M is the share of answers with a correct brand mention; C ∈ [0,1] is factual consistency; S is normalized sentiment; H ∈ [0,1] is a hallucination penalty; α+β+γ+δ = 1.
Headline score (example)
72
Illustrative — not your live score.
Presence
Share of answers that mention the brand appropriately
Factual consistency
Alignment with known facts
Normalized sentiment
Positive / neutral / negative blend
What you see in a typical report
Model roll-up
LLM-Score (snapshot): 67
WoW change: +5
Largest lift: YandexGPT (+12 quote hits)
Risk: Perplexity — wrong site URL in 2/10 answers
Answer quote
“In this category, brands X and Y are named most often; your brand appears in the context of …” — verbatim text explains *why* the score moved.
From a dated snapshot: model label + timestamp
How it works in practice
What to open first in a report
- Per-model roll-up — where the score moved after a GEO launch or an AEO content refresh.
- Quotes — one or two sentences from the model often explain jumps better than the scalar: wrong HQ address, confused sibling brand, etc.
- Prompt-level breakdown — a high LLM-Score can still hide failures on a handful of “money” prompts; fix those scenarios before polishing the average.
Mini walkthrough
Suppose LLM-Score = 67 vs 62 last week: the model table shows YandexGPT adding +12 correct-name quotes while Perplexity still swaps your domain — prioritize org-card and source fixes before writing more blog posts.
Related lenses: GPI for overall visibility pressure, Share of Voice when competitive mention share matters most.
How to read it
Use the score as a summary, then read the quotes. A high score with wrong facts in high-intent prompts still requires action.
LLM-Score recovery plan (first 2 weeks)
- Pull top 20 low-quality answers by business importance.
- Tag each issue: wrong fact, outdated page, missing citation, brand confusion.
- Fix the top 3 recurring issue types in source pages.
- Re-run the same pack and compare only those prompts first.
This prevents random activity and gives you a measurable quality loop.
Weekly volatility guide
| Move | Typical interpretation |
|---|---|
| +/-1 to +/-2 points | Normal model noise |
| +/-3 to +/-5 points | Real content/distribution effect likely |
| >5 points in one week | Major source or model behavior change |
Always validate with quote-level evidence before reporting large deltas.
LLM-Score vs Share of Voice
LLM-Score includes quality; Share of Voice is mostly volume. You can lead SoV and still lose trust if models cite wrong facts.
When to use
- Tracking weekly after a llms.txt or schema change.
- Comparing how different models see you (ChatGPT vs YandexGPT).
- Reporting to leadership: one number per brand per month.