Glossary
LLM-Score
A 0–100 score that combines mention rate, correctness, and sentiment of how LLMs answer about your brand.
See your own LLM-Score in the demo report.
Definition
LLM-Score is a composite 0–100 metric produced by Getllmspy that shows how correctly and how often major language models — ChatGPT, Gemini, Perplexity, Claude, YandexGPT, Alice, GigaChat, DeepSeek — mention your brand when users ask questions in your category. Unlike classic SEO rankings, LLM-Score measures presence inside generated answers, not inside SERP lists. It rolls up three signals: mention rate across a prompt pack, factual correctness (no hallucinations), and tone.
Scale benchmarks
| Range | How to read it |
|---|---|
| 0–20 | Brand is effectively invisible in LLM answers |
| 21–45 | Occasional mentions, often with factual errors |
| 46–65 | Steady presence in a subset of models |
| 66–80 | Strong visibility with rare hallucinations |
| 81–100 | Category-leading presence |
Median LLM-Score across the Getllmspy dataset is 38 (April 2026 snapshot).
How it's computed
End-to-end, the pipeline looks like this:
- Prompt pack — a fixed set of category questions (your brand is not pasted into the literal prompt text), see prompt pack.
- Model fan-out — each model in coverage runs the same scripted steps; answers are stored as dated snapshots.
- Per-answer signals — was the brand mentioned appropriately, how factually consistent is the narrative, what is the sentiment, and is there a penalty for clear contradictions.
- Normalization — signals are scaled to 0–1 per model and prompt, weighted (traffic + regional relevance), then aggregated into one headline number for the brand and topic.
- Reporting — the UI shows the headline score, a per-model table, and quotes; you should always read the number next to the underlying answers.
The formula and charts below are didactic; production weights can differ while keeping the same structure.
A simplified aggregation model
Illustrative only — production weights and normalization may differ, but the idea is the same: three signals minus a hallucination penalty.
Schematic formula
LLM-Score ≈ 100 × ( α·M + β·C + γ·S − δ·H )
M is the share of answers with a correct brand mention; C ∈ [0,1] is factual consistency; S is normalized sentiment; H ∈ [0,1] is a hallucination penalty; α+β+γ+δ = 1.
Headline score (example)
72
Illustrative — not your live score.
Presence
Share of answers that mention the brand appropriately
Factual consistency
Alignment with known facts
Normalized sentiment
Positive / neutral / negative blend
What you see in a typical report
Model roll-up
LLM-Score (snapshot): 67
WoW change: +5
Largest lift: YandexGPT (+12 quote hits)
Risk: Perplexity — wrong site URL in 2/10 answers
Answer quote
“In this category, brands X and Y are named most often; your brand appears in the context of …” — verbatim text explains *why* the score moved.
From a dated snapshot: model label + timestamp
How it works in practice
What to open first in a report
- Per-model roll-up — where the score moved after a GEO launch or an AEO content refresh.
- Quotes — one or two sentences from the model often explain jumps better than the scalar: wrong HQ address, confused sibling brand, etc.
- Prompt-level breakdown — a high LLM-Score can still hide failures on a handful of “money” prompts; fix those scenarios before polishing the average.
Mini walkthrough
Suppose LLM-Score = 67 vs 62 last week: the model table shows YandexGPT adding +12 correct-name quotes while Perplexity still swaps your domain — prioritize org-card and source fixes before writing more blog posts.
Related lenses: GPI for overall visibility pressure, Share of Voice when competitive mention share matters most.
How to read it
Use the scale above as a rule of thumb. Always pair the score with qualitative quotes from the report: a high number with toxic context still needs a content response.
LLM-Score vs Share of Voice
LLM-Score blends correctness and sentiment with presence. Share of Voice only measures how often your brand appears compared to competitors. A brand can win SoV and still have a low LLM-Score if the models misquote it.
When to use
- Tracking weekly after a llms.txt or schema change.
- Comparing how different models see you (ChatGPT vs YandexGPT).
- Reporting to leadership: one number per brand per month.