Skip to content

Glossary

LLM-Score

LLM-Score is a 0–100 metric showing how correctly and how often large language models mention your brand for your topic prompts.
  • A 0–100 score that tracks how often and how accurately LLMs talk about your brand.

  • See your own LLM-Score in the demo report.

Definition

LLM-Score is a composite 0–100 metric from Getllmspy. It measures how often major models (ChatGPT, Gemini, Perplexity, Claude, YandexGPT, Alice, GigaChat, DeepSeek) mention your brand, and whether those mentions are accurate and usable. Unlike SEO rank tracking, this score reflects what users actually read in generated answers.

Scale benchmarks

RangeHow to read it
0–20Brand is effectively invisible in LLM answers
21–45Occasional mentions, often with factual errors
46–65Steady presence in a subset of models
66–80Strong visibility with rare hallucinations
81–100Category-leading presence

Median LLM-Score across the Getllmspy dataset is 38 (April 2026 snapshot).

Worked score example

Assume weights: α=0.40, β=0.35, γ=0.20, δ=0.15. For one monthly slice:

  • Mention signal M = 0.68
  • Correctness signal C = 0.74
  • Sentiment signal S = 0.57
  • Hallucination penalty H = 0.19
LLM-Score ≈ 100 × (0.40×0.68 + 0.35×0.74 + 0.20×0.57 − 0.15×0.19)
          ≈ 100 × (0.272 + 0.259 + 0.114 − 0.029)
          ≈ 61.6

Rounded headline score: 62.

Practical reading pattern

ScenarioScore moveWhat to do
Content update shipped, score +6Broad improvementKeep pack stable and confirm in next run
Score flat, contradictions upHidden quality issueFix source pages before adding more pages
Score up only in one modelNarrow gainExpand validation to other high-traffic models

Mini chart (8 weekly runs):

W1 54 ▇▇▇▇▇▇
W2 55 ▇▇▇▇▇▇
W3 57 ▇▇▇▇▇▇▇
W4 56 ▇▇▇▇▇▇
W5 60 ▇▇▇▇▇▇▇▇
W6 61 ▇▇▇▇▇▇▇▇
W7 63 ▇▇▇▇▇▇▇▇▇
W8 62 ▇▇▇▇▇▇▇▇

This pattern suggests real progress with normal weekly noise.

How it's computed

End-to-end, the pipeline looks like this:

  1. Prompt pack — a fixed set of category questions (your brand is not pasted into the literal prompt text), see prompt pack.
  2. Model fan-out — each model in coverage runs the same scripted steps; answers are stored as dated snapshots.
  3. Per-answer signals — was the brand mentioned appropriately, how factually consistent is the narrative, what is the sentiment, and is there a penalty for clear contradictions.
  4. Normalization — signals are scaled to 0–1 per model and prompt, weighted (traffic + regional relevance), then aggregated into one headline number for the brand and topic.
  5. Reporting — the UI shows the headline score, a per-model table, and quotes; you should always read the number next to the underlying answers.

Weights can evolve by product version, but the structure stays the same: presence + correctness + sentiment, with penalties for contradictions.

A simplified aggregation model

Illustrative only — production weights and normalization may differ, but the idea is the same: three signals minus a hallucination penalty.

Schematic formula

LLM-Score ≈ 100 × ( α·M + β·C + γ·S − δ·H )

M is the share of answers with a correct brand mention; C ∈ [0,1] is factual consistency; S is normalized sentiment; H ∈ [0,1] is a hallucination penalty; α+β+γ+δ = 1.

Headline score (example)

72

Illustrative — not your live score.

Presence

Share of answers that mention the brand appropriately

Factual consistency

Alignment with known facts

Normalized sentiment

Positive / neutral / negative blend

What you see in a typical report

Model roll-up

LLM-Score (snapshot): 67

WoW change: +5

Largest lift: YandexGPT (+12 quote hits)

Risk: Perplexity — wrong site URL in 2/10 answers

Answer quote

“In this category, brands X and Y are named most often; your brand appears in the context of …” — verbatim text explains *why* the score moved.

From a dated snapshot: model label + timestamp

How it works in practice

What to open first in a report

  • Per-model roll-up — where the score moved after a GEO launch or an AEO content refresh.
  • Quotes — one or two sentences from the model often explain jumps better than the scalar: wrong HQ address, confused sibling brand, etc.
  • Prompt-level breakdown — a high LLM-Score can still hide failures on a handful of “money” prompts; fix those scenarios before polishing the average.

Mini walkthrough

Suppose LLM-Score = 67 vs 62 last week: the model table shows YandexGPT adding +12 correct-name quotes while Perplexity still swaps your domain — prioritize org-card and source fixes before writing more blog posts.

Related lenses: GPI for overall visibility pressure, Share of Voice when competitive mention share matters most.

How to read it

Use the score as a summary, then read the quotes. A high score with wrong facts in high-intent prompts still requires action.

LLM-Score recovery plan (first 2 weeks)

  1. Pull top 20 low-quality answers by business importance.
  2. Tag each issue: wrong fact, outdated page, missing citation, brand confusion.
  3. Fix the top 3 recurring issue types in source pages.
  4. Re-run the same pack and compare only those prompts first.

This prevents random activity and gives you a measurable quality loop.

Weekly volatility guide

MoveTypical interpretation
+/-1 to +/-2 pointsNormal model noise
+/-3 to +/-5 pointsReal content/distribution effect likely
>5 points in one weekMajor source or model behavior change

Always validate with quote-level evidence before reporting large deltas.

LLM-Score vs Share of Voice

LLM-Score includes quality; Share of Voice is mostly volume. You can lead SoV and still lose trust if models cite wrong facts.

When to use

  • Tracking weekly after a llms.txt or schema change.
  • Comparing how different models see you (ChatGPT vs YandexGPT).
  • Reporting to leadership: one number per brand per month.