Skip to content

Glossary

LLM-Score

LLM-Score is a 0–100 metric showing how correctly and how often large language models mention your brand for your topic prompts.
  • A 0–100 score that combines mention rate, correctness, and sentiment of how LLMs answer about your brand.

  • See your own LLM-Score in the demo report.

Definition

LLM-Score is a composite 0–100 metric produced by Getllmspy that shows how correctly and how often major language models — ChatGPT, Gemini, Perplexity, Claude, YandexGPT, Alice, GigaChat, DeepSeek — mention your brand when users ask questions in your category. Unlike classic SEO rankings, LLM-Score measures presence inside generated answers, not inside SERP lists. It rolls up three signals: mention rate across a prompt pack, factual correctness (no hallucinations), and tone.

Scale benchmarks

RangeHow to read it
0–20Brand is effectively invisible in LLM answers
21–45Occasional mentions, often with factual errors
46–65Steady presence in a subset of models
66–80Strong visibility with rare hallucinations
81–100Category-leading presence

Median LLM-Score across the Getllmspy dataset is 38 (April 2026 snapshot).

How it's computed

End-to-end, the pipeline looks like this:

  1. Prompt pack — a fixed set of category questions (your brand is not pasted into the literal prompt text), see prompt pack.
  2. Model fan-out — each model in coverage runs the same scripted steps; answers are stored as dated snapshots.
  3. Per-answer signals — was the brand mentioned appropriately, how factually consistent is the narrative, what is the sentiment, and is there a penalty for clear contradictions.
  4. Normalization — signals are scaled to 0–1 per model and prompt, weighted (traffic + regional relevance), then aggregated into one headline number for the brand and topic.
  5. Reporting — the UI shows the headline score, a per-model table, and quotes; you should always read the number next to the underlying answers.

The formula and charts below are didactic; production weights can differ while keeping the same structure.

A simplified aggregation model

Illustrative only — production weights and normalization may differ, but the idea is the same: three signals minus a hallucination penalty.

Schematic formula

LLM-Score ≈ 100 × ( α·M + β·C + γ·S − δ·H )

M is the share of answers with a correct brand mention; C ∈ [0,1] is factual consistency; S is normalized sentiment; H ∈ [0,1] is a hallucination penalty; α+β+γ+δ = 1.

Headline score (example)

72

Illustrative — not your live score.

Presence

Share of answers that mention the brand appropriately

Factual consistency

Alignment with known facts

Normalized sentiment

Positive / neutral / negative blend

What you see in a typical report

Model roll-up

LLM-Score (snapshot): 67

WoW change: +5

Largest lift: YandexGPT (+12 quote hits)

Risk: Perplexity — wrong site URL in 2/10 answers

Answer quote

“In this category, brands X and Y are named most often; your brand appears in the context of …” — verbatim text explains *why* the score moved.

From a dated snapshot: model label + timestamp

How it works in practice

What to open first in a report

  • Per-model roll-up — where the score moved after a GEO launch or an AEO content refresh.
  • Quotes — one or two sentences from the model often explain jumps better than the scalar: wrong HQ address, confused sibling brand, etc.
  • Prompt-level breakdown — a high LLM-Score can still hide failures on a handful of “money” prompts; fix those scenarios before polishing the average.

Mini walkthrough

Suppose LLM-Score = 67 vs 62 last week: the model table shows YandexGPT adding +12 correct-name quotes while Perplexity still swaps your domain — prioritize org-card and source fixes before writing more blog posts.

Related lenses: GPI for overall visibility pressure, Share of Voice when competitive mention share matters most.

How to read it

Use the scale above as a rule of thumb. Always pair the score with qualitative quotes from the report: a high number with toxic context still needs a content response.

LLM-Score vs Share of Voice

LLM-Score blends correctness and sentiment with presence. Share of Voice only measures how often your brand appears compared to competitors. A brand can win SoV and still have a low LLM-Score if the models misquote it.

When to use

  • Tracking weekly after a llms.txt or schema change.
  • Comparing how different models see you (ChatGPT vs YandexGPT).
  • Reporting to leadership: one number per brand per month.