Minbook
KO
GEO Score 4-Layer Metric Design

GEO Score 4-Layer Metric Design

MJ · · 13 min read

Analysis of the design philosophy and hierarchical dependencies of the four GEO Score layers (Inclusion, Prominence, Quality, Stability), and their implications for decision-making.

Scope of This Post

WICHI’s GEO Score measures how a brand is represented in AI search engine responses. This post covers the design philosophy and conceptual framework behind the 4 layers that constitute the GEO Score. Specific weights, algorithms, and scoring formulas are core product IP and are not disclosed here.


Why Four Layers

The Structural Limitation of a Single Score

Producing a single GEO Score number is easy. Count brand mentions, or convert exposure rate to a percentage. The problem is that a single number cannot tell you what is going well and what is broken.

A few real scenarios make this limitation clear:

  • Mentioned but negatively. The brand appears in AI responses, but in contexts like “more expensive than X” or “alternatives include Y.” Exposure exists, so a single score looks fine — but the mention actually benefits competitors.
  • Frequently mentioned but inaccurate. The brand is mentioned often, but AI describes features it does not offer or gets pricing wrong. If users trust and act on this, it erodes brand credibility.
  • Great today, gone tomorrow. Running the same query today puts the brand at #1; running it tomorrow omits it entirely. A one-time measurement cannot capture this instability.
  • Top position, negative context. The brand appears in the first sentence of the response, but as “a service to be cautious about.” Positionally it is the top mention, but effectively it is harmful.
  • Many mentions, no citations. The brand name appears multiple times, but no source links or domain citations are included. This suggests AI “knows about” the brand but does not “trust” it as a source.

All these scenarios collapse into the same number under a single-score system. A score of 70 could mean “strong visibility,” “frequently mentioned but inaccurate,” or “good today, uncertain tomorrow” — and there is no way to tell which.

Problems with Existing Approaches

Common approaches to measuring AI search visibility each have inherent limitations:

ApproachHow It MeasuresLimitation
Simple mention countingCount brand name appearancesIgnores context (positive/negative)
Binary exposure checkMentioned/not mentionedNo position, weight, or quality information
Sentiment analysis alonePositive/negative/neutral classificationNo exposure or position information
Rank-basedRank vs. competitorsDoes not reflect consistency (stability)
SOV (Share of Voice) aloneBrand’s share of total mentionsIgnores quality and accuracy

The common thread: one-dimensionality. Each approach captures only one facet of AI search visibility while ignoring the rest. The result is that different approaches can produce contradictory conclusions from the same data.

“Measure what matters, not what’s easy. A single number is easy to produce but cannot carry the information needed for decision-making.”

The Answer: Layered Architecture

WICHI chose a 4-layer architecture with independent layers. Three core principles guide it:

  1. Each layer answers one question. “Is it mentioned?”, “Is it prominent?”, “Is it accurate?”, “Is it consistent?” — each question is independent, and each answer demands different action.
  2. Logical dependencies exist between layers. You cannot measure the quality of a brand that is not mentioned. These dependencies determine the interpretation order.
  3. Layer-level patterns matter more than aggregate scores. A total of 70 means entirely different things depending on how the score is distributed across layers.
graph TD
    A["GEO Score"] --> B["L1: Inclusion<br/>Exposure"]
    A --> C["L2: Prominence<br/>Visibility"]
    A --> D["L3: Quality<br/>Accuracy & Sentiment"]
    A --> E["L4: Stability<br/>Consistency"]

    B -->|"Prerequisite"| C
    B -->|"Prerequisite"| D
    E -->|"Confidence check"| B
    E -->|"Confidence check"| C
    E -->|"Confidence check"| D

    style B fill:#e8f4f8,stroke:#2196F3
    style C fill:#fff3e0,stroke:#FF9800
    style D fill:#e8f5e9,stroke:#4CAF50
    style E fill:#fce4ec,stroke:#E91E63

Note the arrow directions. L1 (Inclusion) is a prerequisite for L2 and L3, while L4 (Stability) validates the confidence of all other layers. This structural relationship is what differentiates a 4-layer architecture from merely listing four scores.


Layer Details

Layer Overview

LayerNameCore QuestionWhat It MeasuresActivation Condition
L1Inclusion (Exposure)Is the brand mentioned?Presence, citation inclusion, share of voiceAlways
L2Prominence (Visibility)How visibly is it positioned?Position within response, weight, depth of coverageL1 > 0
L3Quality (Accuracy)Is the content accurate and favorable?Sentiment, accuracy, narrative alignmentL1 > 0
L4Stability (Consistency)Are results consistent across runs?Cross-run variance, volatility2+ runs required

L1 — Inclusion (Exposure)

Core Question

“Does the AI search engine recognize this brand’s existence?”

Inclusion is the most fundamental layer. It measures whether the brand name appears in AI search responses. If this score is low, the other three layers are moot — you cannot assess the quality or stability of a brand that is not mentioned.

Why Not a Binary Value

Treating Inclusion as simple Yes/No loses information. “Mentioned in 1 of 10 queries” and “mentioned in 9 of 10 queries” are both “mentioned,” but represent completely different states. Inclusion therefore combines multiple sub-signals into a continuous score between 0 and 1.

What It Measures

Sub-SignalDescriptionWhy It Is Needed
Brand mention presenceDoes the brand name appear in response text?Most basic existence check
Citation inclusionIs the brand’s domain included in source citations?Whether AI treats the brand as a trustworthy source
Share of Voice (SOV)Brand’s share among all brands mentioned in the responseRelative position within the category

These three sub-signals are combined because each carries different meaning. Appearing in text versus being cited as a source are different things. An AI engine may mention a brand in text without providing citation links — signaling “aware of it but does not trust it as an official source.”

Interpretation Guide

Inclusion LevelMeaningPriority Action
Very lowBrand barely recognized in AI searchSecure external sources, provide structured data
LowSporadic exposure in select queries onlyDevelop per-query content strategy
MediumIntermittent exposure in category queriesFocus on improving exposure consistency
HighStable exposure across most relevant queriesShift to L2-L3 optimization
Very highExposure + citation in nearly all queriesMaintain + expand into new query territories
flowchart LR
    Q["AI Search Query"] --> R["AI Response Generated"]
    R --> M{"Brand<br/>mentioned?"}
    M -->|No| X["L1 = 0<br/>L2-L4 cannot be measured"]
    M -->|Yes| C{"Citation<br/>included?"}
    C -->|No| S1["Text mention only"]
    C -->|Yes| S2["Text + citation present"]
    S1 --> SOV["SOV Calculation"]
    S2 --> SOV
    SOV --> L1["L1 Inclusion Score"]

    style X fill:#ffebee,stroke:#c62828
    style L1 fill:#e8f4f8,stroke:#2196F3

The Inclusion Trap

High Inclusion is not unconditionally good. Appearing on a “worst services” list also produces high Inclusion. This is precisely why L3 (Quality) exists, and why Inclusion alone must never be used to judge GEO status.

Traditional SEO has a parallel: high impressions mean nothing if click-through rate is low. Inclusion is the GEO equivalent of impressions — the other layers play the roles of CTR and conversion.


L2 — Prominence (Visibility)

Core Question

“Where in the response does the brand appear, and how much weight does it receive?”

If Inclusion measures “whether it exists,” Prominence measures “quality of existence.” Being described in detail as the top recommendation versus being listed as a one-liner under “other options” affects user behavior in completely different ways.

Why Position Matters

The position effect in AI search responses is similar to but more extreme than traditional SERP position effects. In SERPs, users can scroll through the page. AI search responses are typically presented as a single continuous text block. Users who get their answer from the first portion likely never read the rest.

Aggarwal et al. (KDD 2024) proposed PAWC (Position-Adjusted Word Count), a concept that quantifies this phenomenon. Brand-related text positioned earlier in the response receives higher weight.

What It Measures

Sub-SignalDescriptionWhat It Reflects
Position-adjusted score (PAWC-based)Higher score for earlier position and greater volumeUser attention pattern favoring earlier content
Brand mention weightRatio of brand-related content to total response lengthHow deeply AI covers the brand

PAWC Explained

The core idea is straightforward: the same volume of text has different visibility depending on whether it appears in the 1st paragraph or the 5th. Earlier text is read by more users; later text may not be read at all.

ScenarioPosition in ResponseBrand-Related VolumeRelative PAWC
AFirst paragraph50 wordsHigh
BThird paragraph50 wordsMedium
CLast paragraph50 wordsLow
DFirst + Third paragraph30 + 20 wordsLower than A, higher than B

Scenarios A and C allocate the same 50 words to the brand, but their PAWC-measured visibility differs dramatically. This is the gap that simple mention counting or word-count tallying cannot capture.

Interpretation Guide

Prominence LevelMeaningTypical State
Very lowMentioned but peripheral”Other options” list, bottom of comparison table
LowPresent but not noticeableBrief mention in mid-position
MediumCovered with meaningful weightOne of several options described in detail
HighAmong top recommendationsIntroduced early as a primary option
Very highFirst recommendation, most detailed coveragePresented as the core answer

Relationship Between Prominence and Inclusion

Prominence is meaningful only when Inclusion exceeds 0 — you cannot discuss the position of a brand that is not mentioned. However, high Inclusion does not guarantee high Prominence. A brand mentioned in all 10 queries (high L1) but each time as a brief last-line note (low L2) is entirely possible.

Patterns created by combining these two layers:

PatternL1L2Interpretation
InvisibleLowAI does not recognize the brand
WallflowerHighLowMentioned but with low weight
SpotlightedHighHighWell-exposed with strong weight
Occasional spotlightMediumHighHigh weight in select queries only

L3 — Quality (Accuracy and Sentiment)

Core Question

“Is what AI says about the brand accurate and favorable?”

Quality evaluates the “content” of exposure. Being visible and prominent means nothing if the content is inaccurate or negative — in fact, high-prominence inaccurate content can be worse than no visibility at all. Users are likely to trust and act on AI responses.

Why Quality Is the Most Complex Layer

Inclusion can be measured via text matching. Prominence can be measured through structural attributes — position and volume. Quality requires evaluating the meaning of text. “This service is expensive but good” and “This service is good but expensive” use nearly identical words but convey different nuances.

This is why Quality measurement incorporates LLM-based evaluation (LLM-as-a-Judge). Simple keyword matching or rule-based sentiment analysis cannot capture contextual meaning at this level.

What It Measures

Quality combines multiple dimensions, each evaluating a different aspect of brand mentions.

DimensionDescriptionWhat Low Scores Mean
SentimentOverall tone toward the brandMentioned in negative or critical contexts
AccuracyFactual correctness of stated informationNon-existent features claimed, wrong prices, incorrect dates
Narrative AlignmentAlignment with the brand’s intended core messagingAI positions the brand differently than intended
Hallucination RiskProportion of AI-generated information that contradicts factsNon-existent features, services, or prices presented as real
graph TD
    R["Brand Mention in AI Response"] --> S["Sentiment Analysis"]
    R --> A["Accuracy Verification"]
    R --> N["Narrative Alignment"]
    R --> H["Hallucination Detection"]

    S --> Q["L3 Quality Score"]
    A --> Q
    N --> Q
    H --> Q

    Q --> I1{"High?"}
    I1 -->|Yes| G["Exposure benefits the brand"]
    I1 -->|No| B["Exposure may harm the brand"]

    style Q fill:#e8f5e9,stroke:#4CAF50
    style G fill:#c8e6c9,stroke:#388E3C
    style B fill:#ffcdd2,stroke:#c62828

The Severity of Hallucination

In AI search, hallucination is not merely a technical error — it is a business risk. If AI states “this service offers a free trial” when no free trial exists, users become disappointed and brand trust erodes. In regulated industries (finance, healthcare), AI-amplified misinformation can create legal exposure.

The Quality layer detects such hallucinations and clearly identifies cases where high Inclusion coexists with low Quality — a diagnosis impossible with simple mention counting.

Interpretation Guide

Quality LevelMeaningPriority Action
Very lowMentioned with negative or inaccurate informationCreate correction content, update official information sources
LowNeutral but core messaging not reflectedStrengthen USP-focused content
MediumMostly accurate with some inaccuraciesTarget specific inaccuracies for correction
HighAccurate and positively describedMaintain + fine-tune narrative alignment
Very highCore messaging accurately reflected, positive tone, no hallucinationsMaintain ideal state

The Quality Paradox: High Inclusion + Low Quality

The most dangerous pattern is high L1 with low L3. This combination means exposure is actively harming the brand.

ScenarioExampleRisk Level
Inaccurate information spreading”This service is free” (actually paid)High — directly causes user disappointment
Competitor-favorable context”A more affordable alternative to A is B”Medium — indirect revenue loss
Outdated informationDescribed with 2-year-old pricing or featuresMedium — user confusion
Negative tone”Known for having many issues”High — brand image damage

L4 — Stability (Consistency)

Core Question

“Can we trust the measurement results? Do repeated runs produce consistent outcomes?”

Stability differs in nature from the other three layers. While L1-L3 measure “what is the current state,” L4 verifies “how trustworthy is that measurement.” AI search responses can vary for the same query based on execution timing, model version, region, and more.

Why AI Search Is Inherently Unstable

Traditional search engines return relatively stable results for the same query. Google’s SERP changes over time but does not swing dramatically day to day. AI search engines (ChatGPT, Perplexity, Gemini, etc.) are structurally more volatile.

Volatility FactorDescription
Model updatesAI model updates change responses to identical queries
Temperature parameterGeneration randomness means the same query can yield different results
Context windowPrior conversation context can alter responses to the same query
Real-time dataSome AI engines incorporate live web data, causing time-dependent variation
Region and language settingsUser settings can alter responses to the same query

What It Measures

Sub-SignalDescriptionActivation Condition
Response DriftGEO Score difference between current and previous results for the same query2+ runs
Citation VolatilityBrand’s inclusion/exclusion fluctuation in citation lists2+ runs
Prompt SensitivityResult differences across similar query variations2+ runs
Model Version DriftResult differences across AI model versions2+ runs
flowchart TD
    R1["Run 1"] --> S1["L1-L3 Score Set A"]
    R2["Run 2"] --> S2["L1-L3 Score Set B"]
    R3["Run N"] --> S3["L1-L3 Score Set N"]

    S1 --> CMP["Cross-Run Comparison"]
    S2 --> CMP
    S3 --> CMP

    CMP --> D{"Variance level?"}
    D -->|"Low"| ST["L4 High<br/>Results are trustworthy"]
    D -->|"High"| UN["L4 Low<br/>Results unreliable"]

    style ST fill:#c8e6c9,stroke:#388E3C
    style UN fill:#ffcdd2,stroke:#c62828

Why 2+ Runs Are Required

Stability is fundamentally a comparative metric. Consistency cannot be discussed from a single measurement. This is a fundamental measurement limitation, and WICHI’s explicit acknowledgment of it.

On a single run, L4 is deactivated. Expressing “stability cannot yet be assessed” by simply not measuring it is more honest than generating a score with insufficient data. Explicit uncertainty beats false confidence.

Interpretation Guide

Stability LevelMeaningImplication
Not measurableSingle run completed onlyCurrent L1-L3 scores are reference only, insufficient for decision-making
LowResults vary significantly across runsDefer L1-L3-based decisions, additional measurement needed
MediumSome variation exists but overall trend holdsL1-L3 directional trends are reliable, specific numbers are reference only
HighConsistent results across repeated runsL1-L3 scores can be used for decision-making

Business Significance of Stability

The Stability layer is also a key SaaS differentiator. One-time measurement tools can only provide L1-L3. A recurring subscription model provides L4 through repeated measurement. This layer demonstrates with data “why continuous monitoring is needed” rather than “measure once and done.”

In an environment where AI search engines continuously update and responses continuously change, the value of one-time measurement depreciates rapidly. The Stability layer quantifies this depreciation rate.


Inter-Layer Relationships

Dependency Structure

The four layers are measured independently, but interpretation follows a clear dependency structure.

graph BT
    L4["L4: Stability<br/>Confidence Layer"] -.->|"Determines confidence<br/>of all layers"| L1
    L4 -.-> L2
    L4 -.-> L3

    L1["L1: Inclusion<br/>Prerequisite Layer"] -->|"L1 > 0 required"| L2["L2: Prominence<br/>Position & Weight Layer"]
    L1 -->|"L1 > 0 required"| L3["L3: Quality<br/>Content & Accuracy Layer"]

    style L1 fill:#e8f4f8,stroke:#2196F3
    style L2 fill:#fff3e0,stroke:#FF9800
    style L3 fill:#e8f5e9,stroke:#4CAF50
    style L4 fill:#fce4ec,stroke:#E91E63

L1 is the prerequisite for L2 and L3. You cannot discuss position or quality when the brand is not mentioned at all. When L1 is 0, L2 and L3 are N/A.

L2 and L3 are independent of each other. A brand can be described prominently but inaccurately (high L2, low L3), or accurately but inconspicuously (low L2, high L3). These two layers measure different axes.

L4 is a meta-layer. It measures not the “value” but the “confidence” of L1-L3. Low L4 means L1-L3 scores, however good they look, are unreliable for decision-making.

Key Layer Combination Patterns

With four layers each having high/low states, 16 theoretical combinations exist. Since L1 being low makes L2 and L3 irrelevant, practical patterns are more limited. Here are the most commonly observed patterns:

Pattern NameL1L2L3L4DiagnosisPriority Action
InvisibleLowDoes not exist in AIContent + source acquisition
WallflowerHighLowHighHighMentioned but low weightStrengthen positioning
BackfireHighHighLowHighProminently wrongUrgent information correction
Honor studentHighHighHighHighIdeal stateMaintain + expand
Unstable honor studentHighHighHighLowGood but unstableContinuous monitoring
MisunderstoodHighMedLowHighConsistently misrepresentedFundamental content overhaul

“The same aggregate score can represent entirely different states. Without examining layer patterns, you will prescribe the wrong treatment.”

Interpretation Order

Layer dependencies determine the interpretation sequence:

  1. Check L1 first. If L1 is very low, there is no point discussing other layers. “First, you must exist.”
  2. If L1 is sufficient, check L2 and L3 together. Diagnose whether position (L2), content (L3), or both are problematic.
  3. Check L4 last. Determine how trustworthy the L1-L3 diagnosis is. If L4 is low, treat steps 1-2 as provisional and seek additional measurement.

This sequence resembles medical triage. Rather than interpreting all tests simultaneously, you confirm prerequisites first and progressively deepen the analysis.


Design Philosophy

Principle 1: Measure What Matters, Not What Is Easy

“Measure what matters, not what’s easy.”

Counting brand mentions is easy — a single regex suffices. But determining whether those mentions are positive, accurate, and stable requires a far more complex pipeline. The Quality layer’s sentiment analysis, accuracy verification, and narrative alignment assessment involve LLM-based evaluation. The Stability layer demands the cost of repeated execution.

The reason for accepting this complexity is simple: decisions based on simple measurements lead to wrong actions. “Mention count increased” is less useful than “mention count increased but so did the proportion of inaccurate information.”

Principle 2: Decomposition Over Aggregation

WICHI does provide an approximate composite score from the 4 layers. However, this composite is a dashboard summary, not the basis for decisions.

An example with two brands sharing an identical aggregate score of 65 makes this clear:

Brand ABrand B
L1 Inclusion9060
L2 Prominence8070
L3 Quality3070
L4 Stability6060
Composite (reference)~65~65
DiagnosisBackfire — high exposure, inaccurateWallflower — insufficient exposure
Priority actionUrgent information correctionContent strategy expansion

With the same score, the required actions are diametrically opposed. Brand A might actually benefit from reducing exposure (stopping misinformation spread), while Brand B needs to increase it. The aggregate score alone makes this distinction impossible.

Principle 3: Explicit Uncertainty

The decision to deactivate Stability on a single run reflects this principle. Estimating a score when data is absent is less honest than stating “this dimension cannot yet be measured.”

The same principle applies to other layers. When L1 is 0, L2 and L3 display N/A rather than arbitrary values. Presenting unmeasurable things as measured is misleading to users.

Principle 4: Diagnosis Leads to Prescription

The key outcome of this architecture is not the final aggregate score. It is layer-level patterns. Patterns determine diagnosis, and diagnosis determines prescription (action).

PatternDiagnosisPrescription
Low L1Absent presenceContent creation, external source acquisition, structured data
High L1, Low L2Insufficient weightStrengthen differentiation, create comparison content
High L1, Low L3Harmful exposureCorrect information, update official sources, strengthen FAQ
High L1-L3, Low L4UnstableRegular monitoring, track AI model changes
High L1-L4Optimal stateMaintain + expand into new query territories

The design’s purpose is not to raise scores per se, but to diagnose which layer has problems and apply the appropriate remedy.


Generalizability

Beyond WICHI

The 4-layer framework was designed for a specific product, but conceptually it applies to any system measuring AI search visibility. The four core questions — does it exist, is it prominent, is it accurate, is it consistent — are equally valid whether the subject is a brand, a product, or an information source.

Extension to Other Domains

DomainL1 InterpretationL2 InterpretationL3 InterpretationL4 Interpretation
Brand GEOMention presencePosition in responseSentiment/accuracyCross-run consistency
Academic sourcesCitation presenceCitation position/weightCitation accuracyTemporal stability
News sourcesReference presenceHeadline/body placementFactual accuracyIssue persistence
Product comparisonCandidate inclusionRecommendation rankSpec accuracyQuery variation stability

Specific implementations (sub-signals, weights, evaluation methods) vary by domain, but the 4-layer structure follows a universal logic: Exist, Stand Out, Be Accurate, Be Consistent.

Implementation Considerations

For those applying this framework in their own systems:

  1. L1 is easiest to implement, L3 is hardest. Start with text matching (L1) and incrementally increase complexity.
  2. L4 requires time. It activates only after 2+ measurements, so early operation uses L1-L3 only.
  3. Inter-layer weights should vary by domain. In healthcare or finance where accuracy is paramount, L3 weight should be high. For early-stage startups where exposure itself is the priority, L1 weight should dominate.
  4. If using an LLM Judge for Quality measurement, the Judge’s own consistency (Stability) becomes a concern. Judge model responses also fluctuate, so Judge-level stability must be managed separately.

Summary

The 4-layer GEO Score design answers these questions in order:

flowchart LR
    Q1["Does it exist?<br/>L1 Inclusion"] --> Q2["Is it prominent?<br/>L2 Prominence"]
    Q2 --> Q3["Is it accurate?<br/>L3 Quality"]
    Q3 --> Q4["Can we trust it?<br/>L4 Stability"]

    style Q1 fill:#e8f4f8,stroke:#2196F3
    style Q2 fill:#fff3e0,stroke:#FF9800
    style Q3 fill:#e8f5e9,stroke:#4CAF50
    style Q4 fill:#fce4ec,stroke:#E91E63

Each layer is measured independently, but interpretation follows a sequence. L1 is the prerequisite, L4 validates confidence. Layer-level patterns matter more than aggregate scores — patterns determine diagnosis, and diagnosis determines action.

What this design deliberately rejects is the convenience of a single number. “Inclusion High, Prominence High, Quality Low, Stability Not Measured” in four lines carries more information and connects to more precise action than “GEO Score 72” in one line.

Share

Related Posts