GEO Score 4-Layer Metric Design

Analysis of the design philosophy and hierarchical dependencies of the four GEO Score layers (Inclusion, Prominence, Quality, Stability), and their implications for decision-making.

Scope of This Post

WICHI’s GEO Score measures how a brand is represented in AI search engine responses. This post covers the design philosophy and conceptual framework behind the 4 layers that constitute the GEO Score. Specific weights, algorithms, and scoring formulas are core product IP and are not disclosed here.

Why Four Layers

The Structural Limitation of a Single Score

Producing a single GEO Score number is easy. Count brand mentions, or convert exposure rate to a percentage. The problem is that a single number cannot tell you what is going well and what is broken.

A few real scenarios make this limitation clear:

Mentioned but negatively. The brand appears in AI responses, but in contexts like “more expensive than X” or “alternatives include Y.” Exposure exists, so a single score looks fine — but the mention actually benefits competitors.
Frequently mentioned but inaccurate. The brand is mentioned often, but AI describes features it does not offer or gets pricing wrong. If users trust and act on this, it erodes brand credibility.
Great today, gone tomorrow. Running the same query today puts the brand at #1; running it tomorrow omits it entirely. A one-time measurement cannot capture this instability.
Top position, negative context. The brand appears in the first sentence of the response, but as “a service to be cautious about.” Positionally it is the top mention, but effectively it is harmful.
Many mentions, no citations. The brand name appears multiple times, but no source links or domain citations are included. This suggests AI “knows about” the brand but does not “trust” it as a source.

All these scenarios collapse into the same number under a single-score system. A score of 70 could mean “strong visibility,” “frequently mentioned but inaccurate,” or “good today, uncertain tomorrow” — and there is no way to tell which.

Problems with Existing Approaches

Common approaches to measuring AI search visibility each have inherent limitations:

Approach	How It Measures	Limitation
Simple mention counting	Count brand name appearances	Ignores context (positive/negative)
Binary exposure check	Mentioned/not mentioned	No position, weight, or quality information
Sentiment analysis alone	Positive/negative/neutral classification	No exposure or position information
Rank-based	Rank vs. competitors	Does not reflect consistency (stability)
SOV (Share of Voice) alone	Brand’s share of total mentions	Ignores quality and accuracy

The common thread: one-dimensionality. Each approach captures only one facet of AI search visibility while ignoring the rest. The result is that different approaches can produce contradictory conclusions from the same data.

“Measure what matters, not what’s easy. A single number is easy to produce but cannot carry the information needed for decision-making.”

The Answer: Layered Architecture

WICHI chose a 4-layer architecture with independent layers. Three core principles guide it:

Each layer answers one question. “Is it mentioned?”, “Is it prominent?”, “Is it accurate?”, “Is it consistent?” — each question is independent, and each answer demands different action.
Logical dependencies exist between layers. You cannot measure the quality of a brand that is not mentioned. These dependencies determine the interpretation order.
Layer-level patterns matter more than aggregate scores. A total of 70 means entirely different things depending on how the score is distributed across layers.

graph TD
    A["GEO Score"] --> B["L1: Inclusion<br/>Exposure"]
    A --> C["L2: Prominence<br/>Visibility"]
    A --> D["L3: Quality<br/>Accuracy & Sentiment"]
    A --> E["L4: Stability<br/>Consistency"]

    B -->|"Prerequisite"| C
    B -->|"Prerequisite"| D
    E -->|"Confidence check"| B
    E -->|"Confidence check"| C
    E -->|"Confidence check"| D

    style B fill:#e8f4f8,stroke:#2196F3
    style C fill:#fff3e0,stroke:#FF9800
    style D fill:#e8f5e9,stroke:#4CAF50
    style E fill:#fce4ec,stroke:#E91E63

Note the arrow directions. L1 (Inclusion) is a prerequisite for L2 and L3, while L4 (Stability) validates the confidence of all other layers. This structural relationship is what differentiates a 4-layer architecture from merely listing four scores.

Layer Details

Layer Overview

Layer	Name	Core Question	What It Measures	Activation Condition
L1	Inclusion (Exposure)	Is the brand mentioned?	Presence, citation inclusion, share of voice	Always
L2	Prominence (Visibility)	How visibly is it positioned?	Position within response, weight, depth of coverage	L1 > 0
L3	Quality (Accuracy)	Is the content accurate and favorable?	Sentiment, accuracy, narrative alignment	L1 > 0
L4	Stability (Consistency)	Are results consistent across runs?	Cross-run variance, volatility	2+ runs required

L1 — Inclusion (Exposure)

Core Question

“Does the AI search engine recognize this brand’s existence?”

Inclusion is the most fundamental layer. It measures whether the brand name appears in AI search responses. If this score is low, the other three layers are moot — you cannot assess the quality or stability of a brand that is not mentioned.

Why Not a Binary Value

Treating Inclusion as simple Yes/No loses information. “Mentioned in 1 of 10 queries” and “mentioned in 9 of 10 queries” are both “mentioned,” but represent completely different states. Inclusion therefore combines multiple sub-signals into a continuous score between 0 and 1.

What It Measures

Sub-Signal	Description	Why It Is Needed
Brand mention presence	Does the brand name appear in response text?	Most basic existence check
Citation inclusion	Is the brand’s domain included in source citations?	Whether AI treats the brand as a trustworthy source
Share of Voice (SOV)	Brand’s share among all brands mentioned in the response	Relative position within the category

These three sub-signals are combined because each carries different meaning. Appearing in text versus being cited as a source are different things. An AI engine may mention a brand in text without providing citation links — signaling “aware of it but does not trust it as an official source.”

Interpretation Guide

Inclusion Level	Meaning	Priority Action
Very low	Brand barely recognized in AI search	Secure external sources, provide structured data
Low	Sporadic exposure in select queries only	Develop per-query content strategy
Medium	Intermittent exposure in category queries	Focus on improving exposure consistency
High	Stable exposure across most relevant queries	Shift to L2-L3 optimization
Very high	Exposure + citation in nearly all queries	Maintain + expand into new query territories

flowchart LR
    Q["AI Search Query"] --> R["AI Response Generated"]
    R --> M{"Brand<br/>mentioned?"}
    M -->|No| X["L1 = 0<br/>L2-L4 cannot be measured"]
    M -->|Yes| C{"Citation<br/>included?"}
    C -->|No| S1["Text mention only"]
    C -->|Yes| S2["Text + citation present"]
    S1 --> SOV["SOV Calculation"]
    S2 --> SOV
    SOV --> L1["L1 Inclusion Score"]

    style X fill:#ffebee,stroke:#c62828
    style L1 fill:#e8f4f8,stroke:#2196F3

The Inclusion Trap

High Inclusion is not unconditionally good. Appearing on a “worst services” list also produces high Inclusion. This is precisely why L3 (Quality) exists, and why Inclusion alone must never be used to judge GEO status.

Traditional SEO has a parallel: high impressions mean nothing if click-through rate is low. Inclusion is the GEO equivalent of impressions — the other layers play the roles of CTR and conversion.

L2 — Prominence (Visibility)

Core Question

“Where in the response does the brand appear, and how much weight does it receive?”

If Inclusion measures “whether it exists,” Prominence measures “quality of existence.” Being described in detail as the top recommendation versus being listed as a one-liner under “other options” affects user behavior in completely different ways.

Why Position Matters

The position effect in AI search responses is similar to but more extreme than traditional SERP position effects. In SERPs, users can scroll through the page. AI search responses are typically presented as a single continuous text block. Users who get their answer from the first portion likely never read the rest.

Aggarwal et al. (KDD 2024) proposed PAWC (Position-Adjusted Word Count), a concept that quantifies this phenomenon. Brand-related text positioned earlier in the response receives higher weight.

What It Measures

Sub-Signal	Description	What It Reflects
Position-adjusted score (PAWC-based)	Higher score for earlier position and greater volume	User attention pattern favoring earlier content
Brand mention weight	Ratio of brand-related content to total response length	How deeply AI covers the brand

PAWC Explained

The core idea is straightforward: the same volume of text has different visibility depending on whether it appears in the 1st paragraph or the 5th. Earlier text is read by more users; later text may not be read at all.

Scenario	Position in Response	Brand-Related Volume	Relative PAWC
A	First paragraph	50 words	High
B	Third paragraph	50 words	Medium
C	Last paragraph	50 words	Low
D	First + Third paragraph	30 + 20 words	Lower than A, higher than B

Scenarios A and C allocate the same 50 words to the brand, but their PAWC-measured visibility differs dramatically. This is the gap that simple mention counting or word-count tallying cannot capture.

Interpretation Guide

Prominence Level	Meaning	Typical State
Very low	Mentioned but peripheral	”Other options” list, bottom of comparison table
Low	Present but not noticeable	Brief mention in mid-position
Medium	Covered with meaningful weight	One of several options described in detail
High	Among top recommendations	Introduced early as a primary option
Very high	First recommendation, most detailed coverage	Presented as the core answer

Relationship Between Prominence and Inclusion

Prominence is meaningful only when Inclusion exceeds 0 — you cannot discuss the position of a brand that is not mentioned. However, high Inclusion does not guarantee high Prominence. A brand mentioned in all 10 queries (high L1) but each time as a brief last-line note (low L2) is entirely possible.

Patterns created by combining these two layers:

Pattern	L1	L2	Interpretation
Invisible	Low	—	AI does not recognize the brand
Wallflower	High	Low	Mentioned but with low weight
Spotlighted	High	High	Well-exposed with strong weight
Occasional spotlight	Medium	High	High weight in select queries only

L3 — Quality (Accuracy and Sentiment)

Core Question

“Is what AI says about the brand accurate and favorable?”

Quality evaluates the “content” of exposure. Being visible and prominent means nothing if the content is inaccurate or negative — in fact, high-prominence inaccurate content can be worse than no visibility at all. Users are likely to trust and act on AI responses.

Why Quality Is the Most Complex Layer

Inclusion can be measured via text matching. Prominence can be measured through structural attributes — position and volume. Quality requires evaluating the meaning of text. “This service is expensive but good” and “This service is good but expensive” use nearly identical words but convey different nuances.

This is why Quality measurement incorporates LLM-based evaluation (LLM-as-a-Judge). Simple keyword matching or rule-based sentiment analysis cannot capture contextual meaning at this level.

What It Measures

Quality combines multiple dimensions, each evaluating a different aspect of brand mentions.

Dimension	Description	What Low Scores Mean
Sentiment	Overall tone toward the brand	Mentioned in negative or critical contexts
Accuracy	Factual correctness of stated information	Non-existent features claimed, wrong prices, incorrect dates
Narrative Alignment	Alignment with the brand’s intended core messaging	AI positions the brand differently than intended
Hallucination Risk	Proportion of AI-generated information that contradicts facts	Non-existent features, services, or prices presented as real

graph TD
    R["Brand Mention in AI Response"] --> S["Sentiment Analysis"]
    R --> A["Accuracy Verification"]
    R --> N["Narrative Alignment"]
    R --> H["Hallucination Detection"]

    S --> Q["L3 Quality Score"]
    A --> Q
    N --> Q
    H --> Q

    Q --> I1{"High?"}
    I1 -->|Yes| G["Exposure benefits the brand"]
    I1 -->|No| B["Exposure may harm the brand"]

    style Q fill:#e8f5e9,stroke:#4CAF50
    style G fill:#c8e6c9,stroke:#388E3C
    style B fill:#ffcdd2,stroke:#c62828

The Severity of Hallucination

In AI search, hallucination is not merely a technical error — it is a business risk. If AI states “this service offers a free trial” when no free trial exists, users become disappointed and brand trust erodes. In regulated industries (finance, healthcare), AI-amplified misinformation can create legal exposure.

The Quality layer detects such hallucinations and clearly identifies cases where high Inclusion coexists with low Quality — a diagnosis impossible with simple mention counting.

Interpretation Guide

Quality Level	Meaning	Priority Action
Very low	Mentioned with negative or inaccurate information	Create correction content, update official information sources
Low	Neutral but core messaging not reflected	Strengthen USP-focused content
Medium	Mostly accurate with some inaccuracies	Target specific inaccuracies for correction
High	Accurate and positively described	Maintain + fine-tune narrative alignment
Very high	Core messaging accurately reflected, positive tone, no hallucinations	Maintain ideal state

The Quality Paradox: High Inclusion + Low Quality

The most dangerous pattern is high L1 with low L3. This combination means exposure is actively harming the brand.

Scenario	Example	Risk Level
Inaccurate information spreading	”This service is free” (actually paid)	High — directly causes user disappointment
Competitor-favorable context	”A more affordable alternative to A is B”	Medium — indirect revenue loss
Outdated information	Described with 2-year-old pricing or features	Medium — user confusion
Negative tone	”Known for having many issues”	High — brand image damage

L4 — Stability (Consistency)

Core Question

“Can we trust the measurement results? Do repeated runs produce consistent outcomes?”

Stability differs in nature from the other three layers. While L1-L3 measure “what is the current state,” L4 verifies “how trustworthy is that measurement.” AI search responses can vary for the same query based on execution timing, model version, region, and more.

Why AI Search Is Inherently Unstable

Traditional search engines return relatively stable results for the same query. Google’s SERP changes over time but does not swing dramatically day to day. AI search engines (ChatGPT, Perplexity, Gemini, etc.) are structurally more volatile.

Volatility Factor	Description
Model updates	AI model updates change responses to identical queries
Temperature parameter	Generation randomness means the same query can yield different results
Context window	Prior conversation context can alter responses to the same query
Real-time data	Some AI engines incorporate live web data, causing time-dependent variation
Region and language settings	User settings can alter responses to the same query

What It Measures

Sub-Signal	Description	Activation Condition
Response Drift	GEO Score difference between current and previous results for the same query	2+ runs
Citation Volatility	Brand’s inclusion/exclusion fluctuation in citation lists	2+ runs
Prompt Sensitivity	Result differences across similar query variations	2+ runs
Model Version Drift	Result differences across AI model versions	2+ runs

flowchart TD
    R1["Run 1"] --> S1["L1-L3 Score Set A"]
    R2["Run 2"] --> S2["L1-L3 Score Set B"]
    R3["Run N"] --> S3["L1-L3 Score Set N"]

    S1 --> CMP["Cross-Run Comparison"]
    S2 --> CMP
    S3 --> CMP

    CMP --> D{"Variance level?"}
    D -->|"Low"| ST["L4 High<br/>Results are trustworthy"]
    D -->|"High"| UN["L4 Low<br/>Results unreliable"]

    style ST fill:#c8e6c9,stroke:#388E3C
    style UN fill:#ffcdd2,stroke:#c62828

Why 2+ Runs Are Required

Stability is fundamentally a comparative metric. Consistency cannot be discussed from a single measurement. This is a fundamental measurement limitation, and WICHI’s explicit acknowledgment of it.

On a single run, L4 is deactivated. Expressing “stability cannot yet be assessed” by simply not measuring it is more honest than generating a score with insufficient data. Explicit uncertainty beats false confidence.

Interpretation Guide

Stability Level	Meaning	Implication
Not measurable	Single run completed only	Current L1-L3 scores are reference only, insufficient for decision-making
Low	Results vary significantly across runs	Defer L1-L3-based decisions, additional measurement needed
Medium	Some variation exists but overall trend holds	L1-L3 directional trends are reliable, specific numbers are reference only
High	Consistent results across repeated runs	L1-L3 scores can be used for decision-making

Business Significance of Stability

The Stability layer is also a key SaaS differentiator. One-time measurement tools can only provide L1-L3. A recurring subscription model provides L4 through repeated measurement. This layer demonstrates with data “why continuous monitoring is needed” rather than “measure once and done.”

In an environment where AI search engines continuously update and responses continuously change, the value of one-time measurement depreciates rapidly. The Stability layer quantifies this depreciation rate.

Inter-Layer Relationships

Dependency Structure

The four layers are measured independently, but interpretation follows a clear dependency structure.

graph BT
    L4["L4: Stability<br/>Confidence Layer"] -.->|"Determines confidence<br/>of all layers"| L1
    L4 -.-> L2
    L4 -.-> L3

    L1["L1: Inclusion<br/>Prerequisite Layer"] -->|"L1 > 0 required"| L2["L2: Prominence<br/>Position & Weight Layer"]
    L1 -->|"L1 > 0 required"| L3["L3: Quality<br/>Content & Accuracy Layer"]

    style L1 fill:#e8f4f8,stroke:#2196F3
    style L2 fill:#fff3e0,stroke:#FF9800
    style L3 fill:#e8f5e9,stroke:#4CAF50
    style L4 fill:#fce4ec,stroke:#E91E63

L1 is the prerequisite for L2 and L3. You cannot discuss position or quality when the brand is not mentioned at all. When L1 is 0, L2 and L3 are N/A.

L2 and L3 are independent of each other. A brand can be described prominently but inaccurately (high L2, low L3), or accurately but inconspicuously (low L2, high L3). These two layers measure different axes.

L4 is a meta-layer. It measures not the “value” but the “confidence” of L1-L3. Low L4 means L1-L3 scores, however good they look, are unreliable for decision-making.

Key Layer Combination Patterns

With four layers each having high/low states, 16 theoretical combinations exist. Since L1 being low makes L2 and L3 irrelevant, practical patterns are more limited. Here are the most commonly observed patterns:

Pattern Name	L1	L2	L3	L4	Diagnosis	Priority Action
Invisible	Low	—	—	—	Does not exist in AI	Content + source acquisition
Wallflower	High	Low	High	High	Mentioned but low weight	Strengthen positioning
Backfire	High	High	Low	High	Prominently wrong	Urgent information correction
Honor student	High	High	High	High	Ideal state	Maintain + expand
Unstable honor student	High	High	High	Low	Good but unstable	Continuous monitoring
Misunderstood	High	Med	Low	High	Consistently misrepresented	Fundamental content overhaul

“The same aggregate score can represent entirely different states. Without examining layer patterns, you will prescribe the wrong treatment.”

Interpretation Order

Layer dependencies determine the interpretation sequence:

Check L1 first. If L1 is very low, there is no point discussing other layers. “First, you must exist.”
If L1 is sufficient, check L2 and L3 together. Diagnose whether position (L2), content (L3), or both are problematic.
Check L4 last. Determine how trustworthy the L1-L3 diagnosis is. If L4 is low, treat steps 1-2 as provisional and seek additional measurement.

This sequence resembles medical triage. Rather than interpreting all tests simultaneously, you confirm prerequisites first and progressively deepen the analysis.

Design Philosophy

Principle 1: Measure What Matters, Not What Is Easy

“Measure what matters, not what’s easy.”

Counting brand mentions is easy — a single regex suffices. But determining whether those mentions are positive, accurate, and stable requires a far more complex pipeline. The Quality layer’s sentiment analysis, accuracy verification, and narrative alignment assessment involve LLM-based evaluation. The Stability layer demands the cost of repeated execution.

The reason for accepting this complexity is simple: decisions based on simple measurements lead to wrong actions. “Mention count increased” is less useful than “mention count increased but so did the proportion of inaccurate information.”

Principle 2: Decomposition Over Aggregation

WICHI does provide an approximate composite score from the 4 layers. However, this composite is a dashboard summary, not the basis for decisions.

An example with two brands sharing an identical aggregate score of 65 makes this clear:

	Brand A	Brand B
L1 Inclusion	90	60
L2 Prominence	80	70
L3 Quality	30	70
L4 Stability	60	60
Composite (reference)	~65	~65
Diagnosis	Backfire — high exposure, inaccurate	Wallflower — insufficient exposure
Priority action	Urgent information correction	Content strategy expansion

With the same score, the required actions are diametrically opposed. Brand A might actually benefit from reducing exposure (stopping misinformation spread), while Brand B needs to increase it. The aggregate score alone makes this distinction impossible.

Principle 3: Explicit Uncertainty

The decision to deactivate Stability on a single run reflects this principle. Estimating a score when data is absent is less honest than stating “this dimension cannot yet be measured.”

The same principle applies to other layers. When L1 is 0, L2 and L3 display N/A rather than arbitrary values. Presenting unmeasurable things as measured is misleading to users.

Principle 4: Diagnosis Leads to Prescription

The key outcome of this architecture is not the final aggregate score. It is layer-level patterns. Patterns determine diagnosis, and diagnosis determines prescription (action).

Pattern	Diagnosis	Prescription
Low L1	Absent presence	Content creation, external source acquisition, structured data
High L1, Low L2	Insufficient weight	Strengthen differentiation, create comparison content
High L1, Low L3	Harmful exposure	Correct information, update official sources, strengthen FAQ
High L1-L3, Low L4	Unstable	Regular monitoring, track AI model changes
High L1-L4	Optimal state	Maintain + expand into new query territories

The design’s purpose is not to raise scores per se, but to diagnose which layer has problems and apply the appropriate remedy.

Generalizability

Beyond WICHI

The 4-layer framework was designed for a specific product, but conceptually it applies to any system measuring AI search visibility. The four core questions — does it exist, is it prominent, is it accurate, is it consistent — are equally valid whether the subject is a brand, a product, or an information source.

Extension to Other Domains

Domain	L1 Interpretation	L2 Interpretation	L3 Interpretation	L4 Interpretation
Brand GEO	Mention presence	Position in response	Sentiment/accuracy	Cross-run consistency
Academic sources	Citation presence	Citation position/weight	Citation accuracy	Temporal stability
News sources	Reference presence	Headline/body placement	Factual accuracy	Issue persistence
Product comparison	Candidate inclusion	Recommendation rank	Spec accuracy	Query variation stability

Specific implementations (sub-signals, weights, evaluation methods) vary by domain, but the 4-layer structure follows a universal logic: Exist, Stand Out, Be Accurate, Be Consistent.

Implementation Considerations

For those applying this framework in their own systems:

L1 is easiest to implement, L3 is hardest. Start with text matching (L1) and incrementally increase complexity.
L4 requires time. It activates only after 2+ measurements, so early operation uses L1-L3 only.
Inter-layer weights should vary by domain. In healthcare or finance where accuracy is paramount, L3 weight should be high. For early-stage startups where exposure itself is the priority, L1 weight should dominate.
If using an LLM Judge for Quality measurement, the Judge’s own consistency (Stability) becomes a concern. Judge model responses also fluctuate, so Judge-level stability must be managed separately.

Summary

The 4-layer GEO Score design answers these questions in order:

flowchart LR
    Q1["Does it exist?<br/>L1 Inclusion"] --> Q2["Is it prominent?<br/>L2 Prominence"]
    Q2 --> Q3["Is it accurate?<br/>L3 Quality"]
    Q3 --> Q4["Can we trust it?<br/>L4 Stability"]

    style Q1 fill:#e8f4f8,stroke:#2196F3
    style Q2 fill:#fff3e0,stroke:#FF9800
    style Q3 fill:#e8f5e9,stroke:#4CAF50
    style Q4 fill:#fce4ec,stroke:#E91E63

Each layer is measured independently, but interpretation follows a sequence. L1 is the prerequisite, L4 validates confidence. Layer-level patterns matter more than aggregate scores — patterns determine diagnosis, and diagnosis determines action.

What this design deliberately rejects is the convenience of a single number. “Inclusion High, Prominence High, Quality Low, Stability Not Measured” in four lines carries more information and connects to more precise action than “GEO Score 72” in one line.

Scope of This Post

Why Four Layers

The Structural Limitation of a Single Score

Problems with Existing Approaches

The Answer: Layered Architecture

Layer Details

Layer Overview

L1 — Inclusion (Exposure)

Core Question

Why Not a Binary Value

What It Measures

Interpretation Guide

The Inclusion Trap

L2 — Prominence (Visibility)

Core Question

Why Position Matters

What It Measures

PAWC Explained

Interpretation Guide

Relationship Between Prominence and Inclusion

L3 — Quality (Accuracy and Sentiment)

Core Question

Why Quality Is the Most Complex Layer

What It Measures

The Severity of Hallucination

Interpretation Guide

The Quality Paradox: High Inclusion + Low Quality

L4 — Stability (Consistency)

Core Question

Why AI Search Is Inherently Unstable

What It Measures

Why 2+ Runs Are Required

Interpretation Guide

Business Significance of Stability

Inter-Layer Relationships

Dependency Structure

Key Layer Combination Patterns

Interpretation Order

Design Philosophy

Principle 1: Measure What Matters, Not What Is Easy

Principle 2: Decomposition Over Aggregation

Principle 3: Explicit Uncertainty

Principle 4: Diagnosis Leads to Prescription

Generalizability

Beyond WICHI

Extension to Other Domains

Implementation Considerations

Summary

Related Posts

Designing the 9-Bucket Query Framework

Multi-Engine Architecture — Parallel Collection from 3 AI Search Engines

Six GEO Business Opportunities and WICHI's Choice