Analysis of the design philosophy and hierarchical dependencies of the four GEO Score layers (Inclusion, Prominence, Quality, Stability), and their implications for decision-making.
Scope of This Post
WICHI’s GEO Score measures how a brand is represented in AI search engine responses. This post covers the design philosophy and conceptual framework behind the 4 layers that constitute the GEO Score. Specific weights, algorithms, and scoring formulas are core product IP and are not disclosed here.
Why Four Layers
The Structural Limitation of a Single Score
Producing a single GEO Score number is easy. Count brand mentions, or convert exposure rate to a percentage. The problem is that a single number cannot tell you what is going well and what is broken.
A few real scenarios make this limitation clear:
- Mentioned but negatively. The brand appears in AI responses, but in contexts like “more expensive than X” or “alternatives include Y.” Exposure exists, so a single score looks fine — but the mention actually benefits competitors.
- Frequently mentioned but inaccurate. The brand is mentioned often, but AI describes features it does not offer or gets pricing wrong. If users trust and act on this, it erodes brand credibility.
- Great today, gone tomorrow. Running the same query today puts the brand at #1; running it tomorrow omits it entirely. A one-time measurement cannot capture this instability.
- Top position, negative context. The brand appears in the first sentence of the response, but as “a service to be cautious about.” Positionally it is the top mention, but effectively it is harmful.
- Many mentions, no citations. The brand name appears multiple times, but no source links or domain citations are included. This suggests AI “knows about” the brand but does not “trust” it as a source.
All these scenarios collapse into the same number under a single-score system. A score of 70 could mean “strong visibility,” “frequently mentioned but inaccurate,” or “good today, uncertain tomorrow” — and there is no way to tell which.
Problems with Existing Approaches
Common approaches to measuring AI search visibility each have inherent limitations:
| Approach | How It Measures | Limitation |
|---|---|---|
| Simple mention counting | Count brand name appearances | Ignores context (positive/negative) |
| Binary exposure check | Mentioned/not mentioned | No position, weight, or quality information |
| Sentiment analysis alone | Positive/negative/neutral classification | No exposure or position information |
| Rank-based | Rank vs. competitors | Does not reflect consistency (stability) |
| SOV (Share of Voice) alone | Brand’s share of total mentions | Ignores quality and accuracy |
The common thread: one-dimensionality. Each approach captures only one facet of AI search visibility while ignoring the rest. The result is that different approaches can produce contradictory conclusions from the same data.
“Measure what matters, not what’s easy. A single number is easy to produce but cannot carry the information needed for decision-making.”
The Answer: Layered Architecture
WICHI chose a 4-layer architecture with independent layers. Three core principles guide it:
- Each layer answers one question. “Is it mentioned?”, “Is it prominent?”, “Is it accurate?”, “Is it consistent?” — each question is independent, and each answer demands different action.
- Logical dependencies exist between layers. You cannot measure the quality of a brand that is not mentioned. These dependencies determine the interpretation order.
- Layer-level patterns matter more than aggregate scores. A total of 70 means entirely different things depending on how the score is distributed across layers.
graph TD
A["GEO Score"] --> B["L1: Inclusion<br/>Exposure"]
A --> C["L2: Prominence<br/>Visibility"]
A --> D["L3: Quality<br/>Accuracy & Sentiment"]
A --> E["L4: Stability<br/>Consistency"]
B -->|"Prerequisite"| C
B -->|"Prerequisite"| D
E -->|"Confidence check"| B
E -->|"Confidence check"| C
E -->|"Confidence check"| D
style B fill:#e8f4f8,stroke:#2196F3
style C fill:#fff3e0,stroke:#FF9800
style D fill:#e8f5e9,stroke:#4CAF50
style E fill:#fce4ec,stroke:#E91E63
Note the arrow directions. L1 (Inclusion) is a prerequisite for L2 and L3, while L4 (Stability) validates the confidence of all other layers. This structural relationship is what differentiates a 4-layer architecture from merely listing four scores.
Layer Details
Layer Overview
| Layer | Name | Core Question | What It Measures | Activation Condition |
|---|---|---|---|---|
| L1 | Inclusion (Exposure) | Is the brand mentioned? | Presence, citation inclusion, share of voice | Always |
| L2 | Prominence (Visibility) | How visibly is it positioned? | Position within response, weight, depth of coverage | L1 > 0 |
| L3 | Quality (Accuracy) | Is the content accurate and favorable? | Sentiment, accuracy, narrative alignment | L1 > 0 |
| L4 | Stability (Consistency) | Are results consistent across runs? | Cross-run variance, volatility | 2+ runs required |
L1 — Inclusion (Exposure)
Core Question
“Does the AI search engine recognize this brand’s existence?”
Inclusion is the most fundamental layer. It measures whether the brand name appears in AI search responses. If this score is low, the other three layers are moot — you cannot assess the quality or stability of a brand that is not mentioned.
Why Not a Binary Value
Treating Inclusion as simple Yes/No loses information. “Mentioned in 1 of 10 queries” and “mentioned in 9 of 10 queries” are both “mentioned,” but represent completely different states. Inclusion therefore combines multiple sub-signals into a continuous score between 0 and 1.
What It Measures
| Sub-Signal | Description | Why It Is Needed |
|---|---|---|
| Brand mention presence | Does the brand name appear in response text? | Most basic existence check |
| Citation inclusion | Is the brand’s domain included in source citations? | Whether AI treats the brand as a trustworthy source |
| Share of Voice (SOV) | Brand’s share among all brands mentioned in the response | Relative position within the category |
These three sub-signals are combined because each carries different meaning. Appearing in text versus being cited as a source are different things. An AI engine may mention a brand in text without providing citation links — signaling “aware of it but does not trust it as an official source.”
Interpretation Guide
| Inclusion Level | Meaning | Priority Action |
|---|---|---|
| Very low | Brand barely recognized in AI search | Secure external sources, provide structured data |
| Low | Sporadic exposure in select queries only | Develop per-query content strategy |
| Medium | Intermittent exposure in category queries | Focus on improving exposure consistency |
| High | Stable exposure across most relevant queries | Shift to L2-L3 optimization |
| Very high | Exposure + citation in nearly all queries | Maintain + expand into new query territories |
flowchart LR
Q["AI Search Query"] --> R["AI Response Generated"]
R --> M{"Brand<br/>mentioned?"}
M -->|No| X["L1 = 0<br/>L2-L4 cannot be measured"]
M -->|Yes| C{"Citation<br/>included?"}
C -->|No| S1["Text mention only"]
C -->|Yes| S2["Text + citation present"]
S1 --> SOV["SOV Calculation"]
S2 --> SOV
SOV --> L1["L1 Inclusion Score"]
style X fill:#ffebee,stroke:#c62828
style L1 fill:#e8f4f8,stroke:#2196F3
The Inclusion Trap
High Inclusion is not unconditionally good. Appearing on a “worst services” list also produces high Inclusion. This is precisely why L3 (Quality) exists, and why Inclusion alone must never be used to judge GEO status.
Traditional SEO has a parallel: high impressions mean nothing if click-through rate is low. Inclusion is the GEO equivalent of impressions — the other layers play the roles of CTR and conversion.
L2 — Prominence (Visibility)
Core Question
“Where in the response does the brand appear, and how much weight does it receive?”
If Inclusion measures “whether it exists,” Prominence measures “quality of existence.” Being described in detail as the top recommendation versus being listed as a one-liner under “other options” affects user behavior in completely different ways.
Why Position Matters
The position effect in AI search responses is similar to but more extreme than traditional SERP position effects. In SERPs, users can scroll through the page. AI search responses are typically presented as a single continuous text block. Users who get their answer from the first portion likely never read the rest.
Aggarwal et al. (KDD 2024) proposed PAWC (Position-Adjusted Word Count), a concept that quantifies this phenomenon. Brand-related text positioned earlier in the response receives higher weight.
What It Measures
| Sub-Signal | Description | What It Reflects |
|---|---|---|
| Position-adjusted score (PAWC-based) | Higher score for earlier position and greater volume | User attention pattern favoring earlier content |
| Brand mention weight | Ratio of brand-related content to total response length | How deeply AI covers the brand |
PAWC Explained
The core idea is straightforward: the same volume of text has different visibility depending on whether it appears in the 1st paragraph or the 5th. Earlier text is read by more users; later text may not be read at all.
| Scenario | Position in Response | Brand-Related Volume | Relative PAWC |
|---|---|---|---|
| A | First paragraph | 50 words | High |
| B | Third paragraph | 50 words | Medium |
| C | Last paragraph | 50 words | Low |
| D | First + Third paragraph | 30 + 20 words | Lower than A, higher than B |
Scenarios A and C allocate the same 50 words to the brand, but their PAWC-measured visibility differs dramatically. This is the gap that simple mention counting or word-count tallying cannot capture.
Interpretation Guide
| Prominence Level | Meaning | Typical State |
|---|---|---|
| Very low | Mentioned but peripheral | ”Other options” list, bottom of comparison table |
| Low | Present but not noticeable | Brief mention in mid-position |
| Medium | Covered with meaningful weight | One of several options described in detail |
| High | Among top recommendations | Introduced early as a primary option |
| Very high | First recommendation, most detailed coverage | Presented as the core answer |
Relationship Between Prominence and Inclusion
Prominence is meaningful only when Inclusion exceeds 0 — you cannot discuss the position of a brand that is not mentioned. However, high Inclusion does not guarantee high Prominence. A brand mentioned in all 10 queries (high L1) but each time as a brief last-line note (low L2) is entirely possible.
Patterns created by combining these two layers:
| Pattern | L1 | L2 | Interpretation |
|---|---|---|---|
| Invisible | Low | — | AI does not recognize the brand |
| Wallflower | High | Low | Mentioned but with low weight |
| Spotlighted | High | High | Well-exposed with strong weight |
| Occasional spotlight | Medium | High | High weight in select queries only |
L3 — Quality (Accuracy and Sentiment)
Core Question
“Is what AI says about the brand accurate and favorable?”
Quality evaluates the “content” of exposure. Being visible and prominent means nothing if the content is inaccurate or negative — in fact, high-prominence inaccurate content can be worse than no visibility at all. Users are likely to trust and act on AI responses.
Why Quality Is the Most Complex Layer
Inclusion can be measured via text matching. Prominence can be measured through structural attributes — position and volume. Quality requires evaluating the meaning of text. “This service is expensive but good” and “This service is good but expensive” use nearly identical words but convey different nuances.
This is why Quality measurement incorporates LLM-based evaluation (LLM-as-a-Judge). Simple keyword matching or rule-based sentiment analysis cannot capture contextual meaning at this level.
What It Measures
Quality combines multiple dimensions, each evaluating a different aspect of brand mentions.
| Dimension | Description | What Low Scores Mean |
|---|---|---|
| Sentiment | Overall tone toward the brand | Mentioned in negative or critical contexts |
| Accuracy | Factual correctness of stated information | Non-existent features claimed, wrong prices, incorrect dates |
| Narrative Alignment | Alignment with the brand’s intended core messaging | AI positions the brand differently than intended |
| Hallucination Risk | Proportion of AI-generated information that contradicts facts | Non-existent features, services, or prices presented as real |
graph TD
R["Brand Mention in AI Response"] --> S["Sentiment Analysis"]
R --> A["Accuracy Verification"]
R --> N["Narrative Alignment"]
R --> H["Hallucination Detection"]
S --> Q["L3 Quality Score"]
A --> Q
N --> Q
H --> Q
Q --> I1{"High?"}
I1 -->|Yes| G["Exposure benefits the brand"]
I1 -->|No| B["Exposure may harm the brand"]
style Q fill:#e8f5e9,stroke:#4CAF50
style G fill:#c8e6c9,stroke:#388E3C
style B fill:#ffcdd2,stroke:#c62828
The Severity of Hallucination
In AI search, hallucination is not merely a technical error — it is a business risk. If AI states “this service offers a free trial” when no free trial exists, users become disappointed and brand trust erodes. In regulated industries (finance, healthcare), AI-amplified misinformation can create legal exposure.
The Quality layer detects such hallucinations and clearly identifies cases where high Inclusion coexists with low Quality — a diagnosis impossible with simple mention counting.
Interpretation Guide
| Quality Level | Meaning | Priority Action |
|---|---|---|
| Very low | Mentioned with negative or inaccurate information | Create correction content, update official information sources |
| Low | Neutral but core messaging not reflected | Strengthen USP-focused content |
| Medium | Mostly accurate with some inaccuracies | Target specific inaccuracies for correction |
| High | Accurate and positively described | Maintain + fine-tune narrative alignment |
| Very high | Core messaging accurately reflected, positive tone, no hallucinations | Maintain ideal state |
The Quality Paradox: High Inclusion + Low Quality
The most dangerous pattern is high L1 with low L3. This combination means exposure is actively harming the brand.
| Scenario | Example | Risk Level |
|---|---|---|
| Inaccurate information spreading | ”This service is free” (actually paid) | High — directly causes user disappointment |
| Competitor-favorable context | ”A more affordable alternative to A is B” | Medium — indirect revenue loss |
| Outdated information | Described with 2-year-old pricing or features | Medium — user confusion |
| Negative tone | ”Known for having many issues” | High — brand image damage |
L4 — Stability (Consistency)
Core Question
“Can we trust the measurement results? Do repeated runs produce consistent outcomes?”
Stability differs in nature from the other three layers. While L1-L3 measure “what is the current state,” L4 verifies “how trustworthy is that measurement.” AI search responses can vary for the same query based on execution timing, model version, region, and more.
Why AI Search Is Inherently Unstable
Traditional search engines return relatively stable results for the same query. Google’s SERP changes over time but does not swing dramatically day to day. AI search engines (ChatGPT, Perplexity, Gemini, etc.) are structurally more volatile.
| Volatility Factor | Description |
|---|---|
| Model updates | AI model updates change responses to identical queries |
| Temperature parameter | Generation randomness means the same query can yield different results |
| Context window | Prior conversation context can alter responses to the same query |
| Real-time data | Some AI engines incorporate live web data, causing time-dependent variation |
| Region and language settings | User settings can alter responses to the same query |
What It Measures
| Sub-Signal | Description | Activation Condition |
|---|---|---|
| Response Drift | GEO Score difference between current and previous results for the same query | 2+ runs |
| Citation Volatility | Brand’s inclusion/exclusion fluctuation in citation lists | 2+ runs |
| Prompt Sensitivity | Result differences across similar query variations | 2+ runs |
| Model Version Drift | Result differences across AI model versions | 2+ runs |
flowchart TD
R1["Run 1"] --> S1["L1-L3 Score Set A"]
R2["Run 2"] --> S2["L1-L3 Score Set B"]
R3["Run N"] --> S3["L1-L3 Score Set N"]
S1 --> CMP["Cross-Run Comparison"]
S2 --> CMP
S3 --> CMP
CMP --> D{"Variance level?"}
D -->|"Low"| ST["L4 High<br/>Results are trustworthy"]
D -->|"High"| UN["L4 Low<br/>Results unreliable"]
style ST fill:#c8e6c9,stroke:#388E3C
style UN fill:#ffcdd2,stroke:#c62828
Why 2+ Runs Are Required
Stability is fundamentally a comparative metric. Consistency cannot be discussed from a single measurement. This is a fundamental measurement limitation, and WICHI’s explicit acknowledgment of it.
On a single run, L4 is deactivated. Expressing “stability cannot yet be assessed” by simply not measuring it is more honest than generating a score with insufficient data. Explicit uncertainty beats false confidence.
Interpretation Guide
| Stability Level | Meaning | Implication |
|---|---|---|
| Not measurable | Single run completed only | Current L1-L3 scores are reference only, insufficient for decision-making |
| Low | Results vary significantly across runs | Defer L1-L3-based decisions, additional measurement needed |
| Medium | Some variation exists but overall trend holds | L1-L3 directional trends are reliable, specific numbers are reference only |
| High | Consistent results across repeated runs | L1-L3 scores can be used for decision-making |
Business Significance of Stability
The Stability layer is also a key SaaS differentiator. One-time measurement tools can only provide L1-L3. A recurring subscription model provides L4 through repeated measurement. This layer demonstrates with data “why continuous monitoring is needed” rather than “measure once and done.”
In an environment where AI search engines continuously update and responses continuously change, the value of one-time measurement depreciates rapidly. The Stability layer quantifies this depreciation rate.
Inter-Layer Relationships
Dependency Structure
The four layers are measured independently, but interpretation follows a clear dependency structure.
graph BT
L4["L4: Stability<br/>Confidence Layer"] -.->|"Determines confidence<br/>of all layers"| L1
L4 -.-> L2
L4 -.-> L3
L1["L1: Inclusion<br/>Prerequisite Layer"] -->|"L1 > 0 required"| L2["L2: Prominence<br/>Position & Weight Layer"]
L1 -->|"L1 > 0 required"| L3["L3: Quality<br/>Content & Accuracy Layer"]
style L1 fill:#e8f4f8,stroke:#2196F3
style L2 fill:#fff3e0,stroke:#FF9800
style L3 fill:#e8f5e9,stroke:#4CAF50
style L4 fill:#fce4ec,stroke:#E91E63
L1 is the prerequisite for L2 and L3. You cannot discuss position or quality when the brand is not mentioned at all. When L1 is 0, L2 and L3 are N/A.
L2 and L3 are independent of each other. A brand can be described prominently but inaccurately (high L2, low L3), or accurately but inconspicuously (low L2, high L3). These two layers measure different axes.
L4 is a meta-layer. It measures not the “value” but the “confidence” of L1-L3. Low L4 means L1-L3 scores, however good they look, are unreliable for decision-making.
Key Layer Combination Patterns
With four layers each having high/low states, 16 theoretical combinations exist. Since L1 being low makes L2 and L3 irrelevant, practical patterns are more limited. Here are the most commonly observed patterns:
| Pattern Name | L1 | L2 | L3 | L4 | Diagnosis | Priority Action |
|---|---|---|---|---|---|---|
| Invisible | Low | — | — | — | Does not exist in AI | Content + source acquisition |
| Wallflower | High | Low | High | High | Mentioned but low weight | Strengthen positioning |
| Backfire | High | High | Low | High | Prominently wrong | Urgent information correction |
| Honor student | High | High | High | High | Ideal state | Maintain + expand |
| Unstable honor student | High | High | High | Low | Good but unstable | Continuous monitoring |
| Misunderstood | High | Med | Low | High | Consistently misrepresented | Fundamental content overhaul |
“The same aggregate score can represent entirely different states. Without examining layer patterns, you will prescribe the wrong treatment.”
Interpretation Order
Layer dependencies determine the interpretation sequence:
- Check L1 first. If L1 is very low, there is no point discussing other layers. “First, you must exist.”
- If L1 is sufficient, check L2 and L3 together. Diagnose whether position (L2), content (L3), or both are problematic.
- Check L4 last. Determine how trustworthy the L1-L3 diagnosis is. If L4 is low, treat steps 1-2 as provisional and seek additional measurement.
This sequence resembles medical triage. Rather than interpreting all tests simultaneously, you confirm prerequisites first and progressively deepen the analysis.
Design Philosophy
Principle 1: Measure What Matters, Not What Is Easy
“Measure what matters, not what’s easy.”
Counting brand mentions is easy — a single regex suffices. But determining whether those mentions are positive, accurate, and stable requires a far more complex pipeline. The Quality layer’s sentiment analysis, accuracy verification, and narrative alignment assessment involve LLM-based evaluation. The Stability layer demands the cost of repeated execution.
The reason for accepting this complexity is simple: decisions based on simple measurements lead to wrong actions. “Mention count increased” is less useful than “mention count increased but so did the proportion of inaccurate information.”
Principle 2: Decomposition Over Aggregation
WICHI does provide an approximate composite score from the 4 layers. However, this composite is a dashboard summary, not the basis for decisions.
An example with two brands sharing an identical aggregate score of 65 makes this clear:
| Brand A | Brand B | |
|---|---|---|
| L1 Inclusion | 90 | 60 |
| L2 Prominence | 80 | 70 |
| L3 Quality | 30 | 70 |
| L4 Stability | 60 | 60 |
| Composite (reference) | ~65 | ~65 |
| Diagnosis | Backfire — high exposure, inaccurate | Wallflower — insufficient exposure |
| Priority action | Urgent information correction | Content strategy expansion |
With the same score, the required actions are diametrically opposed. Brand A might actually benefit from reducing exposure (stopping misinformation spread), while Brand B needs to increase it. The aggregate score alone makes this distinction impossible.
Principle 3: Explicit Uncertainty
The decision to deactivate Stability on a single run reflects this principle. Estimating a score when data is absent is less honest than stating “this dimension cannot yet be measured.”
The same principle applies to other layers. When L1 is 0, L2 and L3 display N/A rather than arbitrary values. Presenting unmeasurable things as measured is misleading to users.
Principle 4: Diagnosis Leads to Prescription
The key outcome of this architecture is not the final aggregate score. It is layer-level patterns. Patterns determine diagnosis, and diagnosis determines prescription (action).
| Pattern | Diagnosis | Prescription |
|---|---|---|
| Low L1 | Absent presence | Content creation, external source acquisition, structured data |
| High L1, Low L2 | Insufficient weight | Strengthen differentiation, create comparison content |
| High L1, Low L3 | Harmful exposure | Correct information, update official sources, strengthen FAQ |
| High L1-L3, Low L4 | Unstable | Regular monitoring, track AI model changes |
| High L1-L4 | Optimal state | Maintain + expand into new query territories |
The design’s purpose is not to raise scores per se, but to diagnose which layer has problems and apply the appropriate remedy.
Generalizability
Beyond WICHI
The 4-layer framework was designed for a specific product, but conceptually it applies to any system measuring AI search visibility. The four core questions — does it exist, is it prominent, is it accurate, is it consistent — are equally valid whether the subject is a brand, a product, or an information source.
Extension to Other Domains
| Domain | L1 Interpretation | L2 Interpretation | L3 Interpretation | L4 Interpretation |
|---|---|---|---|---|
| Brand GEO | Mention presence | Position in response | Sentiment/accuracy | Cross-run consistency |
| Academic sources | Citation presence | Citation position/weight | Citation accuracy | Temporal stability |
| News sources | Reference presence | Headline/body placement | Factual accuracy | Issue persistence |
| Product comparison | Candidate inclusion | Recommendation rank | Spec accuracy | Query variation stability |
Specific implementations (sub-signals, weights, evaluation methods) vary by domain, but the 4-layer structure follows a universal logic: Exist, Stand Out, Be Accurate, Be Consistent.
Implementation Considerations
For those applying this framework in their own systems:
- L1 is easiest to implement, L3 is hardest. Start with text matching (L1) and incrementally increase complexity.
- L4 requires time. It activates only after 2+ measurements, so early operation uses L1-L3 only.
- Inter-layer weights should vary by domain. In healthcare or finance where accuracy is paramount, L3 weight should be high. For early-stage startups where exposure itself is the priority, L1 weight should dominate.
- If using an LLM Judge for Quality measurement, the Judge’s own consistency (Stability) becomes a concern. Judge model responses also fluctuate, so Judge-level stability must be managed separately.
Summary
The 4-layer GEO Score design answers these questions in order:
flowchart LR
Q1["Does it exist?<br/>L1 Inclusion"] --> Q2["Is it prominent?<br/>L2 Prominence"]
Q2 --> Q3["Is it accurate?<br/>L3 Quality"]
Q3 --> Q4["Can we trust it?<br/>L4 Stability"]
style Q1 fill:#e8f4f8,stroke:#2196F3
style Q2 fill:#fff3e0,stroke:#FF9800
style Q3 fill:#e8f5e9,stroke:#4CAF50
style Q4 fill:#fce4ec,stroke:#E91E63
Each layer is measured independently, but interpretation follows a sequence. L1 is the prerequisite, L4 validates confidence. Layer-level patterns matter more than aggregate scores — patterns determine diagnosis, and diagnosis determines action.
What this design deliberately rejects is the convenience of a single number. “Inclusion High, Prominence High, Quality Low, Stability Not Measured” in four lines carries more information and connects to more precise action than “GEO Score 72” in one line.
Related Posts

Designing the 9-Bucket Query Framework
Documentation of WICHI's 9-Bucket query framework. Defines 3 Zones and 9 Buckets based on brand presence to measure AI's organic recommendations effectively.

Multi-Engine Architecture — Parallel Collection from 3 AI Search Engines
Analysis of multi-engine architecture design principles that leverage response variance as signals, featuring parallel collection structures and scalability via the adapter pattern.

Six GEO Business Opportunities and WICHI's Choice
Strategic analysis of three opportunity factors in the AI search (GEO) market and why WICHI chose 'SaaS-based monitoring' over advertising or agency models.