Minbook
KO
GEO Paper Review: Definition and Foundational Frameworks

GEO Paper Review: Definition and Foundational Frameworks

MJ · · 14 min read

Review of the KDD 2024 GEO paper and Chen et al. 2025. Covers the academic definition of GEO, the PAWC visibility metric, and AI search's preference for earned media.

Scope of This Review

This post provides an academic review of two papers that form the foundation of GEO (Generative Engine Optimization).

  1. Aggarwal, P. et al. (2024), “GEO: Generative Engine Optimization” — formally published at KDD 2024
  2. Chen, Y. et al. (2025), “Generative Engine Optimization: How to Dominate AI Search” — preprint

The term “GEO” has been spreading rapidly across the industry, but the actual academic base remains thin. These two papers serve as the twin pillars of that thin foundation. Aggarwal et al. established the academic definition of GEO along with its benchmark and measurement metrics, while Chen et al. provided empirical data on how AI search structurally differs from traditional search in its behavior.

This post covers the background, a detailed analysis of the primary paper (Aggarwal et al.), a detailed analysis of the secondary paper (Chen et al.), a cross-analysis of the two, and the gaps remaining in current literature along with their practical implications.

Background: From SEO to GEO

A Structural Shift in Search Paradigms

Traditional search engines operate on the “ten blue links” model. Users enter a query, and a ranked list of indexed web pages appears. SEO (Search Engine Optimization) is the strategy for climbing those rankings, and its success criteria are clear: position on the SERP (Search Engine Results Page) is the performance metric.

Generative search engines fundamentally change this structure. Systems like ChatGPT, Perplexity, Google AI Overviews, and Bing Copilot produce free-form synthesized responses to user queries. They aggregate information from multiple sources and present it as a single unified text. In this structure, the very concept of “ranking” becomes ambiguous.

In traditional search, the competition was “who makes it to page one.” In generative search, the competition is “whose information does the AI mention when generating its response.” When the object of measurement changes, the optimization strategy must change too.

The Starting Point for GEO Research

Before 2024, academic research on GEO was virtually nonexistent. The industry had been discussing “AI search optimization” as a concept, but no systematic framework existed for what to measure or which strategies were effective. Aggarwal et al.’s KDD 2024 paper was the first academic attempt to fill this gap.

flowchart TB
    subgraph "Pre-Research State"
        A["Only SEO frameworks existed"] --> B["SERP ranking-based measurement"]
        B --> C["Not applicable to generative search"]
    end
    subgraph "Aggarwal et al. 2024 Contributions"
        D["Defined the GEO concept"] --> E["GEO-Bench benchmark"]
        E --> F["Proposed PAWC visibility metric"]
        F --> G["Tested 9 optimization strategies"]
    end
    subgraph "Chen et al. 2025 Contributions"
        H["Empirical analysis of 30K responses"] --> I["Discovered earned media bias"]
        I --> J["Identified engine-specific sensitivity differences"]
    end
    C -->|"Academic gap"| D
    G -->|"Empirical work on top of framework"| H

Paper 1: The Academic Starting Point for GEO (Aggarwal et al., KDD 2024)

Aggarwal et al.’s paper is a field-defining paper in that it was the first to formally define GEO as an academic concept and propose a systematic benchmark and measurement metrics. Its acceptance at KDD 2024 itself reflects the academic community’s recognition of the research’s contribution.

Research Design Overview

The research design consists of three stages:

  1. Benchmark construction: Collected queries across diverse domains and corresponding generative engine responses to build GEO-Bench
  2. Metric design: Proposed a metric system for quantifying content visibility in generative responses (Word Count, Position-Adjusted Word Count, Impression Count)
  3. Optimization experiments: Applied nine content optimization strategies and measured visibility changes

GEO-Bench is the first systematic benchmark for measuring content visibility in generative search engine responses. Traditional SEO had a clear measurement standard in SERP rankings. Generative engine responses, however, are free-form text, which required redefining the very concept of “visibility” from scratch.

Dataset Composition

The GEO-Bench query set was designed to cover diverse domains. The authors classified queries by domain and collected the content of source websites referenced by the generative engine for each query.

ComponentDetails
Query sourceBased on real user search queries
Domain scopeMultiple domains including law, medicine, technology, education, finance
Response collection targetBingChat/Copilot-based generative engine
Collection periodSnapshot at specific points in 2023-2024
Included dataQueries, generated response text, cited/referenced source URLs, source content

Benchmarking Approach

GEO-Bench’s benchmarking approach works as follows: for each query, the generative engine produces a response, and the system tracks how much and in what position each source website is mentioned within that response. After modifying (optimizing) source content, the same query is run again to generate a new response, and visibility changes between pre- and post-optimization are measured.

The core assumption of this approach is that the generative engine references similar sources for the same query. In practice, responses to the same query can vary depending on execution time, model version, and user context, so this assumption has limited robustness. The authors conducted multiple repeated experiments to control for this, though this does not fully resolve the benchmark’s fundamental limitations.

Visibility Measurement Metric System

Aggarwal et al. proposed three visibility metrics. These metrics follow a progression from simple to sophisticated.

MetricDefinitionCharacteristics
Word CountNumber of words derived from a specific source within the responseSimplest. Ignores position information
Position-Adjusted Word Count (PAWC)Word count weighted by position within the responseHigher weight for earlier mentions. Key metric
Impression CountNumber of times a source is cited/mentioned in the responseFrequency-based. Measures exposure count rather than depth

The Design Logic of PAWC

PAWC (Position-Adjusted Word Count) is the paper’s most important methodological contribution. Rather than simple mention count (Word Count), it incorporates position-based weighting within the response.

The design rationale is rooted in user attention distribution. Just as top results in traditional SERPs receive more clicks, sources mentioned at the beginning of generative responses are assumed to leave a stronger impression on users. Being mentioned in the first paragraph versus the last paragraph differs in impact on the user.

PAWC measures not “how often something is mentioned” but “how favorably positioned the mention is.” This shift in perspective is the core of GEO measurement.

PAWC’s weighting function is a decaying function that assigns higher scores to earlier positions in the response. The shape of this function — whether linear decay, exponential decay, or logarithmic decay — can affect measurement results. The authors chose a specific decay function, but did not present the empirical data (such as user eye-tracking data on generative responses) that would justify this choice. This represents both a methodological limitation of PAWC and a point requiring follow-up research.

Nine Optimization Strategies and Experimental Results

The experimental core of the paper is a comparison of how various content optimization strategies affect visibility in generative engine responses. The authors tested a total of nine strategies.

Strategy Descriptions

#StrategyDescription
1Cite SourcesExplicitly cite references for claims
2Add StatisticsAdd quantitative data, figures, and statistics to the body text
3Include QuotationsInclude direct quotes from experts and research findings
4Fluency OptimizationImprove sentence readability and grammatical naturalness
5Technical TermsUse domain-specific terminology appropriately
6Authoritative ToneWrite in an expert, authoritative voice
7Keyword StuffingRepeatedly insert target keywords
8Simple LanguageSimplify with easy vocabulary and short sentences
9Unique WordsUse synonyms and uncommon expressions for vocabulary diversity

Experimental Results

StrategyVisibility Change (PAWC)Effect Classification
Cite SourcesHigh improvementEffective
Add StatisticsHigh improvement (up to +40%)Most effective
Include QuotationsModerate improvementEffective
Fluency OptimizationSlight improvementLimited
Technical TermsDomain-dependentConditionally effective
Authoritative ToneSlight improvementLimited
Keyword StuffingNot significant / NegativeIneffective
Simple LanguageNot significantIneffective
Unique WordsNot significantIneffective

The most notable result is the clear gap between the top three strategies (Cite Sources, Add Statistics, Include Quotations) and the bottom three (Keyword Stuffing, Simple Language, Unique Words).

Adding statistics yielded up to 40% visibility improvement in generative engine responses. Meanwhile, keyword stuffing showed no significant effect.

Domain-Specific Differences

Not all strategies performed equally across all domains. The authors also reported domain-specific effect differences.

DomainMost Effective StrategyNotes
Legal/RegulatoryCite SourcesAuthoritative source citations are key
Science/TechnologyAdd StatisticsQuantitative evidence is decisive
Medical/HealthCite Sources + Authoritative ToneTrust signals operate in combination
General InformationAdd Statistics + Include QuotationsSpecificity is universally effective

The academic implication is clear: GEO strategies are not a “universal formula” but require adjustment based on domain context. While this pattern is also observed in SEO, domain dependency tends to be stronger in GEO.

What These Results Mean

Three key messages emerge from the experimental results.

First, GEO is not a simple variant of SEO. Keyword stuffing, which was effective in SEO, has no effect in GEO. Generative engines respond to information specificity and verifiability, not keyword density.

Second, “concrete evidence” is the core driver of visibility. The fact that statistics, quotations, and source citations were all commonly effective suggests that generative engines prefer “evidence-backed content.” This aligns with the fact that LLM training data includes substantial volumes of evidence-centric texts such as academic papers, Wikipedia articles, and news stories.

Third, surface-level optimization (sentence polishing, vocabulary changes) has limited practical effect. What matters in GEO is not “how you write” but “what you include.”

flowchart LR
    subgraph "Effective Strategies (Information Specificity)"
        S1["Cite Sources"] --> R1["Source verifiability ↑"]
        S2["Add Statistics"] --> R2["Quantitative evidence ↑"]
        S3["Include Quotations"] --> R3["Expert authority ↑"]
    end
    subgraph "Ineffective Strategies (Surface-Level Changes)"
        S4["Keyword Stuffing"] --> R4["Only keyword density ↑"]
        S5["Simple Language"] --> R5["Expression simplified"]
        S6["Unique Words"] --> R6["Vocabulary diversity"]
    end
    R1 --> V["Visibility improvement"]
    R2 --> V
    R3 --> V
    R4 --> X["No effect"]
    R5 --> X
    R6 --> X

Limitations of Paper 1

GEO-Bench has limitations along several dimensions.

Temporal stability unverified. GEO-Bench is a snapshot at a specific point in time. When the generative engine’s model is updated, the effectiveness of the same strategies may change. There is no longitudinal verification of whether strategies effective in 2024 remain valid in 2025.

Limited engine scope. The experiments were confined to BingChat/Copilot. Whether the same results are reproducible on other generative engines such as ChatGPT, Perplexity, or Google AI Overviews has not been confirmed.

PAWC weighting function not validated. There is no independent verification that PAWC’s positional weights reflect actual user attention distribution. Since it was designed by extrapolating from SEO’s CTR (Click-Through Rate) distribution, the metric’s validity may weaken if user behavior patterns unique to generative responses differ.

Causal mechanisms unexplained. The observation that “adding statistics increases visibility” exists, but there is no causal analysis of “why generative engines cite content with statistics more often.”

Paper 2: Empirical Behavior Analysis (Chen et al., 2025)

Chen et al.’s research goes a step further on the framework that Aggarwal et al. established. It moves beyond “do GEO strategies work?” to ask “what types of sources does AI search actually prefer, and how does this differ across engines?”

Research Methodology

Chen et al. conducted a large-scale empirical analysis. The key methodology is summarized below.

ItemDetails
Number of responses analyzedApproximately 30,000
Target enginesChatGPT, Perplexity, Google AI Overviews
Analysis dimensionsType (domain), frequency, and inter-engine differences of cited sources
Comparison baselineDifferences from traditional Google search results
Query typesInformation-seeking queries across diverse domains

The scale of 30,000 responses is the largest among GEO-related empirical studies. The analysis classified the domain type of sources cited by each engine and compared the distribution to source distributions in traditional Google search results.

Earned Media Bias: The Key Finding

The most notable finding is that AI search engines systematically prefer third-party authoritative sources (earned media) over brand-owned channels (owned media).

Owned Media vs Earned Media

TypeDefinitionExamples
Owned MediaChannels directly owned/operated by the brandCorporate websites, brand blogs, company apps
Earned MediaContent voluntarily created by third partiesNews articles, review sites, forum discussions, Wikipedia
Paid MediaContent exposed through paid placementAdvertisements, sponsored content, paid listings

According to Chen et al.’s analysis, the majority of sources cited in AI search engine responses were of the earned media type. Review sites, news articles, forum discussions, and Wikipedia were significantly more likely to be cited than brand official websites.

Traditional Google search exposed owned media and earned media in relatively balanced proportions. This bias in AI search means that the shift from SEO to GEO demands not just a technical change but a structural transformation of content strategy.

Domain Distribution Analysis

Chen et al. systematically classified the domain types of cited sources. Key findings are summarized below.

Source TypeCitation Frequency in AI SearchCompared to Traditional Google Search
News/Media outletsHighSimilar or slightly higher
Review/Comparison sitesVery highConsiderably higher
Wikipedia/EncyclopediasHighSimilar
Forums/Communities (Reddit, etc.)Medium to highConsiderably higher
Academic/Research institutionsMediumSimilar
Brand official sitesLowConsiderably lower
Personal blogsLowSlightly lower

The notably high citation frequency of review/comparison sites and forums/communities compared to traditional Google search is particularly significant. Meanwhile, brand official site citation frequency was lower compared to traditional search.

Regarding possible causes of this pattern, Chen et al. did not present direct causal analysis but proposed several hypotheses: it could result from generative engines referencing more earned media in their training data; it could be that third-party sources receive higher relevance scores during the retrieval stage of the RAG (Retrieval-Augmented Generation) pipeline; or it could be that LLMs, designed to produce “neutral and comprehensive responses,” naturally prefer third-party perspectives.

Engine-Specific Citation Pattern Comparison

Another important contribution from Chen et al. is the systematic identification of differences across AI search engines.

flowchart TB
    Q["Same query input"] --> E1["ChatGPT"]
    Q --> E2["Perplexity"]
    Q --> E3["Google AI Overviews"]
    E1 --> R1["Response + cited sources A"]
    E2 --> R2["Response + cited sources B"]
    E3 --> R3["Response + cited sources C"]
    R1 --> D["Citation pattern comparison"]
    R2 --> D
    R3 --> D
    D --> F1["Source type distribution differences"]
    D --> F2["Freshness sensitivity differences"]
    D --> F3["Language-based citation variation"]
    D --> F4["Query phrasing sensitivity differences"]

Three Sensitivity Dimensions

Chen et al. systematically analyzed engine-specific sensitivity differences across three variables.

Freshness. The speed and extent to which each engine reflects recent information varies. Perplexity showed a tendency to cite recent content more quickly, while ChatGPT’s heavy reliance on training data meant slower freshness reflection. Google AI Overviews occupied a middle position through integration with its own search index.

Language. When the same query was entered in English versus a non-English language, the cited sources differed considerably. English queries broadly cited global English-language sources, while non-English queries showed increased reliance on local-language sources, with an overall reduction in source diversity.

Query Phrasing. When the same search intent was expressed in different phrasings, response consistency varied across engines. Some engines were sensitive to query phrasing, citing different sources for queries with similar intent, while others were more robust at identifying intent and cited similar sources regardless of phrasing variations.

Sensitivity VariableChatGPTPerplexityGoogle AI Overviews
FreshnessLow (training data dependent)High (real-time search)Medium (index-based)
LanguageMediumMedium to highHigh (leverages local index)
Query PhrasingHigh (phrasing sensitive)MediumLow (robust intent detection)

The practical implication of these results is clear: GEO visibility measured from a single engine, single language, and single query can distort the actual exposure state. For those serious about GEO, visibility measurement across multiple engines, multiple languages, and multiple query phrasings is necessary.

Limitations of Paper 2

The limitations of Chen et al.’s research are also evident.

Peer review incomplete. It is at the preprint stage and has not undergone formal peer review. Academic community verification of the methodology and conclusions has not yet occurred.

No causal analysis. The earned media bias was observed, but no causal analysis was presented as to whether it stems from training data composition, retrieval pipeline ranking logic, or prompt design. Correlation alone makes it difficult to determine optimization strategy direction.

Domain concentration. The analyzed queries are concentrated in specific domains, requiring caution when generalizing to other verticals such as B2B, SaaS, or healthcare.

Snapshot analysis limitations. Despite the scale of 30K responses, this is an analysis at a specific point in time. Since AI engines are continuously updated, there is no guarantee that results from the analysis period remain current.

Cross-Analysis of the Two Papers

Complementary Structure

Both papers are valuable when read independently, but a more complete picture emerges when read together. Aggarwal et al. established the framework for “what is GEO and how do we measure it,” and Chen et al. provided empirical evidence for “how does AI search actually differ from traditional search in its behavior.”

Comparison DimensionAggarwal et al. (2024)Chen et al. (2025)
Core questionHow to define and measure GEOHow AI search differs from traditional search
Contribution typeConcept definition + benchmark + metricsEmpirical analysis + cross-engine comparison
MethodologyBenchmark construction + controlled optimization experimentsLarge-scale response collection + comparative analysis
Key outputsGEO-Bench, PAWC, 9-strategy effect analysisEarned media bias, 3 sensitivity dimensions
Analysis scaleDomain-specific query sets~30,000 responses
Target enginesBingChat/CopilotChatGPT, Perplexity, Google AI Overviews
Academic statusFormally published at KDD 2024Preprint
Practical applicabilityMedium (provides strategic direction)High (identifies engine-specific differences)

What the Two Papers Say Together

Connecting Aggarwal et al.’s finding that “adding statistics” and “citing sources” are effective with Chen et al.’s finding that “AI search prefers earned media” reveals a single coherent pattern.

Generative engines prefer “content that is verifiable, specific, and written from a third-party perspective.” This represents a qualitatively different paradigm from the keyword optimization, brand emphasis, and domain authority that were central to SEO.

This pattern can be summarized as follows:

SEO ParadigmGEO Paradigm
Keyword density optimizationInformation specificity optimization
Building domain authority for owned sitesSecuring mentions on third-party channels
SERP ranking = performanceMention position/frequency in responses = performance
Single-engine (Google) optimizationMulti-engine optimization
Static rankingsDynamic/probabilistic mentions
flowchart TB
    subgraph "Aggarwal et al. Contributions"
        A1["GEO definition"] --> A2["GEO-Bench"]
        A2 --> A3["PAWC metric"]
        A3 --> A4["Strategy effect measurement"]
        A4 --> A5["Concrete evidence = effective"]
    end
    subgraph "Chen et al. Contributions"
        B1["30K response analysis"] --> B2["Earned media bias"]
        B2 --> B3["Engine-specific differences"]
        B3 --> B4["Multi-engine measurement essential"]
    end
    A5 --> C["Integrated implications"]
    B4 --> C
    C --> D1["Verifiable content first"]
    C --> D2["Third-party channel strategy essential"]
    C --> D3["Single-engine optimization is insufficient"]

What the Literature Still Lacks

Despite these two papers laying the groundwork for GEO, significant gaps remain in the current literature. These gaps must be filled for GEO to mature from an academic concept into an actionable strategic framework.

1. Absence of Longitudinal Studies

Both studies are snapshot analyses at specific points in time. When generative engine models are updated, the effectiveness of the same strategies may change. There are no longitudinal studies tracking how GEO strategy effectiveness changes when models shift from GPT-4 to GPT-4o, or from Gemini 1.5 to 2.0.

In SEO, the impact of Google’s algorithm updates (Panda, Penguin, BERT, etc.) on SEO strategies has been cumulatively studied. A similar longitudinal analysis is needed for GEO, but none has been conducted yet.

2. Insufficient User Behavior Models

PAWC’s positional weights are extrapolated from SEO’s CTR distribution. However, there is no guarantee that how users read generative responses is the same as how they scan SERPs. The pattern of reading free-form text may differ from the pattern of scanning a list of links.

Validating this requires eye-tracking research on generative responses. Empirical data on which parts of AI responses users actually pay attention to, and which brand mentions they notice and remember, must be obtained before PAWC’s weighting function can be validated.

3. Causal Mechanisms Unexplained

Research to date shows “what works” but does not explain “why it works.”

  • Why is content containing statistical data cited more often?
  • Is the earned media bias caused by training data bias, RAG pipeline ranking logic, or a combination of both?
  • At which stage of the generative engine (retrieval, reranking, generation) are source selections determined?

Without understanding causal mechanisms, strategies can only rely on empirical observation, and strategy effectiveness cannot be predicted following engine updates.

4. Domain Generalization

The query sets of both studies are concentrated in specific domains. Whether the same strategies are reproducible in domains with strong domain-specific characteristics — such as B2B SaaS, technical infrastructure, healthcare, or finance — has not been verified.

B2B domains in particular have queries of a fundamentally different nature from B2C. “Best CRM software recommendations” and “enterprise data pipeline architecture comparison” may have different source types and distributions referenced by generative engines. Domain-specific replication studies are needed.

5. ROI Linkage Models

There is almost no research on the pathway from GEO visibility improvement to actual business outcomes. “Being mentioned 40% more in AI responses” does not mean “actual traffic increases by 40%.”

Research is absent on the conversion rate at each stage of the funnel — GEO visibility, brand awareness, website visits, conversions (purchases, signups) — and on GEO’s contribution to overall marketing ROI. Without this linkage model, it is difficult to justify investment in GEO.

6. No Consideration of Multimodal Responses

Both papers analyze only text-based responses. However, current generative engines already produce responses that include images, charts, code blocks, and video references. A methodology for measuring visibility in such multimodal responses has not yet been proposed.

Practical Implications

Caution is needed when directly applying academic findings to practice, as differences exist between research environments and real operating environments. Nevertheless, the direction derived from both papers is clear.

Shift content strategy direction. The strategy must shift from keyword-centric to “evidence density”-centric. Statistics, citations, and explicit source attribution have a stronger effect in GEO than in SEO.

Growing importance of earned media acquisition. Optimizing only your own website is insufficient for securing visibility in GEO. A strategy for securing mentions on third-party channels — review sites, news outlets, forums, Wikipedia — is essential.

Multi-engine monitoring. Visibility must be monitored separately on each major generative engine: ChatGPT, Perplexity, Google AI Overviews, and others. Optimization on a single engine may not transfer to others.

Need for measurement framework construction. Currently, few commercial tools can continuously measure GEO visibility. Even internally, it is necessary to build a system that periodically collects generative engine responses to key queries and tracks the frequency and position of mentions for your brand and competitors.


References

  • Aggarwal, P. et al. (2024). GEO: Generative Engine Optimization. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2024).
  • Chen, Y. et al. (2025). Generative Engine Optimization: How to Dominate AI Search. Working Paper.
Share

Related Posts