GEO Paper Review: Definition and Foundational Frameworks

Scope of This Review

This post provides an academic review of two papers that form the foundation of GEO (Generative Engine Optimization).

Aggarwal, P. et al. (2024), “GEO: Generative Engine Optimization” — formally published at KDD 2024
Chen, Y. et al. (2025), “Generative Engine Optimization: How to Dominate AI Search” — preprint

The term “GEO” has been spreading rapidly across the industry, but the actual academic base remains thin. These two papers serve as the twin pillars of that thin foundation. Aggarwal et al. established the academic definition of GEO along with its benchmark and measurement metrics, while Chen et al. provided empirical data on how AI search structurally differs from traditional search in its behavior.

This post covers the background, a detailed analysis of the primary paper (Aggarwal et al.), a detailed analysis of the secondary paper (Chen et al.), a cross-analysis of the two, and the gaps remaining in current literature along with their practical implications.

Background: From SEO to GEO

A Structural Shift in Search Paradigms

Traditional search engines operate on the “ten blue links” model. Users enter a query, and a ranked list of indexed web pages appears. SEO (Search Engine Optimization) is the strategy for climbing those rankings, and its success criteria are clear: position on the SERP (Search Engine Results Page) is the performance metric.

Generative search engines fundamentally change this structure. Systems like ChatGPT, Perplexity, Google AI Overviews, and Bing Copilot produce free-form synthesized responses to user queries. They aggregate information from multiple sources and present it as a single unified text. In this structure, the very concept of “ranking” becomes ambiguous.

In traditional search, the competition was “who makes it to page one.” In generative search, the competition is “whose information does the AI mention when generating its response.” When the object of measurement changes, the optimization strategy must change too.

The Starting Point for GEO Research

Before 2024, academic research on GEO was virtually nonexistent. The industry had been discussing “AI search optimization” as a concept, but no systematic framework existed for what to measure or which strategies were effective. Aggarwal et al.’s KDD 2024 paper was the first academic attempt to fill this gap.

flowchart TB
    subgraph "Pre-Research State"
        A["Only SEO frameworks existed"] --> B["SERP ranking-based measurement"]
        B --> C["Not applicable to generative search"]
    end
    subgraph "Aggarwal et al. 2024 Contributions"
        D["Defined the GEO concept"] --> E["GEO-Bench benchmark"]
        E --> F["Proposed PAWC visibility metric"]
        F --> G["Tested 9 optimization strategies"]
    end
    subgraph "Chen et al. 2025 Contributions"
        H["Empirical analysis of 30K responses"] --> I["Discovered earned media bias"]
        I --> J["Identified engine-specific sensitivity differences"]
    end
    C -->|"Academic gap"| D
    G -->|"Empirical work on top of framework"| H

Paper 1: The Academic Starting Point for GEO (Aggarwal et al., KDD 2024)

Aggarwal et al.’s paper is a field-defining paper in that it was the first to formally define GEO as an academic concept and propose a systematic benchmark and measurement metrics. Its acceptance at KDD 2024 itself reflects the academic community’s recognition of the research’s contribution.

Research Design Overview

The research design consists of three stages:

Benchmark construction: Collected queries across diverse domains and corresponding generative engine responses to build GEO-Bench
Metric design: Proposed a metric system for quantifying content visibility in generative responses (Word Count, Position-Adjusted Word Count, Impression Count)
Optimization experiments: Applied nine content optimization strategies and measured visibility changes

GEO-Bench: A Benchmark for Generative Search

GEO-Bench is the first systematic benchmark for measuring content visibility in generative search engine responses. Traditional SEO had a clear measurement standard in SERP rankings. Generative engine responses, however, are free-form text, which required redefining the very concept of “visibility” from scratch.

Dataset Composition

The GEO-Bench query set was designed to cover diverse domains. The authors classified queries by domain and collected the content of source websites referenced by the generative engine for each query.

Component	Details
Query source	Based on real user search queries
Domain scope	Multiple domains including law, medicine, technology, education, finance
Response collection target	BingChat/Copilot-based generative engine
Collection period	Snapshot at specific points in 2023-2024
Included data	Queries, generated response text, cited/referenced source URLs, source content

Benchmarking Approach

GEO-Bench’s benchmarking approach works as follows: for each query, the generative engine produces a response, and the system tracks how much and in what position each source website is mentioned within that response. After modifying (optimizing) source content, the same query is run again to generate a new response, and visibility changes between pre- and post-optimization are measured.

The core assumption of this approach is that the generative engine references similar sources for the same query. In practice, responses to the same query can vary depending on execution time, model version, and user context, so this assumption has limited robustness. The authors conducted multiple repeated experiments to control for this, though this does not fully resolve the benchmark’s fundamental limitations.

Visibility Measurement Metric System

Aggarwal et al. proposed three visibility metrics. These metrics follow a progression from simple to sophisticated.

Metric	Definition	Characteristics
Word Count	Number of words derived from a specific source within the response	Simplest. Ignores position information
Position-Adjusted Word Count (PAWC)	Word count weighted by position within the response	Higher weight for earlier mentions. Key metric
Impression Count	Number of times a source is cited/mentioned in the response	Frequency-based. Measures exposure count rather than depth

The Design Logic of PAWC

PAWC (Position-Adjusted Word Count) is the paper’s most important methodological contribution. Rather than simple mention count (Word Count), it incorporates position-based weighting within the response.

The design rationale is rooted in user attention distribution. Just as top results in traditional SERPs receive more clicks, sources mentioned at the beginning of generative responses are assumed to leave a stronger impression on users. Being mentioned in the first paragraph versus the last paragraph differs in impact on the user.

PAWC measures not “how often something is mentioned” but “how favorably positioned the mention is.” This shift in perspective is the core of GEO measurement.

PAWC’s weighting function is a decaying function that assigns higher scores to earlier positions in the response. The shape of this function — whether linear decay, exponential decay, or logarithmic decay — can affect measurement results. The authors chose a specific decay function, but did not present the empirical data (such as user eye-tracking data on generative responses) that would justify this choice. This represents both a methodological limitation of PAWC and a point requiring follow-up research.

Nine Optimization Strategies and Experimental Results

The experimental core of the paper is a comparison of how various content optimization strategies affect visibility in generative engine responses. The authors tested a total of nine strategies.

Strategy Descriptions

#	Strategy	Description
1	Cite Sources	Explicitly cite references for claims
2	Add Statistics	Add quantitative data, figures, and statistics to the body text
3	Include Quotations	Include direct quotes from experts and research findings
4	Fluency Optimization	Improve sentence readability and grammatical naturalness
5	Technical Terms	Use domain-specific terminology appropriately
6	Authoritative Tone	Write in an expert, authoritative voice
7	Keyword Stuffing	Repeatedly insert target keywords
8	Simple Language	Simplify with easy vocabulary and short sentences
9	Unique Words	Use synonyms and uncommon expressions for vocabulary diversity

Experimental Results

Strategy	Visibility Change (PAWC)	Effect Classification
Cite Sources	High improvement	Effective
Add Statistics	High improvement (up to +40%)	Most effective
Include Quotations	Moderate improvement	Effective
Fluency Optimization	Slight improvement	Limited
Technical Terms	Domain-dependent	Conditionally effective
Authoritative Tone	Slight improvement	Limited
Keyword Stuffing	Not significant / Negative	Ineffective
Simple Language	Not significant	Ineffective
Unique Words	Not significant	Ineffective

The most notable result is the clear gap between the top three strategies (Cite Sources, Add Statistics, Include Quotations) and the bottom three (Keyword Stuffing, Simple Language, Unique Words).

Adding statistics yielded up to 40% visibility improvement in generative engine responses. Meanwhile, keyword stuffing showed no significant effect.

Domain-Specific Differences

Not all strategies performed equally across all domains. The authors also reported domain-specific effect differences.

Domain	Most Effective Strategy	Notes
Legal/Regulatory	Cite Sources	Authoritative source citations are key
Science/Technology	Add Statistics	Quantitative evidence is decisive
Medical/Health	Cite Sources + Authoritative Tone	Trust signals operate in combination
General Information	Add Statistics + Include Quotations	Specificity is universally effective

The academic implication is clear: GEO strategies are not a “universal formula” but require adjustment based on domain context. While this pattern is also observed in SEO, domain dependency tends to be stronger in GEO.

What These Results Mean

Three key messages emerge from the experimental results.

First, GEO is not a simple variant of SEO. Keyword stuffing, which was effective in SEO, has no effect in GEO. Generative engines respond to information specificity and verifiability, not keyword density.

Second, “concrete evidence” is the core driver of visibility. The fact that statistics, quotations, and source citations were all commonly effective suggests that generative engines prefer “evidence-backed content.” This aligns with the fact that LLM training data includes substantial volumes of evidence-centric texts such as academic papers, Wikipedia articles, and news stories.

Third, surface-level optimization (sentence polishing, vocabulary changes) has limited practical effect. What matters in GEO is not “how you write” but “what you include.”

flowchart LR
    subgraph "Effective Strategies (Information Specificity)"
        S1["Cite Sources"] --> R1["Source verifiability ↑"]
        S2["Add Statistics"] --> R2["Quantitative evidence ↑"]
        S3["Include Quotations"] --> R3["Expert authority ↑"]
    end
    subgraph "Ineffective Strategies (Surface-Level Changes)"
        S4["Keyword Stuffing"] --> R4["Only keyword density ↑"]
        S5["Simple Language"] --> R5["Expression simplified"]
        S6["Unique Words"] --> R6["Vocabulary diversity"]
    end
    R1 --> V["Visibility improvement"]
    R2 --> V
    R3 --> V
    R4 --> X["No effect"]
    R5 --> X
    R6 --> X

Limitations of Paper 1

GEO-Bench has limitations along several dimensions.

Temporal stability unverified. GEO-Bench is a snapshot at a specific point in time. When the generative engine’s model is updated, the effectiveness of the same strategies may change. There is no longitudinal verification of whether strategies effective in 2024 remain valid in 2025.

Limited engine scope. The experiments were confined to BingChat/Copilot. Whether the same results are reproducible on other generative engines such as ChatGPT, Perplexity, or Google AI Overviews has not been confirmed.

PAWC weighting function not validated. There is no independent verification that PAWC’s positional weights reflect actual user attention distribution. Since it was designed by extrapolating from SEO’s CTR (Click-Through Rate) distribution, the metric’s validity may weaken if user behavior patterns unique to generative responses differ.

Causal mechanisms unexplained. The observation that “adding statistics increases visibility” exists, but there is no causal analysis of “why generative engines cite content with statistics more often.”

Paper 2: Empirical Behavior Analysis (Chen et al., 2025)

Chen et al.’s research goes a step further on the framework that Aggarwal et al. established. It moves beyond “do GEO strategies work?” to ask “what types of sources does AI search actually prefer, and how does this differ across engines?”

Research Methodology

Chen et al. conducted a large-scale empirical analysis. The key methodology is summarized below.

Item	Details
Number of responses analyzed	Approximately 30,000
Target engines	ChatGPT, Perplexity, Google AI Overviews
Analysis dimensions	Type (domain), frequency, and inter-engine differences of cited sources
Comparison baseline	Differences from traditional Google search results
Query types	Information-seeking queries across diverse domains

The scale of 30,000 responses is the largest among GEO-related empirical studies. The analysis classified the domain type of sources cited by each engine and compared the distribution to source distributions in traditional Google search results.

Earned Media Bias: The Key Finding

The most notable finding is that AI search engines systematically prefer third-party authoritative sources (earned media) over brand-owned channels (owned media).

Owned Media vs Earned Media

Type	Definition	Examples
Owned Media	Channels directly owned/operated by the brand	Corporate websites, brand blogs, company apps
Earned Media	Content voluntarily created by third parties	News articles, review sites, forum discussions, Wikipedia
Paid Media	Content exposed through paid placement	Advertisements, sponsored content, paid listings

According to Chen et al.’s analysis, the majority of sources cited in AI search engine responses were of the earned media type. Review sites, news articles, forum discussions, and Wikipedia were significantly more likely to be cited than brand official websites.

Traditional Google search exposed owned media and earned media in relatively balanced proportions. This bias in AI search means that the shift from SEO to GEO demands not just a technical change but a structural transformation of content strategy.

Domain Distribution Analysis

Chen et al. systematically classified the domain types of cited sources. Key findings are summarized below.

Source Type	Citation Frequency in AI Search	Compared to Traditional Google Search
News/Media outlets	High	Similar or slightly higher
Review/Comparison sites	Very high	Considerably higher
Wikipedia/Encyclopedias	High	Similar
Forums/Communities (Reddit, etc.)	Medium to high	Considerably higher
Academic/Research institutions	Medium	Similar
Brand official sites	Low	Considerably lower
Personal blogs	Low	Slightly lower

The notably high citation frequency of review/comparison sites and forums/communities compared to traditional Google search is particularly significant. Meanwhile, brand official site citation frequency was lower compared to traditional search.

Regarding possible causes of this pattern, Chen et al. did not present direct causal analysis but proposed several hypotheses: it could result from generative engines referencing more earned media in their training data; it could be that third-party sources receive higher relevance scores during the retrieval stage of the RAG (Retrieval-Augmented Generation) pipeline; or it could be that LLMs, designed to produce “neutral and comprehensive responses,” naturally prefer third-party perspectives.

Engine-Specific Citation Pattern Comparison

Another important contribution from Chen et al. is the systematic identification of differences across AI search engines.

flowchart TB
    Q["Same query input"] --> E1["ChatGPT"]
    Q --> E2["Perplexity"]
    Q --> E3["Google AI Overviews"]
    E1 --> R1["Response + cited sources A"]
    E2 --> R2["Response + cited sources B"]
    E3 --> R3["Response + cited sources C"]
    R1 --> D["Citation pattern comparison"]
    R2 --> D
    R3 --> D
    D --> F1["Source type distribution differences"]
    D --> F2["Freshness sensitivity differences"]
    D --> F3["Language-based citation variation"]
    D --> F4["Query phrasing sensitivity differences"]

Three Sensitivity Dimensions

Chen et al. systematically analyzed engine-specific sensitivity differences across three variables.

Freshness. The speed and extent to which each engine reflects recent information varies. Perplexity showed a tendency to cite recent content more quickly, while ChatGPT’s heavy reliance on training data meant slower freshness reflection. Google AI Overviews occupied a middle position through integration with its own search index.

Language. When the same query was entered in English versus a non-English language, the cited sources differed considerably. English queries broadly cited global English-language sources, while non-English queries showed increased reliance on local-language sources, with an overall reduction in source diversity.

Query Phrasing. When the same search intent was expressed in different phrasings, response consistency varied across engines. Some engines were sensitive to query phrasing, citing different sources for queries with similar intent, while others were more robust at identifying intent and cited similar sources regardless of phrasing variations.

Sensitivity Variable	ChatGPT	Perplexity	Google AI Overviews
Freshness	Low (training data dependent)	High (real-time search)	Medium (index-based)
Language	Medium	Medium to high	High (leverages local index)
Query Phrasing	High (phrasing sensitive)	Medium	Low (robust intent detection)

The practical implication of these results is clear: GEO visibility measured from a single engine, single language, and single query can distort the actual exposure state. For those serious about GEO, visibility measurement across multiple engines, multiple languages, and multiple query phrasings is necessary.

Limitations of Paper 2

The limitations of Chen et al.’s research are also evident.

Peer review incomplete. It is at the preprint stage and has not undergone formal peer review. Academic community verification of the methodology and conclusions has not yet occurred.

No causal analysis. The earned media bias was observed, but no causal analysis was presented as to whether it stems from training data composition, retrieval pipeline ranking logic, or prompt design. Correlation alone makes it difficult to determine optimization strategy direction.

Domain concentration. The analyzed queries are concentrated in specific domains, requiring caution when generalizing to other verticals such as B2B, SaaS, or healthcare.

Snapshot analysis limitations. Despite the scale of 30K responses, this is an analysis at a specific point in time. Since AI engines are continuously updated, there is no guarantee that results from the analysis period remain current.

Cross-Analysis of the Two Papers

Complementary Structure

Both papers are valuable when read independently, but a more complete picture emerges when read together. Aggarwal et al. established the framework for “what is GEO and how do we measure it,” and Chen et al. provided empirical evidence for “how does AI search actually differ from traditional search in its behavior.”

Comparison Dimension	Aggarwal et al. (2024)	Chen et al. (2025)
Core question	How to define and measure GEO	How AI search differs from traditional search
Contribution type	Concept definition + benchmark + metrics	Empirical analysis + cross-engine comparison
Methodology	Benchmark construction + controlled optimization experiments	Large-scale response collection + comparative analysis
Key outputs	GEO-Bench, PAWC, 9-strategy effect analysis	Earned media bias, 3 sensitivity dimensions
Analysis scale	Domain-specific query sets	~30,000 responses
Target engines	BingChat/Copilot	ChatGPT, Perplexity, Google AI Overviews
Academic status	Formally published at KDD 2024	Preprint
Practical applicability	Medium (provides strategic direction)	High (identifies engine-specific differences)

What the Two Papers Say Together

Connecting Aggarwal et al.’s finding that “adding statistics” and “citing sources” are effective with Chen et al.’s finding that “AI search prefers earned media” reveals a single coherent pattern.

Generative engines prefer “content that is verifiable, specific, and written from a third-party perspective.” This represents a qualitatively different paradigm from the keyword optimization, brand emphasis, and domain authority that were central to SEO.

This pattern can be summarized as follows:

SEO Paradigm	GEO Paradigm
Keyword density optimization	Information specificity optimization
Building domain authority for owned sites	Securing mentions on third-party channels
SERP ranking = performance	Mention position/frequency in responses = performance
Single-engine (Google) optimization	Multi-engine optimization
Static rankings	Dynamic/probabilistic mentions

flowchart TB
    subgraph "Aggarwal et al. Contributions"
        A1["GEO definition"] --> A2["GEO-Bench"]
        A2 --> A3["PAWC metric"]
        A3 --> A4["Strategy effect measurement"]
        A4 --> A5["Concrete evidence = effective"]
    end
    subgraph "Chen et al. Contributions"
        B1["30K response analysis"] --> B2["Earned media bias"]
        B2 --> B3["Engine-specific differences"]
        B3 --> B4["Multi-engine measurement essential"]
    end
    A5 --> C["Integrated implications"]
    B4 --> C
    C --> D1["Verifiable content first"]
    C --> D2["Third-party channel strategy essential"]
    C --> D3["Single-engine optimization is insufficient"]

What the Literature Still Lacks

Despite these two papers laying the groundwork for GEO, significant gaps remain in the current literature. These gaps must be filled for GEO to mature from an academic concept into an actionable strategic framework.

1. Absence of Longitudinal Studies

Both studies are snapshot analyses at specific points in time. When generative engine models are updated, the effectiveness of the same strategies may change. There are no longitudinal studies tracking how GEO strategy effectiveness changes when models shift from GPT-4 to GPT-4o, or from Gemini 1.5 to 2.0.

In SEO, the impact of Google’s algorithm updates (Panda, Penguin, BERT, etc.) on SEO strategies has been cumulatively studied. A similar longitudinal analysis is needed for GEO, but none has been conducted yet.

2. Insufficient User Behavior Models

PAWC’s positional weights are extrapolated from SEO’s CTR distribution. However, there is no guarantee that how users read generative responses is the same as how they scan SERPs. The pattern of reading free-form text may differ from the pattern of scanning a list of links.

Validating this requires eye-tracking research on generative responses. Empirical data on which parts of AI responses users actually pay attention to, and which brand mentions they notice and remember, must be obtained before PAWC’s weighting function can be validated.

3. Causal Mechanisms Unexplained

Research to date shows “what works” but does not explain “why it works.”

Why is content containing statistical data cited more often?
Is the earned media bias caused by training data bias, RAG pipeline ranking logic, or a combination of both?
At which stage of the generative engine (retrieval, reranking, generation) are source selections determined?

Without understanding causal mechanisms, strategies can only rely on empirical observation, and strategy effectiveness cannot be predicted following engine updates.

4. Domain Generalization

The query sets of both studies are concentrated in specific domains. Whether the same strategies are reproducible in domains with strong domain-specific characteristics — such as B2B SaaS, technical infrastructure, healthcare, or finance — has not been verified.

B2B domains in particular have queries of a fundamentally different nature from B2C. “Best CRM software recommendations” and “enterprise data pipeline architecture comparison” may have different source types and distributions referenced by generative engines. Domain-specific replication studies are needed.

5. ROI Linkage Models

There is almost no research on the pathway from GEO visibility improvement to actual business outcomes. “Being mentioned 40% more in AI responses” does not mean “actual traffic increases by 40%.”

Research is absent on the conversion rate at each stage of the funnel — GEO visibility, brand awareness, website visits, conversions (purchases, signups) — and on GEO’s contribution to overall marketing ROI. Without this linkage model, it is difficult to justify investment in GEO.

6. No Consideration of Multimodal Responses

Both papers analyze only text-based responses. However, current generative engines already produce responses that include images, charts, code blocks, and video references. A methodology for measuring visibility in such multimodal responses has not yet been proposed.

Practical Implications

Caution is needed when directly applying academic findings to practice, as differences exist between research environments and real operating environments. Nevertheless, the direction derived from both papers is clear.

Shift content strategy direction. The strategy must shift from keyword-centric to “evidence density”-centric. Statistics, citations, and explicit source attribution have a stronger effect in GEO than in SEO.

Growing importance of earned media acquisition. Optimizing only your own website is insufficient for securing visibility in GEO. A strategy for securing mentions on third-party channels — review sites, news outlets, forums, Wikipedia — is essential.

Multi-engine monitoring. Visibility must be monitored separately on each major generative engine: ChatGPT, Perplexity, Google AI Overviews, and others. Optimization on a single engine may not transfer to others.

Need for measurement framework construction. Currently, few commercial tools can continuously measure GEO visibility. Even internally, it is necessary to build a system that periodically collects generative engine responses to key queries and tracks the frequency and position of mentions for your brand and competitors.

References

Aggarwal, P. et al. (2024). GEO: Generative Engine Optimization. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2024).
Chen, Y. et al. (2025). Generative Engine Optimization: How to Dominate AI Search. Working Paper.