GEO Paper Review: Evaluation Systems and Manipulation Risks

Scope of This Review

This post reviews two papers that demonstrate how the GEO (Generative Engine Optimization) field is expanding beyond its initial definition phase into evaluation framework construction and security risk analysis.

Kim et al. (2026), “SAGEO Arena” — preprint
Jin et al. (2026), “CORE: Controlling Output Rankings” — preprint (submitted to ICLR 2026)

The former challenges the limitations of existing GEO evaluation methodology and proposes a unified pipeline evaluation approach. The latter empirically demonstrates the possibility of adversarial ranking manipulation in AI responses. This review covers the academic contributions, limitations, and implications for GEO field maturation of both papers.

While the papers reviewed previously — Aggarwal et al. (2024, KDD) and Chen et al. (2025) — defined “what GEO is,” and Wu et al.’s AutoGEO and Bagga et al.’s E-GEO addressed “how to optimize,” these two papers occupy the position of “how to measure properly, and what risks exist.” Academic fields typically mature along the trajectory of definition, measurement, and then security, and GEO is on this trajectory.

Full Context of the GEO Research Flow

The positioning of all papers reviewed so far can be visualized as follows:

flowchart LR
    subgraph Phase1["Phase 1: Definition"]
        A["Aggarwal et al. (2024)\nGEO concept definition\nGEO-Bench, PAWC"]
        B["Chen et al. (2025)\nEmpirical behavior analysis\nEarned media bias"]
    end
    subgraph Phase2["Phase 2: Optimization"]
        C["Wu et al. (2025)\nAutoGEO\nQuality-preserving optimization"]
        D["Bagga et al. (2025)\nE-GEO\nE-commerce vertical"]
    end
    subgraph Phase3["Phase 3: Measurement + Security"]
        E["Kim et al. (2026)\nSAGEO Arena\nUnified evaluation framework"]
        F["Jin et al. (2026)\nCORE\nRanking manipulation demonstration"]
    end
    Phase1 --> Phase2 --> Phase3

What is notable in this flow is that the two Phase 3 papers illuminate the same problem from different directions. SAGEO Arena asks “is our current evaluation sufficient?” while CORE asks “is the current system safe?” These questions are superficially independent but fundamentally converge on the same concern: the trustworthiness of the GEO ecosystem.

Paper 1: The Entire Pipeline Must Be Evaluated — SAGEO Arena

Kim et al.’s SAGEO Arena starts from a critique of existing approaches that treat GEO and SEO as separate optimization domains. The paper’s core claim is simple yet far-reaching: GEO evaluation that only measures the generation stage is structurally incomplete.

Previous GEO research — including Aggarwal et al. (2024) covered in earlier reviews — primarily focused on content visibility at the generation stage. GEO-Bench’s PAWC metric, Chen et al.’s earned media analysis, and AutoGEO’s cooperative GEO all measured “how much a specific source is included in the AI-generated response.”

However, real AI search systems are not single-stage. They consist of a multi-stage pipeline: retrieval, reranking, and generation. Here the fundamental problem arises: content optimized for the generation stage can never appear in AI responses if it fails to make it into the retrieval candidate pool in the first place.

Breaking this problem down concretely:

Pipeline Stage	Role	Coverage by Prior GEO Research
Retrieval	Extracts query-relevant candidates from a large document pool	Almost none
Reranking	Reorders candidates by relevance, authority, and freshness	Indirect (implied in earned media analysis)
Generation	Generates the final response text and cites sources	Most research concentrated here

SAGEO Arena’s core argument originates from the asymmetry in this table. The vast majority of GEO research has focused only on the Generation stage, but the first gate for practical visibility is Retrieval.

SAGEO Arena Architecture

The authors built a unified evaluation environment covering the entire retrieval-reranking-generation pipeline. It measures SEO (search exposure) and GEO (generative response exposure) within the same framework.

flowchart TD
    Q["User query"] --> R["Retrieval stage"]
    R -->|"n candidate documents"| RR["Reranking stage"]
    RR -->|"Top k documents"| G["Generation stage"]
    G --> A["AI response + citations"]

    subgraph SAGEO["SAGEO Arena evaluation scope"]
        R
        RR
        G
    end

    subgraph SEO_Metrics["SEO Metrics"]
        RM1["Retrieval Hit Rate"]
        RM2["Average Retrieval Rank"]
        RM3["Reranking Position"]
    end

    subgraph GEO_Metrics["GEO Metrics"]
        GM1["Citation Inclusion"]
        GM2["Position in Response"]
        GM3["Attribution Quality"]
    end

    R --> SEO_Metrics
    RR --> SEO_Metrics
    G --> GEO_Metrics

The significance of this design is that it redefines GEO from “content optimization” to pipeline optimization. It enables decomposed measurement of which content attributes operate at which pipeline stage.

Methodology Details

SAGEO Arena’s experimental design consists of the following elements:

Experimental Element	Details
Evaluation target	Simultaneous measurement of SEO and GEO metrics
Pipeline configuration	Retrieval (embedding-based search) → Reranking (cross-encoder) → Generation (LLM)
Optimization variables	Structured information (metadata, schema markup), body text, citation density, etc.
SEO metrics	Retrieval Hit Rate, Average Retrieval Rank
GEO metrics	Citation Inclusion Rate, Position-weighted Visibility
Query set	Information-seeking and compound queries across multiple domains

Notably, the authors isolated optimization variables individually for experimentation. This is a design for causally analyzing “which optimization has an effect at which stage.” Prior GEO research has rarely attempted this level of stage-by-stage decomposition analysis.

SEO and GEO Metric Integration

Another dimension of SAGEO Arena’s contribution is presenting a structure that simultaneously measures SEO and GEO metrics within a single framework. Previously, these two domains were almost completely separated.

Aspect	Traditional SEO Evaluation	Traditional GEO Evaluation	SAGEO Unified Evaluation
Measurement target	SERP rankings, click rates	Citation frequency in AI responses	Entire pipeline
Perspective	Search engine crawling	Generative model output	Both simultaneously
Optimization strategy	Technical SEO + content	Content structure + authority	Stage-by-stage decomposition
Limitation	Does not reflect AI search	Ignores retrieval stage	Generalizability of experimental environment

This integration matters practically because in the real world, SEO and GEO apply to the same content. A single web page can appear in Google SERPs and also be cited in AI responses from Perplexity or ChatGPT Search. Optimizing the two metrics separately risks improvements on one side causing degradation on the other, and SAGEO Arena provides the tooling to make this tradeoff visible.

Key Results

Optimization Target	Retrieval Hit Rate Change	Avg. Retrieval Rank Change	Generation Citation Rate Change
Structured information optimization	+22%	+2.72	Significant improvement
Body text optimization alone	Negligible	Negligible	Some improvement
Citation density optimization	Minor	Minor	Significant improvement
Structured info + body text combined	Maximum	Maximum	Maximum

The most notable result is that optimizing structured information — metadata, schema markup, structured data — improved retrieval hit rate by 22% and raised average retrieval rank by 2.72. Meanwhile, optimizing body text alone showed no significant effect at the retrieval stage.

Key finding: Structured information optimization shows the greatest effect at the retrieval stage (+22% hit rate), while body text optimization only has partial effect at the generation stage. GEO strategies limited to body text optimization are structurally insufficient.

This result is intuitively interpretable. The retrieval stage of AI search systems relies on embedding-based similarity search and metadata filtering, while qualitative improvements to body content only take effect after passing through this stage. Structured information is the critical signal that gets content through the retrieval “gate,” while body text influences citation decisions during the generation stage.

Retrieval Improvement Mechanisms Analyzed

It is worth examining more specifically through which mechanisms the +22% retrieval hit rate improvement occurs.

The retrieval stage of AI search systems generally operates through two pathways:

Dense retrieval: Embeds queries and documents in the same vector space and calculates similarity. In this case, structured metadata (title, description, schema markup) directly affects embedding quality.
Sparse retrieval + metadata filtering: Combines traditional search methods like BM25 with metadata-based filters. Documents rich in structured data are more likely to be included at the filtering stage.

SAGEO Arena’s results suggest that structured information operates favorably in both pathways. This is structurally the same pattern as the importance of schema markup and metadata optimization in traditional technical SEO, empirically confirming that this principle remains valid in the AI search era.

Limitations

The paper is at the preprint stage, and whether the pipeline configuration used represents the architecture of all commercial AI search engines has not been verified. In particular, Google AI Overview, Perplexity, and ChatGPT Search each likely use different retrieval-generation pipelines.

Specific limitations are summarized below:

Limitation	Detail	Follow-up Research Needed
Pipeline representativeness	Unverified whether experimental pipeline sufficiently represents commercial systems	Replication experiments with multiple pipeline configurations
Structured info type decomposition	”Structured information” scope is broad; needs detailed analysis of which types contribute most	Separate effect isolation for schema markup, meta tags, Open Graph, etc.
Cross-engine comparison	Single-pipeline experiment cannot capture inter-engine differences	Multi-engine comparison experiments
Temporal stability	Point-in-time results; changes from model updates not measured	Longitudinal study
Query type scope	Primarily information-seeking queries; commercial, local, and other intent types not included	Integration with vertical benchmarks like E-GEO

Paper 2: Can Rankings Be Manipulated? — CORE

Jin et al.’s CORE illuminates the dark side of GEO. If optimization is possible, is malicious manipulation also possible? This question must be raised in every optimization domain, and GEO is no exception.

Ethical Positioning of the Research

First, the nature of this paper must be clarified. CORE is a vulnerability disclosure, not a manipulation guide. This follows the standard approach of AI security research.

In cybersecurity, “responsible disclosure” — the practice of publicly reporting vulnerabilities to promote defense — is well established. Recent AI security research such as prompt injection studies (Perez & Ribeiro, 2022) and data poisoning studies (Carlini et al., 2023) follows the same frame. The fact that CORE was submitted to ICLR 2026 itself reflects the academic community’s recognition of this as legitimate security research.

Ethical framing: CORE’s 91.4% success rate should be read not as “here’s how to manipulate” but as “this is how vulnerable current systems are.”

Research Design

This paper systematically experiments whether AI search response rankings can be deliberately manipulated from an adversarial perspective. The research design consists of building an attack type taxonomy and measuring each type’s effectiveness.

Attack Vector Taxonomy

The attack vectors presented by CORE are classified as follows:

flowchart TD
    A["AI Search Ranking Manipulation\n(Output Ranking Manipulation)"]
    A --> B["Content-level attacks"]
    A --> C["Metadata-level attacks"]
    A --> D["Cross-channel attacks"]

    B --> B1["Authority signal injection"]
    B --> B2["Citation network manipulation"]
    B --> B3["Keyword-structure hybrid"]

    C --> C1["Schema markup abuse"]
    C --> C2["Meta tag inflation"]

    D --> D1["Multi-source consistency attack"]
    D --> D2["Synthetic backlinks"]

The characteristics and success rates of each attack vector are summarized below:

Attack Vector	Description	Top-5 Promotion Success Rate	Top-10 Promotion Success Rate
Authority signal injection	Insert statistics, citations, expert opinions to make content appear authoritative	High	Very high
Citation network manipulation	Induce multiple external sources to reference the target content	High	High
Keyword-structure hybrid	Combine semantic keyword optimization with structured formatting	Medium	High
Schema markup abuse	Insert fabricated structured data into markup	Medium	Medium
Multi-source consistency	Repeat identical claims across multiple channels to generate a consensus signal	Very high	Very high
Combined attack	Use a combination of the above vectors	91.4%	96.2%

Key Results and Their Implications

The headline number reported by CORE is a Top-5 ranking promotion success rate of 91.4%. This means that attempts to place specific content within the top 5 positions of an AI response succeeded 91.4% of the time.

Key numbers: Under combined attack conditions (multiple vectors combined), Top-5 promotion success rate 91.4%, Top-10 promotion success rate 96.2%. These are results under controlled experimental conditions, but they suggest that current AI search systems have remarkably low robustness.

The implications of these numbers need to be analyzed in contrast with SEO’s history.

Comparison Dimension	Google Search (2024 baseline)	AI Search Systems (CORE experiment)
Anti-spam defense history	20+ years of accumulated defense mechanisms	Early stage
Manipulation success rate	Even sophisticated Black Hat SEO has limited success	91.4% (combined attack)
Detection mechanisms	SpamBrain, manual actions, algorithmic penalties	Mostly not built
Normative framework	White Hat / Black Hat distinction established	No standards exist
Industry self-regulation	Quality Raters guidelines	None

This contrasts with traditional Google search, which has developed defense mechanisms against spam, link farms, and click manipulation over decades. AI search systems are being launched as commercial services with virtually no such defense history.

What 91.4% Means for the Industry

Interpreting this number from an industry perspective reveals problems on three dimensions.

First, user trust. AI search is growing on the expectation of “providing more accurate and unbiased answers.” However, if ranking manipulation is this easy, the premise that top results in AI responses are necessarily the most relevant sources collapses. This undermines the foundation of user trust.

Second, market fairness. Access to manipulation techniques is not equal. Only actors with technical capability and resources can attempt manipulation, which creates structural disadvantages for smaller content producers. The pattern from SEO, where large corporations invested massive budgets in link building to push small sites out, risks repeating in GEO.

Third, information ecosystem contamination. In a circular structure where AI search feeds into the training data of other AI systems, manipulated rankings could be reflected in the training of next-generation models. This could lead to long-term information ecosystem distortion beyond single-point manipulation.

The Need for Defense Mechanisms

Jin et al. present not only attack demonstrations but also defense directions. Defense mechanisms proposed or implied in the paper can be structured as follows:

flowchart TD
    D["Defense Mechanisms\n(Defense Layers)"]
    D --> L1["Layer 1: Input Validation"]
    D --> L2["Layer 2: Pipeline Hardening"]
    D --> L3["Layer 3: Output Monitoring"]
    D --> L4["Layer 4: Normative Framework"]

    L1 --> L1a["Content authenticity verification"]
    L1 --> L1b["Metadata consistency checks"]
    L1 --> L1c["Source trustworthiness scoring"]

    L2 --> L2a["Adversarial training"]
    L2 --> L2b["Multi-signal cross-validation"]
    L2 --> L2c["Ranking anomaly detection"]

    L3 --> L3a["Response consistency monitoring"]
    L3 --> L3b["Time-based ranking fluctuation tracking"]
    L3 --> L3c["User feedback integration"]

    L4 --> L4a["White Hat GEO criteria definition"]
    L4 --> L4b["Manipulation detection transparency reports"]
    L4 --> L4c["Industry self-regulation"]

The role and current implementation status of each defense layer:

Defense Layer	Role	Current Status	Implementation Difficulty
Input validation	Verify content and metadata authenticity before indexing	Basic level	Medium
Pipeline hardening	Ensure robustness against adversarial inputs at each retrieval/reranking/generation stage	Virtually nonexistent	High
Output monitoring	Detect anomalies in ranking patterns of generated responses	Partial	Medium
Normative framework	Define industry standards drawing the line between optimization and manipulation	Absent	High (non-technical factors)

Intersection with AI Security Research

CORE’s research structurally connects with existing AI security research agendas.

AI Security Agenda	Definition	GEO Counterpart
Prompt injection	Insert malicious instructions into LLM to induce unintended behavior	Insert text in content that induces the LLM to cite that source
Data poisoning	Inject malicious samples into training data to alter model behavior	Deploy manipulated information across indexable web content
Adversarial examples	Subtly modify inputs to disrupt model classification/output	Fine-tune content to disrupt ranking algorithms
Model extraction	Observe model behavior to reverse-engineer internal logic	Observe ranking fluctuation patterns to reverse-engineer ranking algorithms

These correspondences suggest that GEO security is not a standalone new problem but can be addressed by extending existing AI security frameworks. However, GEO-specific characteristics — multi-stage pipeline, an open input space of web content, real-time impact on commercial services — introduce additional complexity.

Limitations

A gap may exist between the specific pipeline configuration of the experimental environment and reproducibility on commercial systems. Additionally, the 91.4% figure is a result under controlled experimental conditions, and actual commercial systems may have additional filtering layers. However, verifying whether such defenses are sufficient is itself a future research task.

Additionally, the paper’s attack scenarios are static. Real-world manipulation attempts are dynamic processes that adapt and evolve in response to defense mechanisms. This arms race dynamic is beyond the scope of this paper but is a topic that must be addressed in future research.

Cross-Analysis: Evaluation Integrity and Manipulation Risk

Placing both papers side by side reveals not just a comparison but a unified problem: GEO evaluation framework integrity and ranking manipulation risk are two sides of the same coin.

Comparison Frame

Dimension	Kim et al. — SAGEO Arena	Jin et al. — CORE
Perspective	Evaluation methodology	Security/adversarial analysis
Core question	Is current evaluation sufficient?	Is the current system safe?
Core contribution	Full-pipeline evaluation framework	Ranking manipulation vulnerability demonstration
Key numbers	Retrieval hit rate +22%	Top-5 promotion 91.4% success
GEO redefinition	Content optimization → pipeline optimization	Optimization → includes security
Methodology	Unified evaluation environment + variable isolation experiments	Attack taxonomy + success rate measurement
Practical audience	GEO strategists, evaluation researchers	Platform security teams, policy makers
Status	Preprint	Preprint / ICLR 2026 submission

Structural Connection: Interdependence of Evaluation and Security

SAGEO Arena’s pipeline evaluation structure and CORE’s attack vector analysis map to each other with striking precision. At the exact point where SAGEO Arena discovers “structured information is decisive at the retrieval stage,” CORE’s “schema markup abuse” attack operates.

This correspondence visualized:

Pipeline Stage	SAGEO Finding	CORE Attack Vector	Implication
Retrieval	Structured info +22% effect	Schema markup abuse, metadata manipulation	The most effective optimization point is also the most vulnerable attack point
Reranking	Multi-signal combination determines ranking	Multi-source consistency attack, synthetic backlinks	Vulnerable to attacks that synthesize trust signals
Generation	Content structure and citations affect citation rate	Authority signal injection, citation network manipulation	Generation model overly reliant on authority signals

This correspondence reveals a fundamental dilemma: The information needed for effective GEO (structured data, authority signals, citations) is the same information exploited for manipulation. This is structurally the same problem as in SEO, where “good SEO” and “Black Hat SEO” use the same technical mechanisms and differ only in intent.

The Need for an Integrated Framework

The conclusion drawn from cross-analyzing both papers is clear: for the GEO ecosystem to mature healthily, evaluation and security must be treated not as separate research agendas but within a single integrated framework.

flowchart TD
    subgraph Integrated["Integrated GEO Framework"]
        direction TB
        E["Evaluation\nSAGEO Arena direction"]
        S["Security\nCORE direction"]
        E <-->|"Interdependent"| S
    end

    subgraph SEO_Layer["Existing SEO Infrastructure"]
        T["Technical SEO"]
        C["Content SEO"]
    end

    subgraph GEO_Layer["GEO Extension"]
        P["Pipeline Optimization"]
        D["Defensive GEO"]
    end

    SEO_Layer --> Integrated
    Integrated --> GEO_Layer
    P --> O["Practical GEO Strategy"]
    D --> O

This integrated framework has the following practical implications:

Include a security dimension in evaluation metrics: When measuring GEO performance, robustness against manipulation should be evaluated alongside ranking improvements.
Defensive GEO strategies: A defensive perspective is needed to ensure your content is not displaced by manipulated content.
Platform-content producer collaboration: A cooperative structure is needed where platforms build defense mechanisms and content producers optimize in alignment with them.

Comprehensive Synthesis: Cumulative Insights from 6 GEO Papers

Combining the 4 papers from previous reviews with these 2 reveals the overall picture painted by all 6 GEO papers.

Paper	Year	Core Contribution	GEO Maturity Stage
Aggarwal et al.	2024	GEO concept definition, GEO-Bench, PAWC	Definition
Chen et al.	2025	Earned media bias, engine-specific differences	Definition
Wu et al. (AutoGEO)	2025	Quality-preserving auto-optimization	Optimization
Bagga et al. (E-GEO)	2025	E-commerce vertical benchmark	Optimization
Kim et al. (SAGEO Arena)	2026	Unified pipeline evaluation	Measurement
Jin et al. (CORE)	2026	Ranking manipulation vulnerability demonstration	Security

The current position of the GEO field as cumulatively demonstrated by these 6 papers:

Definition: Established. Consensus is forming on what GEO is and how it differs from SEO.
Measurement: Early frameworks exist but are not standardized. Proposals like PAWC and SAGEO exist, but no industry-standard metrics have been adopted.
Optimization: Approaches have begun from both general-purpose and vertical sides. The feasibility of quality-preserving optimization is demonstrated, but replication scope is limited.
Security: At the problem identification stage. Vulnerabilities are empirically demonstrated, but implementation and effectiveness verification of defense mechanisms have not begun.

The still-unsolved problems — standardized evaluation protocols, cross-platform comparable benchmarks, defense mechanism effectiveness verification, longitudinal studies, and ROI linkage models — are gaps that must be filled in the field’s next phase.

Industry Implications: The Need for Defensive GEO

The most important practical takeaway from combining both papers is the need for the concept of Defensive GEO.

Limitations of Current GEO Strategy

Current GEO strategy discourse mostly focuses on “how to get better exposure in AI search.” This corresponds to the offensive strategy of “how to raise rankings” in SEO. However, CORE’s results show this approach alone is insufficient.

A 91.4% manipulation success rate means competitors can technically push your content down in rankings. In this environment, it becomes a strategic imperative not only to raise your own content but also to avoid being displaced by manipulation attempts.

Components of Defensive GEO

A defensive GEO strategy should include the following elements:

Multi-stage optimization: Following SAGEO Arena’s lesson, optimize across the entire pipeline including structured information, not just body text. This makes it harder for a single attack vector to overturn rankings.
Multi-source consistency: Ensure your information is consistently referenced across diverse, trustworthy third-party sources. This is a legitimate defense against CORE’s “multi-source consistency attack.”
Monitoring systems: Continuously monitor your visibility in AI search responses to detect anomalous ranking fluctuations early.
Content authenticity reinforcement: Focus on content types that are difficult to fabricate — original data, proprietary research findings, verifiable information.

Practical conclusion: GEO strategy must expand from “how to raise visibility” to “how to raise visibility and how to defend it.” SAGEO Arena tells us “what to optimize,” and CORE tells us “what to defend against.”

References

Kim, J. et al. (2026). SAGEO Arena. Preprint.
Jin, Z. et al. (2026). CORE: Controlling Output Rankings. Preprint / ICLR 2026 Submission.
Aggarwal, P. et al. (2024). GEO: Generative Engine Optimization. KDD 2024.
Chen, Y. et al. (2025). Generative Engine Optimization: How to Dominate AI Search. Working Paper.
Wu, Z. et al. (2025). AutoGEO. Preprint.
Bagga, N. et al. (2025). E-GEO. Preprint.

GEO Paper Review: Evaluation Systems and Manipulation Risks

Scope of This Review

Full Context of the GEO Research Flow

Paper 1: The Entire Pipeline Must Be Evaluated — SAGEO Arena

The Problem: Blind Spots in Existing GEO Evaluation

SAGEO Arena Architecture

Methodology Details

SEO and GEO Metric Integration

Key Results

Retrieval Improvement Mechanisms Analyzed

Limitations

Paper 2: Can Rankings Be Manipulated? — CORE

Ethical Positioning of the Research

Research Design

Attack Vector Taxonomy

Key Results and Their Implications

What 91.4% Means for the Industry

The Need for Defense Mechanisms

Intersection with AI Security Research

Limitations

Cross-Analysis: Evaluation Integrity and Manipulation Risk

Comparison Frame

Structural Connection: Interdependence of Evaluation and Security

The Need for an Integrated Framework

Comprehensive Synthesis: Cumulative Insights from 6 GEO Papers

Industry Implications: The Need for Defensive GEO

Limitations of Current GEO Strategy

Components of Defensive GEO

Related Posts

GEO Paper Review: Optimization Approaches and Vertical Applications

GEO Paper Review: Definition and Foundational Frameworks

HubSpot, Semrush, Adobe, and Conductor Enter GEO — How Incumbents Are Moving