Minbook
KO
GEO Paper Review: Evaluation Systems and Manipulation Risks

GEO Paper Review: Evaluation Systems and Manipulation Risks

MJ · · 13 min read

Review of SAGEO Arena and CORE papers, analyzing the need for integrated GEO evaluation frameworks and the vulnerability of AI search rankings (91.4% Top-5 manipulation success rate).

Scope of This Review

This post reviews two papers that demonstrate how the GEO (Generative Engine Optimization) field is expanding beyond its initial definition phase into evaluation framework construction and security risk analysis.

  1. Kim et al. (2026), “SAGEO Arena” — preprint
  2. Jin et al. (2026), “CORE: Controlling Output Rankings” — preprint (submitted to ICLR 2026)

The former challenges the limitations of existing GEO evaluation methodology and proposes a unified pipeline evaluation approach. The latter empirically demonstrates the possibility of adversarial ranking manipulation in AI responses. This review covers the academic contributions, limitations, and implications for GEO field maturation of both papers.

While the papers reviewed previously — Aggarwal et al. (2024, KDD) and Chen et al. (2025) — defined “what GEO is,” and Wu et al.’s AutoGEO and Bagga et al.’s E-GEO addressed “how to optimize,” these two papers occupy the position of “how to measure properly, and what risks exist.” Academic fields typically mature along the trajectory of definition, measurement, and then security, and GEO is on this trajectory.

Full Context of the GEO Research Flow

The positioning of all papers reviewed so far can be visualized as follows:

flowchart LR
    subgraph Phase1["Phase 1: Definition"]
        A["Aggarwal et al. (2024)\nGEO concept definition\nGEO-Bench, PAWC"]
        B["Chen et al. (2025)\nEmpirical behavior analysis\nEarned media bias"]
    end
    subgraph Phase2["Phase 2: Optimization"]
        C["Wu et al. (2025)\nAutoGEO\nQuality-preserving optimization"]
        D["Bagga et al. (2025)\nE-GEO\nE-commerce vertical"]
    end
    subgraph Phase3["Phase 3: Measurement + Security"]
        E["Kim et al. (2026)\nSAGEO Arena\nUnified evaluation framework"]
        F["Jin et al. (2026)\nCORE\nRanking manipulation demonstration"]
    end
    Phase1 --> Phase2 --> Phase3

What is notable in this flow is that the two Phase 3 papers illuminate the same problem from different directions. SAGEO Arena asks “is our current evaluation sufficient?” while CORE asks “is the current system safe?” These questions are superficially independent but fundamentally converge on the same concern: the trustworthiness of the GEO ecosystem.


Paper 1: The Entire Pipeline Must Be Evaluated — SAGEO Arena

Kim et al.’s SAGEO Arena starts from a critique of existing approaches that treat GEO and SEO as separate optimization domains. The paper’s core claim is simple yet far-reaching: GEO evaluation that only measures the generation stage is structurally incomplete.

The Problem: Blind Spots in Existing GEO Evaluation

Previous GEO research — including Aggarwal et al. (2024) covered in earlier reviews — primarily focused on content visibility at the generation stage. GEO-Bench’s PAWC metric, Chen et al.’s earned media analysis, and AutoGEO’s cooperative GEO all measured “how much a specific source is included in the AI-generated response.”

However, real AI search systems are not single-stage. They consist of a multi-stage pipeline: retrieval, reranking, and generation. Here the fundamental problem arises: content optimized for the generation stage can never appear in AI responses if it fails to make it into the retrieval candidate pool in the first place.

Breaking this problem down concretely:

Pipeline StageRoleCoverage by Prior GEO Research
RetrievalExtracts query-relevant candidates from a large document poolAlmost none
RerankingReorders candidates by relevance, authority, and freshnessIndirect (implied in earned media analysis)
GenerationGenerates the final response text and cites sourcesMost research concentrated here

SAGEO Arena’s core argument originates from the asymmetry in this table. The vast majority of GEO research has focused only on the Generation stage, but the first gate for practical visibility is Retrieval.

SAGEO Arena Architecture

The authors built a unified evaluation environment covering the entire retrieval-reranking-generation pipeline. It measures SEO (search exposure) and GEO (generative response exposure) within the same framework.

flowchart TD
    Q["User query"] --> R["Retrieval stage"]
    R -->|"n candidate documents"| RR["Reranking stage"]
    RR -->|"Top k documents"| G["Generation stage"]
    G --> A["AI response + citations"]

    subgraph SAGEO["SAGEO Arena evaluation scope"]
        R
        RR
        G
    end

    subgraph SEO_Metrics["SEO Metrics"]
        RM1["Retrieval Hit Rate"]
        RM2["Average Retrieval Rank"]
        RM3["Reranking Position"]
    end

    subgraph GEO_Metrics["GEO Metrics"]
        GM1["Citation Inclusion"]
        GM2["Position in Response"]
        GM3["Attribution Quality"]
    end

    R --> SEO_Metrics
    RR --> SEO_Metrics
    G --> GEO_Metrics

The significance of this design is that it redefines GEO from “content optimization” to pipeline optimization. It enables decomposed measurement of which content attributes operate at which pipeline stage.

Methodology Details

SAGEO Arena’s experimental design consists of the following elements:

Experimental ElementDetails
Evaluation targetSimultaneous measurement of SEO and GEO metrics
Pipeline configurationRetrieval (embedding-based search) → Reranking (cross-encoder) → Generation (LLM)
Optimization variablesStructured information (metadata, schema markup), body text, citation density, etc.
SEO metricsRetrieval Hit Rate, Average Retrieval Rank
GEO metricsCitation Inclusion Rate, Position-weighted Visibility
Query setInformation-seeking and compound queries across multiple domains

Notably, the authors isolated optimization variables individually for experimentation. This is a design for causally analyzing “which optimization has an effect at which stage.” Prior GEO research has rarely attempted this level of stage-by-stage decomposition analysis.

SEO and GEO Metric Integration

Another dimension of SAGEO Arena’s contribution is presenting a structure that simultaneously measures SEO and GEO metrics within a single framework. Previously, these two domains were almost completely separated.

AspectTraditional SEO EvaluationTraditional GEO EvaluationSAGEO Unified Evaluation
Measurement targetSERP rankings, click ratesCitation frequency in AI responsesEntire pipeline
PerspectiveSearch engine crawlingGenerative model outputBoth simultaneously
Optimization strategyTechnical SEO + contentContent structure + authorityStage-by-stage decomposition
LimitationDoes not reflect AI searchIgnores retrieval stageGeneralizability of experimental environment

This integration matters practically because in the real world, SEO and GEO apply to the same content. A single web page can appear in Google SERPs and also be cited in AI responses from Perplexity or ChatGPT Search. Optimizing the two metrics separately risks improvements on one side causing degradation on the other, and SAGEO Arena provides the tooling to make this tradeoff visible.

Key Results

Optimization TargetRetrieval Hit Rate ChangeAvg. Retrieval Rank ChangeGeneration Citation Rate Change
Structured information optimization+22%+2.72Significant improvement
Body text optimization aloneNegligibleNegligibleSome improvement
Citation density optimizationMinorMinorSignificant improvement
Structured info + body text combinedMaximumMaximumMaximum

The most notable result is that optimizing structured information — metadata, schema markup, structured data — improved retrieval hit rate by 22% and raised average retrieval rank by 2.72. Meanwhile, optimizing body text alone showed no significant effect at the retrieval stage.

Key finding: Structured information optimization shows the greatest effect at the retrieval stage (+22% hit rate), while body text optimization only has partial effect at the generation stage. GEO strategies limited to body text optimization are structurally insufficient.

This result is intuitively interpretable. The retrieval stage of AI search systems relies on embedding-based similarity search and metadata filtering, while qualitative improvements to body content only take effect after passing through this stage. Structured information is the critical signal that gets content through the retrieval “gate,” while body text influences citation decisions during the generation stage.

Retrieval Improvement Mechanisms Analyzed

It is worth examining more specifically through which mechanisms the +22% retrieval hit rate improvement occurs.

The retrieval stage of AI search systems generally operates through two pathways:

  1. Dense retrieval: Embeds queries and documents in the same vector space and calculates similarity. In this case, structured metadata (title, description, schema markup) directly affects embedding quality.
  2. Sparse retrieval + metadata filtering: Combines traditional search methods like BM25 with metadata-based filters. Documents rich in structured data are more likely to be included at the filtering stage.

SAGEO Arena’s results suggest that structured information operates favorably in both pathways. This is structurally the same pattern as the importance of schema markup and metadata optimization in traditional technical SEO, empirically confirming that this principle remains valid in the AI search era.

Limitations

The paper is at the preprint stage, and whether the pipeline configuration used represents the architecture of all commercial AI search engines has not been verified. In particular, Google AI Overview, Perplexity, and ChatGPT Search each likely use different retrieval-generation pipelines.

Specific limitations are summarized below:

LimitationDetailFollow-up Research Needed
Pipeline representativenessUnverified whether experimental pipeline sufficiently represents commercial systemsReplication experiments with multiple pipeline configurations
Structured info type decomposition”Structured information” scope is broad; needs detailed analysis of which types contribute mostSeparate effect isolation for schema markup, meta tags, Open Graph, etc.
Cross-engine comparisonSingle-pipeline experiment cannot capture inter-engine differencesMulti-engine comparison experiments
Temporal stabilityPoint-in-time results; changes from model updates not measuredLongitudinal study
Query type scopePrimarily information-seeking queries; commercial, local, and other intent types not includedIntegration with vertical benchmarks like E-GEO

Paper 2: Can Rankings Be Manipulated? — CORE

Jin et al.’s CORE illuminates the dark side of GEO. If optimization is possible, is malicious manipulation also possible? This question must be raised in every optimization domain, and GEO is no exception.

Ethical Positioning of the Research

First, the nature of this paper must be clarified. CORE is a vulnerability disclosure, not a manipulation guide. This follows the standard approach of AI security research.

In cybersecurity, “responsible disclosure” — the practice of publicly reporting vulnerabilities to promote defense — is well established. Recent AI security research such as prompt injection studies (Perez & Ribeiro, 2022) and data poisoning studies (Carlini et al., 2023) follows the same frame. The fact that CORE was submitted to ICLR 2026 itself reflects the academic community’s recognition of this as legitimate security research.

Ethical framing: CORE’s 91.4% success rate should be read not as “here’s how to manipulate” but as “this is how vulnerable current systems are.”

Research Design

This paper systematically experiments whether AI search response rankings can be deliberately manipulated from an adversarial perspective. The research design consists of building an attack type taxonomy and measuring each type’s effectiveness.

Attack Vector Taxonomy

The attack vectors presented by CORE are classified as follows:

flowchart TD
    A["AI Search Ranking Manipulation\n(Output Ranking Manipulation)"]
    A --> B["Content-level attacks"]
    A --> C["Metadata-level attacks"]
    A --> D["Cross-channel attacks"]

    B --> B1["Authority signal injection"]
    B --> B2["Citation network manipulation"]
    B --> B3["Keyword-structure hybrid"]

    C --> C1["Schema markup abuse"]
    C --> C2["Meta tag inflation"]

    D --> D1["Multi-source consistency attack"]
    D --> D2["Synthetic backlinks"]

The characteristics and success rates of each attack vector are summarized below:

Attack VectorDescriptionTop-5 Promotion Success RateTop-10 Promotion Success Rate
Authority signal injectionInsert statistics, citations, expert opinions to make content appear authoritativeHighVery high
Citation network manipulationInduce multiple external sources to reference the target contentHighHigh
Keyword-structure hybridCombine semantic keyword optimization with structured formattingMediumHigh
Schema markup abuseInsert fabricated structured data into markupMediumMedium
Multi-source consistencyRepeat identical claims across multiple channels to generate a consensus signalVery highVery high
Combined attackUse a combination of the above vectors91.4%96.2%

Key Results and Their Implications

The headline number reported by CORE is a Top-5 ranking promotion success rate of 91.4%. This means that attempts to place specific content within the top 5 positions of an AI response succeeded 91.4% of the time.

Key numbers: Under combined attack conditions (multiple vectors combined), Top-5 promotion success rate 91.4%, Top-10 promotion success rate 96.2%. These are results under controlled experimental conditions, but they suggest that current AI search systems have remarkably low robustness.

The implications of these numbers need to be analyzed in contrast with SEO’s history.

Comparison DimensionGoogle Search (2024 baseline)AI Search Systems (CORE experiment)
Anti-spam defense history20+ years of accumulated defense mechanismsEarly stage
Manipulation success rateEven sophisticated Black Hat SEO has limited success91.4% (combined attack)
Detection mechanismsSpamBrain, manual actions, algorithmic penaltiesMostly not built
Normative frameworkWhite Hat / Black Hat distinction establishedNo standards exist
Industry self-regulationQuality Raters guidelinesNone

This contrasts with traditional Google search, which has developed defense mechanisms against spam, link farms, and click manipulation over decades. AI search systems are being launched as commercial services with virtually no such defense history.

What 91.4% Means for the Industry

Interpreting this number from an industry perspective reveals problems on three dimensions.

First, user trust. AI search is growing on the expectation of “providing more accurate and unbiased answers.” However, if ranking manipulation is this easy, the premise that top results in AI responses are necessarily the most relevant sources collapses. This undermines the foundation of user trust.

Second, market fairness. Access to manipulation techniques is not equal. Only actors with technical capability and resources can attempt manipulation, which creates structural disadvantages for smaller content producers. The pattern from SEO, where large corporations invested massive budgets in link building to push small sites out, risks repeating in GEO.

Third, information ecosystem contamination. In a circular structure where AI search feeds into the training data of other AI systems, manipulated rankings could be reflected in the training of next-generation models. This could lead to long-term information ecosystem distortion beyond single-point manipulation.

The Need for Defense Mechanisms

Jin et al. present not only attack demonstrations but also defense directions. Defense mechanisms proposed or implied in the paper can be structured as follows:

flowchart TD
    D["Defense Mechanisms\n(Defense Layers)"]
    D --> L1["Layer 1: Input Validation"]
    D --> L2["Layer 2: Pipeline Hardening"]
    D --> L3["Layer 3: Output Monitoring"]
    D --> L4["Layer 4: Normative Framework"]

    L1 --> L1a["Content authenticity verification"]
    L1 --> L1b["Metadata consistency checks"]
    L1 --> L1c["Source trustworthiness scoring"]

    L2 --> L2a["Adversarial training"]
    L2 --> L2b["Multi-signal cross-validation"]
    L2 --> L2c["Ranking anomaly detection"]

    L3 --> L3a["Response consistency monitoring"]
    L3 --> L3b["Time-based ranking fluctuation tracking"]
    L3 --> L3c["User feedback integration"]

    L4 --> L4a["White Hat GEO criteria definition"]
    L4 --> L4b["Manipulation detection transparency reports"]
    L4 --> L4c["Industry self-regulation"]

The role and current implementation status of each defense layer:

Defense LayerRoleCurrent StatusImplementation Difficulty
Input validationVerify content and metadata authenticity before indexingBasic levelMedium
Pipeline hardeningEnsure robustness against adversarial inputs at each retrieval/reranking/generation stageVirtually nonexistentHigh
Output monitoringDetect anomalies in ranking patterns of generated responsesPartialMedium
Normative frameworkDefine industry standards drawing the line between optimization and manipulationAbsentHigh (non-technical factors)

Intersection with AI Security Research

CORE’s research structurally connects with existing AI security research agendas.

AI Security AgendaDefinitionGEO Counterpart
Prompt injectionInsert malicious instructions into LLM to induce unintended behaviorInsert text in content that induces the LLM to cite that source
Data poisoningInject malicious samples into training data to alter model behaviorDeploy manipulated information across indexable web content
Adversarial examplesSubtly modify inputs to disrupt model classification/outputFine-tune content to disrupt ranking algorithms
Model extractionObserve model behavior to reverse-engineer internal logicObserve ranking fluctuation patterns to reverse-engineer ranking algorithms

These correspondences suggest that GEO security is not a standalone new problem but can be addressed by extending existing AI security frameworks. However, GEO-specific characteristics — multi-stage pipeline, an open input space of web content, real-time impact on commercial services — introduce additional complexity.

Limitations

A gap may exist between the specific pipeline configuration of the experimental environment and reproducibility on commercial systems. Additionally, the 91.4% figure is a result under controlled experimental conditions, and actual commercial systems may have additional filtering layers. However, verifying whether such defenses are sufficient is itself a future research task.

Additionally, the paper’s attack scenarios are static. Real-world manipulation attempts are dynamic processes that adapt and evolve in response to defense mechanisms. This arms race dynamic is beyond the scope of this paper but is a topic that must be addressed in future research.


Cross-Analysis: Evaluation Integrity and Manipulation Risk

Placing both papers side by side reveals not just a comparison but a unified problem: GEO evaluation framework integrity and ranking manipulation risk are two sides of the same coin.

Comparison Frame

DimensionKim et al. — SAGEO ArenaJin et al. — CORE
PerspectiveEvaluation methodologySecurity/adversarial analysis
Core questionIs current evaluation sufficient?Is the current system safe?
Core contributionFull-pipeline evaluation frameworkRanking manipulation vulnerability demonstration
Key numbersRetrieval hit rate +22%Top-5 promotion 91.4% success
GEO redefinitionContent optimization → pipeline optimizationOptimization → includes security
MethodologyUnified evaluation environment + variable isolation experimentsAttack taxonomy + success rate measurement
Practical audienceGEO strategists, evaluation researchersPlatform security teams, policy makers
StatusPreprintPreprint / ICLR 2026 submission

Structural Connection: Interdependence of Evaluation and Security

SAGEO Arena’s pipeline evaluation structure and CORE’s attack vector analysis map to each other with striking precision. At the exact point where SAGEO Arena discovers “structured information is decisive at the retrieval stage,” CORE’s “schema markup abuse” attack operates.

This correspondence visualized:

Pipeline StageSAGEO FindingCORE Attack VectorImplication
RetrievalStructured info +22% effectSchema markup abuse, metadata manipulationThe most effective optimization point is also the most vulnerable attack point
RerankingMulti-signal combination determines rankingMulti-source consistency attack, synthetic backlinksVulnerable to attacks that synthesize trust signals
GenerationContent structure and citations affect citation rateAuthority signal injection, citation network manipulationGeneration model overly reliant on authority signals

This correspondence reveals a fundamental dilemma: The information needed for effective GEO (structured data, authority signals, citations) is the same information exploited for manipulation. This is structurally the same problem as in SEO, where “good SEO” and “Black Hat SEO” use the same technical mechanisms and differ only in intent.

The Need for an Integrated Framework

The conclusion drawn from cross-analyzing both papers is clear: for the GEO ecosystem to mature healthily, evaluation and security must be treated not as separate research agendas but within a single integrated framework.

flowchart TD
    subgraph Integrated["Integrated GEO Framework"]
        direction TB
        E["Evaluation\nSAGEO Arena direction"]
        S["Security\nCORE direction"]
        E <-->|"Interdependent"| S
    end

    subgraph SEO_Layer["Existing SEO Infrastructure"]
        T["Technical SEO"]
        C["Content SEO"]
    end

    subgraph GEO_Layer["GEO Extension"]
        P["Pipeline Optimization"]
        D["Defensive GEO"]
    end

    SEO_Layer --> Integrated
    Integrated --> GEO_Layer
    P --> O["Practical GEO Strategy"]
    D --> O

This integrated framework has the following practical implications:

  1. Include a security dimension in evaluation metrics: When measuring GEO performance, robustness against manipulation should be evaluated alongside ranking improvements.
  2. Defensive GEO strategies: A defensive perspective is needed to ensure your content is not displaced by manipulated content.
  3. Platform-content producer collaboration: A cooperative structure is needed where platforms build defense mechanisms and content producers optimize in alignment with them.

Comprehensive Synthesis: Cumulative Insights from 6 GEO Papers

Combining the 4 papers from previous reviews with these 2 reveals the overall picture painted by all 6 GEO papers.

PaperYearCore ContributionGEO Maturity Stage
Aggarwal et al.2024GEO concept definition, GEO-Bench, PAWCDefinition
Chen et al.2025Earned media bias, engine-specific differencesDefinition
Wu et al. (AutoGEO)2025Quality-preserving auto-optimizationOptimization
Bagga et al. (E-GEO)2025E-commerce vertical benchmarkOptimization
Kim et al. (SAGEO Arena)2026Unified pipeline evaluationMeasurement
Jin et al. (CORE)2026Ranking manipulation vulnerability demonstrationSecurity

The current position of the GEO field as cumulatively demonstrated by these 6 papers:

  • Definition: Established. Consensus is forming on what GEO is and how it differs from SEO.
  • Measurement: Early frameworks exist but are not standardized. Proposals like PAWC and SAGEO exist, but no industry-standard metrics have been adopted.
  • Optimization: Approaches have begun from both general-purpose and vertical sides. The feasibility of quality-preserving optimization is demonstrated, but replication scope is limited.
  • Security: At the problem identification stage. Vulnerabilities are empirically demonstrated, but implementation and effectiveness verification of defense mechanisms have not begun.

The still-unsolved problems — standardized evaluation protocols, cross-platform comparable benchmarks, defense mechanism effectiveness verification, longitudinal studies, and ROI linkage models — are gaps that must be filled in the field’s next phase.


Industry Implications: The Need for Defensive GEO

The most important practical takeaway from combining both papers is the need for the concept of Defensive GEO.

Limitations of Current GEO Strategy

Current GEO strategy discourse mostly focuses on “how to get better exposure in AI search.” This corresponds to the offensive strategy of “how to raise rankings” in SEO. However, CORE’s results show this approach alone is insufficient.

A 91.4% manipulation success rate means competitors can technically push your content down in rankings. In this environment, it becomes a strategic imperative not only to raise your own content but also to avoid being displaced by manipulation attempts.

Components of Defensive GEO

A defensive GEO strategy should include the following elements:

  1. Multi-stage optimization: Following SAGEO Arena’s lesson, optimize across the entire pipeline including structured information, not just body text. This makes it harder for a single attack vector to overturn rankings.
  2. Multi-source consistency: Ensure your information is consistently referenced across diverse, trustworthy third-party sources. This is a legitimate defense against CORE’s “multi-source consistency attack.”
  3. Monitoring systems: Continuously monitor your visibility in AI search responses to detect anomalous ranking fluctuations early.
  4. Content authenticity reinforcement: Focus on content types that are difficult to fabricate — original data, proprietary research findings, verifiable information.

Practical conclusion: GEO strategy must expand from “how to raise visibility” to “how to raise visibility and how to defend it.” SAGEO Arena tells us “what to optimize,” and CORE tells us “what to defend against.”


References

  • Kim, J. et al. (2026). SAGEO Arena. Preprint.
  • Jin, Z. et al. (2026). CORE: Controlling Output Rankings. Preprint / ICLR 2026 Submission.
  • Aggarwal, P. et al. (2024). GEO: Generative Engine Optimization. KDD 2024.
  • Chen, Y. et al. (2025). Generative Engine Optimization: How to Dominate AI Search. Working Paper.
  • Wu, Z. et al. (2025). AutoGEO. Preprint.
  • Bagga, N. et al. (2025). E-GEO. Preprint.
Share

Related Posts