Review of SAGEO Arena and CORE papers, analyzing the need for integrated GEO evaluation frameworks and the vulnerability of AI search rankings (91.4% Top-5 manipulation success rate).
Scope of This Review
This post reviews two papers that demonstrate how the GEO (Generative Engine Optimization) field is expanding beyond its initial definition phase into evaluation framework construction and security risk analysis.
- Kim et al. (2026), “SAGEO Arena” — preprint
- Jin et al. (2026), “CORE: Controlling Output Rankings” — preprint (submitted to ICLR 2026)
The former challenges the limitations of existing GEO evaluation methodology and proposes a unified pipeline evaluation approach. The latter empirically demonstrates the possibility of adversarial ranking manipulation in AI responses. This review covers the academic contributions, limitations, and implications for GEO field maturation of both papers.
While the papers reviewed previously — Aggarwal et al. (2024, KDD) and Chen et al. (2025) — defined “what GEO is,” and Wu et al.’s AutoGEO and Bagga et al.’s E-GEO addressed “how to optimize,” these two papers occupy the position of “how to measure properly, and what risks exist.” Academic fields typically mature along the trajectory of definition, measurement, and then security, and GEO is on this trajectory.
Full Context of the GEO Research Flow
The positioning of all papers reviewed so far can be visualized as follows:
flowchart LR
subgraph Phase1["Phase 1: Definition"]
A["Aggarwal et al. (2024)\nGEO concept definition\nGEO-Bench, PAWC"]
B["Chen et al. (2025)\nEmpirical behavior analysis\nEarned media bias"]
end
subgraph Phase2["Phase 2: Optimization"]
C["Wu et al. (2025)\nAutoGEO\nQuality-preserving optimization"]
D["Bagga et al. (2025)\nE-GEO\nE-commerce vertical"]
end
subgraph Phase3["Phase 3: Measurement + Security"]
E["Kim et al. (2026)\nSAGEO Arena\nUnified evaluation framework"]
F["Jin et al. (2026)\nCORE\nRanking manipulation demonstration"]
end
Phase1 --> Phase2 --> Phase3
What is notable in this flow is that the two Phase 3 papers illuminate the same problem from different directions. SAGEO Arena asks “is our current evaluation sufficient?” while CORE asks “is the current system safe?” These questions are superficially independent but fundamentally converge on the same concern: the trustworthiness of the GEO ecosystem.
Paper 1: The Entire Pipeline Must Be Evaluated — SAGEO Arena
Kim et al.’s SAGEO Arena starts from a critique of existing approaches that treat GEO and SEO as separate optimization domains. The paper’s core claim is simple yet far-reaching: GEO evaluation that only measures the generation stage is structurally incomplete.
The Problem: Blind Spots in Existing GEO Evaluation
Previous GEO research — including Aggarwal et al. (2024) covered in earlier reviews — primarily focused on content visibility at the generation stage. GEO-Bench’s PAWC metric, Chen et al.’s earned media analysis, and AutoGEO’s cooperative GEO all measured “how much a specific source is included in the AI-generated response.”
However, real AI search systems are not single-stage. They consist of a multi-stage pipeline: retrieval, reranking, and generation. Here the fundamental problem arises: content optimized for the generation stage can never appear in AI responses if it fails to make it into the retrieval candidate pool in the first place.
Breaking this problem down concretely:
| Pipeline Stage | Role | Coverage by Prior GEO Research |
|---|---|---|
| Retrieval | Extracts query-relevant candidates from a large document pool | Almost none |
| Reranking | Reorders candidates by relevance, authority, and freshness | Indirect (implied in earned media analysis) |
| Generation | Generates the final response text and cites sources | Most research concentrated here |
SAGEO Arena’s core argument originates from the asymmetry in this table. The vast majority of GEO research has focused only on the Generation stage, but the first gate for practical visibility is Retrieval.
SAGEO Arena Architecture
The authors built a unified evaluation environment covering the entire retrieval-reranking-generation pipeline. It measures SEO (search exposure) and GEO (generative response exposure) within the same framework.
flowchart TD
Q["User query"] --> R["Retrieval stage"]
R -->|"n candidate documents"| RR["Reranking stage"]
RR -->|"Top k documents"| G["Generation stage"]
G --> A["AI response + citations"]
subgraph SAGEO["SAGEO Arena evaluation scope"]
R
RR
G
end
subgraph SEO_Metrics["SEO Metrics"]
RM1["Retrieval Hit Rate"]
RM2["Average Retrieval Rank"]
RM3["Reranking Position"]
end
subgraph GEO_Metrics["GEO Metrics"]
GM1["Citation Inclusion"]
GM2["Position in Response"]
GM3["Attribution Quality"]
end
R --> SEO_Metrics
RR --> SEO_Metrics
G --> GEO_Metrics
The significance of this design is that it redefines GEO from “content optimization” to pipeline optimization. It enables decomposed measurement of which content attributes operate at which pipeline stage.
Methodology Details
SAGEO Arena’s experimental design consists of the following elements:
| Experimental Element | Details |
|---|---|
| Evaluation target | Simultaneous measurement of SEO and GEO metrics |
| Pipeline configuration | Retrieval (embedding-based search) → Reranking (cross-encoder) → Generation (LLM) |
| Optimization variables | Structured information (metadata, schema markup), body text, citation density, etc. |
| SEO metrics | Retrieval Hit Rate, Average Retrieval Rank |
| GEO metrics | Citation Inclusion Rate, Position-weighted Visibility |
| Query set | Information-seeking and compound queries across multiple domains |
Notably, the authors isolated optimization variables individually for experimentation. This is a design for causally analyzing “which optimization has an effect at which stage.” Prior GEO research has rarely attempted this level of stage-by-stage decomposition analysis.
SEO and GEO Metric Integration
Another dimension of SAGEO Arena’s contribution is presenting a structure that simultaneously measures SEO and GEO metrics within a single framework. Previously, these two domains were almost completely separated.
| Aspect | Traditional SEO Evaluation | Traditional GEO Evaluation | SAGEO Unified Evaluation |
|---|---|---|---|
| Measurement target | SERP rankings, click rates | Citation frequency in AI responses | Entire pipeline |
| Perspective | Search engine crawling | Generative model output | Both simultaneously |
| Optimization strategy | Technical SEO + content | Content structure + authority | Stage-by-stage decomposition |
| Limitation | Does not reflect AI search | Ignores retrieval stage | Generalizability of experimental environment |
This integration matters practically because in the real world, SEO and GEO apply to the same content. A single web page can appear in Google SERPs and also be cited in AI responses from Perplexity or ChatGPT Search. Optimizing the two metrics separately risks improvements on one side causing degradation on the other, and SAGEO Arena provides the tooling to make this tradeoff visible.
Key Results
| Optimization Target | Retrieval Hit Rate Change | Avg. Retrieval Rank Change | Generation Citation Rate Change |
|---|---|---|---|
| Structured information optimization | +22% | +2.72 | Significant improvement |
| Body text optimization alone | Negligible | Negligible | Some improvement |
| Citation density optimization | Minor | Minor | Significant improvement |
| Structured info + body text combined | Maximum | Maximum | Maximum |
The most notable result is that optimizing structured information — metadata, schema markup, structured data — improved retrieval hit rate by 22% and raised average retrieval rank by 2.72. Meanwhile, optimizing body text alone showed no significant effect at the retrieval stage.
Key finding: Structured information optimization shows the greatest effect at the retrieval stage (+22% hit rate), while body text optimization only has partial effect at the generation stage. GEO strategies limited to body text optimization are structurally insufficient.
This result is intuitively interpretable. The retrieval stage of AI search systems relies on embedding-based similarity search and metadata filtering, while qualitative improvements to body content only take effect after passing through this stage. Structured information is the critical signal that gets content through the retrieval “gate,” while body text influences citation decisions during the generation stage.
Retrieval Improvement Mechanisms Analyzed
It is worth examining more specifically through which mechanisms the +22% retrieval hit rate improvement occurs.
The retrieval stage of AI search systems generally operates through two pathways:
- Dense retrieval: Embeds queries and documents in the same vector space and calculates similarity. In this case, structured metadata (title, description, schema markup) directly affects embedding quality.
- Sparse retrieval + metadata filtering: Combines traditional search methods like BM25 with metadata-based filters. Documents rich in structured data are more likely to be included at the filtering stage.
SAGEO Arena’s results suggest that structured information operates favorably in both pathways. This is structurally the same pattern as the importance of schema markup and metadata optimization in traditional technical SEO, empirically confirming that this principle remains valid in the AI search era.
Limitations
The paper is at the preprint stage, and whether the pipeline configuration used represents the architecture of all commercial AI search engines has not been verified. In particular, Google AI Overview, Perplexity, and ChatGPT Search each likely use different retrieval-generation pipelines.
Specific limitations are summarized below:
| Limitation | Detail | Follow-up Research Needed |
|---|---|---|
| Pipeline representativeness | Unverified whether experimental pipeline sufficiently represents commercial systems | Replication experiments with multiple pipeline configurations |
| Structured info type decomposition | ”Structured information” scope is broad; needs detailed analysis of which types contribute most | Separate effect isolation for schema markup, meta tags, Open Graph, etc. |
| Cross-engine comparison | Single-pipeline experiment cannot capture inter-engine differences | Multi-engine comparison experiments |
| Temporal stability | Point-in-time results; changes from model updates not measured | Longitudinal study |
| Query type scope | Primarily information-seeking queries; commercial, local, and other intent types not included | Integration with vertical benchmarks like E-GEO |
Paper 2: Can Rankings Be Manipulated? — CORE
Jin et al.’s CORE illuminates the dark side of GEO. If optimization is possible, is malicious manipulation also possible? This question must be raised in every optimization domain, and GEO is no exception.
Ethical Positioning of the Research
First, the nature of this paper must be clarified. CORE is a vulnerability disclosure, not a manipulation guide. This follows the standard approach of AI security research.
In cybersecurity, “responsible disclosure” — the practice of publicly reporting vulnerabilities to promote defense — is well established. Recent AI security research such as prompt injection studies (Perez & Ribeiro, 2022) and data poisoning studies (Carlini et al., 2023) follows the same frame. The fact that CORE was submitted to ICLR 2026 itself reflects the academic community’s recognition of this as legitimate security research.
Ethical framing: CORE’s 91.4% success rate should be read not as “here’s how to manipulate” but as “this is how vulnerable current systems are.”
Research Design
This paper systematically experiments whether AI search response rankings can be deliberately manipulated from an adversarial perspective. The research design consists of building an attack type taxonomy and measuring each type’s effectiveness.
Attack Vector Taxonomy
The attack vectors presented by CORE are classified as follows:
flowchart TD
A["AI Search Ranking Manipulation\n(Output Ranking Manipulation)"]
A --> B["Content-level attacks"]
A --> C["Metadata-level attacks"]
A --> D["Cross-channel attacks"]
B --> B1["Authority signal injection"]
B --> B2["Citation network manipulation"]
B --> B3["Keyword-structure hybrid"]
C --> C1["Schema markup abuse"]
C --> C2["Meta tag inflation"]
D --> D1["Multi-source consistency attack"]
D --> D2["Synthetic backlinks"]
The characteristics and success rates of each attack vector are summarized below:
| Attack Vector | Description | Top-5 Promotion Success Rate | Top-10 Promotion Success Rate |
|---|---|---|---|
| Authority signal injection | Insert statistics, citations, expert opinions to make content appear authoritative | High | Very high |
| Citation network manipulation | Induce multiple external sources to reference the target content | High | High |
| Keyword-structure hybrid | Combine semantic keyword optimization with structured formatting | Medium | High |
| Schema markup abuse | Insert fabricated structured data into markup | Medium | Medium |
| Multi-source consistency | Repeat identical claims across multiple channels to generate a consensus signal | Very high | Very high |
| Combined attack | Use a combination of the above vectors | 91.4% | 96.2% |
Key Results and Their Implications
The headline number reported by CORE is a Top-5 ranking promotion success rate of 91.4%. This means that attempts to place specific content within the top 5 positions of an AI response succeeded 91.4% of the time.
Key numbers: Under combined attack conditions (multiple vectors combined), Top-5 promotion success rate 91.4%, Top-10 promotion success rate 96.2%. These are results under controlled experimental conditions, but they suggest that current AI search systems have remarkably low robustness.
The implications of these numbers need to be analyzed in contrast with SEO’s history.
| Comparison Dimension | Google Search (2024 baseline) | AI Search Systems (CORE experiment) |
|---|---|---|
| Anti-spam defense history | 20+ years of accumulated defense mechanisms | Early stage |
| Manipulation success rate | Even sophisticated Black Hat SEO has limited success | 91.4% (combined attack) |
| Detection mechanisms | SpamBrain, manual actions, algorithmic penalties | Mostly not built |
| Normative framework | White Hat / Black Hat distinction established | No standards exist |
| Industry self-regulation | Quality Raters guidelines | None |
This contrasts with traditional Google search, which has developed defense mechanisms against spam, link farms, and click manipulation over decades. AI search systems are being launched as commercial services with virtually no such defense history.
What 91.4% Means for the Industry
Interpreting this number from an industry perspective reveals problems on three dimensions.
First, user trust. AI search is growing on the expectation of “providing more accurate and unbiased answers.” However, if ranking manipulation is this easy, the premise that top results in AI responses are necessarily the most relevant sources collapses. This undermines the foundation of user trust.
Second, market fairness. Access to manipulation techniques is not equal. Only actors with technical capability and resources can attempt manipulation, which creates structural disadvantages for smaller content producers. The pattern from SEO, where large corporations invested massive budgets in link building to push small sites out, risks repeating in GEO.
Third, information ecosystem contamination. In a circular structure where AI search feeds into the training data of other AI systems, manipulated rankings could be reflected in the training of next-generation models. This could lead to long-term information ecosystem distortion beyond single-point manipulation.
The Need for Defense Mechanisms
Jin et al. present not only attack demonstrations but also defense directions. Defense mechanisms proposed or implied in the paper can be structured as follows:
flowchart TD
D["Defense Mechanisms\n(Defense Layers)"]
D --> L1["Layer 1: Input Validation"]
D --> L2["Layer 2: Pipeline Hardening"]
D --> L3["Layer 3: Output Monitoring"]
D --> L4["Layer 4: Normative Framework"]
L1 --> L1a["Content authenticity verification"]
L1 --> L1b["Metadata consistency checks"]
L1 --> L1c["Source trustworthiness scoring"]
L2 --> L2a["Adversarial training"]
L2 --> L2b["Multi-signal cross-validation"]
L2 --> L2c["Ranking anomaly detection"]
L3 --> L3a["Response consistency monitoring"]
L3 --> L3b["Time-based ranking fluctuation tracking"]
L3 --> L3c["User feedback integration"]
L4 --> L4a["White Hat GEO criteria definition"]
L4 --> L4b["Manipulation detection transparency reports"]
L4 --> L4c["Industry self-regulation"]
The role and current implementation status of each defense layer:
| Defense Layer | Role | Current Status | Implementation Difficulty |
|---|---|---|---|
| Input validation | Verify content and metadata authenticity before indexing | Basic level | Medium |
| Pipeline hardening | Ensure robustness against adversarial inputs at each retrieval/reranking/generation stage | Virtually nonexistent | High |
| Output monitoring | Detect anomalies in ranking patterns of generated responses | Partial | Medium |
| Normative framework | Define industry standards drawing the line between optimization and manipulation | Absent | High (non-technical factors) |
Intersection with AI Security Research
CORE’s research structurally connects with existing AI security research agendas.
| AI Security Agenda | Definition | GEO Counterpart |
|---|---|---|
| Prompt injection | Insert malicious instructions into LLM to induce unintended behavior | Insert text in content that induces the LLM to cite that source |
| Data poisoning | Inject malicious samples into training data to alter model behavior | Deploy manipulated information across indexable web content |
| Adversarial examples | Subtly modify inputs to disrupt model classification/output | Fine-tune content to disrupt ranking algorithms |
| Model extraction | Observe model behavior to reverse-engineer internal logic | Observe ranking fluctuation patterns to reverse-engineer ranking algorithms |
These correspondences suggest that GEO security is not a standalone new problem but can be addressed by extending existing AI security frameworks. However, GEO-specific characteristics — multi-stage pipeline, an open input space of web content, real-time impact on commercial services — introduce additional complexity.
Limitations
A gap may exist between the specific pipeline configuration of the experimental environment and reproducibility on commercial systems. Additionally, the 91.4% figure is a result under controlled experimental conditions, and actual commercial systems may have additional filtering layers. However, verifying whether such defenses are sufficient is itself a future research task.
Additionally, the paper’s attack scenarios are static. Real-world manipulation attempts are dynamic processes that adapt and evolve in response to defense mechanisms. This arms race dynamic is beyond the scope of this paper but is a topic that must be addressed in future research.
Cross-Analysis: Evaluation Integrity and Manipulation Risk
Placing both papers side by side reveals not just a comparison but a unified problem: GEO evaluation framework integrity and ranking manipulation risk are two sides of the same coin.
Comparison Frame
| Dimension | Kim et al. — SAGEO Arena | Jin et al. — CORE |
|---|---|---|
| Perspective | Evaluation methodology | Security/adversarial analysis |
| Core question | Is current evaluation sufficient? | Is the current system safe? |
| Core contribution | Full-pipeline evaluation framework | Ranking manipulation vulnerability demonstration |
| Key numbers | Retrieval hit rate +22% | Top-5 promotion 91.4% success |
| GEO redefinition | Content optimization → pipeline optimization | Optimization → includes security |
| Methodology | Unified evaluation environment + variable isolation experiments | Attack taxonomy + success rate measurement |
| Practical audience | GEO strategists, evaluation researchers | Platform security teams, policy makers |
| Status | Preprint | Preprint / ICLR 2026 submission |
Structural Connection: Interdependence of Evaluation and Security
SAGEO Arena’s pipeline evaluation structure and CORE’s attack vector analysis map to each other with striking precision. At the exact point where SAGEO Arena discovers “structured information is decisive at the retrieval stage,” CORE’s “schema markup abuse” attack operates.
This correspondence visualized:
| Pipeline Stage | SAGEO Finding | CORE Attack Vector | Implication |
|---|---|---|---|
| Retrieval | Structured info +22% effect | Schema markup abuse, metadata manipulation | The most effective optimization point is also the most vulnerable attack point |
| Reranking | Multi-signal combination determines ranking | Multi-source consistency attack, synthetic backlinks | Vulnerable to attacks that synthesize trust signals |
| Generation | Content structure and citations affect citation rate | Authority signal injection, citation network manipulation | Generation model overly reliant on authority signals |
This correspondence reveals a fundamental dilemma: The information needed for effective GEO (structured data, authority signals, citations) is the same information exploited for manipulation. This is structurally the same problem as in SEO, where “good SEO” and “Black Hat SEO” use the same technical mechanisms and differ only in intent.
The Need for an Integrated Framework
The conclusion drawn from cross-analyzing both papers is clear: for the GEO ecosystem to mature healthily, evaluation and security must be treated not as separate research agendas but within a single integrated framework.
flowchart TD
subgraph Integrated["Integrated GEO Framework"]
direction TB
E["Evaluation\nSAGEO Arena direction"]
S["Security\nCORE direction"]
E <-->|"Interdependent"| S
end
subgraph SEO_Layer["Existing SEO Infrastructure"]
T["Technical SEO"]
C["Content SEO"]
end
subgraph GEO_Layer["GEO Extension"]
P["Pipeline Optimization"]
D["Defensive GEO"]
end
SEO_Layer --> Integrated
Integrated --> GEO_Layer
P --> O["Practical GEO Strategy"]
D --> O
This integrated framework has the following practical implications:
- Include a security dimension in evaluation metrics: When measuring GEO performance, robustness against manipulation should be evaluated alongside ranking improvements.
- Defensive GEO strategies: A defensive perspective is needed to ensure your content is not displaced by manipulated content.
- Platform-content producer collaboration: A cooperative structure is needed where platforms build defense mechanisms and content producers optimize in alignment with them.
Comprehensive Synthesis: Cumulative Insights from 6 GEO Papers
Combining the 4 papers from previous reviews with these 2 reveals the overall picture painted by all 6 GEO papers.
| Paper | Year | Core Contribution | GEO Maturity Stage |
|---|---|---|---|
| Aggarwal et al. | 2024 | GEO concept definition, GEO-Bench, PAWC | Definition |
| Chen et al. | 2025 | Earned media bias, engine-specific differences | Definition |
| Wu et al. (AutoGEO) | 2025 | Quality-preserving auto-optimization | Optimization |
| Bagga et al. (E-GEO) | 2025 | E-commerce vertical benchmark | Optimization |
| Kim et al. (SAGEO Arena) | 2026 | Unified pipeline evaluation | Measurement |
| Jin et al. (CORE) | 2026 | Ranking manipulation vulnerability demonstration | Security |
The current position of the GEO field as cumulatively demonstrated by these 6 papers:
- Definition: Established. Consensus is forming on what GEO is and how it differs from SEO.
- Measurement: Early frameworks exist but are not standardized. Proposals like PAWC and SAGEO exist, but no industry-standard metrics have been adopted.
- Optimization: Approaches have begun from both general-purpose and vertical sides. The feasibility of quality-preserving optimization is demonstrated, but replication scope is limited.
- Security: At the problem identification stage. Vulnerabilities are empirically demonstrated, but implementation and effectiveness verification of defense mechanisms have not begun.
The still-unsolved problems — standardized evaluation protocols, cross-platform comparable benchmarks, defense mechanism effectiveness verification, longitudinal studies, and ROI linkage models — are gaps that must be filled in the field’s next phase.
Industry Implications: The Need for Defensive GEO
The most important practical takeaway from combining both papers is the need for the concept of Defensive GEO.
Limitations of Current GEO Strategy
Current GEO strategy discourse mostly focuses on “how to get better exposure in AI search.” This corresponds to the offensive strategy of “how to raise rankings” in SEO. However, CORE’s results show this approach alone is insufficient.
A 91.4% manipulation success rate means competitors can technically push your content down in rankings. In this environment, it becomes a strategic imperative not only to raise your own content but also to avoid being displaced by manipulation attempts.
Components of Defensive GEO
A defensive GEO strategy should include the following elements:
- Multi-stage optimization: Following SAGEO Arena’s lesson, optimize across the entire pipeline including structured information, not just body text. This makes it harder for a single attack vector to overturn rankings.
- Multi-source consistency: Ensure your information is consistently referenced across diverse, trustworthy third-party sources. This is a legitimate defense against CORE’s “multi-source consistency attack.”
- Monitoring systems: Continuously monitor your visibility in AI search responses to detect anomalous ranking fluctuations early.
- Content authenticity reinforcement: Focus on content types that are difficult to fabricate — original data, proprietary research findings, verifiable information.
Practical conclusion: GEO strategy must expand from “how to raise visibility” to “how to raise visibility and how to defend it.” SAGEO Arena tells us “what to optimize,” and CORE tells us “what to defend against.”
References
- Kim, J. et al. (2026). SAGEO Arena. Preprint.
- Jin, Z. et al. (2026). CORE: Controlling Output Rankings. Preprint / ICLR 2026 Submission.
- Aggarwal, P. et al. (2024). GEO: Generative Engine Optimization. KDD 2024.
- Chen, Y. et al. (2025). Generative Engine Optimization: How to Dominate AI Search. Working Paper.
- Wu, Z. et al. (2025). AutoGEO. Preprint.
- Bagga, N. et al. (2025). E-GEO. Preprint.
Related Posts

GEO Paper Review: Optimization Approaches and Vertical Applications
Review of AutoGEO (quality-preserving auto-optimization) and E-GEO (e-commerce vertical benchmark) papers, analyzing how GEO optimization seeks a Pareto optimal between visibility and utility.

GEO Paper Review: Definition and Foundational Frameworks
Review of the KDD 2024 GEO paper and Chen et al. 2025. Covers the academic definition of GEO, the PAWC visibility metric, and AI search's preference for earned media.

HubSpot, Semrush, Adobe, and Conductor Enter GEO — How Incumbents Are Moving
Analysis of major players entering the GEO market (SearchGPT, Perplexity, Google AI Overviews), their response characteristics, and step-by-step response strategies for enterprises.