Review of AutoGEO (quality-preserving auto-optimization) and E-GEO (e-commerce vertical benchmark) papers, analyzing how GEO optimization seeks a Pareto optimal between visibility and utility.
Review Context
GEO research has been branching rapidly since Aggarwal et al. defined the concept at KDD 2024. While early work focused on proving “does GEO work?”, 2025 preprints shifted the questions to “how do we optimize?” and “which domains do we apply it to?” This transition signals that the GEO field is moving beyond proof of concept into the engineering phase.
Looking at SEO’s evolution as a reference, this branching was predictable. SEO also started from a single question — “how does a search engine rank crawled content?” — before splitting along domain and technique axes into technical SEO, on-page SEO, local SEO, e-commerce SEO, and more. GEO is following the same trajectory.
The two papers covered here represent two different axes of that branching.
| Paper | Authors | Core Question | Approach |
|---|---|---|---|
| AutoGEO | Wu et al. (2025) | Can we automate optimization while preserving quality? | General-purpose framework |
| E-GEO | Bagga et al. (2025) | How do we measure GEO in e-commerce? | Vertical-specific benchmark |
Wu et al.’s AutoGEO tackles the challenge of general-purpose optimization automation with quality preservation, while Bagga et al.’s E-GEO addresses the challenge of building a benchmark for the e-commerce vertical. This review analyzes each paper in order, followed by cross-comparison and unresolved issues.
The Flow of GEO Optimization Research
Understanding where these two papers sit requires grasping the broader flow of GEO research. The table below organizes major GEO studies from 2024-2025 chronologically.
| Period | Paper/Study | Core Contribution | Research Stage |
|---|---|---|---|
| 2024 Q2 | Aggarwal et al. (KDD) | GEO concept definition, GEO-Bench | Definition |
| 2025 Q1 | Chen et al. | Empirical behavior analysis, PAWC metric | Measurement |
| 2025 Q1 | Wu et al. (AutoGEO) | Quality-preserving auto-optimization | Optimization |
| 2025 Q1 | Bagga et al. (E-GEO) | E-commerce-specific benchmark | Vertical application |
| 2025-2026 | Kim et al. (SAGEO Arena) | Full pipeline evaluation | Evaluation framework |
| 2025-2026 | Jin et al. (CORE) | Ranking manipulation risk demonstration | Security |
AutoGEO and E-GEO address research questions that naturally follow the definition and measurement stages: “how do we execute?” and “where do we apply?” Once the definition stage established “what GEO is” and the measurement stage established “how to quantify GEO,” the remaining questions are about execution and application.
AutoGEO: Automated Optimization Without Sacrificing Quality
Wu et al.’s AutoGEO (2025 preprint) directly addresses a core concern in GEO research: Does GEO optimization degrade content quality? This question is intimately tied to patterns that have repeated throughout SEO history. Just as keyword stuffing and link farms — strategies that compromise content integrity to boost search rankings — polluted the SEO ecosystem, the same concern exists for GEO.
Problem Definition: The GEO Dilemma
The specific problem AutoGEO aims to solve is as follows.
Most existing GEO optimization attempts have been either manual or tend to compromise the content’s original purpose — delivering useful information to users — during the optimization process. For example, inserting unnecessary statistics to increase citation probability, or excessively adding authoritative citations to the point of disrupting the content’s natural flow.
Wu et al. decompose this problem along three axes:
| Axis | Problem | Limitation of Existing Approaches |
|---|---|---|
| Automation | Manual optimization cannot scale | Rule-based automation lacks precision |
| Quality preservation | Optimization degrades usefulness | Assumes a visibility-quality tradeoff |
| Generalization | Only effective for specific engines/queries | Performance drops on domain transfer |
AutoGEO sets the goal of solving all three simultaneously. Its core claim is that visibility and quality are not zero-sum but can exist in a cooperative relationship.
AutoGEO Architecture
AutoGEO’s pipeline consists of three phases, each designed as an independent module.
flowchart TD
A[Input content] --> B[Phase 1: Automatic preference rule extraction]
B --> C[Rule set R]
C --> D[Phase 2: Rule-based rewriting]
D --> E[Rewritten content]
E --> F[Phase 3: Quality preservation verification]
F -->|Pass| G[Optimized content]
F -->|Fail| H[Rule adjustment]
H --> D
style B fill:#e8f4fd,stroke:#333
style D fill:#e8f4fd,stroke:#333
style F fill:#fde8e8,stroke:#333
Phase 1: Preference Rule Extraction
In this phase, AutoGEO automatically analyzes which content characteristics generative engines prefer. Specifically, the process involves:
- Collecting generative engine responses for various queries
- Comparing cited sources against uncited sources
- Extracting content characteristics commonly found in cited sources — structural elements, information density, writing style, citation patterns — as rules
This process is similar to reverse engineering a search engine’s ranking factors in traditional SEO, but differs methodologically because the target is not a traditional search algorithm but LLM behavioral patterns. Since LLMs operate on probabilistic patterns based on training data and prompts rather than explicit ranking algorithms, rule extraction uses the LLM itself as an analytical tool rather than statistical analysis.
Phase 2: Rule-Based Rewriting
The extracted rule set is applied to existing content for automatic rewriting. The key to this phase is the precision of rule application. Rather than applying all rules uniformly, appropriate rule subsets are selectively applied based on the content’s domain and type.
Phase 3: Quality Preservation Verification
This phase evaluates whether the rewritten content maintains the original’s utility. Wu et al. measure utility across four dimensions:
| Utility Dimension | Definition | Measurement Method |
|---|---|---|
| Completeness | Are the core information elements from the original preserved? | Retention rate of key information elements |
| Accuracy | Have factual errors been introduced during rewriting? | Fact-check-based verification |
| Readability | Is the rewritten content’s readability maintained or improved? | Readability metrics |
| Naturalness | Is the writing style natural, without mechanical optimization artifacts? | Human evaluation + automated evaluation |
Content that fails the verification phase is fed back to Phase 2 with rule adjustments. This iterative loop implements AutoGEO’s “cooperative” character. Rather than simply maximizing visibility, the system searches for the Pareto optimal point between visibility and quality.
Multi-Agent Cooperative Framework
The technical core of AutoGEO lies in its multi-agent cooperative framework. Agents responsible for each phase operate independently while collaborating under a shared objective function — visibility improvement + quality maintenance.
flowchart LR
subgraph Analyzer["Analyzer Agent"]
A1[Query analysis]
A2[Engine response collection]
A3[Rule extraction]
end
subgraph Optimizer["Optimizer Agent"]
O1[Rule selection]
O2[Content rewriting]
O3[Change tracking]
end
subgraph Validator["Validator Agent"]
V1[Utility evaluation]
V2[Quality score calculation]
V3[Pass/reject decision]
end
Analyzer -->|Rule set| Optimizer
Optimizer -->|Rewriting results| Validator
Validator -->|Feedback| Optimizer
Validator -->|Rule adjustment request| Analyzer
A notable aspect of this structure is that the Validator agent does more than simply pass or reject — it analyzes the failure cause and provides feedback to both the Optimizer and Analyzer agents. This enables iterative improvement of the entire system.
This design aligns with the multi-agent patterns actively being researched in LLM-based systems. Research showing that multiple agents with separated roles collaborating on complex tasks outperform a single LLM handling everything shares the same context.
Experimental Design
Wu et al.’s experimental design is summarized below.
| Experimental Element | Details |
|---|---|
| Target engines | Multiple generative search engines (see paper for specific engine names) |
| Query set | Information-seeking queries across diverse domains |
| Comparison groups | Original content (baseline), naive optimization, AutoGEO |
| Evaluation metrics | GEO visibility metrics + content utility metrics (dual evaluation) |
| Key results | Average GEO metric +35.99%, utility metrics no degradation |
What’s particularly notable in the comparison group design is the inclusion of “naive optimization” as a separate comparison group. This ensures the conclusion is not the trivial “optimization improves performance,” but rather “AutoGEO’s cooperative approach is superior to naive optimization in terms of quality preservation.”
Key Result Analysis
AutoGEO achieved an average 35.99% improvement on GEO visibility metrics, and this improvement was achieved without statistically significant degradation in content utility.
This result can be interpreted on two levels.
First, the quantitative significance. 35.99% is meaningful as an absolute number, but what matters more is that it was achieved without quality degradation. Compared to naive optimization, AutoGEO shows similar visibility improvement magnitudes but significantly less utility degradation. This suggests the cooperative framework’s feedback loop is actually working.
Second, the structural significance. It disproves the implicit assumption that “visibility and quality are in a tradeoff relationship.” This is a question about the legitimacy of the entire GEO field. If GEO optimization inevitably degrades content quality, GEO becomes a technology that harms user experience. AutoGEO’s results provide a counterargument to this concern.
However, several caveats apply when interpreting these results:
- 35.99% is an “average.” Variance by domain and query type likely exists.
- “No utility degradation” is measured against criteria that Wu et al. themselves established. External validation is needed.
- The results are tied to the generative engine version at the time of experimentation and require separate verification for continued validity after engine updates.
Theoretical Implications of Cooperative GEO
The concept of “cooperative GEO” proposed by Wu et al. extends beyond a mere technical approach to carry implications as an ethical framework for the GEO field.
Throughout SEO’s history, Google repeatedly emphasized the principle of “create content for users,” but in practice, strategies that harmed user experience for higher search rankings were pervasive. The same pattern could easily repeat in GEO: inserting unnecessary elements to increase the probability of AI engine citations, or distorting content’s arguments — “GEO spam” — could emerge.
AutoGEO’s cooperative framework presents a technical solution to this problem. By embedding quality verification into the optimization process, quality-degrading optimization is blocked at the system level. If this approach spreads industry-wide, it could preemptively prevent the formation of a perception that “optimized content equals low-quality content.”
E-GEO: An E-Commerce-Specific Benchmark
Bagga et al.’s E-GEO (2025 preprint) takes a different direction from AutoGEO. In a landscape where most GEO research focuses on informational queries, E-GEO builds the first systematic benchmark for queries with commercial intent.
Why E-Commerce GEO Requires Separate Research
E-commerce queries are fundamentally different in nature from information-seeking queries. This difference is not merely “the query content differs” — it means the optimization target itself is different.
| Dimension | Information-Seeking Queries | E-Commerce Queries |
|---|---|---|
| User intent | Understanding, learning | Purchase decision |
| Expected response format | Explanations, analysis | Comparisons, recommendations, specs |
| Distance to conversion | Indirect (awareness → interest) | Direct (comparison → purchase) |
| Trust basis | Expertise, sources | Price, reviews, usage experience |
| Time sensitivity | Relatively low | High (price fluctuations, stock) |
| Optimization goal | Increase citation probability | Citation + purchase conversion |
“Best wireless earbuds recommendation” and “fundamental principles of quantum mechanics” are both search queries, but the structure of the AI engine’s response, citation patterns, and the type of information users expect are completely different. A general-purpose GEO benchmark cannot capture this difference.
E-GEO Benchmark Composition
The scale and composition of the benchmark E-GEO built is as follows:
| Item | Details |
|---|---|
| Query scale | 7,000+ realistic product queries |
| Query source | Based on actual e-commerce search logs |
| Rewriting strategies | 15 heuristic-based rewriting approaches |
| Optimization method | Iterative Prompt Optimization (IPO) |
| Application targets | E-commerce product descriptions, reviews, comparison content |
| Evaluation engines | Multiple generative search engines |
The 7,000+ query scale is considerably larger than existing GEO benchmarks. Considering that Aggarwal et al.’s GEO-Bench had query sets in the hundreds, E-GEO is a substantial effort to sufficiently reflect the diversity of the e-commerce domain in terms of scale.
Product Query Taxonomy
One of E-GEO’s key contributions is proposing a taxonomy that systematically classifies e-commerce queries. This classification empirically demonstrates that different optimization strategies are needed for each query type.
flowchart TD
Q[E-commerce Query] --> C1[Product Discovery]
Q --> C2[Product Comparison]
Q --> C3[Purchase Decision]
Q --> C4[Usage & Troubleshooting]
C1 --> C1a["'Best wireless earbuds'"]
C1 --> C1b["'Running shoes under $100'"]
C2 --> C2a["'AirPods vs Galaxy Buds'"]
C2 --> C2b["'Dyson V15 vs V12 differences'"]
C3 --> C3a["'AirPods Pro 2 price'"]
C3 --> C3b["'Galaxy S25 pre-order'"]
C4 --> C4a["'AirPods one side not working'"]
C4 --> C4b["'Dyson filter replacement cycle'"]
Each query type elicits different AI engine response patterns, and therefore requires different optimization strategies. E-GEO’s experimental results by query type are summarized below.
| Query Type | AI Engine Response Characteristics | Effective Optimization Strategy | Ineffective Strategy |
|---|---|---|---|
| Product Discovery | List-format, category-based | Structured spec tables, category tags | Simple keyword insertion |
| Product Comparison | Comparison tables, pros/cons analysis | Clear comparison framework, numerical data | One-sided recommendations |
| Purchase Decision | Price/stock info, purchase links | Price history, discount info, buying guide | Generic product descriptions |
| Usage/Troubleshooting | Step-by-step guides, FAQ | Structured resolution steps, visual guides | Long narrative explanations |
15 Rewriting Heuristics
The 15 rewriting heuristics designed by E-GEO are optimization strategies specialized for e-commerce content. These strategies differ in character from the general-purpose GEO strategies applied to academic or news content. Below is a category-based classification:
| Category | Strategy Examples | Applicable Types |
|---|---|---|
| Structural optimization | Spec table insertion, comparison matrix addition, FAQ structuring | All types |
| Data enrichment | Price info addition, user review summary insertion, benchmark figures | Discovery, comparison |
| Trust signals | Expert citations, verified data source attribution, test results | Purchase decision |
| Format optimization | Pros/cons lists, rating summaries, explicit recommendation reasoning | Comparison, discovery |
| Intent matching | Buying guide tone, problem-solving step sequencing, usage scenarios | Purchase/usage |
The effectiveness of these 15 strategies was not uniform. The gap between the most and least effective strategies in E-GEO’s experiments was large, meaning an approach of “just apply any strategy to e-commerce content” is not viable.
Iterative Prompt Optimization (IPO)
Another methodological contribution from E-GEO is Iterative Prompt Optimization (IPO). Rather than performing optimization with a single prompt, this approach gradually improves prompts through multiple iterations.
E-GEO’s iterative prompt optimization showed significant improvement in visibility metrics with multiple iterations compared to a single attempt. The iteration effect was most pronounced for comparison-type queries.
This methodology contrasts with AutoGEO’s rule-based approach. Where AutoGEO extracts explicit rules and applies them, E-GEO’s IPO explores optimal results through iterative prompt adjustments without explicitly defining rules.
E-Commerce-Specific Findings
Here are the findings unique to the e-commerce domain from E-GEO.
General-purpose GEO strategies achieve only about 40-60% effectiveness on e-commerce queries. Visibility improvement becomes significantly higher when e-commerce-specific strategies are applied.
This finding directly contradicts the assumption that “GEO strategies are universally applicable.” When the domain changes, the optimization strategy must change too.
Specifically, the elements effective at driving AI engine citations in e-commerce content are:
- Price comparison data: Citation probability increases when content includes tables quantitatively comparing prices across competing products.
- Spec tables: Presenting key specifications in structured tables increases the probability of AI engines citing that source when generating comparison responses.
- Real-use review summaries: Content summarizing real usage experiences by category is preferred over simple star ratings.
- Purchase decision trees: Conditional recommendation structures like “if your budget is under $X, choose A; if above, choose B” are favorable for citation.
Conversely, strategies effective for academic content — such as authoritative citations and statistical data density — showed relatively lower effectiveness in e-commerce contexts.
Cross-Analysis: General-Purpose vs Vertical-Specific
Placing the two papers side by side reveals the two clear directions in which GEO research is branching.
Comparison Framework
| Dimension | AutoGEO | E-GEO |
|---|---|---|
| Core question | Can we optimize while maintaining quality? | Can we build a benchmark for a specific domain? |
| Approach direction | General-purpose | Vertical-specific |
| Methodology | Rule extraction + auto-rewriting + quality verification | Heuristic design + iterative prompt optimization |
| Agent architecture | Multi-agent cooperative | Single optimization loop |
| Query type focus | Information-seeking queries | Commercial-intent queries |
| Scale | Diverse domain query sets | 7,000+ e-commerce queries |
| Evaluation criteria | GEO metrics + utility dual measurement | E-commerce visibility metrics |
| Automation level | High (from rule extraction to rewriting) | Medium (heuristics manual, application automatic) |
| Practical implications | Justification for content teams adopting GEO | Strategy basis for e-commerce operators in AI search |
Convergence Points
Though superficially different, the two approaches converge at several points.
First, the importance of structured content. In both AutoGEO’s preference rule extraction results and E-GEO’s rewriting heuristics, structured content (tables, lists, step-by-step guides) is commonly found to have higher AI engine citation probability than unstructured narrative content. This aligns with the technical characteristic that LLMs encode structured information more effectively from their training data and find it easier to reference structured sources during response generation.
Second, the need for content-type differentiation. Whether it’s AutoGEO selectively applying rules based on content domain or E-GEO measuring different optimization strategy effects by query type, both arrive at the conclusion that “a single GEO strategy does not work for all content.”
Third, the validity of iterative optimization. Both AutoGEO’s feedback loops and E-GEO’s iterative prompt optimization share the conclusion that multiple iterations produce better results than a single optimization pass.
Combination Potential
flowchart TD
subgraph Combined["Combined Framework (Hypothesis)"]
A[AutoGEO's rule extraction] --> B[Domain-specific rule filtering]
B --> C[Merge with E-GEO's e-commerce heuristics]
C --> D[Integrated rewriting]
D --> E[AutoGEO's quality verification]
E -->|Pass| F[Optimization complete]
E -->|Fail| G[Adjust via E-GEO's IPO]
G --> D
end
The two approaches are not mutually exclusive. In fact, they could become more powerful when combined. Applying AutoGEO’s cooperative GEO framework (rule extraction + quality verification) to E-GEO’s e-commerce domain is the natural next step. Specifically:
- Use e-commerce domain queries as training data in AutoGEO’s rule extraction phase
- Compare extracted rules with E-GEO’s 15 heuristics to strengthen domain-specific rules
- Add e-commerce-specific utility metrics (price accuracy, spec completeness, etc.) to AutoGEO’s quality verification phase
- Integrate E-GEO’s iterative prompt optimization into AutoGEO’s feedback loop
A model that performs commerce-specific optimization while maintaining quality would be the most practically useful.
Unresolved Issues (Gap Analysis)
Both papers share the structural limitations of GEO research at this stage. These limitations are not weaknesses of individual papers but rather indicators that the GEO field is still in its early phase.
Limitation 1: Generative Engine Volatility
Both papers experiment based on AI engine responses at a specific point in time. However, LLM-based search engines undergo frequent model updates, prompt changes, and ranking logic modifications. An optimization strategy effective today could be invalidated by the next model update.
This problem existed in SEO too, but to a different degree. Google’s search algorithm updates (Panda, Penguin, BERT, etc.) occurred on an annual basis, with continuity in ranking logic between updates. The volatility of LLM-based engines is far greater. The model architecture itself can change, and even the same model can produce very different response patterns depending on prompt engineering.
Limitation 2: No Standardized Evaluation Metrics
Whether AutoGEO and E-GEO use the same GEO metric standards, and whether cross-comparison is possible, remains unclear. A unified benchmark across all GEO research does not yet exist.
| Paper | Metrics Used | Cross-Comparability |
|---|---|---|
| Aggarwal et al. (2024) | GEO-Bench proprietary metrics | Serves as reference point (de facto standard) |
| Chen et al. (2025) | PAWC metric | Partially compatible |
| Wu et al. (AutoGEO) | GEO visibility + utility | Aggarwal-based but extended |
| Bagga et al. (E-GEO) | E-commerce visibility | Independent metrics, compatibility unclear |
This situation is analogous to the NLP field before GLUE/SuperGLUE emerged. When each study uses its own benchmark and metrics, cross-study comparison becomes impossible and measuring overall field progress becomes difficult.
Limitation 3: Missing Business KPI Linkage
Neither paper addresses how a 35.99% GEO metric improvement affects actual traffic, conversion rates, or revenue. The gap between academic benchmarks and business KPIs is an area that future research must fill.
In the causal chain from GEO visibility improvement to actual traffic inflow to conversion to revenue, current research only covers the first step.
This missing linkage is particularly pronounced in E-GEO’s e-commerce benchmark. In e-commerce, GEO’s value must ultimately be measured by revenue contribution, but E-GEO stops at visibility metrics.
Limitation 4: No Multilingual/Multicultural Validation
Both papers conduct English-centric experiments. Whether GEO optimization strategies work identically in non-English languages like Korean, Japanese, and Chinese has not been verified. Since LLM training data distributions differ by language and content consumption patterns differ by culture, direct transfer of English-language results is risky.
Limitation 5: No Multimodal Content Consideration
Current GEO research is focused on text content. However, in e-commerce, the share of multimodal content — product images, video reviews, infographics — is substantial. If multimodal AI search becomes widespread, text-based GEO strategies alone will be insufficient.
Future Research Directions
Synthesizing the above limitations, the directions that future GEO optimization research must address are:
| Research Direction | Necessity | Difficulty |
|---|---|---|
| Unified benchmark construction | Enable cross-study comparability | High (requires community consensus) |
| Business KPI linkage | Prove GEO’s practical value | High (requires A/B testing) |
| Engine volatility response | Sustainable optimization strategies | Medium (monitoring systems) |
| Multilingual expansion | Validate non-English applicability | Medium (dataset construction) |
| Vertical expansion | Domains beyond e-commerce | Medium (requires domain expertise) |
| Multimodal GEO | Optimization including images/video | High (new methodology needed) |
Practitioner-Oriented Implications
Here are the practically actionable takeaways from both papers.
GEO Optimization Does Not Presuppose Quality Degradation
AutoGEO’s cooperative GEO results suggest it is possible to move beyond the “optimization vs. quality” dichotomy. There is academic basis to dispel the concern that “optimizing will make content worse” when content teams adopt GEO. However, this does not mean “any kind of optimization preserves quality.” This conclusion is only valid under a systematic framework with embedded quality verification.
Domain-Specific GEO Strategies Are Necessary
What E-GEO demonstrates is the limitation of general-purpose GEO strategies. Strategies effective in e-commerce differ from those effective for academic content. This implies that benchmarks and strategies specialized for each vertical — e-commerce, healthcare, finance, travel — are needed.
Structure Is Key
Both papers converge on the conclusion that structured content (tables, lists, step-by-step guides) is more favorable for AI engine citations than unstructured narrative. This is an immediately actionable strategy. Simply adding structural elements to existing content can improve AI search visibility.
Understand the Realistic Level of Automation
AutoGEO’s automatic rewriting is rule-based, not fully autonomous. E-GEO’s heuristics are also human-designed. At this point, GEO optimization is most realistic as a tool-assisted approach. The expectation that “AI will handle the optimization automatically” is premature.
Monitoring Systems Are Essential
Given the volatility of generative engines, a single round of optimization is not the end — continuous monitoring and re-optimization are necessary. This is the same context in which rank monitoring tools are essential in SEO. GEO also needs AI engine response monitoring systems, and this area is still lacking in tooling.
References
- Wu, Z. et al. (2025). AutoGEO: Automating Generative Engine Optimization with Cooperative Content Rewriting. Preprint.
- Bagga, N. et al. (2025). E-GEO: A Testbed for Generative Engine Optimization in E-Commerce. Preprint.
- Aggarwal, P. et al. (2024). GEO: Generative Engine Optimization. Proceedings of KDD 2024.
- Chen, J. et al. (2025). Generative Engine Optimization: How to Dominate AI Search. Preprint.
Related Posts

GEO Paper Review: Evaluation Systems and Manipulation Risks
Review of SAGEO Arena and CORE papers, analyzing the need for integrated GEO evaluation frameworks and the vulnerability of AI search rankings (91.4% Top-5 manipulation success rate).

GEO Paper Review: Definition and Foundational Frameworks
Review of the KDD 2024 GEO paper and Chen et al. 2025. Covers the academic definition of GEO, the PAWC visibility metric, and AI search's preference for earned media.

HubSpot, Semrush, Adobe, and Conductor Enter GEO — How Incumbents Are Moving
Analysis of major players entering the GEO market (SearchGPT, Perplexity, Google AI Overviews), their response characteristics, and step-by-step response strategies for enterprises.