Minbook
KO
Multi-Engine Architecture — Parallel Collection from 3 AI Search Engines

Multi-Engine Architecture — Parallel Collection from 3 AI Search Engines

MJ · · 12 min read

Analysis of multi-engine architecture design principles that leverage response variance as signals, featuring parallel collection structures and scalability via the adapter pattern.

Why Multiple Engines

When designing a GEO (Generative Engine Optimization) monitoring system, the first decision is which AI search engines to analyze.

A single-engine approach is tempting. You only need one parser, maintenance burden for response format changes stays small, and API costs remain low. WICHI’s initial prototype targeted just one engine.

But when you send the same query to different AI search engines, the results diverge significantly. This is not a matter of surface-level phrasing; the differences are structural.

Why Engine Responses Differ

Each AI search engine uses different training data, different search indexes, different ranking algorithms, and different response generation strategies. These differences directly affect which brands get mentioned, in what order they appear, and which sources are cited.

Differentiating FactorImpact on Results
Training data composition and cutoffWhether a brand’s latest information is reflected
Search index scopeTypes and range of web sources referenced
Source preferenceWeighting toward encyclopedic, community, official sources
Response generation strategyList-based, comparative, narrative formats
Citation methodInline citations, footnote lists, direct URL exposure

For example, sending “best project management tools in 2026” to three AI search engines might yield: Engine A recommends Brand X first, citing the official site and review outlets. Engine B omits Brand X entirely and centers on Brand Y with community feedback citations. Engine C mentions both X and Y but in different order and context.

The Structural Limitation of Single-Engine Analysis

Reporting single-engine results as “brand visibility in AI search” is misleading — it represents visibility on that particular engine, not across the AI search ecosystem as a whole. Since you cannot control which engine end users choose, and market share in AI search is shifting rapidly, single-engine analysis is fundamentally incomplete.

graph TD
    Q[Same Search Query] --> E1[Engine A]
    Q --> E2[Engine B]
    Q --> E3[Engine C]

    E1 --> R1["Brand X: Ranked #1<br/>Brand Y: Not mentioned<br/>Brand Z: Ranked #3"]
    E2 --> R2["Brand X: Not mentioned<br/>Brand Y: Ranked #1<br/>Brand Z: Ranked #2"]
    E3 --> R3["Brand X: Ranked #2<br/>Brand Y: Ranked #3<br/>Brand Z: Ranked #1"]

    R1 --> AN[Cross-Engine Analysis]
    R2 --> AN
    R3 --> AN
    AN --> INS["Brand Z: Mentioned by all 3 → Strong signal<br/>Brand X: 2/3 engines → Moderate signal<br/>Brand Y: 2/3 engines, high variance → Possible source bias"]

This diagram illustrates a critical point: analyzing any single engine would have led to an entirely different conclusion. Looking at Engine A alone, you would conclude “Brand X has the highest visibility.” But aggregating all three engines reveals “Brand Z has the most stable visibility” — a fundamentally different takeaway.

Design Principle: AI search visibility should be measured not by ranking on any single engine, but by the consistency and quality of brand mentions across multiple engines.


Design Principles

Here are the core principles applied when designing the multi-engine collection system. These principles are not specific to WICHI — they apply broadly to any multi-LLM system.

Principle 1: Response Variance Is Signal, Not Noise

Initially, we viewed cross-engine variance as “a problem to be unified.” In practice, that variance turned out to be the most valuable insight.

When all engines consistently recommend a particular brand, it means that brand has strong online presence across diverse source types. Conversely, a brand recommended by only one engine likely has content concentrated in the source types that engine favors.

PatternMeaningDiagnosis
Consistently high mentions across all enginesStrong presence across diverse sourcesBrand visibility is healthy
Consistently low mentions across all enginesOverall lack of online presenceFull content strategy review needed
High mentions on specific engine(s) onlyConcentrated in certain source typesReinforce content on source types the underperforming engines prefer
Large rank variance across enginesBrand positioning perceived differently by source typeDevelop separate content strategies per source type

Rather than dismissing variance with “each engine is different,” displaying engine-by-engine scores side by side is the core structure of a GEO report.

Principle 2: Consensus and Divergence

The most useful framework for interpreting multi-engine data is “consensus vs. divergence.”

graph LR
    subgraph Consensus Pattern
        CA[Engine A: Brand X #1] --> CS[Strong Signal]
        CB[Engine B: Brand X #1] --> CS
        CC[Engine C: Brand X #1-2] --> CS
    end

    subgraph Divergence Pattern
        DA[Engine A: Brand Y #1] --> DS[Weak Signal + Needs Diagnosis]
        DB[Engine B: Brand Y Not Mentioned] --> DS
        DC[Engine C: Brand Y #4] --> DS
    end

Consensus: Multiple engines mention the same brand at similar positions. This is a strong signal that the brand exists consistently across diverse information sources. Higher consensus means higher confidence in the brand’s GEO score.

Divergence: Engines produce significantly different results. This is itself a diagnostic target. Divergence triggers the question “Why does this engine omit this brand?” — and the answer to that question points directly to content strategy gaps.

Design Principle: Consensus increases score confidence. Divergence reveals improvement opportunities. Both are meaningful data.

Principle 3: Partial Results Are Valid

If one of three engines fails, the results from the remaining two are still valid. Delivering partial results with explicit notation is better than withholding all data while waiting for perfection. The key requirement: always disclose which engine’s data is missing. This principle runs through the entire error-handling strategy.

Principle 4: Engines Are Plugins

The AI search market is evolving rapidly. New engines emerge, market share shifts, and API specifications change. Each engine should be designed as an independent module implementing a common interface, so that adding or removing engines does not affect the rest of the pipeline.


Parallel Collection Architecture

After committing to multi-engine collection, the next design choice was the collection method.

Sequential vs. Parallel Collection

FactorSequentialParallel
Implementation complexityLowHigher (async control required)
Total elapsed timeEngine A + B + C summedmax(A, B, C)
Error handlingStraightforward (sequential try-catch)Independent per-engine error handling needed
DebuggingEasy (trace in order)Concurrent log separation required
Rate limit managementNatural spacingBurst control needed
User-perceived speedSlow (tens of seconds to minutes)Fast (limited by slowest engine)
ScalabilityLinear increase per engine addedNear-constant wait time regardless of engine count

AI search engines respond slowly compared to typical REST APIs — model inference alone takes seconds to tens of seconds. Sequential processing of three engines makes a single query analysis prohibitively long; repeating this for dozens of queries pushes the total pipeline into minutes.

Parallel collection makes the slowest engine’s response time the total elapsed time. Whether there are three engines or five, latency converges to the single slowest response. Since WICHI is a SaaS where users press an analysis button and wait for results, perceived speed is a direct usability metric.

We chose parallel.

Async Parallel Collection Flow

sequenceDiagram
    participant U as User
    participant API as API Server
    participant EA as Engine A Adapter
    participant EB as Engine B Adapter
    participant EC as Engine C Adapter
    participant DB as Database

    U->>API: Analysis request (query list)
    API->>API: Prepare query list

    par Parallel Collection
        API->>EA: Send all queries (async)
        API->>EB: Send all queries (async)
        API->>EC: Send all queries (async)
    end

    Note over EA,EC: Each engine also runs<br/>queries concurrently<br/>(with concurrency limits)

    EA-->>API: Engine A results (or partial failure)
    EB-->>API: Engine B results (or partial failure)
    EC-->>API: Engine C results (or partial failure)

    API->>API: Aggregate + normalize results
    API->>DB: Store normalized responses
    API->>U: Progress status update

The key design points are as follows.

Inter-engine parallelism: All three engines receive requests simultaneously. Each engine’s collection proceeds independently. If one engine runs slowly, it does not affect collection from the others.

Intra-engine concurrency control: Within each engine, multiple queries are processed concurrently. However, unlimited concurrent requests would trigger rate limits, so a per-engine semaphore caps concurrent request counts. This cap is configurable per engine.

Inter-request delay: Beyond concurrency limits, short delays between requests prevent burst patterns from hitting the engine’s instantaneous rate limits.

The Adapter Pattern

Each engine is isolated behind an adapter implementing a common interface. The common interface defines:

  • Input: Query text, system prompt
  • Output: Raw response, list of mentioned brands, list of citations
  • Errors: Standardized error types for timeout, rate limit, authentication failure, etc.

In practice, each adapter constructs requests matching that engine’s API specification and converts responses into the common format. Adding a new engine requires only implementing a new adapter — the rest of the pipeline (evaluation, metric calculation, insight generation) remains unchanged.

graph TB
    subgraph Pipeline
        QE[Query Engine] --> RC[Response Collector]
        RC --> JE[Evaluation Engine]
        JE --> MC[Metric Calculator]
        MC --> IG[Insight Generator]
    end

    subgraph Adapter Layer
        RC --> IF{Common Interface}
        IF --> AA[Adapter A]
        IF --> AB[Adapter B]
        IF --> AC[Adapter C]
        IF -.-> AD[Adapter D — Future Extension]
    end

    AA --> EPA[Engine A API]
    AB --> EPB[Engine B API]
    AC --> EPC[Engine C API]
    AD -.-> EPD[Engine D API]

Design Principle: Engine additions and removals should occur exclusively within the adapter layer, without affecting pipeline logic.

Response Normalization

Each engine returns raw responses in different structures. Normalization is required for consistent downstream processing.

Normalization TargetDescriptionCross-Engine Variance Example
Brand mention extractionDetecting brand names in response textOfficial name vs. abbreviation vs. mixed-language variants
Citation parsingExtracting source URLs and domainsInline markdown links vs. footnote-style numbering vs. API field
Mention positionRelative location of brand within responseFirst paragraph vs. mid-list vs. conclusion
Response lengthToken count of raw responseDefault response lengths vary by engine
FormatMarkdown structureHeading usage, list style, dividers

The core principle of normalization is “preserve the raw response while extracting metadata into a unified schema.” Raw text is stored as-is, but parsed data — brand mentions, citations, positions — is stored in an engine-agnostic schema. This allows the downstream Judge engine and metric calculation logic to operate without per-engine branching.


Response Replication and Reliability

Why Send the Same Query Multiple Times

AI model responses are stochastic. The same query sent to the same engine twice may yield different results. Especially with temperature settings above zero, different brands may be mentioned or their order may change.

Because of this stochastic nature, visibility measurement based on a single response is merely “a snapshot of that moment.” Brand A being recommended first in one response does not guarantee the same outcome in the next.

To address this, the same query is sent to each engine multiple times, and statistical results across multiple responses are used. A brand mentioned in 3 out of 3 runs has different visibility stability than one mentioned in 1 out of 3.

Replication CountAdvantagesDisadvantages
1Minimum cost, maximum speedVulnerable to stochastic variation, low confidence
3Reasonable stability, acceptable cost3x request volume
5+High statistical confidenceCost and time increase sharply, diminishing returns

WICHI chose 3 replications per query. While not statistically perfect, this represents a reasonable balance of cost, time, and stability. Increasing to 5 showed marginal improvement over 3, while cost and time scaled proportionally.

Total Request Volume

Queries x replications x engines = total API calls. For WICHI: approximately 40 queries x 3 replications x 3 engines = roughly 360 API calls per analysis run. Efficiently handling this volume requires parallel collection and concurrency control.


Operational Challenges

Multi-engine parallel collection has clear design intent, but ongoing operational difficulties persist.

1. Response Format Unification

Each engine returns responses in different structures. Parsing logic for detecting brand mentions and extracting context must be maintained separately for each engine.

Specific differences include:

  • Citation handling: One engine inserts [1], [2] style numbered citations inline and lists URLs at the bottom. Another uses inline markdown links. A third returns citation lists in a separate API response field.
  • List structure: One engine presents recommendations as numbered lists, another uses heading-plus-paragraph format, and a third responds with comparison tables.
  • Language handling: Engines differ in their use of Korean brand names, English brand names, or mixed representations.

When an engine changes its response format, the corresponding parser must be updated. Such changes often happen without prior notice, requiring continuous monitoring.

2. Rate Limit Management

Sending many requests in parallel increases the risk of hitting per-engine rate limits.

Limit TypeDescriptionMitigation
RPM (Requests Per Minute)Per-minute request capConcurrency limits + inter-request delay
TPM (Tokens Per Minute)Per-minute token capResponse length monitoring
Daily limitDaily total request/token capUsage tracking + queue when limits approach
Burst limitInstantaneous spike blockingMinimum inter-request delay

Each engine has different rate policies, and some specifics are not publicly documented, requiring empirical discovery of safe thresholds. Since a rate limit on one engine should not halt the entire collection, independent per-engine limit management is essential.

Retry strategy for 429 (Too Many Requests) responses is also critical. Immediate retries fail against still-active limits, so exponential backoff is applied — short wait for the first retry, progressively longer waits, giving the engine time to release restrictions.

3. Partial Failure Handling

When one of three engines returns a timeout or error, the question is what to do. Three options were considered during design.

Option A: Full retry. Re-collect from all three engines if any fails. Ensures data completeness but wastes cost and time on already-successful engines. Also risks infinite retries if the failing engine is persistently down.

Option B: Failed engine retry only. Retry only the failed engine, preserving successful results. Reasonable, but still needs retry limits and a final-failure strategy.

Option C: Accept partial results. Generate the report from successful engines, explicitly noting which engine’s data is missing.

WICHI combines B and C: retry the failed engine a limited number of times; if it still fails, generate the report from remaining engines with explicit missing-engine notation.

flowchart TD
    START[Start Parallel Collection] --> PA[Engine A Collection]
    START --> PB[Engine B Collection]
    START --> PC[Engine C Collection]

    PA --> |Success| SA[Store Result A]
    PB --> |Failure| RB{Retry limit exceeded?}
    PC --> |Success| SC[Store Result C]

    RB --> |No| RBR[Backoff then retry]
    RBR --> |Success| SB[Store Result B]
    RBR --> |Failure| RB
    RB --> |Yes| SKIP[Skip Engine B — Record as missing]

    SA --> MERGE[Aggregate Results]
    SB --> MERGE
    SC --> MERGE
    SKIP --> MERGE

    MERGE --> REPORT[Generate Report<br/>Note missing engines]

4. Latency Management

Even with parallel collection, total elapsed time is determined by the “slowest engine.” When response speed varies significantly across engines, fast engines sit idle while waiting for the slow one.

Strategies for managing this:

Timeout settings: Each engine gets a maximum wait time. Exceeding it treats that engine’s collection as failed and proceeds with partial results. Too-short timeouts miss legitimate but slow responses; too-long timeouts tie the entire pipeline to one slow engine.

Progress feedback: Rather than waiting for everything to finish, per-engine collection status is communicated to the user in real time. Updates like “Engine A complete, Engine B in progress, Engine C in progress” reduce perceived wait time.

5. Extensibility for Adding/Removing Engines

The AI search market is shifting rapidly. When new engines appear or existing engines’ market share changes, the collection targets need adjustment.

Each engine addition follows a cycle:

  1. API integration: Understand the engine’s API spec, set up authentication, implement request/response formats
  2. Adapter development: Write an adapter implementing the common interface
  3. Parser development: Build a parser for extracting brand mentions, citations, etc. from that engine’s responses
  4. Normalization verification: Confirm parsed output follows the same schema as existing engines
  5. Rate limit exploration: Discover the engine’s rate limits and adjust concurrency settings
  6. Integration testing: Verify the full pipeline works correctly when the new engine runs alongside existing ones

With the adapter pattern, steps 1-3 are contained within the adapter layer, and pipeline code remains untouched. Registering a new engine in the configuration automatically includes it in parallel collection.

Design Principle: The number of files that need modification when adding an engine should be minimized. Ideally: one adapter file + one config file.


Finding the Right Number of Engines

How Many Is Enough

Intuitively, “more is better” — but engine count has a diminishing returns threshold.

Engine CountAdvantagesDisadvantages
1Simple implementation, minimal costBiased results, incomplete visibility measurement
2Minimal cross-check possibleWhen two engines disagree, no tiebreaker
3Consensus/divergence judgment possible (2 vs. 1)Moderate operational complexity
4-5Finer pattern detectionProportional cost and maintenance increase, diminishing new insights
6+Statistical robustnessSharply rising cost and complexity, declining ROI

Three is the minimum unit for judging “consensus” and “divergence.” With only two engines, there is no way to determine which is more representative when results differ. With three, a “2 vs. 1” majority structure emerges. While majority rule is not always correct, it provides the minimum basis for identifying divergence patterns.

WICHI currently uses 3 engines. A fourth could be added as the AI search market evolves, but at this stage, deepening analysis across the existing three delivers more value than adding another engine.

Engine Selection Criteria

The criteria for choosing which engines to include:

CriterionDescription
Market shareEngines with more actual users take priority
Source diversityEngines referencing different source types than existing ones add more cross-check value
API stabilityEngines with stable APIs and infrequent breaking changes
Response qualityEngines that include meaningful brand recommendations and citations
Cost efficiencyEngines whose API costs are reasonable relative to analytical value

The ideal engine combination is one where each engine has strengths in different source types, maximizing the value of cross-checking. Two engines that reference similar sources provide less diagnostic value than two engines with complementary source-type strengths.


Generalizable Patterns

The multi-engine architecture yields patterns applicable to multi-LLM systems in general.

Pattern 1: Fan-Out / Fan-In

Send the same input to multiple LLMs simultaneously (Fan-Out) and integrate all responses after collection (Fan-In). Beyond GEO monitoring, this applies to:

  • Quality verification: Send the same question to multiple models and check answer consistency
  • Diversity: Collect responses from multiple models to the same prompt and select the best
  • Hallucination detection: If multiple models agree, the answer is more likely factual; disagreement flags verification needs

The key is the Fan-In stage: “how to integrate.” Options include simple majority vote, weighted average, or a separate Judge model evaluating responses.

Pattern 2: Graceful Degradation

A design where partial engine failures do not halt the entire system. Partial results are accepted, with missing portions transparently indicated.

Core principles:

  • One engine’s failure must not affect other engines’ collection (isolation)
  • Confidence of partial results must be stated (transparency)
  • Retry counts must be bounded (prevent infinite loops)
  • Total failure (all engines fail) requires separate handling

Pattern 3: Adapter-Based Extension

The adapter pattern hides each LLM behind a common interface. Because the LLM market changes rapidly, tight coupling to any specific model or provider means model replacement impacts the entire system.

With the adapter pattern:

  • Model replacement requires only adapter changes
  • New model addition requires no existing code changes
  • A/B testing (running two models side by side) becomes natural
  • Transitioning from single-model to multi-model can happen incrementally

Pattern 4: Async Pipeline

LLM API calls have long, unpredictable response times. Synchronous processing ties the entire pipeline to the slowest call. An asynchronous pipeline addresses this structurally.

Why async design is especially important in multi-LLM systems:

  • Response time variance across LLMs is large (depends on model, server state, input length)
  • Rate-limit-induced waits are frequent
  • Retries must not block other requests
  • Users need mid-process progress updates

Pattern 5: Response Normalization Layer

To process responses from different LLMs uniformly, a layer that separates raw text from metadata and normalizes metadata into a unified schema is essential. Without this layer, every downstream logic branch must check “which engine produced this response,” and each new engine adds more branches — maintenance complexity grows exponentially.

LayerInputOutputRole
AdapterEngine-specific API responseCommon response objectAPI spec abstraction
NormalizationCommon response objectNormalized metadataMetadata schema unification
AnalysisNormalized metadataMetrics, insightsEngine-agnostic logic

Current Limitations and Future Work

The current architecture has unsolved problems.

Temporal response drift: The same query to the same engine may produce different results at different times, due to model updates, training data changes, and index refreshes. Currently, WICHI provides single-point-in-time snapshots. Time-series tracking is a future priority.

Engine weighting: Currently, all three engines’ results are treated equally. In reality, engines have different market shares, and weighting by share would produce more realistic visibility measurements. However, AI search market share data itself remains uncertain, so weighting is on hold.

Regional variation: Even the same engine may produce different results for Korean-language vs. English-language queries, and results may vary based on user location settings. The current system is specialized for Korean queries; multilingual support requires separate design work.


Summary

Multi-engine architecture is not simply about “collecting more data.” It is a design that leverages cross-engine response variance as the core analytical signal. This required design choices including async parallel collection, adapter-based extension, response normalization, and partial failure tolerance — each carrying implementation complexity and operational cost.

The reason for maintaining this structure despite those costs is straightforward: accurately measuring brand visibility in AI search requires multi-engine analysis as a non-negotiable prerequisite. A single engine’s results cannot reveal the full picture, and without variance, diagnosis is impossible.

Share

Related Posts