What Determines AI Citations? An Analysis of 658 Sources Across 30 Queries

15 min read

TL;DR

We tested whether structural website optimization predicts AI search citations using 485 domains across 30 intent-based queries. The result: website structure (schema markup, HTML semantics, trust signals) shows no statistically significant correlation with LLM citation frequency (Pearson r = 0.009). Domain Authority is the only measurable predictor, explaining just 2.2% of variance. Over half of citations changed between identical runs, and YouTube was the single most-cited domain — especially for product queries. Follow-up analyses tested three alternative theories: content relevance gating (62× same-topic vs cross-topic citation rate), threshold effects (none found), and structural necessity (a Score=11 site was cited). AI search appears to sample relevant sources probabilistically rather than rank pages deterministically.

Bottom Line Up Front

I ran 90 Perplexity API calls across 30 queries and extracted every citation. 658 sources, 485 unique domains. The structural factors that GEO consultants recommend optimizing — schema markup, HTML semantics, trust signals — showed zero correlation with getting cited. The only variable with any statistical weight was Domain Authority, and it explained just 2.2% of variance. Over half of citations changed between identical runs.

What I Set Out to Test

AI search engines — Perplexity, ChatGPT, Google AI Overviews — generate answers by citing external sources. Unlike traditional search with ten blue links, AI answers typically cite 3–7 sources per response. Getting cited means traffic, trust, and visibility in a channel that is replacing conventional search for many query types.

This has created a "GEO" (Generative Engine Optimization) industry — tools and consultants promising that technical optimization leads to more AI citations. The logic sounds reasonable: structure your data correctly, and AI engines will find you and cite you more often.

But almost nobody has tested these claims with real citation data. Most GEO advice is based on inference from how RAG systems work in theory, not on observation of what actually gets cited. I built an AI Search Readiness Score that measures exactly the factors GEO consultants recommend. Then I tested whether those factors predict citations.

The question was direct: does structural readiness — schema markup, content format, trust signals — actually predict whether a site gets cited by an AI search engine?

How I Collected the Data

I wrote 30 intent-based queries across three verticals: SaaS (10), E-commerce (10), and Services (10). Each query reflected a real purchase or evaluation intent — the kind where AI citations translate into business outcomes.

Example queries:

  • "Best AI-powered CRM for remote sales teams in 2026" (SaaS)
  • "Best long-distance running shoes for marathon training" (E-commerce)
  • "Top-rated immigration lawyers in Lisbon for digital nomad visas" (Services)

Each query was run 3 times through the Perplexity API using the sonar-reasoning-pro model at temperature=0. That gave me 90 total API runs. From those runs, I extracted every cited URL — 658 citations across 485 unique domains.

I then scored each domain with my AI Search Readiness Score (0–100 across 4 dimensions: Machine Readability, Extractability, Trust & Entity, Offering Readiness). I also collected Moz Domain Authority and domain age as controls.

The analysis plan, scoring formula, and query list were all pre-registered before data collection. Two deviations from protocol were documented: the original LLM model (sonar-pro) was deprecated mid-study and replaced with sonar-reasoning-pro, and the final sample (441 scored domains) fell short of the target (500) due to crawler timeouts.

MetricValue
Queries30 (10 per vertical)
LLM runs90 (3 replicates × 30)
Citations extracted658
Unique domains485
Domains scored441 (91%)
LLM modelPerplexity sonar-reasoning-pro
Temperature0
ControlsMoz Domain Authority, Domain Age

What the Dataset Looks Like

The citation distribution was heavily skewed. Of 441 scored domains, roughly 60% were never cited in any of the 90 runs. A small number of domains appeared repeatedly across multiple queries. Most appeared once or not at all.

The AI Search Readiness Score across all scored domains had a mean of 50.4 (SD: 13.3), ranging from 5 to 95. The distribution was approximately normal — the scoring model differentiates well across the population, even though that differentiation turned out not to predict citations.

275 unique domains were cited at least once. The most frequently cited domains tell their own story:

DomainTotal CitationsQueries Cited In
youtube.com339
g2.com105
clutch.co84
forbes.com74
techradar.com63

Domain Authority ranged from 1 to 100. Cited domains had a slightly higher mean DA (43.1) compared to uncited domains (40.6), but the difference was not statistically significant (p = 0.248). Many low-DA sites were cited while many high-DA sites were not. Domain age showed no meaningful difference between groups — the median age for both was approximately 8 years.

Finding 1: Structure Does Not Predict Citations

This was the central hypothesis. Sites with higher AI Search Readiness Scores should be cited more frequently. The data rejected it.

The Pearson correlation between Score and citation count was r = 0.009 (p = 0.849). Effectively zero.

I ran an OLS regression with Domain Authority and Domain Age as covariates. The Score coefficient was not significant (p = 0.557). The overall model R² was 0.022 — all three variables combined explain just over 2% of citation variance. Domain Authority was the only variable approaching significance, contributing nearly all of that 2.2%.

I also tested a logistic regression (cited vs. not cited, binary). The Score coefficient had p = 0.071 — marginally outside the standard threshold, and the effect direction was inconsistent with the optimization hypothesis.

I tested each of the four sub-scores independently (MR, EX, TR, OR). None were statistically significant after Bonferroni correction. I also segmented by Domain Authority tier (low, medium, high) to check whether structure matters more for low-authority domains. It did not — the correlation was near zero in every segment.

A post-hoc power analysis confirmed the study was adequately powered: with n = 441, I could detect correlations as small as |r| ≥ 0.133 at 80% power. The null result is not due to insufficient sample size.

To be clear about what this means: my scoring model measures exactly the factors that GEO consultants recommend optimizing — schema markup, FAQ blocks, trust signals, content structure, heading hierarchy, product data completeness. If these factors drove citation selection in any meaningful way, I would see at least a weak correlation. The data shows nothing. Not a weak signal buried in noise — no signal at all.

What This Means

Structural optimization — adding schema markup, improving HTML semantics, fixing robots.txt — does not measurably increase the probability of being cited by Perplexity. These are hygiene factors: necessary for crawlability but not sufficient for citation. The factors that actually drive citation selection appear to operate at a different level — likely content relevance, domain reputation, and the LLM's retrieval index composition.

Finding 2: Citations Are Probabilistic, Not Deterministic

This was the finding that surprised me most. Despite using temperature=0 (which should produce deterministic output), the citation lists varied significantly between identical runs.

Of all domain-query citation pairs, 52.5% were unstable — they did not appear consistently across all three replicates of the same query. Only 47.5% appeared in all 3 runs. 29.3% appeared in just 1 of the 3 replicates — essentially random appearances.

The mean Jaccard similarity between citation sets from different replicates of the same query was 64.4%, ranging from 18% to 100%. Some queries produced nearly identical citation lists every time. Others changed dramatically between runs.

Why does this happen at temperature=0? Because the LLM output layer is only one component. The retrieval layer — which selects documents from the search index before the LLM sees them — introduces its own non-determinism. Search indices are dynamic, retrieval rankings shift with index updates, and different API server instances may have slightly different cached states.

The practical implication: any single citation check is unreliable. If you run one query through Perplexity and see your site cited, there is a roughly 50% chance it will not appear on the next identical run. Citation monitoring tools that report based on single snapshots are reporting noise. Meaningful measurement requires multiple samples per query over time.

Stability by the Numbers

  • 47.5% of citation pairs appeared in all 3 replicates (stable)
  • 23.2% appeared in exactly 2 of 3 replicates (partially stable)
  • 29.3% appeared in only 1 of 3 replicates (unstable)
  • Mean Jaccard similarity across replicates: 64.4%
  • Jaccard range: 18% to 100%

Finding 3: Content Format Shapes What Gets Cited

The single most-cited domain in my entire dataset was YouTube.com — 33 citations across 9 different queries. YouTube was not just cited frequently, it was cited stably. In most queries where YouTube appeared, it showed up in all 3 replicates.

The distribution was not uniform across verticals. 76% of YouTube citations came from e-commerce queries — product reviews, comparison videos, "best of" roundups. SaaS queries rarely cited YouTube. Services queries cited it occasionally for how-to content.

Review and comparison platforms showed similar vertical specificity. G2.com (10 citations) and Clutch.co (8 citations) appeared almost exclusively in SaaS and Services queries. Forbes and TechRadar appeared primarily in product comparison contexts. None of these platforms have high AI Search Readiness Scores by my rubric — their citations are driven by content format and domain authority, not structural optimization.

This suggests a hypothesis worth further testing: the retrieval layer is modality-sensitive. It does not merely match text similarity — it appears to weight content format relative to query intent. E-commerce queries retrieve video reviews. SaaS queries retrieve comparison platforms. Services queries retrieve directories.

If this holds, matching the expected content format for your query type matters more than matching a structural checklist. A well-produced YouTube review may outperform a perfectly structured product page for certain query intents.

VerticalTop Cited Domain TypeExample
E-commerceVideo platforms, product review sitesYouTube, TechRadar
SaaSSoftware review & comparison platformsG2, Capterra
ServicesDirectories, professional networksClutch, LinkedIn

What I Think This Means

These findings do not mean structural optimization is worthless. They mean it is necessary but not sufficient — and the industry has been overselling it as the primary lever.

1. GEO Is Not Schema Optimization

Structural readiness — schema markup, proper HTML, trust signals — is hygiene. It ensures AI crawlers can access and parse your content. But passing these checks does not meaningfully increase citation probability. Think of it like having a clean storefront: necessary for customers to shop, but not the reason they walk in.

2. Content Relevance Dominates

The sites that get cited consistently are the ones whose content directly matches query intent in both substance and format. A 2,000-word comparison guide answering "best X for Y" will outperform a perfectly marked-up product page describing a single product. The match is semantic, not structural. This aligns with how RAG retrieval works: embedding similarity is computed on text content, not on schema markup or HTML attributes.

3. Video Is a Fast Path for Product Visibility

YouTube's dominance in e-commerce citations was the most unexpected finding. If you sell physical products, investing in video reviews and comparisons may yield faster AI citation results than any amount of on-site structural optimization. The retrieval system appears to actively prefer video content for product evaluation queries.

4. Test Citations Repeatedly, Not Once

With 52.5% of citation pairs being unstable across identical runs, a single citation check is essentially a coin flip. Any meaningful citation strategy requires monitoring over time with multiple samples per query. If your monitoring tool reports based on single snapshots, you are making decisions on noise.

5. Domain Authority Still Matters — Marginally

DA was the only variable with any statistical weight, though it explained just 2.2% of variance. Established, well-linked domains get preferential treatment in search indices. Building real brand authority through thought leadership and genuine expertise remains relevant. But it is a long-term play, not a quick optimization. And even DA is a weak predictor — many high-DA sites were never cited, while some DA-20 niche sites appeared consistently.

Limitations

I want to be explicit about what this study does and does not tell us.

  • Single LLM, single time point. I tested only Perplexity (sonar-reasoning-pro). ChatGPT, Claude, Google AI Overviews, and Bing Copilot may weight different signals. The retrieval pipeline varies across platforms.
  • Correlation, not causation. I measured whether structural readiness correlates with citations. I cannot manipulate citations in a controlled experiment.
  • Black box retrieval. I cannot observe the internal retrieval and reranking pipeline. I see inputs (queries) and outputs (citations), but the middle layer is opaque.
  • Score model limitations. My scoring model may not capture the right structural signals. But it covers the most commonly recommended GEO factors — schema, FAQ, trust signals, content structure — so if these factors mattered, I would expect to see at least a weak signal.
  • Scoring gaps. 9% of domains (44 of 485) could not be scored due to crawler timeouts or bot blocking. These domains are not randomly distributed — they skew toward sites with unusual architectures.
  • Vertical depth. 30 queries across 3 verticals means 10 per vertical. Enough for aggregate analysis but too shallow for robust per-vertical conclusions.

Note: The following analyses are exploratory, not pre-registered. I conducted them on the same dataset to investigate alternative explanations for the null finding. The pre-registered results are reported above.

Going Deeper: Three Alternative Theories

A null result invites the question: is the relationship truly absent, or am I measuring the wrong thing? I tested three alternative theories on the same dataset.

Theory 1: Content Relevance Gates Citation Opportunity

My primary study used 30 queries as the denominator for citation rate, but each domain is relevant to only about 10 queries (its own vertical). Using all 30 creates a 3× dilution effect. A SaaS tool will never be cited for a running shoes query, regardless of its structural readiness.

I assigned each domain to its primary topic, then computed separate citation rates for same-topic and cross-topic query pairs.

MetricSame-TopicCross-Topic
Citation rate5.17%0.08%
Ratio62× higher for same-topic queries

Content relevance is overwhelmingly the dominant gate. Domains are cited 62 times more often for queries in their own vertical. But does removing this dilution reveal a hidden Score signal?

No. When I restrict to same-topic pairs only, the correlation between Score and within-topic citation rate is r = −0.010 (p = 0.845). Still zero. An OLS regression on 3,580 same-topic domain-query pairs with clustered standard errors confirms: the Score coefficient is non-significant (p = 0.898), while Domain Authority retains a small but significant effect (p = 0.008).

Interpretation: Content relevance determines which domains have a chance of being cited. But within that pool of relevant domains, structural readiness does not predict who gets cited.

Theory 2: Is There a Score Threshold?

Perhaps the relationship is not linear but threshold-based — below some Score, citation is unlikely; above it, other factors dominate.

I divided all 441 scored domains into quintiles by Score and computed the percentage cited in each:

QuintileScore Rangen% CitedMean Citation Rate
Q1 (lowest)5–388934.8%0.02
Q238–468936.0%0.01
Q346–539340.9%0.01
Q453–628846.6%0.02
Q5 (highest)62–958245.1%0.02

There is a mild gradient in the binary outcome (35% to 47% cited), but the continuous citation rate is flat at 0.01–0.02 across all quintiles. The gradient in the binary outcome does not survive when confounders like Domain Authority are considered.

A LOWESS smoother confirms no non-linear pattern. A segmented regression scan across all decile boundaries found no meaningful step. The overall R² for Score alone remains 0.0001.

Interpretation: There is no Score threshold above which citation becomes meaningfully more likely. The relationship is flat.

Theory 3: Is Structural Readiness a Necessary Condition?

Even if Score does not predict citation frequency, perhaps it acts as a floor: sites below some minimum simply cannot be cited. This "necessary but not sufficient" pattern would still justify structural optimization.

The data falsifies this directly. The lowest-scoring domain that was cited had a Score of 11 out of 100 (motaword.com, cited in 1 of 30 queries). Meanwhile, the highest-scoring domain that was never cited had a Score of 95.

Citation rates across score deciles range from 29% (bottom decile, Score 5–35) to 56% (decile 8, Score 57–62), with no monotonic pattern. Domains in the bottom decile still get cited at a meaningful rate. The cited and uncited score distributions overlap almost entirely.

Key Finding

A domain with Score = 11 was cited by Perplexity. A domain with Score = 95 was never cited. There is no structural floor below which citation becomes impossible.

What All Three Theories Tell Us

These analyses converge on a clear picture:

  • Content relevance is the dominant prerequisite. The 62× same-topic vs cross-topic ratio dwarfs any structural effect. Being about the right topic matters more than how your site is built.
  • Score is neither sufficient nor necessary. Low-score sites get cited; high-score sites get ignored. The structural factors I measure do not gate citation access.
  • No non-linear rescue exists. The relationship is not hidden behind a threshold or step function. It is genuinely flat.
  • The honest conclusion: Structural optimization is hygiene, not competitive advantage. It ensures your site is crawlable and parseable, but it does not determine whether an AI search engine will cite you.

What's Next

This analysis is part of a broader research effort. Here is what I am planning:

  • Multi-model replication: Running the same 30 queries through ChatGPT, Claude, and Gemini to test whether the null result is Perplexity-specific or generalizes across LLM search engines.
  • Longitudinal tracking: Repeating the study monthly to measure whether the relationship between structure and citations changes as retrieval systems evolve.
  • Public dataset: The full dataset is available for download — 485 domains, 14,550 domain-query pairs, all scores, citation counts, and stability metrics. See the data dictionary for column descriptions.

I am also exploring whether the null result is specific to Perplexity or generalizes across all AI search engines. Perplexity uses its own retrieval pipeline, which may differ significantly from Google AI Overviews (which has direct access to the Knowledge Graph and structured data integration). It is plausible that schema markup matters more for Google's AI answers — and that is worth testing directly.

For the complete statistical methodology, regression tables, and per-vertical breakdowns, read the full whitepaper.

Check Your Own Site

Structural readiness may not predict citations, but it is still the baseline. Sites that block AI crawlers or lack basic schema markup cannot be cited at all. Check whether your foundation is in place:

Frequently Asked Questions

Does website structure predict AI search citations?+

No. In our pre-registered study of 485 domains across 30 queries, AI Search Readiness Score (measuring schema markup, HTML semantics, trust signals, and content structure) showed zero correlation with citation frequency (r = 0.009, p = 0.849). The study was adequately powered to detect any practically meaningful effect.

What is the strongest predictor of AI citations?+

Domain Authority was the only statistically significant predictor, but it explains just 2.2% of variance. The remaining 98% is driven by unmeasured factors — likely content relevance, retrieval pipeline mechanics, and model training data.

Are AI search citations consistent or do they change?+

Citations are surprisingly inconsistent. Even at temperature=0, 52.5% of domain-query citation pairs were unstable across 3 replicate runs. Only 47.5% appeared in all 3 runs. A single citation check is unreliable — you need multiple samples.

Why does YouTube appear so often in AI search results?+

YouTube was the most-cited domain in our dataset (33 citations across 9 queries), with 76% of those from e-commerce queries. This suggests AI retrieval is modality-sensitive: for product comparison queries, video reviews match the format the model is looking for.

Should I still optimize my website for AI search?+

Structural optimization (clean HTML, schema markup, good meta tags) is hygiene — necessary but not sufficient for citations. Focus on content relevance, format matching (video for products, comparison data for SaaS), and building real domain authority rather than expecting technical fixes alone to drive AI visibility.

Does content relevance explain the null result?+

Partially. Domains are cited 62× more often for queries in their own vertical (5.17%) than for irrelevant queries (0.08%). Content relevance is clearly the primary gate. However, even within same-topic pairs, Score still shows zero correlation with citation (r = −0.010, p = 0.845). So relevance explains which domains have a chance, but Score does not predict who wins within that pool.

Is there a minimum Score needed to get cited by AI search?+

No. The lowest-scoring cited domain in our dataset had a Score of 11 out of 100 (motaword.com). Meanwhile, the highest-scoring uncited domain had a Score of 95. Citation rates across score deciles range from 29% to 56% with no monotonic pattern. There is no floor below which citation becomes impossible.

AT

Alexey Tolmachev

Senior Systems Analyst · AI Search Readiness Researcher

Senior Systems Analyst with 14 years of experience in data architecture, system integration, and technical specification design. Researches how AI search engines process structured data and select citation sources. Creator of the methodology.

Check Your AI Search Readiness

Get your free AI Search Readiness Score in under 2 minutes. See exactly what to fix so ChatGPT, Perplexity, and Google AI Overviews can find and cite your content.

Scan My Site — Free

No credit card required.

Related Articles