We Tested Whether AI Search Readiness Score Predicts LLM Citations. It Doesn't.
TL;DR
A pre-registered study of 485 domains and 90 Perplexity API runs found no statistically significant association between AI Search Readiness Score and LLM citation frequency (Pearson r = 0.009, p = 0.849). Moz Domain Authority was the only predictor to borderline survive multiple comparison correction (adjusted p = 0.047), explaining roughly 2% of variance. Notably, DA predicts citation intensity (how often a cited domain is cited) but not citation selection (whether a domain gets cited at all). The full model explains about 2% of citation variance, suggesting that content relevance and retrieval pipeline mechanics — not structural website characteristics — are the dominant drivers of LLM citation decisions.
I built a scoring tool that measures 26 structural characteristics of a website — Schema.org markup, robots.txt configuration, heading hierarchy, entity signals, the works. I was convinced these factors would predict how often a site gets cited in AI-generated answers. Then I tested that assumption with real data. Across 485 domains, 30 queries, and 90 Perplexity API runs, the answer came back: no statistically significant association (Pearson r = 0.009, p = 0.849). Not weak. Not marginal. Zero.
I pre-registered the study, made sure it had adequate statistical power, and ran every robustness check I could think of. The null result held through all of them. This is not what I wanted to find — I built the score, I sell audits based on it. But the data does not care about my business model, and pretending otherwise would be dishonest.
Disclosure: I am Alexey Tolmachev, the creator of AI Search Readiness Score (getaisearchscore.com). I designed and published this study despite the null result, per the pre-registration commitment. Raw data and analysis code are available upon request.
Why I Ran This Study
The "Generative Engine Optimization" (GEO) industry is growing fast. Agencies sell structural website optimization as the path to AI visibility. I was one of those people — I built a tool around that exact premise. But at some point I realized: where is the actual evidence that any of this works?
- I could not find a single public empirical study that tested whether measurable structural signals predict LLM citation probability using open methodology. Not one.
- The GEO paper by Aggarwal et al. (2023) showed that content modifications can improve citation rates, but they tested content-level interventions (adding statistics, citations, quotations) — not the structural stuff I was scoring.
- Industry claims about Schema.org, robots.txt, and E-E-A-T signals improving AI visibility were everywhere, but nobody had checked them against actual LLM behavior data.
- With Gartner projecting a 25% drop in traditional search volume by 2026 as AI search grows, the question of what actually drives AI citation felt urgent enough to warrant a proper test.
So I decided to test my own product. If the score works, I want to know. If it does not, I also want to know — and I think other people in this space deserve to know too.
Study Design: How I Tested It
I pre-registered everything before collecting any data. Hypotheses, score formula, query list, analysis plan — all frozen in advance. This matters because it is very easy to rationalize results after the fact, and I wanted to make sure I could not talk myself out of a null finding.
Data Collection
| Parameter | Value |
|---|---|
| LLM model | Perplexity sonar-reasoning-pro |
| Temperature | 0 (deterministic) |
| Queries | 30 (10 SaaS + 10 E-commerce + 10 Services) |
| Replicates per query | 3 (90 total API runs) |
| Total citations extracted | 658 |
| Search engine results | 573 (Google + Bing + DuckDuckGo via SearXNG) |
| Unique domains | 485 |
| Successfully scored | 441 (90.9%) |
| Collection date | March 12, 2026 (single 8-hour session) |
The whole pipeline — LLM queries, search engine scraping, scoring, enrichment — ran in a single 8-hour session. I built custom collectors for each data source, a scorer that ran my production formula against every domain, and an enricher that pulled Domain Authority and domain age. Not glamorous work, but necessary if you want a clean dataset.
How Citation Was Measured
A domain was marked as "cited" for a query if it appeared in 2 or more of the 3 replicate runs (majority vote). The citation rate for each domain is the fraction of the 30 queries where it was cited. Each domain was scored using the production AI Search Readiness Score formula (26 checks, 4 baskets) and enriched with Moz Domain Authority and domain age data.
Statistical Power
At n = 441 and α = 0.05, the study had 80% power to detect correlations of |r| ≥ 0.133. Any meaningful effect would have been detected. The observed r = 0.009 is far below this threshold.
I want to be clear about what this means: the study was not underpowered. If there were a real relationship of any practical size, 441 domains would have been enough to find it. The effect is not hiding in noise — it is simply not there.
Key Findings
Finding 1: My Score Does Not Predict Citations
Primary Result
Pearson r = 0.009, p = 0.849. OLS coefficient p = 0.557. The null hypothesis was not rejected. No statistically significant association between AI Search Readiness Score and LLM citation frequency was detected.
When I first saw these numbers, I reran the analysis. Then I checked the data pipeline for bugs. Then I ran it a third time. The result did not change. A correlation of 0.009 is statistically indistinguishable from zero — my score has no predictive relationship with whether Perplexity cites a domain.
This held across every analytical specification I tried:
- OLS regression: p = 0.557
- Logistic regression (cited vs. not): p = 0.071
- Hurdle model (cited domains only): negative coefficient, p = 0.097
- Sensitivity analysis with Google Rank: p = 0.902
- All three Domain Authority segments: non-significant
- With and without high-authority "giants" (DA > 80): non-significant
I kept looking for some angle where the score would matter. Maybe it works for low-DA sites? No. High-DA sites? No. Maybe one of the four sub-scores carries the signal? Also no. At some point you have to accept what the data is telling you.
Finding 2: Domain Authority Is the Only Significant Predictor (Barely)
Moz Domain Authority was the only variable to borderline survive Bonferroni correction (r = 0.129, adjusted p = 0.047). But even DA explains only about 2% of variance in citation rate. Not exactly a strong signal.
The interesting part: DA predicts citation intensity (how often a domain is cited), not citation selection (whether it gets cited at all). In the logistic regression, DA was not significant (p = 0.144). In the hurdle model, DA was highly significant only in the intensity part. I think of this as an "amplifier" effect — domains that enter the retrieval pool for other reasons get cited more frequently if they have higher DA. But DA does not determine pool entry.
Finding 3: The Model Explains Almost Nothing
The best model (OLS with AI Score + DA + Domain Age) explained only 2.2% of variance (R² = 0.022). Even the hurdle model on cited-only domains reached just 8.8% — and none of that came from the AI Score.
98% of what determines LLM citation behavior is not captured by any variable in this study.
That number stopped me. I spent months building a 26-check scoring system, and the entire model — my score plus Domain Authority plus domain age — explains 2% of the outcome. The likely dominant factors are content relevance to the specific query and the opaque mechanics of each LLM's retrieval pipeline. Neither of which I measured, and neither of which a structural audit can capture.
Full Correlation Results
Here is the complete correlation matrix. I am including it in full because I think transparency about null results matters more than looking good. Every row is a different way of asking "does this variable predict citations?" and the answer is consistently no.
| Variable | Pearson r | p (raw) | p (Bonferroni) | Significant? |
|---|---|---|---|---|
| Total AI Score | +0.009 | 0.849 | 1.000 | No |
| Machine Readability | −0.008 | 0.870 | 1.000 | No |
| Extractability | −0.013 | 0.790 | 1.000 | No |
| Trust & Entity | −0.101 | 0.034 | 0.238 | No (confounded) |
| Offering Readiness | +0.082 | 0.084 | 0.588 | No |
| Moz Domain Authority | +0.129 | 0.007 | 0.047 | Borderline |
| Domain Age | +0.026 | 0.593 | 1.000 | No |
LLM Citations Are Less Stable Than You Think
This was a side finding, but I think it is important for anyone tracking "AI visibility." Even at temperature = 0 (which should produce deterministic output), citation behavior was only moderately stable across 3 replicates of each query:
- 47.5% of citations appeared in all 3 runs (high stability)
- 23.2% appeared in 2 of 3 runs
- 29.3% appeared in only 1 run (essentially random)
Nearly a third of citations are not reproducible even under identical conditions. I suspect this comes from the retrieval (RAG) layer, not the generation layer — the LLM is deterministic, but the search index it queries probably is not. The practical implication: if you are measuring your "AI visibility" from a single query run, you are measuring signal plus a lot of noise. I learned this the hard way while building my monitoring features.
What This Means for Site Owners and SEO Practitioners
1. Be Skeptical of "AI SEO" Services That Promise Citation Improvements
I am saying this as someone who built one of those tools. The data does not support the claim that structural website optimization — Schema.org markup, robots.txt tuning, E-E-A-T signals, heading hierarchy — will increase your LLM citation probability. If someone is selling you a structural optimization package as a guaranteed path to AI visibility, ask them for their evidence. As of March 2026, the empirical evidence points to zero correlation.
2. Domain Authority Matters — But Not How You Think
DA was the only significant predictor, but it functions as an amplifier, not a gate. Building domain authority through backlinks and brand recognition may increase how often you are cited once you are already in the retrieval pool. But DA does not determine whether you get into the pool in the first place.
3. Content Relevance Is Likely the Dominant Factor
With 98% of citation variance unexplained, the most plausible driver is something I did not measure: how well your content actually answers the specific query posed. This is consistent with the GEO research by Aggarwal et al. that found content-level modifications (adding statistics, quotations, citations) could improve LLM citation rates by 15–41%. The structure of your site probably matters far less than whether your content directly and thoroughly answers what someone asked.
4. Structural Readiness May Still Be Necessary — Just Not Sufficient
I want to be fair to my own tool here. These results do not prove that structural optimization is useless. A site that blocks AI crawlers in robots.txt obviously cannot be cited. Structural readiness may be a prerequisite that, once met, provides no additional advantage. Think of it like having a phone number listed — necessary for people to call you, but it does not determine whether they want to. The score still finds real technical problems. It just does not predict citations.
Limitations and Caveats
I want to be thorough about what this study cannot tell you, because overstating a null result is just as bad as overstating a positive one:
- Single model: I only tested Perplexity. ChatGPT, Gemini, and Claude may behave differently. It is entirely possible that some other LLM does weight structural signals. I have no evidence either way.
- Single snapshot: All data was collected on one day (March 12, 2026). LLM behavior changes with model updates and index refreshes. This result is a snapshot, not a permanent truth.
- Content relevance not controlled: This is the biggest limitation and I know it. The most important variable — how well each site's content matches each query — was not measured. A future study needs to control for this.
- Zero-inflated data: 60% of domains had zero citations. OLS is not ideal for this distribution. I ran hurdle models as robustness checks, and the null result held, but the data shape is not pretty.
- Correlation, not causation: Even if the score were significant, that would not prove that improving your score causes more citations. This is observational data.
- 30 queries: Broad but not deep. Individual verticals had only 10 queries each. A study with 100+ queries per vertical might reveal patterns I missed.
- My own scoring formula: The null result applies to my specific 26-check formula. A different structural score with different weights might perform differently. I tested the best formula I could build, but I am one person.
The full whitepaper documents 13 specific limitations. The complete study with regression diagnostics, sub-score analysis, and segment breakdowns is available upon request. I would genuinely welcome someone replicating this with different models or more queries.
What I Think Should Come Next
This study answered one question and raised several more. Here is what I think needs to happen — whether I do it or someone else does:
- Multi-model replication: Run the same 30 queries against ChatGPT, Gemini, and Claude to see if structural preferences differ by model. If one model cares about Schema.org and another does not, that is useful information.
- Content relevance as a control variable: Add BM25 or semantic similarity scoring between domain content and each query to isolate the structural signal. This is the study I wish I had done first.
- Longitudinal tracking: Repeat the study monthly to see if the relationship changes as LLM retrieval pipelines evolve. Maybe structural signals will start mattering as retrieval becomes more sophisticated.
- Intervention study: Take a set of domains, improve their structural scores, and measure whether citation rates change. This is the causal test that my observational study cannot provide.
- Retrieval pool analysis: Investigate what determines whether a domain enters the LLM's retrieval candidate pool — the selection mechanism that neither my score nor DA could explain.
What I Learned From Disproving My Own Product
I had a hypothesis: that measurable structural characteristics would predict LLM citation. I spent months building a 26-check scoring system around that hypothesis. Then the data said no — at least for Perplexity in March 2026.
I do not regret building the score. Structural readiness audits identify real technical problems: blocked crawlers, missing schema, JavaScript-only rendering, absent entity data. These are worth fixing regardless of whether they directly cause citations. But I need to be honest about what the score can and cannot claim. It is a diagnostic tool for structural website quality. It is not a predictor of LLM citation rates.
If the data changes in future studies — perhaps with different models or after retrieval pipelines evolve — I will update these conclusions. But I will not pretend the current evidence says something it does not.
The GEO industry needs more studies like this: pre-registered, open-methodology, willing to publish null results. Too much of the current discourse is based on anecdotes and vendor claims. If we want to understand what actually drives AI citation, we need empirical evidence — including the evidence that is inconvenient for the people running the studies.
— Alexey Tolmachev, Senior Systems Analyst
Methodology Summary
- Design: Pre-registered observational cross-sectional study
- Sample: 485 unique domains (441 scored) from LLM citations + search results
- Queries: 30 intent-based queries across SaaS, E-commerce, and Services
- LLM: Perplexity sonar-reasoning-pro, temperature=0, 3 replicates × 30 queries
- Score: 26-check AI Search Readiness Score (0–100) across 4 baskets
- Controls: Moz Domain Authority, domain age
- Analysis: Bivariate correlations (Bonferroni-corrected), OLS, logistic regression, sub-score decomposition, hurdle model, sensitivity analysis with Google Rank
- Statistical software: Python 3.14, scipy 1.15, statsmodels 0.14
- Data: Available upon request (dataset_domain.csv, 485 rows)
Frequently Asked Questions
Does AI Search Readiness Score predict LLM citations?+
No. In a pre-registered study of 485 domains across 30 queries and 90 Perplexity API runs, AI Search Readiness Score showed zero statistically significant correlation with citation frequency (Pearson r = 0.009, p = 0.849). This null result held across every analytical specification tested: OLS regression, logistic regression, hurdle model, and sensitivity analysis with Google Rank.
What does predict whether a website gets cited by AI search engines?+
Moz Domain Authority was the only statistically significant predictor, but it explains only about 2% of citation variance and works as an "amplifier" (increasing citation frequency for already-cited domains) rather than a "gate" (determining whether a domain gets cited at all). The remaining 98% of citation behavior is driven by unmeasured factors — most likely content relevance to the specific query and the LLM's retrieval pipeline mechanics.
Are LLM citations consistent across repeated queries?+
Only moderately. Even at temperature=0 (deterministic output), only 47.5% of citations appeared consistently across all 3 replicates of each query. 29.3% appeared in just 1 of 3 runs, making them essentially random. This instability likely comes from the retrieval (RAG) layer rather than the generation layer.
Does this mean GEO (Generative Engine Optimization) is useless?+
Not necessarily. This study tested structural characteristics (schema markup, crawlability, entity signals) and found no correlation with citations. But content-level interventions (adding statistics, quotations, citations to your text) have been shown to improve citation rates by 15–41% in the GEO paper by Aggarwal et al. (2023). Structural readiness may be a necessary prerequisite that provides no additional advantage once met.
How was this study designed to avoid bias?+
The study was pre-registered before data collection: hypotheses, score formula, query list, and analysis plan were frozen in advance. The author disclosed a conflict of interest (being the creator of the score tested). The study committed to publishing regardless of the result. All data and code are available upon request.
Alexey Tolmachev
Senior Systems Analyst · AI Search Readiness Researcher
Senior Systems Analyst with 14 years of experience in data architecture, system integration, and technical specification design. Researches how AI search engines process structured data and select citation sources. Creator of the AI Search Readiness Score methodology.
Check Your AI Search Readiness
Get your free AI Search Readiness Score in under 2 minutes. See exactly what to fix so ChatGPT, Perplexity, and Google AI Overviews can find and cite your content.
Scan My Site — FreeNo credit card required.
Related Articles
We Audited 98 Websites for AI Search Readiness. Here's What We Found.
Original data from 98 AI search readiness audits: average score 52.8/100, 91% fail on review markup, only 18.1% citation rate. The first public dataset on AI search readiness.
12 min read
What Is an AI Search Readiness Score? How It Works and Why It Matters
An AI Search Readiness Score is a diagnostic metric (0–100) that measures whether a website is prepared for citation by AI search engines. Covers the 4-dimension framework, original data from 100 audits, signal correlations, common failures, and the citation funnel.
16 min read
How to Improve Your Citation Rate in AI Search Engines
Data-driven guide to improving your citation rate in AI search. 10-step action plan with before/after metrics and citation tracking methods.
10 min read
Content Relevance Predicts AI Citations — Not SEO Score
Empirical study: content relevance (BM25 + embeddings) predicts AI citations with AUC 0.915. Our 26-check AI Readiness Score adds nothing (p=0.14). 438 domains, 30 queries, 13,140 pairs.
10 min read
