How LLMs Actually Parse Your Content: Chunking, Readability, and Citations

18 min read

TL;DR

AI search engines process your content through a 7-step RAG pipeline: crawl, parse, chunk, embed, retrieve, re-rank, generate. Most "LLM SEO" advice targets steps 1-2 (access), but citations are determined at steps 5-7 (retrieval and generation). My study of 441 domains found zero correlation (r=0.009) between structural readiness scores and actual citations - but content relevance showed a 62x difference. Structure is table stakes. Relevance wins.

I built an AI Search Readiness scoring tool with 26 checks. Crawlability, schema markup, heading hierarchy, FAQ blocks, trust signals - the whole list. Then I ran an empirical study across 441 domains and 14,550 domain-query pairs. The correlation between my readiness score and actual AI citations was r=0.009. Effectively zero.

That result forced a question I couldn't ignore: if 26 structural checks don't predict citations, what does? The answer turned out to be content relevance - same-topic pages got cited at 5.17% versus 0.08% for cross-topic pages, a 62x difference. But that only tells you what matters. It doesn't explain why. To understand why, you need to look at how LLMs actually process your content - the mechanics of the pipeline that sits between your published page and an AI's citation.

I read Olaf Kopp's article on LLM Readability and agreed with much of the framework - the emphasis on natural language quality, information hierarchy, and context management makes sense. But I noticed something: no data. The framework is built on patent research and reasoning, not empirical measurement. So I decided to write the version with evidence. Where I have data, I'll show it. Where I don't, I'll say so.

What Actually Happens When an AI Reads Your Page

When someone asks ChatGPT, Perplexity, or Google AI Overviews a question that triggers web search, your content goes through a pipeline called Retrieval-Augmented Generation (RAG). Understanding each step reveals where content gets lost - and where optimization actually matters.

Step 1: Crawl and Fetch

AI bots (GPTBot, PerplexityBot, ClaudeBot, Google-Extended) fetch your HTML just like traditional search crawlers. This is binary - either they can access your page or they can't. If your robots.txt blocks these user agents, the pipeline ends here. No fetch, no citation, no exceptions.

What to check: Does your robots.txt allow GPTBot and PerplexityBot? Many CDN or CMS defaults still block them. This is the one structural check that is genuinely binary and genuinely matters.

Step 2: Parse HTML and Extract Text

The bot receives your raw HTML and needs to extract readable text from it. This is where JavaScript-heavy sites fail. Most AI crawlers either don't execute JavaScript at all or execute it with significant timeouts and limitations. If your content loads via client-side API calls after page render, the bot sees an empty page. I've written about this in detail in my article on AI crawlers and JavaScript.

What to check: Disable JavaScript in your browser and reload. If your main content disappears, AI bots likely can't see it either.

Step 3: Chunk the Content

Here is where things get interesting - and where most "LLM SEO" advice falls apart. The extracted text is split into chunks, typically 400-512 tokens each (roughly 300-400 words, about 2-3 paragraphs). These chunks become the atomic units that the system works with from this point forward. I'll cover chunking mechanics in detail in the next section.

Step 4: Embed Chunks into Vectors

Each chunk is converted into a numerical vector - a point in high-dimensional space where semantically similar content clusters together. The quality of this embedding determines how well the system can match your content to user queries. This step is largely outside your control. But what you can control is whether the chunk that gets embedded contains a clear, self-contained answer. A chunk that mixes three different sub-topics will produce a muddled embedding that matches nothing well.

Step 5: Retrieve Top-K Chunks

When a user asks a question, the system converts the query into the same vector space and retrieves the most similar chunks - usually the top 5-20 across all indexed content. This is the critical filter. Your content competes against every other indexed page in the system's corpus. The chunk needs to be close enough to the query embedding to make the cut. Content relevance is what determines this - not schema markup, not heading hierarchy, not FAQ blocks. Pure semantic similarity between the query and your chunk.

This is exactly what my data shows. The 62x difference between same-topic and cross-topic citation rates is the retrieval step in action. If your content isn't about what the user is asking, your chunks never get retrieved, and nothing downstream matters.

Step 6: Re-rank Retrieved Chunks

After initial retrieval, most production systems apply a re-ranker - a more expensive model that looks at the query-chunk pair more carefully and reorders the results. This is where factors beyond pure semantic similarity start to matter: source authority, content freshness, specificity of the answer, and presence of verifiable claims. But re-ranking only operates on the chunks that made it through retrieval. It doesn't rescue irrelevant content.

One factor that likely plays a role at this stage is information gain - a concept Olaf Kopp has analyzed in depth based on Google patents. The idea: a document scores higher if it contains information that isn't present in other documents the system has already processed. In a RAG context, this means the re-ranker may prefer chunks that add something new to the response - not just chunks that repeat what other sources already said. This aligns with the triangulation behavior observed in Profound's 700K conversation analysis: LLMs cite multiple distinct sources per answer, each contributing a different facet. If your content says the same thing as ten other sites, it has low information gain - even if it's perfectly relevant.

I should be honest: I haven't tested information gain empirically in a RAG context. The concept comes from Google Search patents, and how exactly it transfers to ChatGPT or Perplexity's retrieval is an open question. But the logic is sound - if you want to be cited, say something the other sources don't.

Step 7: Generate Response and Select Citations

The LLM receives the top-ranked chunks as context and generates a response. It then selects which sources to cite based on which chunks it actually used to form the answer. This is where the "Lost in the Middle" problem (which I'll cover in Section 3) becomes relevant - the position of information within the context window affects whether the LLM pays attention to it.

The key insight: Most "LLM SEO" advice targets Steps 1-2 (access and parsing) while the actual citation decision happens at Steps 5-7 (retrieval, re-ranking, generation). Access is necessary but not sufficient. What determines citations is whether your chunk is the best semantic match for the query and whether it provides a clear, quotable answer.

How Chunking Actually Works (The Technical Reality)

Chunking is the process that turns your 3,000-word article into discrete units that the RAG system indexes and retrieves. Understanding how it works reveals why content structure matters in ways that most advice gets wrong.

Recursive Character Splitting: The Default

Most production RAG systems - including what we can infer from Perplexity and ChatGPT's browsing mode - use some variant of recursive character splitting. This is not sophisticated. The algorithm tries to split on paragraph breaks first, then sentence breaks, then word breaks, until each chunk is under the target size (typically 400-512 tokens). There is usually a 10-20% overlap between consecutive chunks so that information at chunk boundaries isn't lost.

Heading boundaries (H2, H3 tags) serve as natural split points. When the chunker encounters a heading, it will almost always start a new chunk there. This means your H2 sections are, in practice, likely to become individual chunks - or at least chunk boundaries.

The Chunk Size Sweet Spot

NVIDIA's 2024 RAG benchmarks tested different chunk sizes and found that page-level chunking (treating entire pages as single chunks) achieved 0.648 accuracy - worse than smaller, more focused chunks. The sweet spot in most benchmarks falls around 400-512 tokens. Smaller chunks are more precise but lose context. Larger chunks preserve context but introduce noise that dilutes the semantic match to any specific query.

What does this mean practically? Each H2 section of your content should be approximately 300-400 words (roughly 400-512 tokens) and should contain one complete, self-contained idea. If your critical answer spans two H2 sections, chunking may split it across two separate chunks - and neither chunk alone will be a strong match for the query.

Practical rule: Write each H2 section as if it might be the only thing the AI sees. Because, after chunking, it might be. Include the question, the answer, and enough context to be quotable - all within one section.

The Mike King Debate: Does Semantic Splitting Make Structure Irrelevant?

Mike King from iPullRank published an article arguing that semantic chunking makes HTML structure less important for RAG systems. His argument: if the chunker uses semantic understanding to split content, then headings and paragraph structure are just hints, not hard boundaries.

He is partially right. Advanced chunking methods exist - agentic chunking, semantic splitting, document-aware parsing. But the key question is what production systems at scale actually use. Running semantic splitting on billions of pages is computationally expensive. Based on public documentation and inference from system behavior, most large-scale RAG deployments still use some form of recursive splitting with heading awareness. The economics of processing the entire web push systems toward simpler, faster methods.

Even if semantic splitting becomes universal tomorrow, well-structured content with clear heading hierarchy and self-contained sections will still produce better chunks than a wall of text. Good structure is robust to any chunking method. Poor structure is only salvageable by the most sophisticated ones.

The "Lost in the Middle" Problem

This is the insight I haven't seen any competitor or "LLM SEO" advisor connect to content strategy. It comes from a 2023 Stanford/Berkeley paper by Liu et al. - "Lost in the Middle" - and it fundamentally changes how you should think about where you place key information.

The U-Shaped Attention Curve

Liu et al. tested how well LLMs use information placed at different positions within their context window. The finding: LLMs have a U-shaped attention curve. Information at the beginning and end of the context gets high attention. Information in the middle is dramatically deprioritized.

When a model like GPT-4 receives 20 retrieved documents, it disproportionately attends to the first few and the last few. The documents in positions 7-15 might as well not be there for many query types. The performance difference between placing relevant information first versus in the middle was significant enough to alter answer accuracy by substantial margins.

What This Means for Your Content

The implication operates at two levels. First, at the retrieval level: your chunk's rank among retrieved documents affects how much attention it receives. Higher-ranked chunks (better semantic match to the query) get more attention. This reinforces the content relevance argument - being the top semantic match matters enormously.

Second, at the within-chunk level: even after your chunk is retrieved and fed to the LLM, information at the beginning and end of the chunk gets more attention than information buried in the middle. This means front-loading your key claim - the BLUF (Bottom Line Up Front) principle - isn't just a writing style preference. It's aligned with how the model's attention architecture actually works.

The inverted pyramid works for AI not because of tradition, but because of architectural attention bias. The first sentence of each section is the most likely to be attended to, processed, and cited. Put your quotable claim there.

Mitigation Exists, But Isn't Universal

It's worth noting that researchers are working on this problem. Zhang et al. published "Found in the Middle" at NeurIPS 2024, proposing a technique called Ms-PoE (Multi-scale Positional Encoding) that achieves a 3.8 average accuracy gain on middle-position information. Some production systems may already incorporate similar mitigation. But you can't know which systems have implemented it and which haven't. Writing with BLUF is a robust strategy regardless - it works with or without positional bias mitigation.

I think of it this way: optimizing for the "Lost in the Middle" problem is free. Front-loading your answer costs you nothing and helps you in every scenario. It hurts you in none.

What My Data Shows About Content vs Structure

Let me be specific about what I measured, because precision matters when you're making claims about data. I ran my AI Search Readiness audit across 441 domains with 14,550 domain-query pairs. For each domain, I calculated a readiness score based on 26 structural and content checks. Then I measured actual citation rates using the Perplexity API across 30 carefully designed queries.

The Null Result

The Pearson correlation between readiness score and citation rate: r=0.009, p=0.849. This is as close to zero as you can get in empirical research. My 26 checks - measuring schema markup, heading hierarchy, FAQ presence, meta descriptions, trust signals, and more - collectively predicted nothing about whether a site would actually be cited by an AI search engine.

I tested multiple alternative hypotheses. Maybe there's a threshold effect - you need a minimum score for citations? No. Maybe the score is a necessary condition but not sufficient? No. Maybe it only matters within the same topic? Still no (within-topic r=-0.010). Every angle I tried produced the same answer: structural readiness scores don't predict citations.

The Content Relevance Signal

The one variable that did predict citations was content relevance. When I categorized citation checks by whether the query topic matched the domain's content, the numbers were stark: 5.17% citation rate for same-topic versus 0.08% for cross-topic - a 62x difference.

This aligns perfectly with the RAG pipeline mechanics. Content relevance determines whether your chunks pass the retrieval filter (Step 5). If your chunk isn't semantically close to the query, nothing else saves it. Not schema markup. Not FAQ blocks. Not heading hierarchy.

Structure as Table Stakes

This doesn't mean structure is irrelevant. It means structure is necessary but not sufficient. You need crawlable, parseable content for the pipeline to even start (Steps 1-2). You need decent heading structure for chunking to produce coherent units (Step 3). You need clear language for embeddings to be accurate (Step 4). But all of that is table stakes. It gets you into the game. Content relevance is what wins it.

The Search Atlas study reached a similar conclusion from a different angle: schema markup alone doesn't influence citation rates. Their data and mine converge on the same point - structural optimization is necessary infrastructure, not a competitive advantage.

Olaf Kopp's LLM Readability Framework - What's Right and What's Missing

Kopp's framework identifies several factors that influence whether LLMs can effectively process and cite content: natural language quality, information structuring, hierarchy, context management, and consistency. I think most of this is directionally correct. Let me be specific about where I agree and where I see gaps.

What Kopp Gets Right

The emphasis on natural language quality aligns with embedding mechanics. Clear, well-written text produces better embeddings, which leads to better retrieval. Ambiguous or convoluted language creates muddled vectors that don't match cleanly to any query. This is real.

Information hierarchy matters because it influences chunking. As I described above, heading boundaries become chunk boundaries. A logical H2/H3 structure means the chunker produces coherent, self-contained units. This is real too.

Context management - maintaining semantic consistency within a section - helps the embedding model produce a focused vector for each chunk. A section that covers one topic cleanly will have a stronger semantic signal than one that jumps between ideas. Also real.

What's Missing

No empirical validation. The framework is built on patent analysis and logical reasoning. These are valid inputs, but patents describe what companies might implement, not what they actually deploy at scale. I've seen enough gap between patent filings and production systems to treat patents as hypotheses, not evidence.

Chunk retrieval mechanics are underweighted. The framework focuses heavily on making content "readable" by LLMs. But the first hurdle isn't readability - it's retrieval. Your content needs to be found before it can be read. The 62x content relevance gap in my data suggests that retrieval is the dominant filter, not readability.

The "Lost in the Middle" problem isn't addressed. Kopp's framework doesn't account for positional attention bias within the context window. Even perfectly readable content can be deprioritized if it lands in the wrong position among retrieved documents.

Content relevance is underweighted relative to structure. The framework gives roughly equal weight to structural readability and topical authority. My data suggests the weighting should be dramatically skewed toward relevance. Getting your structure from a D to a B matters far less than being the best answer to the query.

Platform-Specific Citation Mechanics

Not all AI search platforms work the same way. Understanding the differences helps you prioritize. Here is what we know from available data:

PlatformRetrieval MethodSource Selection FactorsWhat Matters Most
ChatGPTBing browsing mode (31% of queries trigger web search; 59% for local intent)Domain authority ~40%, content quality ~35%, platform trust ~25%Being in Bing index, domain authority
PerplexityAlways-RAG, proprietary indexContent relevance, freshness, answer qualityTopic match, answer specificity
Google AI OverviewsGoogle Search index + ranking signalsTraditional ranking + content extractabilityAlready ranking well + structured format
ClaudeTraining data only (no web browsing)Training corpus selectionBeing in the training data

Some data points worth noting from Profound's analysis of 700,000 ChatGPT conversations: ChatGPT dominates AI referral traffic at 84.2% share. The top 10 most-cited domains capture only 12% of all citations. That last number is important - it means 88% of citations go to the long tail. You don't need to be Wikipedia to get cited. You need to be the most relevant answer to specific queries in your niche.

Also notable: 31% of ChatGPT queries trigger web search. The rest rely on the model's training data. For Perplexity, every query triggers retrieval. This means Perplexity is actually the more optimizable platform - every query gives your content a chance to be retrieved, while ChatGPT only searches the web for a third of conversations.

What to Actually Do: A Pipeline-Grounded Checklist

Based on the pipeline mechanics, here is what I'd prioritize. I've ordered these by which pipeline step they address, which roughly corresponds to impact order:

1. Allow AI Crawlers to Access Your Content (Step 1)

Check your robots.txt for GPTBot, PerplexityBot, and ClaudeBot. This is binary and non-negotiable. If they can't fetch your pages, nothing else matters. Many WordPress security plugins and enterprise firewalls block these user agents by default.

2. Ensure Content Renders Without JavaScript (Step 2)

Most AI crawlers don't execute JavaScript reliably. If your main content is loaded via client-side API calls, AI bots see an empty page. Use server-side rendering or static generation. This is table stakes.

3. Write Each H2 Section as a Self-Contained Answer Unit (Step 3)

Target 300-400 words per H2 section. Each section should contain one complete thought that stands on its own after chunking. Include the question, the answer, and enough context to be quotable. If your key claim requires reading two sections to understand, chunking may split it across chunks and neither will be cited.

4. Front-Load the Answer in Each Section (Steps 5-7)

The BLUF principle - put your quotable claim in the first sentence of each section. This addresses both the embedding quality (the beginning of a chunk heavily influences its vector representation) and the "Lost in the Middle" attention bias. The first sentence is the most likely to be attended to, processed, and cited.

5. Use Questions as H2 Subheadings (Step 5)

Question-format headings improve query-to-chunk matching during retrieval. When someone asks Perplexity "how does chunking work for LLM search?" and your H2 is "How Does Chunking Work for LLM Search?", the semantic similarity between the query and your chunk gets a boost from the heading being included in the chunk text.

6. Include Specific Data Points with Sources (Steps 6-7)

LLMs prefer verifiable claims when generating cited responses. "Our study of 441 domains found r=0.009 correlation" is more citable than "structure doesn't matter much." Named studies, specific numbers, and attributed claims give the LLM confidence that citing your source adds credibility to its response.

7. Add FAQ Blocks with Real Questions (Steps 3, 5)

A 2025 study by Relixir found that pages with FAQ content had a 41% citation rate compared to 15% for pages without FAQ blocks. This makes sense through the pipeline lens: FAQ question-answer pairs are naturally self-contained chunks that map directly to user queries. Each Q&A is essentially a pre-chunked, query-aligned content unit.

8. Go Deep on Your Topic (Step 5)

Content relevance is 62x more impactful than structural optimization based on my data. The single best thing you can do for AI citations is be the most comprehensive, authoritative answer within your specific topic. Depth within your niche beats breadth across topics. A 5,000-word guide that covers one topic thoroughly will outperform five 1,000-word posts that each cover a topic superficially.

Caveat on the Relixir FAQ data: I haven't independently verified their methodology or sample. The 41% vs 15% numbers are widely cited in the LLM SEO community but should be treated as indicative, not definitive. My own data doesn't isolate FAQ presence as a separate variable - it's one of 26 checks in the composite score.

A Systems Analyst's Perspective

I spent 14 years as a systems analyst before building this tool. Systems thinking is my default mode. And when I look at the "LLM SEO" industry through that lens, what I see is a field that has skipped the most basic analytical step: measuring whether its recommended interventions actually produce the claimed outcomes.

I built my scoring tool thinking structure equals citations. I invested months in 26 checks across four dimensions. The data proved me wrong. That's not a failure - that's science working as intended. The failure would have been ignoring the data and continuing to sell structural audits as citation improvement tools.

Here is how I think about it now: structure is like having a clean, well-lit shop window. It's necessary. Customers can't buy from you if they can't see your products. But the shop window doesn't determine whether someone walks in. That depends on whether you have what they're looking for.

The Profound data reinforces this: the top 10 most-cited domains capture only 12% of all ChatGPT citations. The long tail - smaller, focused sites - gets 88%. This means you don't need massive domain authority or perfect technical optimization. You need to be the best answer to specific questions in your specific domain.

Most of what gets published about "LLM SEO" is unfalsifiable advice. "Write better content." "Use structured data." "Be authoritative." These statements can't be wrong because they can't be tested as stated. My contribution is narrower but more honest: here is what I measured, here is what I found, here is what the RAG pipeline mechanics suggest should work, and here is what I still don't know.

What I don't know is still a lot. I don't know the exact chunking method Perplexity uses. I don't know how ChatGPT's browsing mode re-ranks results. I don't know whether Google AI Overviews uses the same retrieval pipeline as standard search. Nobody outside these companies does. Anyone who claims otherwise is speculating.

What I do know is this: if you want to be cited by AI, focus on being the most relevant, specific, well-evidenced answer to the questions your audience is asking. Make sure the pipeline can access and parse your content. Structure it so chunking produces coherent units. Front-load your answers. Then stop tweaking structural knobs and invest in depth and relevance. That is what the data supports.

Frequently Asked Questions

What is chunking and why does it matter for AI citations?+

Chunking is the process of splitting your page content into smaller pieces (typically 400-512 tokens) that get individually indexed and retrieved by AI search systems. Each chunk becomes an atomic unit - the system retrieves and evaluates chunks, not whole pages. If your key answer spans two chunks, neither may be strong enough to get retrieved. Writing self-contained H2 sections of 300-400 words helps ensure each chunk contains a complete, citable answer.

Does content structure affect AI citations?+

Structure affects citations indirectly but not directly. Good heading hierarchy (H2/H3) creates better chunk boundaries during the chunking step. Crawlable, parseable HTML is required for the pipeline to even start. But structural readiness scores show zero correlation with actual citation rates (r=0.009 across 441 domains). Structure is necessary infrastructure - it gets you into the game - but content relevance is what wins citations.

What is the "Lost in the Middle" problem?+

Research by Liu et al. (2023) found that LLMs have a U-shaped attention curve: they attend strongly to information at the beginning and end of their context window, but deprioritize information in the middle. This means front-loading your key claims (BLUF - Bottom Line Up Front) in each section isn't just good writing style - it aligns with how the model's attention architecture actually works.

Which AI search platform is easiest to optimize for?+

Perplexity is the most optimizable because every query triggers web retrieval, giving your content a chance to be found. ChatGPT only searches the web for about 31% of queries (59% for local intent). Google AI Overviews relies heavily on existing Google Search rankings. Claude doesn't browse the web at all - it only uses training data.

How important are FAQ blocks for AI citations?+

A 2025 study by Relixir found 41% citation rate for pages with FAQ content versus 15% without. This makes sense through the pipeline lens: FAQ question-answer pairs are naturally self-contained chunks that map directly to user queries. However, this data hasn't been independently verified, so treat it as indicative rather than definitive.

AT

Alexey Tolmachev

Senior Systems Analyst · AI Search Readiness Researcher

Senior Systems Analyst with 14 years of experience in data architecture, system integration, and technical specification design. Researches how AI search engines process structured data and select citation sources. Creator of the methodology.

Check Your AI Search Readiness

Get your free AI Search Readiness Score in under 2 minutes. See exactly what to fix so ChatGPT, Perplexity, and Google AI Overviews can find and cite your content.

Scan My Site — Free

No credit card required.

Related Articles