GPT-4o Passes Your Prompt Verbatim to Web Search: Empirical Evidence on LLM Query Rewriting

Updated April 29, 202612 min read

TL;DR

When you send GPT-4o a 38-word conversational prompt with web search enabled, the search backend receives the same 38 words. We measured this empirically across 20 paired prompts (10 SEO-shaped, 10 conversational) using OpenAI's Responses API with web_search_preview and tool_choice forced. Across 8 successful pairs (12 calls failed with HTTP 500s during the run window), every single search query GPT-4o issued matched the user input verbatim. Word-for-word, character identical. This rules out internal LLM rewriting as the variable that would collapse the SEO and conversational citation ecosystems into one. The two query styles really do go to retrieval as different strings, and AI citation monitoring tools that generate only keyword queries are measuring the keyword ecosystem only, not a normalized blend.

When you ask GPT-4o a 38-word conversational question with web browsing enabled, the question that hits the search backend is the same 38 words. Not a keyword extraction. Not a rewritten query. Not a decomposition into sub-queries. The user input is sent to web search verbatim. We tested this empirically across 20 paired prompts (10 SEO-shaped, 10 conversational) through OpenAI’s Responses API and captured the exact search query the model issued internally for each call.

This single empirical fact has practical consequences for AI citation monitoring. If your monitoring tool sends keyword-shaped queries to AI engines on your behalf, you are measuring the SEO citation ecosystem. If real users send paragraphs, your tool is measuring a different ecosystem than the one your prospects actually experience. The engine does not bridge the two for you. This article walks through the experiment, what it does and does not generalize to, and what it changes about how to design AI search visibility monitoring.

Why this matters

A common assumption in the GEO/AEO industry is that AI engines normalize user input before retrieval. The thinking goes: the user types a long conversational prompt, the engine extracts the “real” intent and runs a tighter keyword-style search, and the citation result is roughly the same as if the user had typed the keyword version directly. This assumption is doing a lot of load-bearing work in monitoring tool design. It is also empirically wrong, at least for GPT-4o.

The two ecosystems barely overlap. Our prior 8-site controlled experiment measured Jaccard URL overlap of 0.04 between SEO-style and conversational queries inside Perplexity Sonar. That is, of every 100 unique URLs cited in either set, only 4 appeared in both. The two query styles route the model into nearly disjoint regions of the indexed web.
B2B and consumer adoption is at scale. Profound’s 680 million-citation dataset (August 2024 – June 2025) covers ChatGPT, Google AI Overviews, and Perplexity at production volume. Forrester research cited in Discovered Labs’ analysis puts B2B buyer use of AI search engines at 94%. Real users send real conversational prompts, at scale, today.
Query rewriting is documented in the academic literature. Recent research on LLM query rewriting surveys the standard pipeline: a generative model takes the user prompt, expands it into one or several sub-queries, runs retrieval, and synthesizes an answer. Cloudflare AI Search, Anthropic Claude with web search, and Perplexity Sonar all describe similar architectures publicly. The question is not whether rewriting can happen. The question is whether it does happen for any specific engine on any specific input.
Most monitoring tools generate one query style. Our own tool generated SEO-shaped monitoring queries until April 2026, and a survey of public documentation suggests we were not unusual. Practitioner guides on AI search optimization typically recommend tracking 10–20 keyword phrases against AI engines. If those phrases are sent verbatim, the resulting visibility report is for the keyword ecosystem only.

The empirical question, then, is whether engines transparently normalize the two styles into one retrieval. The answer for GPT-4o is no. The answer for other engines requires their own measurement. This article documents the GPT-4o test in detail, names what it does and does not cover, and gives you the script to run the same test on Claude and Gemini if you want to.

Where rewriting could happen: the four-stage LLM web search pipeline

Before measuring whether GPT-4o rewrites prompts, it helps to be precise about where in the pipeline rewriting could happen. Generative search engines that combine an LLM with a web index typically have four stages:

Input. The user types a string. This is the only stage where the user has direct control. Everything downstream is determined by the engine’s architecture.
Query planning (where rewriting can happen). The model may decide whether to search at all, may rewrite the user input into one or several search queries, may decompose a complex prompt into sub-queries (“query fan-out”), and may issue metadata-aware queries to specialized indices. This stage is opaque unless the engine exposes it through an API. Most do not.
Retrieval. The planned queries are sent to a search backend (Bing, Google, Perplexity’s own index, etc.). The backend returns a candidate document set. The engine may filter, rerank, or chunk these documents.
Answer synthesis. The model receives the retrieved context, generates the answer, and selects which sources to cite. Citation selection is not the same as retrieval — the model can retrieve a document and not cite it, or cite a retrieved document while paraphrasing without naming it.

Stage 2 is the one this article is about. Stages 1, 3, and 4 are well understood (or at least observable: you control the input, you can see the citations). Stage 2 is the unobservable middle. If the engine rewrites your conversational prompt into a keyword query at stage 2, then the entire conversational/SEO distinction collapses at the search backend. If it does not rewrite, then the search backend sees fundamentally different strings and can return fundamentally different results.

OpenAI’s Responses API exposes stage 2 explicitly. When the model uses the web_search tool, the response includes the actual search query the model issued in output[].web_search_call.action.query. This makes GPT-4o the cheapest engine to test empirically. Anthropic Claude exposes similar information through tool_use blocks in its Messages API. Perplexity Sonar does not expose internal queries at all — you see only the final answer and citations, with no record of what was searched for.

The experiment

We sent 10 SEO-shaped queries and 10 conversational analogs to GPT-4o through the OpenAI Responses API, with the web_search_preview tool enabled. For each call, we captured the search query the model actually issued and compared it with the user input. The setup:

Engine: GPT-4o via the Responses API endpoint POST /v1/responses.
Tool configuration: tools: [{type: "web_search_preview"}] with tool_choice: {type: "web_search_preview"}. The forced tool_choice is critical. With tool_choice: "auto", GPT-4o frequently declined to search at all on consumer-research questions and answered from training data instead, which would have left us with no internal query to capture.
Inputs: 10 prompt pairs across 4 SaaS verticals, 3 ecommerce verticals, and 3 services verticals. Each pair contains an SEO-shaped query (e.g., “best AI-powered CRM for remote sales teams in 2026”, 13 words) and a conversational analog (e.g., “I’m running a 12-person fully remote B2B sales team and our spreadsheet workflow just broke under scale — looking for an AI-powered CRM I could actually trust to roll out in under a month, what would you recommend?”, 38 words). Both prompts target the same business need. The only difference is shape.
Capture: theoutput[].web_search_call.action.query field from each response. This is the literal string GPT-4o sent to its search backend.
Measurement: word count of input vs word count of search query, and exact-string comparison.

The full Python script (~150 lines, stdlib urllib only) is reproduced at the end of this article and archived in our research repo. The total cost was approximately $0.15 in API calls.

Result: zero rewriting

OpenAI returned HTTP 500 errors for 12 of 20 calls during the run window (April 28, 2026). For the 8 successful pairs, every single search query matched the user input verbatim. Word-for-word, character identical, including punctuation.

Q	Style	Input words	Search query words	Rewrite?
5	conv	38	38	identical
9	conv	36	36	identical
11	conv	37	37	identical
17	conv	40	40	identical
21	seo	13	13	identical
21	conv	38	38	identical
25	seo	9	9	identical
25	conv	42	42	identical

Two specific examples make the result concrete. Q21 is a services vertical question about immigration lawyers:

SEO input (13 words): “how to choose a reliable immigration lawyer for digital nomad visas in Europe”

Search query GPT-4o issued: “how to choose a reliable immigration lawyer for digital nomad visas in Europe” — the same 13 words.

Conversational input (38 words): “My partner and I are planning to move from Brazil to Portugal under the digital nomad visa next year — any honest advice on how to actually pick an immigration lawyer that won’t disappear after taking the retainer?”

Search query GPT-4o issued: the same 38 words verbatim.

The model did not extract “immigration lawyer Brazil Portugal digital nomad visa” as a keyword distillation. It did not split the sentence into multiple sub-queries. It did not rephrase the question for SEO compatibility. It sent the user’s full sentence to web search as a single retrieval string. Whatever the backend does with a 38-word retrieval string is what determines the citations the user ultimately sees.

Implications for AI visibility monitoring

This finding is narrow but consequential. It applies to GPT-4o with the Responses API and a forced tool choice. With those bounds in mind:

The 96% non-overlap from our 8-site experiment is not an artifact of LLM rewriting. When we observed that SEO and conversational queries cite almost completely different URLs through Perplexity Sonar, that gap exists at the search-engine retrieval layer. It is not the case that the engine secretly translated one style into the other before retrieval. The two styles really do go to retrieval as different strings, and retrieval really does return different documents for them.
Monitoring queries are the search input, not a proxy for it. If a citation monitoring tool generates SEO-shaped queries, those queries hit the search backend as keyword strings. The tool is reporting what AI engines return for keyword input. If your prospects send paragraphs, the tool is reporting on a different population than your prospects.
The two query styles must be measured separately. There is no “canonical” query the engine internally normalizes to. The user’s phrasing is the search input. To capture both ecosystems, monitoring must generate both styles independently and report them as separate metrics, not collapse them into a single “citation rate” number that averages two unrelated populations.
SEO-shaped queries do not test conversational visibility. For a B2B SaaS site whose buyers send conversational prompts to AI assistants, monitoring 50 keyword queries gives 50 data points about a different ecosystem. The site might appear visible by keyword monitoring and invisible by conversational monitoring — we documented exactly this pattern with Genie Networks (50% citation rate under SEO, 6% under conversational) in our prior 8-site study.

The verbatim-pass behavior shifts where the burden of methodology sits. Previously, a tool vendor could plausibly argue: “we generate keyword queries because the engine normalizes inputs anyway, so the choice is mathematically equivalent.” That defense does not survive the empirical test. The choice of query style is a first-class methodology decision that determines which population of citations the tool measures.

How different LLMs expose internal query behavior

We surveyed publicly documented behavior of major LLM web search implementations. The question: can you observe the actual search query the model issued, and is it the user input verbatim?

Engine	Internal queries observable?	Verbatim pass?	How to inspect
GPT-4o (Responses API + web_search_preview)	Yes	Yes (8/8 in our test)	`output[].web_search_call.action.query`
Claude (Messages API + web_search tool)	Yes	Untested by us; tool_use blocks expose query strings	`tool_use` blocks with `name: "web_search"`
Perplexity Sonar	No	Cannot determine from API alone	Only `citations` and `search_results` exposed; no query metadata
Gemini with grounding	Limited	Untested empirically	`groundingMetadata.searchEntryPoint` exposes a search summary, not the raw query
o1 / GPT-5 (reasoning models)	Partially	Likely different from GPT-4o	Reasoning steps may plan multiple queries; needs explicit measurement

The most important caveat in the table is the last row. GPT-4o is a non-reasoning model that uses tools reactively. Reasoning models like o1 and GPT-5 explicitly plan their actions during a reasoning chain before acting, and that planning could include rewriting the user prompt into multiple targeted search queries. The verbatim-pass result we observed for GPT-4o cannot be assumed for reasoning models without running the same test. We expect to publish a reasoning-model replication once GPT-5 has stable web_search support.

How to test this on your own engine

The experiment is simple to replicate. If you have an OpenAI API key and 30 minutes, you can verify the GPT-4o result on your own queries. The minimum viable script:

import json, urllib.request, os

API_KEY = os.environ["OPENAI_API_KEY"]

def gpt4o_internal_query(user_prompt):
    payload = json.dumps({
        "model": "gpt-4o",
        "input": user_prompt,
        "tools": [{"type": "web_search_preview"}],
        "tool_choice": {"type": "web_search_preview"},
    }).encode()
    req = urllib.request.Request(
        "https://api.openai.com/v1/responses",
        data=payload,
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=120) as resp:
        body = json.loads(resp.read())
    queries = []
    for item in body.get("output", []):
        if item.get("type") == "web_search_call":
            q = item.get("action", {}).get("query")
            if q:
                queries.append(q)
    return queries

# Test it:
seo = "best AI-powered CRM for remote sales teams in 2026"
conv = ("I'm running a 12-person fully remote B2B sales team "
        "and our spreadsheet workflow just broke under scale - "
        "looking for an AI-powered CRM I could actually trust "
        "to roll out in under a month, what would you recommend?")

print("SEO  input:", seo)
print("SEO  searched:", gpt4o_internal_query(seo))
print()
print("Conv input:", conv)
print("Conv searched:", gpt4o_internal_query(conv))

Run this against 10–20 prompts in your business’s vertical and check the output. You should see, for nearly every call, the searched string equal the input string. If you see decomposition into sub-queries (multiple web_search_call entries with different query strings) or rewriting (different content from input), please publish the result. Reasoning models in particular are likely to behave differently and the empirical literature is thin.

Two practical notes from running our version:

Force the tool_choice. With auto, GPT-4o frequently decided not to search at all on consumer-research questions and answered from training data. Force-mode guarantees a search call you can inspect. The cost is that the model may search even when it would otherwise have answered without searching, but for the purpose of measuring rewriting behavior this is fine.
OpenAI 500 errors are a real factor. We saw 12 of 20 calls fail during a single run window. Retry with backoff (we used 2s/4s/8s exponential) and accept some noise. The successful calls were unambiguous, so we did not need a clean 20/20 to draw the conclusion.

For Claude with web search, the same logic applies but the API shape differs. The tool_use blocks in the Messages API response contain the search queries Claude issued. We have not yet run this empirically but the script is straightforward and a replication is on our roadmap.

Common mistakes this finding exposes

Three patterns of error follow from assuming engines normalize inputs:

Treating one query style as a proxy for both. A tool that generates 20 SEO-shaped queries cannot stand in for measuring conversational visibility. Whatever it reports about your citation rate is true for the keyword ecosystem and uncertain for the conversational one. The error is not in the keyword measurement itself; it is in extending that measurement to claims about “your AI visibility” in general.
Inferring engine internals from output similarity. One style of reasoning we sometimes hear: “keyword and conversational queries return similar lists, therefore the engine must be normalizing.” This conflates output overlap (which is observable) with internal mechanism (which is not, unless an API exposes it). For Perplexity Sonar specifically, no API surface reveals internal queries, so any claim about whether Sonar rewrites is a claim made without measurement. Output similarity could come from rewriting or from genuinely different queries coincidentally hitting overlapping documents.
Generalizing GPT-4o to all engines. Verbatim pass is a property we measured for one model on one date. Reasoning models, new model generations, and engines with explicit query-fanout mechanisms (like Google AI Overviews per patent US20240289407A1) should be expected to behave differently. The lesson is to measure per engine, not to assume the GPT-4o result transfers.

A systems analyst’s perspective

I came to the verbatim-pass test reluctantly. Our prior 8-site experiment had already shown 96% URL non-overlap between SEO and conversational queries inside Perplexity Sonar. The straightforward reading was that the two query styles produce different citations, end of story. The complication is that we could not see what Perplexity actually searched for. A skeptic could reasonably argue: “sure, the citations differ, but maybe Perplexity rewrote the conversational prompt into something that retrieved a different document set incidentally, and the difference disappears for engines that normalize.” Without an API surface, that argument is unfalsifiable.

OpenAI’s Responses API gave us a window. Forcing the tool, we could capture the literal search string for each call. The result decided a methodological question we had been avoiding: at least for GPT-4o, the engine does not bridge the two ecosystems for you. The user’s phrasing is the retrieval input. This is the kind of empirical fact that should sit at the bottom of monitoring tool design, and so far in this industry it mostly does not.

The broader observation, which I find professionally uncomfortable to state in public: the GEO/AEO industry has been making confident claims about engine behavior without measuring it. Most popular optimization advice (add Schema.org, write FAQ blocks, get cited in third-party reviews) is plausible at the level of first-principles reasoning but rarely tied to a published controlled experiment. Practitioner blogs cite each other in a closed loop. When we ran our pre-registered Score-vs-Citation study and found r = 0.009, the most-shared response was that the methodology must be wrong, not that the underlying assumption (structural readiness drives citation) might be wrong. When we ran the conversational-vs-SEO experiment and found Jaccard 0.04, several practitioners argued the engines must be normalizing inputs. Now we have measured: for GPT-4o, they are not.

I do not think this is bad faith. It is the natural state of a young practice without empirical scaffolding. SEO took ten years to grow a culture of A/B testing and methodology audit, and it took several more before public datasets like Ahrefs, Moz, and Semrush became rigorous enough to argue with. The AEO/GEO industry is roughly where SEO was in 2008. Most claims should be hedged. Most assumptions should be tested. Most monitoring tools should expose their query strategy explicitly so users can evaluate whether it matches their target audience.

For the specific case of this article: if you operate a citation monitoring tool, expose to users which query styles you generate and how. If you buy a citation monitoring tool, ask. If the answer is “we use the user’s seed keywords,” you are buying a keyword-ecosystem instrument. That may be exactly what you want, or it may not. The verbatim-pass result is what makes the question decidable rather than rhetorical.

Limitations and next steps

This is a small empirical test. The bounds:

One model. GPT-4o only. Claude, Gemini, Perplexity Sonar, and reasoning models (o1, GPT-5) require separate replication. Of these, Claude is most accessible because tool_use blocks expose internal queries; Perplexity is least accessible because no API surface reveals them.
One configuration. Forced tool_choice. Auto behavior may differ in subtle ways — for example, a model that decides to issue a query at its own discretion may also rewrite more aggressively. Future work should compare forced vs auto with larger n.
Small n. 8 successful pairs, 12 of 20 calls failed due to OpenAI server errors. The verbatim-pass result is unambiguous across the 8 we have, but a larger replication (say n=100) would let us detect rewriting behavior in edge cases (very long inputs, inputs in non-English languages, multi-question inputs) that were not represented in our sample.
One date. Run on April 28, 2026. Engine behavior can change with model updates. The Responses API is itself in preview at the time of writing, and the web_search_preview tool may change shape before general availability. The result should be treated as a snapshot.
Verbatim does not mean the search backend treats both styles equally. All this experiment shows is that the string GPT-4o sends to its retrieval system is the user input. What Bing (or whatever backend OpenAI uses) does with a 38-word retrieval string vs a 13-word one is a separate question. The end-to-end citation result depends on both stages, and we have only measured the first one.

Two follow-up studies are queued. First, replicating the verbatim-pass test on Claude with web search through the Messages API. Second, building a bare-metal RAG harness where we control retrieval ourselves and can isolate retrieval-vs-citation behavior in a way Perplexity API responses do not allow. Both will be published when they finish.

If you have run a similar test on any other engine and want the result included in the comparison table above, please get in touch. The point of this article is not the GPT-4o number per se — it is to push the AEO/GEO conversation toward measurement of engine behavior rather than inference from output. The more public empirical scaffolding the practice has, the better the optimization advice gets.

Summary

We sent 20 prompts to GPT-4o through the OpenAI Responses API with theweb_search_preview tool forced, and captured the literal search query the model issued for each call. For all 8 successful pairs (12 calls failed with OpenAI 500s), the search query equaled the user input verbatim. GPT-4o does not rewrite user prompts before web search; the user’s phrasing is the retrieval input.

The practical consequence: AI visibility monitoring tools that generate keyword-shaped queries are measuring the keyword ecosystem, not a normalized blend. For sites whose buyers send conversational prompts, monitoring needs to generate conversational queries explicitly and report them as a separate metric. There is no backend-level translation between the two styles, at least for GPT-4o.

If you want to check both query styles on your own site, the AI Search Readiness Score now generates 20 SEO and 20 conversational monitoring queries per site and reports them as separate citation rates as part of the post-scan report. Run a free scan at getaisearchscore.com. Full code and raw API responses for this experiment are in our research repo under1. Projects/9. Research v2 conversational queries/sub-experiments/llm-internal-rewrites/.

Frequently Asked Questions

Does GPT-4o rewrite user prompts before web search?+

No, at least not in the configuration we tested. We sent 20 prompts to GPT-4o through OpenAI's Responses API with the web_search_preview tool forced via tool_choice, and for all 8 successful pairs the search query GPT-4o issued internally was the user input verbatim. A 38-word conversational prompt was sent to web search as a 38-word string. A 13-word keyword query was sent as a 13-word string. No keyword extraction, no decomposition into sub-queries, no rewriting. This was true for every successful call in the test window (April 28, 2026).

How can I see what query an LLM actually sends to web search?+

For OpenAI GPT-4o through the Responses API, the web_search_call.action.query field in the output array contains the literal search query the model issued. For Anthropic Claude with the web_search tool, tool_use blocks in the Messages API response expose the same information. For Perplexity Sonar, no API surface exposes the internal query at all — you see only the final answer and citations. For Gemini with grounding, the groundingMetadata.searchEntryPoint field provides a search summary but not the raw query.

Why does this matter for AI citation monitoring?+

If LLMs do not rewrite user prompts before search, then the query style your monitoring tool generates is the query style hitting the search backend. A tool that generates 20 SEO-shaped queries is measuring what AI engines return for keyword input. If real users send conversational paragraphs to ChatGPT, the tool is reporting on a different ecosystem than the one prospects experience. Our prior 8-site experiment showed Jaccard 0.04 URL overlap between the two query styles in Perplexity Sonar — the GPT-4o verbatim-pass result confirms this gap is not a measurement artifact of unobserved rewriting.

Does this apply to all LLMs or just GPT-4o?+

Just GPT-4o on April 28, 2026, in the configuration we tested. Claude with web search exposes tool_use blocks but we have not yet run the same empirical test. Perplexity Sonar does not expose internal queries so the question cannot be answered from API data alone. Reasoning models like o1 and GPT-5 explicitly plan their actions during reasoning steps, and that planning could include query rewriting. The verbatim-pass result for GPT-4o cannot be assumed for reasoning models without running the same test on each.

What is the LLM web search pipeline and where does rewriting fit?+

Generative search engines have four stages: (1) user input, (2) query planning where the model decides whether to search and how to phrase the search, (3) retrieval where queries hit a search backend like Bing or Google, (4) answer synthesis where the model uses retrieved context to write the answer and select citations. Stage 2 is where rewriting could happen and is the only stage opaque to outside observers unless the engine exposes it through an API. Our test measures stage 2 specifically for GPT-4o.

Should AI citation monitoring tools generate both SEO and conversational queries?+

Yes, and report them as separate metrics. The two query styles produce nearly disjoint citation sets (Jaccard 0.04 in our 8-site experiment), and engines do not internally normalize between them — at least GPT-4o does not. A monitoring tool that generates only one style is reporting on one ecosystem. To capture both, the tool must generate both styles independently and report them as two separate citation rates rather than averaging them into a single "AI visibility" number that blends two different populations.

Alexey Tolmachev

Senior Systems Analyst · AI Search Readiness Researcher

Senior Systems Analyst with 14 years of experience in data architecture, system integration, and technical specification design. Researches how AI search engines process structured data and select citation sources. Creator of the AI Search Readiness Score methodology.

LinkedIn ↗

Check Your AI Search Readiness

Get your free AI Search Readiness Score in under 2 minutes. See exactly what to fix so ChatGPT, Perplexity, and Google AI Overviews can find and cite your content.

Scan My Site — Free

No credit card required.

Why Your AI Citation Rate Is Half-Truth: An Experiment

A controlled experiment on 8 sites compared SEO-style vs conversational queries against the same Perplexity Sonar model. Result: 96% non-overlap in cited URLs (Jaccard 0.04) and a 12.3-point gap in self-citation rate. Most AI citation tools measure only one half of the ecosystem.

14 min read

Content Relevance Predicts AI Citations — Not SEO Score

Empirical study: content relevance (BM25 + embeddings) predicts AI citations with AUC 0.915. Our 26-check AI Readiness Score adds nothing (p=0.14). 438 domains, 30 queries, 13,140 pairs.

10 min read

How Google's Query Fan-Out Affects Your AI Visibility

Google's patent reveals how AI search decomposes queries into sub-intents and routes them to specialized databases. Here's what query fan-out means for your content strategy, why Schema.org is now a routing mechanism, and how to measure your sub-intent coverage.