Your SEO team just showed you a dashboard where everything is green. Rankings holding, technical health at 88, Core Web Vitals passing. Meanwhile ChatGPT, Gemini, and Perplexity are recommending your competitor by name and skipping you entirely. Both things are true at the same time, and that gap is exactly what a traditional audit is structurally incapable of seeing.
This is a walkthrough of the actual workflow I run when an enterprise brand asks why it's losing ground in AI search. It is not a tool review. It's a data pipeline, one that connects the technical SEO debt sitting in your crawl exports to the reason an LLM won't cite you. I can't share the proprietary skill I use for the final synthesis layer, so the value here is the methodology and the stack around it. If you build this internally, you can run it without me.
The Hook & The Problem
Why the traditional SEO audit stops short
A classic technical audit answers one question: can Googlebot crawl, render, and rank this page? That question still matters. But it was designed for a world where the end state is a blue link in a SERP, and the user clicks it.
LLMs don't work that way. They don't rank ten pages and hand the user a list. They parse your content, extract entities and facts, decide whether they trust the source, and then generate an answer that may or may not name you. Being indexable is the floor, not the goal. You can pass every Lighthouse check and still be functionally invisible to a model that couldn't confidently figure out what your company does or whether to vouch for it.
Why standalone "AI visibility tools" fail on their own
The opposite failure is just as common. A team buys an AI visibility tracker, sees their recommendation rate is low, and has no idea what to do about it. The tool tells them that they're losing. It rarely tells them why.
The why almost always lives in the technical and structural layer the AI tool never looks at: a thin entity definition, a homepage buried under inline JavaScript, missing or contradictory structured data, an llms.txt that doesn't exist, content that buries the answer six paragraphs down. AI visibility tools measure the symptom. Technical SEO data holds the diagnosis. Run either one in isolation and you get half a picture.
The Dual-Engine Audit
The framework I use treats the brand as something being read by two different engines with two different appetites:
- The crawler engine (Google): wants clean rendering, fast pages, valid markup, clear internal linking, and indexable content. This is where most of your existing audit data already lives.
- The context-window engine (LLMs): wants unambiguous entity signals, machine-readable facts it can extract without guessing, structured data that resolves who you are and what you sell, and content shaped so the answer is easy to lift and cite.
Optimizing for one does not automatically satisfy the other, but they share a foundation. A site that's a mess for crawlers is almost always worse for LLMs, because the model has even less tolerance for ambiguity than the ranking algorithm does. The dual-engine audit measures both and connects the findings so each one explains the other.
The Enterprise GEO Audit Stack (My Real Workflow)
Here's the exact four-step process I ran recently to diagnose a brand whose AI recommendation rate had been sliding for two quarters while their SEO metrics looked fine. It's human-in-the-loop by design. The tooling does the heavy lifting; the judgment stays with you.
Step 1: Raw Data Collection (The Foundation)
You can't synthesize what you haven't gathered. The first phase is unglamorous and non-negotiable: pull the raw technical and content data into one place.
- Deep technical crawl via Screaming Frog. Run a full crawl, not a sample. Render JavaScript so you see what the page actually exposes after execution, not just the raw HTML. Make sure you also get data from Google Search Console, GA4 or PageSpeed Insights via the API (if you're looking to optimize page speed, check out my guide on achieving a flawless 100/100 PageSpeed score). Export everything that signals friction: response codes, indexability status, canonical conflicts, rendered word count, structured data presence and validation, internal link depth, and page weight. Heavy inline JavaScript and CSS, broken internal links, and missing security headers all show up here, and all of them degrade how cleanly an agent can parse the page.
- Keyword and content-gap data via a coverage tool. The point isn't the keyword volumes. It's the topical map: which entities and questions your category owns that your site barely addresses. Those gaps are the queries an LLM is answering with someone else's content.
At the end of Step 1 you have a pile of CSVs. Thousands of rows. This is where most audits drown; the analyst opens the export, scans the first 200 rows, and pattern-matches from memory. That's not analysis. That's guessing with a spreadsheet open.
Step 2: The RAG Extraction Phase (Using NotebookLM)
Instead of reading the CSVs, I make a model read them for me, but in a grounded, source-locked way, not a "summarize this" way.
I upload the raw Screaming Frog and content-gap exports into Google's NotebookLM. Because NotebookLM only reasons over the sources you give it, it won't hallucinate findings from general training data; it's answering from your crawl, which is exactly what you want for an audit.
Then I prompt it to act as a technical analyst with a specific job:
- Identify the pages with the highest concentration of technical debt, and group them by type of friction, not just list them.
- Flag the specific signals that prevent an AI agent from cleanly parsing the brand's core entity: thin or missing entity definitions, pages where rendered content diverges from raw HTML, absent or invalid structured data, answer content buried deep in the DOM.
- Separate observed evidence from inference. I want it to label what the crawl proves versus what it suspects, so I'm never publishing a finding I can't defend.
I export those structured insights into Google Docs. Now I have a readable, source-grounded technical narrative instead of 4,000 rows, and a document I can feed into the next stage.
The discipline that matters here: keep the model locked to your data. The moment it starts theorizing from general knowledge, you've lost the audit. Source-grounded extraction is the whole point.
Step 3: Integrating LLM Citation Data
Now the two engines meet. Up to this point I have a strong picture of the crawler-engine problems. Step 3 brings in the context-window-engine reality, what the LLMs are actually doing with this brand.
This is where you cross-reference the NotebookLM technical narrative against real AI visibility data:
- Recommendation rate vs. mention rate. Appearing in an AI answer is not the same as being recommended in one. A brand can be mentioned constantly and recommended almost never, and that gap is usually the entire business problem. Track them as two separate numbers.
- Citation source analysis. When ChatGPT or Perplexity answers a query in your category, where is it pulling from? In the work I've done with citation-tracking platforms, the overwhelming majority of citations in some verticals come from third-party sources rather than the brand's own domain, which tells you the model doesn't trust or can't cleanly extract from your site directly. (For a detailed case study on where ChatGPT gets its citations, see our analysis of 143,010 citations in UK banking).
- Topic-level breakdown. Map recommendation rate against the specific product or service topics that actually drive revenue, not vanity queries. A high recommendation rate on a low-value topic is noise.
For this layer I pull data from a tracking platform that runs persona-based, multi-turn conversations rather than single one-shot prompts, because a single prompt tells you what a model says once, and a real buyer's research journey is a conversation, not a single question. I'm looking for the join: the topics where recommendation rate is collapsing are the same topics where Step 2 flagged thin entity signals and missing structured data. When those two datasets line up, you've found your causal chain: technical debt on the left, lost recommendations on the right.
Step 4: Executive Synthesis (The LLM Polish)
The final step turns a technical diagnosis into something two very different audiences will act on. I feed the NotebookLM technical docs and the citation dataset into a custom Claude setup whose only job is synthesis, translating debt into business impact.
The output is split into two modes deliberately:
- Boss Mode (for the CMO/VP): the business narrative. "Recommendation rate on your three highest-LTV product categories dropped from X to Y. The root cause is parseability debt on those category pages; the model can't reliably extract what you offer, so it recommends competitors it understands better. Here's the revenue exposure and the priority order." No jargon. Impact, cause, sequence.
- Operator Mode (for the content/dev team): the punch list. Page-level, prioritized P0/P1/P2. "P0: fix rendered-content divergence on these 12 category pages. P1: add and validate Organization + Product structured data with a consistent
@id strategy. P2: restructure answer content so the extractable fact sits in the first 150 words." Falsifiable, assignable, shippable.
The model is the synthesizer, not the analyst. Every finding traces back to evidence collected in Steps 1–3. Claude is making the diagnosis legible to two audiences; it is not inventing the diagnosis. That distinction is what keeps the output defensible in a room full of skeptical stakeholders.
Key Metrics to Actually Care About
Impressions and average position were built for the crawler engine. They tell you almost nothing about whether an LLM will vouch for you. These are the metrics that matter for the context-window engine:
- Recommendation rate vs. mention rate. The single most important distinction in this whole discipline. Mention is presence. Recommendation is the model actively steering the user toward you. The buying decision happens in the gap between them. Track the ratio, watch it move, and tie it to the topics that pay your bills.
- Entity validation. Can a machine state, unambiguously, who you are, what you do, and what you sell, from your structured data and on-page signals alone? This is the GEO equivalent of E-E-A-T made machine-readable (which we explore in our ultimate guide to LLM brand visibility & GEO). Weak entity clarity is the most common root cause I find behind a low recommendation rate, and it's the thing the open-source audit frameworks (the dageno-style "entity clarity and trust" scoring approaches now circulating on GitHub) are explicitly trying to quantify. Worth understanding even if you build your own.
- Structured data readiness, for two readers. Google's parsers check structured data against a rulebook: valid or not, eligible for a rich result or not. LLMs read it more loosely, as a fact sheet they use to ground an answer. That dual purpose is why a clean, consolidated JSON-LD approach, one coherent block per page, a consistent
@id strategy, and aggressive sameAs entity linking, pays off twice. It's the core idea behind the dual-engine JSON-LD frameworks gaining traction: markup that satisfies the crawler's eligibility check and gives the model an unambiguous entity to remember. One block, two engines, no contradictions. - Citation source mix. What share of AI answers in your category cite your own domain versus third parties? A low own-domain share is a trust-and-extractability signal, and it's directly actionable through Steps 2–4.
Move your reporting onto these axes and the conversation with leadership changes. You stop reporting on traffic that may not exist and start reporting on whether the machines deciding your category trust you.
None of this is a single tool you buy and switch on. The open-source agents are useful reference points, and the tracking platforms give you the citation layer, but the pipeline that connects rendered-content debt to a collapsing recommendation rate, and then translates it into something a CMO and a developer will both act on, is something you assemble. It took me months of trial and error to get the handoffs between crawl data, source-grounded extraction, citation cross-referencing, and dual-mode synthesis to produce something defensible instead of a pile of disconnected reports. Most of that time was spent learning what not to include.
The brands that win in AI search over the next few years won't be the ones with the most tools. They'll be the ones who built the pipeline, who treat AI visibility as a measurable, diagnosable, fixable system rather than a black box they check on once a quarter.
Need this deployed for your brand? I build custom LLM visibility audits and tracking infrastructures for enterprise brands.
If your SEO dashboard is green but the models aren't recommending you, that's the exact gap this framework was built to close. Let's fix it: get in touch via the contact form on my homepage or go to the contact page to drop me a line directly.