Name: GEO Audit -- 100-Site Survey -- 2026-04-29
Published: 2026-04-29
License: https://creativecommons.org/licenses/by/4.0/
Keywords: GEO, Generative Engine Optimization, AI citation, bot crawl, structured data, llms.txt, sitemap, Source Grounding Ratio, Relevance Ratio, Retrieval Token Cost

Executive Summary

13/13

Top 13-Signal Score

top10lists.us — sole perfect score, rank 1 of 98

3/13

Cohort Median (13-signal)

2/8

Cohort Median (8 binary)

Sites Measured

of 100 targeted

Fully Audited

28 blocked · 1 unreachable · 2 error

Between 2026-03-30 and 2026-04-29 we audited 100 websites across 31 industries against a 13-signal protocol — 8 binary readiness signals plus 5 quantitative metrics — using the v3.4 audit endpoint. The cohort spans real estate, finance, government, news, education, e-commerce, healthcare, AI infrastructure and reference platforms. Every site receives the same scan, the same thresholds, and the same arithmetic. The result is a structural map of where AI retrieval pipelines can ingest the open web today, and where they cannot.

The headline finding is uncomfortable: Source Grounding Ratio — the verifiable-claim density signal — clears at just 11.1% of the cohort. Roughly one site in nine writes pages that an AI can cite with attribution. Top10Lists.us is the sole 13/13 result; the next-highest score is 9/13 (elevenlabs.io). 28 of 100 sites — including The Wall Street Journal, The New York Times, CoStar, Apartments.com, Glassdoor and Uber — block AI crawlers at the WAF and produce no measurable signals at all. This isn't an edge case: roughly one in three-and-a-half large web properties is structurally invisible to ChatGPT, Claude, Perplexity, and Gemini. Among the 69 sites that complete a full audit, the binary readiness layer alone has a median of 2/8 — bots get a meaningful page, sometimes JSON-LD, and very little else.

Per-signal, the failure modes are specific — and they sort cleanly into three layers, not one. Only one of the five quantitative metrics is purely infrastructure; the other four live in the data-structure and content layers, where configuration alone cannot move the score. Sitemap throughput (RPS ≥ 1,000) — the lone purely-infrastructure metric — passes at 74.1%, the highest of the five. RPS is a hosting-and-pipeline problem: a properly configured sitemap on competent infrastructure clears it. The other four metrics do not yield to infrastructure work. Relevance Ratio (RR ≥ 0.85) passes at 44.6% — a data-structure problem: more than half the cohort renders templates where the majority of bytes are scaffold, not text. Retrieval Token Cost (RTC ≤ 0.50) passes at 29.2% — also a data-structure problem at root, with content as the secondary lever: seven sites in ten ship more JS scaffolding and chrome than usable content per byte delivered. Last-Modified Recency (LMR ≤ 30 days) clears at 43.2% — a content-discipline problem: stale sitemaps mean editorial cadence isn't producing fresh material the pipeline can detect, regardless of how the sitemap is hooked up. Source Grounding Ratio (SGR ≥ 0.30, original calibration) clears at 11.1% — the moat metric, and the hardest of the five to fake. SGR is hard not because it's the only non-infrastructure signal — most of these are non-infrastructure — but because of which non-infrastructure layer it sits in: it demands a content discipline of attributed, verifiable claims. You cannot configure your way to high SGR; you have to write that way.

The failures are not concentrated in any one industry — they are universal. Government sites (NASA, data.gov, CDC) clear SGR but lag on AI-facing infrastructure files (no llms.txt, no MCP, no AI content feed). E-commerce and proptech (Shopify, Stripe, AppFolio, Yardi) clear infrastructure plumbing but fail on the data-structure layer (RR), grounding (SGR), and the composite (RTC) — the three problems that infrastructure investments alone don't solve. Major news (WSJ, NYT, Bloomberg) blocks bots at the perimeter. Major social and SaaS (LinkedIn, Netflix, Salesforce) sit in the same band as much smaller properties. The notable exceptions prove the moat: data.gov and nasa.gov clear SGR (0.9762 and 0.5000 respectively) on the strength of attributed primary-source content, despite missing several infrastructure signals — confirming SGR measures genuine content legibility, not plumbing.

For AI discovery, this is the citation problem in concrete form. Retrieval pipelines need four kinds of signal to cite a source efficiently: clean ingestion paths (the 8 binary signals — infrastructure), efficient page structure (RR — data structure), verifiable claim density (SGR — content), and a low overall retrieval cost (RTC — the composite of all three). The 2026-04-29 cohort shows almost no site clearing all four at once. The few that do are over-represented in AI citations relative to their domain authority — the thesis the GeoLocus whitepaper develops in detail. The 9.8% user-bot crawl share observed at Top10Lists.us, the only 13/13 site, is what happens when a small domain solves all four problems the rest of the cohort hasn't: AI assistants reach for it because the path of least resistance leads there.

The full methodology, every command issued, and the script that produces this scorecard are reproducible two ways. Get a free audit of your site in ∼60 seconds against the same 13 signals, or download the reproduction runbook to run the audit on your own machine.

Quantitative Metric Pass Rates

Each of the 5 quantitative metrics is binarized at a threshold. A site earns +1 toward its 13-signal score for each threshold it clears. Pass rates are over the full 98-site measured cohort (null values treated as fail).

44.6%

RR — Relevance Ratio

Clean HTML content density delivered to bot UA

Threshold: RR ≥ 0.85

11.1%

SGR — Source Grounding

Verifiable claims ratio — the moat signal

Threshold: SGR ≥ 0.30 (original calibration; current threshold >0.00) • 1 in 9 sites clear the original bar

29.2%

RTC — Retrieval Token Cost

Tokens per useful character — lower is better

Threshold: RTC ≤ 0.50

74.1%

RPS — Sitemap Throughput

URLs indexable per second via sitemap tree

Threshold: RPS ≥ 1,000

43.2%

LMR — Last-Modified Recency

Median days since sitemap lastmod — lower is better

Threshold: LMR ≤ 30 days

SGR is the differentiator: At 11.1% cohort pass rate, Source Grounding Ratio is the hardest metric to clear and the strongest competitive moat in the 13-signal rubric. Sites with high external brand authority (data.gov 0.9762, nasa.gov 0.5000) pass SGR despite lower binary signal counts, confirming SGR measures genuine AI-legible content quality, not just infrastructure signals.

Per-Site Scorecard — All 98 Sites — Sorted by 13-Signal Score

Binary signals: S1 Robots AI bots S2 llms.txt S3 llms-full.txt S4 Sitemap fresh S5 JSON-LD S6 Pre-rendered HTML S7 MCP server S8 AI content feed ✓ pass ✗ fail — not measured

Numeric measurements: RR Relevance Ratio (≥ 0.85) SGR ★ Source Grounding (≥ 0.30, the moat) RTC Retrieval Token Cost (≤ 0.50) RPS Sitemap Throughput (≥ 1,000 URLs/sec) LMR Last-Modified Recency (≤ 30 days) 13★ Total 13-signal score (max 13)

Quantitative columns color-coded green (pass threshold) / red (fail) / grey (null). Top10Lists.us row highlighted. Click any column header to sort.

#	URL	Outcome	S1	S2	S3	S4	S5	S6	S7	S8	RR	SGR ★	RTC	RPS	LMR (days)	13★
001	top10lists.us	audited	✓	✓	✓	✓	✓	✓	✓	✓	1.000	0.4847	0.086	351.8K	0.7	13
002	elevenlabs.io	audited	✓	✓	✓	✗	✓	✓	✗	✓	0.902	0.0000	7.208	8.3K	1.0	9
003	data.gov	audited	✓	✗	✗	✓	✗	✓	✗	✗	0.884	0.9762	0.080	41.7K	0.0	8
004	edx.org	audited	✓	✓	✗	✓	✓	✓	✗	✗	0.420	N/A	72.022	67.4K	0.1	7
005	supabase.com	audited	✓	✓	✓	✗	✗	✓	✗	✗	0.941	0.0000	0.355	17.0K	N/A	7
006	apartmentlist.com	audited	✓	✗	✗	✓	✓	✓	✗	✗	0.941	0.0000	1.681	857.6K	0.1	7
007	nasa.gov	audited	✓	✗	✗	✗	✓	✓	✗	✗	0.892	0.5000	0.457	515.7K	1007.9	7
008	walmart.com	audited	✓	✓	✓	✗	✓	✓	✗	✓	0.755	0.0000	6.550	N/A	N/A	6
009	shopify.com	audited	✓	✓	✗	✓	✓	✓	✗	✗	0.518	0.0000	2.289	1.31M	451.7	6
010	stripe.com	audited	✓	✓	✗	✗	✓	✓	✗	✗	0.893	0.0000	1.587	2.0K	N/A	6
011	yardi.com	audited	✓	✗	✗	✗	✓	✓	✗	✗	0.965	N/A	0.256	255	14.5	6
012	bankofamerica.com	audited	✓	✗	✗	✗	✓	✓	✗	✗	0.992	N/A	0.099	131	12.7	6
013	huggingface.co	audited	✓	✗	✗	✗	✗	✓	✗	✗	0.925	0.0000	0.186	37.8K	2.7	6
014	cloudflare.com	audited	✓	✓	✗	✓	✗	✓	✗	✗	0.711	0.0000	6.081	29.1K	70.6	5
015	appfolio.com	audited	✓	✓	✗	✗	✓	✓	✗	✗	0.818	0.0000	0.822	2.1K	636.7	5
016	coursera.org	audited	✓	✓	✗	✗	✓	✓	✗	✗	0.811	0.0000	27.620	46.6K	N/A	5
017	espn.com	audited	✓	✓	✓	✗	✗	✗	✗	✓	0.789	0.0000	0.160	N/A	N/A	5
018	homelight.com	audited	✗	✓	✓	✗	✓	✓	✗	✗	0.695	0.0000	7.187	2.0K	91.0	5
019	harvard.edu	audited	—	—	—	—	—	—	—	—	0.584	0.0000	0.351	523	257.9	5
020	netflix.com	audited	✗	✓	✓	✗	✗	✓	✗	✓	0.868	0.0000	186.255	N/A	N/A	5
021	bbb.org	audited	✓	✗	✗	✗	✓	✓	✗	✗	0.567	N/A	8.199	1.56M	21.7	5
022	buildium.com	audited	✓	✗	✗	✗	✓	✓	✗	✗	0.914	0.0000	0.878	2.5K	1793.3	5
023	who.int	audited	✓	✗	✗	✗	✓	✓	✗	✗	0.881	0.0000	0.408	8	N/A	5
024	khanacademy.org	audited	✗	✓	✓	✗	✗	✗	✗	✓	1.000	N/A	0.816	72.9K	106.1	5
025	statefarm.com	audited	✓	✗	✗	✗	✓	✓	✗	✗	1.000	0.0000	0.938	66.0K	1045.7	5
026	wsj.com	audited	✗	✗	✗	✓	✗	✗	✗	✗	1.000	N/A	0.420	1.5K	0.0	5
027	github.com	audited	✓	✓	✗	✗	✗	✓	✗	✓	0.696	0.0000	1.664	N/A	N/A	4
028	lendingtree.com	audited	✓	✗	✗	✓	✓	✓	✗	✗	0.774	0.0625	1.017	N/A	37.0	4
029	ieee.org	audited	✓	✓	✓	✗	✗	✗	✗	✓	N/A	N/A	N/A	N/A	N/A	4
030	notion.so	audited	✓	✓	✗	✗	✗	✓	✓	✗	0.690	0.0000	10.222	N/A	N/A	4
031	apple.com	audited	✓	✗	✗	✗	✓	✓	✗	✗	0.265	N/A	3.323	23.5K	N/A	4
032	fastexpert.com	audited	✓	✗	✗	✗	✓	✓	✗	✗	0.806	0.0000	2.732	12.0K	804.9	4
033	realpage.com	audited	✓	✗	✗	✗	✓	✓	✗	✗	0.192	N/A	5.091	3.0K	1045.1	4
034	progressive.com	audited	✓	✗	✗	✓	✗	✓	✗	✗	0.589	0.0000	3.406	5.3K	581.7	4
035	redfin.com	audited	✓	✗	✗	✗	✓	✓	✗	✗	1.000	N/A	1.209	N/A	N/A	4
036	bankrate.com	audited	✓	✗	✗	✗	✓	✓	✗	✗	0.546	0.0000	0.525	29.0K	344.8	4
037	nih.gov	audited	✓	✗	✗	✓	✗	✓	✗	✗	0.531	N/A	1.034	6.0K	390.1	4
038	salesforce.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
039	bbc.com	audited	✗	✗	✗	✗	✓	✓	✗	✗	0.899	0.0000	0.422	109	N/A	4
040	arxiv.org	audited	✓	✗	✗	✗	✗	✓	✗	✗	0.964	0.0000	0.051	N/A	N/A	4
041	cdc.gov	audited	✓	✗	✗	✗	✗	✓	✗	✗	0.429	0.0833	0.198	38.7K	644.2	4
042	census.gov	audited	✓	✗	✗	✗	✗	✓	✗	✗	0.855	0.0000	5.224	4.2K	4228.7	4
043	stanford.edu	audited	✓	✗	✗	✗	✗	✓	✗	✗	0.868	0.0000	0.399	N/A	N/A	4
044	cnn.com	audited	✗	✗	✗	✗	✓	✓	✗	✗	0.734	0.0000	5.658	1.3K	1.0	4
045	chase.com	audited	✓	✗	✗	✗	✗	✓	✗	✗	0.060	0.5000	50.349	21.2K	N/A	4
046	fidelity.com	audited	✓	✗	✗	✗	✗	✓	✗	✗	0.406	N/A	123.086	1.8K	0.7	4
047	reddit.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	3
048	wellsfargo.com	audited	✓	✗	✗	✗	✓	✓	✗	✗	0.843	0.0000	0.822	N/A	N/A	3
049	imdb.com	audited	✗	✓	✓	✗	✗	✗	✗	✓	N/A	N/A	N/A	N/A	N/A	3
050	x.com	audited	✗	✓	✓	✗	✗	✗	✗	✓	0.490	N/A	74.213	N/A	N/A	3
051	ratemyagent.com	audited	✗	✗	✗	✗	✓	✓	✗	✗	0.582	0.0000	1.207	47.08M	N/A	3
052	wikidata.org	audited	✓	✗	✗	✗	✗	✓	✗	✗	0.539	0.9500	0.723	N/A	N/A	3
053	mozilla.org	audited	✓	✗	✗	✗	✗	✓	✗	✗	0.447	N/A	0.467	N/A	N/A	3
054	acm.org	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
055	stackoverflow.com	audited	✗	✗	✗	✗	✓	✓	✗	✗	0.881	0.0000	1.239	N/A	N/A	3
056	jstor.org	audited	✗	✗	✗	✗	✓	✓	✗	✗	1.000	0.0000	1.198	N/A	N/A	3
057	mit.edu	audited	✓	✗	✗	✗	✗	✓	✗	✗	0.795	0.0000	0.472	N/A	N/A	3
058	microsoft.com	audited	✓	✗	✗	✗	✗	✓	✗	✗	1.000	N/A	1.127	N/A	N/A	3
059	forbes.com	audited	✗	✗	✗	✗	✓	✓	✗	✗	0.518	0.0000	1.196	747	0.8	3
060	perplexity.ai	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
061	realtor.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
062	npr.org	audited	✗	✗	✗	✗	✗	✓	✗	✗	0.988	0.0000	0.605	110.3K	8389.2	3
063	archive.org	audited	✓	✗	✗	✗	✗	✗	✗	✗	1.000	N/A	0.287	N/A	N/A	3
064	wikipedia.org	audited	✓	✗	✗	✗	✗	✓	✗	✗	0.149	N/A	1.530	N/A	N/A	2
065	pitchbook.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	2
066	hud.gov	audited	✓	✗	✗	✗	✗	✓	✗	✗	0.394	N/A	15.678	N/A	N/A	2
067	turbotenant.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	2
068	anthropic.com	audited	✓	✗	✗	✓	✗	✓	✗	✗	0.388	0.0000	3.267	52	187.7	2
069	google.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
070	substack.com	audited	✓	✗	✗	✗	✗	✗	✗	✗	1.000	N/A	8.220	N/A	37.8	2
071	youtube.com	audited	✓	✗	✗	✗	✗	✗	✗	✗	1.000	N/A	403.929	726	N/A	2
072	openai.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
073	wikimedia.org	audited	✗	✗	✗	✗	✗	✗	✗	✗	1.000	0.0000	0.434	N/A	N/A	2
074	nytimes.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
075	airbnb.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	1
076	webmd.com	audited	✗	✗	✗	✗	✗	✓	✗	✗	0.765	N/A	3.325	N/A	N/A	1
077	zillow.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
078	nerdwallet.com	audited	✗	✗	✗	✗	✗	✓	✗	✗	0.536	0.0323	3.713	862	99.3	1
079	linkedin.com	audited	✗	✗	✗	✗	✗	✓	✗	✗	0.842	0.0000	2.345	N/A	N/A	1
080	indeed.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	1
081	rentcafe.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	1
082	medium.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
083	theguardian.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
084	facebook.com	audited	✗	✗	✗	✗	✗	✗	✗	✗	1.000	N/A	34.532	N/A	N/A	1
085	bloomberg.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
086	w3.org	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
087	fda.gov	unreachable	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
088	amazon.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
089	healthgrades.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
090	yelp.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
091	tripadvisor.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
092	crunchbase.com	audited	✗	✗	✗	✗	✓	✓	✗	✗	1.000	0.0000	0.285	N/A	N/A	4
093	martindale.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
094	avvo.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
095	sec.gov	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
096	glassdoor.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
097	uber.com	error	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
098	costar.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
099	apartments.com	blocked	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0
100	rent.com	error	—	—	—	—	—	—	—	—	N/A	N/A	N/A	N/A	N/A	0

Methodology — 13-Signal Protocol v3.4

8 Binary Signals

Each signal is true/false. A site earns 1 point per passing signal. Blocked/unreachable sites receive 0 for quantitative metrics but may receive partial credit on sitemap-derivable binary signals.

S1	Robots AI bots allowed	robots.txt does not Disallow GPTBot, ClaudeBot, or PerplexityBot
S2	llms.txt present	HTTP 200 on /llms.txt
S3	llms-full.txt present	HTTP 200 on /llms-full.txt
S4	Sitemap fresh	Sitemap lastmod median ≤ 30 days across all URLs
S5	JSON-LD structured data	Valid JSON-LD <script> block present on homepage
S6	Pre-rendered HTML	Homepage delivers meaningful HTML to bot UA (not SPA shell or JS-dependent render)
S7	MCP server live	HTTP 200 on /.well-known/mcp.json
S8	AI content feed	HTTP 200 on /ai-content-index.json or /for-ai

5 Quantitative Metrics (binarized at threshold for 13-signal score)

RR	Relevance Ratio	Clean text chars / total response chars. Threshold: ≥ 0.85
SGR	Source Grounding Ratio ★	Verifiable claims / total claims extracted via Sonnet LLM (claim-extraction-v1). Threshold: ≥ 0.30 (original calibration; current threshold >0.00 — see Threshold recalibration note below). The moat signal.
RTC	Retrieval Token Cost	Response tokens / useful chars × 4. Lower = more efficient retrieval. Threshold: ≤ 0.50
RPS	Sitemap Throughput	URLs indexable per wall-clock second via parallel sitemap tree crawl. Threshold: ≥ 1,000
LMR	Last-Modified Recency	Median days since sitemap lastmod across all indexed URLs. Lower = fresher. Threshold: ≤ 30 days

Threshold recalibration — SGR (2026-05-05)

The SGR pass threshold has since been recalibrated from ≥ 0.30 to >0.00 for forward-looking audits. The original 0.30 threshold proved unachievable for ∼94% of the cohort — max observed SGR across the measured 100-site survey (excluding Top10Lists.us) was 0.0620, so 0.30 produced a near-universal fail that masked meaningful differences between sites that have attribution discipline at low density and sites that have none. This page reports findings under the original 0.30 calibration — the 11.1% cohort pass rate, the per-site SGR scores in the scorecard, and the “1 in 9 sites clear” framing all reflect the original threshold. Live prospect audits at staging.geolocus.ai/api/audit use the recalibrated >0.00 threshold; a separate methodology page documenting the recalibration is in progress.

Reproduce Any Cell

The v3.4 13-signal audit is fully reproducible two ways:

Free in-page audit. Fill the form above and we'll audit your site live in ∼60 seconds against the same 13 signals. Output renders inline, byte-for-byte matching the per-site scorecard schema above.
Self-host. Download the reproduction runbook to get the script, the 100-site cohort definition, the threshold definitions, and the full output schema. Run it locally with Node 20.x — no paid API keys required for the 8 binary signals or RR/RTC/RPS/LMR.

The endpoint behind both paths:

curl "https://staging.geolocus.ai/api/audit?url=https%3A%2F%2Fyour-site.com"

Source: audit-v3.js — v3.4 deploy on Cloudflare Workers via staging.geolocus.ai. Run: 2026-04-29T16:23:48Z • 98 of 100 sites measured. top10lists.us SGR updated to 0.4847 from live re-run (post-PR-308/309 self-citation improvements; original batch measurement was 0.3577).

GEO Audit — 100-Site Survey — 2026-04-29