Executive Summary

13/13
Top 13-Signal Score
top10lists.us — sole perfect score, rank 1 of 98
3/13
Cohort Median (13-signal)
2/8
Cohort Median (8 binary)
98
Sites Measured
of 100 targeted
69
Fully Audited
28 blocked · 1 unreachable · 2 error

Between 2026-03-30 and 2026-04-29 we audited 100 websites across 31 industries against a 13-signal protocol — 8 binary readiness signals plus 5 quantitative metrics — using the v3.4 audit endpoint. The cohort spans real estate, finance, government, news, education, e-commerce, healthcare, AI infrastructure and reference platforms. Every site receives the same scan, the same thresholds, and the same arithmetic. The result is a structural map of where AI retrieval pipelines can ingest the open web today, and where they cannot.

The headline finding is uncomfortable: Source Grounding Ratio — the verifiable-claim density signal — clears at just 11.1% of the cohort. Roughly one site in nine writes pages that an AI can cite with attribution. Top10Lists.us is the sole 13/13 result; the next-highest score is 9/13 (elevenlabs.io). 28 of 100 sites — including The Wall Street Journal, The New York Times, CoStar, Apartments.com, Glassdoor and Uber — block AI crawlers at the WAF and produce no measurable signals at all. This isn't an edge case: roughly one in three-and-a-half large web properties is structurally invisible to ChatGPT, Claude, Perplexity, and Gemini. Among the 69 sites that complete a full audit, the binary readiness layer alone has a median of 2/8 — bots get a meaningful page, sometimes JSON-LD, and very little else.

Per-signal, the failure modes are specific — and they sort cleanly into three layers, not one. Only one of the five quantitative metrics is purely infrastructure; the other four live in the data-structure and content layers, where configuration alone cannot move the score. Sitemap throughput (RPS ≥ 1,000) — the lone purely-infrastructure metric — passes at 74.1%, the highest of the five. RPS is a hosting-and-pipeline problem: a properly configured sitemap on competent infrastructure clears it. The other four metrics do not yield to infrastructure work. Relevance Ratio (RR ≥ 0.85) passes at 44.6% — a data-structure problem: more than half the cohort renders templates where the majority of bytes are scaffold, not text. Retrieval Token Cost (RTC ≤ 0.50) passes at 29.2% — also a data-structure problem at root, with content as the secondary lever: seven sites in ten ship more JS scaffolding and chrome than usable content per byte delivered. Last-Modified Recency (LMR ≤ 30 days) clears at 43.2% — a content-discipline problem: stale sitemaps mean editorial cadence isn't producing fresh material the pipeline can detect, regardless of how the sitemap is hooked up. Source Grounding Ratio (SGR ≥ 0.30, original calibration) clears at 11.1% — the moat metric, and the hardest of the five to fake. SGR is hard not because it's the only non-infrastructure signal — most of these are non-infrastructure — but because of which non-infrastructure layer it sits in: it demands a content discipline of attributed, verifiable claims. You cannot configure your way to high SGR; you have to write that way.

The failures are not concentrated in any one industry — they are universal. Government sites (NASA, data.gov, CDC) clear SGR but lag on AI-facing infrastructure files (no llms.txt, no MCP, no AI content feed). E-commerce and proptech (Shopify, Stripe, AppFolio, Yardi) clear infrastructure plumbing but fail on the data-structure layer (RR), grounding (SGR), and the composite (RTC) — the three problems that infrastructure investments alone don't solve. Major news (WSJ, NYT, Bloomberg) blocks bots at the perimeter. Major social and SaaS (LinkedIn, Netflix, Salesforce) sit in the same band as much smaller properties. The notable exceptions prove the moat: data.gov and nasa.gov clear SGR (0.9762 and 0.5000 respectively) on the strength of attributed primary-source content, despite missing several infrastructure signals — confirming SGR measures genuine content legibility, not plumbing.

For AI discovery, this is the citation problem in concrete form. Retrieval pipelines need four kinds of signal to cite a source efficiently: clean ingestion paths (the 8 binary signals — infrastructure), efficient page structure (RR — data structure), verifiable claim density (SGR — content), and a low overall retrieval cost (RTC — the composite of all three). The 2026-04-29 cohort shows almost no site clearing all four at once. The few that do are over-represented in AI citations relative to their domain authority — the thesis the GeoLocus whitepaper develops in detail. The 9.8% user-bot crawl share observed at Top10Lists.us, the only 13/13 site, is what happens when a small domain solves all four problems the rest of the cohort hasn't: AI assistants reach for it because the path of least resistance leads there.

The full methodology, every command issued, and the script that produces this scorecard are reproducible two ways. Get a free audit of your site in ∼60 seconds against the same 13 signals, or download the reproduction runbook to run the audit on your own machine.

Quantitative Metric Pass Rates

Each of the 5 quantitative metrics is binarized at a threshold. A site earns +1 toward its 13-signal score for each threshold it clears. Pass rates are over the full 98-site measured cohort (null values treated as fail).

44.6%
RR — Relevance Ratio
Clean HTML content density delivered to bot UA
Threshold: RR ≥ 0.85
11.1%
SGR — Source Grounding
Verifiable claims ratio — the moat signal
Threshold: SGR ≥ 0.30 (original calibration; current threshold >0.00) • 1 in 9 sites clear the original bar
29.2%
RTC — Retrieval Token Cost
Tokens per useful character — lower is better
Threshold: RTC ≤ 0.50
74.1%
RPS — Sitemap Throughput
URLs indexable per second via sitemap tree
Threshold: RPS ≥ 1,000
43.2%
LMR — Last-Modified Recency
Median days since sitemap lastmod — lower is better
Threshold: LMR ≤ 30 days

SGR is the differentiator: At 11.1% cohort pass rate, Source Grounding Ratio is the hardest metric to clear and the strongest competitive moat in the 13-signal rubric. Sites with high external brand authority (data.gov 0.9762, nasa.gov 0.5000) pass SGR despite lower binary signal counts, confirming SGR measures genuine AI-legible content quality, not just infrastructure signals.

Per-Site Scorecard — All 98 Sites — Sorted by 13-Signal Score

Binary signals: S1 Robots AI bots S2 llms.txt S3 llms-full.txt S4 Sitemap fresh S5 JSON-LD S6 Pre-rendered HTML S7 MCP server S8 AI content feed    ✓ pass ✗ fail — not measured
Numeric measurements: RR Relevance Ratio (≥ 0.85) SGR ★ Source Grounding (≥ 0.30, the moat) RTC Retrieval Token Cost (≤ 0.50) RPS Sitemap Throughput (≥ 1,000 URLs/sec) LMR Last-Modified Recency (≤ 30 days) 13★ Total 13-signal score (max 13)

Quantitative columns color-coded green (pass threshold) / red (fail) / grey (null). Top10Lists.us row highlighted. Click any column header to sort.

# URL Outcome S1 S2 S3 S4 S5 S6 S7 S8 RR SGR ★ RTC RPS LMR
(days)
13★
001top10lists.usaudited1.0000.48470.086351.8K0.713
002elevenlabs.ioaudited0.9020.00007.2088.3K1.09
003data.govaudited0.8840.97620.08041.7K0.08
004edx.orgaudited0.420N/A72.02267.4K0.17
005supabase.comaudited0.9410.00000.35517.0KN/A7
006apartmentlist.comaudited0.9410.00001.681857.6K0.17
007nasa.govaudited0.8920.50000.457515.7K1007.97
008walmart.comaudited0.7550.00006.550N/AN/A6
009shopify.comaudited0.5180.00002.2891.31M451.76
010stripe.comaudited0.8930.00001.5872.0KN/A6
011yardi.comaudited0.965N/A0.25625514.56
012bankofamerica.comaudited0.992N/A0.09913112.76
013huggingface.coaudited0.9250.00000.18637.8K2.76
014cloudflare.comaudited0.7110.00006.08129.1K70.65
015appfolio.comaudited0.8180.00000.8222.1K636.75
016coursera.orgaudited0.8110.000027.62046.6KN/A5
017espn.comaudited0.7890.00000.160N/AN/A5
018homelight.comaudited0.6950.00007.1872.0K91.05
019harvard.eduaudited0.5840.00000.351523257.95
020netflix.comaudited0.8680.0000186.255N/AN/A5
021bbb.orgaudited0.567N/A8.1991.56M21.75
022buildium.comaudited0.9140.00000.8782.5K1793.35
023who.intaudited0.8810.00000.4088N/A5
024khanacademy.orgaudited1.000N/A0.81672.9K106.15
025statefarm.comaudited1.0000.00000.93866.0K1045.75
026wsj.comaudited1.000N/A0.4201.5K0.05
027github.comaudited0.6960.00001.664N/AN/A4
028lendingtree.comaudited0.7740.06251.017N/A37.04
029ieee.orgauditedN/AN/AN/AN/AN/A4
030notion.soaudited0.6900.000010.222N/AN/A4
031apple.comaudited0.265N/A3.32323.5KN/A4
032fastexpert.comaudited0.8060.00002.73212.0K804.94
033realpage.comaudited0.192N/A5.0913.0K1045.14
034progressive.comaudited0.5890.00003.4065.3K581.74
035redfin.comaudited1.000N/A1.209N/AN/A4
036bankrate.comaudited0.5460.00000.52529.0K344.84
037nih.govaudited0.531N/A1.0346.0K390.14
038salesforce.comblockedN/AN/AN/AN/AN/A0
039bbc.comaudited0.8990.00000.422109N/A4
040arxiv.orgaudited0.9640.00000.051N/AN/A4
041cdc.govaudited0.4290.08330.19838.7K644.24
042census.govaudited0.8550.00005.2244.2K4228.74
043stanford.eduaudited0.8680.00000.399N/AN/A4
044cnn.comaudited0.7340.00005.6581.3K1.04
045chase.comaudited0.0600.500050.34921.2KN/A4
046fidelity.comaudited0.406N/A123.0861.8K0.74
047reddit.comblockedN/AN/AN/AN/AN/A3
048wellsfargo.comaudited0.8430.00000.822N/AN/A3
049imdb.comauditedN/AN/AN/AN/AN/A3
050x.comaudited0.490N/A74.213N/AN/A3
051ratemyagent.comaudited0.5820.00001.20747.08MN/A3
052wikidata.orgaudited0.5390.95000.723N/AN/A3
053mozilla.orgaudited0.447N/A0.467N/AN/A3
054acm.orgblockedN/AN/AN/AN/AN/A0
055stackoverflow.comaudited0.8810.00001.239N/AN/A3
056jstor.orgaudited1.0000.00001.198N/AN/A3
057mit.eduaudited0.7950.00000.472N/AN/A3
058microsoft.comaudited1.000N/A1.127N/AN/A3
059forbes.comaudited0.5180.00001.1967470.83
060perplexity.aiblockedN/AN/AN/AN/AN/A0
061realtor.comblockedN/AN/AN/AN/AN/A0
062npr.orgaudited0.9880.00000.605110.3K8389.23
063archive.orgaudited1.000N/A0.287N/AN/A3
064wikipedia.orgaudited0.149N/A1.530N/AN/A2
065pitchbook.comblockedN/AN/AN/AN/AN/A2
066hud.govaudited0.394N/A15.678N/AN/A2
067turbotenant.comblockedN/AN/AN/AN/AN/A2
068anthropic.comaudited0.3880.00003.26752187.72
069google.comblockedN/AN/AN/AN/AN/A0
070substack.comaudited1.000N/A8.220N/A37.82
071youtube.comaudited1.000N/A403.929726N/A2
072openai.comblockedN/AN/AN/AN/AN/A0
073wikimedia.orgaudited1.0000.00000.434N/AN/A2
074nytimes.comblockedN/AN/AN/AN/AN/A0
075airbnb.comblockedN/AN/AN/AN/AN/A1
076webmd.comaudited0.765N/A3.325N/AN/A1
077zillow.comblockedN/AN/AN/AN/AN/A0
078nerdwallet.comaudited0.5360.03233.71386299.31
079linkedin.comaudited0.8420.00002.345N/AN/A1
080indeed.comblockedN/AN/AN/AN/AN/A1
081rentcafe.comblockedN/AN/AN/AN/AN/A1
082medium.comblockedN/AN/AN/AN/AN/A0
083theguardian.comblockedN/AN/AN/AN/AN/A0
084facebook.comaudited1.000N/A34.532N/AN/A1
085bloomberg.comblockedN/AN/AN/AN/AN/A0
086w3.orgblockedN/AN/AN/AN/AN/A0
087fda.govunreachableN/AN/AN/AN/AN/A0
088amazon.comblockedN/AN/AN/AN/AN/A0
089healthgrades.comblockedN/AN/AN/AN/AN/A0
090yelp.comblockedN/AN/AN/AN/AN/A0
091tripadvisor.comblockedN/AN/AN/AN/AN/A0
092crunchbase.comaudited1.0000.00000.285N/AN/A4
093martindale.comblockedN/AN/AN/AN/AN/A0
094avvo.comblockedN/AN/AN/AN/AN/A0
095sec.govblockedN/AN/AN/AN/AN/A0
096glassdoor.comblockedN/AN/AN/AN/AN/A0
097uber.comerrorN/AN/AN/AN/AN/A0
098costar.comblockedN/AN/AN/AN/AN/A0
099apartments.comblockedN/AN/AN/AN/AN/A0
100rent.comerrorN/AN/AN/AN/AN/A0

Methodology — 13-Signal Protocol v3.4

8 Binary Signals

Each signal is true/false. A site earns 1 point per passing signal. Blocked/unreachable sites receive 0 for quantitative metrics but may receive partial credit on sitemap-derivable binary signals.

S1Robots AI bots allowedrobots.txt does not Disallow GPTBot, ClaudeBot, or PerplexityBot
S2llms.txt presentHTTP 200 on /llms.txt
S3llms-full.txt presentHTTP 200 on /llms-full.txt
S4Sitemap freshSitemap lastmod median ≤ 30 days across all URLs
S5JSON-LD structured dataValid JSON-LD <script> block present on homepage
S6Pre-rendered HTMLHomepage delivers meaningful HTML to bot UA (not SPA shell or JS-dependent render)
S7MCP server liveHTTP 200 on /.well-known/mcp.json
S8AI content feedHTTP 200 on /ai-content-index.json or /for-ai

5 Quantitative Metrics (binarized at threshold for 13-signal score)

RRRelevance RatioClean text chars / total response chars. Threshold: ≥ 0.85
SGRSource Grounding Ratio ★Verifiable claims / total claims extracted via Sonnet LLM (claim-extraction-v1). Threshold: ≥ 0.30 (original calibration; current threshold >0.00 — see Threshold recalibration note below). The moat signal.
RTCRetrieval Token CostResponse tokens / useful chars × 4. Lower = more efficient retrieval. Threshold: ≤ 0.50
RPSSitemap ThroughputURLs indexable per wall-clock second via parallel sitemap tree crawl. Threshold: ≥ 1,000
LMRLast-Modified RecencyMedian days since sitemap lastmod across all indexed URLs. Lower = fresher. Threshold: ≤ 30 days

Threshold recalibration — SGR (2026-05-05)

The SGR pass threshold has since been recalibrated from ≥ 0.30 to >0.00 for forward-looking audits. The original 0.30 threshold proved unachievable for ∼94% of the cohort — max observed SGR across the measured 100-site survey (excluding Top10Lists.us) was 0.0620, so 0.30 produced a near-universal fail that masked meaningful differences between sites that have attribution discipline at low density and sites that have none. This page reports findings under the original 0.30 calibration — the 11.1% cohort pass rate, the per-site SGR scores in the scorecard, and the “1 in 9 sites clear” framing all reflect the original threshold. Live prospect audits at staging.geolocus.ai/api/audit use the recalibrated >0.00 threshold; a separate methodology page documenting the recalibration is in progress.

Reproduce Any Cell

The v3.4 13-signal audit is fully reproducible two ways:

  1. Free in-page audit. Fill the form above and we'll audit your site live in ∼60 seconds against the same 13 signals. Output renders inline, byte-for-byte matching the per-site scorecard schema above.
  2. Self-host. Download the reproduction runbook to get the script, the 100-site cohort definition, the threshold definitions, and the full output schema. Run it locally with Node 20.x — no paid API keys required for the 8 binary signals or RR/RTC/RPS/LMR.

The endpoint behind both paths:

curl "https://staging.geolocus.ai/api/audit?url=https%3A%2F%2Fyour-site.com"

Source: audit-v3.js — v3.4 deploy on Cloudflare Workers via staging.geolocus.ai. Run: 2026-04-29T16:23:48Z • 98 of 100 sites measured. top10lists.us SGR updated to 0.4847 from live re-run (post-PR-308/309 self-citation improvements; original batch measurement was 0.3577).