Can't Know What You Don't Know — She's in the Attic

There is a character in the 1984 film Supergirl — Zaltar, played by the legendary Peter O'Toole — who spends the entire film searching the Phantom Zone for the one source of power that will let him outwit the villain Selena (Faye Dunaway, Columbo). He travels extraordinary distances, accumulates considerable knowledge, and ultimately fails — not because he lacks effort or intelligence, but because he is operating within a frame of reference that was never large enough to begin with. The answer was somewhere he was not looking.

No matter how much knowledge you accumulate, there will always be more out there. Until you find the edges of your own frame, you are stuck within limitations you may not even know exist. You cannot know what you do not know — and the dangerous assumption is that what you can see is the complete picture.

This has a point. Bear with me.

What Google admitted under oath

In October 2023, during the DOJ antitrust case United States v. Google, Pandu Nayak — Google's VP of Search — was cross-examined about how Google's ranking systems actually work. The exchange that followed was more revealing than anything Google has published voluntarily in twenty years of SEO guidance.

Q:RankBrain looks at the top 20 or 30 documents and may adjust their initial score. Is that right?

A:That is correct.

Q:And RankBrain is an expensive process to run?

A:It's certainly more expensive than some of our other ranking components.

Q:So that's, in part, one of the reasons why you just wait until you're down to the final 20 or 30 before you run RankBrain?

A:That is correct.

Q:RankBrain is too expensive to run on hundreds or thousands of results?

A:That is correct.

Source: United States v. Google, Day 24 transcript, October 2023 — Pandu Nayak cross-examination, page 6431

Four consecutive confirmations. The deep-learning component of Google's ranking — the layer that SEOs have built a decade of theory around — is deliberately withheld from the bulk of the index because Google cannot afford to apply it more broadly. The corpus gets culled to tens of thousands of pages first, and from that pool only the top 20 to 30 reach the deep-learning layer at all.

The industry has treated RankBrain and BERT as the definition of how Google ranks. Under oath, Nayak described them as expensive optional layers applied to a narrow window that classical retrieval has already culled. The number is not 20 to 30 because of some fundamental truth about search. It is 20 to 30 because that is what Google's hardware budget would support. The constraint has held — until now.

"You have been optimising for a window you did not know was artificially narrow. Google has now told us exactly why — and published what comes next."

TurboQuant and the widening window

In March 2026, Google Research published a technique called TurboQuant — a vector quantization (sic) method that compresses the representations used in nearest-neighbour search by a factor of four to four and a half, with performance described as comparable to unquantized models. Indexing time reduced to, in the paper's words, virtually zero.

If the memory cost of evaluating candidates drops by 75%, the economics that held RankBrain at 20 to 30 candidates no longer apply. A system running on the same hardware could plausibly evaluate a candidate set several times larger. TurboQuant has not been confirmed as deployed in Google Search — the March 2026 core update carried no public commentary linking it to retrieval efficiency. But Google has published the algorithm. The question has shifted from whether the window can widen to what you do before it does.

Getting into the candidate list — especially as a new site

This is where it gets practically useful. The debate about rankings has always assumed the hard problem was ranking within the candidate set. The harder problem — and the one almost nobody talks about — is getting into the candidate set at all. For an established site with years of crawl history, link equity and topical authority, this is largely handled. For a new site, it is the entire challenge.

The candidate set is not assembled by magic. It is built by classical retrieval — postings lists, keyword matching, link signals — before the expensive layers ever fire. Getting into it requires the fundamentals to be in place.

Get crawled properly first

Clean robots.txt. Sitemap submitted to GSC. No accidental noindex directives — the Breeze Furnishings case on this site is a good illustration of what happens when that goes wrong silently. If Google cannot find and index your pages, the candidate set question is academic.

Internal linking from day one

Even a small site needs a logical structure Google can walk. Pages that are not linked to from anywhere are difficult to discover and difficult to contextualise. Every new page should be reachable from at least one existing page via a relevant anchor.

Answer specific, concrete queries

The narrower the question, the less competition for the candidate slot. A new site cannot compete for "crystals" — but it can compete for "rose quartz for anxiety" or "how to use black tourmaline for protection." Specific intent, specific page, specific chance of making the candidate set.

E-E-A-T signals — be a real entity

For a new site, demonstrating that a real, credentialled person is behind it matters more than it does for an established brand. An Ahrefs contributor credit, a named author with a verifiable history, a company registration number in a privacy policy — these are not vanity signals. They are the things that help a new site get taken seriously before it has earned authority through age and links.

Consistent publishing — give crawlers a reason to return

A site that publishes regularly on a predictable schedule gets crawled more frequently than one that publishes sporadically. This matters for new sites because freshness is one of the few signals a new site can influence immediately. Even a modest schedule — two or three posts a month — is better than bursts followed by silence.

The frame you are not seeing — multiple data sources

Here is the watch story. Some years ago, working on site speed improvements for a client, I sat in a meeting with a head of tech and a head of marketing. The head of tech, tasked with understanding the speed of the site, had decided to measure it by loading the site internally on his machine and looking at his watch.

He was not being lazy or careless. He was measuring something. He just had no idea that Google was measuring something completely different — with completely different tools, from completely different locations, under completely different conditions — and that his watch bore no relationship whatsoever to the outcome he was trying to influence. His frame of reference was all he had. It did not occur to him that a different frame existed.

The same problem exists right now in how most sites think about AI traffic. User-driven agents — ChatGPT-User, Claude-User, Perplexity-User — fetch pages on demand when someone asks an AI model about a topic your page covers. They are real visitors. They may be arriving in meaningful numbers. And they do not execute JavaScript.

GA4 is JavaScript-dependent. If your analytics implementation is entirely client-side — which the standard GA4 setup is — every visit from an AI agent is invisible to your dashboard. You have no idea whether your pages are being cited by ChatGPT, referenced by Perplexity or pulled by Claude. The instrument you are using to measure the world does not show you this part of the world.

"GA4 cannot see AI agents. Your server logs can. These are not the same instrument."

The fix is not complicated. Server logs record every request regardless of whether JavaScript fires. In Hostinger, access logs are available under Advanced → Logs in hPanel. Download them and search for the user agents that matter: OAI-SearchBot, Claude-SearchBot, PerplexityBot for index crawlers; ChatGPT-User, Claude-User, Perplexity-User for on-demand retrieval. If those strings are not appearing, those systems are not visiting — and no amount of content optimisation for AI search will change that until the crawlability problem is addressed first.

Multiple data sources are always better than one. GSC shows you what Google sees. GA4 shows you what JavaScript-executing visitors do. Server logs show you everything else. The head of tech with his watch was not wrong to measure — he was wrong to assume his measurement was complete.

Zaltar never found what he was looking for because he was searching the wrong space. The Omegahedron was already in Midvale. The candidate set is already being assembled. The question is whether your pages are in it — and whether you have the right instruments to know either way.