Indexing OCR text and layered PDFs
Wondering whether PDF overlays was too obscure a topic for the Webmaster Central Blog, I consulted my girlfriend in AdWords, who has knowledge of Search and I believe represents the general audience reaction:
me: marie, yt? qq
me: when you read the term “pdf overlay” what do you think? does it sound like a feminine hygiene product?
Marie: it sounds more nonsensical than fem hyg pro
me: pdf overlay sounds nonsensical? really? so for search, i’m just referring to a text layer under an image in a pdf.
Marie: not intuitive
im in sales
Given this one datapoint*, this post is on my blog. Here’s the basic gist of three questions about OCR’d content/layered PDFs that I was recently asked.
Can Google index textual content from OCR?
Yes. For example, we can index text layers beneath the image as found in PDF overlays.
(Though I have limited understanding, I’ve found that when people talk to me about PDF overlays/image+text PDFs/layered PDFs/text searchable PDFs, they’re largely referring to the same thing. To the rest of the world there may be important distinctions, and it seems like “PDF overlays” could actually be a superset, but let’s not get bogged down by crazy stuff like being accurate.)
Bottom line, if it’s been OCR’d, yes, it can be indexed. And PDFs with standard text, like our SEO Starter Guide, have been indexed and searchable for years.
So OCR’d content isn’t considered spammy?
The technique is fine. We’re always trying to find more ways to index quality information. In fact, in our own Indexing pipeline we’re now using OCR on some documents that are without textual content. It’s the early phase, though, and of course standard REP directives still apply.
What if I use OCR on every single page I’ve ever written ever, do you think I could rank numero uno for every query forever?
Forever ever? Unlikely. It’s helpful to remember that the quality and compelling-ness of your content is still important. Long ago, like four years, some webmasters thought that if they dumped their entire database on the web, unleashing millions of new spreadsheets and documents, then their rankings would soar! It didn’t pan out.
This OCR-every-document plan has a similar feel.
But back to ranking, if your site has content that you feel is important to have indexed and searchable, try to make the content regular text (non OCR) on the page. It’s safer and often more user-friendly. Because sometimes OCR isn’t that clear — so it’ll be hard for search engines to index and users to comprehend.
* Thanks, Marie, for assisting my rigorous research.