Archive for the ‘SEO’ Category

Simple best practice for sitelink titles

Sitelink titles (anchor text) can be influenced by your webmaster charms! The URLs that Google selects for sitelinks, however, are far less manually manipulated.

 

google sitelinks for oprah.com

Oprah’s sitelink titles include “The Oprah Winfrey Show,” “Contact Us,” “Why Oprah Says She’ll Never Diet…”

 

If the titles of your sitelinks aren’t exactly what you hoped for, a troubleshooting tactic is to investigate the anchor text of your internal links (as it’s one of several factors used to determine sitelink titles). For example, here are a few links on Oprah’s homepage:

 

Text link <a href="http://www.oprah.com/omagazine.html">O, The Oprah Magazine</a>
Link to a CSS sprite (so it’s a less common case, but you get the idea) <a class="bookclub" href="http://www.oprah.com/book_club.html" alt="BOOK CLUB">BOOK CLUB</a>

 

Let's pretend Oprah sees her sitelink "BOOK CLUB," but she would prefer it displayed with standard capitalization as "Book Club". One way to help influence this change is for Oprah (or a web-savvy Stedman) to check the anchor text of her internal links and the alt text of her image links -- making sure to use "Book Club," not "BOOK CLUB."

 

We recently updated our sitelinks FAQ to reflect this tip (thanks to the Sitelinks teams for all their help!):

 

[ At the moment, sitelinks are completely automated. We're always working to improve our sitelinks algorithms, and we may incorporate webmaster input in the future. There are best practices you can follow, however, to improve the quality of your sitelinks. For example, for your site's internal links, make sure you use anchor text and alt text that's informative, compact, and avoids repetition. Read a blog post about the importance of link structure. ]

 


Indexing OCR text and layered PDFs

Wondering whether PDF overlays was too obscure a topic for the Webmaster Central Blog, I consulted my girlfriend in AdWords, who has knowledge of Search and I believe represents the general audience reaction:

me: marie, yt? qq

 

Marie: sure

 

me: when you read the term “pdf overlay” what do you think? does it sound like a feminine hygiene product?

 

Marie: it sounds more nonsensical than fem hyg pro

 

me: pdf overlay sounds nonsensical? really? so for search, i’m just referring to a text layer under an image in a pdf.

 

Marie: not intuitive
but again…
im in sales

 

Given this one datapoint*, this post is on my blog. Here’s the basic gist of three questions about OCR’d content/layered PDFs that I was recently asked.

 

Can Google index textual content from OCR?

Yes. For example, we can index text layers beneath the image as found in PDF overlays.

 

(Though I have limited understanding, I’ve found that when people talk to me about PDF overlays/image+text PDFs/layered PDFs/text searchable PDFs, they’re largely referring to the same thing. To the rest of the world there may be important distinctions, and it seems like “PDF overlays” could actually be a superset, but let’s not get bogged down by crazy stuff like being accurate.)

 

Bottom line, if it’s been OCR’d, yes, it can be indexed. And PDFs with standard text, like our SEO Starter Guide, have been indexed and searchable for years.

So OCR’d content isn’t considered spammy?

The technique is fine. We’re always trying to find more ways to index quality information. In fact, in our own Indexing pipeline we’re now using OCR on some documents that are without textual content. It’s the early phase, though, and of course standard REP directives still apply.

What if I use OCR on every single page I’ve ever written ever, do you think I could rank numero uno for every query forever?

Forever ever? Unlikely. It’s helpful to remember that the quality and compelling-ness of your content is still important. Long ago, like four years, some webmasters thought that if they dumped their entire database on the web, unleashing millions of new spreadsheets and documents, then their rankings would soar! It didn’t pan out.

 

This OCR-every-document plan has a similar feel.

 

But back to ranking, if your site has content that you feel is important to have indexed and searchable, try to make the content regular text (non OCR) on the page. It’s safer and often more user-friendly. Because sometimes OCR isn’t that clear — so it’ll be hard for search engines to index and users to comprehend.

* Thanks, Marie, for assisting my rigorous research.


Google & Site performance: The compilation answer album

The comments from my last post about text indent made me feel like Captain Hammer, so this time I’m crossing my fingers to make allies, not enemies.

 

Anyone want to talk about site performance? Don’t we all love a faster site? Users dig it. Webmasters can capitalize on it. It pairs perfectly with a sauvignon blanc!

 

I’ve consolidated information from personal conversations with people like Sreeram Ramachandran and Steve Souders, and I combed WMC blog posts and my blog comments for anything site performance related. This information is accurate as of June 1, 2010.

 

How is a page’s performance measured?

It’s measured very, very carefully… We’re of course experimenting with several types of measurements. For instance, toolbar data from opted-in users is a signal.

 

One of the ways we measure a page’s speed incorporates both download and render time — we pay attention to the time taken from the moment the user clicks on a link until just before that document’s body.onload() handler is called. This includes:

  • DNS resolution
  • network travel time
  • browser time to construct and render the DOM
  • time to parse and execute necessary javascript
  • and so on and so forth

 

If actions are deferred to the body.onload() handler, they won’t affect the page load time in this measurement. Please keep in mind that there are several measurement techniques. I only highlighted one of them.

How big of an impact is site performance on Google rankings?

From our original WMC blog post:

 

[ While site speed is a new signal, it doesn't carry as much weight as the relevance of a page. Currently, fewer than 1% of search queries are affected by the site speed signal in our implementation and the signal for site speed only applies for visitors searching in English on Google.com at this point. ]

 

Also, HT to Jonathan Simon, who pointed out that when ranking based on speed, we focus on the performance of a specific page/result, not the overall performance of the entire site.

Does Webmaster Tools’ Site performance feature consider the site’s geographic preference settings and report accordingly?

Some of our speed statistics come from real user data (opted-in toolbar users). Therefore, if your site’s target audience consists of mainly Australian users, then our performance numbers should reflect their usage.

What about ads? The slowest thing on my website costing me the last 7 points to the full 100 in Page Speed is Google’s AdSense ads.

One factor that makes ads kind of slow is their use of inline DOM
elements like document.write(), which doesn’t allow deferred loading (because the document.write may alter the page’s content, the browser has to wait).

 

The good news is that Steve Souders, Alex Russell, along with several of our co-workers and many outside developers, are looking into improving the speed of external factors like ads, etc. There are some promising things to keep an eye out for: html5 and its iframe attributes (seamless and srcdoc) and the FRAG tag.

 

Additionally, asynchronous loading would be a terrific improvement in the ads space. In fact, companies like BuySellAds.com are already using this technique to improve performance for their publishers.

What are the typical causes/solutions regarding fixing long time-to-first-byte metrics? Other than reducing the number of requests what other optimizations are there?

Can you flush the document early? It’s covered in Chapter 12 of “Even Faster Web Sites.”

(And then there’s the really old stuff that I’ve answered before about site performance.)

 

Is it possible to check my server response time from different areas around the world?

Yes. WebPagetest.org can test performance from the United States (both East and West Coast—go West Coast!), United Kingdom, China, and New Zealand.

What’s a good response time to aim for?

If your competition is fast, they may provide a better user experience than your site for your same audience.

 

Otherwise, studies by Akamai claim 2 seconds as the threshold for ecommerce site “acceptability.” Just as an FYI, at Google we aim for under a half-second.

Does progressive rendering help users?

Definitely! Progressive rendering is when a browser can display content as it’s available incrementally rather than waiting for all the content to display at once. This provides users faster visual feedback and helps them feel more in control. Bing experimented with progressive rendering by sending users their visual header (like the logo and searchbox) quickly, then the results/ads once they were available. Bing found a 0.7% increase in satisfaction with progressive rendering. They commented that this improvement compared with full feature rollout.

 

How can you implement progressive rendering techniques on your site? Put stylesheets at the top of the page. This allows a browser to start displaying content ASAP.

Sweet! That’s it for now. See you in the comments if you have questions. eof


HTML “text-indent: -9999px” and holding the line

Because today is Towel Day and because it’s just you and me, I can write about stuff I couldn’t say on a large platform like our Webmaster Central Blog. For example, I can write that:

 

If possible, it’s still best to avoid techniques such as “text-indent:-9999px” or “margin:-4000px” or “left:-2000em”.

 

And you can scream at me, “But I do it for accessibility! You’re mean, I’m nice!”

 

And that may be true. Another truth is that using “text-indent: -9999px”, or hiding text (keeping text out of the user’s sight in a browser), is common spammer’s technique to hide off-topic keywords and/or links to manipulate search engine rankings.

 

hidden links using text-indent
Example of “text-indent:-9999px” to hide unrelated links and boost PageRank to those sites. Search engines will never notice!

 

Google has top-secret algorithms designed to detect when text is hidden/positioned off screen. If this type of hidden text is detected, our important red phone rings, and this becomes one of the signals that may cause us to believe your site is deceptive.

 

Given that Google only wants to return the most relevant sites to users, if we consider your site deceptive, its rankings may be negatively affected.

 

So what should a webmaster do?

 

Try to hold the line — avoid hiding text. We’re trying to find an elegant solution. And once we do, I’ll write an official post.

 

What solutions are being considered?

 

With HTML5, my friend Ian Hickson shared a few possibilities that could satisfy both webspam and accessibility needs:

 

  1. Hide content from screen users but show it to screen reader users.
    Use media-specific CSS, e.g. @media speech { } vs @media screen { }.

    Caveat: Not yet implemented by screen readers.

  2. Hide irrelevant content, such as hiding a login form once the user is logged in.
    Use HTML5′s hidden=”" attribute.

    Caveat: This was just drafted a few months ago. I’ll get Ian’s latest take on the subject once he returns from paternity leave. Congrats, Ian!

 

Happy Towel Day, everyone!

 

Update made later on towel day: Luigi Montanez and I have some crazy connection. He just posted on the same friggin text-indent topic (enjoy my anchor text, Luigi!). Suddenly all that was impossible is possible.


rel=”canonical” for non-HTML files?

Update in June 2011: Google now supports
rel=”canonical” in the HTTP header
! It’s party time.

 

Q: How would Google implement rel=”canonical” for non-HTML files?

 

A: Likely through the link entity in the HTTP header. It would look something like this:

 

HTTP/1.1 200 OK
Date: Tue, 20 Apr 2010 07:28:14 GMT
Server: Apache/2.2
Content-Type: text/html; charset=UTF-8
Link: <http://www.example.com/preferred-canonical-url.doc>; rel="canonical"
Transfer-Encoding: chunked

 

Q: When will this feature be ready?

 

A: Oh no, sorry if I misled. We probably won’t support this any time soon.

 

Q: Rats!

 

A: That’s not a question.

 

Q: So why wouldn’t you guys support rel=”canonical” in the HTTP header?

 

A: Truth is, we’ve discussed it internally and we’re currently leaning toward the worry that it may cause more damage than benefit.

  • An HTTP header with rel=”canonical” could be too obscure for many webmasters to debug — it’s a lot more obvious to troubleshoot when it’s in the HTML source.
  • We favor verifying correct adoption/implemention before increasing support for new features. For example, we waited some time before rolling out cross-domain rel=”canonical” to be sure same-domain rel=”canonical” was largely properly implemented.
  • Less notably, it’s not an often requested feature.
  • Update on 04/20/2010: We still use URLs in your Sitemap as a hint for your preferred canonical whether it’s HTML or non-HTML content (thanks to John for mentioning this!). So when we have a cluster of duplicates, your Sitemap URL can be the display version and obtain the linking properties from the cluster. Unlike rel=”canonical”, it’s not quite as strong a signal and it doesn’t have the ability to actually cluster dupes.

 

Last thing: If you feel that the lack of HTTP header support for non-HTML files is a gaping hole in rel=”canonical” functionality, let us (me) know. Otherwise, it’ll probably remain low to miniscule priority for some time to come.


Search engines, URLs, and a trailing slash “/”

I wanted to write a general post for the Webmaster Blog about how search engines handle URLs with/without trailing slashes, but turns out that the major engines differ quite a bit in their display.

 

Conducting quick research using the URL for Webmaster Central, which 301s the non-trailing-slash URL, www.google.com/webmasters, to the trailing slash URL, www.google.com/webmasters/,

 

 

here’s what I found today:

  • Good news for PageRank and linking properties! Search engines appear to cluster 301′d trailing-slash URLs to/from no-trailing-slash URLs (evidenced by the same cached version for various URL formats)
  • Google generally adheres to the target of your 301, displaying the target URL (with our without the trailing slash) in SERPs. Some examples from SERPs that show the 301 target with/without the trailing slash:
    www.google.com/products www.google.com/webmasters/
  • Yahoo! and Bing may remove the trailing slash from results, even if it’s the target of your 301. For the query [google webmaster central], both Yahoo! and Bing show this URL without the trailing slash:
    www.google.com/webmasters
  • Bing can remove more than just the trailing slash of a URL to fit query terms/keywords — they can remove characters, too. For the query [google webmaster tools], Bing shows:
    www.google.com/webmaster

 

Details and screenshots

 

Google generally follows 301s and displays URLs with or without a trailing slash accordingly (i.e. we stay pretty true to the URL that was successfully crawled). For example, Google search results reflect www.google.com/webmasters/, the target of the 301, as the canonical version.

 

But Yahoo! seems to remove the trailing slash from the results display (i.e. even though www.google.com/webmasters/ is the 301 target, the slash is removed):

 

Interestingly, clicking this result’s cached version shows the trailing-slash URL, www.google.com/webmasters/. So while the display may differ, it’s likely both URLs, slash and no-trailing-slash, are clustered as expected:

Bing removes the trailing slash, too.

 

But I think Bing also removes the ‘s’ in ‘webmasters’ when it doesn’t match the query term. Here they display www.google.com/webmaster. I found this feature most interesting.

In Bing, both the URLs
www.google.com/webmasters
and
www.google.com/webmaster
show the source URL in the cached version as
www.google.com/webmasters/.
Evidence again that while the display formats may differ, the duplicate content URLs are likely clustered.

Bing’s swapping/adjusting of display URLs to match queries is a pretty neat idea with potentially large implications. And I’m sure Bing prevents keywords in URLs from becoming spammy in these 301 cases. For example, perhaps their results display only allows stemming of the canonical URL from plural to singular nouns (webmasters -> webmaster), not complete variations of keywords.

 

I don’t research search engine behavior outside of Google as much as I should, sorry about that. If you have more findings on trailing slashes and URLs, please share. Would be cool to learn more.

 

Update on May 18, 2010: A few weeks after this post, I published an official Webmaster Central article about how Google handles URLs and the trailing slash.


What’s the optimal server response time?

Fairly valid answers:

 

  1. Faster than your competitors
  2. Under 2 seconds

 

I’d go with #2 as I’m a believer in having metrics for myself independent of others’ performance. It just seems conducive to higher overall happiness.

 

My coworker, Sreeram Ramachandran, who developed Site Performance in Webmaster Tools forwarded me an article by Akamai about response times for eCommerce sites.

 

At Google, we definitely aim for sub-two.

 

You can check your site’s response time from locations throughout the world at WebPagetest.org. For example, a user in Virginia with DSL needs less than a second to run the query [page speed] on Google.com.

 

response time for google query
click image for this result on WebPagetest.org

Title and name attributes in HTML anchors

How does Google currently process title and name attributes in HTML anchors?

 

<a title=”sweet link!” name=”nice name!” href=”page.html”>foo</a>

 

title = not processed by Google (please keep in mind that it could be useful for other engines or applications)

 

name = not processed for ranking/content relevance, but can be utilized for understanding page structure (such as with JavaScript functions)

 

Thanks to Joachim Kupke (super nice guy) for checking the code to provide clarification.


DateRank: PageRank for singles

DateRank: noun. An authoritative gauge of the guy/girl sitting in front of you that’s far more accurate than your hopes or pre-conceptions.

In the uncertain world of dating in the city, where you often know very little of your date’s background, people find it reassuring to meet friends of the person they’re dating. First of all, phew, your date has friends! Second, it’s truly affirming if your date’s friends are people you could imagine being friends with as well. It’s like hoping that they have quality inbound links.

Last week as Vanessa and I made our way home from the Beauty Bar, she coined this concept as DateRank™. While perhaps dehumanizing and unromantic, the parallels between DateRank and PageRank remain numerous.

 

DateRank PageRank
“He has really cool friends.” Quality inbound links
“His friends… let’s just say they’re questionable.” Links to bad neighborhood
“Eh, we’re friends, but I don’t know her that well.” rel=”nofollow”

 

Other dating signals Other ranking signals
Name-dropping Keyword stuffing
Repetitive/monotonous Duplicate content
Looking for a meal ticket Made for Adsense