The magazine of the Melbourne PC User Group

Google Comes To Town
Trevor Gosbell
 

Is it inevitable that everyone who writes about computers must eventually do a bit about Google? Well, why fight it? Here goes...

In June I was one of the privileged few hundred to pile into a lecture hall at Melbourne University to hear a talk called "Google - Finding Needles in a 20 TB Haystack, 200 Million Times per Day". Yes, Google had come to town and they weren't just sightseeing - rumour had it that they were recruiting, looking for the brightest minds Melbourne has to offer.

Unfortunately my PhD hadn't arrived from dodgy-degrees-r-us.com, so I didn't get an interview but I was able to elbow my way into the public lecture. The speaker was Dr Craig Nevill-Manning, a kiwi by origin and Director of Engineering at the New York office of Google. Here's the highlights as I saw them.

On the Cheap

Google does not favour exotic, high-end server hardware. No, Google does things on the cheap. The 200 million hits per day on Google are handled by standard desktop hardware - the sort of thing that you or I might have, only multiplied thousands of times over. They keep their costs low by buying in the "sweet spot" where the price-performance ratio ensures a good bang for their bucks.

Cheaper hardware has its advantages but it also means lower reliability. Failures do occur and Google manages this (indeed they expect it) by having plenty of capacity and redundancy, so that the user experiences a reliable site. One presumes that there are dozens of people employed by Google to constantly replace parts or entire machines as they fail.

Doing it cheap has always been the Google way. According to Nevill-Manning the original setup at google.stanford.edu was a mixed bag of hardware left over from previous research work - complete with an external disk drive casing made of Lego (he even had photos to prove it). And later when they moved into dedicated hosting premises they took in their homemade server racks with motherboards mounted directly onto metal trays insulated with cork. Apparently the bemused management at their hosting facility was somewhat concerned about the fire risk!

And like all good IT start-ups, the early Google team served the obligatory time in a friend's garage.

How they Do It

Along with monitoring this vast farm of hardware, Nevill-Manning reckons that Google's other main challenge is, not surprisingly, indexing the Web. He outlined their approach to hypertext analysis, which includes taking into account:

The last one is the Google method of rating the "reputation" of every site on the Web. PageRank is not influenced by traffic to a site, but by the number of links to a site and the reputation of sites making those links. It all sounds a bit self-referential (ie. We decide on the ranking of a site by looking at the ranking of other sites, which we ranked by looking at the ranking of other sites ...) but you've got to admit it seems to work.

And if you don't believe that Google uses incoming link text to index pages, try to predict what will be at the top of a search for the term "click here". I'll leave it as an exercise for the reader to explain that result.

Metadata is not used in Google indexing because they have found it to be unreliable and often deliberately misleading. (Metadata is extra data that provides information about other data, such as the description and keywords fields in a HTML document. See metatdata entry in the Free On-Line Dictionary Of Computing: http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?metadata. Also, metadata is usually invisible and Google ignores all "invisibles" when indexing. And the indexing algorithm is clever - it can tell if you're trying something underhanded to get into the search engines like hiding white text on a white background or making some text unreadably tiny - that sort of text is effectively invisible, so Google ignores it.

But the cleverness also works in positive ways. For example, images are indexed using related information such as the surrounding text, text in the image tag ALT attribute, and the file name of the image to give clues to the image contents.

So search results depend on a combination of the relevance of search terms used and the reputation of Web sites (represented by PageRank).

Advertising

Nevill-Manning rejected any suggestion that top ranking in Google can be bought - either by those that advertise with them or others. This seems to be a very difficult claim for an outside observer to prove or disprove, but a little thought suggests that it is highly unlikely because the core value of Google is the ability to produce highly relevant results. Any weakening of the results also weakens the credibility of Google. And as Nevill-Manning pointed out their "Sponsored Links", though prominently featured, are clearly labelled and physically separated from the "real" results.

Google Adsense http://www.google.com/adsense/ is the service that provides small advertisements to Web sites, each targeted to the content of the current page. This provides some curly problems for Google - for example in a page that mentions "Java" is the topic Indonesia, coffee, or programming?

Working for Mr Google

Apart from getting their name on one of the coolest business cards in the world, staff of Google have a very attractive perk: "20% time". This allows Google folks to spend 20% of their time sitting around dreaming up stuff and trying things that they wouldn't otherwise have the time to do. There's no strings attached - it's free form innovation time.

Some of the resulting work is released into the wild at Google labs http://labs.google.com, and a few end up as part of the main Google system. Products of 20% time include:
  • Froogle - "smart shopping through Google" http://froogle.google.com
     
  • local.google - search by geographic location - not yet available for Australia http://local.google.com
     
  • define - get word meanings simply by typing "define (your word here)" in the Google search box

  • spell checking - how many times has Google asked you "Did you mean: ..."? (To illustrate the need for spell checking, Nevill-Manning showed a depressingly long list of the creative ways people have misspelled "Britney Spears"...)
But then, some other ideas are just interesting solutions in search of a problem, like Google sets http://labs.google.com/sets.

Google Zeitgeist

Nevill-Manning closed with a story that illustrates Google's place in the mainstream.

A couple of years ago on the US game show "Who wants to be a millionaire?" a contestant reached the top price question which was: In "The Brady Bunch" what was Carol Brady's maiden name? In accordance with the rules of the game, the contestant was allowed to phone a friend - a friend who was waiting with the Google search page open. In the time available the friend tried to search for "carol brady maiden name" but didn't quite get time to return the correct answer.

However, the contestant's friend was not the only searcher that night - Google statistics showed a sharp peak of thousands of searches on "carol brady maiden name" - firstly at the time the show was broadcast on the east coast and then several times during the evening as the syndicated show went to air in different time zones across the country.

I'm really not sure what to make of that.
 
Don’t Want To Be In Google?

Do you have a site or part of a site that you’d like to keep hidden from Google (and other prying Web crawlers and robots)? The main way that Google finds stuff on the Web is with a “Web crawler” called Googlebot. Googlebot is a program that trundles around the Web, reading pages and following links — mapping the Web as it goes. Other search engines have similar tools.

So if you want to stay hidden on the Web the solution to your problem is to put the brakes on Googlebot and its mates.

robots.txt

Your first option is to put a file called robots.txt in the root of your Web site. That is, the URL will be http://www.yoursite.com/robots.txt.

The simplest entry to put into this file is:
   User-Agent: *
   Disallow: /


This stops Googlebot and other crawlers from poking around in your site. See the Robots.txt Tutorial http://www.searchengineworld.com/robots/robots_tutorial.htm for more information.

The Robots Meta Tag

You can also lock-out Googlebot on a page-by-page basis using the robot’s meta tag. Place the following tag in the <head> section of your page:
    <meta name="ROBOTS" content= "NOINDEX, NOFOLLOW">

Well behaved bots will not index this page or follow any links on the page.

Cache-flow Problem?

Google also keeps a backup store or cache of Web pages. This is often useful when the page you’re after is temporarily out of action, if the host site is offline for example. But if you want your pages kept out of the Google cache, it’s meta tags again:
    <meta name="ROBOTS" content="NOARCHIVE">

Think carefully before you cut yourself off from the search engines, but if you’re sure you want to go it alone then these methods should see your site left in peace.

More Information

• See the Robots Exclusion http://www.robotstxt.org/wc/exclusion.html page for more information on both robots.txt and the robots meta tag.
• Google Information for Webmasters http://www.google.com/webmasters/faq.html is also helpful.

Reprinted from the August 2004 issue of PC Update, the magazine of Melbourne PC User Group, Australia

[ About Melbourne PC User Group ]