The magazine of the Melbourne PC User Group
Google Comes To Town
Trevor Gosbell |
|
Is it inevitable that everyone who writes about computers must eventually do a
bit about Google? Well, why fight it? Here goes...
In June I was one of the privileged few hundred to pile into a lecture hall at
Melbourne University to hear a talk called "Google - Finding Needles in a 20 TB
Haystack, 200 Million Times per Day". Yes, Google had come to town and they
weren't just sightseeing - rumour had it that they were recruiting, looking for
the brightest minds Melbourne has to offer.
Unfortunately my PhD hadn't arrived from dodgy-degrees-r-us.com, so I didn't get
an interview but I was able to elbow my way into the public lecture. The speaker
was
Dr Craig Nevill-Manning, a kiwi by origin and Director of Engineering at the New
York office of Google. Here's the highlights as I saw them.
On the Cheap
Google does not favour exotic, high-end server hardware. No, Google does things
on the cheap. The 200 million hits per day on Google are handled by standard
desktop hardware - the sort of thing that you or I might have, only multiplied
thousands of times over. They keep their costs low by buying in the "sweet spot"
where the price-performance ratio ensures a good bang for their bucks.
Cheaper hardware has its advantages but it also means lower reliability.
Failures do occur and Google manages this (indeed they expect it) by having
plenty of capacity and redundancy, so that the user experiences a reliable site.
One presumes that there are dozens of people employed by Google to constantly
replace parts or entire machines as they fail.
Doing it cheap has always been the Google way. According to Nevill-Manning the
original setup at google.stanford.edu was a mixed bag of hardware left over from
previous research work - complete with an external disk drive casing made of
Lego (he even had photos to prove it). And later when they moved into dedicated
hosting premises they took in their homemade server racks with motherboards
mounted directly onto metal trays insulated with cork. Apparently the bemused
management at their hosting facility was somewhat concerned about the fire risk!
And like all good IT start-ups, the early Google team served the obligatory time
in a friend's garage.
How they Do It
Along with monitoring this vast farm of hardware, Nevill-Manning reckons that
Google's other main challenge is, not surprisingly, indexing the Web. He
outlined their approach to hypertext analysis, which includes taking into
account:
The last one is the Google method of rating the "reputation" of every site on
the Web. PageRank is not influenced by traffic to a site, but by the number of
links to a site and the reputation of sites making those links. It all sounds a
bit self-referential (ie. We decide on the ranking of a site by looking at the
ranking of other sites, which we ranked by looking at the ranking of other sites
...) but you've got to admit it seems to work.
And if you don't believe that Google uses incoming link text to index pages, try
to predict what will be at the top of a search for the term "click here". I'll
leave it as an exercise for the reader to explain that result.
Metadata is not used in Google indexing because they have found it to be
unreliable and often deliberately misleading.
(Metadata is extra data that provides information about other data, such as the
description and keywords fields in a HTML document. See metatdata entry in the
Free On-Line Dictionary Of Computing:
http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?metadata.
Also, metadata is usually invisible and Google ignores all "invisibles" when
indexing. And the indexing algorithm is clever - it can tell if you're trying
something underhanded to get into the search engines like hiding white text on a
white background or making some text unreadably tiny - that sort of text is
effectively invisible, so Google ignores it.
But the cleverness also works in positive ways. For example, images are indexed
using related information such as the surrounding text, text in the image tag
ALT attribute, and the file name of the image to give clues to the image
contents.
So search results depend on a combination of the relevance of search terms used
and the reputation of Web sites (represented by PageRank).
Advertising
Nevill-Manning rejected any suggestion that top ranking in Google can be bought
- either by those that advertise with them or others. This seems to be a very
difficult claim for an outside observer to prove or disprove, but a little
thought suggests that it is highly unlikely because the core value of Google is
the ability to produce highly relevant results. Any weakening of the results
also weakens the credibility of Google. And as Nevill-Manning pointed out their
"Sponsored Links", though prominently featured, are clearly labelled and
physically separated from the "real" results.
Google Adsense http://www.google.com/adsense/ is the service that provides small
advertisements to Web sites, each targeted to the content of the current page.
This provides some curly problems for Google - for example in a page that
mentions "Java" is the topic Indonesia, coffee, or programming?
Working for Mr Google
Apart from getting their name on one of the coolest business cards in the world,
staff of Google have a very attractive perk: "20% time". This allows Google
folks to spend 20% of their time sitting around dreaming up stuff and trying
things that they wouldn't otherwise have the time to do. There's no strings
attached - it's free form innovation time.
Some of the resulting work is released into the wild at Google labs
http://labs.google.com,
and a few end up as part of the main Google system. Products of 20% time
include:
- Froogle - "smart shopping through Google"
http://froogle.google.com
- local.google - search by geographic location - not yet available for Australia
http://local.google.com
- define - get word meanings simply by typing "define (your word here)" in the Google search box
- spell checking - how many times has Google asked you "Did you mean: ..."? (To
illustrate the need for spell checking, Nevill-Manning showed a depressingly
long list of the creative ways people have misspelled "Britney Spears"...)
But then, some other ideas are just interesting solutions in search of a
problem, like Google sets http://labs.google.com/sets.
Google Zeitgeist
Nevill-Manning closed with a story that illustrates Google's place in the
mainstream.
A couple of years ago on the US game show "Who wants to be a millionaire?" a
contestant reached the top price question which was: In "The Brady Bunch" what
was Carol Brady's maiden name? In accordance with the rules of the game, the
contestant was allowed to phone a friend - a friend who was waiting with the
Google search page open. In the time available the friend tried to search for
"carol brady maiden name" but didn't quite get time to return the correct
answer.
However, the contestant's friend was not the only searcher that night - Google
statistics showed a sharp peak of thousands of searches on "carol brady maiden
name" - firstly at the time the show was broadcast on the east coast and then
several times during the evening as the syndicated show went to air in different
time zones across the country.
I'm really not sure what to make of that.
Don’t Want To Be In Google?
Do you have a site or part of a site that you’d like to keep hidden from
Google (and other prying Web crawlers and robots)? The main way that
Google finds stuff on the Web is with a “Web crawler” called Googlebot.
Googlebot is a program that trundles around the Web, reading pages and
following links — mapping the Web as it goes. Other search engines have
similar tools.
So if you want to stay hidden on the Web the solution to your problem is
to put the brakes on Googlebot and its mates.
robots.txt
Your first option is to put a file called robots.txt in the root of your
Web site. That is, the URL will be
http://www.yoursite.com/robots.txt.
The simplest entry to put into this file is:
User-Agent: *
Disallow: /
This stops Googlebot and other crawlers from poking around in your site.
See the Robots.txt Tutorial
http://www.searchengineworld.com/robots/robots_tutorial.htm for more
information.
The Robots Meta Tag
You can also lock-out Googlebot on a page-by-page basis using the robot’s
meta tag. Place the following tag in the <head> section of your page:
<meta name="ROBOTS" content= "NOINDEX, NOFOLLOW">
Well behaved bots will not index this page or follow any links on the
page.
Cache-flow Problem?
Google also keeps a backup store or cache of Web pages. This is often
useful when the page you’re after is temporarily out of action, if the
host site is offline for example. But if you want your pages kept out of
the Google cache, it’s meta tags again:
<meta name="ROBOTS" content="NOARCHIVE">
Think carefully before you cut yourself off from the search engines, but
if you’re sure you want to go it alone then these methods should see your
site left in peace.
More Information
• See the Robots Exclusion
http://www.robotstxt.org/wc/exclusion.html page for more information
on both robots.txt and the robots meta tag.
• Google Information for Webmasters
http://www.google.com/webmasters/faq.html is also helpful. |
Reprinted from the August 2004 issue of PC Update, the magazine of Melbourne PC User Group, Australia
|