Googlebot is Google’s web crawling bot or a web crawler (sometimes also called a “spider”). Web Crawling is a process by which we crawler fetches new and updated web pages to be added to the Google index. We use a huge set of computers to crawl billions of web pages on the web.
Googlebot finds web pages in two ways: Through an add URL form www.google.com/addurl.html , and through finding links by crawling the web.
Googlebot, Google’s Web Crawler
Googlebot is Google’s web crawling robot or spider, which finds and retrieves web pages on the web and handover these web pages to the Google indexer. It’s easy to imagine Googlebot as a little spider run across the cyberspace, but in reality Googlebot doesn’t travel the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloads the entire page, then handover it to Google’s indexer.
Googlebot consists of many computers, requesting and fetching web pages much faster than you can do with your web browser. In fact, Googlebot can request thousands of different web pages simultaneously. To avoid devasting of web servers, or congested free requests from human users, Googlebot consciously makes requests of each individual web server more slowly than it’s capable of doing.
Googlebot is used to search the Internet. It is Web crawling software by Google, which scans, finds, adds and index new web pages on Google. In other words, “Googlebot is the name of the web crawler or spider for Google. Googlebot will visit websites which have been submitted for indexing, every once in a while to update its index.”
Note: Googlebot only follows HREF “Hypertext Reference” links and SRC “Source” links. With a list of website’s URLs, Googlebot collects information to build a searchable index for Google’s Indexer.
Function of Googlebot
Googlebot functions as a spider to crawl the content on a site and interpret the contents of a user’s created robots.txt file
How to use Googlebot
Current version: Googlebot 2.1
Tag: Googlebot/2.1 (+http://www.googlebot.com/bot.html)
Switching User-Agent to Googlebot: FireFox extension (User-agent switcher)
Tips: For Googlebot to function entirely, allow the googlebot to have all the access they want.
Reminders: Ensure the “Prevent Spiders option” is set to true in your admin sessions settings.
Updates/changes to Googlebot: Check the .txt file (such as “robots.txt”) for content.
How to Allow/Disallow Googlebot (manually):
To Allow Googlebot
Allow: / (or list a directory or page that you want to allow)
To Block Googlebot
Disallow: / (or list a directory or page that you want to disallow)
How to create a robots.txt file:
- Users must go to the Webmaster Tools Home page and click the site/property they want.
- Under Site configuration, click Crawler access.
- Click the Generate robots.txt tab to allow robot access to your site.
- In the Files or directories box, type /.
- Click Add.
- This will create your robots.txt file to be automatically generated.
- Save your robots.txt file (Note: It must reside in the root of the domain and must be named “robots.txt”.)
How to ensure the robots.txt tool is working properly?
- Go to the Webmaster Tools Home page and click the site/property you want.
- Under Site configuration, click Crawler access. Click the Test robots.txt tab.
- Copy the content of their robots.txt file and paste it into the first box. In the URLs box, list the site to test it against.
Pros and Cons of Googlebot
- It can quickly build a list of links that come from the Web.
- It recrawls popular frequently-changing web pages to keep the index current.
- It only follows HREFlinks and SRC links.
- It takes up an enormous amount of bandwidth.
- Some pages may take longer to find, so crawling may occur once a month vice daily.
- It must be setup/programmed to function properly.
Other Googlebot Options
- crawls pages for Google’s mobile index
- crawls pages for Google’s image index
- crawls pages for AdSense content/ads
- crawls pages to check for Google AdWords