Types of Google Crawlers

Overview of Google Crawlers and Fetchers (User Agents)

Google uses crawlers and fetchers to interact with websites and gather information for its various products. These interactions can be automatic or triggered by user requests.

Crawlers (also called "robots" or "spiders") are automated programs that systematically browse websites by following links from one page to another. Think of them as tireless explorers mapping out the vast landscape of the internet. Google's primary crawler, responsible for gathering information for Google Search results, is called Googlebot.

Fetchers, on the other hand, are more like individual users. They request a single specific URL when prompted by a user or another program.

This document provides a detailed overview of the different Google crawlers and fetchers, how they are identified in your server logs, and how you can manage their access to your site using the robots.txt file.

Understanding User Agents

Every time a crawler or fetcher accesses a webpage, it sends a request to the server hosting that page. This request includes a string of text called the user agent, which identifies the specific crawler or fetcher making the request. You can think of the user agent as the crawler's name tag.

Here's an example of what a user agent string might look like in your server logs:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

This string tells us that the request came from Googlebot. The user agent string also provides additional information, such as the crawler's version (2.1 in this case) and the software it uses to access the web (identified as "Mozilla/5.0").

Types of Google Crawlers and Fetchers

Google uses a variety of crawlers and fetchers, each with a specific purpose. They are broadly categorized as follows:

1. Common Crawlers

These are the workhorses of Google, constantly crawling the web to gather information for Google Search results and other Google products. They always respect the instructions provided in your robots.txt file, making them well-behaved guests on your website.

Here are some of the most common crawlers you might encounter:

Googlebot (Smartphone & Desktop): This is the main crawler responsible for indexing websites for Google Search. It simulates both smartphone and desktop browsers to understand how your website appears to users on different devices.
- User agent tokens: Googlebot
- Example user agent string (Desktop): Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/99.0.4844.51 Safari/537.36
- Example user agent string (Smartphone): Mozilla/5.0 (Linux; Android 10; SM-G981B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Googlebot Image: This crawler specifically focuses on discovering and indexing images for Google Images and other Google products that utilize visual content.
- User agent tokens: Googlebot-Image, Googlebot
- Example user agent string: Googlebot-Image/1.0
Googlebot News: As its name suggests, this crawler is dedicated to finding and indexing news articles for Google News.
- User agent tokens: Googlebot-News, Googlebot
- Example user agent string: Utilizes the various Googlebot user agent strings.
Googlebot Video: This crawler scours the web for video content to index for Google Video and other video-based products.
- User agent tokens: Googlebot-Video, Googlebot
- Example user agent string: Googlebot-Video/1.0
Google StoreBot: This crawler is specifically designed to understand e-commerce websites. It crawls product pages, shopping carts, and checkout pages to gather information for Google Shopping and other relevant products.
- User agent token: Storebot-Google
- Example user agent string (Desktop): Mozilla/5.0 (X11; Linux x86_64; Storebot-Google/1.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36
- Example user agent string (Mobile): Mozilla/5.0 (Linux; Android 10; SM-G981B; Storebot-Google/1.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Mobile Safari/537.36
Google-InspectionTool: This crawler powers the testing tools available in Google Search Console, such as the Rich Results Test and URL Inspection tool. It behaves similarly to Googlebot and helps you understand how Google sees your website.
- User agent tokens: Google-InspectionTool, Googlebot
- Example user agent string (Mobile): Mozilla/5.0 (Linux; Android 10; SM-G981B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Mobile Safari/537.36 (compatible; Google-InspectionTool/1.0;)
- Example user agent string (Desktop): Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36 (compatible; Google-InspectionTool/1.0;)
GoogleOther: This is a generic crawler used by various Google product teams for fetching publicly accessible content for internal research and development purposes. It's like a curious intern exploring the web for interesting information.
- User agent token: GoogleOther
- Example user agent string (Mobile): Mozilla/5.0 (Linux; Android 10; SM-G981B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Mobile Safari/537.36 (compatible; GoogleOther)
- Example user agent string (Desktop): Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36 (compatible; GoogleOther)
GoogleOther-Image & GoogleOther-Video: These crawlers function similarly to GoogleOther but are specifically designed to fetch publicly accessible image and video URLs.
- User agent tokens: GoogleOther-Image, GoogleOther for images and GoogleOther-Video, GoogleOther for videos.
- Example user agent string: GoogleOther-Image/1.0 and GoogleOther-Video/1.0
Google-Extended: This crawler helps improve Google products like Gemini Apps and Vertex AI generative APIs by accessing and learning from publicly available website content. It's important to note that Google-Extended doesn't affect your website's ranking in Google Search.
- User agent token: Google-Extended
- Example user agent string: Google-Extended does not have a separate user agent string and uses existing Google user agents for crawling.

2. Special-Case Crawlers

These crawlers are used by specific Google products like AdsBot and operate under special agreements with website owners. Unlike common crawlers, they might not always follow the rules specified in your robots.txt file.

Here are some examples of special-case crawlers:

APIs-Google: This crawler is used by Google APIs to deliver push notifications to web browsers.
- User agent token: APIs-Google
- Example user agent string: APIs-Google (+https://developers.google.com/webmasters/APIs-Google.html)
AdsBot (Mobile Web & Desktop): These crawlers analyze the quality of advertisements displayed on websites.
- User agent tokens: AdsBot-Google-Mobile (Mobile) and AdsBot-Google (Desktop)
- Example user agent string (Mobile): Mozilla/5.0 (Linux; Android 10; SM-G981B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Mobile Safari/537.36 (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)
- Example user agent string (Desktop): Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36 (compatible; AdsBot-Google; +http://www.google.com/adsbot.html)
AdSense & Mobile AdSense: These crawlers analyze the content of websites to deliver relevant ads through the AdSense program.
- User agent token: Mediapartners-Google
- Example user agent string (Desktop): Mediapartners-Google
- Example user agent string (Mobile): (iPhone; CPU iPhone OS 13_5_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Mobile/15E148 Safari/604.1 (compatible; Mediapartners-Google/2.1; +http://www.google.com/bot.html)
Google-Safety: This crawler is responsible for identifying and flagging security threats like malware on websites. It's like a security guard patrolling the web to keep users safe.
- Example user agent string: Google-Safety

3. User-Triggered Fetchers

These fetchers are activated by user requests to perform specific actions. They may not always adhere to the instructions in your robots.txt file because they are responding directly to user actions.

Here are some examples of user-triggered fetchers:

Feedfetcher: This fetcher is used by Google Podcasts and Google News to retrieve and update RSS or Atom feeds, ensuring users have access to the latest content.
- User agent token: FeedFetcher-Google
- Example user agent string: FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)
Google Publisher Center: This fetcher processes news feeds provided by publishers through the Google Publisher Center. It helps ensure that news content is accurately displayed in Google News.
- Example user agent string: GoogleProducer; (+http://goo.gl/7y4SX)
Google Read Aloud: Upon a user's request, this fetcher accesses web pages and uses text-to-speech technology to read the content aloud, making information accessible to visually impaired users or anyone who prefers to listen rather than read.
- Example user agent string (Desktop): Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36 (compatible; Google-Read-Aloud; +https://support.google.com/webmasters/answer/1061943)
- Example user agent string (Mobile): Mozilla/5.0 (Linux; Android 10; SM-G981B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Mobile Safari/537.36 (compatible; Google-Read-Aloud; +https://support.google.com/webmasters/answer/1061943)
Google Site Verifier: This fetcher helps verify website ownership in Google Search Console. When you verify your website, Google Site Verifier fetches a specific file to confirm you have access to the site.
- Example user agent string: Mozilla/5.0 (compatible; Google-Site-Verification/1.0)

Managing Crawlers with robots.txt

The robots.txt file acts as a set of instructions for web crawlers, telling them which parts of your website they can and cannot access. You can use user agent tokens in your robots.txt file to create specific rules for different Google crawlers.

Here's an example of how you can use robots.txt to allow Googlebot to crawl your entire website, but prevent Googlebot-Image from accessing your /images/ directory:

User-agent: Googlebot
Disallow: 

User-agent: Googlebot-Image
Disallow: /images/

In this example, the first rule allows Googlebot to crawl the entire website (denoted by "Disallow: "). The second rule specifically targets Googlebot-Image and instructs it not to crawl any URLs within the /images/ directory.

Conclusion

Understanding Google's crawlers and fetchers, their specific roles, and how to manage their access using the robots.txt file empowers you to control how your website interacts with Google's various products and services. This granular control allows you to optimize your website's visibility, manage server load, and protect sensitive content.

PreviousHTTP Status Codes, Network, and DNS Errors NextGooglebot Explained

Last updated 1 year ago