Verifying the Googlebot and other Google crawlers

Verifying Googlebot and Other Google Crawlers

It's important to ensure that requests claiming to be from Googlebot are legitimate. This verification process protects your site from malicious actors pretending to be Google crawlers. Here's a comprehensive guide to verifying Google crawlers:

Understanding Google Crawler Types

Google employs various crawlers, each serving different purposes. They are categorized as follows:

Type

Description

Reverse DNS Mask

IP Ranges

Googlebot

The primary crawler for Google Search, strictly adheres to robots.txt rules.

crawl-***-***-***-***.googlebot.com or geo-crawl-***-***-***-***.geo.googlebot.com

googlebot.json

Special-case crawlers

Perform specific tasks (e.g., AdsBot), may or may not adhere to robots.txt rules.

rate-limited-proxy-***-***-***-***.google.com

special-crawlers.json

User-triggered fetchers

Activated by user actions (e.g., Google Site Verifier), ignore robots.txt rules.

***-***-***-***.gae.googleusercontent.com or google-proxy-***-***-***-***.google.com

user-triggered-fetchers.json and user-triggered-fetchers-google.json

Verification Methods

There are two primary methods to verify if a crawler is legitimate:

1. Manual Verification (Using Command Line Tools)

This method is ideal for occasional checks and can be performed using the host command:

Step 1: Reverse DNS Lookup

Use the IP address from your server logs and run a reverse DNS lookup to obtain the hostname:

host 66.249.66.1

Example Output:

1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

Step 2: Verify Domain Name

Confirm that the returned domain name belongs to Google:

googlebot.com
google.com
googleusercontent.com

Step 3: Forward DNS Lookup

Perform a forward DNS lookup using the domain name obtained in Step 1:

host crawl-66-249-66-1.googlebot.com

Example Output:

crawl-66-249-66-1.googlebot.com has address 66.249.66.1

Step 4: Cross-Verification

Ensure the IP address returned in Step 3 matches the original IP address from your server logs.

2. Automatic Verification (IP Address Matching)

For large-scale verification, automatically compare the crawler's IP address against Google's published IP ranges.

Step 1: Access IP Range Lists Download the appropriate JSON file based on the crawler type:

Googlebot: googlebot.json
Special Crawlers: special-crawlers.json
User-Triggered Fetches: user-triggered-fetchers.json and user-triggered-fetchers-google.json

Step 2: IP Address Matching Implement a script or utilize a library (depending on your programming language) to:

Parse the downloaded JSON file.
Check if the crawler's IP address falls within any of the listed IP ranges (represented in CIDR notation).

Example Python Code (Using the ipaddress module):

import ipaddress
import json

def is_google_ip(ip_address, json_file):
  with open(json_file, 'r') as f:
    data = json.load(f)
  
  for entry in data["prefixes"]:
    if "ipv4Prefix" in entry:
      ip_network = ipaddress.ip_network(entry["ipv4Prefix"])
      if ipaddress.ip_address(ip_address) in ip_network:
        return True
  return False

# Example usage
crawler_ip = "66.249.66.1"
if is_google_ip(crawler_ip, "googlebot.json"):
  print(f"{crawler_ip} belongs to Googlebot.")
else:
  print(f"{crawler_ip} is not a verified Googlebot IP.")

Verifying Other Google Services

To verify if an IP address belongs to other Google services (like Google Cloud functions), you can use the general list of Google IP addresses available publicly.

Note: The IP addresses in the JSON files are in CIDR format. You can use online tools or programming libraries to efficiently check if an IP address belongs to a specific CIDR block.

PreviousReducing the crawl rate of Googlebot NextManaging Crawl Budget for Large Sites

Last updated 1 year ago