🔎
Google Search for beginners
Home
  • Introduction
  • Google Search Essentials
    • Overview
    • Google Search Technical Requirements
    • Spam Policies
  • SEO Basics
    • SEO Beginner's Guide
    • How Google Search Works
    • Creating Helpful, Reliable Content
    • Do You Need an SEO Expert?
    • Maintaining Your Website’s SEO
    • Developer's Guide to Google Search
    • How to Get Your Website Listed on Google
  • crawling and indexing
    • Overview
    • File formats Google can index
    • URL structure
    • Links
    • Sitemaps
      • Create and submit a sitemap
      • Manage your sitemaps
      • Image-specific sitemaps
      • News-oriented sitemaps
      • Video sitemaps and alternatives
      • Combining different sitemap types
    • Managing Google Crawlers
      • Reducing the crawl rate of Googlebot
      • Verifying the Googlebot and other Google crawlers
      • Managing Crawl Budget for Large Sites
      • HTTP Status Codes, Network, and DNS Errors
      • Types of Google Crawlers
      • Googlebot Explained
      • Google Read Aloud Service
      • Google API
      • Understanding Feedfetcher
    • Robots.txt
      • Creating and Submitting Robots.txt
      • Updating Robots.txt
      • Google's Interpretation of Robots.txt
    • Canonicalization
      • Specifying Canonicals Using rel="canonical" and Other Methods
      • Resolving Canonicalization Issues
    • Canonicalization for Mobile Sites and Mobile-First Indexing
    • AMP (Accelerated Mobile Pages)
      • Understanding How AMP Works in Search Results
      • Enhancing Your AMP Content
      • Validating AMP Content
      • Removing AMP Content
    • JavaScript
      • Fixing Search-Related JavaScript Issues
      • Resolving Issues with Lazy-Loaded Content
      • Using Dynamic Rendering as a Workaround
    • Page and Content Metadata
      • Meta Tags
      • Using Robots Meta Tag, data-nosnippet, and X-Robots-Tag noindex
      • noindex Explained
      • rel Attributes
    • Removals
      • Removing Pages from Search Results
      • Removing Images from Search Results
      • Handling Redacted Information
    • Redirects and Google Search
      • Switching Website Hosting Services
      • Handling URL Changes During Site Moves
      • A/B Testing for Sites
      • Pause or Disable a Website
Powered by GitBook
On this page
  1. crawling and indexing
  2. Robots.txt

Google's Interpretation of Robots.txt

How Google Interprets the robots.txt Specification

Google's automated web crawlers, like Googlebot, use the Robots Exclusion Protocol (REP) to determine which parts of a website they are allowed to crawl. This means that before crawling a site, Googlebot will download and parse the site's robots.txt file to identify any access restrictions. It's important to note that the REP isn't applicable to all of Google's crawlers. For instance, user-controlled crawlers (like those used for feed subscriptions) or crawlers focused on user safety (like those analyzing for malware) may not adhere to the REP.

This document provides a detailed explanation of how Google interprets the REP, expanding on the original standard outlined in RFC 9309.

What is a robots.txt File?

A robots.txt file acts as a set of instructions for web crawlers. If you want to prevent crawlers from accessing certain sections of your website, you can create a robots.txt file with specific rules. It's a simple text file placed in the root directory of your website that uses a specific syntax to define which crawlers can access which parts of your site.

Example of a robots.txt file

Let's say you have an e-commerce website and you don't want search engines to index your internal administrative pages. Here's how your robots.txt file might look:

User-agent: *
Disallow: /admin/
Disallow: /private/

User-agent: Googlebot-Image
Disallow: /images/product-prototypes/ 

In this example:

  • The first two lines apply to all web crawlers ("User-agent: *") and prevent them from accessing any URLs that start with /admin/ or /private/.

  • The following lines specifically target Google's image crawler ("User-agent: Googlebot-Image") and disallow access to a directory containing product prototype images.

New to robots.txt?

If you're just getting started with robots.txt, we recommend checking out these resources:

  • Intro to robots.txt: [Link to Google's introductory guide]

  • Tips for creating a robots.txt file: [Link to Google's best practices guide]

File Location and Range of Validity

For Google to find and interpret your robots.txt file correctly, you must place it in the top-level directory of your site, using a supported protocol (HTTP, HTTPS, or FTP for Google Search).

Remember:

  • The URL for the robots.txt file is case-sensitive.

  • Crawlers access the file differently depending on the protocol:

    • HTTP/HTTPS: Crawlers use an HTTP non-conditional GET request.

    • FTP: Crawlers use a standard RETR (RETRIEVE) command with anonymous login.

The rules within the robots.txt file apply specifically to the host, protocol, and port number where it's hosted.

Examples of Valid robots.txt URLs and their Scope

robots.txt URL
Valid For
Not Valid For

https://example.com/robots.txt

All files under https://example.com/, including subdirectories

Other subdomains (e.g., https://blog.example.com/), different protocols (e.g., http://example.com/), or non-standard port numbers (e.g., https://example.com:8181/)

https://www.example.com/robots.txt

All files under https://www.example.com/

https://example.com/, https://shop.www.example.com/

https://example.com/folder/robots.txt

Not a valid location; crawlers won't find it here.

N/A

https://www.exämple.com/robots.txt

https://www.exämple.com/ and its punycode equivalent (e.g., https://xn--exmple-cua.com/)

https://www.example.com/

ftp://example.com/robots.txt

All files accessible via FTP on ftp://example.com/

https://example.com/

https://212.96.82.21/robots.txt

Only for crawling the IP address 212.96.82.21 as the host name

https://example.com/ (even if hosted on that IP)

https://example.com:443/robots.txt

https://example.com:443/ and the equivalent default hostname https://example.com/

https://example.com:444/

https://example.com:8181/robots.txt

Only for content on the non-standard port 8181 (e.g., https://example.com:8181/)

https://example.com/

Handling of Errors and HTTP Status Codes

When Googlebot attempts to access your robots.txt file, the server's HTTP status code response significantly impacts how Google will proceed.

HTTP Status Code Responses and Google's Interpretation

Status Code Range
Description
Google's Interpretation

2xx (Success)

Indicates the robots.txt file was fetched successfully.

Google will process the directives in the robots.txt file as provided.

3xx (Redirection)

The server is redirecting the request for the robots.txt file. Google will follow up to five redirects. If the robots.txt is still not found, Google will treat this as a 404 error.

Google does not follow logical redirects within the robots.txt file itself (e.g., redirects using frames, JavaScript, or meta refresh tags).

4xx (Client Errors)

Indicates an issue with the client's request (e.g., file not found).

Google will generally treat all 4xx errors, except for 429 (Too Many Requests), as if there's no robots.txt file present, assuming no crawl restrictions. Important: Do not use 401 (Unauthorized) or 403 (Forbidden) to manage crawl rate. Instead, use the appropriate methods for crawl rate management.

5xx (Server Errors)

Indicates a server-side error prevented fulfilling the request. This includes the 429 (Too Many Requests) status code.

Google temporarily interprets 5xx errors, including 429, as a full disallow. Google will retry fetching the robots.txt file. For prolonged outages (over 30 days), Google will use its last cached copy of the robots.txt. If nothing is available, Google assumes no crawl restrictions.

Other Errors

Issues like DNS problems, network timeouts, invalid responses, or connection interruptions

Similar to 5xx errors, these are treated as server errors.

Note: If your website needs a temporary suspension of crawling, serve a 503 (Service Unavailable) HTTP status code for all URLs on your site.

Misconfigured 5xx Errors: If Google detects that your server is incorrectly configured to return a 5xx error instead of a 404 (Not Found) for missing pages, Google will treat those 5xx errors as 404s. For example, if the error message on a page returning a 5xx code is "Page not found," Google would interpret this as a 404 error.

Caching

To optimize efficiency, Google generally caches robots.txt files for up to 24 hours. However, the cache duration may be longer if refreshing the file proves difficult (e.g., due to timeouts or 5xx errors).

  • Shared Cache: The cached robots.txt may be shared among different Google crawlers.

  • Cache Control: You can influence Google's caching behavior using the max-age directive within the Cache-Control HTTP header in your robots.txt response.

File Format

Your robots.txt file must adhere to the following format requirements:

  • Encoding: Use UTF-8 encoding.

  • Line Separators: Separate lines with CR, CR/LF, or LF.

  • Invalid Lines: Google ignores invalid lines, including the Unicode Byte Order Mark (BOM) at the beginning of the file. Only valid lines are used. If the downloaded content isn't a valid robots.txt, Google will attempt to extract rules and ignore the rest.

  • Character Encoding: If the encoding isn't UTF-8, Google may ignore unsupported characters, potentially invalidating your rules.

  • File Size Limit: Google currently enforces a 500 kibibytes (KiB) file size limit. Content exceeding this limit is ignored. To reduce file size, consolidate rules or move excluded content to a separate directory.

Syntax

A valid line in your robots.txt file follows this structure:

<field>:<value>
  • Field: Specifies the directive (e.g., User-agent, Disallow).

  • Colon (:): Separates the field from the value.

  • Value: Provides the instruction for the field.

Example:

User-agent: Googlebot
Disallow: /private-area/ 

Additional Syntax Rules:

  • Spaces: While spaces are optional, they are recommended for readability. Leading and trailing spaces on a line are ignored.

  • Comments: Use the # character to add comments. Everything after the # on the same line is ignored.

Example with Comments and Spacing:

# This is a comment explaining the rule below
User-agent: *  # Applies to all crawlers
Disallow: /secret-recipes/ # Keep those recipes hidden! 

By carefully structuring your robots.txt file and understanding how Google interprets its directives, you can effectively control how Google crawls your website, ensuring that sensitive content is protected and that your site's most important pages are indexed correctly.

PreviousUpdating Robots.txtNextCanonicalization

Last updated 11 months ago