Google's Interpretation of Robots.txt

How Google Interprets the robots.txt Specification

Google's automated web crawlers, like Googlebot, use the Robots Exclusion Protocol (REP) to determine which parts of a website they are allowed to crawl. This means that before crawling a site, Googlebot will download and parse the site's robots.txt file to identify any access restrictions. It's important to note that the REP isn't applicable to all of Google's crawlers. For instance, user-controlled crawlers (like those used for feed subscriptions) or crawlers focused on user safety (like those analyzing for malware) may not adhere to the REP.

This document provides a detailed explanation of how Google interprets the REP, expanding on the original standard outlined in RFC 9309.

What is a robots.txt File?

A robots.txt file acts as a set of instructions for web crawlers. If you want to prevent crawlers from accessing certain sections of your website, you can create a robots.txt file with specific rules. It's a simple text file placed in the root directory of your website that uses a specific syntax to define which crawlers can access which parts of your site.

Example of a robots.txt file

Let's say you have an e-commerce website and you don't want search engines to index your internal administrative pages. Here's how your robots.txt file might look:

User-agent: *
Disallow: /admin/
Disallow: /private/

User-agent: Googlebot-Image
Disallow: /images/product-prototypes/ 

In this example:

  • The first two lines apply to all web crawlers ("User-agent: *") and prevent them from accessing any URLs that start with /admin/ or /private/.

  • The following lines specifically target Google's image crawler ("User-agent: Googlebot-Image") and disallow access to a directory containing product prototype images.

New to robots.txt?

If you're just getting started with robots.txt, we recommend checking out these resources:

  • Intro to robots.txt: [Link to Google's introductory guide]

  • Tips for creating a robots.txt file: [Link to Google's best practices guide]

File Location and Range of Validity

For Google to find and interpret your robots.txt file correctly, you must place it in the top-level directory of your site, using a supported protocol (HTTP, HTTPS, or FTP for Google Search).

Remember:

  • The URL for the robots.txt file is case-sensitive.

  • Crawlers access the file differently depending on the protocol:

    • HTTP/HTTPS: Crawlers use an HTTP non-conditional GET request.

    • FTP: Crawlers use a standard RETR (RETRIEVE) command with anonymous login.

The rules within the robots.txt file apply specifically to the host, protocol, and port number where it's hosted.

Examples of Valid robots.txt URLs and their Scope

Handling of Errors and HTTP Status Codes

When Googlebot attempts to access your robots.txt file, the server's HTTP status code response significantly impacts how Google will proceed.

HTTP Status Code Responses and Google's Interpretation

Note: If your website needs a temporary suspension of crawling, serve a 503 (Service Unavailable) HTTP status code for all URLs on your site.

Misconfigured 5xx Errors: If Google detects that your server is incorrectly configured to return a 5xx error instead of a 404 (Not Found) for missing pages, Google will treat those 5xx errors as 404s. For example, if the error message on a page returning a 5xx code is "Page not found," Google would interpret this as a 404 error.

Caching

To optimize efficiency, Google generally caches robots.txt files for up to 24 hours. However, the cache duration may be longer if refreshing the file proves difficult (e.g., due to timeouts or 5xx errors).

  • Shared Cache: The cached robots.txt may be shared among different Google crawlers.

  • Cache Control: You can influence Google's caching behavior using the max-age directive within the Cache-Control HTTP header in your robots.txt response.

File Format

Your robots.txt file must adhere to the following format requirements:

  • Encoding: Use UTF-8 encoding.

  • Line Separators: Separate lines with CR, CR/LF, or LF.

  • Invalid Lines: Google ignores invalid lines, including the Unicode Byte Order Mark (BOM) at the beginning of the file. Only valid lines are used. If the downloaded content isn't a valid robots.txt, Google will attempt to extract rules and ignore the rest.

  • Character Encoding: If the encoding isn't UTF-8, Google may ignore unsupported characters, potentially invalidating your rules.

  • File Size Limit: Google currently enforces a 500 kibibytes (KiB) file size limit. Content exceeding this limit is ignored. To reduce file size, consolidate rules or move excluded content to a separate directory.

Syntax

A valid line in your robots.txt file follows this structure:

<field>:<value>
  • Field: Specifies the directive (e.g., User-agent, Disallow).

  • Colon (:): Separates the field from the value.

  • Value: Provides the instruction for the field.

Example:

User-agent: Googlebot
Disallow: /private-area/ 

Additional Syntax Rules:

  • Spaces: While spaces are optional, they are recommended for readability. Leading and trailing spaces on a line are ignored.

  • Comments: Use the # character to add comments. Everything after the # on the same line is ignored.

Example with Comments and Spacing:

# This is a comment explaining the rule below
User-agent: *  # Applies to all crawlers
Disallow: /secret-recipes/ # Keep those recipes hidden! 

By carefully structuring your robots.txt file and understanding how Google interprets its directives, you can effectively control how Google crawls your website, ensuring that sensitive content is protected and that your site's most important pages are indexed correctly.

Last updated