Google's Interpretation of Robots.txt
How Google Interprets the robots.txt Specification
Google's automated web crawlers, like Googlebot, use the Robots Exclusion Protocol (REP) to determine which parts of a website they are allowed to crawl. This means that before crawling a site, Googlebot will download and parse the site's robots.txt
file to identify any access restrictions. It's important to note that the REP isn't applicable to all of Google's crawlers. For instance, user-controlled crawlers (like those used for feed subscriptions) or crawlers focused on user safety (like those analyzing for malware) may not adhere to the REP.
This document provides a detailed explanation of how Google interprets the REP, expanding on the original standard outlined in RFC 9309.
What is a robots.txt File?
A robots.txt
file acts as a set of instructions for web crawlers. If you want to prevent crawlers from accessing certain sections of your website, you can create a robots.txt
file with specific rules. It's a simple text file placed in the root directory of your website that uses a specific syntax to define which crawlers can access which parts of your site.
Example of a robots.txt file
Let's say you have an e-commerce website and you don't want search engines to index your internal administrative pages. Here's how your robots.txt
file might look:
In this example:
The first two lines apply to all web crawlers ("User-agent: *") and prevent them from accessing any URLs that start with
/admin/
or/private/
.The following lines specifically target Google's image crawler ("User-agent: Googlebot-Image") and disallow access to a directory containing product prototype images.
New to robots.txt?
If you're just getting started with robots.txt
, we recommend checking out these resources:
Intro to robots.txt: [Link to Google's introductory guide]
Tips for creating a robots.txt file: [Link to Google's best practices guide]
File Location and Range of Validity
For Google to find and interpret your robots.txt
file correctly, you must place it in the top-level directory of your site, using a supported protocol (HTTP, HTTPS, or FTP for Google Search).
Remember:
The URL for the
robots.txt
file is case-sensitive.Crawlers access the file differently depending on the protocol:
HTTP/HTTPS: Crawlers use an HTTP non-conditional
GET
request.FTP: Crawlers use a standard
RETR
(RETRIEVE) command with anonymous login.
The rules within the robots.txt
file apply specifically to the host, protocol, and port number where it's hosted.
Examples of Valid robots.txt URLs and their Scope
https://example.com/robots.txt
All files under https://example.com/
, including subdirectories
Other subdomains (e.g., https://blog.example.com/
), different protocols (e.g., http://example.com/
), or non-standard port numbers (e.g., https://example.com:8181/
)
https://www.example.com/robots.txt
All files under https://www.example.com/
https://example.com/
, https://shop.www.example.com/
https://example.com/folder/robots.txt
Not a valid location; crawlers won't find it here.
N/A
https://www.exämple.com/robots.txt
https://www.exämple.com/
and its punycode equivalent (e.g., https://xn--exmple-cua.com/
)
https://www.example.com/
ftp://example.com/robots.txt
All files accessible via FTP on ftp://example.com/
https://example.com/
https://212.96.82.21/robots.txt
Only for crawling the IP address 212.96.82.21
as the host name
https://example.com/
(even if hosted on that IP)
https://example.com:443/robots.txt
https://example.com:443/
and the equivalent default hostname https://example.com/
https://example.com:444/
https://example.com:8181/robots.txt
Only for content on the non-standard port 8181
(e.g., https://example.com:8181/
)
https://example.com/
Handling of Errors and HTTP Status Codes
When Googlebot attempts to access your robots.txt
file, the server's HTTP status code response significantly impacts how Google will proceed.
HTTP Status Code Responses and Google's Interpretation
2xx (Success)
Indicates the robots.txt
file was fetched successfully.
Google will process the directives in the robots.txt
file as provided.
3xx (Redirection)
The server is redirecting the request for the robots.txt
file. Google will follow up to five redirects. If the robots.txt is still not found, Google will treat this as a 404 error.
Google does not follow logical redirects within the robots.txt
file itself (e.g., redirects using frames, JavaScript, or meta refresh tags).
4xx (Client Errors)
Indicates an issue with the client's request (e.g., file not found).
Google will generally treat all 4xx errors, except for 429 (Too Many Requests), as if there's no robots.txt
file present, assuming no crawl restrictions. Important: Do not use 401 (Unauthorized) or 403 (Forbidden) to manage crawl rate. Instead, use the appropriate methods for crawl rate management.
5xx (Server Errors)
Indicates a server-side error prevented fulfilling the request. This includes the 429 (Too Many Requests) status code.
Google temporarily interprets 5xx errors, including 429, as a full disallow. Google will retry fetching the robots.txt
file. For prolonged outages (over 30 days), Google will use its last cached copy of the robots.txt
. If nothing is available, Google assumes no crawl restrictions.
Other Errors
Issues like DNS problems, network timeouts, invalid responses, or connection interruptions
Similar to 5xx errors, these are treated as server errors.
Note: If your website needs a temporary suspension of crawling, serve a 503 (Service Unavailable) HTTP status code for all URLs on your site.
Misconfigured 5xx Errors: If Google detects that your server is incorrectly configured to return a 5xx error instead of a 404 (Not Found) for missing pages, Google will treat those 5xx errors as 404s. For example, if the error message on a page returning a 5xx code is "Page not found," Google would interpret this as a 404 error.
Caching
To optimize efficiency, Google generally caches robots.txt
files for up to 24 hours. However, the cache duration may be longer if refreshing the file proves difficult (e.g., due to timeouts or 5xx errors).
Shared Cache: The cached
robots.txt
may be shared among different Google crawlers.Cache Control: You can influence Google's caching behavior using the
max-age
directive within theCache-Control
HTTP header in yourrobots.txt
response.
File Format
Your robots.txt
file must adhere to the following format requirements:
Encoding: Use UTF-8 encoding.
Line Separators: Separate lines with CR, CR/LF, or LF.
Invalid Lines: Google ignores invalid lines, including the Unicode Byte Order Mark (BOM) at the beginning of the file. Only valid lines are used. If the downloaded content isn't a valid
robots.txt
, Google will attempt to extract rules and ignore the rest.Character Encoding: If the encoding isn't UTF-8, Google may ignore unsupported characters, potentially invalidating your rules.
File Size Limit: Google currently enforces a 500 kibibytes (KiB) file size limit. Content exceeding this limit is ignored. To reduce file size, consolidate rules or move excluded content to a separate directory.
Syntax
A valid line in your robots.txt
file follows this structure:
Field: Specifies the directive (e.g.,
User-agent
,Disallow
).Colon (:): Separates the field from the value.
Value: Provides the instruction for the field.
Example:
Additional Syntax Rules:
Spaces: While spaces are optional, they are recommended for readability. Leading and trailing spaces on a line are ignored.
Comments: Use the
#
character to add comments. Everything after the#
on the same line is ignored.
Example with Comments and Spacing:
By carefully structuring your robots.txt
file and understanding how Google interprets its directives, you can effectively control how Google crawls your website, ensuring that sensitive content is protected and that your site's most important pages are indexed correctly.
Last updated