Managing Crawl Budget for Large Sites

Large Site Owner's Guide to Managing Your Crawl Budget

This guide helps you optimize how Google crawls very large and frequently updated websites.

Do you need this guide?

This guide is not for you if:

Your site has a small number of pages that change infrequently.
Your new or updated pages are typically crawled the same day they are published.

If this describes your situation, simply keep your sitemap up to date and regularly check your index coverage in Google Search Console.

If you have content that's been available for a while but hasn't been indexed, this guide won't help. Instead, use the URL Inspection tool to find out why your page isn't indexed.

Who is this guide for?

This is an advanced guide designed for websites with:

Large Size & Moderate Updates: 1 million+ unique pages with weekly content changes.
Medium/Large Size & Frequent Updates: 10,000+ unique pages with daily content changes.
Significant Undiscovered Content: A large portion of URLs classified as "Discovered - currently not indexed" in Google Search Console.

Note: These numbers are rough estimates.

Understanding Crawl Budget

What is Crawl Budget?

The web is vast. Google can't crawl every URL on every site. This means there's a limit to how much time Googlebot (Google's crawler) spends on each site. We call this a site's crawl budget.

Important: Crawling doesn't equal indexing. Google evaluates each crawled page, deciding whether to index it based on factors like quality and relevance.

Crawl Budget Components:

Two primary elements determine your crawl budget:

Crawl Capacity Limit: How many requests Googlebot can make to your server simultaneously without overloading it. This limit fluctuates depending on:
- Crawl Health: Fast site responses increase the limit (more parallel requests). Slow responses or server errors decrease it.
- Google's Limits: Google utilizes vast but finite resources. We balance crawling needs across the entire web.
Crawl Demand: How much Google wants to crawl your site, based on:
- Perceived Inventory: The number of unique, valuable URLs Google thinks your site has. Duplicate content and unnecessary URLs waste crawl budget. This factor is largely within your control.
- Popularity: More popular URLs get crawled more frequently to keep their index information fresh.
- Staleness: Google periodically recrawls to detect changes.
- Site Events: Major events, like site migrations, trigger increased crawling to reindex content under new URLs.

Crawl Budget in a Nutshell:

Think of crawl budget as the number of URLs Googlebot can and wants to crawl on your site. Even with a high capacity limit, low crawl demand means less frequent visits.

Example:

Imagine two news websites, A and B, both publishing 100 articles daily:

Website A:
- Poorly structured, with many duplicate pages.
- Articles are poorly written and attract little traffic.
Website B:
- Well-organized with unique, high-quality articles.
- Articles are shared widely and attract significant traffic.

Googlebot would likely allocate a larger crawl budget to Website B, due to its:

Higher crawl demand (popularity and content quality)
More efficient use of crawl capacity (less wasted on duplicates)

Best Practices for Optimizing Crawl Budget

You can't directly control your crawl budget, but you can influence it by making your site more crawl-friendly and valuable. Here's how:

1. Manage Your URL Inventory:

Guide Googlebot to the pages you want crawled and indexed.

Example:
- Use a robots.txt file to prevent crawling of:
  - Internal search result pages (e.g., /search?q=keyword)
  - Pages with dynamically generated content that isn't valuable for search engines.
```
User-agent: *
Disallow: /search?
Disallow: /dynamic-content/
```

2. Consolidate Duplicate Content:

Focus crawling on unique content, not duplicated URLs.

Example:
- Redirect all variations of a product page (e.g., different colors) to the main product page using canonical tags.
```
<link rel="canonical" href="https://www.example.com/product-page" />
```

3. Use Robots.txt Effectively:

Block pages from crawling that shouldn't be in search results but are still important for users.

Example:
- Block infinite scrolling pages that duplicate content from paginated pages to avoid wasting crawl budget.
```
User-agent: *
Disallow: /category/page=
```

Don't use robots.txt for:

Temporarily Reallocating Crawl Budget: This tactic is ineffective. Google won't shift budget unless your server is already overloaded.
Pages You Want Indexed: Use noindex for pages you don't want in search results but still want crawled.

4. Handle Removed Pages Correctly:

Return a 404 (Not Found) or 410 (Gone) status code for permanently deleted pages.

Example:
- Implement server-side logic to return the appropriate status code for removed product pages.
- This tells Google to remove the page from its index and not waste crawl budget revisiting it.

5. Eliminate Soft 404 Errors:

Pages returning a 200 (OK) status code when they should be 404 errors waste crawl budget.

Example:
- Use the Coverage Report in Google Search Console to identify and fix soft 404s.
- Ensure your server configurations return proper 404 status codes.

6. Maintain an Up-to-Date Sitemap:

Your sitemap helps Google discover and prioritize important pages.

Example:
- Use a dynamic sitemap generator that automatically updates when you add or remove content.
- Submit your sitemap through Google Search Console.

Remember: Increasing your crawl budget isn't about tricks; it's about making your site more valuable and easier for Google to crawl. This benefits both you and your users.

PreviousVerifying the Googlebot and other Google crawlers NextHTTP Status Codes, Network, and DNS Errors

Last updated 1 year ago