Skip to main content

Crawl Budget: Why Google Doesn't Index All Your Pages

9 min read

9 min read

Crawl Budget: Why Google Doesn’t Index All Your Pages

Reading time: 9 minutes

Quick Definition: Crawl budget is the number of pages Google will crawl on your site within a given timeframe (usually per day). If your site has 10,000 pages but Google only crawls 100/day, it’ll take 100 days to index everything. assuming you don’t add new pages in the meantime.

Key insight: Small sites (under 1,000 pages) rarely need to worry about crawl budget. Large sites, e-commerce stores, and news sites should optimize it.

TLDR

Crawl budget is how many pages Google will crawl on your site per day. Small sites under 1,000 pages don’t need to worry. Google crawls them fully within days. Large sites waste budget on duplicate content, redirect chains, broken links, and infinite filter combinations. One e-commerce site blocked low-value filter pages and went from 500 products crawled daily to 2,000, cutting new product indexing from months to one week. Optimize by submitting an XML sitemap, blocking low-value pages, and improving server speed.


How Crawl Budget Works

Google’s crawler (Googlebot) has limited resources. It decides:

  1. How many pages to crawl on your site (crawl rate)
  2. Which pages to prioritize (crawl demand)

Crawl rate limit:

  • Determined by your server’s capacity
  • Google won’t crawl so fast it crashes your server
  • Higher for sites with fast servers and good hosting

Crawl demand:

  • How popular is the page? (traffic, backlinks)
  • How frequently does it update?
  • Is it already indexed and ranking?

Crawl budget = Rate limit × Demand


Who Needs to Care About Crawl Budget?

You SHOULD optimize if:

  • E-commerce site with 10,000+ products
  • News site publishing 50+ articles/day
  • Site with millions of pages (large directories, databases)
  • International site with many language/country variations
  • Site with heavy URL parameters (filters, sorts, sessions)
  • Site suffering from slow indexing (new pages take weeks to appear)

You probably DON’T need to worry if:

  • Blog with under 1,000 pages
  • Small business site (5-50 pages)
  • Portfolio or brochure site
  • New site with limited content

Google’s own guidance: Sites under 1,000 URLs are crawled efficiently without intervention.


What Wastes Crawl Budget

1. Duplicate Content

Problem:

example.com/product/blue-widget
example.com/product/blue-widget?ref=homepage
example.com/product/blue-widget?sort=price
example.com/product/blue-widget?color=blue

Google crawls 4 URLs, but they’re all the same content.

Fix:

  • Use canonical tags pointing to /product/blue-widget
  • Block parameters in robots.txt: Disallow: /*?
  • Set parameter handling in Google Search Console

2. Low-Quality/Thin Pages

Examples:

  • Empty category pages
  • “No results found” search pages
  • Paginated pages with minimal content
  • Automatically generated doorway pages

Fix:

  • Noindex thin pages
  • Consolidate content
  • Use robots.txt to block crawling

3. Soft 404s (Fake 404s)

Problem: Pages that don’t exist but return 200 OK instead of 404 Not Found.

Example:

GET /this-page-doesnt-exist
Response: 200 OK
Body: "Sorry, page not found"

Google crawls these thinking they’re real pages, wasting budget.

Fix: Return proper 404 status codes for missing pages.

4. Redirect Chains

Problem:

Page A → 301 → Page B → 301 → Page C → 301 → Page D

Google must crawl 4 URLs to reach the final destination.

Fix: Redirect directly:

Page A → 301 → Page D
Page B → 301 → Page D
Page C → 301 → Page D

5. Infinite Spaces (Faceted Navigation)

Problem: E-commerce filters creating millions of combinations:

/shoes
/shoes?color=red
/shoes?color=red&size=10
/shoes?color=red&size=10&brand=nike
/shoes?color=red&size=10&brand=nike&price=50-100
...

Fix:

  • Use noindex on filtered pages
  • Implement rel="canonical" to main category
  • Block filter parameters in robots.txt
  • Use AJAX filters (not changing URL)

Problem: Internal links pointing to non-existent pages.

Why it wastes budget: Google crawls the 404, gets nothing useful, but still counts it against your budget.

Fix:

  • Run regular broken link audits (Screaming Frog, Ahrefs)
  • Fix internal 404s (update links or redirect)

7. Orphaned Pages

Problem: Pages with zero internal links pointing to them.

Why it matters: If Google can’t find the page through your site navigation, it may never crawl it (unless it has external backlinks).

Fix:

  • Add pages to your sitemap
  • Link to them from relevant pages
  • Check for orphans with crawl tools

How to Optimize Crawl Budget

1. Submit an XML Sitemap

Why it helps: Tells Google exactly which pages exist and how often they change.

How:

  • Generate sitemap (most CMS do this automatically)
  • Submit via Google Search Console
  • Keep it updated (remove deleted pages, add new ones)

Sitemap priorities:

<url>
  <loc>https://example.com/important-page</loc>
  <priority>1.0</priority>
  <changefreq>daily</changefreq>
</url>

Note: Priority and changefreq are hints, not commands. Google may ignore them.

2. Fix Crawl Errors

Check Google Search Console:

  • Coverage → Errors
  • Look for server errors (500, 503)
  • Fix broken redirects
  • Resolve DNS issues

Common errors:

  • Server error (5xx)
  • Redirect error
  • Submitted URL not found (404)

3. Improve Site Speed

Why it matters: Faster servers = Google can crawl more pages in the same time.

Optimizations:

  • Upgrade hosting (shared → VPS → dedicated)
  • Enable gzip compression
  • Optimize database queries
  • Use a CDN for static assets
  • Reduce server response time (aim for <200ms)

Check speed:

  • Google Search Console → Settings → Crawl Stats
  • Shows avg response time, crawl requests/day

4. Use Robots.txt Strategically

Block low-value pages:

User-agent: *
Disallow: /search?
Disallow: /filter?
Disallow: /cart/
Disallow: /checkout/
Disallow: /admin/

Allow high-value pages:

Allow: /products/
Allow: /blog/

5. Manage URL Parameters

Google Search Console → Settings → URL Parameters:

  • Sorts (price-low-high): Tell Google to ignore
  • Filters (color=red): Representative URL
  • Pagination (page=2): Let Googlebot decide
  • Tracking (utm_source): Tell Google to ignore

Example configuration:

Parameter: color
Effect: No URLs
Googlebot: No URLs (parameter doesn't change page content significantly)

6. Update Content Regularly

Why: Google prioritizes crawling pages that change frequently.

Strategy:

  • Refresh old blog posts (add new info, update dates)
  • Keep product descriptions current
  • Remove outdated seasonal content
  • Publish new content consistently

Evidence Google is crawling:

  • Google Search Console → Settings → Crawl Stats
  • Check “Total crawl requests” over time

7. Internal Linking

Why it helps: Google discovers pages by following links. More internal links = easier discovery.

Best practices:

  • Link to new pages from high-authority pages (homepage, popular posts)
  • Use descriptive anchor text
  • Don’t bury important pages 5+ clicks deep
  • Create hub pages linking to related content

8. Monitor and Adjust Crawl Rate

Google Search Console → Settings → Crawl rate:

  • Shows current crawl rate (requests/day)
  • You can’t increase it, only decrease it (if Google is overloading your server)

If crawl rate is too low:

  • Improve server speed
  • Fix crawl errors
  • Add internal links to important pages
  • Update content more frequently

Checking Your Crawl Budget

Google Search Console

Settings → Crawl Stats:

  • Total crawl requests: Pages crawled per day
  • Total download size: Data transferred
  • Average response time: Server speed
  • Crawl requests by status: 200, 404, 301, etc.

What good stats look like:

  • Crawl requests increasing over time (if adding content)
  • Most requests returning 200 OK
  • Low 404 and 500 errors
  • Average response time under 500ms

Red flags:

  • Decreasing crawl requests (Google losing interest)
  • High 500 errors (server issues)
  • Slow response times (> 1 second)

Server Logs

Advanced: Analyze server logs to see exactly what Googlebot crawls.

Tools:

  • Screaming Frog Log File Analyzer
  • Splunk
  • Custom scripts (grep/awk)

What to look for:

  • Which pages Google crawls most
  • Pages Google never crawls (orphans)
  • Crawl frequency per section

Case Study: E-commerce Site

Problem:

  • 50,000 product pages
  • Google crawling 500 pages/day
  • New products taking 3+ months to index

Investigation:

  • 70% of crawl budget wasted on filter pages (/shoes?color=red&size=10...)
  • 15% wasted on session IDs (/product?session=abc123)
  • 10% on broken images, CSS files

Solution:

  1. Noindexed all filter combination pages
  2. Blocked session parameters in robots.txt
  3. Fixed broken links
  4. Submitted product-only sitemap

Result:

  • Crawl budget shifted to actual product pages
  • Google now crawling 2,000+ products/day
  • New products indexed within 1 week

Common Myths

Myth: “More pages = better SEO”

Reality: 10,000 thin pages waste crawl budget. 100 high-quality pages rank better.

Myth: “I can increase crawl budget by requesting it”

Reality: Google sets crawl budget based on your site’s authority, server speed, and content quality. You can’t manually increase it.

Myth: “XML sitemaps increase crawl budget”

Reality: Sitemaps help Google discover pages, but don’t increase the total number of pages crawled per day. They help prioritize WHICH pages get crawled.

Myth: “Small sites need to optimize crawl budget”

Reality: If your site has under 1,000 pages, Google crawls it fully within days. Don’t waste time optimizing.


Quick Reference

Crawl budget wasters:

  • Duplicate content
  • Redirect chains
  • Soft 404s
  • URL parameters (filters, sorts, tracking)
  • Slow server response
  • Broken links

Crawl budget optimizations:

  • Submit XML sitemap
  • Use robots.txt to block low-value pages
  • Fix crawl errors (500s, redirects)
  • Improve server speed
  • Manage URL parameters in Search Console
  • Add internal links to important pages

What Surmado Checks

Surmado Scan looks for:

  • Crawl errors (500, 404, redirect chains)
  • Duplicate content wasting crawl budget
  • URL parameters creating infinite spaces
  • Slow server response times
  • Orphaned pages not linked internally

Related: Robots.txt Essentials | XML Sitemaps Explained | Server Response Codes

Next Steps

Run Surmado Scan to optimize crawl efficiency

View all Scan features →

Help Us Improve This Article

Know a better way to explain this? Have a real-world example or tip to share?

Contribute and earn credits:

  • Submit: Get $25 credit (Signal, Scan, or Solutions)
  • If accepted: Get an additional $25 credit ($50 total)
  • Plus: Byline credit on this article
Contribute to This Article