Skip to main content

Robots.txt Essentials: Control What Search Engines Crawl

10 min read

10 min read

Robots.txt Essentials: Control What Search Engines Crawl

Reading time: 10 minutes

Quick Definition: The robots.txt file is a text file at the root of your website (example.com/robots.txt) that tells search engine crawlers which pages they can or can’t access. Think of it as a “Do Not Enter” sign for bots.

Critical to understand: Robots.txt blocks crawling, not indexing. A blocked page can still appear in search results if linked from other sites.

TLDR

Robots.txt controls which pages search engines can crawl, but it doesn’t prevent indexing. blocked pages can still rank if linked externally. Use it to block admin pages, manage crawl budget, and declare your sitemap location. Don’t use it to hide private content or block CSS and JavaScript files. Critical mistake: accidentally blocking your entire site with a single misplaced rule. Test changes with Google Search Console’s robots.txt tester before deploying and always keep a backup.


Why Robots.txt Matters

Good uses:

  • Block admin pages (/wp-admin/, /admin/)
  • Block duplicate content (/print/, ?utm_source=)
  • Manage crawl budget (prevent wasting crawls on unimportant pages)
  • Tell Google where your sitemap is

Bad uses:

  • Hiding private content (use password protection or noindex instead)
  • Blocking pages you want to rank (surprisingly common mistake!)
  • Blocking CSS/JavaScript files (breaks Google’s rendering)

Basic Robots.txt Syntax

Simplest Example

User-agent: *
Disallow:

Translation: “All bots can crawl everything.”

Block One Folder

User-agent: *
Disallow: /admin/

Translation: “All bots: Don’t crawl anything in the /admin/ folder.”

Block Entire Site

User-agent: *
Disallow: /

Translation: “All bots: Don’t crawl any page.” (Used for staging sites)

Allow Everything Except One Folder

User-agent: *
Disallow: /private/
Allow: /

Block Specific Bot

User-agent: Googlebot
Disallow: /no-google/

User-agent: *
Disallow:

Translation: “Google can’t crawl /no-google/, but all other bots can crawl everything.”


Common User-Agents

User-agentBot
*All bots (wildcard)
GooglebotGoogle’s main crawler
Googlebot-ImageGoogle Images
BingbotBing search
AhrefsBotAhrefs SEO tool crawler
SemrushBotSemrush SEO tool crawler
GPTBotOpenAI’s ChatGPT crawler
CCBotCommon Crawl (used by AI systems)

Example: Block AI crawlers but allow Google:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Allow: /

Advanced Patterns

Wildcards (*)

Block all PDFs:

User-agent: *
Disallow: /*.pdf$

Block all URLs with ? (query parameters):

User-agent: *
Disallow: /*?

Block URLs with specific parameter:

User-agent: *
Disallow: /*?utm_source=

Allow Overrides Disallow

Block entire folder EXCEPT one file:

User-agent: *
Disallow: /private/
Allow: /private/public-document.html

Sitemap Directive

Always include your sitemap location:

User-agent: *
Disallow: /admin/

Sitemap: https://example.com/sitemap.xml

Multiple sitemaps:

Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml

Why this matters:

  • Tells search engines where to find your sitemap
  • Doesn’t require manual submission to Search Console
  • Speeds up discovery of new content

Real-World Examples

E-commerce Site

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /*?sort=
Disallow: /*?filter=
Allow: /

Sitemap: https://shop.example.com/sitemap.xml

WordPress Site

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Allow: /wp-content/uploads/

Sitemap: https://blog.example.com/sitemap_index.xml

Marketing Site (Block SEO Tool Crawlers)

User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: MJ12bot
Disallow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Staging Site (Block Everything)

User-agent: *
Disallow: /

# No sitemap because we don't want ANY indexing

Common Mistakes

Blocking CSS and JavaScript

Bad:

User-agent: *
Disallow: /css/
Disallow: /js/

Why it’s bad:

  • Google can’t render your page properly
  • May hurt mobile-friendliness scores
  • Can impact Core Web Vitals assessment

Fix: Remove these blocks or use:

User-agent: *
Allow: /css/
Allow: /js/

Blocking Pages You Want to Rank

Scenario: You want /special-offer/ to rank, but you block it:

User-agent: *
Disallow: /special-offer/

Result: Google never crawls it, never indexes it, never ranks it.

Fix: Remove the Disallow line, or use noindex meta tag if you want crawling but not indexing.

Forgetting the Trailing Slash

Intent: Block only the /admin/ folder

Wrong:

Disallow: /admin

This also blocks /administration/, /admin-tools/, /adminhelp/, etc.

Right:

Disallow: /admin/

Expecting Privacy from Robots.txt

Wrong assumption: “If I block it in robots.txt, it won’t appear in Google.”

Reality:

  • Blocked pages can still be indexed if linked from other sites
  • Anyone can view your robots.txt file at yoursite.com/robots.txt
  • Hackers often check robots.txt to find admin pages

For actual privacy:

  • Use password protection (HTTP authentication)
  • Use noindex meta tag
  • Use server-side access controls

Testing on Live Site Without Backup

Mistake: Editing robots.txt directly on production site, accidentally blocking everything:

User-agent: *
Disallow: /

Result: Your entire site disappears from Google within days.

Prevention:

  1. Always keep a backup of your current robots.txt
  2. Test changes with Google Search Console’s Robots.txt Tester
  3. Use staging environment first

How to Create/Edit Robots.txt

Check If You Have One

Visit yoursite.com/robots.txt in a browser. If you see a 404, you don’t have one.

Create One (Apache/Nginx)

  1. Create a file named robots.txt (all lowercase)
  2. Upload to your site’s root directory (same level as index.html)
  3. Verify at yoursite.com/robots.txt

WordPress

Manual method:

  1. Use FTP/file manager to access root directory
  2. Create robots.txt file
  3. Add your rules

Plugin method:

  • Yoast SEO: Tools → File Editor → Robots.txt
  • Rank Math: General Settings → Edit Robots.txt

Shopify

Shopify auto-generates robots.txt. You can customize via:

  1. Admin → Online Store → Preferences
  2. Scroll to “Robots.txt”
  3. Edit template

Default Shopify robots.txt:

User-agent: *
Disallow: /admin
Disallow: /cart
Disallow: /orders
Disallow: /checkouts/
Disallow: /checkout
# ... (more Shopify defaults)

Sitemap: https://yourstore.myshopify.com/sitemap.xml

Testing Your Robots.txt

Google Search Console

  1. Go to Search Console
  2. Legacy Tools → robots.txt Tester
  3. Paste your robots.txt content
  4. Test specific URLs to see if they’re blocked

Manual Testing

Test if /admin/panel is blocked:

User-agent: Googlebot
Disallow: /admin/

Visit: yoursite.com/admin/panel

  • Blocked: ✓ Correct
  • Allowed: ✗ Pattern doesn’t match

Validation Tools


Crawl-Delay Directive (Controversial)

Syntax:

User-agent: *
Crawl-delay: 10

Translation: “Wait 10 seconds between requests.”

Problems:

  • Googlebot ignores it (use Search Console crawl rate instead)
  • Bingbot respects it
  • Can slow down indexing dramatically

When to use: Only if your server is being overloaded by aggressive bots.

Better solution: Use server rate limiting or contact the bot owner.


Robots.txt vs Meta Robots vs X-Robots-Tag

MethodBlocks CrawlingBlocks IndexingUse Case
Robots.txtYesNoBlock access to folders/files
Meta robots tagNoYesPrevent specific pages from ranking
X-Robots-TagNoYesPrevent PDFs/images from ranking

Example: Private page that’s already indexed

<!-- Use meta tag, NOT robots.txt -->
<meta name="robots" content="noindex, nofollow">

Quick Reference

Safe default robots.txt:

User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /checkout/
Allow: /

Sitemap: https://example.com/sitemap.xml

Key principles:

  1. Block admin/duplicate/sensitive paths
  2. Don’t block CSS/JavaScript
  3. Always include sitemap directive
  4. Test before deploying
  5. Keep a backup

What Surmado Checks

Surmado Scan looks for:

  • Robots.txt exists and is accessible
  • Sitemap directive is present
  • No accidental Disallow: / blocking entire site
  • CSS/JavaScript files aren’t blocked
  • Syntax errors (missing colons, wrong casing)

Related: XML Sitemaps Explained | Noindex & Nofollow | Crawl Budget

Help Us Improve This Article

Know a better way to explain this? Have a real-world example or tip to share?

Contribute and earn credits:

  • Submit: Get $25 credit (Signal, Scan, or Solutions)
  • If accepted: Get an additional $25 credit ($50 total)
  • Plus: Byline credit on this article
Contribute to This Article