Robots.txt Essentials: Control What Search Engines Crawl

10 min read

Robots.txt Essentials: Control What Search Engines Crawl

Reading time: 10 minutes

Quick Definition: The robots.txt file is a text file at the root of your website (example.com/robots.txt) that tells search engine crawlers which pages they can or can’t access. Think of it as a “Do Not Enter” sign for bots.

Critical to understand: Robots.txt blocks crawling, not indexing. A blocked page can still appear in search results if linked from other sites.

TLDR

Robots.txt controls which pages search engines can crawl, but it doesn’t prevent indexing. blocked pages can still rank if linked externally. Use it to block admin pages, manage crawl budget, and declare your sitemap location. Don’t use it to hide private content or block CSS and JavaScript files. Critical mistake: accidentally blocking your entire site with a single misplaced rule. Test changes with Google Search Console’s robots.txt tester before deploying and always keep a backup.

Why Robots.txt Matters

Good uses:

Block admin pages (/wp-admin/, /admin/)
Block duplicate content (/print/, ?utm_source=)
Manage crawl budget (prevent wasting crawls on unimportant pages)
Tell Google where your sitemap is

Bad uses:

Hiding private content (use password protection or noindex instead)
Blocking pages you want to rank (surprisingly common mistake!)
Blocking CSS/JavaScript files (breaks Google’s rendering)

Basic Robots.txt Syntax

Simplest Example

User-agent: *
Disallow:

Translation: “All bots can crawl everything.”

Block One Folder

User-agent: *
Disallow: /admin/

Translation: “All bots: Don’t crawl anything in the /admin/ folder.”

Block Entire Site

User-agent: *
Disallow: /

Translation: “All bots: Don’t crawl any page.” (Used for staging sites)

Allow Everything Except One Folder

User-agent: *
Disallow: /private/
Allow: /

Block Specific Bot

User-agent: Googlebot
Disallow: /no-google/

User-agent: *
Disallow:

Translation: “Google can’t crawl /no-google/, but all other bots can crawl everything.”

Common User-Agents

User-agent	Bot
`*`	All bots (wildcard)
`Googlebot`	Google’s main crawler
`Googlebot-Image`	Google Images
`Bingbot`	Bing search
`AhrefsBot`	Ahrefs SEO tool crawler
`SemrushBot`	Semrush SEO tool crawler
`GPTBot`	OpenAI’s ChatGPT crawler
`CCBot`	Common Crawl (used by AI systems)

Example: Block AI crawlers but allow Google:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Allow: /

Advanced Patterns

Wildcards (*)

Block all PDFs:

User-agent: *
Disallow: /*.pdf$

Block all URLs with ? (query parameters):

User-agent: *
Disallow: /*?

Block URLs with specific parameter:

User-agent: *
Disallow: /*?utm_source=

Allow Overrides Disallow

Block entire folder EXCEPT one file:

User-agent: *
Disallow: /private/
Allow: /private/public-document.html

Sitemap Directive

Always include your sitemap location:

User-agent: *
Disallow: /admin/

Sitemap: https://example.com/sitemap.xml

Multiple sitemaps:

Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml

Why this matters:

Tells search engines where to find your sitemap
Doesn’t require manual submission to Search Console
Speeds up discovery of new content

Real-World Examples

E-commerce Site

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /*?sort=
Disallow: /*?filter=
Allow: /

Sitemap: https://shop.example.com/sitemap.xml

WordPress Site

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Allow: /wp-content/uploads/

Sitemap: https://blog.example.com/sitemap_index.xml

Marketing Site (Block SEO Tool Crawlers)

User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: MJ12bot
Disallow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Staging Site (Block Everything)

User-agent: *
Disallow: /

# No sitemap because we don't want ANY indexing

Common Mistakes

Blocking CSS and JavaScript

Bad:

User-agent: *
Disallow: /css/
Disallow: /js/

Why it’s bad:

Google can’t render your page properly
May hurt mobile-friendliness scores
Can impact Core Web Vitals assessment

Fix: Remove these blocks or use:

User-agent: *
Allow: /css/
Allow: /js/

Blocking Pages You Want to Rank

Scenario: You want /special-offer/ to rank, but you block it:

User-agent: *
Disallow: /special-offer/

Result: Google never crawls it, never indexes it, never ranks it.

Fix: Remove the Disallow line, or use noindex meta tag if you want crawling but not indexing.

Forgetting the Trailing Slash

Intent: Block only the /admin/ folder

Wrong:

Disallow: /admin

This also blocks /administration/, /admin-tools/, /adminhelp/, etc.

Right:

Disallow: /admin/

Expecting Privacy from Robots.txt

Wrong assumption: “If I block it in robots.txt, it won’t appear in Google.”

Reality:

Blocked pages can still be indexed if linked from other sites
Anyone can view your robots.txt file at yoursite.com/robots.txt
Hackers often check robots.txt to find admin pages

For actual privacy:

Use password protection (HTTP authentication)
Use noindex meta tag
Use server-side access controls

Testing on Live Site Without Backup

Mistake: Editing robots.txt directly on production site, accidentally blocking everything:

User-agent: *
Disallow: /

Result: Your entire site disappears from Google within days.

Prevention:

Always keep a backup of your current robots.txt
Test changes with Google Search Console’s Robots.txt Tester
Use staging environment first

How to Create/Edit Robots.txt

Check If You Have One

Visit yoursite.com/robots.txt in a browser. If you see a 404, you don’t have one.

Create One (Apache/Nginx)

Create a file named robots.txt (all lowercase)
Upload to your site’s root directory (same level as index.html)
Verify at yoursite.com/robots.txt

WordPress

Manual method:

Use FTP/file manager to access root directory
Create robots.txt file
Add your rules

Plugin method:

Yoast SEO: Tools → File Editor → Robots.txt
Rank Math: General Settings → Edit Robots.txt

Shopify

Shopify auto-generates robots.txt. You can customize via:

Admin → Online Store → Preferences
Scroll to “Robots.txt”
Edit template

Default Shopify robots.txt:

User-agent: *
Disallow: /admin
Disallow: /cart
Disallow: /orders
Disallow: /checkouts/
Disallow: /checkout
# ... (more Shopify defaults)

Sitemap: https://yourstore.myshopify.com/sitemap.xml

Testing Your Robots.txt

Google Search Console

Go to Search Console
Legacy Tools → robots.txt Tester
Paste your robots.txt content
Test specific URLs to see if they’re blocked

Manual Testing

Test if /admin/panel is blocked:

User-agent: Googlebot
Disallow: /admin/

Visit: yoursite.com/admin/panel

Blocked: ✓ Correct
Allowed: ✗ Pattern doesn’t match

Validation Tools

Crawl-Delay Directive (Controversial)

Syntax:

User-agent: *
Crawl-delay: 10

Translation: “Wait 10 seconds between requests.”

Problems:

Googlebot ignores it (use Search Console crawl rate instead)
Bingbot respects it
Can slow down indexing dramatically

When to use: Only if your server is being overloaded by aggressive bots.

Better solution: Use server rate limiting or contact the bot owner.

Robots.txt vs Meta Robots vs X-Robots-Tag

Method	Blocks Crawling	Blocks Indexing	Use Case
Robots.txt	Yes	No	Block access to folders/files
Meta robots tag	No	Yes	Prevent specific pages from ranking
X-Robots-Tag	No	Yes	Prevent PDFs/images from ranking

Example: Private page that’s already indexed

<!-- Use meta tag, NOT robots.txt -->
<meta name="robots" content="noindex, nofollow">

Quick Reference

Safe default robots.txt:

User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /checkout/
Allow: /

Sitemap: https://example.com/sitemap.xml

Key principles:

Block admin/duplicate/sensitive paths
Don’t block CSS/JavaScript
Always include sitemap directive
Test before deploying
Keep a backup

What Surmado Checks

Site Audit looks for:

Robots.txt exists and is accessible
Sitemap directive is present
No accidental Disallow: / blocking entire site
CSS/JavaScript files aren’t blocked
Syntax errors (missing colons, wrong casing)

→ Related: XML Sitemaps Explained | Noindex & Nofollow | Crawl Budget

Was this helpful?

Help Us Improve This Article

Know a better way to explain this? Have a real-world example or tip to share?

Contribute and earn jobs:

Submit: Get 1 free job (AI Visibility, Site Audit, or Strategy)
If accepted: Get an additional free job (2 total)
Plus: Byline credit on this article

Contribute to This Article

Table of Contents

Robots.txt Essentials: Control What Search Engines Crawl

Robots.txt Essentials: Control What Search Engines Crawl

TLDR

Why Robots.txt Matters

Basic Robots.txt Syntax

Simplest Example

Block One Folder

Block Entire Site

Allow Everything Except One Folder

Block Specific Bot

Common User-Agents

Advanced Patterns

Wildcards (*)

Allow Overrides Disallow

Sitemap Directive

Real-World Examples

E-commerce Site

WordPress Site

Marketing Site (Block SEO Tool Crawlers)

Staging Site (Block Everything)

Common Mistakes

Blocking CSS and JavaScript

Blocking Pages You Want to Rank

Forgetting the Trailing Slash

Expecting Privacy from Robots.txt

Testing on Live Site Without Backup

How to Create/Edit Robots.txt

Check If You Have One

Create One (Apache/Nginx)

WordPress

Shopify

Testing Your Robots.txt

Google Search Console

Manual Testing

Validation Tools

Crawl-Delay Directive (Controversial)

Robots.txt vs Meta Robots vs X-Robots-Tag

Quick Reference

What Surmado Checks

Help Us Improve This Article

Related Articles

Noindex & Nofollow: Control Search Engine Indexing

Site Structure & Navigation: Organize for Users and Search Engines

Featured Snippets: Rank #0 in Google Search Results

Documentation

Blog Posts

Videos