Robots.txt Essentials: Control What Search Engines Crawl
10 min read
Robots.txt Essentials: Control What Search Engines Crawl
Reading time: 10 minutes
Quick Definition: The robots.txt file is a text file at the root of your website (example.com/robots.txt) that tells search engine crawlers which pages they can or can’t access. Think of it as a “Do Not Enter” sign for bots.
Critical to understand: Robots.txt blocks crawling, not indexing. A blocked page can still appear in search results if linked from other sites.
TLDR
Robots.txt controls which pages search engines can crawl, but it doesn’t prevent indexing. blocked pages can still rank if linked externally. Use it to block admin pages, manage crawl budget, and declare your sitemap location. Don’t use it to hide private content or block CSS and JavaScript files. Critical mistake: accidentally blocking your entire site with a single misplaced rule. Test changes with Google Search Console’s robots.txt tester before deploying and always keep a backup.
Why Robots.txt Matters
Good uses:
- Block admin pages (
/wp-admin/,/admin/) - Block duplicate content (
/print/,?utm_source=) - Manage crawl budget (prevent wasting crawls on unimportant pages)
- Tell Google where your sitemap is
Bad uses:
- Hiding private content (use password protection or
noindexinstead) - Blocking pages you want to rank (surprisingly common mistake!)
- Blocking CSS/JavaScript files (breaks Google’s rendering)
Basic Robots.txt Syntax
Simplest Example
User-agent: *
Disallow:
Translation: “All bots can crawl everything.”
Block One Folder
User-agent: *
Disallow: /admin/
Translation: “All bots: Don’t crawl anything in the /admin/ folder.”
Block Entire Site
User-agent: *
Disallow: /
Translation: “All bots: Don’t crawl any page.” (Used for staging sites)
Allow Everything Except One Folder
User-agent: *
Disallow: /private/
Allow: /
Block Specific Bot
User-agent: Googlebot
Disallow: /no-google/
User-agent: *
Disallow:
Translation: “Google can’t crawl /no-google/, but all other bots can crawl everything.”
Common User-Agents
| User-agent | Bot |
|---|---|
* | All bots (wildcard) |
Googlebot | Google’s main crawler |
Googlebot-Image | Google Images |
Bingbot | Bing search |
AhrefsBot | Ahrefs SEO tool crawler |
SemrushBot | Semrush SEO tool crawler |
GPTBot | OpenAI’s ChatGPT crawler |
CCBot | Common Crawl (used by AI systems) |
Example: Block AI crawlers but allow Google:
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: *
Allow: /
Advanced Patterns
Wildcards (*)
Block all PDFs:
User-agent: *
Disallow: /*.pdf$
Block all URLs with ? (query parameters):
User-agent: *
Disallow: /*?
Block URLs with specific parameter:
User-agent: *
Disallow: /*?utm_source=
Allow Overrides Disallow
Block entire folder EXCEPT one file:
User-agent: *
Disallow: /private/
Allow: /private/public-document.html
Sitemap Directive
Always include your sitemap location:
User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml
Multiple sitemaps:
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml
Why this matters:
- Tells search engines where to find your sitemap
- Doesn’t require manual submission to Search Console
- Speeds up discovery of new content
Real-World Examples
E-commerce Site
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /*?sort=
Disallow: /*?filter=
Allow: /
Sitemap: https://shop.example.com/sitemap.xml
WordPress Site
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Allow: /wp-content/uploads/
Sitemap: https://blog.example.com/sitemap_index.xml
Marketing Site (Block SEO Tool Crawlers)
User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Staging Site (Block Everything)
User-agent: *
Disallow: /
# No sitemap because we don't want ANY indexing
Common Mistakes
Blocking CSS and JavaScript
Bad:
User-agent: *
Disallow: /css/
Disallow: /js/
Why it’s bad:
- Google can’t render your page properly
- May hurt mobile-friendliness scores
- Can impact Core Web Vitals assessment
Fix: Remove these blocks or use:
User-agent: *
Allow: /css/
Allow: /js/
Blocking Pages You Want to Rank
Scenario: You want /special-offer/ to rank, but you block it:
User-agent: *
Disallow: /special-offer/
Result: Google never crawls it, never indexes it, never ranks it.
Fix: Remove the Disallow line, or use noindex meta tag if you want crawling but not indexing.
Forgetting the Trailing Slash
Intent: Block only the /admin/ folder
Wrong:
Disallow: /admin
This also blocks /administration/, /admin-tools/, /adminhelp/, etc.
Right:
Disallow: /admin/
Expecting Privacy from Robots.txt
Wrong assumption: “If I block it in robots.txt, it won’t appear in Google.”
Reality:
- Blocked pages can still be indexed if linked from other sites
- Anyone can view your robots.txt file at
yoursite.com/robots.txt - Hackers often check robots.txt to find admin pages
For actual privacy:
- Use password protection (HTTP authentication)
- Use
noindexmeta tag - Use server-side access controls
Testing on Live Site Without Backup
Mistake: Editing robots.txt directly on production site, accidentally blocking everything:
User-agent: *
Disallow: /
Result: Your entire site disappears from Google within days.
Prevention:
- Always keep a backup of your current robots.txt
- Test changes with Google Search Console’s Robots.txt Tester
- Use staging environment first
How to Create/Edit Robots.txt
Check If You Have One
Visit yoursite.com/robots.txt in a browser. If you see a 404, you don’t have one.
Create One (Apache/Nginx)
- Create a file named
robots.txt(all lowercase) - Upload to your site’s root directory (same level as index.html)
- Verify at
yoursite.com/robots.txt
WordPress
Manual method:
- Use FTP/file manager to access root directory
- Create
robots.txtfile - Add your rules
Plugin method:
- Yoast SEO: Tools → File Editor → Robots.txt
- Rank Math: General Settings → Edit Robots.txt
Shopify
Shopify auto-generates robots.txt. You can customize via:
- Admin → Online Store → Preferences
- Scroll to “Robots.txt”
- Edit template
Default Shopify robots.txt:
User-agent: *
Disallow: /admin
Disallow: /cart
Disallow: /orders
Disallow: /checkouts/
Disallow: /checkout
# ... (more Shopify defaults)
Sitemap: https://yourstore.myshopify.com/sitemap.xml
Testing Your Robots.txt
Google Search Console
- Go to Search Console
- Legacy Tools → robots.txt Tester
- Paste your robots.txt content
- Test specific URLs to see if they’re blocked
Manual Testing
Test if /admin/panel is blocked:
User-agent: Googlebot
Disallow: /admin/
Visit: yoursite.com/admin/panel
- Blocked: ✓ Correct
- Allowed: ✗ Pattern doesn’t match
Validation Tools
Crawl-Delay Directive (Controversial)
Syntax:
User-agent: *
Crawl-delay: 10
Translation: “Wait 10 seconds between requests.”
Problems:
- Googlebot ignores it (use Search Console crawl rate instead)
- Bingbot respects it
- Can slow down indexing dramatically
When to use: Only if your server is being overloaded by aggressive bots.
Better solution: Use server rate limiting or contact the bot owner.
Robots.txt vs Meta Robots vs X-Robots-Tag
| Method | Blocks Crawling | Blocks Indexing | Use Case |
|---|---|---|---|
| Robots.txt | Yes | No | Block access to folders/files |
| Meta robots tag | No | Yes | Prevent specific pages from ranking |
| X-Robots-Tag | No | Yes | Prevent PDFs/images from ranking |
Example: Private page that’s already indexed
<!-- Use meta tag, NOT robots.txt -->
<meta name="robots" content="noindex, nofollow">
Quick Reference
Safe default robots.txt:
User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /checkout/
Allow: /
Sitemap: https://example.com/sitemap.xml
Key principles:
- Block admin/duplicate/sensitive paths
- Don’t block CSS/JavaScript
- Always include sitemap directive
- Test before deploying
- Keep a backup
What Surmado Checks
Surmado Scan looks for:
- Robots.txt exists and is accessible
- Sitemap directive is present
- No accidental
Disallow: /blocking entire site - CSS/JavaScript files aren’t blocked
- Syntax errors (missing colons, wrong casing)
→ Related: XML Sitemaps Explained | Noindex & Nofollow | Crawl Budget
Was this helpful?
Thanks for your feedback!
Have suggestions for improvement?
Tell us moreHelp Us Improve This Article
Know a better way to explain this? Have a real-world example or tip to share?
Contribute and earn credits:
- Submit: Get $25 credit (Signal, Scan, or Solutions)
- If accepted: Get an additional $25 credit ($50 total)
- Plus: Byline credit on this article