Your robots.txt file is one of the most powerful — and dangerous — files on your website. A single incorrect line can accidentally block Google from crawling your entire site, wiping your rankings overnight. Yet most site owners set it up once and never check it again. This guide explains exactly how robots.txt works, what to block, what to never block, and the critical mistakes that can destroy your SEO.
What Is a robots.txt File?
A robots.txt file is a plain text file placed at the root of your website (e.g. https://yoursite.com/robots.txt) that tells search engine crawlers which pages or directories they are and are not allowed to crawl. It follows a standard called the Robots Exclusion Protocol.
Googlebot — Google's crawler — fetches your robots.txt file before crawling any other page on your site. If your robots.txt is misconfigured or missing, Googlebot still crawls your site but may waste crawl budget on pages you would prefer it to skip.
According to Google's robots.txt documentation, it is one of the most fundamental technical SEO files for controlling how Google accesses your site.
The Basic robots.txt Syntax
A robots.txt file is made up of simple directives in plain text:
- User-agent: specifies which crawler the rule applies to.
User-agent: *means all crawlers.User-agent: Googlebotmeans Google specifically. - Disallow: tells the crawler not to visit a specific path.
Disallow: /admin/blocks the admin directory. - Allow: explicitly permits access to a path — useful when a parent directory is blocked but one subdirectory should be crawlable.
- Sitemap: tells all crawlers the URL of your XML sitemap.
Example of a basic robots.txt:
User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /storage/
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
What You Should Block in robots.txt
Block pages that offer no value to search engine users and would waste Googlebot's crawl budget:
/wp-admin/or/admin/— admin dashboards are not for public users/login/— login and authentication pages/storage/or/private/— private server directories/cart/and/checkout/— e-commerce transaction pages/search?— internal search result pages (these create near-infinite duplicate URLs)/staging/or/dev/— development or staging environments/*.pdf$— PDF files (if you do not want them indexed)
What You Should Never Block
These are the most common and damaging robots.txt mistakes:
- CSS and JavaScript files — Google needs to render these to understand your page layout. Blocking them can cause Google to misinterpret your content.
- Your homepage — never block
/ - Your blog posts and tool pages — any page you want indexed must be crawlable
- Your sitemap — never block the sitemap URL
- Image directories — unless you deliberately do not want images indexed
The Critical Mistake That Destroys Rankings
The single most catastrophic robots.txt error is this one line:
Disallow: /
This tells every search engine crawler not to crawl any page on your entire site. It takes effect immediately — within hours, Google may begin de-indexing your pages. This single mistake has wiped rankings from websites overnight.
It typically happens when a developer sets Disallow: / on a staging environment to prevent it from being indexed, then accidentally pushes that robots.txt to production.
Always check your live robots.txt after any deployment. Visit https://yourdomain.com/robots.txt directly in your browser and confirm it does not contain Disallow: / for all user agents.
robots.txt vs noindex: The Critical Difference
These two tools are often confused but they do fundamentally different things:
| Feature | robots.txt Disallow | noindex Meta Tag |
|---|---|---|
| What it does | Prevents crawling | Prevents indexing |
| Can page still rank? | Yes (if linked to externally) | No |
| Google reads the page | No | Yes (to see noindex) |
| Best for | Admin pages, private directories | Duplicate content, thank-you pages |
A page blocked by robots.txt can still appear in Google search results if external sites link to it — Google sees the link but cannot read the page. To truly remove a page from search results, you must use the noindex meta tag, not robots.txt.
robots.txt and Crawl Budget
Crawl budget is the number of pages Googlebot will crawl on your site within a given timeframe. For large sites with thousands of pages, using robots.txt to block low-value pages (admin pages, internal search results, duplicate URL parameters) preserves crawl budget for your important content pages.
For most small to medium sites (under 1,000 pages), crawl budget is rarely a limiting factor. However, if you notice important pages taking weeks to be indexed, reviewing your robots.txt for unnecessary crawl budget waste is a useful starting point.
How to Test Your robots.txt
Google Search Console includes a robots.txt tester under Settings → Crawl. You can enter any URL on your site and see whether it would be blocked by your current robots.txt rules.
You can also test your robots.txt manually using our free Robots.txt Generator, which validates syntax and generates a correctly formatted file.
robots.txt Examples by Site Type
Standard blog or content site:
User-agent: *
Disallow: /admin/
Disallow: /login/
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
Laravel/PHP application:
User-agent: *
Disallow: /admin/
Disallow: /storage/
Disallow: /vendor/
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
How Google Treats robots.txt Directives
It is important to understand that Google treats robots.txt as a directive it should respect, but can choose to override in limited circumstances. In particular:
- Google will respect
Disallowrules and will not crawl those URLs - However, Google may still index a blocked URL if external sites link to it — robots.txt prevents crawling, not indexing
- Google may continue to show a blocked page in search results as a URL-only result (without a description) if it was previously indexed or has external links
- To fully remove a page from Google, use a
noindexmeta tag AND allow crawling — Google must be able to crawl the page to read the noindex instruction
robots.txt for Performance and Security
Beyond SEO, robots.txt serves two additional purposes for many sites:
Performance: Blocking aggressive bots that crawl your site unnecessarily can reduce server load. While Google and Bing are well-behaved, some scrapers and AI training bots crawl sites at high frequency. You can block specific user agents:
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
Privacy: For sites with user-generated content or private sections, robots.txt ensures that logged-in areas, user profile URLs, and internal API endpoints are not indexed. Combine with noindex meta tags for belt-and-suspenders protection on truly private content.
Remember: any file publicly accessible at yourdomain.com/robots.txt can be read by anyone. Do not put sensitive paths in your robots.txt file — this effectively creates a roadmap to private areas for malicious actors. Use proper authentication and server-level access controls for genuinely sensitive content.
Testing and Validating Your robots.txt File
Before deploying a robots.txt change to a live site, always test it. A single misplaced disallow directive can accidentally block your entire site from Google — a mistake that can take weeks to recover from once discovered.
Use Google Search Console's robots.txt tester: Go to Settings → robots.txt in Google Search Console. This tool lets you paste any URL on your site and test whether your current robots.txt file allows or blocks Googlebot from crawling it. You can also edit and test changes before they go live.
Check after every change: Any time you update robots.txt, use the tester to verify that your most important pages are still crawlable. It takes less than 2 minutes and prevents potentially serious indexing mistakes.
Monitor in Google Search Console: After publishing changes to robots.txt, watch the Coverage report in GSC. If you see a sudden spike in "Blocked by robots.txt" errors, you have likely accidentally blocked pages you intended to allow. Revert the change and test again before republishing.
Common robots.txt testing checklist:
- Homepage (/) is allowed for all user agents
- Key content pages (/blog/, /tools/) are allowed
- Admin paths (/wp-admin/, /admin/) are blocked
- The Sitemap directive at the bottom points to the correct URL
- No wildcard disallows that accidentally block important content
Frequently Asked Questions
Does blocking a page in robots.txt remove it from Google?
No. Blocking a page in robots.txt prevents Googlebot from crawling it, but the page can still appear in search results if external sites link to it. To fully remove a page from Google search, use a noindex meta tag instead.
What happens if I have no robots.txt file?
Googlebot will crawl your entire site without restriction. This is not necessarily harmful for small sites, but it means admin pages and private directories may be crawled. A basic robots.txt blocking admin directories is always recommended.
Can robots.txt hurt my SEO?
Yes — if misconfigured. Accidentally blocking important pages or your entire site can devastate rankings. Always test your robots.txt after any changes and verify live pages are accessible to Googlebot.
How do I generate a robots.txt file?
Use our free Robots.txt Generator to create a correctly formatted robots.txt file. Enter your site URL, select the pages to block, and copy the output directly to your server.