Every website needs a robots.txt file. It is a plain text file at the root of your domain that tells search engine crawlers — Googlebot, Bingbot, and others — which pages they are allowed to visit and which to skip. Get it wrong and you either waste your crawl budget on pages that should never rank, or you accidentally block Google from your most important content. This guide covers everything: syntax, what to block, platform-specific templates, advanced directives, and how to test it properly before anything goes live.
What Is a Robots.txt File and How Does It Work?
A robots.txt file lives at the root of your domain — always at https://yourdomain.com/robots.txt. You cannot place it in a subdirectory. When any search engine crawler arrives at your site, its very first action is to fetch this file before crawling any other page. The crawler reads the rules and follows them before deciding what to access next.
The file uses the Robots Exclusion Protocol, an informal standard dating from 1994 that virtually every search engine and well-behaved crawler respects. Googlebot, Bingbot, DuckDuckBot, Yandex, Apple's Applebot — all of them check robots.txt before crawling.
Three things robots.txt is not:
- Not a security measure. Malicious bots ignore robots.txt completely. Never rely on it to protect sensitive data — use server-level authentication (passwords, IP allowlisting) instead.
- Not a guarantee of de-indexing. If a page is blocked in robots.txt but other sites link to it, Google can still index that page based on the links alone — it just will not have read the content. To prevent a page from appearing in Google results, use a
noindexmeta tag on the page itself. - Not compulsory. If your site has no robots.txt, crawlers assume everything is accessible. But having no file wastes crawl budget and risks exposing pages you do not want indexed.
The crawl budget — the number of pages Google will crawl on your site in a given time period — is directly influenced by your robots.txt. A well-configured file ensures Google spends its crawl budget on your best pages, not on login forms and admin panels.
Complete Robots.txt Syntax Reference
A robots.txt file is made of one or more groups. Each group targets specific crawlers and lists what they can and cannot access.
Basic structure:
# This is a comment — ignored by all crawlers User-agent: * Disallow: /admin/ Disallow: /private/ Allow: /public/ Crawl-delay: 1 Sitemap: https://yourdomain.com/sitemap.xml
Every directive explained:
User-agent: — Specifies which crawler the rules below apply to. Use * for all crawlers. Use a specific name (e.g., Googlebot, Bingbot, DuckDuckBot) to target just one. If you have multiple groups, crawlers follow the most specific group that matches them. A rule for Googlebot overrides the * rule for Google's crawler.
Disallow: — The path the specified crawler must not access. Disallow: /admin/ blocks everything under /admin/. An empty Disallow: with nothing after it means "allow everything" — this is a common source of confusion. It does NOT mean "block everything."
Allow: — Explicitly permits access to a path. Most useful when you need to carve out an exception inside a blocked directory. For example: block /wp-admin/ but allow /wp-admin/admin-ajax.php for plugin functionality.
Crawl-delay: — Tells crawlers to wait N seconds between requests. Useful for servers that struggle under heavy crawl load. Note: Google officially does not support Crawl-delay (it ignores it). Use Google Search Console's crawl rate setting instead for Googlebot specifically.
Sitemap: — Points crawlers to your XML sitemap. Not part of the original Robots Exclusion Protocol but supported by all major search engines. You can include multiple Sitemap lines if you have more than one sitemap file.
Syntax rules that catch people out:
- Paths are case-sensitive:
/Admin/and/admin/are two different paths - Each directive must be on its own separate line — no inline combinations
- There must be a blank line between groups (different User-agent blocks)
- The order of Allow/Disallow matters when rules overlap: for the same path, the first matching rule wins
- An empty Disallow (
Disallow:) with nothing following means "allow everything" — if you want to block all crawlers from everything, useDisallow: / - Comments start with # and can appear anywhere
How to Create One Free in 3 Steps
You do not need to write robots.txt syntax by hand. Use the free robots.txt generator at SearchRankTool to build the correct file without touching code:
Step 1: Open the generator
Go to searchranktool.com/robots-txt-generator. No account, no signup. Open it in any browser.
Step 2: Configure your rules
Select which directories to block, whether to allow all crawlers or restrict specific bots, and enter your sitemap URL. The generator writes the correctly formatted robots.txt syntax as you make each selection.
Step 3: Copy and upload
Copy the generated content. On your local computer, create a new plain text file named exactly robots.txt — lowercase, no capital letters, no spaces, no .html extension. Upload this file to the root directory of your web server: the same folder that contains your index.html, index.php, or wp-config.php.
After uploading, test it by visiting https://yourdomain.com/robots.txt directly in your browser. You should see the plain text content — not a download prompt, not a 404 error.
If you see a 404, the file is in the wrong directory. If you see a server error, check the file encoding — robots.txt must be saved as UTF-8 plain text, not rich text or HTML.
What to Block, What to Allow, and Why
The biggest mistake people make with robots.txt is blocking too much or too little. Here is a comprehensive guide by content type:
Always block these paths (add to Disallow):
/admin/or/wp-admin/— Admin control panels have no value in search results and waste crawl budget/login,/logout,/signup,/register— Authentication pages should never appear in Google/cart,/checkout,/order-confirmation— Transaction pages generate duplicate thin content/searchor/?s=— Internal search results pages are duplicate content that Google penalises/feed/— RSS feeds duplicate your blog content/cdn-cgi/— Cloudflare system directory, not user content/private/,/staging/,/dev/— Any non-production directories/thank-you,/success— Post-conversion pages with minimal content
Never block these (keep them crawlable):
- Your homepage (
/) — blocking this de-indexes your entire site - Blog posts, articles, guides — this is your indexable content
- Product and service pages — these are what you want ranking
- Category and tag pages (if they have substantial content)
- CSS files (
/css/), JavaScript files (/js/), images (/images/) — blocking these prevents Google from rendering your pages. A page Google cannot render looks broken in search results. - Your sitemap file (
/sitemap.xml)
Block with care — depends on your site:
/tag/,/author/— WordPress taxonomy pages. Block if they are thin; allow if they have substantial content/page/2,/page/3— Paginated archives. Allow if the content on each page is unique and valuable/?utm_source=— UTM-tagged URLs create duplicate pages. Block using wildcards (see advanced directives section)- Print versions of pages (
/print/) — duplicate content
Ready-to-Use Templates by Platform
Copy the template for your platform and replace yourdomain.com with your actual domain:
WordPress:
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Disallow: /wp-login.php Disallow: /xmlrpc.php Disallow: /search Disallow: /?s= Disallow: /feed/ Disallow: /comments/feed/ Sitemap: https://yourdomain.com/sitemap.xml
The Allow: /wp-admin/admin-ajax.php exception is important — many WordPress plugins use this AJAX endpoint to load dynamic content. Blocking it breaks plugin functionality.
Shopify / eCommerce:
User-agent: * Disallow: /admin Disallow: /cart Disallow: /orders Disallow: /checkouts Disallow: /checkout Disallow: /account Disallow: /customers Sitemap: https://yourdomain.com/sitemap.xml
Laravel / Custom PHP:
User-agent: * Disallow: /admin Disallow: /login Disallow: /logout Disallow: /register Disallow: /dashboard Disallow: /storage/ Disallow: /vendor/ Sitemap: https://yourdomain.com/sitemap.xml
Minimal (allow everything, just declare sitemap):
User-agent: * Disallow: Sitemap: https://yourdomain.com/sitemap.xml
An empty Disallow: tells all crawlers they can access everything. Use this if you genuinely want all pages indexed and just need to declare your sitemap.
Advanced Directives: Wildcards and Crawl-Delay
Standard robots.txt only supports exact path matching, but Google and most modern crawlers also support two wildcard patterns:
* (asterisk) — matches any sequence of characters within a path. For example:
# Block all URLs containing ?utm_ (UTM-tagged duplicates) Disallow: /*?utm_ # Block all PDF files Disallow: /*.pdf$ # Block all URLs with session IDs Disallow: /*?sessionid=
$ (dollar sign) — matches the end of a URL exactly. Use it to block specific file types without blocking directories that contain similar strings:
# Block .pdf files specifically ($ = end of URL) Disallow: /*.pdf$ # Block .zip downloads Disallow: /*.zip$
Targeting specific crawlers:
You can write separate rule groups for different bots. This is useful when you want to block AI training crawlers (like GPTBot or Google-Extended) while keeping your site fully accessible to search engine crawlers:
# Normal search engine crawlers — allow everything User-agent: Googlebot Disallow: User-agent: Bingbot Disallow: # Block AI training scrapers User-agent: GPTBot Disallow: / User-agent: Google-Extended Disallow: / # Block all other crawlers from admin User-agent: * Disallow: /admin/
Robots.txt and Crawl Budget
Crawl budget is the number of pages Googlebot will crawl on your site within a given time period. Google allocates crawl budget based on your site's overall authority and server response speed. For most small sites (under 500 pages), crawl budget is not a limiting factor — Google will crawl everything it finds. For larger sites (10,000+ pages), managing crawl budget becomes critical.
Robots.txt helps with crawl budget in two ways:
Blocking low-value pages means Googlebot spends more of its allocated crawl budget on your important pages. A site with 10,000 product pages and 5,000 admin/duplicate pages that are blocked will have Google crawling 10,000 valuable pages instead of 15,000 mixed ones — meaning each valuable page gets crawled and re-indexed more frequently.
Reducing unnecessary crawl requests reduces server load. Heavy crawling from Google on thin pages can slow down your site for real users if your server is resource-constrained. Blocking unnecessary pages reduces the crawl load on your server.
Important: blocking a page in robots.txt does NOT remove it from Google's index if it was previously indexed. To remove an already-indexed page, you need to either return a 404 or 410 HTTP response from the server, or use Google Search Console's URL Removal tool. Robots.txt prevents future crawling — it does not undo past indexing.
For a full guide on how crawl budget affects rankings, see our crawl budget SEO guide.
How to Test Your Robots.txt Before It Goes Live
A wrong robots.txt rule can knock your entire site out of Google's index. Always test before and after making changes:
Method 1: Google Search Console robots.txt viewer
Go to Google Search Console → Settings → robots.txt. This shows your current live robots.txt exactly as Google has fetched it, including when it was last crawled. There is also a URL tester — enter any URL on your site and it tells you instantly whether Googlebot can access it based on your current robots.txt rules.
Method 2: Direct browser check
Open https://yourdomain.com/robots.txt in your browser. You should see the plain text content of your file. If you see a 404 error, the file is not in the correct directory. If you see HTML (a webpage), the file has the wrong extension.
Method 3: Google's Rich Results Test
While primarily for structured data, Google's Rich Results Test at search.google.com/test/rich-results will flag if the URL you enter is blocked by robots.txt. Useful for quickly checking individual pages.
What to check after every edit:
- Your homepage (
/) is accessible — never blocked - Your sitemap URL is listed and accessible
- Your 5 most important pages are individually accessible using the GSC URL tester
- Admin, login, and cart pages are blocked as expected
- No CSS or JavaScript files are blocked (these affect how Google renders your pages)
After updating robots.txt, Google typically re-fetches it within 24 hours. Changes take effect the next time Googlebot visits the affected URLs — which can take days to weeks for less frequently crawled pages. For urgent changes, use GSC's URL Inspection to request immediate re-crawling of affected pages.
The 7 Most Dangerous Robots.txt Mistakes
These are the errors that cause real ranking damage — all are easy to make and easy to avoid once you know them:
1. Disallow: /
The most catastrophic error. This one line blocks ALL crawlers from your ENTIRE website. If deployed to a live site, Google will de-index every page within days. This typically happens accidentally during development when someone copies a dev robots.txt to production. Always check for this before deploying.
2. Blocking CSS and JavaScript files
Google renders your pages visually before ranking them. If it cannot load your CSS and JavaScript, it sees a broken, unstyled page — and ranks it poorly. Never add /css/, /js/, /assets/, or /static/ to your Disallow list.
3. Using robots.txt to hide sensitive content
Robots.txt is publicly visible — anyone can read it by visiting /robots.txt. Listing your admin panel path in robots.txt actually advertises its existence to anyone looking. Use server-level authentication for truly sensitive areas.
4. Confusing robots.txt with noindex
Blocking a page in robots.txt does not remove it from Google's index if it was previously indexed. Google can still show the page in results — it just won't have fresh content from crawling. To actually de-index a page, use <meta name="robots" content="noindex"> on the page and keep it accessible to crawlers so Google can read the noindex instruction. See our noindex tag guide for details.
5. Blocking pages you want to rank
The most common ongoing mistake is blocking pages accidentally. This happens when a developer blocks a directory (e.g., /resources/) not realising that blog posts or product pages live under that path. Use the GSC URL tester to check your most important pages whenever you change robots.txt.
6. Not including your sitemap
The Sitemap: line in robots.txt is the fastest way to tell all crawlers where to find every page on your site. Missing it means crawlers rely entirely on link-following to discover your pages — slower and less reliable.
7. Using robots.txt on a development site with search access
If your development or staging site is publicly accessible and you forget to block it with robots.txt (Disallow: /), Google may index your dev site and treat it as duplicate content of your live site. Either block dev sites entirely via robots.txt or use HTTP authentication to keep them password-protected.
Frequently Asked Questions
Does every website need a robots.txt file?
Not strictly required, but strongly recommended for any site beyond a personal one-page site. Without robots.txt, search crawlers access everything they can find — admin panels, login pages, search result pages, and duplicate content. This wastes your crawl budget on pages that have no business being indexed, potentially reducing how frequently Google crawls your important content. Set one up in the first week after launch.
What happens if robots.txt blocks the wrong page?
If you block a page that was previously indexed, Google will stop crawling it. The page remains in the index temporarily but will eventually be updated with a "blocked by robots.txt" note or dropped from results entirely since Google can no longer verify its content. Always test robots.txt changes with Google Search Console's URL tester before saving. If you accidentally block an important page, fix the robots.txt immediately and use GSC → URL Inspection → Request Indexing to trigger a re-crawl.
Can robots.txt block Google from indexing a page?
Blocking a page in robots.txt prevents Google from crawling it, but does not guarantee removal from the index. If other sites link to the blocked page, Google can still show it in results based on those links — it just will not have current content. To actually prevent indexing, the page must be accessible to crawlers (so they can read the instruction) AND contain a <meta name="robots" content="noindex"> tag in the HTML head. Robots.txt and noindex serve different purposes and are often used together.
How often should I update my robots.txt?
Review and update robots.txt whenever: you add new directories that should be blocked (staging areas, admin tools), you change your URL structure, you launch new site sections, or you migrate platforms. After any major change, test every important page using Google Search Console's URL tester and submit your sitemap again to prompt a re-crawl. A quarterly review is good practice for active sites — treat it the same as reviewing your sitemap for completeness.