| # | User-agent | Allow | Disallow | Crawl-delay | Note | Copy |
|---|---|---|---|---|---|---|
| Add a user-agent section to generate robots.txt. | ||||||
| {{ row.idx }} | {{ row.userAgent }} | {{ row.allowSummary }} | {{ row.disallowSummary }} | {{ row.crawlDelayLabel }} | {{ row.note }} | |
A robots.txt file tells cooperative crawlers which parts of a site they may request and where to find supporting crawl metadata such as sitemap locations. It matters because crawl control is usually a balancing act: you want bots to discover public pages efficiently without wasting effort on admin paths, duplicate search pages, or staging areas that should stay out of routine crawl flows.
This generator turns that policy work into a structured draft. It can start from presets, build multiple user-agent sections, normalize allow and disallow paths, add sitemap and host lines, emit crawl-delay values where you want them, and sort directives before producing three synchronized outputs: the final text file, a directive table, and a JSON representation of the same rules.
That is useful when you are drafting a new policy, cleaning up a hand-written file, or importing an existing ruleset to make it easier to inspect. A site owner might begin with a standard allow-all preset, add a protected section for /admin and /search, then check the warnings panel before copying the text into the site root.
The boundaries are just as important as the convenience. A crawler policy is not the same thing as access control, and a blocked path can still appear in search results if other signals expose the URL. The package itself reinforces that cautious reading with warnings about missing sitemap lines, a blank host field, and the uneven support that Crawl-delay receives across search engines.
This implementation is also intentionally narrower than the full universe of crawler directives. It focuses on user-agent groups, allow and disallow rules, crawl-delay, sitemap lines, an optional host line, and import parsing for those constructs. It does not try to become a general crawler-governance workbench for every vendor-specific extension.
Start with the simplest question: are you opening crawling, selectively shaping it, or blocking it? The summary line answers that immediately. Crawling open means the current sections do not impose disallow rules, Selective crawling means some paths are restricted, and All crawling blocked appears when the ruleset effectively shuts the door for the matching crawlers.
The presets are useful only as starting positions. Standard gives a permissive base, Block all is useful for draft or staging environments, Hide admin/search blocks common private areas, and Throttle polite crawlers adds rate-limiting hints. Once the preset is applied, every field remains editable, so the safest habit is to treat the preset as a rough shape and then read the generated text line by line.
Use one section per crawler family when policy actually differs. If Googlebot, Bingbot, and all other crawlers should behave the same way, one wildcard section is cleaner than three almost-identical blocks. If a particular bot needs a different allow path or delay value, split it into its own section so the difference is obvious in both the rendered text and the directive table.
The host and sitemap fields deserve more attention than people often give them. Sitemap lines help crawlers find index files quickly, while the host field is an optional helper that some environments still want even though it sits outside the core RFC. If those fields are blank, the package warns you before publication because the file may still be syntactically valid while being operationally incomplete for the policy you intended.
Import is most valuable when you inherit a messy file. Paste the existing text, let the parser rebuild sections, and then review what came across cleanly. That is especially useful if comments, host lines, sitemap entries, and user-agent groups have been mixed together in a way that is hard to scan quickly in raw text.
The generator works from a normalized section list. Each section has one user-agent token, zero or more allow paths, zero or more disallow paths, an optional crawl-delay value, and an optional note. When the final file is assembled, the package can insert a generated header comment, add the note as a comment above the section, write the user-agent line, then emit allow and disallow directives in sorted order if sorting is enabled.
The package also distinguishes between an empty rule set and an intentionally open rule set. If a section has no allow or disallow entries and the allow-all placeholder setting is enabled, it writes a blank Disallow: line. That keeps the generated file explicit about allowing crawling rather than simply omitting directives and leaving readers to infer the intent.
Host and sitemap lines are appended after all user-agent groups. The host value is cleaned from the site URL or from manual input, while sitemap entries are normalized so leading slashes can be expanded against the supplied site URL. The resulting JSON export mirrors that assembled policy, including the preset name, cleaned host, normalized sitemap list, sorted section rules, note fields, and a generation timestamp.
The import path is deliberately practical rather than ambitious. The parser reads lines, keeps leading comments as pending notes, creates a new section when it sees User-agent:, appends Allow, Disallow, and Crawl-delay directives to the current section, and captures the first host line plus all sitemap lines it encounters. If it finds no user-agent groups at all, it stops with an import error instead of pretending that the file can be reconstructed safely.
All generation and export steps stay in the browser. The file text, table rows, JSON payload, CSV output, DOCX export, and copied row snippets are assembled locally. That is useful for privacy, but it also means responsibility stays with the user to validate the final draft against the live site, especially because crawler support differs across directives and search engines.
| Directive or element | How this package handles it | Compatibility note |
|---|---|---|
User-agent |
Creates one explicit section per selected crawler token, with support for custom values. | Specific groups are easier to audit than duplicated wildcard sections. |
Allow and Disallow |
Accepts one path per line and can sort rules before rendering or export. | These are core robots constructs and the most important policy surface in the tool. |
Crawl-delay |
Supports both a global default and per-section values. | Support varies by crawler, and Google explicitly does not honor it. |
Sitemap |
Normalizes absolute or site-relative entries and appends them after rule groups. | Useful for crawler discovery and broadly supported by major engines. |
Host |
Emits a cleaned host line when present. | This sits outside RFC 9309 and should be treated as an optional helper rather than a universal rule. |
| Package warning or behavior | What it means | Why it matters |
|---|---|---|
| Blank host warning | The package warns when the host field is empty. | A valid file can still be missing deployment details the user expected to include. |
| No sitemap warning | The package warns when no sitemap URL is present. | Crawlers may still work, but sitemap discovery becomes less explicit. |
| Crawl-delay warning | The package warns that some crawlers ignore the directive. | Prevents users from assuming delay controls are consistently enforced everywhere. |
| Import requires user-agent groups | If the parser finds no user-agent rules, import fails. | Stops a malformed or partial text block from being turned into misleading output. |
| Block-all summary | The summary changes to All crawling blocked when a section blocks everything for its crawler. |
Draws attention to the most operationally risky draft state before publication. |
| View | What it contains | Exports available |
|---|---|---|
| Robots.txt | The final assembled plain-text policy with comments, user-agent groups, host, and sitemap lines. | Clipboard copy and text download. |
| Directive Table | One row per normalized section with summaries of allow, disallow, crawl-delay, and notes. | CSV copy, CSV download, DOCX export, and per-row copy. |
| JSON | The preset, cleaned host, sitemap list, normalized section objects, and generation timestamp. | Clipboard copy and JSON download. |
Custom.The text view is the source of truth because it shows exactly what would be published as robots.txt. The table is easier to scan for section-by-section differences, and the JSON view is better for automation or structured review. If those three views do not seem to tell the same story, stop and resolve the mismatch before you use the output.
The summary badges are quick signals, not a policy audit. A high disallow count does not automatically mean the draft is good, and a blocks-all badge does not tell you whether the file is appropriate for production or only for staging. What matters is whether the right crawler groups are getting the right paths under the right deployment circumstances.
The warnings panel is the best place to catch false confidence. A draft can look neat and still be operationally weak if it omits sitemap lines, relies on crawl-delay for Google, or uses a host line as though every crawler will treat it the same way. The package warns precisely because those issues are easy to miss in a clean-looking text block.
The biggest interpretation trap is confusing crawl control with indexing control or security. A disallow rule does not hide a URL from everyone on the internet, and it is not a substitute for authentication, authorization, or a noindex strategy when indexing behavior is the actual goal.
A team selects the block-all preset for a staging site, keeps a wildcard user-agent, and leaves a note that explains the environment. The summary flips to All crawling blocked, which is exactly what they want for a short-lived pre-release site. Before publication they still need to confirm that the file is in the root directory and that the server returns a normal success status for it.
A site owner starts with the protect-sensitive preset and then adds a sitemap line for the public content tree. The directive table makes it easy to confirm that /admin, /login, and /search are disallowed while the rest of the site remains open. That is a good example of using the tool for selective crawl shaping rather than blanket blocking.
A marketing team inherits a text file with comments, several user-agent blocks, and a few sitemap lines scattered through the document. They paste it into the import box, let the tool rebuild the normalized sections, and then use the table and JSON views to check whether the original intent is still coherent. That review is often faster than editing the raw text blind.
Not necessarily. Crawl blocking and index suppression are different problems. A disallowed URL can still appear in search if other signals expose it.
Because support is uneven. Some crawlers use it, while Google explicitly documents that it does not.
No. The package can emit it as an optional helper, but it sits outside RFC 9309 and should be treated as compatibility-sensitive.
Import fails with an error instead of fabricating a ruleset, because the package needs explicit crawler groups to rebuild the draft safely.