Sitemap XML Generator
Generate sitemap XML from same-host URLs or paths, normalize lastmod and crawl hints, review exclusions, and catch protocol limits before publishing.| Line | Status | loc | lastmod | changefreq | priority | Note | Copy |
|---|---|---|---|---|---|---|---|
| Add at least one URL or path to build a sitemap. | |||||||
| {{ row.line }} | {{ row.status }} | {{ row.loc || row.source }} | {{ row.lastmod || '' }} | {{ row.changefreq || '' }} | {{ row.priority || '' }} | {{ row.note }} | |
| Check | Status | Detail | Copy |
|---|---|---|---|
| {{ row.check }} | {{ row.status }} | {{ row.detail }} |
Introduction:
Search crawlers discover pages through links first, but a sitemap gives them a deliberate inventory to check. The XML format is especially useful when a site has pages that are hard to reach from navigation, recently changed content, separate sections managed by different systems, or enough URLs that a plain manual list becomes risky.
A good sitemap is not a promise that every page will be indexed. It is a clean statement of which canonical URLs belong to one host and which optional facts, such as a last modified date, are reliable enough to share. If a CMS or hosting platform already publishes a reliable sitemap, that maintained source is usually safer than a second hand-built list. Crawlers still evaluate robots rules, redirects, canonical tags, page quality, duplicate content, and normal crawl constraints before deciding what to fetch or show in search results.
The most common mistakes happen before the XML is even written. A site owner may mix www and non-www URLs, include a staging host, paste campaign links with tracking parameters, or fill every lastmod field with the generation date. Those choices can produce well-formed XML while still creating a weak crawl handoff.
Three ideas matter most when preparing a sitemap inventory: location, freshness, and intent. Location means each URL should belong to the host that will publish or submit the sitemap. Freshness means lastmod should describe a meaningful page change, not a scheduled export. Intent means the list should contain the URLs you actually want search engines to consider as canonical, not every possible path a server can return.
| Field | What it tells crawlers | Common mistake |
|---|---|---|
loc |
The absolute URL to consider for discovery. | Listing duplicates, tracking URLs, old protocol variants, or the wrong host. |
lastmod |
The date or date-time of a significant page update. | Using today's date for unchanged pages. |
changefreq |
An optional hint about expected change frequency. | Treating it as a crawl schedule. |
priority |
An optional same-site importance hint from 0.0 to 1.0. |
Expecting it to improve ranking against other sites. |
Large sites also have a mechanical boundary: one sitemap file is limited to 50,000 URLs and 50 MB uncompressed. Bigger inventories need multiple sitemap files and usually a sitemap index. Smaller sites still benefit from the same discipline because duplicate or stale URLs can make troubleshooting harder after submission.
A sitemap works best beside robots rules, redirects, canonical tags, internal links, and webmaster-tool checks. It helps crawlers find and revisit preferred URLs, but it should not be used to hide architecture problems or to compensate for pages that are blocked, redirected incorrectly, or duplicated across several canonical forms. After publication, the most useful checks are simple: fetch the public file, confirm that it returns sitemap XML, list it in robots.txt when appropriate, and watch webmaster-tool feedback for fetch or parsing errors.
How to Use This Tool:
Start with the host that owns the sitemap, then paste or load the URL inventory and review the generated checks before publishing the XML.
- Set
Site originto the exacthttporhttpsorigin for the sitemap, using the site's canonical host and protocol. Relative paths will use this origin. - Paste one URL or root-relative path per line in
URL inventory, or chooseBrowse TXT/CSV. A row may include optional fields after the URL aslastmod,changefreq, andpriority. - Choose
Lastmod handling. Use per-line values when they come from a CMS or crawl export, apply a default only when missing rows truly share that date, or chooseOmit lastmodwhen dates are uncertain. - Set
Optional hint tags.Global hintsapplies onechangefreqandpriorityto every included row,Per-line with fallbackrespects row values when valid, andOmit hintswrites only URL and date fields. - Leave
De-duplicate URLson for a publication draft. UseOutput orderfor stable diffs,Strip tracking query paramswhen campaign parameters are not canonical, andInclude schema locationwhen a validator workflow expects it. - Check the summary,
URL Inventory, andPublish Check. If the draft is blocked, fix the origin, add at least one valid same-host URL, split oversized files, or correct invalid default dates before copyingSitemap XML.
Interpreting Results:
The summary is a readiness signal, not the full audit. Sitemap draft blocked means the XML is missing a required condition. Sitemap ready with review notes means XML can be produced, but one or more rows, dates, hints, exclusions, or handoff checks still deserve attention.
Sitemap XMLis the text to publish only after blockers and important review notes are resolved.URL Inventoryshows which rows areIncluded,Review, orExcluded, with row-level notes for fragments, tracking cleanup, duplicate removal, invalid dates, cross-host URLs, and parse failures.Publish Checkcompares the draft against host scope, entry count, size limits,lastmodpolicy, optional hints, XML escaping, source exclusions, and discovery handoff.- Passing checks do not prove that the listed pages are crawlable, canonical, indexable, or reachable over live HTTP. Test representative URLs after publishing and compare Search Console or webmaster-tool feedback.
Technical Details:
The standard XML sitemap format uses a urlset root element in the sitemap namespace. Each url entry must contain one absolute loc value. The optional lastmod, changefreq, and priority elements add metadata, but they do not replace the required URL.
Canonicalization matters because crawlers compare sitemap entries with the rest of the site's signals. A path such as /pricing becomes useful only after it is resolved to a full URL. Query strings may identify real content, but analytics parameters and fragments usually do not belong in a canonical sitemap URL. When an absolute URL already includes a protocol, review it against the site's chosen canonical protocol before publishing.
Transformation Core:
| Stage | Rule | Result Cue |
|---|---|---|
| Origin check | The origin must parse as http or https. |
An invalid origin blocks the draft. |
| Row parsing | Each non-empty, non-comment line is split as a URL plus optional date, frequency, and priority fields. | Pipe, tab, and CSV-style comma rows are accepted. |
| URL resolution | Root-relative paths use the selected origin, while absolute URLs must stay on the selected host. | Different hosts are excluded from the XML. |
| URL cleanup | Fragments are removed. Common tracking query parameters are removed only when that option is enabled. | The inventory note records cleanup that affects an included row. |
| Optional fields | lastmod must be a valid date or date-time. changefreq must match a protocol value. priority is constrained to 0.0 through 1.0. |
Invalid dates are omitted or replaced by the selected policy, and invalid hints fall back when fallback mode is active. |
| XML writing | Included values are entity-escaped before loc, lastmod, changefreq, and priority are written. |
A URL containing & remains valid XML text. |
Protocol and Review Limits:
| Check | Limit or Rule | Response |
|---|---|---|
| URLs per sitemap | 50,000 |
Split larger inventories into multiple sitemap files. |
| Uncompressed file size | 50 MB |
Split the file or publish a compressed version while keeping the uncompressed size within the limit. |
loc length |
2,048 characters |
Shorten, canonicalize, or remove overlong URLs before publishing. |
| Host scope | one host | Generate a separate sitemap for a different host or subdomain. |
| Google hint handling | advisory | Expect changefreq and priority to be ignored by Google; accurate URLs and dates matter more. |
A row such as /docs/getting-started?utm_source=newsletter | 2026-04-12 | weekly | 0.7 is resolved under the selected origin, the tracking parameter is removed when that cleanup option is on, the W3C-compatible date is kept, and the remaining values are emitted as one escaped url entry.
Comment lines and blank lines are skipped before row validation, so a working inventory can keep short notes while the XML stays limited to included URLs. Sorting by URL or path depth changes only the order of emitted entries; it does not override exclusion rules, date validation, hint fallback, duplicate handling, or the one-host sitemap boundary.
Privacy Notes:
Pasted inventory text and selected TXT, CSV, or TSV files are read in the current browser session to build the sitemap output. The generator does not crawl listed URLs, fetch page content, submit the sitemap to search engines, or verify live HTTP status.
- Remove staging URLs, private paths, tokens, and campaign parameters before sharing XML, JSON, CSV, or document exports.
- Treat generated
sitemap.xmlas public once published because it advertises URLs you want crawlers to discover. - Run a separate crawl, link check, or webmaster-tool validation when live reachability and indexing feedback matter.
Worked Examples:
A small documentation site sets Site origin to its canonical web origin and pastes /, /pricing, and /docs/getting-started | 2026-04-12 | weekly | 0.7. The URL Inventory shows three Included rows, Publish Check passes host scope and protocol limits, and Sitemap XML contains three loc entries.
A CMS export contains a blog-host page while the origin is the main site host. That row appears as Excluded because the host does not match. The right fix is to generate a separate sitemap for the blog host, not to force the row into the main site's file.
A row such as /sale#signup | 2026-02-31 | daily | 1.2 needs review before publishing. The fragment is removed, the invalid lastmod is omitted or replaced depending on Lastmod handling, and the priority is constrained to the valid sitemap range. Check the row note and Publish Check before copying the XML.
FAQ:
Can I mix subdomains in one sitemap?
No. The inventory is constrained to the host in Site origin. Use a separate sitemap for hosts such as www.example.com and blog.example.com.
Should every URL have today's lastmod date?
Only use today's date when the page content actually changed today. If the date is uncertain, choose Omit lastmod or use per-line dates from a reliable CMS, crawl, or deployment source.
Why did a URL disappear from Sitemap XML?
Open URL Inventory. Rows can be excluded for missing values, parse failures, cross-host URLs, overlong loc values, or duplicate URLs when De-duplicate URLs is on.
Does a passing Publish Check mean the pages will be indexed?
No. Publish Check reviews sitemap structure and handoff risks. Indexing still depends on crawl access, canonical signals, redirects, page quality, and search engine decisions.
Glossary:
urlset- The root XML element for a standard sitemap file.
loc- The required absolute URL for one sitemap entry.
lastmod- An optional date or date-time for a meaningful page change.
changefreq- An optional protocol hint for likely update frequency.
priority- An optional same-site importance hint from 0.0 to 1.0.
- sitemap index
- A file that lists multiple sitemap files for larger inventories.
References:
- Sitemaps XML format, sitemaps.org.
- Sitemaps FAQ, sitemaps.org.
- Build and submit a sitemap, Google Search Central.
- Sitemaps ping endpoint is going away, Google Search Central Blog, 2023-06-26.
- How to create an XML sitemap for your website, Simplified Guide.