Sitemap URL Extractor

Current extraction

Start from:

Choose the source that matches your workflow before extracting URLs.

Sitemap URL:

The backend fetcher blocks private hosts and returns parsed URL rows plus source sitemap evidence.

Sitemap XML:

Paste XML, drop a TXT/XML export, or load a sample. URL rows update locally as you edit.

Include URL filter:

Leave blank to include every valid URL from the sitemap.

Exclude URL filter:

Use this to remove archives, faceted paths, or campaign URLs before export.

De-duplicate URLs:

Leave on for migration and crawl QA exports unless you are investigating duplicate sitemap entries.

Distribution chart:

Host and section views are useful for migrations; depth and file type views catch crawl-shape issues.

Index sitemap locs:

Turn on when your task is collecting child sitemap URLs from an index file.

{{ includeIndexEntriesEnabled ? 'Included' : 'Source evidence only' }}

Strip URL fragments:

The original URL remains visible in JSON evidence when normalization changes it.

Case-sensitive filters:

Affects include and exclude matching only.

#	URL	Host	Depth	Lastmod	Status	Copy
No matching sitemap URLs are available for the current source and filters.
{{ row.number }}	{{ row.url }}	{{ row.host \|\| '—' }}	{{ row.depth }}	{{ row.lastmod \|\| '—' }}	{{ row.status }}

#	URL	Host	Source	Type	Lastmod	Review	Copy
No parsed sitemap rows are available to review.
{{ row.number }}	{{ row.url }}	{{ row.host \|\| '—' }}	{{ row.source \|\| '—' }}	{{ row.typeLabel }}	{{ row.lastmod \|\| '—' }}	{{ row.status }}

Signal	Value	Evidence	Review	Copy
{{ row.signal }}	{{ row.value }}	{{ row.evidence }}	{{ row.review }}

#	Sitemap URL	Host	Lastmod	Role	Status	Copy
No child or fetched sitemap source rows are available for the current input.
{{ row.number }}	{{ row.url }}	{{ row.host \|\| '—' }}	{{ row.lastmod \|\| '—' }}	{{ row.role }}	{{ row.status }}

Embed:

Customize

Include current inputs

Size

Advanced

Width

Height

Aspect ratio

Max height

Collapsible embed

Allow fullscreen

Referrer policy

Sandbox tokens

A sitemap audit often starts with a deceptively simple question: which URLs is the site telling crawlers to consider? The answer matters during migrations, launch checks, content cleanups, crawl-budget reviews, and Search Console troubleshooting because the sitemap is usually the most deliberate URL inventory a site owner publishes.

An XML sitemap is a crawler hint, not a crawl report. It can name pages, images, videos, news entries, alternate-language URLs, and update dates, but the ordinary page inventory is built around loc values. A sitemap index is one step higher: it lists child sitemap files instead of page URLs, which is why a large site may have one index file and many smaller page sitemaps underneath it.

urlset: The XML root that normally holds page URL entries. Each url entry needs a loc value and may include optional hints such as lastmod.
sitemapindex: The XML root that points to other sitemap files. Its loc values are child sitemap URLs, not page URLs.
lastmod: A date or date-time hint for a meaningful update. It is useful only when it stays consistent with the real page or sitemap change.

Extraction turns XML into a reviewable table. Instead of scanning tags by eye, an auditor can count included rows, see whether URLs share one host, compare top path sections, find duplicate loc values, and spot missing update dates. That inventory is especially useful before a migration redirect map, a focused crawler run, or a conversation with developers about stale generated sitemaps.

Common sitemap audit questions and limits
Audit Question	Why Extraction Helps	What It Cannot Prove
Which URLs are advertised?	Every usable `loc` value becomes a row that can be filtered, copied, or handed to another crawler.	The page may still redirect, be blocked, return an error, or canonicalize elsewhere.
Does the sitemap stay in scope?	Host, path depth, and top path section make cross-host or wrong-directory entries easier to notice.	Verified cross-site submission and Search Console ownership still need separate confirmation.
Can update dates be trusted?	`lastmod` coverage shows how much of the inventory carries a freshness hint.	A present date is not proof that the date reflects a significant page change.
Is the file within protocol limits?	URL count and source size checks expose oversized single-file inventories.	A valid size does not guarantee that search engines will crawl or index every URL.

Sitemap audit flow from XML source to loc values, normalized URLs, review signals, and live crawler checks

A common mistake is to treat a clean extraction as an indexation verdict. Sitemaps help discovery, especially for large, recently changed, isolated, or media-heavy content, but they do not override robots rules, canonical tags, redirects, duplicate handling, quality signals, or normal crawler discovery through links.

The best use of an extracted sitemap list is as a starting inventory. Once the rows look right, the important samples still need live checks against HTTP status, robots controls, canonical URLs, internal links, and search-console feedback.

How to Use This Tool:

Pick the source mode first, then use filters and review tabs to narrow the inventory without losing evidence about the rows that were removed.

Set Start from to Paste sitemap XML for pasted XML, a dropped TXT or XML file, or a sample. Choose Fetch sitemap URL only for a public http or https sitemap URL that is safe for remote retrieval.
In paste mode, add a urlset, a sitemapindex, or plain text containing URLs. The source hint reports how many loc values were parsed locally, and warnings appear when XML fails or only fallback URL text is available.
In fetch mode, enter the public sitemap address in Sitemap URL and run Extract. Invalid addresses, unsupported protocols, timeouts, private hosts, and unreachable sources are reported as alerts so you can correct the address or switch to pasted XML.
Use Include URL filter and Exclude URL filter when the audit needs only part of the inventory. Patterns may be comma-separated or line-separated, and * works as a wildcard. Case-sensitive filters changes only this matching step.
Leave De-duplicate URLs on for normal exports. Keep Strip URL fragments on unless you are deliberately checking fragment-bearing source values. Turn on Index sitemap locs only when you want child sitemap file URLs in URL List.
Read Coverage Audit before exporting. It summarizes source type, included URLs, host scope, lastmod coverage, duplicate rows, filter exclusions, protocol limits, and index-source handling.
Use URL List for the included rows, Review Ledger for included and excluded evidence, Source Sitemaps for fetched or child sitemap files, URL Distribution for grouped counts, Plain URLs for a clean handoff list, and JSON when another process needs structured audit evidence.

If the included count is unexpectedly zero, check whether you pasted an index file, turned on a narrow include filter, excluded the same paths you meant to keep, selected a gzipped local file, or supplied XML that needs repair.

Interpreting Results:

The headline number is the included URL count after parsing, normalization, optional fragment removal, de-duplication, and active filters. Trust it as the working set for the current settings, not as a claim that those pages are crawlable or indexed.

Sitemap extractor output cues and recommended checks
Output Cue	What It Means	Best Follow-up
Multiple hosts	Included rows span more than one host or subdomain.	Confirm that cross-site submission or verified ownership is intentional before treating the mix as valid.
Low lastmod coverage	Few included rows carry an update date.	Use CMS, crawl, or deployment evidence before relying on the sitemap for freshness review.
Duplicate locs	The same normalized URL appeared more than once.	Inspect canonical spelling, tracking parameters, fragments, generator logic, and repeated child sitemaps.
Protocol limit check: Blocked	The included count or measured pasted source size exceeds a single-sitemap boundary.	Split the inventory into smaller sitemap files and use an index file where needed.
Index sources: Review	The source listed child sitemap files, and those child URLs are evidence rather than page rows unless included.	Fetch the child files for page URLs, or turn on Index sitemap locs only when collecting sitemap-file addresses.
Invalid loc values	One or more rows could not be accepted as `http` or `https` URLs.	Fix malformed, relative, unsupported-scheme, or accidentally escaped values before publishing the sitemap.

A clean Coverage Audit lowers parsing risk, but it does not settle crawlability. Test representative rows for HTTP status, redirects, robots controls, canonical targets, and Search Console sitemap feedback before using the extracted list as a launch or migration sign-off.

Technical Details:

The sitemap protocol separates page inventories from sitemap inventories. A page sitemap uses urlset and contains url children with required loc values. A sitemap index uses sitemapindex and contains sitemap children whose loc values point to other sitemap files.

URL extraction is a transformation problem. The source must be recognized, candidate URL strings must be decoded into absolute http or https URLs, and review fields such as host, path depth, top path section, file type, update date, and source role must stay attached to each row.

Transformation Core

Sitemap URL extraction transformation rules
Stage	Rule	Result Cue
Source recognition	`urlset` XML yields page URL rows. `sitemapindex` XML yields child sitemap rows. Non-standard XML is scanned for `loc` values, and non-XML text is scanned for URL-looking strings.	Source type names the recognized source shape.
URL acceptance	Only absolute `http` and `https` URLs are accepted for included rows. Unsupported schemes, malformed values, and relative paths are marked for review.	Status shows Extracted, Normalized, or the exclusion reason.
Fragment handling	When fragment stripping is enabled, a URL ending in an in-page anchor is compared and exported without the fragment.	Review Ledger keeps the row evidence while the included URL shows the normalized value.
Filter order	Include patterns are evaluated before exclude patterns. Matching uses the normalized URL, host, and path, with case-insensitive matching by default.	Filter exclusions counts rows removed by active filters.
De-duplication	When enabled, the first normalized URL remains included and later copies are excluded.	Duplicate locs and Review Ledger identify repeated entries.
Index handling	Child sitemap `loc` values are source evidence by default. They become included rows only when index sitemap locs are intentionally included.	Source Sitemaps and Index sources explain the current role.

Distribution Share Formula

Grouped distribution counts are calculated after exclusions. The selected group can be top path section, host, path depth, or file type.

share = \frac{URLs in selected group}{included URLs} \times 100

If 240 included rows contain 60 URLs under /docs/, the top-section share is 60 / 240 x 100 = 25%. Changing filters, duplicate handling, fragment stripping, or index inclusion changes the denominator, so distribution shares should be compared only when those settings stay the same.

Protocol and Retrieval Boundaries

Sitemap protocol and retrieval boundaries
Boundary	Rule	Why It Matters
Single sitemap size	One sitemap is limited to `50,000` URLs and `50 MB` uncompressed.	Larger inventories need multiple sitemap files and usually a sitemap index.
Sitemap index size	One index file may list up to `50,000` child sitemaps and is also limited to `50 MB` uncompressed.	A large site can need more than one index file when both child-file count and file size grow.
Absolute URLs	Sitemaps should use fully specified URLs with scheme and host.	Relative paths are not enough for reliable crawler submission.
Host and directory scope	A sitemap normally covers URLs on the same host and within the applicable directory scope, unless cross-site submission has been set up.	Mixed hosts are a review signal even when they are technically intentional.
`lastmod`	The date should represent a meaningful page or sitemap-file change and use a supported date or date-time format.	Repeatedly inaccurate dates reduce trust in the freshness signal.
`changefreq` and `priority`	They are optional hints, and Google does not use them for crawl scheduling or ranking decisions.	They should not be treated as commands, quality signals, or proof of page importance.
Public retrieval	Server-assisted fetching accepts public `http` and `https` sources, follows a bounded number of child sitemap files, caps retrieved bytes, and blocks private-network or localhost targets.	Use pasted XML for private inventories and expect very large public sitemap sets to need separate review passes.

Browser file import is meant for ordinary TXT or XML review files. Gzipped local files need to be fetched as public sitemap URLs or decompressed before import, and large local files should be split before pasted review so the browser remains responsive.

Privacy Notes:

Pasted text, dropped TXT/XML content, and selected local files are parsed in the browser. Fetch mode is different: the public sitemap URL you enter is sent to a Simplified Tools retrieval service, which requests the public source and returns extracted rows, source evidence, and warnings.

Use paste mode for confidential sitemap text, staging inventories, signed URLs, internal hosts, or launch plans that should not be requested by a remote service.
Fetch mode rejects localhost and private-network targets, but a public URL can still reveal the existence and timing of content listed in that sitemap.
Before sharing results, check Review Ledger and Coverage Audit so filters, duplicates, index handling, and invalid rows are not mistaken for missing pages.

Worked Examples:

Migration inventory check

A migration team pastes a urlset with 4,820 page entries. URL List shows 4,806 included rows after De-duplicate URLs removes repeated canonical URLs. Coverage Audit reports one host and 4,100 rows with lastmod, so the next pass can focus on duplicate source rows and pages that lack an update date.

Index file with no page rows

A large-site source contains a sitemapindex with 18 child sitemap files. With Index sitemap locs off, Included URLs is 0 and Index sources is marked Review because the child sitemap URLs are evidence only. Turning the option on includes those 18 sitemap-file URLs when the goal is collecting child files, not page URLs.

Oversized sitemap boundary

A generated page sitemap contains 52,300 included rows. Protocol limit check is marked Blocked because a single sitemap file should stay at or below 50,000 URLs. The practical fix is to split the inventory into multiple sitemap files, then list those files from a sitemap index.

Focused documentation export

An editorial audit needs only documentation pages. Include URL filter is set to /docs/*, Exclude URL filter removes campaign parameters, and Distribution chart is set to Top path section. The included count drops from 12,000 to 2,340, Filter exclusions records the removed rows, and Plain URLs becomes a clean crawl seed list.

Broken XML salvage

A copied sitemap fragment is missing closing tags. The warning reports that XML parsing failed, but URL-looking text is recovered as rows. That fallback is useful for quick salvage work, but a publication audit should still fetch or paste the original sitemap so lastmod, child sitemap entries, and source roles are not lost.

FAQ:

Why did a sitemap index show zero included URLs?

A sitemap index lists child sitemap files, not page URLs. Check Source Sitemaps, turn on Index sitemap locs when you want those child file addresses, or fetch the child sitemap files to extract page rows.

Can I paste a plain URL list instead of XML?

Yes. If the pasted source has no XML markup, the parser scans for http and https URLs and labels the source as a plain URL list.

Why were rows excluded from URL List?

Rows can be excluded because the URL is invalid, a duplicate while de-duplication is on, outside the include filter, or matched by the exclude filter. Review Ledger shows the row-level reason.

Should sitemap URLs include fragments?

Usually no. Fragment identifiers point to a location within a page, and major crawl guidance discourages using fragments for distinct page content. Strip URL fragments keeps duplicate checks focused on the page URL.

Why did my local file fail to import?

The browser import path rejects gzipped files and large local files. Fetch a public gzipped sitemap by URL, or decompress and split the file before loading it in paste mode.

Does a clean extraction mean the sitemap is ready to submit?

No. It means the source produced reviewable rows under the current settings. Check protocol limits, canonical choices, redirects, robots rules, noindex signals, and Search Console feedback before treating the sitemap as ready.

Glossary:

urlset: The sitemap XML root that contains page URL entries.
sitemapindex: The sitemap XML root that lists child sitemap files.
loc: The URL value inside a page entry or child sitemap entry.
lastmod: An optional date or date-time for the last meaningful change to a page or sitemap file.
Host scope: The host or subdomain covered by the sitemap URLs being reviewed.
De-duplication: The rule that keeps the first normalized URL and excludes later matching copies.

References:

Sitemaps XML format, sitemaps.org, November 21, 2016.
Build and submit a sitemap, Google Search Central, December 10, 2025.
Manage your sitemaps with a sitemap index file, Google Search Central, December 10, 2025.
URL structure best practices for Google Search, Google Search Central, December 10, 2025.
Sitemaps ping endpoint is going away, Google Search Central Blog, June 26, 2023.
How to create an XML sitemap for your website, Simplified Guide.
How to seed Scrapy start URLs from an XML sitemap, Simplified Guide.
How to use canonical URLs for your website, Simplified Guide.