Current extraction
{{ summaryFigure }}
{{ summaryLine }}
{{ badge.label }}
{{ flowSourceLabel }} {{ flowKindLabel }} {{ flowOutputLabel }}
Sitemap URL extractor inputs
Choose the source that matches your workflow before extracting URLs.
The backend fetcher blocks private hosts and returns parsed URL rows plus source sitemap evidence.
Paste XML, drop a TXT/XML export, or load a sample. URL rows update locally as you edit.
{{ sourceHint }}
Leave blank to include every valid URL from the sitemap.
Use this to remove archives, faceted paths, or campaign URLs before export.
Leave on for migration and crawl QA exports unless you are investigating duplicate sitemap entries.
{{ dedupeEnabled ? 'On' : 'Off' }}
Host and section views are useful for migrations; depth and file type views catch crawl-shape issues.
Turn on when your task is collecting child sitemap URLs from an index file.
{{ includeIndexEntriesEnabled ? 'Included' : 'Source evidence only' }}
The original URL remains visible in JSON evidence when normalization changes it.
{{ stripFragmentsEnabled ? 'On' : 'Off' }}
Affects include and exclude matching only.
{{ caseSensitiveEnabled ? 'On' : 'Off' }}
# URL Host Depth Lastmod Status Copy
No matching sitemap URLs are available for the current source and filters.
{{ row.number }} {{ row.url }} {{ row.host || '—' }} {{ row.depth }} {{ row.lastmod || '—' }} {{ row.status }}
# URL Host Source Type Lastmod Review Copy
No parsed sitemap rows are available to review.
{{ row.number }} {{ row.url }} {{ row.host || '—' }} {{ row.source || '—' }} {{ row.typeLabel }} {{ row.lastmod || '—' }} {{ row.status }}
Signal Value Evidence Review Copy
{{ row.signal }} {{ row.value }} {{ row.evidence }} {{ row.review }}
# Sitemap URL Host Lastmod Role Status Copy
No child or fetched sitemap source rows are available for the current input.
{{ row.number }} {{ row.url }} {{ row.host || '—' }} {{ row.lastmod || '—' }} {{ row.role }} {{ row.status }}

        

        
Customize
Advanced
:

XML sitemaps are structured URL inventories for search crawlers and site audits. A normal sitemap lists page URLs in loc elements, while a sitemap index lists other sitemap files that hold the page URLs. Extracting those rows is useful during migrations, crawl reviews, content audits, and cleanup work where the first question is simply which URLs the sitemap is advertising.

The extracted list is not the same thing as an indexation report. A URL can appear in a sitemap and still be blocked, redirected, canonicalized elsewhere, duplicated, stale, or ignored by a search engine. The value of an extraction is that it turns XML into a reviewable table before deeper crawl, log, or Search Console checks begin.

Sitemap extraction flow from XML or text through loc rows, filters, de-duplication, and audit outputs

Sitemap extraction also helps expose shape problems. Multi-host inventories, missing lastmod values, duplicate canonical URLs, fragment links, index files with no page rows, and oversized files are easier to discuss when each URL has its host, path depth, source, and review status beside it.

The main caution is scope. A sitemap is a crawler hint, not a guarantee that a page is live, crawlable, canonical, or selected for search results. Use the extracted rows as a starting inventory, then verify important samples against the site, crawl rules, redirects, and search-console evidence.

How to Use This Tool:

Choose the input path first, then refine the extracted rows with filters and review the audit signals before using the output in another crawl or migration workflow.

  1. Set Start from to Paste sitemap XML for local review, or choose Fetch sitemap URL when the sitemap is public and should be retrieved for you.
  2. For pasted work, place XML in Sitemap XML, drop a TXT or XML file, or load one of the samples. A standard urlset produces page URL rows, and a sitemapindex produces child sitemap evidence unless you include index entries later.
  3. For public fetching, enter an absolute http or https value in Sitemap URL and run Extract. If the target is not public, uses another protocol, times out, or cannot be parsed, fix the address or use pasted XML instead.
  4. Use Include URL filter and Exclude URL filter when you only need part of the inventory. Patterns may be comma-separated or line-separated, and * works as a wildcard for path or host matching.
  5. Keep De-duplicate URLs on for normal audits so the first normalized URL is kept and later copies are counted as duplicates. In Advanced, change Distribution chart, Index sitemap locs, Strip URL fragments, or Case-sensitive filters only when those choices match the review question.
  6. Read URL List for the included rows, Review Ledger for included and excluded row evidence, Coverage Audit for source-level checks, and Source Sitemaps for fetched sitemap files or child sitemap index rows.
  7. Use URL Distribution, Plain URLs, and JSON when you need grouped counts, a clean one-URL-per-line list, or structured evidence for handoff. If the summary says no rows match, remove filters, inspect Review Ledger, include index sitemap locs, or repair the source XML before exporting.

A solid first pass ends with the expected included URL count, no unexpected invalid rows, host scope that matches the sitemap's purpose, and a clear reason for every exclusion.

Interpreting Results:

The headline count is the number of included rows after parsing, normalization, de-duplication, and filters. Treat it as an extracted working set, not as proof that every URL should stay in the final sitemap.

Sitemap extraction result cues and follow-up checks
Visible Cue Best First Reading What to Check Next
Multiple hosts The extracted rows span more than one host or subdomain. Confirm whether this is a verified cross-site workflow or a sitemap-scope mistake.
Low lastmod coverage Many rows have no page update date in the source. Use crawl or CMS evidence before assuming the sitemap can guide refresh scheduling.
Duplicate locs The same normalized URL appeared more than once. Check canonical URL spelling, tracking parameters, fragments, and duplicate sitemap entries.
Protocol limit blocked The included row count or measured source size exceeds a sitemap protocol limit. Split the inventory and use a sitemap index before submitting or publishing it.
Index sources review The source is an index of sitemap files, not necessarily a list of page URLs. Use Source Sitemaps to confirm child sitemap addresses, then include child locs only when collecting sitemap file addresses or fetch the child files for page rows.

False confidence is common after a clean parse. A valid loc row can still point to a noindex page, redirected URL, soft 404, duplicate canonical, or blocked path. Use the audit rows to pick samples for live checks instead of treating the extracted list as a final SEO verdict.

Technical Details:

The sitemap protocol has two related XML roots. A urlset contains page entries, with one required loc value per url entry and optional lastmod, changefreq, and priority fields. A sitemapindex contains sitemap entries whose loc values point to other sitemap files.

Extraction starts by deciding whether the source is a URL sitemap, an index sitemap, generic XML with loc tags, or plain text containing URL-looking strings. After that, each candidate URL is normalized, checked for an http or https scheme, grouped by host and path details, filtered, and labeled for review.

Transformation Core

Sitemap URL extraction transformation rules
Stage Rule Result Cue
Source recognition urlset XML yields page URL rows. sitemapindex XML yields child sitemap rows. Non-standard XML falls back to visible loc values, and plain text falls back to URL-looking strings. Source type identifies the path used.
URL normalization Only http and https URLs are accepted. Fragment identifiers can be stripped before duplicate checks and output. Status becomes Normalized, Extracted, or an exclusion reason.
Row anatomy Each valid row is split into host, path, path depth, top path section, file type, optional lastmod, and source evidence. URL List, Review Ledger, and URL Distribution show the derived review fields.
Filtering Include patterns are checked first. Exclude patterns then remove matches. Matching is case-insensitive unless Case-sensitive filters is enabled. Filter exclusions counts rows removed by the active filters.
De-duplication When enabled, the first normalized URL is kept and later copies are excluded from the included set. Duplicate locs shows how many repeats were found, and Review Ledger shows which rows were removed from the included export.
Index handling Index sitemap loc values are kept as source evidence by default. They can be included in URL List when the task is collecting child sitemap file URLs. Index sources explains whether those rows are evidence or included rows, and Source Sitemaps lists them for export.

Count Core

Distribution percentages are simple shares of the included URL set after exclusions have been applied.

share = group URLs included URLs × 100

For example, if 120 included rows contain 36 URLs under /docs/, the section share is 36 / 120 x 100 = 30%. The same rule is used when grouping by host, path depth, or file type.

Protocol and Safety Boundaries

Sitemap protocol limits and extraction boundaries
Boundary Rule Why It Matters
URL count A single sitemap file is limited to 50,000 URLs. Larger inventories need multiple sitemap files and usually a sitemap index.
Uncompressed size A sitemap file is limited to 50 MB uncompressed. Compressed transfer does not remove the uncompressed-size limit.
loc format Protocol documents expect fully specified URLs that include the scheme. Relative paths, malformed values, and unsupported schemes cannot be trusted as sitemap URLs.
Host scope Ordinary sitemap files are normally scoped to the host that serves them, with cross-site submission requiring verified ownership workflows. Multiple hosts in one extraction are a review cue, not an automatic publishing plan.
lastmod The value should reflect the page's last meaningful change and use a supported date or date-time format. Repeated or unreliable dates can reduce the usefulness of the signal.
changefreq and priority They are optional hints, and major search engines may ignore them. Do not treat them as crawl commands, ranking signals, or proof of page importance.
Public fetch The fetch path accepts public http and https sitemap URLs and rejects private or localhost targets. This reduces accidental internal-network probing and keeps private sitemaps out of remote review.

A pasted source can be much larger than a normal textarea example, but browser file import has its own practical size limit. For very large or gzipped public sitemaps, fetch by URL when the target is safe to retrieve publicly, or decompress and split the source before review.

Privacy Notes:

Pasted XML, dropped text, and selected local files are parsed in the browser session. Public fetching is different: the sitemap URL is sent to a server-side fetch service, which retrieves the public sitemap and returns extracted rows and warnings.

  • Do not fetch signed URLs, private staging sitemaps, intranet hosts, password-protected paths, or sitemap URLs that reveal confidential launch plans.
  • Use pasted XML when you need to review sensitive source text without sending a public URL for retrieval.
  • Check the Coverage Audit warnings before sharing output, especially when filters, duplicate removal, or index handling changed the included set.

Worked Examples:

Migration URL pull

A migration audit starts by pasting a urlset with 4,820 page entries. URL List shows 4,806 included rows after De-duplicate URLs removes repeated canonical URLs. Coverage Audit reports one host and 4,100 rows with lastmod, so the next review can focus on the duplicate rows and the pages with no update date.

Index sitemap review

A large site paste contains a sitemapindex with 18 child sitemap files. With Index sitemap locs off, URL List has no page rows and Index sources explains that child sitemap URLs are source evidence only. Turning the option on includes those 18 sitemap-file URLs when the task is collecting child files, not page URLs.

Section-only export

An editorial team needs only documentation URLs. Include URL filter is set to /docs/*, Exclude URL filter is set to ?utm_, and Distribution chart is set to Top path section. The included count drops from 12,000 to 2,340, Filter exclusions records the removed rows, and Plain URLs becomes a clean handoff list for a focused crawl.

Broken source recovery

A copied sitemap fragment is missing closing XML tags. The warning reports an XML parse failure, but URL-looking text is recovered into rows. That fallback is useful for salvage work, but a publication audit should still fetch or paste the original sitemap so lastmod, index entries, and source evidence are not lost.

FAQ:

Why did my sitemap index show no page URLs?

A sitemap index lists other sitemap files. Leave Index sitemap locs off when those child files are only evidence, turn it on when collecting child sitemap URLs, or fetch the child sitemap files to extract page rows.

Can I paste a plain list of URLs?

Yes. When no XML markup is present, the parser looks for http and https URLs in the text and labels the source as a plain URL list.

Why were some URLs excluded?

Rows can be excluded because they are invalid, duplicated while de-duplication is on, outside the include filter, or matched by the exclude filter. Check the Status column and Coverage Audit evidence.

Should fragments be kept in sitemap URLs?

Usually no. Fragment identifiers are not normal sitemap targets, and Strip URL fragments helps duplicate detection by comparing the page URL without the in-page anchor.

Does a clean extraction mean the sitemap is ready for Google?

No. It means the source could be parsed into reviewable rows. Check protocol limits, canonical choices, redirects, robots rules, noindex signals, and Search Console feedback before treating the sitemap as ready.

Glossary:

urlset
The sitemap XML root that contains page URL entries.
sitemapindex
The sitemap XML root that lists child sitemap files.
loc
The URL value inside a page entry or child sitemap entry.
lastmod
An optional date or date-time for the last meaningful update to a page or sitemap file.
Host scope
The host or subdomain covered by the extracted sitemap URLs.
De-duplication
The removal of later copies after URL normalization keeps the first matching row.

References: