HTML Unescaper
Decode escaped HTML text or small files in your browser, inspect each entity change, and flag unresolved references or script-shaped output.{{ summaryTitle }}
{{ decodedText }}
| # | Reference | Type | Decoded | Pass | Status | Copy |
|---|---|---|---|---|---|---|
| {{ row.sequence }} | {{ row.reference }} | {{ row.type }} | {{ row.decodedDisplay }} | {{ row.pass }} | {{ row.status }} | |
| No HTML character references detected. | ||||||
| Finding | Status | Detail | Copy |
|---|---|---|---|
| {{ row.finding }} | {{ row.status }} | {{ row.detail }} |
Introduction
Escaped HTML often appears when markup or symbols need to travel as ordinary text. A less-than sign can become <, an ampersand can become &, and a copyright mark can become ©. Those spellings prevent characters from being interpreted too early, but they also make copied CMS content, email fragments, log lines, and code samples harder to read.
Unescaping reverses that representation. Character references turn back into the Unicode characters they name or number, so the text becomes easier to inspect. The important safety point is that decoding changes visibility, not trust. A decoded <script> string is easier to spot than <script>, but it still needs to stay as text unless the destination has its own sanitizer and review process.
- Named reference
- A readable HTML name such as
©,&, or<. - Decimal numeric reference
- A base-10 code point form such as
<for the less-than sign. - Hex numeric reference
- A hexadecimal code point form such as
<or<. - Semicolonless reference
- A legacy-compatible spelling such as
©that some HTML parsing paths accept, but that can be ambiguous in source review.
Escaping depth is easy to misread. One decode pass turns < into <. A double-escaped fragment such as &lt; first becomes <, then needs a second pass before the less-than sign appears. Nested escaping is common after text crosses more than one storage, email, translation, ticketing, or logging system.
Strictness matters because HTML keeps some older reference spellings for compatibility. A browser-style decode may accept missing semicolons for names such as ©, while a stricter source review may need that token to stay visible. Unicode normalization is a separate choice. It can make visually equivalent text easier to compare, but compatibility normalization can also fold characters that were distinct in the original source.
The safest habit is to decode for inspection, not for rendering. Treat unresolved references, replacement characters, tag-shaped text, and script-shaped text as review clues. Unescaping helps reveal what a string contains; it does not sanitize HTML, remove dangerous code, or decide whether the decoded output belongs in a page.
How to Use This Tool:
Start with the escaped text, choose how strict the decode should be, then use the ledger and audit rows to decide whether the output is ready to copy.
- Paste content into Escaped HTML, use Browse file, or drop one TXT, HTML, or HTM file onto the textarea. Files larger than
1 MBshow a size warning instead of loading. - Set Decode passes from
1to6. Use1for ordinary escaped text and2for common double-escaped fragments such as&lt;. Decoding stops early when a pass makes no change. - Choose Reference mode. HTML5 tolerant follows browser-style resolution for accepted legacy forms, while Strict semicolons decodes only references that end in
;. - Open Advanced when comparison or cleanup requires it. Leave Unicode normalization as No change for the closest decoded text, choose NFC or NFKC for matching work, and enable Trim outer whitespace only when edge whitespace should be removed.
- Read the summary badges before copying. Decoded text ready means no unresolved or strict-skipped references remain, while Decoded with unresolved references, Markup text, or Script signal means the output needs review.
- Use Decoded Text for the readable result, Entity Ledger for reference-by-reference evidence, Safety Audit for warning signals, and JSON when a structured record is useful.
Interpreting Results:
The decoded text is the main result, but the counts explain whether it is complete. No entities found means the input did not contain reference-shaped tokens that matched the scanner. If the decoded count is lower than the total reference count, check Entity Ledger for Skipped or Unresolved rows before assuming the text is finished.
Text only is a first-pass safety signal, not a security guarantee. Markup text means the decoded output contains tag-shaped strings. Script signal is stronger because it looks for script tags, iframes, embedded objects, event-handler attributes, and javascript: text. Keep those results as text until a separate sanitizer or code review says they are safe for the intended destination.
Repeat checks need the same settings. Changing Decode passes, Reference mode, Unicode normalization, or whitespace trimming can change the final text, ledger statuses, audit rows, and copied output, especially when nested escaping or semicolonless names are present.
Technical Details:
HTML character references use an ampersand-led token to stand in for Unicode text. Named references depend on the HTML named-character table, while decimal and hexadecimal numeric references point directly at a Unicode code point. Several spellings can map to the same character: <, <, and < all resolve to the less-than sign.
The semicolon is the clearest delimiter for a reference. HTML keeps legacy compatibility for some names without the semicolon, which is why © may still become the copyright symbol in tolerant parsing. That lenience is useful for copied web text but risky for audits, because adjacent letters and digits can change where a parser decides the name ends.
Transformation Core
Unescaping is a bounded transformation over reference-shaped tokens. Each pass scans the current text, replaces tokens that resolve, records evidence for every detected reference, and gives the changed text to the next pass only when another pass may reveal nested escaping.
| Step | Rule | What to check |
|---|---|---|
| Reference scan | Detect named references, decimal numeric references, and hexadecimal numeric references, with or without a trailing semicolon. | Entity Ledger shows the reference text, type, offset, pass number, and sequence. |
| Tolerant resolution | Resolve semicolon-terminated references and browser-accepted legacy forms. | Changed rows receive Decoded status and list the resulting Unicode code point or points. |
| Strict resolution | Resolve only references that end with ;. |
Missing-semicolon tokens remain unchanged and receive Skipped status. |
| Unresolved handling | Leave the token unchanged when the reference name or numeric value does not resolve. | Unresolved rows point to spelling errors, unsupported names, or malformed numeric references. |
| Post-decode cleanup | Apply selected Unicode normalization and optional outer-whitespace trimming after decoding finishes. | Safety Audit records normalization and trim findings when those options are active. |
Numeric references can expose invalid or unusual Unicode results. A replacement character in the output usually means decoded text contains a value that could not be represented as an ordinary character. The audit marks replacement characters for review because they often indicate damaged encoding, copied binary data, or a reference that was legal-looking but not meaningful for the destination.
Reference and Output Signals
| Signal | Meaning | Practical response |
|---|---|---|
| Decoded | The reference changed to one or more Unicode characters. | Compare the decoded value and code points when exact characters matter. |
| Skipped | Strict mode preserved a semicolonless reference-shaped token. | Add the semicolon, switch to tolerant mode, or keep the token unchanged for source review. |
| Unresolved | The parser did not turn the token into a different character. | Check spelling, numeric base, and whether the token is meant to be literal text. |
| Markup signal | The decoded text contains tag-shaped segments. | Treat the output as code or markup evidence, not automatically safe content. |
| Script signal | The decoded text contains script-bearing patterns such as event handlers or javascript:. |
Do not render it as trusted HTML without a separate sanitizer and security review. |
NFC normalization composes canonically equivalent Unicode where possible, which is useful when accents or combining marks should compare the same. NFKC also applies compatibility folding, so full-width forms, ligatures, and other compatibility characters may become more ordinary-looking text. That can help search and matching, but it can erase distinctions that mattered in the original source.
The audit is deliberately conservative. It searches for common markup-shaped and script-bearing patterns, counts unresolved references, and records replacement characters, but it is not a complete HTML parser or cross-site scripting scanner. Security still depends on the context where the decoded text will be used: plain text, HTML body, attribute value, script, style, URL, and rich-text editor contexts all need different handling.
Limitations, Privacy, and Safety Notes:
- Pasted text, dropped text, small file reading, decoding, ledger rows, audit rows, and exports are handled in the current browser session after the page loads.
- TXT, HTML, and HTM file loading is capped at
1 MBso a very large source does not make the browser-side review sluggish. - Script and markup signals are pattern checks. They help surface risky-looking output, but they are not a full sanitizer, parser, or vulnerability assessment.
- Unicode normalization can change exact characters. Leave it off when the original decoded text must be preserved byte-for-byte after character references are resolved.
Advanced Tips:
- Use Strict semicolons for source audits where missing delimiters should remain visible. Use HTML5 tolerant when the goal is to see what browser-compatible text will usually become.
- Increase Decode passes only until nested escaping is exposed. The pass limit is
6, and decoding stops early when a pass no longer changes the text. - Choose NFC for ordinary canonical matching of accents and combining marks. Choose NFKC only when compatibility folding is acceptable for the review job.
- Keep Trim outer whitespace off when copied snippets need exact leading or trailing spaces. Turn it on for cleanup only after the decoded text has been inspected.
- Use Entity Ledger before copying output from a double-escaped source. It shows which pass changed each reference and which tokens remained skipped or unresolved.
Worked Examples:
Double-escaped CMS text
A copied fragment such as &lt;article class=&quot;notice&quot;&gt;Tom &amp; Jerry&lt;/article&gt; needs Decode passes set to 2. Decoded Text becomes <article class="notice">Tom & Jerry</article>. Entity Ledger shows the first pass exposing nested references and the second pass resolving them, while Safety Audit reports a markup signal because the result contains tag-shaped text.
Strict review of a missing semicolon
With Reference mode set to Strict semicolons, Copyright © 2026 & Co. leaves © unchanged and decodes & to &. Entity Ledger marks the missing-semicolon token as Skipped, and the summary includes an unresolved count. Add the semicolon or switch to HTML5 tolerant only when browser-compatible lenience is the desired reading.
Script-shaped output after one pass
The source <script>alert(1)</script> decodes in one pass to visible script markup. Decoded Text displays it as text, Safety Audit reports Script signal, and the warning text explains that the output should not be rendered without review. That result is useful for a ticket or code review because it makes hidden markup visible.
File too large to load
Dragging a 2 MB HTML export onto Escaped HTML produces the file-size warning instead of decoding. Split the export, paste a smaller excerpt, or save a text sample that is 1 MB or smaller. Once the source loads, the Input size row in Safety Audit confirms the character count being reviewed.
FAQ:
Why did a reference stay unchanged?
Check Entity Ledger. Skipped means Strict semicolons preserved a missing-semicolon token, while Unresolved means the parser did not recognize the reference name or numeric value.
Should I use one decode pass or more?
Use one pass for ordinary escaped HTML and two passes when the source contains nested escaping such as &lt;. The control accepts 1 to 6 passes and stops after a pass no longer changes the text.
Does decoded output become safe HTML?
No. Decoding reveals characters; it does not sanitize HTML, remove scripts, or prove that markup can be rendered safely. Treat Markup signal and Script signal as prompts for a separate sanitizer or security review.
What does Unicode normalization change?
NFC composes canonically equivalent characters where possible, while NFKC also folds many compatibility characters. Leave normalization off when the exact decoded text matters more than matching or search behavior.
Which files can I load?
Use Browse file or drag and drop for TXT, HTML, or HTM files up to 1 MB. Larger files show a local size warning and are not decoded in the browser tab.
Is my pasted text sent to a server for decoding?
After the page loads, pasted text, small file reading, decoding, ledger rows, audit rows, and exports are produced in the browser session. Handle copied or downloaded output according to the sensitivity of the source text.
Glossary:
- Character reference
- An ampersand-led HTML sequence that represents another Unicode character.
- Named reference
- A character reference that uses a name from the HTML named-character table, such as
©. - Numeric reference
- A decimal or hexadecimal character reference that points to a Unicode code point.
- Decode pass
- One scan through the current text, replacing references that resolve during that scan.
- Semicolonless reference
- A reference-shaped token without the usual trailing semicolon, accepted by some tolerant HTML parsing paths.
- Unicode normalization
- A post-decode rewrite that converts Unicode text into NFC or NFKC form for comparison or matching.
- Script signal
- An audit finding that decoded text contains script-bearing patterns such as event handlers, script tags, or
javascript:text.
References:
- HTML named character references, WHATWG HTML Living Standard.
- Character reference, MDN Web Docs.
- Unicode Standard Annex #15: Unicode Normalization Forms, Unicode Consortium.
- Cross Site Scripting Prevention Cheat Sheet, OWASP Cheat Sheet Series.