Unicode Character Lookup
Look up a character, code point, grapheme, or escape with Unicode names, U+ values, UTF bytes, normalization checks, and hidden-text warnings.{{ summaryTitle }}
| Field | Value | Copy |
|---|---|---|
|
{{ row.label }}
Recommended
{{ row.note }}
|
{{ row.displayValue }} |
| Target | Output | Copy |
|---|---|---|
|
{{ row.label }}
Primary
{{ row.note }}
|
{{ row.displayValue }} |
| Form | Preview | U+ | Change | Copy |
|---|---|---|---|---|
|
{{ row.form }}
Selected
{{ row.note }}
|
{{ row.preview }} | {{ row.codePointSequence }} | {{ row.changedText }} |
| # | Preview | Name | U+ | Type | Profile | Copy |
|---|---|---|---|---|---|---|
| {{ row.position }} | {{ row.preview }} |
{{ row.name }}
{{ row.scriptLabel }} | {{ row.utf8Label }}
|
{{ row.uPlus }} | {{ row.typeLabel }} | {{ row.profileLabel }} |
Text bugs often start with a mark that looks ordinary. A pasted check mark, accented letter, emoji, or space-like character can carry more structure than the eye can see. Before a font draws it, the value has passed through Unicode code points, an encoding form, escape syntax, normalization rules, and sometimes a browser or terminal display choice.
Unicode separates the number assigned to a character from the way that character is stored or drawn. The number is the code point, usually written as U+ plus hexadecimal digits. UTF-8 turns scalar values into bytes. UTF-16 stores many values as one 16-bit unit and supplementary values as a surrogate pair. Grapheme clustering decides what readers experience as one character, so one visible emoji or accented letter can still be several code points.
- Code point
- A numbered Unicode value, such as
U+2713for a check mark. - Unicode scalar value
- Any code point except the surrogate range
U+D800toU+DFFF. - Grapheme cluster
- The sequence a reader usually treats as one character, even when it contains joins, modifiers, or combining marks.
- Normalization form
- A rule for comparing equivalent text sequences, such as composed and decomposed accent forms.
Most mistakes come from collapsing those layers too early. A percent-encoded URL value describes UTF-8 bytes, not a character name. A JavaScript escape and a CSS escape can look similar while following different parsing rules. A private-use value can be valid Unicode but meaningless outside the system that defined it. A noncharacter or lone surrogate may be useful diagnostic evidence, but it is not a normal interchange character.
A character lookup gives a precise handle for review, not a final policy decision. Programming languages, identifiers, domain names, terminal encodings, database collations, search systems, and security filters can all add stricter rules. The useful evidence is the exact sequence, the representation used to store or transmit it, and the destination rule that will accept or reject it.
How to Use This Tool:
Begin with the form you actually received. The same mark can be pasted as literal text, copied from a ticket as U+ notation, found in source code as an escape, or recovered from a URL as percent-encoded bytes.
- Enter one character, grapheme cluster, code point token, escape sequence, or percent-encoded UTF-8 byte string in
Character, grapheme, or code point. Empty input returns a prompt instead of a partial lookup. - Set
Interpret input as. UseAuto detectfor common notations,Literal graphemewhen the visible pasted text is the value,Code point token or sequencefor explicit numeric notation, andEscaped textfor code or URL-style input. - When escape syntax matters, open
Advancedand chooseEscape dialect. The available choices cover automatic detection, JavaScript or JSON escapes, CSS escapes, and percent-encoded UTF-8. - Use
Example presetonly when you want a known case such as a visible symbol, combining sequence, zero-width character, CSS escape, or private-use value.Customkeeps the current query and settings. - Choose
Normalization previewwhen you need to compare the source value with NFC, NFD, NFKC, or NFKD. Leave it atNonewhen you only want the original sequence. - Turn on
Flag internal-only valuesbefore text leaves a controlled fixture, and turn onFlag invisible controlswhen hidden whitespace, format characters, or combining marks may explain a copy-and-paste problem. - Read
Code Point Referencebefore copying anything. A ready result shows the lookup status,U+ sequence, visible preview, cluster shape, type and script profiles, UTF rows, normalization notes, and review flags.
Interpreting Results:
Lookup complete means the enabled checks did not find a high-severity condition. Review needed means at least one part of the sequence deserves care, such as a surrogate, noncharacter, private-use value, control, whitespace, format code point, invisible mark, or normalization change. Treat the label as triage and read the review notes for the reason.
U+ sequence is the most reliable way to describe the value to another person. The visible preview can be helpful, but it can hide zero-width code points, make several code points look like one mark, or render differently on another platform. If Cluster shape reports more than one code point or more than one grapheme, check Sequence Audit before calling it a single ordinary character.
Copy Targets are destination-specific. Literal text is convenient for ordinary visible characters. U+ notation is safer for bug reports, specs, review notes, and tickets. JavaScript, JSON, CSS, HTML numeric references, UTF bytes, and percent-encoded UTF-8 should be copied only when the receiving system expects that representation.
Normalization rows need careful reading. If NFC, NFD, NFKC, or NFKD changes the sequence, the displayed text may still look unchanged while exact equality, byte counts, search keys, signatures, and database comparisons change. Copy a normalized literal only when the destination rule calls for that form.
Technical Details:
Unicode code points cover U+0000 through U+10FFFF. The range U+D800 through U+DFFF is reserved for UTF-16 surrogate code units, so those values are not Unicode scalar values by themselves. Well-formed UTF-8, UTF-16, and UTF-32 encode scalar values, while UTF-16 uses pairs of surrogate units to represent supplementary scalar values above U+FFFF.
Visible text is a sequence, not just a single number. Text segmentation groups code points into grapheme clusters for cursor movement and reader perception. Normalization compares equivalent sequences, especially composed and decomposed forms. Security review adds another layer because invisible controls, bidirectional controls, lookalike characters, and private-use values can make the displayed text differ from what storage or source review sees.
Transformation Core
| Input path | Accepted shape | Transformation | Result to inspect |
|---|---|---|---|
| Literal grapheme | Pasted mark, combining sequence, or emoji cluster | The first grapheme cluster is selected; extra literal graphemes are reported as ignored. | Exact code points for the visible pasted unit. |
| Code point token | U+2713, 0x2713, decimal notation, JavaScript code point notation, or numeric HTML reference |
Each token is converted to a numeric code point and then to the corresponding sequence. | Readable name, decimal value, category, script, and copy-safe forms. |
| JavaScript or JSON escape | \u2713, \u{1F469}, \x20, or simple escapes such as \n |
The escape syntax is decoded before Unicode inspection. | The string that the escape would produce. |
| CSS escape | A backslash followed by one to six hexadecimal digits, with optional terminator whitespace | The CSS hexadecimal value is decoded, and optional terminator whitespace is not kept as text. | The character or sequence represented by the stylesheet escape. |
| Percent-encoded UTF-8 | One or more encoded byte tokens such as %E2%9C%93 |
The bytes are decoded as UTF-8. | The text value carried by a URL or encoded payload fragment. |
Encoding Core
| Value range | UTF-8 shape | UTF-16 shape | Boundary note |
|---|---|---|---|
U+0000 to U+007F |
One byte | One code unit | ASCII-compatible range. |
U+0080 to U+07FF |
Two bytes | One code unit | Still one UTF-16 unit. |
U+0800 to U+FFFF, excluding surrogates |
Three bytes | One code unit | Surrogate code points are excluded from scalar encoding. |
U+10000 to U+10FFFF |
Four bytes | Two code units | UTF-16 uses a high-surrogate and low-surrogate pair. |
U+D800 to U+DFFF |
Not available as a standalone scalar value | Diagnostic code unit value | A lone surrogate should be documented, not emitted as UTF-8 or an HTML numeric reference. |
Normalization uses four standard forms. NFC composes canonical equivalents where possible. NFD decomposes canonical equivalents. NFKC and NFKD add compatibility mappings, which can change characters chosen for legacy, layout, or compatibility reasons. Compatibility normalization is useful for comparison in some systems, but it can erase distinctions that mattered in the original text.
| Review condition | Technical reason | Safer handling |
|---|---|---|
| Surrogate | Not a standalone Unicode scalar value. | Keep the U+ value and recover the original UTF-16 context. |
| Private-use value | Meaning depends on a private agreement, font, or application. | Carry the originating system context with the value. |
| Noncharacter | Reserved away from normal open interchange. | Use explicit notation in logs, tickets, and test cases. |
| Control, whitespace, or format code point | May be invisible, directional, or easy to lose during review. | Share escaped or U+ notation instead of raw copied text. |
| Combining mark | Rendering and matching depend on the neighboring base character and normalization state. | Compare source and normalization rows before changing storage. |
A compact example is U+0065 U+0301, a Latin small letter followed by a combining acute mark. NFC can compose it to U+00E9, while NFD keeps the decomposed pair. The screen may show the same accented letter, but exact matching, byte output, and copy targets are different.
Privacy and Accuracy Notes:
The character inspection runs in the browser. The page may load public Unicode name and property reference data so it can label characters more precisely; if that data cannot be loaded, it falls back to broad local labels for categories such as control, private-use, noncharacter, whitespace, and combining mark.
The entered character is not posted to a separate character lookup service by the page. Normal browser requests for page assets and reference data can still expose standard request metadata to the services that deliver those assets. Clipboard copies, CSV downloads, JSON downloads, and DOCX exports are created from the current browser result.
Worked Examples:
Documenting a check mark. Enter U+2713 with Code point token or sequence. The reference rows show the Unicode name, decimal value, U+ sequence, type profile, and UTF encodings. Copy U+ notation for a ticket so the evidence does not depend on the recipient's font.
Finding an accent mismatch. Enter U+0065 U+0301 and set Normalization preview to NFC. If the selected normalization changes the source, the visible accented letter can look unchanged while the stored sequence changes to a composed value.
Checking a CSS emoji escape. Use Escaped text with CSS escapes for \1F469 \200D \1F4BB. The decoded sequence includes U+200D ZERO WIDTH JOINER, so the audit shows that the visible cluster is built from separate code points.
Debugging a hidden separator. Enter %E2%80%8B as percent-encoded UTF-8. It decodes to U+200B ZERO WIDTH SPACE. The preview cannot show a normal glyph, so the review flag and U+ sequence are the reliable evidence.
FAQ:
Why did plain digits become a code point?
Auto detect treats a standalone decimal token as code point notation when it looks intentional. If the digits themselves are the text, switch Interpret input as to Literal grapheme.
Why can one visible character contain several code points?
Combining marks, variation selectors, zero-width joiners, and emoji modifiers can form one grapheme cluster from multiple code points. Cluster shape and Sequence Audit show the pieces.
Why are UTF-8 and HTML references unavailable for a surrogate?
A lone surrogate is not a Unicode scalar value. The lookup can show the diagnostic U+ value and UTF-16 unit, but it does not create UTF-8 bytes, percent-encoded UTF-8, or HTML numeric references for that malformed standalone value.
What does a normalization change mean?
It means the selected normalization form produced a different code point sequence. Copy the normalized literal only when the destination storage, comparison, or search rule expects that form.
Can this prove a character is allowed in an identifier?
No. The lookup identifies the sequence and flags common Unicode hazards, but programming languages, domain names, usernames, and databases can apply stricter identifier rules.
Glossary:
- Code point
- A numbered Unicode value, usually written in
U+notation. - Unicode scalar value
- Any Unicode code point except the surrogate range.
- Grapheme cluster
- A code point sequence that a user commonly experiences as one character.
- Combining mark
- A code point that modifies a neighboring base character, such as an accent mark.
- Normalization
- A standard process that converts equivalent or compatibility-related Unicode sequences into a selected form.
- Private-use value
- A Unicode value reserved for application-specific meaning.
- Noncharacter
- A reserved code point intended for internal use rather than normal open interchange.
References:
- The Unicode Standard, Chapter 3: Conformance, Unicode 17.0.0.
- Unicode Standard Annex #15: Unicode Normalization Forms, Unicode 17.0.0, 2025-07-30.
- Unicode Standard Annex #29: Unicode Text Segmentation, Unicode 17.0.0, 2025-08-17.
- Unicode Technical Standard #39: Unicode Security Mechanisms, Unicode 17.0.0, 2025-09-04.
- Unicode Technical Standard #55: Unicode Source Code Handling, Version 2, 2024-01-29.
- CSS Syntax Module Level 3, W3C Candidate Recommendation Draft, 2021-12-24.
- How to set UTF-8 character encoding in PuTTY, Simplified Guide.