Duplicate Line Remover
Clean a one-entry-per-line list in your browser, remove repeats with case and whitespace rules, and audit kept or dropped rows.{{ cleanOutput || 'No cleaned lines.' }}
| Metric | Value | Reading | Copy |
|---|---|---|---|
| {{ row.metric }} | {{ row.value }} | {{ row.reading }} |
| Duplicate key | Occurrences | Kept line | Removed lines | Removed text | Kept text | Copy |
|---|---|---|---|---|---|---|
| {{ row.preview }} | {{ row.occurrences }} | {{ row.keptLineLabel }} | {{ row.removedLineLabel }} | {{ row.removedTextLabel }} | {{ row.keptTextLabel }} | |
|
No duplicate groups found
Current match settings keep every comparable line unique.
|
||||||
Introduction
Repeated rows usually appear after several ordinary jobs have been mixed together: a spreadsheet column is pasted into a form, a contact export is merged with an older list, a keyword set is copied from more than one source, or command output is saved and compared by hand. The visible problem is a duplicate line, but the real decision is whether two rows should count as the same item after spacing, capitalization, and blank rows are handled.
Line de-duplication is most dependable when each physical line represents one complete item: one email address, one URL, one keyword, one ID, one log pattern, or one short record. The cleanup then creates a comparison value for each row and groups rows that share that value. Strict matching protects codes and technical identifiers, while forgiving matching catches common copy-paste noise such as edge spaces and capitalization drift.
- Comparison key
- The value used to decide whether two rows match after the chosen case and whitespace rules are applied.
- Occurrence
- One appearance of a matching row in the original list, including its source line position.
- Retained row
- The row that remains in the cleaned list when a duplicate group is reduced to one entry.
The safest rule depends on the list's purpose. Email lists, tag sets, keyword batches, and copied table columns often benefit from relaxed matching because inconsistent capitalization or stray spaces usually do not change the item. Product codes, license keys, fixed-width rows, code snippets, and case-sensitive identifiers need stricter rules because a small visible difference may be part of the value.
Retention rules matter as much as matching rules. Keeping the first occurrence preserves the earliest source position, which fits priority lists, manual allowlists, and audit trails. Keeping the last occurrence fits later-overrides-earlier merges. Keeping only rows that appear once is a different operation: every repeated value is removed completely, which is useful for finding singletons but surprising when a user expected one copy of each repeated item to remain.
Line-based cleanup does not parse records. A CSV row is still one line, a stack trace may span several lines, and two visually different rows may still refer to the same customer, product, or URL after normalization outside the text itself. De-duplicating by line is fast and predictable, but it is not a substitute for column-aware imports, fuzzy matching, or identity resolution.
The main tradeoff is between catching messy repeats and protecting meaningful differences. Check the duplicate groups before a cleaned list replaces source material used for email sends, redirects, firewall rules, access lists, billing imports, or production data changes.
How to Use This Tool:
Choose the matching rules before trusting the cleaned text. The counts and ledger are there to prove which rows survived and why.
- Paste one entry per row in Line list, drop text onto the box, or use Browse TXT for a TXT, CSV, or LOG file under 5 MB. The summary should change from the waiting message to kept-line, removed-line, and unique-key counts.
- Use Load sample if you want a quick test case. The sample includes capitalization differences, edge spaces, a repeated value, and a blank row, so it shows how the main settings interact.
- Set Case-sensitive match. Leave it off for ordinary email lists, tags, and keyword merges; turn it on when values such as
ABCandabcmust remain separate. - Choose the whitespace rules. Trim line edges removes leading and trailing spaces before matching and from the retained output. Collapse inner spaces for matching treats repeated spaces and tabs inside a row as one space only for comparison.
Leave inner-space collapse off for code, fixed-width rows, serials, or IDs where spacing inside the line is meaningful.
- Set Ignore blank lines. Keep it on when empty rows are accidental. Turn it off only when a blank row should remain as a deliberate separator in Cleaned Lines.
- Open Advanced and choose Keep rule. Use first occurrence for source-order lists, last occurrence for newer-overrides-older merges, and unique-only when every repeated value should be removed.
Unique-only can return an empty cleaned list when every comparable row belongs to a duplicate group.
- Choose Output order. Preserve source order keeps retained rows in their original positions, while A-Z and Z-A sorting happen only after duplicate removal.
- Review Line Metrics and Duplicate Ledger. If a file fails, choose a text-like file under 5 MB or paste the rows directly. If the ledger shows the wrong kept or removed row, adjust the matching rules before copying Cleaned Lines.
Interpreting Results:
Kept output lines is the number of rows left in the cleaned text. Removed lines counts entries dropped by the selected keep rule. Unique match keys counts the comparable values after the case, whitespace, and blank-line rules are applied.
A low removed count does not prove that the source list was clean. Strict settings can keep messy variants apart even when they refer to the same item. A high removed count does not automatically mean the output is better either, because relaxed settings can merge values whose capitalization or spacing carried meaning.
- Use Duplicate Ledger when Removed lines is nonzero. It shows the key preview, occurrence count, kept source line, removed source lines, removed text, and kept text.
- Use Comparable lines to see whether blank rows were included or ignored.
- Use Output order to confirm whether the retained rows stayed in source order or were sorted after de-duplication.
- Use the copied or downloaded text only after checking the duplicate groups that matter most to the list's purpose.
Technical Details:
Duplicate-line cleanup is a deterministic grouping problem. Each source row becomes a comparison key, rows with the same key form a duplicate group, and a keep rule chooses which source row survives. Matching is global across the full input, so repeated values do not need to be adjacent.
Matching and output are related but not identical. Edge trimming can change the retained output text because the row is trimmed before it is kept. Inner-space collapse affects comparison only, so a retained row can keep its original internal spacing even when repeated spaces were ignored for matching.
Transformation Core:
| Stage | Rule | Resulting Check |
|---|---|---|
| Split | Line breaks divide the source into rows. Carriage returns are normalized to line feeds first. | Original lines counts every split row. |
| Blank policy | When blank rows are ignored, rows whose comparable value is empty are excluded before grouping. | Comparable lines and ignored blank rows explain the difference. |
| Whitespace policy | Edge trimming removes leading and trailing spaces before matching and in the retained output. Inner-space collapse treats repeated tabs and spaces inside a row as one space for matching. | Duplicate Ledger shows what text survived. |
| Case policy | Case-sensitive matching keeps capitalization distinct. Case-insensitive matching compares lowercase forms for ordinary list cleanup. | Unique match keys changes when capitalization variants merge. |
| Retention | Keep first, keep last, or keep only values that appear once decides the retained row for each group. | Kept output lines and Removed lines show the final effect. |
Keep Rule Behavior:
| Keep Rule | What Survives | When It Fits |
|---|---|---|
| Keep first occurrence | The earliest row in each duplicate group remains, and later matching rows are removed. | Original-order lists, priority lists, hand-built allowlists, and source-of-truth exports. |
| Keep last occurrence | The latest row in each duplicate group remains, and earlier matching rows are removed. | Exports where later rows are more recent or later sources should override earlier sources. |
| Keep only lines that appear once | Every row in a duplicate group is removed, leaving only comparison keys with one occurrence. | Finding singletons, spotting one-off values, or removing all repeated items from a diagnostic list. |
Formula Core:
The reduction percentage is computed from comparable rows, not from the raw source line count. Ignored blank rows do not inflate the denominator.
If a pasted list has 7 original rows, 1 ignored blank row, and 3 removed duplicate rows, the comparable count is 6 and the reduction is 3 / 6 x 100% = 50.0%. If blank lines are not ignored, the denominator can change because blank rows become comparable rows.
Sorting is deliberately late in the sequence. The retained rows are selected first, using their original source positions and the selected keep rule. A-Z or Z-A ordering then rearranges the retained output, so sorting does not change which occurrence was kept.
CSV and LOG files are read as text for line cleanup. A CSV row is treated as one line, not as parsed cells, so a duplicate email column inside a wider CSV should be extracted before cleaning. Multi-line CSV fields, stack traces, and wrapped log events can also break the one-row-per-item assumption.
Case-insensitive matching is useful for ordinary capitalization cleanup, but it is not a complete language-aware identity check. Accents, composed characters, compatibility characters, and locale-sensitive letters can still require manual review when lists mix languages, scripts, or normalized forms.
Privacy Notes:
Pasted text and selected TXT, CSV, or LOG content are processed in the browser session for this cleanup workflow. File loading is limited to text-like files under 5 MB, which keeps the full text read manageable in the tab. The cleanup path does not require uploading the line list to a server, but copied and downloaded results should still be handled according to the sensitivity of the source data.
Worked Examples:
A keyword merge starts with alpha, beta, Alpha, beta , gamma, and another beta. With Case-sensitive match off, Trim line edges on, ignored blanks on, and first occurrence selected, Cleaned Lines keeps alpha, beta, and gamma. Duplicate Ledger shows one alpha group and one beta group.
An access-code list contains AB-12, ab-12, and AB-12. Turning Case-sensitive match on keeps AB-12 and ab-12 as separate keys, so only the exact repeated AB-12 rows are grouped. Leaving it off would merge the two capitalizations.
A vendor export is pasted twice, with the newer rows later in the file. Keep last occurrence keeps the later row from each duplicate group while Output order can still preserve source order or sort the retained rows after de-duplication.
A troubleshooting case starts with an empty cleaned output after selecting Keep only lines that appear once. That means every comparable row belonged to a duplicate group under the current settings. Switch back to Keep first occurrence, or make matching stricter if some repeated-looking values should remain distinct.
FAQ:
Does sorting change which duplicate is kept?
No. Duplicate grouping and the keep rule run first. A-Z or Z-A sorting is applied only to the retained rows.
Why did one blank line remain?
When Ignore blank lines is off, blank rows become a match key. With keep first occurrence, one blank row can remain while later blank rows are removed.
Can duplicates be removed when they are far apart?
Yes. Rows are grouped across the whole source list by comparison key, so matching entries do not need to be adjacent.
Does it find near-duplicates or misspellings?
No. Matching uses the selected case, blank-line, and whitespace rules. It does not perform fuzzy matching, spelling correction, or identity matching across different values.
Will it de-duplicate one CSV column?
Only if that column is the text being cleaned. A full CSV row is treated as one line, so extract the target column first when the repeated value is only one field inside a larger row.
Why did the file fail to load?
The browser-side file path is for text-like files under 5 MB. Use TXT, CSV, or LOG content, or paste the lines directly into Line list.
Glossary:
- Comparison key
- The value used for grouping after case, whitespace, and blank-line rules are applied.
- Duplicate group
- Two or more comparable rows that share the same comparison key.
- Comparable lines
- The rows left after the blank-line policy has been applied.
- Keep rule
- The rule that keeps the first row, the last row, or only rows that appear once.
- Source order
- The order rows appeared in before any optional A-Z or Z-A sorting was applied.
- Line-based cleanup
- A text cleanup method that treats each physical line as one item, rather than parsing columns or multi-line records.
References:
- The Unicode Standard, Chapter 3: Conformance, The Unicode Consortium.
- FileReader: readAsText() method, MDN Web Docs.