{{ summaryHeading }}
{{ summaryPrimary }}
{{ summaryLine }}
Disabled {{ statusBadge }} {{ sourceBadge }} {{ imagePolicyBadge }} {{ warningCount }} note{{ warningCount === 1 ? '' : 's' }}
DOCX converter inputs
Drop or browse for one Word document. Legacy .doc files are not supported.
{{ sourceTitle }}
{{ sourceSubtitle }}
{{ sourceStatus }}
HTML, Markdown, TXT, RTF, XML, CSV, EPUB, and JSON are generated from one local DOCX drop.
Choose the artifact you want to copy or download.
Semantic keeps the closest Word-to-HTML structure; compact reduces CMS-facing class noise.
GitHub style keeps pipe tables; plain style favors readable text blocks.
Choose how document blocks become plain TXT line breaks.
Applies when table extraction is on.
Leave on for documents where tables carry the text you need.
{{ includeTablesBool ? 'Extract table text' : 'Skip table text' }}
Use readable when the RTF will be opened directly by a reviewer.
Document XML is the main body; styles and properties help audit the source package.
Tables are best for spreadsheet handoff; paragraph blocks are best for content inventories.
Leave blank to use the selected file name.
Use placeholders when you plan to upload images separately.
{{ loadProgressLabel }}
Example: p[style-name='Warning'] => aside.warning > p:fresh
Leave blank to use the DOCX filename.
Off keeps review comments out of publish-ready HTML.
{{ includeCommentsBool ? 'Include comment references' : 'Ignore comments' }}
Off removes blank Word paragraphs from the HTML artifact.
{{ keepEmptyParagraphsBool ? 'Preserve blanks' : 'Remove blanks' }}
Off ignores document-provided mappings for safer, predictable output.
{{ trustEmbeddedStyleMapBool ? 'Trust embedded map' : 'Ignore embedded map' }}
Leave off for normal conversion; turn on when corruption is suspected.
{{ strictPackageCheckBool ? 'CRC validation on' : 'Fast package read' }}
{{ header }} Copy
{{ cell.value }}
Customize
Advanced
:

DOCX conversion is usually a content-extraction job, not a perfect imitation of a Word screen. A Word document stores text, tables, styles, images, links, review notes, and document properties inside a compressed Open XML package. When that material moves into HTML, Markdown, TXT, RTF, XML, CSV, EPUB, or JSON, the useful question is which parts of the document need to survive the move.

DOCX package moving through part inspection into HTML, text, and evidence outputs.

The safest conversion choice depends on the destination. HTML is useful when a document is becoming web or content-management copy. Markdown is better for repository notes and review comments. TXT strips the result down to readable text. CSV is narrow by design: it extracts table rows or paragraph-like blocks, not a faithful spreadsheet. XML is for auditing selected package parts, while JSON keeps the selected source details, settings, outputs, messages, and package counts together for handoff.

DOCX files can carry more than visible paragraphs. Images, hyperlinks, comments, footnotes, endnotes, style definitions, and document properties may affect the converted result or the review risk. A clean conversion therefore needs both an output artifact and evidence about what was found in the source package.

A browser-side converter is best for quick review, content migration, and lightweight extraction. It should not be treated as a Word renderer. Page layout, page breaks, complex numbering, exact typography, tracked-change semantics, and unsupported embedded objects may need manual checking in Word, LibreOffice, or the final publishing system.

Technical Details:

A DOCX file is an Open XML word-processing package. The package contains separate parts for the main document body, relationships, styles, properties, media, comments, footnotes, and other features. The visible text usually lives in WordprocessingML paragraphs, runs, tables, and text nodes. Relationships connect the body to images, hyperlinks, and referenced assets.

Conversion starts by reading the selected file as binary data, then opening the ZIP-style package and looking for the main document body. The primary HTML path maps semantic Word styles into headings, paragraphs, quotes, lists, tables, emphasis, links, and images where supported. The resulting HTML is cleaned before display so unsafe elements, event handlers, inline style attributes, unsafe link targets, and unsafe image sources do not pass through unchanged.

Several outputs are derived from the same cleaned content. Markdown is rendered from the sanitized HTML. Plain text is extracted from the cleaned blocks, with table handling controlled by the chosen separator and layout mode. RTF is generated from extracted text rather than full Word layout. CSV is built from tables or paragraph-like blocks. EPUB is assembled as a simple EPUB 3 package from extracted text. XML output returns one selected package part for inspection.

Transformation Core

DOCX conversion pipeline
Stage What Happens Review Point
File gate Accepts one `.docx` file and rejects other extensions or documents above 50 MB. Legacy `.doc` files must be saved as `.docx` first.
Package inspection Counts entries, paragraphs, heading cues, tables, media, comments, footnotes, endnotes, style definitions, hyperlinks, and external link targets. Missing main body content marks the package as needing review.
Semantic conversion Maps Word styles and document blocks into HTML, optionally applying a selected style map and custom mapping lines. Custom mappings should match trusted source styles.
Cleanup Removes unsafe markup and applies the selected image policy before derived formats are produced. Conversion Messages records cleanup and image notes.
Derived artifacts Builds HTML, Markdown, TXT, RTF, XML, CSV, EPUB manifest text, JSON evidence, and package chart data from the converted source. Check the output tab and Package Ledger before publishing or sharing.

Format Behavior

DOCX output format behavior
Output Best Fit Important Limit
HTML CMS paste, web content review, and semantic markup handoff. It is sanitized content, not a pixel match for Word layout.
Markdown Repository notes, issue comments, and text-first publishing. Tables, images, and links depend on the selected Markdown style and image policy.
TXT Plain text extraction with paragraph, compact, or continuous line breaks. Formatting and visual table structure are intentionally reduced.
RTF Readable text handoff to Word-compatible editors. The RTF is generated from extracted text, not original Word layout.
XML Auditing document body, styles, relationships, core properties, or app properties. Only the selected package part is shown.
CSV Table extraction or paragraph-block inventories. It does not infer spreadsheet formulas or merged visual layout.
EPUB Simple ebook-style proof from extracted text. The generated EPUB package uses extracted text and a basic navigation spine.
JSON Audit handoff with source details, settings, outputs, messages, and package counts. It can include generated content, so review before sharing.

The Strict package check setting adds CRC validation during package inspection. That can help when a document seems corrupt, but it may be slower on large files. The normal path is faster and still checks whether the main document body exists.

Everyday Use & Decision Guide:

Start with HTML when the document is heading into a CMS or web editor. Use the fragment option for paste-ready markup, the complete document option when someone needs a standalone HTML file, and text-only HTML when Word styling is getting in the way of clean paragraphs.

Choose Markdown for repository workflows, release notes, and review comments. GitHub-flavored Markdown keeps pipe tables when possible, while plain Markdown favors simpler readable text. For documents with tables, compare Markdown Output with CSV Output before deciding which handoff is easier to check.

Use TXT when the goal is text extraction. Paragraph spacing keeps blank lines between blocks, compact lines puts each extracted block on its own line, and continuous text folds blocks into one stream. Leave table extraction on when table cells carry important content; switch it off when tables are decorative or duplicate nearby paragraphs.

  • Use HTML or Markdown for publishing copy that still needs headings, lists, links, and tables.
  • Use XML when you need to inspect the source package rather than create reader-facing copy.
  • Use CSV for tables or paragraph inventories, not for a full document replica.
  • Use JSON when another reviewer needs the outputs, settings, package evidence, and messages in one record.

Stop and review when Conversion Messages reports cleanup notes, image events, or conversion warnings. A ready status means the converter produced output; it does not prove that the converted content is publication-ready. Compare the generated result with the original document when headings, tables, images, comments, or links matter.

The selected document is read through the browser file picker or drop target. The conversion path does not post the file to a backend conversion service, but the resulting artifacts can still contain source text, links, comments, or document properties. Treat downloaded outputs and copied JSON as document content.

Step-by-Step Guide:

Use the visible status, output tabs, and evidence tables as checkpoints rather than treating conversion as a single button press.

  1. Drop a DOCX file into Source DOCX or use Browse DOCX. The source status should show the file name, size, extension, and modified date after the file is accepted.
  2. If the file is rejected, check the message below the drop zone. Only `.docx` files are accepted, and browser-side conversions must stay under 50 MB.
  3. Choose Convert to based on the destination format. The active result tab changes to the matching output, such as HTML Output, Markdown Output, Extracted Text, XML Output, CSV Output, EPUB Package, or JSON.
  4. Adjust the format-specific controls. For HTML and Markdown, set Style map and Images. For TXT, choose Text layout, Table separator, and Tables. For XML, choose the XML part. For CSV, choose Tables or Paragraph blocks. For EPUB, set a title when the source file name is not suitable.
  5. Open Advanced only when needed. Add a custom style map for known Word style names, set a filename prefix for exports, include comment references when review notes matter, preserve empty paragraphs for editor round trips, trust embedded style maps only for trusted sources, and turn on Strict package check when corruption is suspected.
  6. Review the summary and Package Ledger. Paragraph, table, media, comment, style, hyperlink, and external-link counts tell you whether the output needs closer inspection.
  7. Check Conversion Messages before copying or downloading. Cleanup notes, image handling notes, and conversion warnings should be resolved or documented before the result is reused.
  8. Use DOCX Structure Chart when package counts need a quick visual review. The chart reflects package evidence, not content quality.

After conversion, verify the chosen output in its destination system. A clean HTML preview in the tool can still need final checks in a CMS, Markdown renderer, spreadsheet, ebook reader, or document editor.

Interpreting Results:

The main output tab should match the selected Convert to format, but the evidence tabs often explain whether the result is trustworthy. Package Ledger shows what the source contained. Conversion Messages explains cleanup, warnings, image treatment, and whether the browser conversion engine was available.

Use character counts in the summary as a quick sanity check, not a quality score. A short HTML or TXT result from a long source document can mean the file has unsupported structure, no readable main body, or table-heavy content that needs a different output mode.

  • Local output ready means conversion completed and the selected output exists.
  • Needs review means the file, package, or conversion step reported an error that should be fixed before reuse.
  • Image placeholders or Images removed means the output intentionally avoids embedded image data.
  • External targets in Package Ledger means the source document contains links that deserve manual checking.

A successful conversion does not mean the original document was safe, complete, or visually preserved. Check output text, headings, links, tables, comments, and images against the source before publishing or sending the artifact onward.

Worked Examples:

Release Notes To CMS HTML

A 1.2 MB release-notes document has headings, two tables, and no embedded media. Choose HTML, keep HTML fragment for CMS paste, set Style map to Article headings and quotes, and leave Images on Inline embedded images. HTML Output should contain headings, paragraphs, links, and table markup, while Package Ledger should show the paragraph and table counts. If Conversion Messages is clear, paste the HTML into the CMS and compare the rendered headings and tables with the source document.

Table Inventory To CSV

A project handoff document has three approval tables and several narrative sections. Choose CSV and set CSV source to Tables. CSV Output should begin with Table and Row columns followed by extracted cell columns. If the output says no tables were found, switch CSV source to Paragraph blocks to create a content inventory, or return to the original document and confirm that the apparent tables are real Word tables rather than positioned text.

Large Or Invalid Source File

A 62 MB file named archive.docx is dropped into Source DOCX. The source status reports that browser-side conversions must stay under 50 MB, so no output tabs should be trusted for that attempt. Compress images or split the document, then retry with a smaller `.docx`. If the source is a legacy `.doc`, open it in Word or LibreOffice and save a new `.docx` copy before trying again.

Package Audit For Links And Comments

A policy draft converts to Markdown, but Package Ledger shows external targets and review artifacts. Open Conversion Messages and Package Ledger before copying Markdown Output. If comments are relevant, turn on Include comment references in Advanced and convert again. If external links matter, inspect them in the converted output and compare them with the source document before sharing JSON or Markdown with another reviewer.

FAQ:

Does the converter upload my DOCX?

The selected file is read through the browser file picker or drop target, and the conversion path does not call a backend conversion service for the document. Review downloaded artifacts and JSON because they can contain source text and document details.

Why are old Word documents rejected?

Only `.docx` files are accepted. A legacy `.doc` file uses an older binary format, so open it in Word or LibreOffice and save it as `.docx` before converting.

Why does the converted HTML not match Word exactly?

The HTML path converts document structure and semantic styles, then sanitizes the result. Exact Word layout, pagination, typography, and some embedded objects may not survive that move.

What should I do when Conversion Messages shows warnings?

Read the warning detail and action column before copying the output. Warnings often point to cleanup, image handling, style mapping, or source formatting that needs a manual comparison with the original document.

When should I turn on Strict package check?

Use it when a DOCX appears damaged or package evidence looks suspicious. It validates CRC values during inspection, which can be slower on large documents.

Glossary:

DOCX
A modern Word document format based on an Open XML package.
Open XML
The document format family that defines the XML vocabularies and package structure used by `.docx` files.
WordprocessingML
The XML markup used for word-processing document content such as paragraphs, runs, tables, and text.
Package Ledger
The evidence table that reports source file details, package readability, body content, media, links, review artifacts, and styles.
Style map
A set of rules that maps named Word styles to output elements such as headings, paragraphs, quotes, or comment references.
CRC validation
A package integrity check used by Strict package check when corruption is suspected.

References: