OCR PDF Tool
Extract PDF text with auto native-text fallback, OCR language, page ranges, DPI, page ledger, confidence, readiness checks, and TXT/JSON outputs.{{ statusTitle }}
{{ ocrText || textPlaceholder }}
| Page | Status | Route | Confidence | Words | Characters | Runtime | Detail | Copy |
|---|---|---|---|---|---|---|---|---|
| {{ row.page }} | {{ row.status }} | {{ row.route }} | {{ row.confidence }} | {{ row.words }} | {{ row.characters }} | {{ row.runtime }} | {{ row.detail }} |
| Check | Status | Detail | Next step | Copy |
|---|---|---|---|---|
| {{ row.check }} | {{ row.status }} | {{ row.detail }} | {{ row.next }} |
Introduction:
Optical character recognition (OCR) turns printed-looking text in a scanned PDF page into selectable text. That matters when the document was scanned as a picture, exported from a copier, or received as a locked image where normal search and copy commands find nothing useful.
PDF text extraction and OCR are different jobs. A text-native PDF may already contain characters that can be read directly. A scanned page has to be rendered as an image, inspected by an OCR engine, and converted back into words with some loss of certainty. The same document can contain both kinds of pages, especially when a cover sheet is scanned and later pages were generated from office software.
OCR output is best treated as a review draft, not as a legal or archival copy of the document. Smudged scans, skewed pages, tiny print, handwriting, stamps, and multi-column layouts can change letters, drop punctuation, or move words out of order. Page-level confidence and word counts help you decide where to proofread first, but they do not prove that every character is correct.
Searchable PDF creation is a separate step from text extraction. A searchable PDF needs an invisible text layer aligned back over the page image. This tool focuses on extracting text, showing per-page evidence, and making the result easy to audit before any later production workflow writes text back into a PDF.
Technical Details:
A PDF page can expose text in two ways. Text-native pages carry character data that can be read from the PDF text content. Image-only pages carry pixels, so the page has to be rendered to a canvas before OCR can estimate the characters. The automatic route checks for enough embedded text first, then falls back to OCR when the page behaves like a scan.
OCR accuracy depends on three practical inputs: the page image quality, the recognition language, and the page segmentation assumption. Higher rendering resolution can help small print, but it also creates more pixels for the browser to hold in memory. Page segmentation tells the OCR engine whether the page looks like an automatic layout, a single column, a single block, or sparse text scattered across the page.
Recognition Route Rules
The route choice changes both speed and confidence labeling. Native text is faster and reports Native text instead of a percentage because the characters came from the PDF text content. Raster OCR returns a confidence percentage from the recognition pass.
| Route | When it runs | What it reports | Main caution |
|---|---|---|---|
| Auto text or OCR | Checks embedded text first and uses it when the normalized page text reaches the local minimum. | Embedded text route, Native text confidence label, word count, character count, and runtime. |
Native text can still be poorly ordered when the PDF text layer was built badly. |
| OCR rendered pages | Rasterizes every selected page and sends the page image to OCR. | OCR raster route, percentage confidence, word count, character count, and runtime. |
Higher DPI can improve small print but raises memory use and processing time. |
| Blank page result | The selected page produced no normalized text. | Blank status with zero words and zero characters. |
A blank result can mean a real blank page, an unsupported layout, low contrast, or a failed recognition setup. |
Rendering and Memory Bounds
PDF coordinates use points, with 72 points per inch. Rendering at 200 DPI uses a scale of 200 divided by 72. A letter-size page at that scale creates far more pixels than the same page at screen size, which is why the tool caps resolution and blocks pages whose canvas would exceed the pixel guard.
Controls and Bounds
| Control | Accepted values | Result impact | Review cue |
|---|---|---|---|
| Pages | all, one page, or ascending comma-separated ranges such as 1,3,8-10. |
Defines which pages appear in OCR Text, Page OCR Ledger, and JSON. | Duplicate pages are ignored with a warning. |
| OCR resolution | 120 to 300 DPI, with 200 DPI as the ordinary scan starting point. | Controls raster size before OCR. | Lower it if a page exceeds the canvas pixel guard. |
| OCR language | English, Spanish, French, German, or English + Spanish. | Chooses the language model for raster OCR. | Use the main printed language; mixed-language recognition is slower. |
| Page segmentation | Auto page, Single column, Single block, or Sparse text. | Changes how the OCR engine groups marks into lines and words. | Single block can help clean forms; Sparse text can help loose snippets. |
| Page guard | 1 to 50 selected pages. | Blocks accidental long OCR jobs before recognition starts. | Narrow the range before raising the guard. |
| Preserve spacing | On or Off. | Keeps wider inter-word spacing when the engine can infer it. | Leave it off for prose; try it for simple columns or fixed-width scans. |
The current extraction stops at text, ledgers, and JSON. It does not write an invisible text layer back into the PDF, so the readiness ledger keeps Searchable PDF output marked Blocked and Enablement marked Disabled.
Everyday Use & Decision Guide:
Start with Auto text or OCR unless you know every selected page is scanned. That route saves time on text-native pages and still falls back to raster OCR when a page has too little embedded text. Switch to OCR rendered pages when you want every page treated like a scan, such as after a copier pass or when a bad PDF text layer produces scrambled copy.
Keep the first run small. Load one PDF, leave Pages at a short range such as 1 or 1-3, keep OCR resolution near 200 DPI, and check the Page OCR Ledger before running a large batch. The ledger shows route, confidence, words, characters, runtime, and detail for each page, so it is the quickest way to spot a page that needs a different setup.
- Use 144 DPI for a quick check when the print is large and clean.
- Use 200 DPI for ordinary office scans.
- Use 300 DPI for small print only after the range is narrow enough for the browser page guard.
- Try
Single columnorSingle blockwhen auto segmentation breaks a form or simple report into awkward lines. - Turn on
Preserve spacingonly when wider gaps help columns or fixed-width text stay readable.
The main wrong assumption is that a copied OCR result is finished text. Confidence can be high while names, invoice numbers, dates, or punctuation still need proofreading. Use OCR Text for the draft, then use Page OCR Ledger and Readiness Ledger to decide which pages deserve manual review.
Treat the JSON view as the audit handoff. It carries the selected parameters, PDF summary, selected pages, extracted text, page results, warnings, and readiness rows in one structured payload.
Step-by-Step Guide:
Use this flow to get a reviewable text draft without running more pages than the browser can handle.
- Drop or browse one item in
Scanned PDF. The summary should change fromChoose PDFto the page count, and Readiness Ledger should showPDF sourceas loaded. - Choose
Recognition route. UseAuto text or OCRfor mixed PDFs, orOCR rendered pageswhen the file is known to be scanned image pages. - Set
OCR language,Pages, andOCR resolution. If the range is invalid, the error explains whether the page number is outside the PDF, not ascending, or in the wrong format. - Open
Advancedonly when needed. AdjustPage segmentation,Page guard,Preserve spacing, orPDF passwordbefore running again. - Click
Run OCR. The progress bar and action label show inspection or recognition progress, andCancelstops a run before all selected pages finish. - Review
OCR Text, then checkPage OCR Ledgerfor route, confidence, words, characters, and runtime. UseReadiness Ledgerfor blocked range, page guard, searchable PDF, privacy, and enablement checks. - Copy or download the text only after the page rows look sensible. Use
JSONwhen another reviewer needs the parameters and warnings together with the extracted text.
If a password-protected PDF fails, enter the password in Advanced, reselect or inspect the file again, and rerun the same page range.
Interpreting Results:
The most important output is the combination of text and page evidence. OCR Text gives the draft copy. Page OCR Ledger tells you how each page was read, how many words and characters were produced, and whether the page returned Ready or Blank.
Embedded textwithNative textusually means the page already carried readable characters, not that OCR proved the image correct.OCR rasterwith low confidence deserves proofreading, especially for identifiers, totals, names, and dates.Blankshould be checked against the original page before you treat it as intentionally empty.Searchable PDF outputmarkedBlockedmeans the result is extracted text only, not a modified PDF.
A clean run does not mean the document is production-ready. Compare any important value against the original scan before pasting it into a contract, spreadsheet, filing system, or customer record.
Worked Examples:
Three-page office scan
A user drops a 3-page scanned invoice, keeps Recognition route on Auto text or OCR, selects Pages as 1-3, leaves English at 200 DPI, and runs OCR. The Page OCR Ledger reports OCR raster for each page with word and character counts. The final OCR Text is useful for search and review, but invoice numbers and totals still need checking against the image.
Mixed PDF with a text-native cover page
A 10-page policy PDF has a generated cover page and scanned attachments. With Pages set to 1,8-10, Auto reads page 1 as Embedded text with Native text confidence and OCRs the selected attachment pages as raster images. That split is expected: the output combines both routes into one OCR Text view while the ledger preserves the route per page.
Range blocked by the page guard
A user enters all on a 24-page PDF while Page guard is 8. The run stops before OCR starts and Readiness Ledger marks Browser page guard as blocked. The practical fix is to narrow Pages to a small range such as 1-8, or raise the guard only after lowering DPI and accepting the longer browser runtime.
FAQ:
Does this create a searchable PDF?
No. The current result is extracted text plus review ledgers and JSON. The readiness check for Searchable PDF output stays blocked because no invisible text layer is written back into the PDF.
Why does one page show Native text instead of a confidence percentage?
That page was read from embedded PDF text in Auto mode. Confidence percentages appear for raster OCR pages because those pages were rendered and recognized from pixels.
What page range formats work?
Use all, one page number, or ascending comma-separated ranges such as 1,3,8-10. A backward range, out-of-bounds page, or unsupported token triggers a range error.
Does the selected PDF leave my browser?
The selected PDF is read through browser file APIs and is not sent to a tool-specific backend. OCR scripts and language data may load from external hosts when recognition is needed.
What should I try when the OCR text looks wrong?
Rerun a smaller range, choose the main printed language, raise DPI only for small print, and try Single column, Single block, or Sparse text when the page layout is not ordinary prose.
Glossary:
- OCR
- Optical character recognition, the process of estimating text from page images.
- Embedded text
- Character data already present inside a PDF page, separate from the visible page image.
- Raster OCR
- Recognition run after a PDF page is rendered into pixels.
- DPI
- Dots per inch, used here as the rendering resolution for OCR.
- Page segmentation
- The assumption about how text is arranged on the page before recognition groups words and lines.
- Searchable PDF
- A PDF with recognized text aligned as an invisible text layer over the original page image.
References:
- File API, MDN Web Docs, April 10, 2025.
- Module: pdfjsLib, Mozilla PDF.js API draft.
- Command Line Usage, Tesseract OCR tessdoc.
- Tesseract.js project page, Project Naptha.