OCR PDF Tool
Extract text from scanned or text-native PDFs in your browser, with OCR fallback, confidence ledgers, page guards, and text downloads.{{ statusTitle }}
{{ ocrText || textPlaceholder }}
| Page | Status | Route | Confidence | Words | Characters | Runtime | Detail | Copy |
|---|---|---|---|---|---|---|---|---|
| {{ row.page }} | {{ row.status }} | {{ row.route }} | {{ row.confidence }} | {{ row.words }} | {{ row.characters }} | {{ row.runtime }} | {{ row.detail }} |
| Check | Status | Detail | Next step | Copy |
|---|---|---|---|---|
| {{ row.check }} | {{ row.status }} | {{ row.detail }} | {{ row.next }} |
Introduction
A PDF can look complete while hiding a basic question: are the words stored as characters, or are they only painted into a page image? Many PDFs contain real text from a word processor, form system, invoice tool, or publishing app. Others are pictures of pages because they came from a scanner, copier, fax workflow, phone camera, or document archive. Both files may look identical in a viewer, but they behave very differently when you try to copy, search, index, translate, or reuse the words.
Optical character recognition, usually shortened to OCR, reads printed characters from a page image and turns them into editable text. It is most useful when a document is image-only or partly image-only, such as scanned receipts, signed forms, old reports, paper letters, stamped records, and mixed PDF packets. Native PDF text extraction is different: it reads characters that are already present inside the document, so it is usually faster and avoids many recognition errors.
A useful OCR pass starts by recognizing which kind of PDF you have. A bank statement exported from an account portal may already contain selectable text, while a signed version of the same statement may be a scanned image. A court bundle, insurance packet, or office archive can contain both types page by page. Treating every page as an image wastes time, but assuming every page has usable text can miss the scanned pages that matter most.
Several factors change OCR quality before any software setting is chosen: scan resolution, contrast, skew, page rotation, print size, blur, bleed-through, stamps, handwriting, tables, columns, and marks near the text. Higher resolution can help small print, but it also makes each page image larger and slower to process. Layout is another source of error because OCR may recognize the words correctly while placing them in an order that does not match the original page.
- Searchable PDF usually means a PDF has a text layer that search tools can read, whether that layer came from the original document or from a previous OCR process.
- Image-only PDF means the visible page is a picture, so copying text from a viewer may select nothing or produce unreliable fragments.
- OCR confidence is a clue from the recognition pass, not a proof that names, amounts, dates, punctuation, or table cells are correct.
How to Use This Tool:
Use the controls to choose a small, reviewable page set first. Long scanned PDFs are easier to audit when you confirm the settings on a few pages before processing more of the document.
- Drop or browse one file in Scanned PDF. If the file is password protected, open Advanced, enter the PDF password, and run the file inspection again.
- Choose Auto text or OCR when the PDF may already contain text. Choose OCR rendered pages when every selected page should be read from a rendered page image.
- Set OCR language, Pages, and OCR resolution. The page range accepts
all, a single page such as1, or comma-separated ranges such as1,3,8-10. - Use Text layout to control the copied or downloaded text format. Page headings are helpful for multi-page review, while continuous text is cleaner for a short single-page extraction.
- Open Advanced when the scan has unusual layout. Page segmentation changes the OCR layout assumption, Page guard blocks accidental long jobs, and Preserve spacing can help simple columns or fixed-width scans.
- Run OCR and watch the progress bar. If the Readiness Ledger reports an invalid range, a guard block, or a file parsing problem, fix that item before relying on the extracted text.
- Review OCR Text first, then use Page OCR Ledger, Readiness Ledger, and JSON to check page routes, confidence, warnings, word counts, and processing details before copying or downloading the result.
Interpreting Results:
Start with the route and status for each page. Embedded text means the page had enough native PDF text to skip OCR. OCR raster means the page was rendered as an image and recognized with the selected language, resolution, segmentation, and spacing options. A high OCR confidence score reduces review effort, but it does not make the text a certified transcript.
Use word and character counts as sanity checks. A page marked Blank may really be blank, but it may also be a low-contrast scan, a page with handwriting, a decorative cover page, or a page whose text is too small for the selected settings. When the output will be used for filing, quoting, data entry, or search indexing, proofread names, dates, totals, invoice numbers, addresses, and table rows against the original PDF image.
| Signal | Meaning | What to check |
|---|---|---|
| Native text | Enough embedded characters were found to avoid image recognition. | Check reading order on forms, tables, headers, footers, and multi-column pages. |
| OCR confidence | The recognition pass estimated how likely its page text is to be correct. | Proofread low-confidence pages and any page with small print or noisy marks. |
| Page guard | The selected page count is above the browser safety limit you set. | Narrow the page range, then process the document in smaller groups. |
| DPI or pixel warning | The rendered page image would be too large for a safe browser pass. | Lower OCR resolution or split the work into fewer pages. |
| Text only output | The result is extracted text and review data, not a rewritten PDF. | Use another workflow if you need a searchable PDF with an embedded text layer. |
Technical Details:
PDF pages are measured in user-space points, where the normal page unit is one seventy-second of an inch. A page can expose a text layer, a drawing program, images, forms, or a mixture of those objects. Text extraction reads the text layer when it is present, while OCR needs a raster image of the page because character recognition works on pixels.
The automatic route uses native PDF text when the extracted and normalized page text reaches a practical minimum length. If that page does not provide enough text, the page is rendered at the selected dots per inch and passed through OCR. That page-by-page decision matters for mixed documents because it lets a text-native page stay fast while still recognizing scanned pages in the same file.
Formula Core
Raster size grows with both page dimensions. Doubling the DPI roughly quadruples the number of pixels, so a higher setting can improve small print while also increasing memory use and runtime.
For example, a letter-size page is about 612 by 792 points. At 200 DPI, the scale is 2.7778, producing about 1,700 by 2,200 pixels, or roughly 3.74 million pixels. At 300 DPI, the same page is about 2,550 by 3,300 pixels, or roughly 8.42 million pixels. The browser safety cutoff blocks any single rendered page above 64,000,000 pixels.
Rule Core
| Rule | Boundary | Effect |
|---|---|---|
| PDF selection | One PDF is processed at a time. | Extra dropped files are ignored, and non-PDF files stop with an error. |
| Page range | all, single pages, comma lists, and ascending ranges are valid. |
Duplicates are ignored with a warning; invalid, descending, or out-of-bounds ranges stop before OCR. |
| Native text route | Automatic mode uses embedded text when normalized page text has at least 24 characters. | That page reports Embedded text and Native text confidence. |
| OCR route | OCR-only mode, or automatic mode below the native-text threshold, renders the page image. | The page reports OCR confidence, runtime, words, characters, and route details. |
| Resolution range | DPI is constrained to 120 through 300. | Higher values are useful for small print but make each rendered page larger. |
| Page guard | The guard is constrained to 1 through 50 pages, with 8 pages as the default. | Selected ranges above the guard are blocked before recognition starts. |
Recognition Settings
Language data affects character choices and dictionary bias. A single-language pass is usually faster and cleaner when the document is consistent. Mixed English and Spanish recognition can help bilingual pages, but it gives the OCR engine a wider set of possibilities and may take longer.
Page segmentation tells OCR what shape of text to expect. Automatic page segmentation is the general default. Single-column mode is better for one-column pages with varying text sizes, single-block mode fits clean blocks such as simple forms or table-of-contents pages, and sparse-text mode can help labels or loose snippets spread across a page.
Limitations, Privacy, and Accuracy Notes:
The selected PDF is read in the browser after you choose it or drop it. The PDF file itself is not uploaded for OCR processing, but recognition engine files and language data may be fetched when image recognition is needed. Avoid saving shared screenshots after entering a PDF password, and close the tab when you are finished with sensitive documents.
- OCR is weakest on handwriting, skewed scans, low contrast, tiny print, stamps, watermarks, and dense tables.
- Native text extraction can still have odd reading order when the original PDF layout is complex.
- The output is plain extracted text plus review data; it does not create a searchable PDF or preserve exact page geometry.
Worked Examples:
| Scenario | Input choices | Expected result | Review step |
|---|---|---|---|
| Mixed contract packet | Auto text or OCR, 1-6, 200 DPI, page headings. |
Text-native pages show Embedded text; signed scan pages show OCR raster. | Use Page OCR Ledger to find pages that need manual proofreading. |
| Small scanned invoice | OCR rendered pages, 1, 300 DPI, single block if the page is clean. |
OCR Text returns the invoice text and the ledger reports confidence, words, and characters. | Verify invoice number, date, total, tax, and supplier name against the page image. |
| Guarded long archive | all on a 40-page scan with the default 8-page Page guard. |
Readiness Ledger reports the selected range is blocked before OCR starts. | Run 1-8, then continue with later ranges after confirming the settings. |
| Large poster page | 300 DPI on an unusually large PDF page. | A pixel warning can stop the page if the rendered image exceeds the browser safety cutoff. | Lower OCR resolution or process a smaller page range. |
FAQ:
Why did a page skip OCR?
In Auto text or OCR mode, a page with at least 24 normalized native text characters is read directly from the PDF text layer instead of being rendered for OCR.
Why does the output order look strange?
PDF text layers and OCR layout analysis can both place words in an order that differs from the visual page, especially on forms, columns, headers, footers, and tables. Check the original page when order matters.
Will 300 DPI always improve results?
No. Higher DPI can help small print, but it also raises memory use and runtime. It will not fix blur, poor contrast, skew, handwriting, or badly cropped pages.
What should I do when the page range is rejected?
Use all, one page number, or ascending ranges such as 1,3,8-10. Reverse ranges, out-of-bounds page numbers, and ranges above the Page guard need correction before OCR starts.
Can this make a searchable PDF?
No. The output is extracted text, page review data, and JSON. A searchable PDF requires writing a text layer back into the PDF, which is outside this tool's output scope.
Glossary:
- Embedded text
- Characters already present in the PDF page data, separate from the visible page image.
- OCR raster
- A rendered page image that is passed through optical character recognition.
- DPI
- Dots per inch, the resolution used when turning a PDF page into pixels for OCR.
- Page segmentation
- The OCR layout assumption, such as automatic page, single column, single block, or sparse text.
- Text layer
- The searchable or copyable character data that may already exist inside a PDF page.
- Searchable PDF
- A PDF with a text layer that search and selection tools can read.
References:
- PDF.js PDFPageProxy API documentation, Mozilla.
- Tesseract User Manual, tesseract-ocr.
- Tesseract Command Line Usage, tesseract-ocr.
- Improving the quality of the output, tesseract-ocr.
- Using files from web applications, MDN Web Docs, September 18, 2025.
- Tesseract.js API documentation, naptha.