PDF

Scanned PDF Has No Searchable Text: How to OCR It

A scanned PDF where you can’t select text or search isn’t corrupted — it’s an image-only PDF, which is exactly what most scanners produce by default. Each page is a picture of the document, with no underlying text data for the reader to find. The fix is OCR (optical character recognition): a process that looks at the image, identifies the text, and adds an invisible text layer behind it. Once OCR is applied, text becomes selectable, copyable, and searchable.

Quick fix

The free tool OCRmyPDF handles this in one command. It uses Tesseract internally and produces a searchable PDF without changing the visual appearance of the original.

Install it (one-time setup):

# macOS with Homebrew
brew install ocrmypdf

# Ubuntu / Debian
sudo apt install ocrmypdf

# Windows or any platform with Python
pip install ocrmypdf

Then OCR the file:

ocrmypdf input.pdf output.pdf

That’s it. Open output.pdf and you’ll find selectable, searchable text. The original input.pdf is untouched.

If you have Adobe Acrobat Pro installed (note: not the free Acrobat Reader, which doesn’t include OCR), use it directly: open the PDF, then Tools > Scan & OCR > Recognize Text > In This File. Acrobat performs OCR using Adobe’s own engine and updates the file in place.

If that didn’t work

For non-English documents, specify the language. OCRmyPDF defaults to English; OCR quality on non-English text without the language flag is poor.

ocrmypdf --language fra input.pdf output.pdf      # French
ocrmypdf --language deu input.pdf output.pdf      # German
ocrmypdf --language jpn input.pdf output.pdf      # Japanese

The full list of language codes is available via tesseract --list-langs. You may need to install additional Tesseract language packs:

sudo apt install tesseract-ocr-fra
brew install tesseract-lang

For low-quality scans where the text is faint, skewed, or noisy, add image-processing flags:

ocrmypdf --deskew --rotate-pages input.pdf output.pdf

--deskew corrects pages scanned at a slight angle. --rotate-pages detects pages scanned upside-down or sideways and rotates them. Both improve OCR accuracy at the cost of some processing time. The --clean flag invokes unpaper to remove background noise before OCR; install unpaper separately if you want to use it.

If a PDF already has some OCR text — perhaps poor-quality OCR from earlier software — and you want to redo it:

ocrmypdf --redo-ocr input.pdf output.pdf

This strips existing OCR text and applies fresh OCR using current Tesseract.

Advanced recovery

For one-off scanned PDFs where installing tools isn’t worth the trouble, Google Drive performs OCR for free. Upload the PDF to Drive, right-click it, choose Open with > Google Docs. Google Docs runs OCR during the open and produces a Doc containing the recognized text. You can then copy the text or save the Doc as a new PDF that includes the OCR layer. This is fast and effective; the trade-off is that you’ve uploaded the document to Google’s servers, so don’t use it for sensitive material.

For scans where each page is a separate image file (JPEG, PNG, TIFF) rather than already a PDF, combine them first:

img2pdf page1.jpg page2.jpg page3.jpg -o combined.pdf
ocrmypdf combined.pdf searchable.pdf

img2pdf is recommended over Ghostscript or ImageMagick for image-to-PDF conversion because it preserves image quality without re-encoding.

For batch processing of many scanned PDFs at once:

for f in *.pdf; do ocrmypdf "$f" "ocr-$f"; done

This runs sequentially. OCRmyPDF is already multi-threaded internally, so excessive parallelism doesn’t help much.

Why this happens

Scanners produce PDFs in two fundamentally different ways. The default and simpler mode is image-only PDF: each page is a JPEG, PNG, or TIFF image embedded inside a PDF wrapper. The PDF file knows what the page looks like but has no information about the text on the page. Search returns nothing because there is nothing to search.

The other mode is searchable PDF (sometimes called “text-under-image”): the scanner runs OCR during scanning, identifies the text in the image, and embeds an invisible text layer underneath the image. Visually the page looks identical, but search and copy-paste work because the text layer is there. Most modern document scanners can produce searchable PDFs if you enable OCR in the scan settings — but they don’t always default to it, especially for older devices or basic scan modes.

Phone scanning apps (Adobe Scan, Microsoft Office Lens, Google Drive scan) generally do produce searchable PDFs, because OCR runs server-side after capture. Dedicated office scanners often don’t, because they aim for speed and assume the user will OCR later if needed. Multi-function printers vary widely.

OCR quality depends on a few factors. Resolution matters most: anything below 200 DPI gives unreliable results, 300 DPI is the working minimum, 400 DPI or higher is best. Image clarity matters next — a scan with even contrast and clear text outperforms a scan with shadows, smudges, or compression artifacts, even at the same resolution. Font matters too: Tesseract handles modern serif and sans-serif fonts well but struggles with handwriting, decorative fonts, or text rendered at very small sizes.

Preventing this in future

If you scan documents regularly, configure your scanner to enable OCR at scan time. Most modern scanners have a setting called “Searchable PDF,” “OCR PDF,” or similar. The OCR happens during the scan and the resulting file is searchable from the start, with no separate processing step needed.

If your scanner doesn’t support in-line OCR, add OCRmyPDF to your scan workflow. Set up a watched folder where scanned PDFs land, and have OCRmyPDF process new files automatically. The OCRmyPDF documentation includes recipes for this with watchmedo and similar tools.

For phone scanning, prefer apps that produce searchable output. Adobe Scan and Microsoft Office Lens both do this well, and both are free.

If the PDF in question arrived via email and won’t open at all (rather than opening but lacking searchable text), the problem is more likely transfer corruption than missing OCR — see PDF from email won’t open. For PDFs that have other rendering or content problems, the PDF repair pillar covers the broader landscape of file-level issues.

Last verified: April 2026