Complete Guide to pikepdf for PDF Repair in Python
pikepdf is the right answer when you have many damaged PDFs to repair, when PDF handling is part of a larger Python workflow, or when you want the same repair behavior as qpdf but called from code rather than a shell. It is a Python wrapper around libqpdf — the same C++ library that powers the qpdf command-line tool — so the underlying repair behavior is identical. What changes is the interface: a Pythonic object model instead of shell flags, exception-based error handling instead of exit codes, and the ability to script complex operations across thousands of files without invoking a separate process for each one.
If you only have one damaged PDF and don’t write Python, qpdf is the simpler choice. If you have a folder of damaged PDFs, an automation that needs to inspect or modify PDFs, or a server-side workflow, pikepdf is usually faster to build and easier to maintain.
This guide covers installation, the recipes that handle most real tasks, how pikepdf’s automatic repair behavior interacts with diagnostics, and what pikepdf does not do.
When to use pikepdf
pikepdf is the right tool for:
Scripted batch repair. Open every PDF in a directory, save it back to a new location, and the act of opening and saving rebuilds structure for damaged files. A few lines of Python replaces a shell loop calling qpdf.
PDF processing inside a larger Python application. OCR pipelines, document workflows, web services that accept user-uploaded PDFs and need to sanitize them before further processing. pikepdf is the foundation library for OCRmyPDF for exactly this reason.
Page-level operations on many files. Splitting, merging, reordering, rotating, and extracting pages — all expressed as list and dictionary operations on the Pdf.pages collection.
Programmatic encryption and decryption. Adding or removing passwords as part of an automated pipeline, with full control over permission flags.
XMP and document metadata editing. Reading and writing XMP metadata blocks programmatically without external tools.
pikepdf is not the right tool for:
Rendering PDFs to images. pikepdf does not draw page content. Pair it with PyMuPDF, pdf2image, or pypdfium2 for rasterization.
Text extraction. pikepdf can read content streams as raw byte sequences but does not interpret them into readable text. Use pdfminer.six, pypdf, or PyMuPDF for extraction.
One-off repair when you don’t write Python. qpdf’s command line is a single invocation. Setting up Python and pip just to run pikepdf for one file is overkill.
Recovering severely damaged content streams. pikepdf inherits qpdf’s limits — it rebuilds structure but cannot reconstruct content that is genuinely missing. Ghostscript’s re-rendering approach sometimes salvages files pikepdf cannot.
Installation
pikepdf requires Python 3.9 or newer. Binary wheels are published for Windows, macOS, and Linux on both x86-64 and ARM64 architectures, so no compiler is needed for typical installations.
The standard install:
pip install pikepdf
If you use a virtual environment — and you should — activate it first:
python -m venv .venv
source .venv/bin/activate # macOS/Linux
.venv\Scripts\activate # Windows
pip install pikepdf
For conda users:
conda install -c conda-forge pikepdf
Verify the install:
import pikepdf
print(pikepdf.__version__)
If pip install falls back to building from source — usually because no wheel matches your platform — you’ll need a C++17 compiler and the qpdf development headers installed first. The pikepdf documentation has the full instructions for that path; on most modern platforms the wheel install is enough and the source build is unnecessary.
Common recipes
All examples assume import pikepdf and a damaged or working PDF named input.pdf in the current directory.
Repair a damaged PDF
The fundamental pikepdf repair pattern relies on the fact that Pdf.open() silently repairs structural damage when it can, and pdf.save() writes a fresh, well-structured file:
with pikepdf.Pdf.open('input.pdf') as pdf:
pdf.save('output.pdf')
This is functionally equivalent to qpdf input.pdf output.pdf from the command line. For most damaged-but-recoverable PDFs, this single open-and-save cycle produces a clean output file.
You cannot save back to the input file directly — pikepdf holds the input open for the lifetime of the Pdf object and disallows overwriting it. Save to a new path, then replace the original with os.replace() if needed:
import os, pikepdf
with pikepdf.Pdf.open('input.pdf') as pdf:
pdf.save('input.pdf.tmp')
os.replace('input.pdf.tmp', 'input.pdf')
Batch repair a directory
The case where pikepdf earns its keep over qpdf:
from pathlib import Path
import pikepdf
source = Path('damaged_pdfs')
dest = Path('repaired_pdfs')
dest.mkdir(exist_ok=True)
for pdf_path in source.glob('*.pdf'):
try:
with pikepdf.Pdf.open(pdf_path) as pdf:
pdf.save(dest / pdf_path.name)
except pikepdf.PdfError as e:
print(f'Failed: {pdf_path.name} — {e}')
The try block catches the cases where the file is too damaged to open at all. Files that open with warnings still produce output — pikepdf doesn’t raise on recoverable damage, it just repairs silently.
Diagnose without modifying
Pdf.check() returns a list of structural problems found in the file:
with pikepdf.Pdf.open('input.pdf') as pdf:
problems = pdf.check()
for problem in problems:
print(problem)
There is a real subtlety here: because Pdf.open() already attempts repair as part of opening the file, check() may report fewer problems than qpdf --check would on the same file from the command line. check() reports what remains after the automatic repair pass, not what was originally wrong. If you need the pre-repair diagnostic, use the Job interface to invoke qpdf-equivalent behavior:
from pikepdf import Job
Job(['pikepdf', '--check', 'input.pdf']).run()
This runs the same checks as qpdf --check and prints the same output to stdout.
Split, merge, and rearrange pages
Pages are accessed as a Python list via pdf.pages:
# Extract pages 1-3 (PDF pages 1-3 are Python indices 0-2)
with pikepdf.Pdf.open('input.pdf') as pdf:
new_pdf = pikepdf.Pdf.new()
new_pdf.pages.extend(pdf.pages[0:3])
new_pdf.save('first-three.pdf')
# Merge two PDFs
with pikepdf.Pdf.open('a.pdf') as a, pikepdf.Pdf.open('b.pdf') as b:
a.pages.extend(b.pages)
a.save('merged.pdf')
# Reverse page order
with pikepdf.Pdf.open('input.pdf') as pdf:
pdf.pages.reverse()
pdf.save('reversed.pdf')
# Delete a page
with pikepdf.Pdf.open('input.pdf') as pdf:
del pdf.pages[2] # remove page 3
pdf.save('shorter.pdf')
Rotate pages
Each page has a rotate() method. The relative=True argument adds to whatever rotation the page already has:
with pikepdf.Pdf.open('input.pdf') as pdf:
for page in pdf.pages:
page.rotate(180, relative=True)
pdf.save('upside-down.pdf')
Use relative=False only if you specifically want to override existing rotation rather than add to it.
Decrypt a password-protected PDF
If you have the password:
with pikepdf.Pdf.open('encrypted.pdf', password='your-password') as pdf:
pdf.save('decrypted.pdf')
By default the saved file has no encryption applied. pikepdf does not crack or recover lost passwords — if you don’t have the password, pikepdf cannot help.
Encrypt a PDF
from pikepdf import Encryption
with pikepdf.Pdf.open('input.pdf') as pdf:
pdf.save(
'encrypted.pdf',
encryption=Encryption(
user='user-password',
owner='owner-password'
)
)
The default is AES-256, which is the strongest option pikepdf supports and the right choice for new files. To restrict permissions while leaving the file openable:
from pikepdf import Encryption, Permissions
restrictions = Permissions(extract=False, modify_other=False)
with pikepdf.Pdf.open('input.pdf') as pdf:
pdf.save(
'restricted.pdf',
encryption=Encryption(
user='', owner='owner-password', allow=restrictions
)
)
A blank user password lets anyone open the file but enforces the permissions for users who don’t have the owner password.
Use the Job interface for qpdf-style commands
For operations that map cleanly to qpdf command-line invocations, the Job interface lets you write what looks like a qpdf command in Python:
from pikepdf import Job
# Equivalent to: qpdf --linearize input.pdf output.pdf
Job(['pikepdf', '--linearize', 'input.pdf', 'output.pdf']).run()
# JSON form, useful for building jobs programmatically
Job({
'inputFile': 'input.pdf',
'outputFile': 'output.pdf',
'linearize': ''
}).run()
This is the right interface when porting an existing qpdf shell script to Python.
Limitations and known issues
Inherits qpdf’s repair limits. Anything qpdf cannot fix, pikepdf cannot fix either. The library shares a backend; using pikepdf is not a workaround for a file that qpdf rejects.
Pdf.check() reports post-repair state. Because the file has already been repaired by the time check() runs, the method may show no problems on a file that qpdf --check would flag. This is correct behavior, not a bug — but it surprises people. Use the Job interface if you need the pre-repair view.
No content rendering. pikepdf can read the bytes of a content stream but does not interpret PDF drawing commands into rasterized output. For visual rendering, pair pikepdf with a separate rendering library.
No text extraction beyond raw bytes. Page content is accessible as parsed PDF tokens, but extracting human-readable text requires a separate library that handles font encodings, ligatures, and layout.
Cannot overwrite the input file. A pikepdf Pdf object holds the source file open until closed. Save to a different path, then atomically replace the original if needed.
Build-from-source path requires C++ tooling. Binary wheels cover most modern platforms, but unusual architectures or older systems may force a source build with its own dependency chain.
Not thread-safe within a single Pdf object. Multiple threads sharing one Pdf instance will produce undefined behavior. Concurrent processing of separate files is fine; concurrent modification of a single file is not.
Alternatives
qpdf is the command-line equivalent and shares the same backend. For interactive one-off use, qpdf is faster to reach for. For Python integration, pikepdf is the natural choice. See the complete guide to qpdf.
pypdf is a pure-Python PDF library with no compiled dependencies. It is more portable than pikepdf and easier to install in restricted environments, but has weaker repair behavior for damaged files. Useful when the install constraints rule out pikepdf.
PyMuPDF wraps the MuPDF rendering engine and offers strong text extraction, image rendering, and a different repair surface. PyMuPDF is AGPL-licensed (with a commercial alternative), which has implications for redistribution that pikepdf’s MPL-2.0 license does not.
Ghostscript approaches PDF problems by re-rendering rather than restructuring. It sometimes recovers files that pikepdf and qpdf cannot, at the cost of losing form fields, annotations, signatures, and tagged accessibility structure. See the complete guide to Ghostscript for PDF recovery.
Apache PDFBox is the equivalent library for Java environments. The same family of capabilities, the same underlying tradeoffs around what structural repair can and cannot achieve. See the complete guide to Apache PDFBox.
Last verified: April 2026