Complete Guide to Apache PDFBox for PDF Repair

Apache PDFBox is a Java library for working with PDF documents, with a bundled set of command-line tools that ship as a single executable JAR file. It is the standard PDF library in Java environments, the way qpdf is the standard in C++ and pikepdf is the standard in Python. For repair work specifically, PDFBox is more useful as a programming library than as a command-line tool — it does not have a dedicated repair command, but its parser is tolerant of many structural errors and a load-and-resave cycle through the Java API will rebuild a PDF’s structure for most common damage modes.

Before you go further, there is one thing about PDFBox that matters more than anything else in this guide: PDFBox 2.x and 3.x have completely different command-line syntaxes. Code and tutorials written for one will not work on the other. Both branches are actively maintained — 2.0.36 and 3.0.7 are the current releases at time of writing — and switching between them changes every command in this guide. The recipes below cover both syntaxes side by side.

This guide covers what PDFBox is best at, the licensing situation (uncomplicated, in PDFBox’s case), installation, the recipes that actually work for repair-adjacent tasks, and where PDFBox falls short of dedicated repair tools.

When to use Apache PDFBox

PDFBox is the right tool for:

PDF processing inside a Java application. If your stack is Java or JVM-based — Spring, Kotlin, Scala, Clojure — PDFBox integrates as a Maven or Gradle dependency without external process calls.

Repairing PDFs as a side effect of a load-and-resave cycle. Programmatically opening a damaged PDF with Loader.loadPDF() and saving it back with document.save() reconstructs the file’s structure. This is the PDFBox equivalent of qpdf input.pdf output.pdf.

Inspecting the internal structure of a PDF. The bundled debug GUI shows the object tree, content streams, and metadata in a way that helps diagnose what is wrong with a damaged file.

Decompressing PDF streams for diagnosis. The decode command (3.x) or WriteDecodedDoc (2.x) produces a version of the PDF with all stream filters removed, which makes the file readable in a text editor and reveals what the content actually contains.

Text extraction with reasonable accuracy. PDFBox’s ExtractText (2.x) / export:text (3.x) handles most fonts and layouts adequately for downstream processing.

PDFBox is not the right tool for:

Quick command-line PDF repair. PDFBox has no --repair flag and no dedicated repair subcommand. The repair behavior is an implicit side effect of opening and re-saving programmatically. For one-off command-line repair, qpdf is much faster to reach for. See the complete guide to qpdf.

Rendering PDFs to images at production speed. PDFBox can rasterize, but it is slower than dedicated rendering libraries like MuPDF or Poppler.

Working with severely corrupted PDFs. PDFBox’s parser is lenient but not magical. Files that defeat qpdf’s reconstruction will usually defeat PDFBox too.

Avoiding a Java runtime. PDFBox requires Java. If you are not already running Java, installing a JDK just for PDFBox is heavier than the alternatives. qpdf ships as a small native binary; pikepdf installs as a Python wheel.

A note on licensing

PDFBox is licensed under the Apache License 2.0. This is a permissive license: you can use PDFBox in commercial products, redistribute it, modify it, and bundle it with proprietary software without releasing your own source code. Attribution and a copy of the license must be included with any redistribution.

Compared to Ghostscript’s AGPL-or-commercial dual license, PDFBox is the obvious choice when license terms matter for your project.

Installation

Command-line tool

The PDFBox command-line tools ship as an executable JAR file you download directly from the Apache project. There is no platform-specific installer.

Download pdfbox-app-3.0.7.jar (or the equivalent 2.x release) from pdfbox.apache.org/download.cgi. The file is around 10 MB and runs anywhere Java does.

You will need a Java runtime to execute it. PDFBox 2.x and 3.x both require Java 8 or newer. Most systems have Java already; check with:

java -version

If you do not have Java installed, the OpenJDK distributions from Adoptium (formerly AdoptOpenJDK) are the cleanest choice. Download an installer for your platform from adoptium.net.

Once both are in place, verify PDFBox runs:

java -jar pdfbox-app-3.0.7.jar --help

This lists the available subcommands. The 2.x version prints a different list — see the recipes below for the syntax differences.

Java library

For programmatic use, add PDFBox to your build’s dependencies.

Maven:

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>3.0.7</version>
</dependency>

Gradle:

implementation 'org.apache.pdfbox:pdfbox:3.0.7'

For 2.x compatibility, substitute the version number; the artifact and group are the same.

Common recipes

All command-line examples assume pdfbox-app-3.0.7.jar is in the current directory. The 2.x equivalents are shown alongside where the syntax differs.

Repair a damaged PDF programmatically

There is no command-line repair invocation. The closest is to open the PDF and immediately save it back, which forces PDFBox to rewrite the structure. In Java:

import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import java.io.File;

public class RepairPdf {
    public static void main(String[] args) throws Exception {
        try (PDDocument document = Loader.loadPDF(new File("input.pdf"))) {
            document.save("output.pdf");
        }
    }
}

PDFBox’s parser is tolerant of common structural errors. The Loader.loadPDF() call (3.x) or PDDocument.load() (2.x) silently repairs many issues during the read, and save() writes a clean output file.

For 2.x, the equivalent is:

import org.apache.pdfbox.pdmodel.PDDocument;
import java.io.File;

PDDocument document = PDDocument.load(new File("input.pdf"));
document.save("output.pdf");
document.close();

Repair via split-and-merge from the command line

When you must work from the command line and qpdf is not available, splitting a PDF into single pages and merging them back together forces PDFBox to rewrite the structure as a side effect. This is a workaround, not an officially-blessed repair pattern, but it works for many damaged files.

In PDFBox 3.x:

java -jar pdfbox-app-3.0.7.jar split -i=input.pdf -split=1
java -jar pdfbox-app-3.0.7.jar merge -i=input-1.pdf -i=input-2.pdf -i=input-3.pdf -o=repaired.pdf

The -split=1 value produces one file per page. You then merge them all back. This is clearly inefficient compared to qpdf, and you have to know how many pages the file has. Use this only when qpdf and pikepdf are not options.

In PDFBox 2.x, the syntax differs:

java -jar pdfbox-app-2.0.36.jar PDFSplit -split 1 input.pdf
java -jar pdfbox-app-2.0.36.jar PDFMerger input-1.pdf input-2.pdf input-3.pdf repaired.pdf

Inspect a PDF’s internal structure

The debug command (3.x) or PDFDebugger (2.x) opens a Swing GUI that lets you browse the PDF object tree, view content streams, and examine fonts and metadata:

# 3.x
java -jar pdfbox-app-3.0.7.jar debug input.pdf

# 2.x
java -jar pdfbox-app-2.0.36.jar PDFDebugger input.pdf

This is the most useful PDFBox command for diagnosing why a PDF is broken. The object tree on the left shows every object in the file; the panel on the right shows what each object contains. If the file has a damaged xref, the debugger will show partial structure or report errors during the load — useful information that the command-line tools don’t surface.

Decompress PDF streams for inspection

PDF content streams are usually compressed, which makes the file unreadable in a text editor. The decode command (3.x) or WriteDecodedDoc (2.x) writes a version with all streams uncompressed:

# 3.x
java -jar pdfbox-app-3.0.7.jar decode input.pdf decoded.pdf

# 2.x
java -jar pdfbox-app-2.0.36.jar WriteDecodedDoc input.pdf decoded.pdf

The decoded file is much larger than the original but is readable in any text editor. This is invaluable when you need to understand exactly what is in a damaged file’s content streams.

Extract text

# 3.x
java -jar pdfbox-app-3.0.7.jar export:text -i=input.pdf -o=output.txt

# 2.x
java -jar pdfbox-app-2.0.36.jar ExtractText input.pdf output.txt

Useful as a damage-assessment step: if PDFBox can extract text, the content streams are readable even if the file’s higher-level structure is broken.

Extract embedded images

# 3.x
java -jar pdfbox-app-3.0.7.jar export:images -i=input.pdf

# 2.x
java -jar pdfbox-app-2.0.36.jar ExtractImages input.pdf

Image files are written to the current directory, named after the input file with sequential suffixes.

Encrypt and decrypt

# 3.x — encrypt
java -jar pdfbox-app-3.0.7.jar encrypt -i=input.pdf -O=owner-password -U=user-password

# 3.x — decrypt
java -jar pdfbox-app-3.0.7.jar decrypt -i=encrypted.pdf -password=your-password

# 2.x — encrypt
java -jar pdfbox-app-2.0.36.jar Encrypt -O owner-password -U user-password input.pdf

# 2.x — decrypt
java -jar pdfbox-app-2.0.36.jar Decrypt -password your-password encrypted.pdf

PDFBox supports the same RC4 and AES encryption levels that the PDF specification defines. Like every other PDF tool, decryption requires the password — PDFBox does not crack lost passwords.

Limitations and known issues

Two CLI syntaxes in active use. Tutorials, Stack Overflow answers, and forum posts that predate the PDFBox 3.0 release use the 2.x command names, which do not work in 3.x and vice versa. Always check which version a given recipe targets before running it.

No command-line repair flag. Repair is a side effect of programmatic load-and-save, or of split-and-merge from the command line. PDFBox does not advertise itself as a repair tool, and the workflows feel like workarounds because they are.

Java startup overhead. Each invocation of the PDFBox CLI starts a JVM, which adds noticeable latency compared to native tools like qpdf. For batch processing many files from the command line, this adds up. Programmatic use within a long-lived JVM avoids the problem.

Repair behavior is less aggressive than qpdf. PDFBox’s parser handles many common errors, but qpdf’s xref reconstruction is more thorough for the specific case of damaged cross-reference tables. Files that PDFBox refuses sometimes open in qpdf.

No native binaries. PDFBox is Java only. If you cannot install a JRE in your environment, PDFBox is not an option.

API changes between major versions. Code written against PDFBox 2.x will not compile against 3.x without changes. The most visible change is PDDocument.load() becoming Loader.loadPDF(), but many other methods have moved or changed signatures. Migration guides exist in the PDFBox documentation.

Slower than dedicated rendering libraries. PDFBox can rasterize PDFs to images, but for high-volume rendering, MuPDF or Poppler are substantially faster.

Alternatives

qpdf is the right first choice for command-line PDF repair. Native binary, fast, no JVM dependency, dedicated repair behavior. See the complete guide to qpdf.

pikepdf is qpdf’s Python wrapper and the natural choice for Python-based PDF processing. Same backend as qpdf, Pythonic API, MPL-2.0 licensed. See the complete guide to pikepdf.

Ghostscript uses re-rendering rather than structural repair and sometimes recovers files that PDFBox and qpdf cannot. The cost is the loss of form fields, signatures, annotations, and tagged structure. AGPL-licensed. See the complete guide to Ghostscript for PDF recovery.

iText is another Java PDF library, dual-licensed under AGPL and a commercial license. More feature-rich than PDFBox in some areas (forms, signatures, accessibility) but the licensing is restrictive for proprietary use. PDFBox’s Apache 2.0 license is the cleaner choice when license terms matter.

OpenPDF is a permissively-licensed (MPL/LGPL) fork of an older version of iText. Functional but less actively maintained than PDFBox; consider it only when you need iText-style features under a more permissive license than current iText offers.

Last verified: April 2026