
Handling PDF files
There are several APIs supporting the extraction of text from a PDF file. Here we will use PDFBox. The Apache PDFBox (https://pdfbox.apache.org/) is an open source API that allows Java programmers to work with PDF documents. In this section we will illustrate how to extract simple text from a PDF document. Javadocs for the PDFBox API is found at https://pdfbox.apache.org/docs/2.0.1/javadocs/.
This is a simple PDF file. It consists of several bullets:
- Line 1
- Line 2
- Line 3
This is the end of the document.
A try block is used to catch IOExceptions. The PDDocument class will represent the PDF document being processed. Its load method will load in the PDF file specified by the File object:
try {
PDDocument document = PDDocument.load(new File("PDF File.pdf"));
...
} catch (Exception e) {
// Handle exceptions
}
Once loaded, the PDFTextStripper class getText method will extract the text from the file. The text is then displayed as shown here:
PDFTextStripper Tstripper = new PDFTextStripper();
String documentText = Tstripper.getText(document);
System.out.println(documentText);
The output of this example follows. Notice that the bullets are returned as question marks.
This is a simple PDF file. It consists of several bullets:
? Line 1
? Line 2
? Line 3
This is the end of the document.
This is a brief introduction to the use of PDFBox. It is a very powerful tool when we need to extract and otherwise manipulate PDF documents.