Java:Data Science Made Easy
上QQ阅读APP看书,第一时间看更新

Overview of PDF files

The Portable Document Format (PDF) is a format not tied to a specific platform or software application. A PDF document can hold formatted text and images. PDF is an open standard, making it useful in a variety of places.

There are a large number of documents stored as PDF, making it a valuable source of data. There are several Java APIs that allow access to PDF documents, including Apache POI and PDFBox. Techniques for extracting information from a PDF document are illustrated in Chapter 3, Data Cleaning.