Read PDF with Java

Read PDF with Java

If you need to read tables from a PDF document, you can use the Tabula library. Tabula is a library for extracting tables from PDF files -- it's specifically designed for this task and integrates well with PDFBox.

Here is an example of how you can use Tabula to read tables from a PDF:

Firstly, add the following dependency to your Maven pom.xml file:

<dependency>
    <groupId>technology.tabula</groupId>
    <artifactId>tabula</artifactId>
    <version>1.0.3</version>
</dependency>

Now, you can use the following code to extract tables:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import technology.tabula.ObjectExtractor;
import technology.tabula.Page;
import technology.tabula.RectangularTextContainer;
import technology.tabula.Table;
import technology.tabula.extractors.BasicExtractionAlgorithm;

import java.io.File;
import java.io.IOException;
import java.util.List;

public class PDFReader {
    public static void main(String[] args) {
        File file = new File("path_to_your_pdf_file.pdf");
        try (PDDocument document = PDDocument.load(file)) {
            PDFTextStripper pdfStripper = new PDFTextStripper();
            String text = pdfStripper.getText(document);
            System.out.println("Text extracted:\n" + text);
            
            ObjectExtractor oe = new ObjectExtractor(document);
            BasicExtractionAlgorithm bea = new BasicExtractionAlgorithm();
            
            Page page = oe.extract(1); // Extracting tables from page 1. Change page number as per your needs.
            List<Table> tables = bea.extract(page);
            for (Table table : tables) {
                for (List<RectangularTextContainer> row : table.getRows()) {
                    for (RectangularTextContainer cell : row) {
                        System.out.print(cell.getText() + ";");
                    }
                    System.out.println();
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

In the above code, after we extract the plain text, we create an ObjectExtractor and a BasicExtractionAlgorithm. We then extract the first page with oe.extract(1) (you can change the page number as per your needs), and then we extract the tables from that page. For each table, we print out the contents of each cell, separated by semicolons. Each row of the table is printed on a new line.

Please replace "path_to_your_pdf_file.pdf" with the actual path to your PDF file.