How to Extract PDF Tables in Python?

Configurare noua (How To)

Situatie

PDF (Portable Document Format) may be a file format that has captured all the weather of a printed document as a bitmap that you simply can view, navigate, print, or forward to somebody else. PDF files are created using Adobe Acrobat.

Solutie

Pasi de urmat

Suppose a PDF file contains a Table

User_ID Name Occupation
1 David Product Manage
2 Leo IT Administrator
3 John Lawyer

And we want to read this table into our Python Program.

Method 1: Using tabula-py

The tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. You can install the tabula-py library using the command.

pip install tabula-py
pip install tabulate

The methods used in the example are:

read_pdf(): reads the data from the tables of the PDF file of the given address

tabulate(): arranges the data in a table format

 

from tabula import read_pdf
from tabulate import tabulate
#reads table from pdf file
df = read_pdf("abc.pdf",pages="all") #address of pdf file
print(tabulate(df))

Tip solutie

Permanent

Voteaza

(8 din 10 persoane apreciaza acest articol)

Despre Autor

Leave A Comment?