Thursday, 13 October 2016

How to do data scraping from PDF files using PHP?

How to do data scraping from PDF files using PHP?

Situations arise when you want to scrap data from PDF or want to search PDF files for matching text. Suppose you have website where users uploads PDF files and you want to give search functionality to user which searches all uploaded PDF file content for matching text and show all PDFs that contains matching search keywords.

Or you might have all London real estate properties details in PDF report file and you want to quickly grab scrape data from PDF reports then you might need PDF scraping library.

To integrate such functionality to web application is not similar to normal search functionality that we do with database search.

Here is the straight solution for this problem. This involves PDF Data Scraping to plain text and match search terms. I have written this post for the people who want to do PDF data scraping or want to make their PDF files to be Searchable.

We are going to use class named class.pdf2text.php which converts PDF text to into ASCII text, so the class is known for PDF extraction. This PHP class ignores anything in PDF that is not a text.

Let’s see very basic example (Taken from author’s file):


include "class.pdf2text.php";

$a = new PDF2Text();
$a->setFilename('web-scraping-service.pdf'); //grab the pdf file reside in folder where PHP files resides.

$a->decodePDF();//converts PDF content to text
echo $a->output();


“Web Scraping is a technique using which programmer can automate the copy paste manual work and save the time. This is PDF w eb scraping using PHP. We at Web Data Scraping offer Web Scraping and Data Scraping Service. Vist our website”

For more complex extraction you can apply regular expression on the text you get and can parse text that you want from PDF. But keep in mind this has limitation and do not work with all types of PDF extraction.

But the wonderful use of this class is to make utility that allow user to search inside PDF when they search on web search bar. Last but not least, You can also find many PDF scraping software available in market that can do complex scraping from PDF files.


No comments:

Post a Comment