Use Python to Recognize Text, Compare Data From Scanned Documents

Background From Previous Post

This post follows on the instruction from a previous article, Research Scanned Documents with Optical Character Recognition, that explained how to use Python to recognize text in png files with the use of Optical Character Recognition.

Optical Character Recognition (OCR) is basically the ability for a computer to recognize words in photos. OCR is particularly useful when dealing with scanned documents.

The previous post addressed how to install Python, the python interpreter Sublime Text, and the python tool for installing libraries Pip. Then, it explained how to install the python libraries Pillow, Tesseract, and Pytesseract. Finally, the post explained how different simpled Python scripts, can read text and then print in out in the python interpreter, create and print the text into a text file, or convert the png file into a pdf that has a layer of OCR over the text.

For example the following script would take a png file named screen.png, read the text and print it out into a text file.

from PIL import Image
import pytesseract

f = open("output.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

Moving Forward, Converting Data into a Python List

Now if we take our output text file, we can turn it into a list by making this addition:

from PIL import Image
import pytesseract

f = open("output.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

rows = []
with open('demofile3.txt', 'r') as txtfile:
	for row in txtfile:
		rows.append(row)

With the new list named “rows”, if we “print(rows)”, the list of names will be the same.

If we wanted to print the list “rows” into a text file, we cannot simply write “f.write(rows)” because the function requires a string. So if we want to create a text file the same as the one we already created, we have to write the script like this:


from PIL import Image
import pytesseract


#print(pytesseract.image_to_string(Image.open('screen.png')))

f = open("demofile3.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

rows = []
with open('demofile3.txt', 'r') as txtfile:
	for row in txtfile:
		rows.append(row)

f = open("demofile4.txt", "w")
for row in rows:
	f.write(row)
f.close()

This addition will create a second text file “demofile4.txt” that is exactly the same as “demofile3.txt”. This simple task does not achieve anything but it shows how to work with the data. And if we have a list, we can compare the contents to another list. So with the names in a list we can do things like see if any of the donation recipients are also in the names of financial disclosure documents for politicians (meaning that a politician also works for the donation recipient).

Now, observe that if we run the script above and tell it to print the list in the python interpreter (Sublime Text), this is the result (script is on top, results are on bottom):

Each “\n” represents a line break. This means that the data recognizes the line breaks. This will be relevant in a moment.

There are other measures that can be taken, such as removing all symbol characters or making all text lowercase to make them more easily compared. However, these measures will not be addressed at the moment.

Converting the Data Directly to a List

But backing up a bit, it seems there should be a more efficient way to make a list without creating a file. However, note that if we tried to put the data directly into the list (instead of putting it into a text file first) like in the script below it did not work:

from PIL import Image
import pytesseract

f = open("demofile3.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

rows = []
data = (pytesseract.image_to_string(Image.open('screen.png')))
for line in data:
	rows.append(line)

If we printed the list, “rows” by typing “print(rows)”, we would get a list of each individual character from all for the names.

Here is a better script below. The original data recognizes line breaks so we can import the data and assign it to a variable (named “data”) and separate the names by using the “split.()” function (as explained here in geeksforgeeks.com) and choose to split by the line breaks using “\n” (as explained here in netinformation.com). And then we put all of that together as “data.split(‘\n’)” and assign that to the variable “t”. Finally we make it a list by saying each thing in “t” (which has now been separated), and we will call each thing “i”, we will append each “i” to “rows”.

This creates a list of every name.

from PIL import Image
import pytesseract

rows = []
data = (pytesseract.image_to_string(Image.open('screen.png')))
t = data.split('\n')
for i in t:
	rows.append(i)

If we add in a “print(rows)”, the results look like this:

We see that there are empty lines that are identified as individual elements in the list as this:

‘ ‘,

You may also notice that the character [ appears a lot in the list.

It is worth mentioning that if there were certain characters that you want removed from the list, like for example a “[“, you can simply add this coding below:

[s.strip('[') for s in rows]
rows = [s.replace('[', '') for s in rows]

The resulting list is clean

This is particularly useful when you want to compare one list to another and you want cleaner data.

Now you have a proper list of the names that we can manipulate and compare.

How to Compare Data from Different Scanned Documents

From here, we will use a common example that involves looking at which nonprofits a corporation provides donations and checking if any of them are involved with local politicians. In this case we are looking at the pharmaceutical company AabVie and a congressman that is involved in areas of potential interest to the company.

To explain why we would research this, it should be noted that the National Bureau of Economic Research published an extensive study, “Tax Exempt Lobbying: Corporate Philanthropy as a Tool For Political Influence”, about how corporate foundations are more likely to give charitable donations to a nonprofit if it is affiliated with a politician.

The study found that “a foundation is more likely to give to a politician-connected non-profit if the politician sits on committees lobbied by the firm.” In addition, the researchers concluded that “a non-profit is more than four times more likely to receive grants from a corporate foundation if a politician sits on its board, controlling for the non-profit’s state as well as fine-grained measures of its size and sector.”

The study identified specific politicians of interest by checking if they were on any committees that were lobbied by a given company. If so, they would check for a link via a nonprofit and donations. So in our example we will look into whether a specific company is given to a nonprofit affiliated with a politician of interest. The existence of such a link does not prove nefarious intent, but it is interesting and in some cases it can be an indicator of a deeper relationship.

This example is relevant because it requires researching files from two different databases (irs nonprofit filings database and congressional financial disclosures database) with records that are often badly scanned documents. The lack of OCR on these files makes it difficult to find links.

We can use the png of donation recipients identified in the public tax filings of the AbbVie Foundation (affiliated with the pharmaceutical company with the same name, which has a presence in Rhode Island). We will name this file abbvie.png.

donation recipients

For the second file we will use a list of nonprofit organizations that have Rhode Island-based congressman Jim Langevin on their board to see if any of his nonprofits receives money from the aforementioned foundation. The information for both files was scraped from records made available via Propublica’s Nonprofit Explorer.

In this case we will use the following Python script to read both documents and compare the data.


from PIL import Image
import pytesseract

rows = []
data = (pytesseract.image_to_string(Image.open('abbvie.png')))
t = data.split('\n')
for i in t:
	rows.append(i)

rows2 = []
data2 = (pytesseract.image_to_string(Image.open('langevin.png')))
t2 = data2.split('\n')
for i2 in t2:
	rows2.append(i2)

names = []
for name in rows:
	if name in rows2:
		names.append(name)
print(names)

Below you can see that running the script identified that the name “Adoption Rhode Island” was in both files (along with a lot of empty spaces shared in both docs).

Compare Many Documents

Now that we have explained how to compare two documents, let’s look at how to compare many documents.

Let’s say you have 20 pdf files of a hundred pages each and you have another set of 20 pdf files and you want to find common names or something else that exist in both groups. First go to pdftopng.com

pdftopng.com

Upload one set of 20 pdf documents and it will return a zip folder to download. Inside the folder is a set of png files, each page of each pdf file will be converted into a separate png file. Next, open Command Prompt / Terminal and navigate to the folder (you may consider copy and pasting the files from the zip folder to a regular folder). Once you have navigated there, type “dir /b > filenames.txt” in the Command Prompt and it will create a text file in the folder with all of the file names. Copy and paste those files into the same folder with the Python script and then replace the name of the png file in the python script with the name of the textfile (which should also be in the same folder as the python script now.

The script will read the file’s names of the png files and then go to each png file to OCR it. You could also put the files in a separate folder and put the path to each file in the text file. This could be accomplished with use of the “Find and Replace” tool. For example, if every file starts with the same word (which will be the case because that is how the website will create the files for you) like “policy”, and the path to each file was something like “documents/research/policy1.png” (with the number changing for every file), you could tell the Find and Replace tool to find all instances of the word “policy”, and replace them with “documents/research/policy”. This would leave the rest of the file names unchanged.

Now repeat these steps for the second group of files and you are done.

That’s it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s