Use Python to Recognize Text, Compare Data From Scanned Documents

Background From Previous Post

This post follows on the instruction from a previous article, Research Scanned Documents with Optical Character Recognition, that explained how to use Python to recognize text in png files with the use of Optical Character Recognition.

Optical Character Recognition (OCR) is basically the ability for a computer to recognize words in photos. OCR is particularly useful when dealing with scanned documents.

The previous post addressed how to install Python, the python interpreter Sublime Text, and the python tool for installing libraries Pip. Then, it explained how to install the python libraries Pillow, Tesseract, and Pytesseract. Finally, the post explained how different simpled Python scripts, can read text and then print in out in the python interpreter, create and print the text into a text file, or convert the png file into a pdf that has a layer of OCR over the text.

For example the following script would take a png file named screen.png, read the text and print it out into a text file.

from PIL import Image
import pytesseract

f = open("output.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

Moving Forward, Converting Data into a Python List

Now if we take our output text file, we can turn it into a list by making this addition:

from PIL import Image
import pytesseract

f = open("output.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

rows = []
with open('demofile3.txt', 'r') as txtfile:
	for row in txtfile:
		rows.append(row)

With the new list named “rows”, if we “print(rows)”, the list of names will be the same.

If we wanted to print the list “rows” into a text file, we cannot simply write “f.write(rows)” because the function requires a string. So if we want to create a text file the same as the one we already created, we have to write the script like this:


from PIL import Image
import pytesseract


#print(pytesseract.image_to_string(Image.open('screen.png')))

f = open("demofile3.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

rows = []
with open('demofile3.txt', 'r') as txtfile:
	for row in txtfile:
		rows.append(row)

f = open("demofile4.txt", "w")
for row in rows:
	f.write(row)
f.close()

This addition will create a second text file “demofile4.txt” that is exactly the same as “demofile3.txt”. This simple task does not achieve anything but it shows how to work with the data. And if we have a list, we can compare the contents to another list. So with the names in a list we can do things like see if any of the donation recipients are also in the names of financial disclosure documents for politicians (meaning that a politician also works for the donation recipient).

Now, observe that if we run the script above and tell it to print the list in the python interpreter (Sublime Text), this is the result (script is on top, results are on bottom):

Each “\n” represents a line break. This means that the data recognizes the line breaks. This will be relevant in a moment.

There are other measures that can be taken, such as removing all symbol characters or making all text lowercase to make them more easily compared. However, these measures will not be addressed at the moment.

Converting the Data Directly to a List

But backing up a bit, it seems there should be a more efficient way to make a list without creating a file. However, note that if we tried to put the data directly into the list (instead of putting it into a text file first) like in the script below it did not work:

from PIL import Image
import pytesseract

f = open("demofile3.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

rows = []
data = (pytesseract.image_to_string(Image.open('screen.png')))
for line in data:
	rows.append(line)

If we printed the list, “rows” by typing “print(rows)”, we would get a list of each individual character from all for the names.

Here is a better script below. The original data recognizes line breaks so we can import the data and assign it to a variable (named “data”) and separate the names by using the “split.()” function (as explained here in geeksforgeeks.com) and choose to split by the line breaks using “\n” (as explained here in netinformation.com). And then we put all of that together as “data.split(‘\n’)” and assign that to the variable “t”. Finally we make it a list by saying each thing in “t” (which has now been separated), and we will call each thing “i”, we will append each “i” to “rows”.

This creates a list of every name.

from PIL import Image
import pytesseract

rows = []
data = (pytesseract.image_to_string(Image.open('screen.png')))
t = data.split('\n')
for i in t:
	rows.append(i)

If we add in a “print(rows)”, the results look like this:

We see that there are empty lines that are identified as individual elements in the list as this:

‘ ‘,

You may also notice that the character [ appears a lot in the list.

It is worth mentioning that if there were certain characters that you want removed from the list, like for example a “[“, you can simply add this coding below:

[s.strip('[') for s in rows]
rows = [s.replace('[', '') for s in rows]

The resulting list is clean

This is particularly useful when you want to compare one list to another and you want cleaner data.

Now you have a proper list of the names that we can manipulate and compare.

How to Compare Data from Different Scanned Documents

From here, we will use a common example that involves looking at which nonprofits a corporation provides donations and checking if any of them are involved with local politicians. In this case we are looking at the pharmaceutical company AabVie and a congressman that is involved in areas of potential interest to the company.

To explain why we would research this, it should be noted that the National Bureau of Economic Research published an extensive study, “Tax Exempt Lobbying: Corporate Philanthropy as a Tool For Political Influence”, about how corporate foundations are more likely to give charitable donations to a nonprofit if it is affiliated with a politician.

The study found that “a foundation is more likely to give to a politician-connected non-profit if the politician sits on committees lobbied by the firm.” In addition, the researchers concluded that “a non-profit is more than four times more likely to receive grants from a corporate foundation if a politician sits on its board, controlling for the non-profit’s state as well as fine-grained measures of its size and sector.”

The study identified specific politicians of interest by checking if they were on any committees that were lobbied by a given company. If so, they would check for a link via a nonprofit and donations. So in our example we will look into whether a specific company is given to a nonprofit affiliated with a politician of interest. The existence of such a link does not prove nefarious intent, but it is interesting and in some cases it can be an indicator of a deeper relationship.

This example is relevant because it requires researching files from two different databases (irs nonprofit filings database and congressional financial disclosures database) with records that are often badly scanned documents. The lack of OCR on these files makes it difficult to find links.

We can use the png of donation recipients identified in the public tax filings of the AbbVie Foundation (affiliated with the pharmaceutical company with the same name, which has a presence in Rhode Island). We will name this file abbvie.png.

donation recipients

For the second file we will use a list of nonprofit organizations that have Rhode Island-based congressman Jim Langevin on their board to see if any of his nonprofits receives money from the aforementioned foundation. The information for both files was scraped from records made available via Propublica’s Nonprofit Explorer.

In this case we will use the following Python script to read both documents and compare the data.


from PIL import Image
import pytesseract

rows = []
data = (pytesseract.image_to_string(Image.open('abbvie.png')))
t = data.split('\n')
for i in t:
	rows.append(i)

rows2 = []
data2 = (pytesseract.image_to_string(Image.open('langevin.png')))
t2 = data2.split('\n')
for i2 in t2:
	rows2.append(i2)

names = []
for name in rows:
	if name in rows2:
		names.append(name)
print(names)

Below you can see that running the script identified that the name “Adoption Rhode Island” was in both files (along with a lot of empty spaces shared in both docs).

Compare Many Documents

Now that we have explained how to compare two documents, let’s look at how to compare many documents.

Let’s say you have 20 pdf files of a hundred pages each and you have another set of 20 pdf files and you want to find common names or something else that exist in both groups. First go to pdftopng.com

pdftopng.com

Upload one set of 20 pdf documents and it will return a zip folder to download. Inside the folder is a set of png files, each page of each pdf file will be converted into a separate png file. Next, open Command Prompt / Terminal and navigate to the folder (you may consider copy and pasting the files from the zip folder to a regular folder). Once you have navigated there, type “dir /b > filenames.txt” in the Command Prompt and it will create a text file in the folder with all of the file names. Copy and paste those files into the same folder with the Python script and then replace the name of the png file in the python script with the name of the textfile (which should also be in the same folder as the python script now.

The script will read the file’s names of the png files and then go to each png file to OCR it. You could also put the files in a separate folder and put the path to each file in the text file. This could be accomplished with use of the “Find and Replace” tool. For example, if every file starts with the same word (which will be the case because that is how the website will create the files for you) like “policy”, and the path to each file was something like “documents/research/policy1.png” (with the number changing for every file), you could tell the Find and Replace tool to find all instances of the word “policy”, and replace them with “documents/research/policy”. This would leave the rest of the file names unchanged.

Now repeat these steps for the second group of files and you are done.

That’s it.

Research Scanned Documents with Optical Character Recognition

Optical Character Recognition (OCR) means that your computer can read words in photos and scanned documents. When you have a document or documents where you want to copy and paste the words into a search engine or you want to do a word search for a specific name in the document, OCR will make that possible.

For example, two previous posts described how you can research the tax records of nonprofits’ tax records or politicians’ financial disclosures, in government databases but the records are all in badly scanned pdf documents that do not recognize the words. Therefore if you want to find if Exxon Mobil’s foundation donated to a Senator’s personal nonprofit, you have to potential search through a lot of pages of tax records to look for the name.

This article will explain how to use Python to add OCR to files.

Get Started on Installations

You will need Python, Sublime Text, and Pip for the basics.

To start with, if you are completely new you can download Python from https://www.python.org/downloads/. Then you can download Sublime Text, a tool for accessing Python scripts, at https://www.sublimetext.com/3.

Now, access your Command Line (if you are using Microsoft) or Terminal (if you are using Mac).

Pip is included in Python but you can see guidance for installation and updating at https://pip.pypa.io/en/stable/installing/.

If you get a message saying you need to upgrade pip, you can do so in the Command Line by typing: python3 -m pip install –upgrade pip

The next group of installations are Pillow, Tesseract, and Pytesseract

Next install Pillow by going here for instructions.

(Install Pillow on Windows by typing: python3 -m pip install –upgrade Pillow)

from – https://pillow.readthedocs.io/en/stable/installation.html

Tesseract

There is the command-line program Tesseract and its third party Python “wrapper” (whatever that means) named Pytesseract.

Click here for a basic overview of Tesseract and its installation and usage, see screenshot below. Following the below screenshot there is a link to the more detailed documentation on installation.G

Go here for installation documenation.

Note in the window below in the second para that, for Windows, if you want to OCR different languages you need to click on the link that says “download the appropriate training data”, which brings you to a webpage that offers a different download for each language.

Tesseract installers for Windows are available here.

The process for Windows should identify the location where it is downloaded, this is necessary to know so that “you can put it in your PATH” (a phrase that is often used by rarely explained with Python).

You have installed tesseract and if you are using windows it will default to the location above.

How to Add Something to the PATH

Next, tesseract must be “added to the PATH”, this means that you must add the directory of tesseract must be added to the PATH environment variable.

The following instructions are for Windows.

1- go to System Properties

2- click on Environment Variables

3 – a box will appear that it titled Environment Variables, within it find where it says System Variables and underneath that there is a list of variables, choose the one titled PATH

4 – a new window appears, click on one of the empty lines then click on browse and find tesseract, click on the tesseract folder (or “directory”) then click on okay until you have closed every box

Now when you open the command prompt you should be able to hit the command from any folder and be able to access tesseract.

Pytesseract

To install the python wrapper library, Pytesseract, which uses existing Tesseract installation to read image files and out put strings and objects that can be used in Python scripts.

You can run “pip install pytesseract”, or go here and download the file and then run “python setup.py install”

Command Line / Terminal command for OCR

The python capabilities described here require that you have a png file, not a pdf. Most pdf conversion capabilities are not friendly to Windows. But the methods here will work just as well for Mac.

How to get a png file? Most snipping and screenshot tools will automatically create a png file of the image you are capturing. If you are working with a pdf file and only need one page, you can snip an image of that page.

Here is a very basic OCR python script. The script must be saved to the same folder/directory as the png file that you want to read. If not, then you would put the path instead of just the name. So if the python script was saved to a folder named “first” but the png file was in a folder named “second” that was located in that same folder named first, then instead of (‘yourfilename.png’), you would type (‘second/yourfilename.png’). This is otherwise known as the path to your file.

from PIL import Image
import pytesseract

print(pytesseract.image_to_string(Image.open('yourfilename.png')))

There are many python interpreters but for this post we suggested using Sublime Text. To run a script in Sublime Text you must save it (File, Save or Ctrl-S) then take the not intuitive step of chooseing to “build” (Tools, Build or Ctrl-B).

For this example I used a png file of a list of donation recipients from the Target Foundation’s tax records that looks like this

The aforementioned python script will produce a list of words and phrases printed out into the python interpreter and should look like this:

If you want to print out that same list into a txt file, you could go to the Command Line / Terminal, navigate to the folder with the python script, and then type “tesseract”, space, the png file name, space, then the name you want to give the text file with your list, like so: tesseract yourfile.png textfile.txt

The following script will take your original png file and convert it into a pdf file that has OCR over the words

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract

pdf = pytesseract.image_to_pdf_or_hocr('pngtarget.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default

The resulting pdf looks like this (with words highlighted):

This method was equally successful when used on the full page of the original document that looked like this:

How to create and print list to a TXT file

According to a instructions from w3schools.com, if we want to create a new text file:

So based on this information, if we want to print our items into a text file, we use this script:

from PIL import Image
import pytesseract

f = open("demofile3.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

The resulting text file looks like this:

That’s it.

PACs, Foundations, Press Releases:

How to Research Corporate Efforts at Influence

The Dirt Digger’s Guide identifies political contributions, nonprofit donations, and press releases as three key factors in a corporation’s ability to influence others. Corporations spend a lot of money on donations to political and charitable causes but instead of making direct payments they create PACs and foundations. PACs are used because of laws against direct corporate political donations and foundations are used to avoid taxes.

To find a PAC, the easiest method is to go to OpenSecrets.com and search for the name of the company and PAC. For an example, the Target Corporation has a Target Corporation PAC with a profile page on Open Secrets that displays its information.

The source of the information is on the Federal Elections Commission website and you can see on the Open Secrets profile page’s section on registration details below that there is a link to the source.

The FEC.gov website has its own profile page for the Target Corp. PAC. The format is not as good as the one on Open Secrets but if you want to cite your data it is good to know the source.

A corporation’s foundation is usually easy to find with just a search on the corporation name and the word foundation. As another example, Propublica’s nonprofit explorer has a page on Target Foundation tax records.

Note that in Section B from the most recent tax filing available there is only one source of its funding, the Target Corporation itself. This is standard practice for a corporation’s foundation.

Depending on the tax record, the corporation may list the recipients in an attachment at the end, or in this case from 2015, in Statement A. Notice that Minnesota and Minneapolis appear a lot in the list. This is likely because their corporate headquarters is in Minneapolis. (an upcoming post will explain how to make these documents word-searchable with optical character recognition)

Note that the National Bureau of Economic Research published an extensive study, “Tax Exempt Lobbying: Corporate Philanthropy as a Tool For Political Influence”, (click here to read) revealing that the S&P 500 corporations commonly donate to nonprofits linked to politicians that they want to influence. (A previous post addressed how to lookup if a politician, or anyone else, is misusing a nonprofit for personal gains). See below for the summation of the study’s findings.

“In our first analysis using these data, we show that a non-profit is more than four times
more likely to receive grants from a corporate foundation if a politician sits on its board, controlling for the non-profit’s state as well as fine-grained measures of its size and sector.”

the study also found that “…a foundation is more likely to give to a politician-connected non-profit if the politician sits on committees lobbied by the firm.” The study also noted that there was a high probability that the foundation would stop donating to the nonprofit when the politician lost their bid for re-election.

Therefore, the nonprofits that receive donations can be compared to politician’s financial disclosures that usually list if they are involved in a nonprofit. A researcher can look at which committees a company lobbies and then lookup the disclosures of the politicians that sit in those committees with lobbying and disclosure records, a previous post addressed how to look up lobbying and disclosure records.

Press Releases

On a side note, it is important to remember that corporations use press releases as possibly the most direct way to release information and try to improve their public image. Specifically, press releases are used by corporations to influence their public, release information that is legally required to be made public, and describe/acknowledge their own activities in the manner that they want to be viewed (which is arguably true about any organization).

There are three good places to find press releases. First, almost every corporation’s website will have a page entirely dedicated to press releases. Second, press releases are often included in SEC filings so you can do a keyword search in the SEC’s EDGAR database for the phrase “press release” with the corporation’s stock ticker to ensure you only get results from the one corporation. Finally, there is a website, https://www.prnewswire.com/, that exists to post and cover corporate press releases. One of the benefits of uses this website is that you can also read other press releases on the same topic from different sources/corporations.

The EDGAR keyword search function (often overlooked)

How to Research U.S. Gov. Contracts : Part 3 – Contract and Tender Lookup

This article will show how to lookup a government contract in order to see what useful information about the contracting company is available.

Brief recap of the path leading to this point:

The Federal Contractor Misconduct Database identified in part 1 revealed that a company was cited for violations related to the death of a contract trainer involved in training sea lions for the U.S. Navy.

The company’s registration information and unique identifier, known as its DUNS number, were identified in part 2 by looking up the company in SAM.gov and DUNS.com

The company is named Science Applications International Corporation (SAIC), a branch of the larger corporation with the same name, and its address is 12010 Sunset Hills, Reston, VA. The company’s DUNS number is 078883327. The original violation record listed that the company was working on the U.S. Navy’s Marine Mammal System Support program when the violation occurred and when the trainer drowned.

CONTRACT lookup

The violation record did not identify the specific contract so one my look through the available records for it. The record listed that the violation was committed by the identified company while it contracted on the Mark 6 / Marine Mammal Support System program in 2014. In theory, this means that the researcher needs to find the contract that fits those three criteria.

The company’s DUNS number that was identified in the previous posts can be used in USAspending.gov, which makes federal government contract information public, to lookup contracts with the company. The website can also search based on the company name or other inputs.

To do so, at the site click on Award Search, then Keyword search. If we search here for the company name there are too many results. Next, try clicking on Award Search then Advanced Search. On the left side of the screen find “Recipient Name”, put in the company name, hit enter (which does not actually start the search), and then scroll down and hit ‘enter’.

This also produces a lot of results. So the next option is to search on the company’s DUNS number. After this, we can add in a keyword search for the name of the program to filter down our results for instances of the company contracting for that specific program.

We see in the results (click here) that when we search for the company (or its DUNS number) with the name of the program, we only get two contracts and the first one started in 2015 and the other in 2020.

This image has an empty alt attribute; its file name is image-12.png

However, we know from the record of the violation that it occurred in 2014.

This is a problem that is NOT common. Basically, the company had a different name when it took the contract. Back then, it was known as Leidos, a company that split off into two other companies known as SAIC and Leidos. The original Leidos, the new Leidos, and SAIC all have different DUNS numbers. This does undermine the value of using the DUNS number to search for a company.

But to investigate the problem, assuming one did not know about the company’s history, a quick google search for SAIC finds on the wikipedia page that the company started on September 27, 2013.

To confirm the information from wikipedia, we see in the company’s SAM.gov registration from the Part 2 post that the company registered on the following day.

One can see from the two contracts identified in the search on USAspending that they are scheduled from 2015 to 2020, and from 2020 to 2025, which suggests that the contracts are usually scheduled in 5 year increments. One can infer that the previous contract was scheduled from 2010 to 2015. At this point one knows that even though SAIC was working on the contract in 2014, the contract may have begun in 2010 before SAIC registered with SAM.gov in 2013.

Since SAIC spun off from Leidos, we can search for that company name in conjunction with the name of the program. Hence, we see that Leidos started a contract in 2009 and it continued through the date of the violation in 2014 and ended in 2015. Therefore, this result is the contract that SAIC was fulfilling at the time of the violation.

This image has an empty alt attribute; its file name is image-40.png

We click on the result and are brought to a page with the contract information. This information includes the contract identification number N6600110C0070.

There is a description of the contract that is kind of vague and not very useful:

There is a history of the transactions for when the company was paid over time and what services were rendered.

Unique identifier information for the award and the government entity that is the source of the contract.

Information about the recipient:

Competition details are interesting. For example, this shows that the contract was intended for open competition but only one company bid for it, meaning that the company did not face any competition for the contract. If we wanted to assess the company based on its ability to obtain a contract, it appears significantly easier (and therefore requiring much less competency from the company) to acquire contracts without facing competition, unless there is some other relevant information that is not apparent here.

Sometimes executives are listed but that is not the case here:

Tender Lookup

We use the solicitation ID, which is also the contract ID, from the contract and go to beta.sam.gov where we can lookup the original tender. When you to the website, you can use the basic search function or if you want the advanced search you have to take the not-obvious step of scrolling down and clicking on Search Contract Opportunities.

Make sure when you are looking for old tenders that you have to unclick a box that says Active Only.

Finally, we search for the ID number and find the original tender.

This is useful because the tender has a description of the contract and that gives us details of what the contract actually entailed.

Additional Contracts

We can also look up to see if the company acquired additional contracts for the same program. Returning to USAspending.gov and searching for the company name and the program name we see that after the first contract ended, the company obtained two more on the same program.

This image has an empty alt attribute; its file name is image-12.png

Interestingly, we see at beta.sam.gov that before the last tender was issued, the navy issued a notice of intent for sole source procurement to SAIC for the program.

If we open this notice, it states that the Navy will not open the tender to open bidding, but will instead offer it directly to SAIC. It is not clear from the notice why the tender will not be open to bids from different companies, but it confirms that SAIC did not face any competition for its most recent contract on the program.

That’s it! An upcoming post will show how to do more in-depth research.

How to Check If a Website Is Safe

There are several ways to check a website to see if it is dangerous.

You can “scan” the website infrastructure for telltale signs of danger. Find if known malware refers back to the website or check the downloadable files on the site itself. Investigate website’s links or the SSL certificates for clues.

There are also blacklists of websites that are considered security threats and this is the best place to start. The easiest first step is to click here to use Virus Total, which has a tool (click here) to check suspicious urls. Virus Total will check your url against 60+ blacklists.

See the results below from searching on “search-ish.com”, and you can see the beginning of a list of different checks that came up clean for my website.

This is the results from searching my website

I also like that they have a nice “summary” section for those of us who do not have PhD’s in Computer Science. See below:

You can also use Threat Intelligence Platform for the same purpose. But keep in mind that it will not give you a simple “yes” or “no” answer. Instead it will provide your various kinds of evidence that the site is or is not safe.

To be extra careful, you can also check if the site’s IP address is blacklisted by looking up the IP on Ultra Tools and searching it at IPvoid.

Malware Affiliated With the Site

Another method is to look for malware that has been hosted on the website in the past, which is pretty damning of the site. You can also search for known malware that refers to the site, which is a sign that the website is affiliated with the processes of the malware. You can search for either of these on the aforementioned Virus Total and Threat Intelligence Platform. However, it is not clear where it obtains this information.

Suspicious File on the Website

If you want to check if a particular file is dangerous you can upload it to Virus Total for analysis and it will be checked against 60 virus databases. While it is definitely recommended that you do not download or in any way work with potentially dangerous files, if you plan to do so it is recommended that you use a virtual machine.

Using a Virtual Machine

For a newcomer, dealing with a virtual machine is a bit of a hassle, so feel free to jump to the next section. Still reading? Okay so if you want to actually go to the suspicious website, maybe download suspicious files to upload them elsewhere for inspection, you can use a virtual machine to mitigate the danger to your computer.

The Intercepts guide for novices to set up a virtual machine

A virtual machine is basically an isolated computer within your computer. The idea is that if a malicious file infects your virtual machine it should not be able to infect your regular computer, though there are a few documented cases where this is possible. For a primer on how to set up and use, click here to see the guidance from The Intercept. I hate to pawn you off to another website to learn virtual machines, but I can’t do a better job then they did.

Additional Names on the SSL Certificate

One can also check the website’s SSL certificate for a sign that it could be dangerous. The SSL certificate is a type of digital certificate that authenticates the website so that when you go to cnn.com, you are actually going to CNN’s website. General you should expect the certificate to have one domain possibly with additional subdomains. Scammers often use many domains on the same certificate.

You can look up any website’s SSL certificate at censys.io, or if you scanned the website on Virus Total you can look in your results under the heading “Subject Alternative Names.” However, censys.io will give you more detailed findings.

Does It Use a Phishing Kit?

While there are many companies that offer customers different kinds of pre-built websites (WordPress) there are similar providers that offer pre-built phishing websites to scammers. These are called “phishing kits” and they have several telltale characteristics. To check if a site was built with a phishing kit, you can send its url to urlscan.io. UrlScan will find evidence of a phishing kit or if other sites use the same kit.

This is a bit long but you see the results below for a search on this website.

Right up top you can click “similar”?

You click there and you can see specifically whether there are indications of a phishing kit (like in the website below that is not mine):

Additional Security Websites

In addition to the websites mentioned above, you find a list of websites that check the safety of suspicious sites by clicking here.

How to Use APIs (explained from scratch)

photo from Petr Gazarov’s What is an API?

This post explains APIs with Python but assumes no prior knowledge of either.

Python Headers, Requests, Json, and APIs

An API is a means for someone, or more specifically their Python script, to communicate directly with a website’s server to obtain information and/or manipulate data that might not otherwise be available. This avoids the difficulty of getting a Python script to interact with a webpage.

To use APIs, one needs to understand Python’s requests and json libraries as well as Python dictionaries. This article provides a walk though for using these tools.

Background – How to Get Started With Python and Requests From Scratch

To start with, if you are completely new you can download Python from https://www.python.org/downloads/. Then you can download Sublime Text, a tool for accessing Python scripts, at https://www.sublimetext.com/3.

Now, access your Command Line (if you are using Microsoft) or Terminal (if you are using Mac).

Next we want to get a new python library called “requests” and we will do so by using pip. Pip is included in Python but you can see guidance for installation and updating at https://pip.pypa.io/en/stable/installing/.

To obtain “requests” (if you are using Windows) from the command line, type “python -m pip install requests”.

If you are using Mac, from Terminal type “pip install requests”.

It can be useful to avoid learning the more official definition of an API because often considered esoteric to the point of being counterproductive

Requests – a GET Request

The following is a walkthrough for a standard GET request. We start with the “requests” library, which is standard for using APIs or web scraping. This library is standard for using Python with the Internet. “Requests” are basically used to request information on the internet from a website or API.

First, to start our script we import the requests library by typing “import requests” at the top.

Second, identify the url that gives us the location of the API. This url should be identified in the api documentation (api documentation is explained below). If we were web-scraping, this would be the url of the webpage we are scraping, but that is a separate topic. We assign the url to what is called a variable, which in this case is named “Api_url” by typing this ‘ Api_url = “http://FAKE-WEBSITE-URL/processing.php” ‘.

SIDENOTE: A “variable” is kind of like a name or container for information, and that information is called a “value”. So in a script you create a name for the variable and assign the information/value to that variable by typing ” the name of the variable = the information/value “. So in this script the variable name is Api_url and the value is the string of characters that create the url and the quotes around it, “http://FAKE-WEBSITE-URL/processing.php”.

Finally, we use the requests library to create a GET request with the format “requests.get(api_url)”. The request for appears in the context of it being assigned to “Api_response”. It might seem weird that the request first appears in the context of saying “something equals the request”. It is easier to think of it as: your computer first reads the request before looking at the variable name, then gets the data (also known as the API response) and brings it back, and then gives the data a tag which is the variable name. That may not be accurate but it is easier to understand

 Import requests
 Api_url = “http://FAKE-WEBSITE-URL/processing.php"
 Api_response = requests.get(api_url)

Requests – a POST Request

Usually with requests, you will do a GET request because you are basically just getting data that is already there on the website or from the API. In other cases you will do a POST request when you need to send information to a server/url in order to get certain information in return. However, these lines dividing GET and POST are often blurred. Sometimes the different requests can be used interchangeably.

Here is an example of a POST request. I in this case below, I want to obtain information about a specific person from a database. Therefore I change “requests.get” to “requests.post” and instead of only putting the url in the request like in the script above, I will also include parameters in the form of “data=params” to tell the database the relevant information.

The Request Parameters (identified as “params” in the script) specify what information you are looking for with your script. S

import requests
params = {'firstname': 'John', 'lastname': 'Smith'}
r = requests.post("http://FAKE-WEBSITE-URL/processing.php", data=params)
print(r.text)

The response to the request will include information on for the specified information (the person) instead of all information at the url.

API Documentation

Each API has certain requirements for how the code is written and formatted. These specific demands should be explained in a guide that accompanies the API on the website that explains or identifies the API itself. The guide is referred to as the “api documentation.”

For example, the website faceplusplus.com overs a tool that will compare faces in photos and there is the option to use an API to access the tool. The website includes the api documentation for Face++ (as show below) where it identifies the requirements or specifications for your script to access their API.

Note that the documentation below identifies the url that needs to be used and that the script must use a POST request. The documentation also identifies the names for the Request Parameters (parameters can be considered one of many ways to include a bit of data in a request) are explained in the Headers section later on in this article) used in the script.

How to use Face++ is explained in OSINT: Automate Face Comparison With Python and Face++. and Python for Investigations if You Don’t Know Python.

API Response

Now, back to the original GET request below.

 Import requests
 Api_url = "http://FAKE-WEBSITE-URL/processing.php"
 Api_response = requests.get(api_url)

The response from the server, which we assigned to”Api_response”, will be written in Json programming language. So we need to make the json response more readable. To do this we need Python’s “json” library (the term json here is put in quotes to specify that it refers to the Python library named “json”, not the json programming language). This is not included in Python so first we need to install it through the “Command Prompt” or “Terminal,” depending on what kind of computer you are using.

As referenced above, we use pip to install. If you are using Windows then from the command line, type “python -m pip install requests”. If you are using Mac, from Terminal type “pip install requests”.

Next we add the line “import json” to our script (this refers to importing Python’s json library, not the json programming language).

Then we use the “json.loads()” function from the json library and we process the “Api_response” that is written in json. When we put the api_response in the json.loads() function. We specify that we want to process the text from the api response by typing “Api_response.text” when we put the response in the “json.loads()” function, so in full we type”json.loads(api_response.text)”, we then assign it to “Python_Dictionary”. In order to make the response data more readable, we used the json.loads function to transform the data into a python dictionary (explained more below).

Here is how it looks:

 Import json
 Import requests
 Api_url = “url for the api here(listed in api documentation)”
 Api_response = requests.get(api_url)
 Python_Dictionary = json.loads(api_response.text) 

For more information on this topic, look at the book Mining Social Media, in particular p. 48, or consider purchasing it.

Recap and Explanation – So, we used the “json.loads()” function from the json library to transform json data into a python dictionary. However, (per mining social media, p.49) the loads() function requires text, but by default the requests library returns api_response as an HTTP status code, which is generally a numbered response, like 200 for a working website or 404 for one that wasn’t found.

So if you typed “print(api_response)” you would get the status code.

We need to access the text of our response, or in this case the JSON rendering of our data. We can do so by putting a period after “api_response” variable, followed by the option “.text” and the entire construction thus looks like this: json.loads(api_response.text). This converts the response of our api call into text for our Python script to interpret it as JSON keys and values. We put these Json keys and values in a Python dictionary, which is used for key-value pairs in general.

Python Dictionary

A Python dictionary contains key-value pairs (explained below) and it is also defined by its formatting. So here is an example of what a Python dictionary and its formatting look like:

headers = {'Content-Type': 'application/json',
'Authorization': 'api_token'}

The dictionary is enclosed in {} and its contents are formatted in key-value pairs (a “key” is assigned to a “value” and if you want to obtain a particular value you can call on its key).

For example, a dictionary would appear in our script like this “Dictionary_Title = {‘key1’ : ‘value1’, ‘key2’ : ‘value2’}”. Separately, we can call upon a value by typing “Dictionary_Title[‘key1’]” and it will give us ‘value1’ because value1 is the value that was assigned to key1.

However, dictionaries can also contain dictionaries within them. See below, where “key2” is a dictionary within a dictionary:

Dictionary_Title = {'key' : 'value', 
    'key2' : {
'value2': 'subvalue',
 'value3': 'subvalue2'}}, 

In the example above key2 is a dictionary within the larger dictionary named Dictionary_Title. Therefore, if we want to get a value in a dictionary within a dictionary, like subvalue2, we would structure our call like this, “Dictionary_Title[‘key2’][‘value3’]” and that would give us subvalue2.

Note that sometimes a very larger dictionary is assigned to a variable so watch if the dictionary is preceded but something that looks like this “item: [ ” that means that item is a variable that contains the dictionary.

Authentication

APIs will commonly require some form of Authentication like an authentication code, also referred to as a key or a token. This is a means for the API owner to limit access or possibly charge users. Typically the user will have to sign up for an account, often referred to as a developer account and is geared toward app developers, with the API owner in order to obtain the code.

The owner’s API documentation will give instructions for how to include the code in your Python script.

API authentication can be a difficult matter. For reference, consider looking at the requests library’s documentation on authentication, click here.

Often, the documentation will instruct the user to include the code as a “param”, “in the header”, or to use Oauth.

Params Authentication

The API documentation for Face++ (for more information about using Face++, see my article on Secjuice by clicking here) specifically requests that you include your api key and api secret (assigned to you when you get an account) are included as request parameters.

Therefore, in your script you would create a params dictionary with the keys identified above and include that dictionary in your request with typing “params” or “params=params” as seen below.


params = {'api_key': 'YOUR UNIQUE API_KEY', 
'api_secret' : 'YOUR UNIQUE API_SECRET', 
}

r = requests.post(api_url, params=params)

Headers are a bit more complicated and therefore require an entire section just to explain headers first.

HTTP Headers

Every time you send a request to a server (which includes things like clicking on a link or doing almost anything on the internet) an HTTP header will be attached to the request automatically.

Headers a bit of data that is attached to a request, that is sent to a server, and provides information about the request itself. For example, when you open a website your browser sends a request with a header that identifies information about yourself, such as the fact that you are using a Chrome browser. Your Python script, behind the scenes, also includes a header that identifies itself as a script instead of a browser. This process is automated but you can choose to create a custom header to include in your python script.

A custom header is often needed when you are using an API. Many APIs require that you obtain a sort of authorization code in order to use the API. You must include that authorization code in your script’s header so that the API will give you permission to use it.

In order to create a custom header, you type a bit of code into your script that is a python dictionary named headers. Also, specify in your request to include this dictionary as the header by typing “headers=headers”. See below:

headers = {'Content-Type': 'application/json',
 'Authorization': 'Bearer {0}'.format(api_token)}

Api_response = requests.get(api_url, headers=headers)

This custom header will get priority over the automated header so, for example, you can set your custom header to identify your Python script as (essentially) a person using a web browser so that you can avoid bot-detection software. In a separate article, we will address how to make your script look human in order to avoid bot-detection software.

There are several predetermined key types and associated meanings. See here for a full list

The api documentation will often give specific instructions for how you must set up the headers for your scripts. Add these lines to the file to set up a dictionary containing your request headers:

This sets two headers at once. The Content-Type header tells the server to expect JSON-formatted data in the body of the request. The Authorization header needs to include our token, so we use Python’s string formatting logic to insert our api_token variable into the string as we create the string. We could have put the token in here as a literal string, but separating it makes several things easier down the road:

See the more official documentation of custom headers below, from the documentation for requests:

“What goes into the request or response headers? Often, the request headers include your authentication token, and the response headers provide current information about your use of the service, such as how close you are to a rate limit”

Github API Authentication

This walkthrough of the Github API shows how to use an authentication token as opposed to the authentication free version (github allows people who do not have an account/token to use their api a limited amount). For more information, there is a great tutorial for the Github API, click here.

Without the API token:

import json
import requests
username = "search-ish"
following = []

api_url = ("https://api.github.com/users/%s/following" % username)
api_response = requests.get(api_url)
orgs = json.loads(api_response.text)
for org in orgs:
    following.append(org['login'])
print(following)

api_url = ("https://api.github.com/rate_limit")
api_response = requests.get(api_url)
print(api_response.text)import json

As a result, the script shows that the user “search-ish” follows one person, “lamthuyvo”. But we see that the limit of searches is set at 60 and that I have used 10 of these already.

With the API token:

import json
import requests
username = "search-ish"

headers = {"Authorization": "bearer fake_token_pasted_here"}

following = []
api_url = ("https://api.github.com/users/%s/following" % username)
api_response = requests.get(api_url, headers=headers)
videos = json.loads(api_response.text)
for video in videos:
    following.append(video['login'])
print(following)

api_url = ("https://api.github.com/rate_limit")
api_response = requests.get(api_url, headers=headers)
print(api_response.text)

In this script, we have gone gotten a “developer account” (this is generally the name of the kind of account you need to get to obtain an access token). Github uses the widely used Oauth2 software and github’s api documentation says that it wants you to put a bearer token in the header. So we use the fake token “1234” and type with the following formatting

headers = {“Authorization” : “bearer 1234”}

This specifies that the type of authorization is a bearer token and provides the token itself.

Next, we tell our GET request to include this header information by typing the following

api_response = requests.get(api_url, headers=headers)

This is a bit confusing but when you type “headers=headers” you are essentially saying that the HTTP Headers are the “headers” variable that I just typed. This will only replace the Authorization part of the original, automatic HTTP headers.

When we run this script we get the following:

Note that we get the same answer to our followers question and now the limit is set to 5,000 because that is the limit for my account.

That’s it, good luck.

How to Research U.S. Gov. Contracts : Part 2 – SAM.gov

This article is a follow up to the post U.S. Government Contracts Case Study: Part 1 – Contractor Misconduct but it is NOT necessary to read it first.

This part of the walkthrough involves researching a company in special databases for companies that contracted for the U.S. government. At this point we know the company had a contract with the government and we have identified its name and address. Based on this information will look up the company registration.

The company we are researching is Science Applications International Corporation (SAIC) and its address is 12010 Sunset Hills, Reston, VA. We are researching it because we found a record about a violation related to a drowning death.

SAM.gov

The next step is to go to SAM.gov, because any company that has ever even tried to get a government contract will be registered on this site. SAM.gov is the site that the government uses to announce contracts offerings (also known as “tenders”) so that companies can bid on them. Companies HAVE to register on this site before applying for a contract.

We search for our company name and we get a lot of results. So we use the advanced search function and specify the company’s address (obtained from its website and confirmed from OpenCorporates.com) and get the registration for our company.

SAM.gov registration summary

The most important is the DUNS number, which is 078883327 for the company. The DUNS number is a unique identifier used for the company in government databases to find a company. This way there is no confusion if the company name is misspelt in a record or if other companies have the same name. “DUNS” refers to a separate database maintained by the company Dun and Bradstreet, more on that later.

One aspect of the SAM registration that is useful is the POC section, because that will identify certain officials in the company, their contact information, and their responsibility. For example our company has identified POCs assigned for Electronic Business, Government Business, and Accounts Receivable.

Note the drop down menu next to where it says View Historical Record. The website will let you look at past registrations that often have different names. This is helpful for finding additional current and former employees. More importantly, if you are researching a past contract, you can look up which member of the company was affiliated with it at the time.

There are two more bits of information that could be useful here. Click on “Core Data” on the left you see a section like below. First, you see when the company first registered with SAM.gov, which is essentially when the company decided to start seeking government contracts.

This is an important date for researcher that is looking into a company’s SEC filings or Lobby disclosures. Any changes that occurred at that point in time that might be related to the company’s desire for government contracts. But that is beyond the scope of this post.

Second, you see the company’s congressional district identified as “VA 11” also known as the 11th District of Virginia. A researcher should consider investigating the relationship between the company and the member of Congress representing this district. An upcoming post will give in depth instruction on how to do that and specifically look at SAIC.

DUNS Database (at dnb.com)

Companies that bid for government contracts must also register with the DUNS database at Dun & Bradstreet to get their DUNS number. We can use the company’s number to look up its registration at DUNS for additional information. We will address the information that is available for free.

To do so, we go to dnb.com and input the number in the search function on the top right of the website. We pull up the record for our company, which starts with the basics.

The Company Profile provides a basic description of the company and what it does.

DUNS estimates that the company has 5 employees at its primary address, but it is not clear what that estimate is based on so it is hard to validate.

The record also points out that the company we are researching is a branch of a larger corporation with the same name, this explains why there were so many results when we searched the company’s name in SAM.gov.

There are several tools available if you want to research the larger web of a corporation, but this article is focused on databases that are specific to government contracts. The aforementioned tools, that will not be addressed in this article, include https://www.corporationwiki.com/, https://enigma.com/, https://public.tableau.com/s/, and https://www.thomasnet.com/.

Finally, the record also identifies several employees as potential contacts for further research.

That is it for looking up registrations, the next post will address investigating the contract.

U.S. Government Contracts Case Study: Part 1 – Contractor Misconduct

When a company takes a contract with the U.S. government it requires making a lot of information public that is normally not the case with private sector contracts. This information is a great opportunity for any researcher interested in the company.

This post will be the first in a series that provides a walkthrough for researcher a U.S. government contract or contractor. The case study focuses on a contractor that trained sea lions for the U.S. Navy and ultimately involved the death of a trainer.

We start with the corporation Space Applications International Corporation (SAIC). The company website mentions that it has contracted for the U.S. government.

In theory, a researcher should first look up the company’s registration details, then contracts, then tenders. However, the most interesting information is often found in violations committed in the process of fulfilling government contracts. Therefore, the first step is to go to the Federal Contractor Misconduct Database that is maintained by the Project On Government Oversite (POGO), a nonprofit government watchdog organization.

When we search for SAIC, we see that the database lists 24 instances of misconduct with cumulative penalties of over half a billion dollars. The page lists POGO correspondence with SAIC and a list identifying each misconduct and providing a link to the source of the information.

Below we see some of the listed incidents of misconduct and the penalties leveled against the company. The incident that looks most interesting and will be the subject of this article is the incident named Drowning Death on Mark 6 Sea Lion Program.

Clicking on this incident’s title we are brought to a page on the incident itself. This includes a summary of the incident, identifies the enforcement agency as the U.S. Occupational Safety and Health Administration (OSHA), and includes a link to the OSHA decision.

With these kinds of incidents (while most do not involve a death, these kinds of incidents are not rare) you can an initial inspection with the results documented publicly on the OSHA website, an OSHA decision, and a final decision from the Occupational Safety and Health Review Commission (OSHRC) that will be posted on oshrc.gov.

All OSHA fatality-related reports are published here. We can find the report for the incident in question because we know the date, but the search feature is pretty user friendly regardless. The report shows that a second “serious violation” was observed by the inspector at the time but it was later decided against during the review commission. Given that some of those documents can be lengthy it is helpful to know what to look for, in this case we would want to find out why the commission ruled for one violation and against another.

In addition, by clicking on the violation ID numbers we can see the inspector’s reported observations/violations at that time. This sheds more light on the incident.

The OSHA review commission report provides a lot more details about the incident and alleged violations that would be particularly useful for a due diligence review of the company’s actions. For starters, the commission explains that a “serious” violation has a specific meaning. Namely, that if an accident occurred, it “must be likely to cause death or serious physical harm.”

Furthermore, the review goes on to detail several criteria necessary to establish a serious violation, including the existence of the hazard, the employer had knowledge in advance about the hazard, the hazard risked death or serious injury, and that there were feasible means to abate the risk but the employer did not take them. The report details the evidence why it deemed that the company’s actions met each of those criteria and why it met some but not all criteria for the second alleged violation.

This is important because, if we were going to assess the company’s decisions and abilities, this information shows that the company did not merely make an oversight. Rather, according to the report, the company made intentional decisions that led to the violation.

Also, the reports details adds further negative information related to the second alleged violation, because it shows that the report indicates there were problems but not enough to be deemed a “serious violation.”

Finally, the report makes a vague reference to a previous incident where a Navy diver had to be rescued and resuscitated and how this provides evidence that the company recognized the existence of certain hazards. This is discussed in greater detail in the final decision.

We can go to oshrc.gov and do a simple search for “SAIC” in the search function on the top right to find the final decision. If we read the final decision we see there is a lot of detailed information about the company, its history, and how it operates. For example, new information is identified that was obtained from testimony. In the section below, there is text that appears to addressing the previous incident involving the company where a diver had to be rescued. These details suggest that the previous incident occurred at the same place and while the company was providing similar services.

Ultimately, these three documents provide very detailed and valuable information about the company, and much or all of this information could have remained secret with a private sector contract.

If a researcher were writing a due diligence report, they could cite that OSHA cited “serious” violations in its inspection following the incident and that the OSHRC final decision noted, on page 8, a previous “near miss” that showed that the company’s employees had been put in danger in the past.

Further posts on this case study will explore researching the contracts themselves and investigating the company with contract-related websites.

See PART 2

Dark Net, Deep Net, and the Open Net

One can easily write at length to describe the differences between the deep net, dark net, and the open net, but they can also be summed up simply as follows.

OPEN NET

The open net is what you would call the “regular” internet. If there is a phrase on the open net , such as the name “Olivia Wilde,” then you can simply google it. If the name only appears once on the open net and its in a news article on cnn.com, then Google will find that article with a quick search.

DEEP NET

The deep net generally refers to information or records that are stored in databases and cannot be discovered via Google. These databases, known as deep web databases, often store government records and can only be accessed via specific websites that exist on the open net as portals to the deep net. For example, properties records are stored in deep web databases. Therefore, if one “Olivia Wilde” owned a house in Miami-Dade County, you would never find that record by googling the person’s name. You could ONLY find that record by going to the Miami-Dade County Government website.

There is a specific page (see picture above) in that website, where you can search for a name in that deep web database of property records for the county. This is the ONLY place on the entire internet where you can search for that record. This is because this is the only access point for the public to search in that database.

TOR AND THE DARK NET

TOR is a free service that enables users to have secure and anonymous internet activity. Here is how it works. When a person uses TOR, from their perspective they merely open a TOR browser and type in a website’s url to connect. This is similar to any other web browser, but with a very slow connection.

Behind the scenes, instead of directly linking the person’s browser to the website, TOR redirects the person’s internet traffic through three proxy nodes and then connects to the website. TOR has a network of several thousand proxy nodes around the world that is uses for this purpose.

This is illustrated below where Alice is using TOR, which means that her internet traffic takes a circuitous route to its destination.

The TOR browser also encrypts the traffic from the person’s computer to the first node and the second and the third node. TOR does not encrypt the internet traffic from the third node to the website. This is demonstrated below where the encrypted parts of the path are highlighted in green but the last hop from the last TOR node is unencrypted.

Because the last leg of this internet trail is not encrypted, the website can only see that an anonymous person is connecting to the website from a TOR node. TOR nodes are more or less publicly known, so websites will known when the traffic is coming from the TOR network.

Dark net websites will only allow traffic coming from these TOR nodes. By contrast, some open net websites will not allow traffic from TOR nodes.

As a result of the encryption and proxies it is almost impossible for any government to monitor the content of a TOR user’s internet browsing. The government can see that the person is accessing TOR, but not what they are doing. Many regimes try to prohibit the public from accessing TOR so that it can better monitor their internet traffic.

Public focus often centers on the seedier and criminal side of the dark net but there are many legitimate uses for it as well. For example, charitable groups use the dark net to provide people that are living under authoritarian regimes with secure and anonymous access reputable news sources, which are often repressed under various regimes.

How to Web Scrape Corporate Profiles with Python

Editor’s Note: this post presumes that the reader has at least a passing knowledge of the programming languages Python and HTML. If this does not apply to you, sorry, you may not enjoy reading this. Consider looking at the crash courses on Python and HTML at Python For Beginners and Beginners’ Guide to HTML.

This post will explain how to use Python to scrape employee profiles from corporate websites. This is the first project in this website’s series of posts on web scraping and provides a good primer for the subject matter. Subsequent posts explain increasingly advanced projects.

Web scraping employee profiles on company websites is basically an advanced version of copy and paste. Below is a very simple Python script that written for scraping employee profiles and it can be applied to almost any corporate website with only minor edits.

To use this script, you must provide the url for your webpage and identify the HTML elements that contain the employee profile information that you want to scrape. Run the script and it will produce a CSV file with the scraped information.

What is an HTML element?

I believe this is best answered by Wikipedia:

This means that any piece of text in a website is part of an HTML element and is encapsulated in an HTML tag. If you right click on a webpage and choose to view the HTML, you will see a simple bit of HTML code before and after each segment of text. See the diagram below showing different parts of an HTML element for a line of text that simply reads “This is a paragraph.”

In order to scrape the content “This is a paragraph.” using Python, you will identify the content based on its HTML tag. So instead of telling your script to scrape the text that reads “This is a paragraph.”, you tell it to scrape the text with the “p” tag.

However, there could be other elements in the page with the same tag, in which case you would need to find include more information about the element to specify the one you wanted. Alternatively, if there were 1,000 elements with the same tag and you wanted to scrape all of them (imagine a list of 1,000 relevant email addresses), you can just tell you script to scrape the text with the “p” tag and it will get all 1,000 of them.

Scraping Profiles

1 – The first step is to find the webpage where the employee profiles reside. (for the sake of simplicity, we will only use one webpage in our example today). Copy and paste this url into the script as the value assigned to the “url” variable. For my test case, I will use the webpage for the Exxonmobil Management Committee (https://corporate.exxonmobil.com/Company/Who-we-are/Management-Committee). I chose a page with only a few profiles for the sake of simplicity, but you can use any profile page regardless of have many profiles are listed on it.

2 – The next step is to choose what information to scrape. For my example I will choose only the employees’ names and positions.

3 – To scrape a specific kind of data (such as job titles) repeatedly, you need to find out how to identify that information in the HTML. This is essentially how you will tell your script which info to scrape.

If you go to my example webpage you see that the job title on each employee profile looks similar (text font and size, formatting, etc). That is, in part, because there are common characteristics in the HTML elements associated with the job titles on the webpage. So you identify those common characteristics and write in the python web scraping script “get every HTML element with this specific characteristic and label them all as ‘Job Titles.'”

5 – Here is the tricky part, you need to identify this information based on its location in the HTML framework of the website. To do this, you find where the relevant information is located on the webpage. In my example, see the photo below, I want to scrape the employees’ position titles. The first profile on the page is Mr. Neil Chapman, Senior Vice President. So I need to figure out how to identify the location of the words “Senior Vice President” in the website’s HTML code. To do this, I right-click my cursor on the words “Senior Vice President” and choose “inspect.” every browser has its own version of this, but the option should include the term “inspect.” This will open up a window in my browser that shows the HTML and highlights the item I clicked on (“Senior Vice President”) in the HTML code. See the photo below and it shows that clicking on that text in the website will identify that the same text is located within the HTML framework between the “<h4>” tags.

In our script below you will see that on line number 12, we identify that the position title is located in the text for the h4 tag and it correlates with the text for the h2 tag with class = “table-of-contents”.

Then, for a test case, you run this script below

import requests
from bs4 import BeautifulSoup
import csv
rows = []
url = 'https://corporate.exxonmobil.com/Company/Who-we-are/Management-Committee'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
content = soup.find("div", class_="article--wrapper")
sections = content.find_all("section", class_="articleMedia articleMedia-half")
for section in sections:
    name = section.find("h2", class_="table-of-contents").text
    position = section.find("h4", class_=[]).text
    row = {"name": name,
           "position": position
           }
    rows.append(row)
with open("management_board.csv", "w+") as csvfile:
    fieldnames = ["name", "position"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for row in rows:
        writer.writerow(row)

The result is a csv file, located in the same folder where this script is saved, that will have this information:

Okay so that obviously was a bit of work for something that you could have just copied and pasted. Why should you still be interested?

For starters, now you can identify any information in the profiles and add it to the script. Identify the location of the information and add it in with a new variable in under “for section in sections:” under “name” and “position” and add whatever title you want in “fieldnames”.

Furthermore, this same script, without any alterations, would also work if there were one thousand profiles on that page.

Finally, this is a very effective method for webpages that you cannot copy and paste at all. For example, imagine if you run into something like this increasingly popular, according to Bluleadz, kind of Employee Profiles page. Notice how the page requires that you run your cursor over a team member to see their information?

The aforementioned method of web scraping can scrape all of the unseen profile information in one quick go and present it in a friendly format.

As webpages design continues to develop, these kinds of techniques will prove invaluable.