Python For OSINT: Find Websites Where an Email Registered Accounts – Holehe

We can identify a list of websites where a specific email address has registered accounts in a very straightforward process of 6 steps. This process does not require the reader to know, learn, or download Python, but does use a Python script.

CREDIT WHERE IT IS DUE: This process does not require a knowledge of Python but the explanation must address the computer language a bit. The Github user account Megadose hosts an amazing Python script named Holehe and deserves a lot of credit for this creation.

The script will find the different social media networks where the email is registered to an account. However, because many people do not know anything about Python, I’ve come up with a process intended for people that do not know anything about python and do not want to learn about it. It is a rote process requires no knowledge of python, no downloads, and no thinking at all.

Instructions for First Time Using Holehe

Step 1 (the hardest) – Click here and sign up for an account on Github . Sorry, that is more than one step but the process is simple and it gets easy afterward.

Step 2 – Login to Gitpod. You do not need to sign up for Gitpod if you already have a Github account. Go here ( and you will see an option on the left to sign in with your Github account even though you don’t have a Gitpod account. See below:

Once you have logged in your page will probably look like this:

Step 3 – Copy and paste this url into your browser and hit Enter:

Why? – Basically, you are making a url that consists of the gitpod website url, a hashtage, and the url of the github page for the script.

Here is the explanation. We want to run a Python script but to do so we need a development environment. Normally you would download it but in this case, Gitpod provides a development environment online where you can run Python scripts. When you identify a script posted on Github you create a url of the Gitpod website’s url, a Hashtag, and the url for the page hosting the Python script. So with our script hosted at, we create a url like this: + /# +

Gitpod will create a workspace, a virtual computer, specifically for running the script. The script and its affiliated files will be downloaded though you will likely still have to run the file, or its equivalent. If you go to the script’s page on Github there should be instructions for downloading and running the script.

Wait for Gitpod to do some processing and then your computer should look like this:

Step 4 – At the bottom of the screen find where it says “/workspace/holehe $”.

Click to the right of these words and type “python3 install” and then hit enter.

Step 5 – Wait for the install to complete and then right click on the folder on the type left that is named “holehe” (not the one titled “holehe.egg-info”). When you right click on the folder a drop down menu appears, choose “open in terminal”.

A new tab has appeared in the terminal, notice the new tab that reads “gitpod /workspace/holehe/holehe” and the cursor is located next to a similarly named prompt.

Step 6 – Finally, choose your email that you want to research, in our case we will use the fake example email “”

Now the last thing to do is type in the final sequence with your email in place of our fake example. So find the prompt “gitpod /workspace/holehe/holehe $” and next to it you will type the following and hit enter:


Results: Separately, I ran a real email address for an example to show how the results should look. The output is a list of 50 or so social media networks and other websites. If the name of the site/network is purple, there was no account on it, green means there is an account registered with the email address and red means the script could not check the site. Hopefully you got something like this:

Now that you’ve done this once, the process will be much easier in the future.

Next time

Login to Gitpod and there will be a workspace named for the script. It should look like the image below. Just click “Open” on the workspace.

From here, you repeat steps 4, 5, and 6. You install, open a new terminal tab, and run the command.

That’s it!

Python for OSINT: Find If a Username is listed as an Email or Social Media Account

This post will explain how to use the profil3r tool to automatically search if a given username or real name is used in various social media or an email address for gmail, yahoo, or hotmail.

If you already have accounts on Github and Gitpod, login and jump to step 3.

Step 1 – (the hardest) – Click here and sign up for an account on Github . Sorry, that is more than one step but the process is simple and it gets easy afterward.

Step 2 – Login to Gitpod. You do not need to sign up for Gitpod if you already have a Github account. Go here ( and you will see an option on the left to sign in with your Github account even though you don’t have a Gitpod account. See below:

Step 3 -Go to and then click on green button that says Gitpod. You’ll then have to wait for a bit.

If the gitpod button does not appear on this page for you, you can alternatively paste the following url into a new tab:

Step 4 -That brings you to Gitpod where it will set up a virtual machine.

Step 5 – see where it says “gitpod /workspace/Profil3r $” and next to it type

sudo python3 install

Step 6 – hit enter, wait for the install to complete. Then, as shown below, type “sudo python3 -p [USERNAME]”

I am going to use the example username “usernameforme”, so I typed

sudo python3 -p usernameforme

Step 7 – Hit Enter and then this screen appears.

If you had entered a person’s name “john smith”, you could choose here if you want the script to search with any, none, or all of the separators listed. as noted in the instructions, move the little yellow arrow up or down with the direction arrows on your keyboard. Choose an option by hitting “space” and unchoose it by hitting the same button. you can also check/choose all options with the “a” button or uncheck them all with the “i” button.

Or you can do nothing at all. Regardless of what you choose, (including nothing) hit enter when you are done.

Step 8 – Now you have a series of options for what you want the script to search. Feel free to hit “a” to choose all, and then hit enter.

Step 9 – wait and get your results. For each version of the username or name (from Step 7) it will check each website listed. Below we see there are several social media sites with the username “usernameforme” but nothing on Soundcloud.

The possible email addresses might be a bit confusing. The script searches if there are email addresses for gmail, yahoo, or hotmail with that username. If it says [SAFE] next to the email address, that means it did NOT find evidence that the email address exists.

That is because the script searches for instances where the email address is listed in a data breach. If the email address is not in a data breach, the website tells the script that the email address is “safe”. For our purposes, if the email address WERE in a data breach, that would prove that it does exist.

Feel free to follow up researching the email address on a website like

That’s it! You’re Done!

Update: Multiple Usernames

It is also possible to run multiple usernames at once, but be careful not to overload it.

To add more usernames, run the same command but add additional usernames separated by commas, like below:

sudo python3 -p username1, username2, username3

See the post (Find A Twitter Account’s…) that identifies the Twitter accounts most commonly mentioned by a given account.

You can then take the most commonly mentioned accounts and research them all with

Python for OSINT: Find if a Phone has an Instagram Account

This guide will walk through how to use Gitpod to run the Megadose/Ignorant script that checks if a phone number is registered to any Instagram account. Note that it will not identify the specific account.

If you already have Gitpod and Github accounts, login to both and jump to step 3.

Step 1 – (the hardest) – Click here and sign up for an account on Github . Sorry, that is more than one step but the process is simple and it gets easy afterward.

Step 2 – Login to Gitpod. You do not need to sign up for Gitpod if you already have a Github account. Go here ( and you will see an option on the left to sign in with your Github account even though you don’t have a Gitpod account. See below:

Once you have logged in your page will probably look like this:

Step 3 – Copy and paste this url into your browser and hit Enter:

Why? – Basically, you are making a url that consists of the gitpod website url, a hashtage, and the url of the github page for the script.

Here is the explanation. We want to run a Python script but to do so we need a development environment. Normally you would download it but in this case, Gitpod provides a development environment online where you can run Python scripts. When you identify a script posted on Github you create a url of the Gitpod website’s url, a Hashtag, and the url for the page hosting the Python script. So with our script hosted at, we combine it along the gitpod parameters: + /# +

to make this url –

Gitpod will create a workspace, a virtual computer, specifically for running the script. The script and its affiliated files will be downloaded though you will likely still have to run the file, or its equivalent. If you go to the script’s page on Github there should be instructions for downloading and running the script.

Wait for Gitpod to do some processing and then your computer should look like this:

Step 4 – At the bottom of the screen find where it says “/workspace/ignorant $”.

Click to the right of these words and type “python3 install” and then hit enter.

Step 5 – Then when it is done “python3 install” and hit enter

Step 6 – Wait for the install to complete and then right click on the folder on the type left that is named “ignorant” (not the one titled “ignorant.egg-info”). When you right click on the folder a drop down menu appears, choose “open in terminal”.

Step 7 – A new tab has appeared in the terminal, notice the new tab that reads “gitpod /workspace/ignorant/ignorant” and the cursor is located next to a similarly named prompt.

Finally, choose your phone number of choice, (i will choose a US-based fake number 123456789, and the US country code is 1) so I would type:

“ignorant 1 123456789”

and you will get results like you see below

Note that the script also checks if the phone was used to register or accounts.

Now that you’ve done this once, the process will be much easier in the future.

Next time

Login to Gitpod and there will be a workspace named for the script. It should look like the image below. Just click on the workspace.

This should reopen your workspace to right where you left off and you can run the code again.

Find a Twitter Account’s Original Content and/or Most Common Words

This post addresses how to take a Twitter account and:

  • a.) create a list of the tweets that were original content, not retweets
  • b.) identify the topics most discussed in the last 4 thousand tweets
  • c.) list the other Twitter accounts most often mentioned (and follow up to investigate them with

Most investigators looking at a Twitter account merely scroll through some of the latest tweets or look at the bio. But you can analyze the entirety of a Twitter account by identifying the main topics of its contents or filtering out retweets so you can just analyze the original content tweeted from the account.

SIDENOTE: We must address Tweet Topic Explorer, which is a great tool for understanding the content of a twitter account. But there are two issues that can get in the way. First, sometimes the most common words that it finds are actually usernames. Second, sometimes it is hard to find if there is any original content coming from an account because there are so many retweets.

This post will explain how to use Google Sheets to filter down to the original content tweets and find the most common words from an account without including usernames and/or contents from retweets.

For this example we are going to use the twitter account for Osint Combine (@osintcombine).

First go to All My Tweets (, plug in the twitter handle and choose “tweets” so it downloads all of the account’s tweets.

Second, highlight and copy all of the tweets’ content. You can use Ctrl + A if there are too many tweets to highlight manually. It will not impact the process if it highlights and copies the other words on the page that are not tweets.

Third, go to Google Sheets ( and paste the tweets into a column. in the B2 cell.

If you want to get rid of all of the formatting, then instead of just pasting into the cell you can right-click in the cell and then choose “Paste Special” and then choose “Paste values only”.

a.) Remove Retweets

If you want to remove retweets and only see original content then at this point highlight the column and then click on Data and then Create Filter

A little box appears at the top of the column

Click on it, choose “Filter by condition”, in the dropdown choose “text does not contain” and then in the box that appears type “RT” or and then click “OK”.

Count Words

First, highlight the column with the texts and then click on “Data”, “split text to columns” and then in the little Separator box that appears click the dropdown and choose “space”

An Aside – How to Filter Out Irrelevant Words

I find that you can filter out a lot of irrelevant words (“the”, “at”) if you excludes words below a certain character minimum. The following will let you find and delete all words that are 6 characters or less.

Highlight all by hitting CTRL and A at the same time

Click “edit” and then “find and replace”

In the new box type next to “Find” type the following:


SIDENOTE: This is a “regular expression” that refers to all words with 1 to 6 characters. Change the numbers however you like for your own spreadsheet.

Next to “replace with” enter a  space

Click check next to “search using regular expressions”

and then click “Replace All”

Now my screen looks like this, with lots of spaces

End of the Aside, Returning to Steps for Counting Words

Second, find out how many rows down your data goes and how many columns across.

So in mine here i see that it goes down to row 237

And across as far as column BH. Note that the “split text to columns” tool will highlight the columns as far as there are data in them (so you know you’ve hit the end of the columns when they aren’t highlighted anymore).

That means our data stretches from the box B2 on its top left to BH237 on its bottom right

Third, in box a2 we type =FLATTEN(B2:BH237) and hit enter. This combines all of the data from each box in the area you specified into one column.

Now all of the individual words will be listed in Column A.

Note that at first it might look like Column A is still full of empty spaces. Don’t worry, that is only because the empty cells are included. Scroll down and you will see the rest.

Fourth, type a title in cell A1, like “words”

Fifth, highlight column A, click “Data”, choose “pivot table”, make sure the box that appears has “new sheet” chosen, and then click “create” in the box.

Sixth, on the right next to “Rows” click “Add” and then choose “words” (or whatever you typed into cell A1). Then Next to “Values” click “Add” and then “words”. it should default to COUNTA under “Summarize by”.

Remove Usernames?

Seventh, at this point you may decide to remove all usernames from your data. This is pretty simple

Highlight column A, click on “Data”, then “Filter views”, then “create new filter view”.

Next click on the upside down arrow next to “words” in A1 and choose to sort A–>Z

You will see all of the numbers and weird characters come first at the top of the list

If you scroll down you will see all of the usernames in a row (since they all start with @), so just highlight all of the rows: left-click on the row number next to the first username and then hold shift and left-click on the row number next to the last one.

Then right-click anywhere on the row numbers and choose to hide all of the rows

Alternatively, you could choose to only see the usernames by highlighting all of the other rows without usernames and hiding them

An Easier Way To List Or Remove Usernames?

There is a second option on how to list only the usernames or exclude them. But it is worth noting that it does not ALWAYS work due to some of the complexities of Google Sheets.

Basically, as soon as you’ve populated your pivot table.

Alternate Step Seven:

Highlight column A, click on “Data”, then “Filter views”, then “create new filter view”.

Next click on the upside down arrow next to “words” in A1.

If you want to only list the usernames, in the dropdown menu choose “Filter by condition”, then “Text starts with” and then just input the “@”. See below:

You may see duplicates of the usernames because of a comma at the end or something of that sort like “@searchish” and “@searchish:”

To resolve this, return to step 1 and after completing that first step go to “edit” then “find and replace” then choose to search for “:” and replace with a space ” ” and hit “replace all”.

By contrast, to exclude usernames, after you click on “Filter by condition”, instead of choosing “Text starts with”, choose “text does not contain”.

Research the top usernames?

If you chose to find the most common usernames, you may can do some follow-up research on them by searching for various accounts and email addresses using the same usernames all at the same time. The method for doing so is pretty easy and explained in the post Find if a Username is listed…

The post will show how to run the profil3r script with one command as shown below

Now regardless of whether you removed your usernames or not, continue as follows

Now you can exit out of the filter view by hitting the X on the right side (you might have to close out your Pivot Table Editor in order to see it.

Eighth, click the little empty box above the “1” marker for row 1 to highlight everything

Click “Data”, “filter views”, then “create new filter view”.

Ninth, click the little upside down triangle in column B and choose “Sort Z->A”

And now you see:
The list above accurately shows that OSINT Combine is affiliated with OSINT training, platforms, and speakers, in addition to Australia and we see the twitter handles of a number of known OSINT experts with links to the organization.

That’s it, you’re done!

Data Analysis for Beginners

This post explains how to conduct Data Analysis on large sets of data using Google Sheets and is intended for people with no background in the subject. The post will explain how to import data into Sheets and how to summarize, sort, filter, and modify the data by performing custom math on it. Finally, we will walk through how to cross reference data from different sources.

CREDIT WHERE IT IS DUE: The post is based on following the instructions from Lam Thuy Vo’s book Mining Social Media. Chapter 6 from this book focuses on Data Analysis is available online here. This information was originally identified by the website Tools For Reporters, here.

In the words of the author, you learn “how to conduct simple data analysis with Google Sheets: you learned how to import and organize data in Sheets, how to ask specific questions of a data set, and how to answer those questions by modifying, sorting, filtering, and aggregating the data.

This post will address how to import, modify, and analyze data in Google Sheets.

Part 1 we will go over how to import data into Google Sheets, make it recognize numbers, and using the function for splitting data from one column into two columns.

Part 2 will go over how to create a Pivot Table (which provides statistics about our raw data) to count how many times a specific value occurs in a column. This will be used in the example to find out how many times a twitter account tweeted on each date. The goal of the example is to identify if the twitter account is a bot based on the frequency of its tweets.

Part 3 describes how to use formulas and functions to analyze data. While Part 2 aggregated how data, Part 3 will use math equations such as find the average number of tweets per day by a given Twitter account in order to see if it is within the bounds of a normal human user or a bot.

Part 4 addresses how to sort and filter data. You can choose one value in your data (such as the number of retweets) and order all of your data based on ascending or descending number of retweets. You can also sort the data in the pivot table (which is based on the raw data). Filtering allows you to hide data you do not need or find the most relevant data (maybe search for all tweets that mention a specified person).

At this point, the reader has a general understanding about what kind of data analysis is available, but there are too many functions and tools to learn them all. It is better to obtain data and then consider what you would like to do with it. Then work backwards to look at the Sheets functions and tools available to find a way to achieve your goal. Note that Parts 3 and 4 reference lists of the available functions and tools.

Part 5 addresses how to cross reference data from different sets based on a common data type or value, for example if both datasheets have a “dates” column.

Part 1

Importing Data to Google Sheets

First we are going to import the data, which was originally obtained directly from the Twitter API (a subject for a separate post) and we start with the data in CSV format.

There are a plethora of potential data sources that allow you to download data in this kind of format. For example, the FEC provides CSV data for public campaign financing (click here) and OpenSecrets provides CSV data on corporate lobbying records (click here for Exxon example)

Go to Google Sheets and start a new sheet, it should say “blank” under “start a new spreadsheet” on the left side.

Click on “file” then “import”, the following screen should appear:

Choose “upload” and “select a file from your device”. After choosing your file, (preferably a csv file) the following window should appear:

Then choose the following in order:

1- under “import location” choose “replace current sheet”

2 – under “separator type” select “comma” – – (since we are using a csv in this case and we need Google Sheets to identify separate values)

3 – under “convert text to numbers, dates, and formulas” select “no” – – (we will change the formatting ourselves because the software makes mistakes)

At this point we are not worried about whether Sheets can differentiate between a number, word, or date.

Now make a copy of the original data set. (you can also use “file” then “version history” to look at past versions)

8 – Click on “sheet1” (you will make this one the original to put aside in case you need to reference the original data later) and rename it “original data”, or whatever.

9 – Then, to create a copy to work with, click the small arrow next to “Sheet1” / “original data” or whatever you call it now, and in the menu that appears you will choose “duplicate”. In the book they name the second sheet “step 1: modify and format” so we will use the same here for the sake of clarity.

Transform data in columns from string to numbers

Then we choose to highlight column C by clicking on the “C”. Then click on “format” then “number” then “number” again. This transforms every value in the column from a string to a number. Repeat this process for the Retweets column which also contains numbers.

Split Text To Columns

We want to find out how many times the twitter account tweeted on each day. So we will count how many times the same date occurs in our raw data on the tweets.

To do this, we need to create a column that only includes the dates of each tweet and then count how many times each date (or “column value”) occurs. So if March 3rd occurs ten times, that means ten tweets came out on that date.

The problem is that our data only provides a column that provides the date and time together for each tweet. That means that several tweets on the same date, (example: “March 3rd, at 12pm” and “March 3rd, at 4pm”) will be counted as separate values instead of identifying that the tweets occurred on the same date.

The solution is to use the Split Text to Columns tool which will separate the dates into one column and the times into a second column. This will let us count the values in the dates column so we can see how many times each date occurred.

How to Use the Split Text to Columns Tool

1 – Click on the new sheet (so you are not working with the original data) and find the “Split Text to Columns” tool under the Data menu

The tool looks for common characteristics in the columns (like a semicolon in-between two numbers) so that it can separate the data.

2 – Choose a column to split and then create an empty column to the right of it.

3 – To do so right-click the letter above the column and choose “Insert 1 right”.

Now you can use the Split Text to Columns tool without overriding any data.

4 – Click on the column you want to split so that it is highlighted, and then in the menu above click on “Data”.

5 – Click on Data and then choose Split Text to Columns.

It will split the text between the two columns based on the default setting “Detect Automatically”. But a small window appears, titled Separator, and it will give you different options for how you want to separate the data.

6 – We will choose “Space”.

Note that these columns of dates and times are still formatted as strings. That is because the separator tool works best with strings so it makes sense to work with the data as strings now, and then convert it later.

7 – For the sake of clarity, we now change the column headers to “date” and “time” to reflect the information in each column.

Now the information is ready to create a Pivot Table, which will summarize the data for us.

Part 2

According Lam Thuy Vo, at this point the data is prepped and the next part is to aggregate the data by using pivot tables and formulas.

Pivot tables can, according to the author (p. 110 from the book), “summarize data…in various ways…we can count the number of times a term occurs in a column or calculate the sum of numerical values based on a date or category.” This is done with the creation of a summary table that gives statistics about our raw data.

In the example from the book the reader uses pivot tables to find how many times each date occurs in a column.


Recall that the data originally provided the date and time in the same column.

so first the data looked like this:

This image has an empty alt attribute; its file name is image-37.png

Therefore in part 1 the reader had to separate the dates into a separate column. The point is so that every time a tweet occurred on a specific date, the column value would only be the date. Therefore if we count the number of times the same date/column value occurred in that column, that would be the number of times a tweet occurred on that date.

How to Create a Pivot Table (shorter explanation)

Note that if you want a short explanation on creating pivot tables, google provides one here.

1 – In short, to create a pivot table you highlight the data you want to use, find and click on “create pivot table.

2 – Then when the table is created, you input data into the table by using the pivot table editor. In the editor there are two options, “rows” and “columns”, you will use these to choose what data to include.

3 – Within the Rows/Columns options, you will choose the data by choosing column headers from the original spreadsheet and each unique value from that data will be listed once in the table. Rows/Columns also lets you order the data ascending or descending.

4 – Finally, you can use the pivot table editor’s “values” option to summarize the data (by adding, counting, finding the average, etc.) or the “filter” option to filter out data values based on the value itself (in which case you choose which specific values to include) or based on a conditional, which is explained below.

Select Data and Create Pivot Table

Returning to the guidance from Mining Social Media: The first step in creating the pivot table is to choose the data that it will analyze and we want to choose all of the data on our sheet so we click on this little rectangle in the top left where the column and row headings meet. This highlights everything so that the pivot table will have access to all of the data. The pivot table will not necessarily use all of the data, but this gives the option to use it for whatever data we want.

This click on the Data tab and then Pivot Table. A window will pop up, make sure New Sheet is chosen and then click Create.
A new sheet will appear, named Pivot Table 1.

Pivot Table Editor

On the right side of the pivot table there is something called the “Pivot table editor”. You use this to put the highlighted data from the sheet into your pivot table and choose how to analyze it.

Note that it suggests some possible things to do, and it even recognizes your data column headers in its suggestions. You can find out what these suggested options would look like with your data. To do this, if you hover your cursor over any of the suggestions, a magnifying glass appears on the right side. Click on the magnifying glass to see what kind of table these suggestions would produce.

For example, we click on the magnifying glass that appears next to “Number of unique id for each language”, and get the table seen below. The table shows that there are 5 times that a tweet is identified as “ar” (arabic) in the data under the language column’s information. This does not put the data into your pivot table, it just shows what it would look like.

Returning to Our Pivot Table and Counting Dates

As previously noted, there are options on the right side there are options for how to “populate” the pivot table with data and analysis. We want to create a column with each unique value from the “date” column in our sheet. Eventually we want to show next to that column how many times each of those dates occurred, but for now we just want a column that shows each unique date.

In the Pivot Table Editor we see the Rows option and click on Add. This means that you are selecting to put your data in rows and when you click add you choose what source of information (which column’s information) you want to input into the rows of the pivot table.

A drop down menu appears and it has the names of each of the column headers from the “step 1: modify and format” table. You click on Date and now the table should look like the screenshot below.

It might seem confusing that the Rows option basically entails choosing a column from the raw data. It is best to think of it as choosing what data will be added into rows.

A new column will appear in the pivot table (named “date”) and in the rows below it each unique value from the Date column (from the “step 1: modify and format” sheet) will appear once. Recall that in the “step 1” table each of these unique values (which are the dates) appeared several times.

If we had chosen “Columns” and then “date”, we would have gotten each unique value (date) in a different column in one row. See below:

Now returning to our table that has each unique value in different rows in one column. Our goal is to count how many times each unique date occurred in the date column of our sheet (“step 1: modify and format”).

So now we want to do analysis on the unique values and display it in a new column.

The Values option in Pivot Table Editor will let us do analysis on our data. So we choose which data we want to work with (in this case it will be the “date” column). To do this we go to Values and click Add. A drop down appears and we choose “date” once again.

Just to be clear about how the Values option works, you choose one of the columns from your raw data and then choose a function to apply to it. So for example, you can add every number in the column together, or multiply them, count them, of find the average.

A table appears under values that shows that “date” is the column we are working with and under “summarize by” we can click below it and a drop down appears with different ways to summarize the data. We choose COUNTA. This option will count the number of times each unique value in the date column appears.

There was also a COUNT option available, but we do not use that because it will only work with data that is formatted as a number. Other options include SUM, AVERAGE, MEDIAN, PRODUCT, MAX, and MIN. Google has a help page here that lists and explains each of the available functions.

Also see that a new column appears (named “COUNTA of date”) and each shows how many times each date appears and automatically puts the count for each one next to the unique value in the rows of the first column.

Success! we now have a pivot table that shows how many times each date occurred in our data. The first row shows that the date 2016-12-10 occurred 3 times, meaning there were 3 tweets on that date.

The author explains in the book that this information is useful because you can identify if a Twitter account is probably a bot based on how often it tweets. The author sites that, according to the Digital Forensic Research Lab (, tweeting more than 72 times per day is “suspicious.”

With that information in mind we can look at our data on a twitter account and judge if it is tweeting on any give day at a “suspicious” rate. (the answer is “yes”)


The values option in the dropdown gives a list of ways to summarize data but they are not always easy to understand. Here is an explanation of the ways to use Values to summarize data:

  • SUM – Returns the sum of a series of numbers and/or cells.
  • COUNTA – Returns the number of values in a dataset.
  • COUNT – Returns the number of numeric values in a dataset.
  • COUNTUNIQUE – Counts the number of unique values in a list of specified values and ranges.
  • AVERAGE – The AVERAGE function returns the numerical average value in a dataset, ignoring text.
  • MAX – Returns the maximum value in a numeric dataset.
  • MIN – Returns the minimum value in a numeric dataset.
  • MEDIAN – Returns the median value in a numeric dataset.
  • PRODUCT – Returns the result of multiplying a series of numbers together.
  • STDEV – The STDEV function calculates the standard deviation based on a sample.
  • STDEVP – Calculates the standard deviation based on an entire population.
  • VAR – Calculates the variance based on a sample.
  • VARP – Calculates the variance based on an entire population.

Part 3

Formulas and Functions

Google Sheets has a basic instruction page and video here to explain how to use functions to create formulas.

The intro video explains that a Function is a built-in operation in Sheets, such as adding or subtracting. Functions appear as options in dropdown lists and they can also be typed directly into Sheets. For example, we used the Function COUNTA in the pivot table to count how many times each date appeared.

Within Sheets, we can also type functions directly into a cell by typing the “=” sign, the name of the function, and then specifying what data (or the location of the data) in parenthesis.

For example, look at the sheet below:

In this case we want to add the values in the two cells A2 and A3, and put the return value in the cell B4.

1- So we start by typing the “=” sign so that sheets knows we are typing a function

2- We type the name of the function, in this case the function name “SUM” is for doing addition

3- Finally specify which cells we want to add by identifying them in parenthesis and separating them with a comma. If you hit enter, the cell B4 will just have the number 4 in it and the formula disappears from view but still remains for the cell.

Side Note: when specifying cells in parenthesis you can use (A2:A7) to mean add all the cells from A2 to A7. You could also type (A:A) to refer to the entire A column. You can choose from cells in a different sheet within the same file by typing the name of the sheet in quotes followed by an exclamation mark. So if you had a different sheet named Sheet 2 and you wanted to get the sum of cells A2 to A7, you would type =SUM(‘Sheet2’ !A2:A7). Also, you could just type =SUM( and then use your cursor to choose the cells.

A Formula uses a Function (or several Functions) to obtain a specific result. So in the previous example we created a Formula in cell B4 that used the SUM function to add the values of A2 and A3.

Smart Copying

The author goes on the point out the value of smart copying. Accordingly, if we want to find the length of each tweet, we can return to the sheet Step 1: Modify and Format and add a column to the right the text column (which is the H column). Right-click on the column header and choose “insert 1 right” (which creates the empty I column).

So if we type the formula =len(H2) in the I2 block and then hit enter, we will get the number of characters (of “length”) for the H2 block.

Sheets actually suggests using smart copying to fill in the column below with the same formula but accounting for the different placement of the cells.

or you could find the little box in the bottom right corner of the first cell with the forumla in it and pull down to other tabs. This will also smart copy to the other cells.

Part 4

Sorting Data

Vo says that the next step is to learn how to sort and filter the results, or data in general, to “rank or isolate data.”

In short, to sort or filter data, you want to remove the formatting by copying and pasting the data into a new sheet, but paste using Paste Special. The =n you highlight it and create a filter view, which allows you to sort and filter. This does not alter the data, it only rearranges it, or hides parts of it.

Vo uses the example goal of sorting the data in the pivot data to see how the suspected bot tweeted on its busiest dat. To do this, Vo explains how to do so by “creating a new sheet with our aggerate results and changing the entire sheet to a filter view.

Paste Special

We will want to copy and paste the data from the pivot table into a new sheet to isolate our work. Note that those cells have data and formulas, so we want to only paste the cells’ data without the formula. To do this we will use Paste Special.

First we create a new sheet by clicking on the plus sign on the bottom left.

The next step is to highlight the cells with the data you want, which in this case is the entire table.

Open the new sheet (named “Sheet 3”) and right-click in the A1 box.

In the drop down menu, choose “Paste Special” and then “Paste Values Only”. This will paste only the data, and not any of the formatting (excluding the formatting includes excluding the formulas). The data is now formatted as strings of integers.

We need to convert the formatting of the data (the dates) from integers to dates (in the formatting sense of the word dates) so that Sheets will adjust how it deals with this data. To do this, you highlight the newly pasted cells and click the Format option in the title bar. From the dropdown menu choose Number and then Date.

Filter View

We need to turn the data into a filter view.

First, we select all of the data in the sheet by selecting the box in the top left corner between the row and column headers.

Clicking this box highlights every cell in the sheet.

Next look at the tools under the headers and find the filter icon, which looks like a funnel, see below:

If it is not there, you may need to click the 3 dots icon seen below. When you see the filter icon, click on the upside down triangle to the right of the icon.

Make sure the cells are all still highlighted/selected and click on the triangle and choose Create New Filter View.

This creates a new heading row and filter icons at the tops of the columns, as below:

The new icons in the first row can be used for the filter functions. If we click on one, (in this case we are using the Date column) we are given a dropdown menu with sort and filter options.

Rearrange in ascending or descending

At the top there are options to sort the data in ascending or descending order (“Sort A->Z” means ascending, regardless of what kind of data you are using). If we choose to sort in ascending order based on the Date column, the values in the COUNTA column will also be rearranged to align with the Date column values.

This is also where it is important that the dates column is formatted as dates. Sheets will recognize the values as dates and arrange them in chronological order. If we had not changed the formatting Sheets would view the values as integers and arranged them as lowest to highest.

Filter by value

We can also filter out certain values by using the options at the bottom of the menu. There is a list (that continues downward past what is visible) of every kind of value in the dates column and there is also a search function if you have a particular value in mind. By unclicking the checkmark next to one of the values, Sheets will hide the value, but only in this particular sheet.

Filter by condition

Finally, the filter dropdown menu gives the option to filter based on a condition. A condition is basically an “if…then…” statement. If we click on the filter option in the menu a minimized dropdown will appear, if we click on that we see a list of conditions we can use.

For example, if we were looking through the content of the tweets, we could use this to look for tweets that mention something specific.

Part 5

Combining Data from Different Sources

Sheets allows us to cross reference data sets and combine them into one data set. This is also known as Merging Data Sets.

What are we going to do specifically? We will take two pivot tables in two separate sheets. The first table (from before) will show how many times per day on specific dates a twitter account (the bot account @sunneversets) tweeted.

The second table shows the number of times a second twitter account (a human’s account, @nostarchpress) tweeted per day on the same dates.

Then we will create a third sheet where we put the data side by side. To do so, first we will choose a date range for the data we want to look at and we create one column with each date in the range listed in separate rows. We name the first column “dates” and then name the next two columns “Pivot Table 1 – sunneversets account” and “Pivot Table 2 – nostarchpress account”. Finally, we create a formula that we will put in the 2nd and 3rd columns that will automatically find and input the relevant data.

The end result will look like this:

This image has an empty alt attribute; its file name is image-36.png

How to Cross Reference Data

We use the formula called =vlookup() to cross reference and merge the data sets based on a common value. Basically that means that both data sets have to have a column for the same kind of data (such as a date column). One data set shows how many tweets occurred on each date, while another data set might show how many political speeches occurred on those dates.

To provide an example of a second data set to reference, Lam Thuy Vo provides a second twitter account’s csv file (nostarch_tweets.csv) on her GitHub page.

This new file is made up of data on a human being’s twitter account.

Merging the Spreadsheets

Vo points out that the first step is to create another new sheet by clicking on the plus sign on the bottom left in Google Sheets. Name the new sheet “cross reference sheet”.

As a reminder, to import data from a csv file into a Google Sheets sheet, open a new sheet and then take the following steps:

Click on “file” then “import”, the following screen should appear:

Choose “upload” and “select a file from your device”. After choosing your file, (preferably a csv file) the following window should appear:

Then choose the following in order:

1 – under “import location” choose “replace current sheet”

2 – under “separator type” select “comma” – – (since we are using a csv in this case and we need Google Sheets to identify separate values)

3 – under “convert text to numbers, dates, and formulas” select “no” – – (we will change the formatting ourselves because the software makes mistakes)

4 – then we split the dates and times from the one “created_at” column into two columns, one for dates and one for times. To do so we highlight the column, click on Data, choose “split text to columns” and when a window pops up asking to choose a separator, choose space. Name the column with dates “date”.

5 – highlight the date column, click on format and in the dropdown choose number, and then date.

6 – we also need to turn this data into a pivot table like the last one, so click the box in the top left corner between the headers of the columns and rows so that everything is highlighted, click on the data tab, choose create pivot table, create new, name the new pivot table Pivot Table 2, click on the new pivot table and under Rows choose date, under Values also choose date, and then again under Values choose COUNTA.

Prep Cross Reference Sheet with the Common Value

In the sheet named “cross reference sheet” we are going to focus on the data from the two csv files that occured within a specific range of values in one of the columns. In the example from the book Mining Social Media, the author focuses on the data in the date range April 13th to May 1st 2017. To do this, you type “date” in the A1 cell and then need to input the dates in the column below. To do this quickly, type the first two dates in the first two cells and highlight them as seen below.

Next, click on the blue square on the bottom right and pull down to highlight the 20 empty cells below it. Sheets will figure out the pattern and automatically fill the rest of the cells with dates.

Using the =vlookup() Formula

According to Lam Thuy Vo, the =vlookup() formula “looks at a value in one table, looks up the value in another table, and then retrieves data based on this common value”.

So we will use the sheet “cross reference sheet” that has the first column of dates for our date range, and then we will put formulas in the second and third columns that will find the relevant data from the two pivot tables and input the data into the “cross reference sheet” columns.

But first (according to para 2 on page 118) we need to set up a new column for the @sunneversets100’s daily tweet counts right next to the date column.

we start by naming the second column “Pivot Table 1 – sunneversets account”. This column will have the number of tweets for that twitter account listed for each date identified in the first column. To do this, we will write a formula for the first cell (located at B2), that will look up the value in cell A2 (which has the date 4/13/2017), and look in Pivot Table 1 for the value associated with the date 4/13/2017 and input it into cell B2 of the cross reference sheet.

The =vlookup() formula takes four “arguments” in the “()” section.

First Argument – This argument is the value you that will be looked up in another table. In our case is the the first date in the series of dates, which is identified by its cell location at A2. So we put it as =vlookup(A2, …) .

Second Argument – This is a bit complicated and therefore best explained in the author’s own words:

Recall that Pivot Table 1 looks like this:

So we are telling the =vlookup() formula to look at columns A and B because A has the common value (the date) that exists in both pivot tables and B has the values that are unique to Pivot Table 1.

Third Argument – This argument might seem a bit redundant. The formula needs to know where the relevant data is located in the Pivot Table sheet relative to the column with the common value. The common value in Pivot TAble 1 is the date in the date column (the first column) and the relevant data is in the second column next to it. We communicate the location to the formula with the number 2. The number 2 informs the formula that the data is essentially, one column to the right. so our formula looks now looks like this:

=vlookup(A2, ‘Pivot Table 1’!A:B, 2, …)

Fourth Argument – In the fourth argument we tell the formula, as the author explains, “whether the range we’re looking at has been sorted in the same order as the table we created for our data merge.” This is not the case, so we input FALSE into the formula. The author recommends to always input FALSE because it will still find the data even if it has been sorted. The complete formula is as such:

=vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE)

Input the =vlookup() Formula

In our “cross reference sheet” we input the formula in cell B2 as seen below:

cross reference sheet

The formula looks at cell A2 in the sheet “cross reference sheet”, sees that the value in the cell is 4/13/2017. The formula then looks at the “Pivot Table 1” sheet (see below) and looks in the column A to find the same value. It finds the value 4/13/2017 in the 5th row of the A column, and then looks at the value in the B column to the right and see that the value in B5 is 1054. So the formula brings that value back to the cell B2 in “cross reference sheet”.

So, upon hitting enter, the formula shows the value 1054 in cell B2 (as seen below). We can smart copy the formula into the rest of the column by clicking on the small blue box at the bottom right of cell B2 and dragging it down to the rest of the column.

In order for the third column to perform the same task for Pivot Table 2, we use the same formula but change “Pivot Table 1” to “Pivot Table 2”. Smart copy again and the table should look like this:

Notice that some cells have #N/A written in them, that is because there was no data available. There is also a red mark in the top right corner of these cells which indicates that there was an error. We can fix this by altering the formula so that it knows that if there is an error, it should just input 0. This is accurate because when there is no data in our case, that was because there were no tweets from the account on that date.

To make this change we use the =iferror() formula. This formula takes 2 arguments.

The first argument is the entire previous formula followed by a common (it might seem like the formula should be inside a set of parentheses but that is not how this works). we also remove the = sign from the =vlookup() formula.

So we take the formula =vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE) and put it in =iferror() as the first argument. Which yields:

=iferror(vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE), …)

The second argument is what we want to input in the cell if the original formula yields an error. In our case we want a 0 so we use a 0 as our second argument. The final product looks like this:

=iferror(vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE), 0)

Now we replace the formula in the sheet “cross reference sheet” with this one and we get the following:

That’s it! Congratulations, you have successfully learned to cross reference data from two different sources!

The Path Forward

Lam Thuy Vo does a great job of addressing the path forward after you have learned the functionalities to perform her example. The following sources can be used to learn or merely as a reference as you use Sheets for you own purposes

The Google for Sheets is here –

A list of available formulas –​/docs/table/25273/).

If you find yourself performing the same task repeatedly, the author notes that you may want to write your own custom functions and there is a handy walk-through here –​/apps-script/guides/sheets/functions/.

The author sums up what you have learned as follows:

In this chapter, you saw how to conduct simple data analysis with Google Sheets: you learned how to import and organize data in Sheets, how to ask specific questions of a data set, and how to answer those questions by modifying, sorting, filtering, and aggregating the data.

Congratulations again!

U.S. Politics and Nonprofit: a case study in corruption

First published on

One of the most important but unnoticed changes in the field of investigative research in recent years is the growing role of nonprofits connected to politics and corporations. U.S. corporations are increasingly using charitable donations as a new form of political influence at a time of increasing scrutiny of the more established methods such as political donations and lobbying.

The total amount of U.S. corporate charitable donations for political influence is “280 percent larger than annual PAC contributions and about 40 percent of total federal lobbying expenditures”, according to an extensive study by the National Bureau of Economic Research.

However, this also presents a great opportunity for investigative research. The U.S. government provides a unique and diffuse array of public databases that are a great source of information for any diligent researcher. Moreover, corporate efforts to influence politics via donations to nonprofits are often well-documented for anyone that can connect the dots across these databases.

Another Version of Corporate Influence

This development has been well-documented in several news stories in the past 20 years but the best source of evidence and information comes from the National Bureau of Economic Research (NBER). The NBER published an extensive study, “Tax-Exempt Lobbying: Corporate Philanthropy as a Tool For Political Influence”, that provides in-depth evidence revealing that this is now a common practice.

With this in mind, I will explain how to research nonprofits linked to politicians (the same methods work for nonprofits owned by the extremely rich) and show how to make sense of the available public records.

Joe Barton Foundation

One of the best examples of political misuse of a nonprofit is the Joe Barton Foundation. The founder and namesake arguably created this nonprofit primarily to generate good PR for himself and payouts to friend and family.

Years later, an article in the Washington Times summed up that
“[Joe Barton] the top Republican on the House Energy and Commerce Committee operates a foundation that has raised donations from the industries his committee oversees…taking credit when companies give directly to community groups in the foundation’s name – essentially bypassing a 2007 congressional requirement that donations from lobbying interests to lawmakers’ charities be disclosed”

The NBER study’s authors referenced Congressman Joe Barton as someone that used his nonprofit for his own benefit, so I chose him and his foundation, the Joe Barton Family Foundation, as a case study that shows classic methods of embezzlement and corruption.

Classic Methods

There are three well-established ways to look for signs of embezzlement. First, investigate whether an official used their position of power to hire a close friend or family member into a job even though they are unqualified for it. This is not proof in and of itself. But embezzlement often requires more than one person so the first person needs a second person that they can trust (like a friend or family member).

For example, Former Angola President Jose Eduardo dos Santos used his position to hire his own daughter (Isabel dos Santos) to run the state-owned oil company. Isabel used that position to embezzle so much money that she became the richest woman in Africa, according to the BBC.

In another example, it was discovered in 2016 that then-Prime Minister of Iceland Sigmundur Davíð Gunnlaugsson used his position to make his personal investments grow substantially. But to avoid a conflict of interest, Sigmundur first sold his investments to his wife for $1 according to The Guardian. In this case, Sigmundur assumed that his wife would share the ill-gotten gains with him.

Second, check whether the organization paid for work that never existed. Isabel dos Santos tried to hide her embezzlement by signing contracts for the state-run oil company to pay a second company (which she secretly owned) for products and services that never existed. Therefore, on paper, it looks like the money was spent for legitimate purposes. A trademark of this activity is when financial documents say that the company paid for something vague or hard to prove.

Finally, a researcher can look for a company or organization that is receiving money but does not seem to be producing or doing anything.

Nonprofit Funding

At this point, it is worth discussing nonprofit funding. There is a popular misconception that it reflects poorly on a nonprofit if it spends a lot of its money on running the nonprofit itself (overhead costs), rather than spending the money on its service or charity. It is therefore easy for critics of any nonprofit to point out how much of the organization’s money is being spent on overhead costs.

In truth, the majority of research (see hereherehere, or here) asserts that the percentage of a nonprofit’s funding spent on overhead is not a sign of how effective the organization is at its purpose. This is a red herring and researchers should avoid making conclusions based on the nonprofit’s overhead.

However, in our case study, we will look into how the nonprofit spent its money and if there is evidence of any work at all resulting from those costs.

This is an important difference from analyzing how efficiently the nonprofit spent money on work that did happen.

Following the footsteps of Jerry Seper and his investigation published in the Washington Times (“Exclusive: Barton Foundation Not So Charitable“)

As noted above, politicians commonly use charities to their benefit to improve their public image and even gain financially at times. At first glance, it might be difficult to imagine how a charity could end up gaining political support for a candidate. This case study will demonstrate how one politician used his charity in this way.

This study follows the case of Congressman Joe L. Barton during the period he was Chair of the House Energy Committee (that fact will be relevant later) from 2004 to 2008 and his creation of a private foundation. These events occurred a while ago, but I am using resources and methods that are intended for modern-day investigations.

Local Press

While he was a sitting member of Congress, Barton repeatedly made public statements about his foundation’s work and its donations to local charities. Usually when someone owns a foundation with their name on it, they are using their own money to finance it and its grants.

Barton founded a foundation in 2005 called the Joe Barton Family Foundation. The reported purpose of this foundation was to support charities in Barton’s district, Texas’ 6th congressional district.

A quick google search would show the kinds of PR that the foundation generated for Barton. Here are some examples.

Press reported that the Barton Family Foundation pledged to raise up to $400,000 to help build a Boys and Girls Club and to raise $500,000 to help build a regional kitchen and offices for the Meals-on-Wheels program. Both projects were in Barton’s district.

For the local Boys and Girls Club (BGC), Barton publicized his role in supporting the project at public events. According to a statement by AT&T, Barton accepted a $50,000 check on behalf of the BGC at a public event organized by the donor AT&T. Barton made public statements about the project and an AT&T spokesperson publicly thanked Barton for making the project possible.

Barton also attended the groundbreaking ceremony and gave a public speech. A local paper reported that Barton was

introduced at the event as the person “making this event possible” and then “Barton took the stage and said the dream had become a reality”.

Barton’s foundation blatantly gained attention in the local press and praise for him. Research by the Washington Times after the fact showed at the time the local Ennis Daily News reported Barton was “the special guest at a VIP Reception” because he had made the first donation to the Meals-on-Wheels program.

Texas Monthly magazine reported in 2005 that a local town hall meeting “burst into applause” for Barton when he announced his foundation’s pledge for the Boys and Girls Club.

The Washington Times reported in a 2009 article: “The Barton foundation was among several groups honoured for their philanthropy during Nov. 6 ceremonies in Fort Worth sponsored by the Association of Fundraising Professionals. The foundation was flagged for its pledge of $500,000 for the Meals-on-Wheels program.”

Starting the Investigation

Along the way, we will be using the reKnowledge DIB extension to facilitate, document, and visualize our research so it will regularly come in handy.

Picture2After finishing with the requisite googling, the best place to start an investigation into a U.S. politician is often their financial disclosures. Congress has a website where the public can access these disclosures. This is where politicians are supposed to disclose other sources of income and relevant outside activities., a website that allows you to search many government registries at once, is usually a good place to start. With nonprofits, your results in Open Corporates may vary depending on the state. A search for the foundation’s name finds registration information for it from a Texas registry.

The registration information in Open Corporates includes an incorporation date (April 5th 2005), an address (3100 sprocket drive, Arlington, Texas), a few names of “directors”, and a link to the source of the information (The Texas Comptroller of Public Accounts) with a URL (

We do not know yet which, if any, of these entities and people could be important so we use the reKnowledge DIB extension to identify entities and take note of them for now.


Sometimes the original source of the information has a bit more detail, so I look up the foundation at the URL provided and find the registration itself.


One notable addition is the Registered Agent Name, one Gary Martin.
Finally, one easy step I learned from the book, A Deal With the Devil by Blake Ellis and Melanie Hicken, is to check if the place has a real physical presence.

You can do this by checking the address in Google Maps or Street View. A more legitimate operation will often have some form of office with their name at the address.

But when we search 3100 Sprocket Dr what we find is not an office for the Barton Foundation, but a company called Martin Sprocket & Gear. Using reKnowledge we create a node for this company and link it to this address. If we check with Street View there is no sight of any office with the words “Barton” or “Foundation”. We only see the business’ headquarters building.


If the name “Martin” sounds familiar, that is because we just saw that the Registered Agent Name for the Barton Foundation is someone named Gary Martin. A search in Open Corporates shows that this person, Gary Martin, also owns the company that is located at the same address where he appears to have registered the Barton Foundation.

A Registered Agent is an entity like a company or a nonprofit organization that can legally designate a registered agent to receive correspondence on its behalf, such as legal summons. If the agent is not affiliated with the company, it will usually be a lawyer that provides this service for a fee.

In practice, the agent can act as a barrier to the entity by hiding the entity’s owner, address, phone number, etc. On paper, the entity can often list the registered agent’s contact information instead of its own.

Bad actors can abuse the role of a registered agent to hide things about themselves.

Martin and Barton Relationship

Rather than being the foundation’s owner, employee, or lawyer, it appears more likely that Martin was just someone tied to Barton and had a separate, maybe pre-existing, relationship.

If we look at campaign contributions on the website for the Federal Election Commission we see Martin’s name come up again. Data from showed that Martin made personal donations to Barton’s campaigns regularly since 1990. There is also a disproportionate amount of individual contributions to Barton from people that work Martin’s company, Martin Sprocket & Gear, and their spouses.


Once again we use reKnowledge to take note of newly discovered relationships focused on Martin. We use the reKnowledge extension to automatically identify known entities and individuals on our screen. Then we can quickly create the relevant relationships between Martin and Barton.


The list below provides a sum of contributions, grouped together based on each contributor’s employer” from individuals to Barton. The description from OpenSecrets about the data explains that, “the organizations themselves did not donate, rather the money came from the organizations’ PACs, their members or employees or owners, and those individuals’ immediate families.”


This further suggests that Barton and Martin had a pre-existing relationship and maybe Barton just asked a friend/associate to register the foundation at his company’s address. That would be easier than creating a real headquarters. There is certainly nothing wrong with running a “shoestring” operation, but as we will soon discover, the foundation was NOT being run on the cheap.

At this point we know we have seen Martin’s name come up directly or indirectly associated with Barton. Barton’s nonprofit is registered with an address where Martin’s company is located. Martin and his company are significant donors to Barton’s campaigns. And of course, Martin is registered as an employee of Barton’s foundation. We use reKnowledge to look back on all of the connections we have seen and visualize them in the analytical workbench. Based on this evidence, we can now assume there is a direct relationship between the two individuals.


At this point, we have well established that the director and registered agent for Barton’s nonprofit is his close associate. In the context of the classic means of embezzlement, this is a data point worth holding on to.

Tax Filings

ProPublica lookup for the nonprofit

We know what was going on publicly with the foundation, but we can look at the tax records to see what was happening behind the scenes. There have been many cases of politicians using their nonprofits to avoid taxes, improve their public image, and/or even enrich their friends and family.

Nonprofit organizations in the U.S. must file their taxes in Form 990 every year and the IRS makes those records available to the public. But they are very hard to read if you do not understand what you are looking at.

There is an IRS public database for looking up the tax records of nonprofits but if you look up the records for the Barton Foundation, you see the records only go back to 2016. However, ProPublica maintains a similar database of filings obtained from the IRS but in a more user-friendly format and with records going back much further.

We can search for the nonprofit by name at the Propublica site and we get a list of the foundation’s records going back to its establishment in 2005. Let’s look at the first tax filing.

At first glance, the tax filing is a daunting and confusing mess of numbers and tax phrases that make no sense. So I will identify the important parts and explain what to look for in them.

Questions to Ask

Here is how the tax record for 2005 appears.


Now that we’ve found the records, we want to consider a few questions to answer. Did the politician fund the nonprofit or are they spending other people’s money?

Did the nonprofit’s actions reflect public statements by the politician?
Who benefited financially from the nonprofit?

To elaborate on the last question, did anyone close to the politician benefit (because that would suggest a conflict of interest)?

Unfortunately, the public tax records for nonprofits come in a poor quality PDF that lacks optical character recognition (which would let us do a word search). ProPublica is converting more recent records, but that does not help us here. The quality of the files will not be a problem if you know where to look.

Whose Money is it?

The first question we will consider is whether Barton is funding his foundation. It is generally implied when someone creates a foundation in their name and announces that they will be donating money to a cause, that the person is donating their own money. In most cases this is true.

Let’s find out if the Congressman was using his own money and if not, how much money was coming into the organization. We will start with the foundation’s first filing from 2005, the records are located here at ProPublica.

The first place to look in the tax record is on page 1 where there is a summary of the nonprofit’s expenses in the section named “Part I: Revenues, Expenses, and Changes in Net Assets or Fund Balances”.

In this section, we see in item 1 that the Barton Foundation gained $235,000 that year.


Before proceeding, let’s look for a section called Schedule B to find out if this money is from the politician. We are looking for Schedule B because if any single contributor donated more than $5,000 then they are supposed to be identified in this section. If Barton is funding this operation, he would need to list himself in this section, especially for the sake of the major tax write-off.

However, we find that this record does not have a Schedule B included. Generally speaking, this implies that no one gave more than $5,000, but does not quite prove it. This is common because many nonprofits do not want to identify their contributors.

For our purpose, this is useful because it means the nonprofit’s money did not come from Barton himself, it is other people’s money. It is still possible to find the donors by looking elsewhere.


Corporations often create a foundation to donate money for tax purposes. So, for example, the Target Corporation created a nonprofit foundation called the Target Foundation. See the tax record below that shows that the one contributor to the Target Foundation is the Target Corporation. The company gives money to the foundation to give that money to charities. In this case, the nonprofit will list its contributor to prove that the corporation donated.


Unfortunately for our investigation, the donors and the recipient did not list their contributions in tax filings. However, it is good to know how to look up who is donating where. Nonprofits often do not list where they received money, but the foundations that donated will often list all of their recipients in their filing.

But first, it is important to understand the state of the data you are researching. Nonprofit tax records will show where the money went, but not where it came from. Furthermore, nonprofit tax filings are generally stored as poorly formatted scanned versions of the paper documents in pdf files, so a researcher usually cannot do a word search for any recipients of grants and donations. Researchers can usually only search for the names of the nonprofit that filed the record, not the names (or any other words for that matter) of anyone recorded in the filing itself.

The solution is to look for the foundation or other nonprofit organization that might have donated the money and see if your nonprofit is listed there.
For example, a little detective work revealed Joe Barton’s relationship with the Community Foundation of North Texas and a search of their donation recipients shows a $25,000 donation in the name of the Joe Barton Foundation.


Where Did They Spend The Money?

Let’s return to page 1, part 1. Here, item 17 shows that the Barton Foundation spent about $52,000 in 2005. For more information about those expenses, we can look at items 13-16. We see here that no money was spent on “Program services” (i.e. charitable activities). However, $35,000 was spent on the vaguely named “Management and general” (more on this in a bit)), and $16,000 on “Fundraising.”

We see the path of the money in 2005 below. A lot of money from unidentified donors goes into the foundation, some goes to an employee (director Thomas L. Driskell) and some go to overhead expenses and a lot is left over.


Spending money on fundraising is to be expected and it is also reasonable that the organization did not perform any services in its first year of existence. But the management expenses are more curious.

Senior Officers and Employees

If we scroll down to “Part V-A” we see the organization’s leadership. There are several names here that we could investigate but one, in particular, jumps out. The foundation paid Thomas L Driskell $32,000 for reportedly working 40 hours per week.

So at the end of 2005, during the organization’s admittedly short period of existence, it has not conducted any charitable services but it has paid Mr Driskell $32,000.


Based on the fact that he was paid with foundation money, Driskell is someone worth investigating.

So now we might ask, who is Thomas Driskell? Is he qualified for this job? What is his relationship with Barton?

According to the tax record, Driskell was working 40 hours per week at the foundation, even though he already had a job running an accounting firm, based on the firm’s registration below.


We are interested in Driskell’s qualifications and his relationship with Barton because we want to determine if there is a conflict of interest here.

With regard to running a nonprofit, a conflict of interest is generally defined as a scenario where the person owning the organization chooses a friend or close associate for a position even though they are not qualified. Their friendship conflicts with their obligation to choose qualified employees. This is not illegal as long as the owner did not choose their friend over a separate applicant. However, there is a practice of nonprofit owners giving money to their friends by giving them a job where they do little or nothing but get paid via contributions to the program. Is that what is happening here?

Who is Driskell? We begin by searching for possible job experience in the field. A search in ProPublica’s database and Open Corporates (which is a good but admittedly not an exhaustive way to search for nonprofit experience) do not show any results that indicate that Driskell had any experience managing a nonprofit as of 2005.

A Google search shows that he is an accountant with his accounting firm though, so he has experience handling money. But the word “nonprofit” does not appear in the section of the website for services offered. Further, the word “nonprofit” appears nowhere on the company website which could suggest that the company does not have experience or offer services for nonprofits.

Driskell and Barton Relationship

While social media is often a great place to start researching if two people have a relationship, neither Driskell nor Barton has much of a social media presence, so we will look at alternative means of investigation.

To begin researching Driskell and Barton’s relationship, let’s look into campaign contributions for Barton. We search the FEC database for political contributions here and discover that Driskell and several other people with the same last name (presumed family members) have been donating to Barton’s political campaigns since the 1990s.


Another good database to search for anything related to a member of Congress is At this site, you can look in a variety of government records databases and many related to Congressional documents. This is kind of a Google search of documents related to Congress. What is most important for us is that this database includes transcripts from Congress so we can look for any statements made by Barton about Driskell from the House.

To do this we can search here for any document that has the names of Barton and Driskell together. Our search has one result, which is the transcription of a ceremony from 2008.

Both men attended the ceremony and in the process of their statements, they revealed that they were longtime friends, previously worked together in politics, and were still close. The following are the key highlights:

– Barton’s was quoted saying that his friend “Tommy” is “the guy who got me into politics.”


– Driskell and Barton were from Crockett, Texas, where Driskell used to be the mayor.

– Barton used to be Driskell’s campaign manager.

– In 2008 the men were still good friends, evidenced by the fact that Driskell and his wife interrupted a vacation in South America to attend a ceremony honouring Barton.

Using reKnowledge to identify and record the relationship of the two men within the text of the page.


We now have a good background on Driskell and his relationship with Barton.

Conclusion on Driskell

We know that Driskell has close personal ties with Barton but we arguably did not find reasonable qualifications for his role as the president in the nonprofit.

We could make a subjective argument that Barton faced a conflict of interest between his interest in hiring the best candidate to run the nonprofit as opposed to his interest in supporting a friend and donor. However, this is legally fine.

What is legal

We do not know for certain how the hiring process occurred. From a legal perspective, according to an article about nepotism from, someone like Barton could decide to hire a friend. But legally, Barton would be in trouble if another person applied for the job but Barton chose to hire his less-qualified friend instead of a more-qualified second candidate.

Other Foundation Employees

Now we can look at all of the foundation’s officers as a whole and it is clear that the officers are all close associates, family members, and/or friends of Barton. If we look at the foundation’s officers from 2005 to 2008 tax records, most officers have the same last name as the foundation’s founder.

There are also the names of Barton’s aforementioned close associates/friends Martin and Driskell. Another name that appears is Betty Hodges, who was Barton’s mother-in-law. And there is also Catherine Gillespie, who worked for Barton’s political office as his chief-of-staff for over a decade, according to local news. This gives the impression of an organization run by close friends and relatives instead of officers that are best qualified for the positions.

As previously noted, Barton can legally hire his friends and family, and this is not proof of any wrongdoing. However, this setup is a practice used corrupt officials often hire close friends and family because they can be trusted. So in the context of everything else we will learn in this investigation, the foundation’s hiring practices are questionable.


It took the Barton Foundation three years to get around setting up its website, which speaks to the lax and unproductive nature of the organization.

The information about the foundation’s website is available to us today but was not available to the investigators at the time. However, it is worth mentioning only to the extent that it adds to the impression of suspicious behaviour in retrospect.

During the Google searching for this foundation there appeared to be no website. Later, in the 2008 tax returns, we see that the organization did spend money on a website.


It is unusual for an organization with hundreds of thousands of dollars to take 3 years to get around to setting up a website. It is ultimately a tiny bit more evidence suggesting that the foundation was created for the sake of making public statements about the foundation, and less so to do real charity.

If we Google the name of the foundation and the word “website” we see a reference on (a site that is similar to the ProPublica nonprofit searcher) to a former website’s URL. The website ( is no longer active.

We can try to find more information on the website and its history with a tool called Carbon Dating the Web. A search for the foundation’s website in this tool shows some results. The tool estimates that the website was created in 2009 based on the earliest references to the website’s existence.


The results also show where we can find an archived version of the website on and


The website gives us some information about what the foundation was doing. We see that in 2009 the website shows that the foundation had contributed to two projects (the aforementioned pledges to the Boys and Girls Club and Meals on Wheels).

Looking at the website over time does not show a lot of activity in the foundation.

This website still only had the same two projects to its name as of September 2012. To be fair, the next archive for the website in June 2013
showed that a 3rd project had finally been added to the list. But 6 years later (shortly before the website was shut down), the last archive of the website still showed that there were no new projects.

Moving Forward with the Taxes

Now that we have established the framework of what information is relevant and where to find it, let’s begin skimming through the next few tax returns to see what happened to the money beyond the headlines.

In 2006, per its filings, the unnamed contributors gave the foundation another $195,000.


The foundation finally made a donation and gave $90,000 to the local chapter of the Boys & Girls Club.


While $90,000 is a significant amount of money, it is important to contrast it with Barton’s public statements. First, recall that in the local press referenced above Barton had initially announced a pledge to raise up to $400,000.

Even though the foundation’s tax records for 2006 clearly state that it only donated $90,000, the foundation claimed that it raised $375,000, according to an article published September 23rd 2006 in the local newspaper the Corsicana Daily Sun.

A Lot of Expenses, Little Output

Also of interest, the foundation continues to have abnormally high expenses for an organization that does little beyond receiving money and paying out one check. The foundation spent over $60,000 on contract labour and consulting fees. But it is not clear what those services did.

federal statement

This shows a consistent pattern of an organization that spends a lot of money on itself with high expenses, paid by donors’ money but does very little work. The expenses are always vaguely described as things like “consulting”. In 2007 we see that the foundation did not receive or donate any money, but it managed to spend over $3,000 on accounting and phone bills.


Let’s look at the money from the aggregate. Up through 2007, we see that for two and a half years, the foundation had received $430,000.

Barton Family foundation

However, during that time the foundation only gave out one donation of $90,000 while at the same time the nonprofit paid over $120,000 in expenses.

Another Paycheck

The 2008 filings showed that Barton hired a new president who was paid $48,000. If you look at the new president’s name, it is hard to deny in this case that Barton has a close personal connection to this individual. Amy Barton, the new president, was Joe Barton’s daughter-in-law.

directors list

Second Donation

In 2008 the foundation finally made a second donation, $55,000 to the local Meals on Wheels.

The Statement of Program Services is very interesting.

statement of program

First, we see that the foundation acknowledges that its role only involves giving financial support to other organizations. This confirms that its role in something like building a new Boys & Girls Club does not involve the actual building, just paying a check. This is relevant because it supports the idea that it is very weird for the foundation to have such high expenses when it is, in theory, not doing much work.

Second, we see another stark comparison between expenses and payouts. Expenses that year are listed at $69,818, while its grant was only $55,000.

Third, there is another disparity between actions and public statements. The foundation pledged to raise up to $500,000, not $55,000.

Finally, and most importantly, we learn how the foundation explains the aforementioned disparity. The organization argues that while it pledged to raise up to $500,000, it believes that it can claim credit for donations from OTHER donors’ who gave their money directly to Meals on Wheels. This allows the foundation to publicly pledge very large dollar amounts while donating much smaller amounts. This is, in the view of this post’s author, as ridiculous as it sounds. We will see in the conclusion that many others shared this view.

We can visualize what we now know about the foundation’s finances during this period below. Donations flow into the foundation and the money is paid out to friend and family in salaries, general internal expenses that were hard to pin down, and some of the money was re-donated to other nonprofits.

Overall, the operation appears to be very inefficient, by turning a large amount in donations on the left (from donors to Barton Foundation) into a much smaller amount of donations to charities (bottom right).



At this point, we have obtained a litany of information proving that Congressman Joe Barton grossly misused the Joe Barton Family Foundation for his benefit.

There are two important notes to be made here.

First, this nonprofit’s activities are not unusual. This example is very standard, which means that you can use the same research methods I have described and you can look for the same kinds of information in other nonprofits.

Second, this information eventually formed the basis of a major investigative news scandal and it is important, and hopefully motivating, to note that anyone could find this information in the open sources referenced above.

While this concludes the investigation, there is an interesting epilogue to the story.


The information above was publicly exposed on April 6th 2009 when an article by Jerry Seper, titled “EXCLUSIVE: Barton Foundation Not So Charitable“, was published in the Washington Times.

Amy Barton pressed for an explanation on the $375,000 claim, resorted to a similar explanation that was provided in the tax filings for the Meals on Wheels project. Per the Washington Times, Amy Barton said the following:


Amy Barton also refused to name the donors.


While Amy Barton refused to name donors, a 2009 report from the New York Times revealed one of the major donors as Exelon Corporation. In June 2008, at a time when Barton had introduced legislation to assist corporations with the recycling of spent nuclear fuel, Barton solicited a $25,000 donation to the Foundation from Exelon, which separately has also donated $80,000 to Barton’s campaign funds.

We later learned in an explanation of the scandal by about the reason why Barton insisted on claiming credit for donations made by donors’ directly to the charities (recall that he publicly accepted a check from a donor on behalf of the Boys and Girls Club), instead of asking the donors to give the money to the Barton Foundation.


Where Are They Now?

After the news scandal, the Barton Foundation was investigated by the Office of Congressional Ethics but no charges were filed. Ultimately, these revelations were not enough to bring an end to Barton’s political career. Barton managed to get reelected again and then again and he was even honoured in congress with a portrait ceremony. But that is not the conclusion of Joe Barton’s political story.

Barton remained in office until his career was ultimately ended by the Me Too movement. Barton stepped down from office in 2018 after being caught in a string of sex scandals.

Barton’s Foundation more or less faded out, mostly spending cash on its expenses. The last tax filing available from ProPublica, from 2017, showed the foundation gained $600 but spent over $4,000

After his somewhat undignified departure from politics, Barton still had his longtime friend Gary Martin hosting a sort of goodbye ceremony in his home in 2018 “in honor and appreciation of Congressman Joe Barton”.

youtube icon_1

Python Scripts, Research Tools for Obtaining Data About Youtube Videos

If you are researching a youtube video there are also several research tools available online for obtaining more information about the video. This post will describe how to do more in-depth research of a video.

The Research Tools

This next part will address some of the tools that are available for researching videos online.

What to Look For

If you are researching videos online there are 3 basic research goals to look for.

1 – Search for videos. There are two subreasons to search for videos. 1a – You are looking for information and it might be available on a video, but you don’t know what that video is. In this case you are looking for videos based on topic. 1b -You have already found a video you are very interested in and you want to analyze it but before you can do so, you need to find the original video. Copies of the video may be altered or edited, plus you might be interested in researching the source of the video. To find the original video you must search for other versions of the video online and find the one that was posted online first.

2 – Try to find out who posted the video. This is difficult and basically involves searching if the youtube account username is also used in social media accounts. Also, you can look up the first commenter, or maybe the first few commenters, that posted on the video because they might know the person that posted the video. You can search for social media accounts with their usernames and look into their common friends for potential candidates. Searching for the original video may reveal that it was originally posted on a social media account, which would obviously make the job easier.

3 – Analysis of the video content. This skill is largely outside the bounds of this blog but we identify tools and guides to for this kind of operation. If you are searching videos to see if they have information on a specific topic (like a person, company, legislation) you can now do a word search in youtube to see if anyone says something like mentioning a person, company, or legislation. This feature can help save a lot of time for a researcher going through different videos.

There is a detailed guide for how to anlyze content of different videos to do things like find the location or verify the video in the  Verification Handbook (

1 – Search for Videos

Searching for videos can be difficult. If you are looking for videos on a certain topic you can try using Petey Vid, a text-based search engine that searches exclusively for videos. Keep in mind that text-based searches, whether they are PeteyVid or Google, can only search for text that is affiliated with a video. So if the video is on a blank webpage and has no affiliated words, the search engine can’t find it.

According to Bellingcat’s Aric Toler, there is currently no way to run an Internet search on a video (this refers to the idea of actually uploading the a video to a search engine that would look for other videos based on it, like a reverse image search). So the next best thing is to get thumbnail photos of the video and run reverse image searches on those photos. The idea is the you are hopefully searching for the photo that appears on a video when it has not yet been started. If this sounds confusing we will walk through an example. At the time of this post’s writing, the youtube homepage looked like this screenshot below:

So if you wanted to search for the video on the top right, you could use the snipping tool to capture the photo that is currently on the video while it has not yet been played. You would do a reverse image search on this photo:

Amnesty International also has a tool for the public called the Youtube DataViewer that extracts data from any youtube video and creates four thumbnail images from the video that you can use for reverse image searches. For example, if we paste the youtube video’s url into the tool and run it, we get the following results.

The results above show the name and description of the video, the video ID, and the specific date and time when it was uploaded.

If we scroll down, we see that the tool created four thumbnail images from the video that can be used for a reverse image search, and there is even a link next to each photo that will do the reverse image search.

There is also a link next to each photo that can be used to do a reverse image search.

2 – Try to Find Out Who Posted the Video

This step requires a bit of time but is relatively simple. Google the username

For example, look at this video here. When we click on the user’s ID in the bottom left corner we are brought to a channel homepage.

That brings us to this page below, which has a strange url that does not identify the user well.

Strangely, if you click on “Home” you will get the same webpage the url changes to show the user’s username. See below, the channel’s homepage is the same but the url has chanced to show that the username is “oregonzoo”.

A quick google search of the username reveals the following Twitter account, which of course provides further information on the user.

When we scroll down to the first two commenters we can apply the same method to try to find their social media.

If we find two twitter accounts associated with those commenters we can try to find common friends by using, see below.

Using this tool we can look for common friends that might be the youtube poster.

3 – Analysis of Video Content

Youtube’s Computer-Generated Transcripts

Youtube has a new feature that makes researching videos easier by generating transcripts for each video. To access the feature, click on the three dots below the video on the right side and then click “Open transcript”.

The computer-generated transcript of the video appears next to the video. The transcript is word-searchable so you can save time by searching for a specific name or company is mentioned in the video. in the screenshot below, I searched for the name Tina Larsen and it popped up in the transcript. Note that the transcript also shows the time when different things were said. If you find something interesting in the transcript you can click on the words and youtube will automatically bring the video to that time.

Depending on the video, you may want to go through all of the content very slowly to try to analysis the background or street signs. You can do this with –

The Metadata Viewer is a great tool for gathering all of the data about a video and its publisher in one place. When was the video posted, by which account, how long has that account been around, is the video geolocated, etc.

If the video is not geolocated, you can look to see if the account that posted it (also known as a channel) has ever geolocated any of its videos, which of course implies the location of the account holder and even the original video itself. Geofinder -(

It is not clear if the following tools are actually useful for the purpose of researching a video, but they are interesting enough to warrant a mention. – will let you edit a video or convert it to an audio file. and – will let you download the video to your device as a video or audio file

More Tools

InVid has a tool for indepth analysis of the video content, for the explanation about how to use the tool click here and tool click here.

Download all of the Comments

You can download all of a youtube video’s comments with the following process. Note that the first time you do this it takes a number of steps but it will be much quicker any other time afterwards once it is set up. You will need to get a youtube api key (you only need to do this process once), find the video id number, setup gitpod (only once), and then hit run.

Getting a YouTube API key:

First, go to “”.

click on agree to terms and services and then click on “agree and continue”

This image has an empty alt attribute; its file name is image-1.png

Next click “select a project”

This image has an empty alt attribute; its file name is image-2.png

Then click on “new project”

This image has an empty alt attribute; its file name is image-3.png

In the next screen click on “create”, and then in the screen click on “select project”, on the left slick on “api & services” and then “credentials”

Then click on “create credentials” and then “API key”

This image has an empty alt attribute; its file name is image.png

You are not done yet, next click on “library” on the left and in the next screen scroll down to youtube and then click on “youtube data api v3”

This image has an empty alt attribute; its file name is image-4.png

In the next screen click “enable”. Now your api has access to youtube data!

To return to your api key, click on “credentials” on the left and you are brought to a page that has your key.

For further information, the api documentation is located here – “”

Find the Video ID

Now get the video id from the youtube video of interest. Do this by opening the video and copying this portion of the url after “v=”

Scraping the Comments

If you do not already have them, first you must sign up for accounts on github and They are both pretty quick and easy. Plus, once you have a github account, all you need to do to get a gitpod account is go to the login page and choose “sign in with github”. Now leave both accounts open in two tabs and go to the following url in a third tab.

If your screen has the Gitpod button on the top left, click on it. Sometimes you screen will have a green button there that just says “The code”, if that is the case, copy and paste the following link in a new tab.

Either option will bring you to the following screen

Click on “” on the left and see the code appear. Find lines 7 and 8 where you need to paste your api key and video id. Make sure you put the apk key and video id in quotes

Now your screen should look like this below (but without the redaction over the api key).


Next, on the bottom right see where it says

gitpod /workspace/Download-All-YouTube-Comments-From-Any-Video $

To the right of the $ sign, type “pip install pytube” and then hit enter

Let it run and then click the little arrow on the top right

You will see the script running like below.

When it is done, click on “allComments.tsv” on the top left and the usernames and comments for the video will appear in chronological order.

That’s it, you’re done!

Conflict of Interest: A Case Study in How to Research Lobbying, Financial Disclosures

This post is about how to investigate ties between corporations and politics, with a focus on personal disclosures and lobbyist reports. Thankfully, a lot of activity between corporations and politics is publicly documented but it remains hidden because so few people know where to look.

This article will explain how you can investigate these links by walking though a case study of one congressman in particular.

Hidden in the Deep Web

Art by Sam Chivers, posted at New Scientist

This kind of investigation focuses on finding information that is hidden in a variety of “deep web” databases. Recall that information in a deep web database is hidden from Google and other search engines (think of the “open web” or “surface web” as anything that you can find in a Google search). So you have to actually go to the database’s website in order to find the information in it.

It is kind of like information being buried underground so no one sees it. Think of the regular Internet as the surface, you google a congressperson and see press stories about their new proposed legislation. But when you dig up the buried information (like their personal business that is registered in a deep web database) and bring it to the surface you can see it in context. In this case, that means you can see that the congressperson is proposing legislation that would benefit their personal business.

That example describes the case of Congressman Roger Williams in 2015 who offered an amendment that would directly benefit his personal business. Williams is not the subject of this article, but you can read more about that case by clicking here.

Opportunities for Research

There is a large gap between what most people will consider a conflict-of-interest and what is legally considered a conflict for politicians. The gap between these two definitions is a great place to investigate links between corporations and politicians.

For members of Congress, the House Ethics Committee sets a official standards to determine if a Congressperson has a conflict-of-interest. “But the burden of proof to show that a member improperly wielded [their] influence for personal benefit is steep,” according to an article by the government-focused news agency Roll Call.

In the aforementioned case of Congressman Williams, the Ethics Committee actually deemed that his actions did not meet the standards for a conflict-of-interest.

Case Study

This article uses a case study of one Congressman John Carter of Texas (hereafter referred to as “the congressman”) in 2018 to showcase research methods that can be used in any political-corporate investigations. (This focuses on data from 2018 because some of the relevant data, see here, is not yet available for 2020.)

So if you decided to investigate this congressman, where to begin?

Political Donations

After starting off with your standard Google searches on the congressman, the next step is to look at who is funding the his campaigns.

Who are they connected to?

When a politician receives any political donation of at least $200, it is listed in the Federal Election Commission (FEC) database at

Political donations can also be searched by using other sites that search the FEC database. In particular I recommend Accountability Project and the Donor Lookup tool at, which are more user-friendly. Depending on whether the donor is a person or a company/organization, this tool will identify the donor’s name, address, occupation, amount of money contributed, and the recipient.

This is an amazing tool for tracking down leads. But in the case of the congressman, there was nothing unusual about his donors.

Public Disclosures

Public disclosures are often a wealth of great information. Politicians have to disclose information about their finances which are recorded in various public databases. You are probably thinking “if the politician publicly disclosed this information, how could it have any secrets?”

As it turns out, public disclosures are regularly the source of scandals, like this and this.

As a side note, here is one example of a scandal stemming from a “secret” public disclosure:

“At the height of the Reno City Council’s campaign to oust strip clubs from downtown, Reno City Attorney Karl Hall worked to sell an office building less than a block away from the Wild Orchid Gentleman’s Club, but didn’t disclose the possible conflict of interest, according to an investigation by the Reno Gazette Journal.” (see the full article here)

Politicians often disclose conflicts-of-interest in these databases for years without anyone noticing. The reason is twofold. First, it is not obvious from the disclosure itself that there is a conflict of interest, you have to do a good deal of legwork. So politicians often, and rightly, assume it is unlikely that anyone would look in the first place. Second, if the conflict of interest DOES come to light, the politician is legally safe because they followed the legal requirements for disclosure.

How to Find Disclosure Reports

Now back to our investigation. The House of Representatives has a disclosure database here, (the search function will not appear unless you first click on “search” on the left side).

Side Note: the Senate has a separate disclosure database, here.

The House disclosure database is actually quite user-friendly, see the basic search function below.

These reports identify the congressperson’s financial assets, income, liabilities, major transactions, and a few others.

I search for the congressman for the year 2018 and get one result. Looking at the list of his assets we see he has some cash in his checking account, a pension, a savings account, and then a large amount of money in Exxon Mobil stock.

2018 House financial disclosure record for the congressman

According to this record, the congressman had between $1 and $5 million dollars invested in Exxon Mobil. Compared to the rest of his assets, this stock is well over half of his money. This may or may not be relevant

Open Secrets provides more in-depth research on this topic, see here, and listed that the portion of the congressman’s wealth invested in Exxon Mobil was actually larger. Specifically, Open Secrets lists that the congressman’s top asset was Exxon Mobil $3,175,000, and the next largest asset was Great Western Bank CD $175,000. For some context on his overall wealth, Open Secrets estimated the congressman’s total wealth at $3,375,000, of which, $3,175,000 was in Exxon Mobil stock.

This information is relevant because it strongly suggests that the congressman had a major interest in the financial success of Exxon Mobil. If the company’s stock tanked, so would the congressman’s personal finances.

Specific Contributor’s Donations

This could still be nothing, so lets pull on this thread and see what else we can find. Lets see if Exxon Mobil contributes political donations to the congressman. If we return to the FEC database and look through the congressman’s donors, we do find Exxon Mobil listed there. But is this important?

To find the answer lets check two things. We look at the congressman’s other donations to his campaign and the amount provided. The actual amount donated by Exxon Mobil is not significant compared to the congressman’s other donors. According to OpenSecrets, the company only ranked number 44 on the list of the congressman’s top donors that year.

Plus, a quick search for Exxon Mobil in the database shows that is donates to most members of Congress and in similar amounts that were donated to our congressman. This confirms that there is nothing unusual about the company’s donations to the congressman.

Lets turn away from the congressman for a bit and focus on Exxon Mobil’s lobbying efforts.


When a company lobbies over any legislation it is arguably acknowledging that it believes the legislation can benefit or hurt the company. And of course, legislation that helps/hurts a company like Exxon Mobil will similarly affect its stock and the congressman’s personal finances.

There are databases where companies document their efforts to lobby the government.

First we go to the lobbying databases for the House and Senate. When we search for Exxon Mobil we find documents like this.

Box number 13 shows that over $3 million was spent on lobbying. Keep in mind this is only one document, though admittedly this amount is higher than most of Exxon Mobil’s lobbying price tags.

The prices listed on lobbying forms are not very precise. To understand why, this this explanation provided in a guide by Propublica:

“It can be tricky to figure out how much an organization spent on a particular lobbying engagement. The law only requires lobbyists to report the amount they were paid for federal lobbying each quarter rounded to the nearest $10,000—and if it’s less than $3,000 in a given quarter (or less than $13,000 for organizations with in-house lobbyists), they don’t have to disclose it at all. Plus, some organizations include spending that doesn’t belong in the report—for instance, money spent lobbying state governments or other legal work.”

Now let’s return to the example lobbying record. Scroll down and we see that the company lobbied the House for legislation like “H.R. 195  Making further continuing appropriations for the fiscal year ending September 30, 2018, and other purposes; provisions related to appropriations and energy.”

Researching Legislation

Assuming you have no idea what H.R. 195 is, what can at least find out who has some say over it by finding the committee dealing with it. To research legislation we go to, which has a search bar on the front page where you can search H.R. 195.

We can see who is on the Appropriations Committee by going to and choosing “appropriations” then “about” and then “membership” ( And who do we find but our congressman himself is on this committee.

So the congressman is invested in Exxon Mobil and he sits on a committee that works with the legislation that Exxon Mobil cares about.

At first glance this looks like a conflict of interest, but lets pull the thread a bit more. Maybe the congressman stood aside for that legislation. Lets check with the Reno Gazette Journal’s database for congressional votes ( We see here our congressman did vote.

Focused Research into Lobbying

Let’s look a bit further into other legislation lobbied by Exxon Mobil. Propublica has a great tool for researching lobbying records. See here for an example of documented legislation lobbied by Exxon Mobil. We search on Exxon Mobil and get the following results:

These results show (going from left to right) what issues were being lobbied, who was hiring the lobbyists, who were the hired lobbyists, how many people were hired, and how long they were working for the company. We see from the first result at the top (look at the “Lobbyists” column in particular) that most of Exxon Mobil’s lobbying is done “in house.” That is why the record that we previously discussed showed such a big price tag, it was referring to the amount paid for its own in-house lobbyists. You can see from the other records that most of the contracts are for smaller numbers of lobbyists and therefore smaller price tags.

For our record at the top, in order to find out more information we look at the “Issues” section and click on “Details.” This gives a very detailed listing of what was being lobbyied, when, who the lobbyists were, and other information. For our purposes here, lets focus on finding what Exxon Mobil lobbied that related to our congressman.

We scroll down and see that the legislation being lobbied is separated into different fiscal quarters and subject matter. For a sample, let’s look at the most last quarter of 2018 (recall we are focusing on 2018 data) and see what legislation Exxon Mobil lobbied that was in the House. See at the bottom where it says “Type of Issue” and next to it “Budge/Appropriations”.

We focus on the legislation that starts with “H.R.”, not “S.”, because the congressman is of course in the House of Representatives, not the Senate.

Member of the House may choose to abstain from votes if they feel it directly affects them personally. So we can check if the congressman decided to abstain based on his personal finances. After all, Exxon Mobil certainly believed the legislation in question could help or hurt the company, or else it wouldn’t be paying to lobby it.

Now that we have identified several pieces of legislation that where Exxon Mobil is sufficiently interested or affected by them, we can check if the congressman abstained.

The record mentions the following five pieces of legislation lobbied in the last quarter of 2018:

We return to the RGZ database (click on the legislation’s number in the list to see individual roll calls of the votes) and see that the congressman did not abstain from voting in any of them.


ACCR article on Corporate Lobbyists and H.R. 4015

While this article generally avoids Googling, at this point we do a quick check just to get a feel for the environment and Exxon Mobil’s lobbying efforts in 2018.

A 2018 article from the American Council for Capital Formation reported that Exxon Mobil had been active lobbying in the House for legislation focused on the Security and Exchange Commission that would have a direct benefit to the company. The legislation, H.R. 4015, had passed in the House at that time.

We can use a database maintained by the Reno Gazette Journal (RGZ) to check which members of the House voted for this legislation, and we see here that our congressman did in fact vote in support of it.

Revolving Door

There is a concept in politics and lobbying known as “Revolving Door,” which is often defined as “a movement of personnel between roles as legislators and regulators, on one hand, and members of the industries affected by the legislation and regulation, on the other hand.”

You can research if any of the lobbyists working on a specific contract have a revolving door history with the people that are being lobbied. So for our example, we can look at the contracts for the aforementioned legislation where Exxon Mobil is lobbying the Appropriations Committee where our congressman sits.

How to do this

First, to find the relevant contract, we repeat the process above where we found the specific record for Exxon Mobil’s in-house lobbying of the Committee. We go to the Propublica lobbying tool, search for our company and get results, we chose to look at the details of the company’s in-house lobbying record (but you can choose any) and we get this detailed page here. Once again we scroll down to the 2018 fourth quarter records.

Second, notice in the top right where it says “Original Filing” and there is an xml file link next to it. This is the record for the contract, click here so we can dive in and see which lobbyists worked on which issues.

Third, you will see one or more sections titled “Lobbying Activity” and we see under number 16 what issues were lobbied and under number 17 we see what parts of the government (such as the House of Representatives or a specific Department) were lobbied over those issues. Finally, we see under number 18 the names of the registered lobbyists that worked on these matters.

So we see in the section below for example, that a lobbyist named Dan Easley was one of the lobbyists that lobbied. Next we can look into whether any of the lobbyists listed below have a relevant Revolving Door history where they worked for the parts of the government that they lobbied.

Fourth step, go to the Open Secrets page for data on Federal Lobbying. Search the name of your company in question and you reach its Lobby Profile, select the year in question in the drop down menu on the right and click the tab for Lobbyists. You will see a list of results showing all of the lobbyists employed by the company during that year. If you want, you can choose to filter and only show lobbyists that were former members of Congress, or filter to lobbyists with “Revolving Door Profiles.”

For our purposes we will only search for the lobbyists named in the contract. If their name has two circular arrows to the right of it, as you can see above, that means they have a “Revolving Door Profile”.

In the image below, we see an Exxon Mobil Lobbying record on the right listing several lobbyists that are lobbied Congress and other government entities over various scientific issues (“scientific issues” is obviously not very specific but I am trying to keep this as brief as possible) and one lobbyist listed is Daniel Easley. On the left you see the affiliated data in Open Secrets under Exxon Mobil’s lobbyists. Because Easley is in the record on the right, we know we can find him in the Open Secrets page.

When we see Easley listed in Open Secrets we also observed that he has a “Revolving Door Profile” from the circular arrows to the right of his name. We click on the arrows and open up his “Revolving Door Profile”. The profile shows Easley’s employment history which reveals that he previously worked on the House Science Committee. This is part of why he has “Revolving Door” status. As a former employee of the House Science Committee who is now lobbying scientific issues and legislation, he could be lobbying his former coworkers.

In our example, in the record focused on the Appropriations Committee, we did NOT find any revolving door lobbyists working on these issues. It is possible the other records for Exxon Mobil’s lobbying of the Committee do include revolving door lobbyists, but we will focus on the one record because we do not want to start cherry-picking data.

Backgrounding further with more data

Ok now lets put things into context and ask “but is this typical?”. We are going to look at two pools of data to get our context. First, we can look at Exxon Mobil’s donations on the aggregate and see if we find any patterns. You can obtain this data from Open Secrets and use Python to display. The goal here is to find try to use Exxon Mobil’s donations as a reflection of the company’s areas of interest. In other words, if they spend more on one topic, maybe they care more about it.

If you are familiar with using Python, there is a separate post of mine that explains how to display data about political donations. That post is a walkthrough using Exxon Mobil as an example and it produces the graph above. The data in the graph, obtained from Open Secrets, shows how much money the company donated to members of each committee. The graph shows that Exxon Mobil spent the most money on contributing to members of the Appropriations Committee.

Before when I looked at the data on political donations I thought it looked like the company just gave money to everyone, but now we see a pattern that suggests the company possibly focuses more resources toward the Appropriations Committee.

Now we want to find if it is common for members of the Appropriations Committee, or just Congress in general, to be so invested in Exxon Mobil.

To find out answer, we return again to Open Secrets, which provides a list of members of congress ranked by how much they are invested in Exxon Mobil.

Our congressman is at the top of the list.


Our congressman, is overwhelming invested a company, more so than anyone else in Congress. He sits on a committee, and our company seems to donate most to people in that committee. But most importantly, he sits on a committee that works with legislation that our company is lobbying and paying a good deal of money on it.


This chart is an attempt to explain the situation.

First: The congressman is on a committee that involved in certain legislation.

Second: The company is lobbying that legislation because, in theory, it believes it can benefit from it.

Third: The company’s success it tied to its stock which is tied to the congressman’s personal finances.

Now we, arguably, see a connection between the congressman’s position and his personal finances.

Moving Forward with New Investigations

These research methods can be applied to research any politician or company.

Go ahead and try them elsewhere. Good luck!

What to Look for When Researching U.S. Corporations

This article provides an overview for conducting corporate research, identifying what information is available and how it is useful to a researcher. This post also weaves together previous articles and guides into a coherent whole. We will address where to find information on a company and what kinds of questions you can consider answering with your research.

Basic Research

The initial steps for researching any company is to do a quick Google search on it, then skim any press articles mentioning the company and look at the company website. With that completed, you can then start getting into the more advanced and more important research. This research guide assumes the researcher has already done the aforementioned basic steps.

Advanced Research

Deep Web –  See our article, “Deep Net, Open Net, and the Dark Net” that explains the deep web to understand why this will play an important role in corporate research.

The more advanced research relies on deep web databases, so at this point it is worth taking a moment to understand the “deep web.”  Most people assume that if they google a company’s name, they will find all of the information on the company that is available on the internet. In truth, the best information is usually located in deep web databases, which means the information in those databases will never appear in one’s google results.

Who Owns/Runs the Company – Every company must register with the government and the registration will usually identify the company owner in addition to other kinds of information that vary depending on the location. To find out how to obtain a registration, see “Corporate Research on the Deep Web.”

This article will also address the difference between a standard company and a corporation, from the perspective of corporate research. The main issue here is that a corporation has addition information available that it must file with the Securities and Exchange Commission (SEC) which is available to the public on the SEC’s EDGAR database. Here you can find the annual filing of a company if it is incorporated. The annual filing, among other things, will show you the subsidiaries of a company.

Court Records, Property Records, Local News – Using our three guides for finding this information, you can research court recordsproperty records, and local news for any company or individual subsidiaries.

Company Website – See our guide for website research, “How to Investigate a Company’s Website.” This article ignores the content of the website and instead looks at what is “behind the scenes”.

Contact Information Without a Name – Some records may list only a phone number, or other contact information, where it is supposed to list a person’s name. If this is the case, you can use our guide for How to Research a Phone Number. (We urge you to use these methods only in support of corporate research, this is not intended as an invitation to stalk someone that wants privacy)

Leadership – You can identify important players within a company by reading our guide “How to Find Influential Actors in a Corporation”. You can also learn to conduct focused and in depth research on a corporation’s Board Members by seeing our article on “Researching Board Members.” This guide will identify how to get background information, identify connections, how much money they are paid, and find possible conflicts-of-interest.

Corporate Profiles – Web scraping can be very useful for researching a company, especially if the company website has many employee profiles. You can use our guide to web scrape profiles (or anything else on the Internet), “Enhance Corporate Investigations by Web Scraping with Python.”

Hidden, Unofficial Connections – There are many ways that people and companies can have hidden or unofficial connections with other entities, and these connections often influence the actions of both parties. See our guide on finding these connections “Corporate Research on Hidden Connections.” For an example, see the chart that maps out connections between people, addresses, stocks, and companies that were revealed when secret financial documents were leaked to

Screenshot of entities linked to Blackwater Investments (from

Past Violations –  Our article on “Material disclosures and violations” will identify problems that the company has run into, whether it be legal issues or increased debt. This article will also show how to find when the company hires a new member to its senior ranks and how much they get paid.

Shipping Companies – Shipping companies have their own special factors, identified in our guide “Shipping companies,” that addresses identifying and locating ships, cargo, and discovering shipping violations.

Nonprofits – Companies are often linked to nonprofits via donations or because their leadership also are involved in the nonprofits themselves. See our guide for basic research on nonprofits “Researching a Nonprofit” that will explain methods like finding their tax records.

Nonprofit Corruption, Corporate Influence – Also see our guide for how to identify corruption and misuse of funds in “How to Discover Corruption in Nonprofits.” Note that it is common for corporations in the U.S. to donate funds to nonprofits affiliated with a politician. In fact, a nonprofit is four times more likely to receive a donations from a corporation if a politician sits on its board or runs it, according to a study by the National Bureau of Economic Research.

For an example of corporate links to nonprofits, see the link analysis chart below, it maps out how oil companies influence think tanks (which are also nonprofits). This map was created by, a free tool that identifies hidden connections.

from, click here to see the source

Contracts with the U.S. Government – If the company has ever contracted or tried to contract for the U.S. government, there are several special databases that will provide unique information about the company. You can learn about this with our article “Ties to the US government”.

Questions to Help Reach a Conclusion

The next paragraph provides a number of questions that may help with a corporate investigation for several reasons. You can look at these questions before your research to help you guide your investigation and decide where to look for information. It is also useful to look at these questions at the end of an investigation so that you can take all of your information and create some form of narrative or conclusion based on it. In addition, the guides above show a lot of ways to obtain information, but you may not have time to run through all of them. Plus, a lot of the information available on a company may not be relevant for a researcher depending on why they are investigating the company in the first place.

For those reasons, it can be helpful for a research to address some or all of the following questions: Who owns the company and what connections do they have to other entities? Who runs the company?  What does the company own? Who does it owe money to? Does the company or its owners/staff have secret companies in tax havens? Where does it do business?  What government or public service contracts does it have? What links does it have with politicians and civil servants?  What regulations has it violated? What legal cases have been brought against it? Who is taking action against it? And finally, who can influence the company?

Now you will have a well-researched and analyzed product.

Use Python to Recognize Text, Compare Data From Scanned Documents

Background From Previous Post

This post follows on the instruction from a previous article, Research Scanned Documents with Optical Character Recognition, that explained how to use Python to recognize text in png files with the use of Optical Character Recognition.

Optical Character Recognition (OCR) is basically the ability for a computer to recognize words in photos. OCR is particularly useful when dealing with scanned documents.

The previous post addressed how to install Python, the python interpreter Sublime Text, and the python tool for installing libraries Pip. Then, it explained how to install the python libraries Pillow, Tesseract, and Pytesseract. Finally, the post explained how different simpled Python scripts, can read text and then print in out in the python interpreter, create and print the text into a text file, or convert the png file into a pdf that has a layer of OCR over the text.

For example the following script would take a png file named screen.png, read the text and print it out into a text file.

from PIL import Image
import pytesseract

f = open("output.txt", "w")

Moving Forward, Converting Data into a Python List

Now if we take our output text file, we can turn it into a list by making this addition:

from PIL import Image
import pytesseract

f = open("output.txt", "w")

rows = []
with open('demofile3.txt', 'r') as txtfile:
	for row in txtfile:

With the new list named “rows”, if we “print(rows)”, the list of names will be the same.

If we wanted to print the list “rows” into a text file, we cannot simply write “f.write(rows)” because the function requires a string. So if we want to create a text file the same as the one we already created, we have to write the script like this:

from PIL import Image
import pytesseract


f = open("demofile3.txt", "w")

rows = []
with open('demofile3.txt', 'r') as txtfile:
	for row in txtfile:

f = open("demofile4.txt", "w")
for row in rows:

This addition will create a second text file “demofile4.txt” that is exactly the same as “demofile3.txt”. This simple task does not achieve anything but it shows how to work with the data. And if we have a list, we can compare the contents to another list. So with the names in a list we can do things like see if any of the donation recipients are also in the names of financial disclosure documents for politicians (meaning that a politician also works for the donation recipient).

Now, observe that if we run the script above and tell it to print the list in the python interpreter (Sublime Text), this is the result (script is on top, results are on bottom):

Each “\n” represents a line break. This means that the data recognizes the line breaks. This will be relevant in a moment.

There are other measures that can be taken, such as removing all symbol characters or making all text lowercase to make them more easily compared. However, these measures will not be addressed at the moment.

Converting the Data Directly to a List

But backing up a bit, it seems there should be a more efficient way to make a list without creating a file. However, note that if we tried to put the data directly into the list (instead of putting it into a text file first) like in the script below it did not work:

from PIL import Image
import pytesseract

f = open("demofile3.txt", "w")

rows = []
data = (pytesseract.image_to_string('screen.png')))
for line in data:

If we printed the list, “rows” by typing “print(rows)”, we would get a list of each individual character from all for the names.

Here is a better script below. The original data recognizes line breaks so we can import the data and assign it to a variable (named “data”) and separate the names by using the “split.()” function (as explained here in and choose to split by the line breaks using “\n” (as explained here in And then we put all of that together as “data.split(‘\n’)” and assign that to the variable “t”. Finally we make it a list by saying each thing in “t” (which has now been separated), and we will call each thing “i”, we will append each “i” to “rows”.

This creates a list of every name.

from PIL import Image
import pytesseract

rows = []
data = (pytesseract.image_to_string('screen.png')))
t = data.split('\n')
for i in t:

If we add in a “print(rows)”, the results look like this:

We see that there are empty lines that are identified as individual elements in the list as this:

‘ ‘,

You may also notice that the character [ appears a lot in the list.

It is worth mentioning that if there were certain characters that you want removed from the list, like for example a “[“, you can simply add this coding below:

[s.strip('[') for s in rows]
rows = [s.replace('[', '') for s in rows]

The resulting list is clean

This is particularly useful when you want to compare one list to another and you want cleaner data.

Now you have a proper list of the names that we can manipulate and compare.

How to Compare Data from Different Scanned Documents

From here, we will use a common example that involves looking at which nonprofits a corporation provides donations and checking if any of them are involved with local politicians. In this case we are looking at the pharmaceutical company AabVie and a congressman that is involved in areas of potential interest to the company.

To explain why we would research this, it should be noted that the National Bureau of Economic Research published an extensive study, “Tax Exempt Lobbying: Corporate Philanthropy as a Tool For Political Influence”, about how corporate foundations are more likely to give charitable donations to a nonprofit if it is affiliated with a politician.

The study found that “a foundation is more likely to give to a politician-connected non-profit if the politician sits on committees lobbied by the firm.” In addition, the researchers concluded that “a non-profit is more than four times more likely to receive grants from a corporate foundation if a politician sits on its board, controlling for the non-profit’s state as well as fine-grained measures of its size and sector.”

The study identified specific politicians of interest by checking if they were on any committees that were lobbied by a given company. If so, they would check for a link via a nonprofit and donations. So in our example we will look into whether a specific company is given to a nonprofit affiliated with a politician of interest. The existence of such a link does not prove nefarious intent, but it is interesting and in some cases it can be an indicator of a deeper relationship.

This example is relevant because it requires researching files from two different databases (irs nonprofit filings database and congressional financial disclosures database) with records that are often badly scanned documents. The lack of OCR on these files makes it difficult to find links.

We can use the png of donation recipients identified in the public tax filings of the AbbVie Foundation (affiliated with the pharmaceutical company with the same name, which has a presence in Rhode Island). We will name this file abbvie.png.

donation recipients

For the second file we will use a list of nonprofit organizations that have Rhode Island-based congressman Jim Langevin on their board to see if any of his nonprofits receives money from the aforementioned foundation. The information for both files was scraped from records made available via Propublica’s Nonprofit Explorer.

In this case we will use the following Python script to read both documents and compare the data.

from PIL import Image
import pytesseract

rows = []
data = (pytesseract.image_to_string('abbvie.png')))
t = data.split('\n')
for i in t:

rows2 = []
data2 = (pytesseract.image_to_string('langevin.png')))
t2 = data2.split('\n')
for i2 in t2:

names = []
for name in rows:
	if name in rows2:

Below you can see that running the script identified that the name “Adoption Rhode Island” was in both files (along with a lot of empty spaces shared in both docs).

Compare Many Documents

Now that we have explained how to compare two documents, let’s look at how to compare many documents.

Let’s say you have 20 pdf files of a hundred pages each and you have another set of 20 pdf files and you want to find common names or something else that exist in both groups. First go to

Upload one set of 20 pdf documents and it will return a zip folder to download. Inside the folder is a set of png files, each page of each pdf file will be converted into a separate png file. Next, open Command Prompt / Terminal and navigate to the folder (you may consider copy and pasting the files from the zip folder to a regular folder). Once you have navigated there, type “dir /b > filenames.txt” in the Command Prompt and it will create a text file in the folder with all of the file names. Copy and paste those files into the same folder with the Python script and then replace the name of the png file in the python script with the name of the textfile (which should also be in the same folder as the python script now.

The script will read the file’s names of the png files and then go to each png file to OCR it. You could also put the files in a separate folder and put the path to each file in the text file. This could be accomplished with use of the “Find and Replace” tool. For example, if every file starts with the same word (which will be the case because that is how the website will create the files for you) like “policy”, and the path to each file was something like “documents/research/policy1.png” (with the number changing for every file), you could tell the Find and Replace tool to find all instances of the word “policy”, and replace them with “documents/research/policy”. This would leave the rest of the file names unchanged.

Now repeat these steps for the second group of files and you are done.

That’s it.