Find a Twitter Account’s Original Content and/or Most Common Words

This post addresses how to take a Twitter account and:

  • a.) create a list of the tweets that were original content, not retweets
  • b.) identify the topics most discussed in the last 4 thousand tweets
  • c.) list the other Twitter accounts most often mentioned (and follow up to investigate them with

Most investigators looking at a Twitter account merely scroll through some of the latest tweets or look at the bio. But you can analyze the entirety of a Twitter account by identifying the main topics of its contents or filtering out retweets so you can just analyze the original content tweeted from the account.

SIDENOTE: We must address Tweet Topic Explorer, which is a great tool for understanding the content of a twitter account. But there are two issues that can get in the way. First, sometimes the most common words that it finds are actually usernames. Second, sometimes it is hard to find if there is any original content coming from an account because there are so many retweets.

This post will explain how to use Google Sheets to filter down to the original content tweets and find the most common words from an account without including usernames and/or contents from retweets.

For this example we are going to use the twitter account for Osint Combine (@osintcombine).

First go to All My Tweets (, plug in the twitter handle and choose “tweets” so it downloads all of the account’s tweets.

Second, highlight and copy all of the tweets’ content. You can use Ctrl + A if there are too many tweets to highlight manually. It will not impact the process if it highlights and copies the other words on the page that are not tweets.

Third, go to Google Sheets ( and paste the tweets into a column. in the B2 cell.

If you want to get rid of all of the formatting, then instead of just pasting into the cell you can right-click in the cell and then choose “Paste Special” and then choose “Paste values only”.

a.) Remove Retweets

If you want to remove retweets and only see original content then at this point highlight the column and then click on Data and then Create Filter

A little box appears at the top of the column

Click on it, choose “Filter by condition”, in the dropdown choose “text does not contain” and then in the box that appears type “RT” or and then click “OK”.

Count Words

First, highlight the column with the texts and then click on “Data”, “split text to columns” and then in the little Separator box that appears click the dropdown and choose “space”

An Aside – How to Filter Out Irrelevant Words

I find that you can filter out a lot of irrelevant words (“the”, “at”) if you excludes words below a certain character minimum. The following will let you find and delete all words that are 6 characters or less.

Highlight all by hitting CTRL and A at the same time

Click “edit” and then “find and replace”

In the new box type next to “Find” type the following:


SIDENOTE: This is a “regular expression” that refers to all words with 1 to 6 characters. Change the numbers however you like for your own spreadsheet.

Next to “replace with” enter a  space

Click check next to “search using regular expressions”

and then click “Replace All”

Now my screen looks like this, with lots of spaces

End of the Aside, Returning to Steps for Counting Words

Second, find out how many rows down your data goes and how many columns across.

So in mine here i see that it goes down to row 237

And across as far as column BH. Note that the “split text to columns” tool will highlight the columns as far as there are data in them (so you know you’ve hit the end of the columns when they aren’t highlighted anymore).

That means our data stretches from the box B2 on its top left to BH237 on its bottom right

Third, in box a2 we type =FLATTEN(B2:BH237) and hit enter. This combines all of the data from each box in the area you specified into one column.

Now all of the individual words will be listed in Column A.

Note that at first it might look like Column A is still full of empty spaces. Don’t worry, that is only because the empty cells are included. Scroll down and you will see the rest.

Fourth, type a title in cell A1, like “words”

Fifth, highlight column A, click “Data”, choose “pivot table”, make sure the box that appears has “new sheet” chosen, and then click “create” in the box.

Sixth, on the right next to “Rows” click “Add” and then choose “words” (or whatever you typed into cell A1). Then Next to “Values” click “Add” and then “words”. it should default to COUNTA under “Summarize by”.

Remove Usernames?

Seventh, at this point you may decide to remove all usernames from your data. This is pretty simple

Highlight column A, click on “Data”, then “Filter views”, then “create new filter view”.

Next click on the upside down arrow next to “words” in A1 and choose to sort A–>Z

You will see all of the numbers and weird characters come first at the top of the list

If you scroll down you will see all of the usernames in a row (since they all start with @), so just highlight all of the rows: left-click on the row number next to the first username and then hold shift and left-click on the row number next to the last one.

Then right-click anywhere on the row numbers and choose to hide all of the rows

Alternatively, you could choose to only see the usernames by highlighting all of the other rows without usernames and hiding them

An Easier Way To List Or Remove Usernames?

There is a second option on how to list only the usernames or exclude them. But it is worth noting that it does not ALWAYS work due to some of the complexities of Google Sheets.

Basically, as soon as you’ve populated your pivot table.

Alternate Step Seven:

Highlight column A, click on “Data”, then “Filter views”, then “create new filter view”.

Next click on the upside down arrow next to “words” in A1.

If you want to only list the usernames, in the dropdown menu choose “Filter by condition”, then “Text starts with” and then just input the “@”. See below:

You may see duplicates of the usernames because of a comma at the end or something of that sort like “@searchish” and “@searchish:”

To resolve this, return to step 1 and after completing that first step go to “edit” then “find and replace” then choose to search for “:” and replace with a space ” ” and hit “replace all”.

By contrast, to exclude usernames, after you click on “Filter by condition”, instead of choosing “Text starts with”, choose “text does not contain”.

Research the top usernames?

If you chose to find the most common usernames, you may can do some follow-up research on them by searching for various accounts and email addresses using the same usernames all at the same time. The method for doing so is pretty easy and explained in the post Find if a Username is listed…

The post will show how to run the profil3r script with one command as shown below

Now regardless of whether you removed your usernames or not, continue as follows

Now you can exit out of the filter view by hitting the X on the right side (you might have to close out your Pivot Table Editor in order to see it.

Eighth, click the little empty box above the “1” marker for row 1 to highlight everything

Click “Data”, “filter views”, then “create new filter view”.

Ninth, click the little upside down triangle in column B and choose “Sort Z->A”

And now you see:
The list above accurately shows that OSINT Combine is affiliated with OSINT training, platforms, and speakers, in addition to Australia and we see the twitter handles of a number of known OSINT experts with links to the organization.

That’s it, you’re done!

Find If a Username is listed as an Email or Social Media Account

This post will explain how to use the profil3r tool to automatically search if a given username or real name is used in various social media or an email address for gmail, yahoo, or hotmail.

If you already have accounts on Github and Gitpod, login and jump to step 3.

Step 1 – (the hardest) – Click here and sign up for an account on Github . Sorry, that is more than one step but the process is simple and it gets easy afterward.

Step 2 – Login to Gitpod. You do not need to sign up for Gitpod if you already have a Github account. Go here ( and you will see an option on the left to sign in with your Github account even though you don’t have a Gitpod account. See below:

Step 3 -Go to and then click on green button that says Gitpod. You’ll then have to wait for a bit.

If the gitpod button does not appear on this page for you, you can alternatively paste the following url into a new tab:

Step 4 -That brings you to Gitpod where it will set up a virtual machine.

Step 5 – see where it says “gitpod /workspace/Profil3r $” and next to it type

sudo python3 install

Step 6 – hit enter, wait for the install to complete. Then, as shown below, type “sudo python3 -p [USERNAME]”

I am going to use the example username “usernameforme”, so I typed

sudo python3 -p usernameforme

Step 7 – Hit Enter and then this screen appears.

If you had entered a person’s name “john smith”, you could choose here if you want the script to search with any, none, or all of the separators listed. as noted in the instructions, move the little yellow arrow up or down with the direction arrows on your keyboard. Choose an option by hitting “space” and unchoose it by hitting the same button. you can also check/choose all options with the “a” button or uncheck them all with the “i” button.

Or you can do nothing at all. Regardless of what you choose, (including nothing) hit enter when you are done.

Step 8 – Now you have a series of options for what you want the script to search. Feel free to hit “a” to choose all, and then hit enter.

Step 9 – wait and get your results. For each version of the username or name (from Step 7) it will check each website listed. Below we see there are several social media sites with the username “usernameforme” but nothing on Soundcloud.

The possible email addresses might be a bit confusing. The script searches if there are email addresses for gmail, yahoo, or hotmail with that username. If it says [SAFE] next to the email address, that means it did NOT find evidence that the email address exists.

That is because the script searches for instances where the email address is listed in a data breach. If the email address is not in a data breach, the website tells the script that the email address is “safe”. For our purposes, if the email address WERE in a data breach, that would prove that it does exist.

Feel free to follow up researching the email address on a website like

That’s it! You’re Done!

Update: Multiple Usernames

It is also possible to run multiple usernames at once, but be careful not to overload it.

To add more usernames, run the same command but add additional usernames separated by commas, like below:

sudo python3 -p username1, username2, username3

See the post (Find A Twitter Account’s…) that identifies the Twitter accounts most commonly mentioned by a given account.

You can then take the most commonly mentioned accounts and research them all with

Python – Find if a Phone has any Instagram Account

This guide will walk through how to use Gitpod to run the Megadose/Ignorant script that checks if a phone number is registered to any Instagram account. Note that it will not identify the specific account.

If you already have Gitpod and Github accounts, login to both and jump to step 3.

Step 1 – (the hardest) – Click here and sign up for an account on Github . Sorry, that is more than one step but the process is simple and it gets easy afterward.

Step 2 – Login to Gitpod. You do not need to sign up for Gitpod if you already have a Github account. Go here ( and you will see an option on the left to sign in with your Github account even though you don’t have a Gitpod account. See below:

Once you have logged in your page will probably look like this:

Step 3 – Copy and paste this url into your browser and hit Enter:

Why? – Basically, you are making a url that consists of the gitpod website url, a hashtage, and the url of the github page for the script.

Here is the explanation. We want to run a Python script but to do so we need a development environment. Normally you would download it but in this case, Gitpod provides a development environment online where you can run Python scripts. When you identify a script posted on Github you create a url of the Gitpod website’s url, a Hashtag, and the url for the page hosting the Python script. So with our script hosted at, we combine it along the gitpod parameters: + /# +

to make this url –

Gitpod will create a workspace, a virtual computer, specifically for running the script. The script and its affiliated files will be downloaded though you will likely still have to run the file, or its equivalent. If you go to the script’s page on Github there should be instructions for downloading and running the script.

Wait for Gitpod to do some processing and then your computer should look like this:

Step 4 – At the bottom of the screen find where it says “/workspace/ignorant $”.

Click to the right of these words and type “python3 install” and then hit enter.

Step 5 – Then when it is done “python3 install” and hit enter

Step 6 – Wait for the install to complete and then right click on the folder on the type left that is named “ignorant” (not the one titled “ignorant.egg-info”). When you right click on the folder a drop down menu appears, choose “open in terminal”.

Step 7 – A new tab has appeared in the terminal, notice the new tab that reads “gitpod /workspace/ignorant/ignorant” and the cursor is located next to a similarly named prompt.

Finally, choose your phone number of choice, (i will choose a US-based fake number 123456789, and the US country code is 1) so I would type “ignorant 1 123456789” and you will get results like you see below

Note that the script also checks if the phone was used to register or accounts.

Now that you’ve done this once, the process will be much easier in the future.

Next time

Login to Gitpod and there will be a workspace named for the script. It should look like the image below. Just click on the workspace.

This should reopen your workspace to right where you left off and you can run the code again.

Data Analysis for Beginners

This post explains how to conduct Data Analysis on large sets of data using Google Sheets and is intended for people with no background in the subject. The post will explain how to import data into Sheets and how to summarize, sort, filter, and modify the data by performing custom math on it. Finally, we will walk through how to cross reference data from different sources.

CREDIT WHERE IT IS DUE: The post is based on following the instructions from Lam Thuy Vo’s book Mining Social Media. Chapter 6 from this book focuses on Data Analysis is available online here. This information was originally identified by the website Tools For Reporters, here.

In the words of the author, you learn “how to conduct simple data analysis with Google Sheets: you learned how to import and organize data in Sheets, how to ask specific questions of a data set, and how to answer those questions by modifying, sorting, filtering, and aggregating the data.

This post will address how to import, modify, and analyze data in Google Sheets.

Part 1 we will go over how to import data into Google Sheets, make it recognize numbers, and using the function for splitting data from one column into two columns.

Part 2 will go over how to create a Pivot Table (which provides statistics about our raw data) to count how many times a specific value occurs in a column. This will be used in the example to find out how many times a twitter account tweeted on each date. The goal of the example is to identify if the twitter account is a bot based on the frequency of its tweets.

Part 3 describes how to use formulas and functions to analyze data. While Part 2 aggregated how data, Part 3 will use math equations such as find the average number of tweets per day by a given Twitter account in order to see if it is within the bounds of a normal human user or a bot.

Part 4 addresses how to sort and filter data. You can choose one value in your data (such as the number of retweets) and order all of your data based on ascending or descending number of retweets. You can also sort the data in the pivot table (which is based on the raw data). Filtering allows you to hide data you do not need or find the most relevant data (maybe search for all tweets that mention a specified person).

At this point, the reader has a general understanding about what kind of data analysis is available, but there are too many functions and tools to learn them all. It is better to obtain data and then consider what you would like to do with it. Then work backwards to look at the Sheets functions and tools available to find a way to achieve your goal. Note that Parts 3 and 4 reference lists of the available functions and tools.

Part 5 addresses how to cross reference data from different sets based on a common data type or value, for example if both datasheets have a “dates” column.

Part 1

Importing Data to Google Sheets

First we are going to import the data, which was originally obtained directly from the Twitter API (a subject for a separate post) and we start with the data in CSV format.

There are a plethora of potential data sources that allow you to download data in this kind of format. For example, the FEC provides CSV data for public campaign financing (click here) and OpenSecrets provides CSV data on corporate lobbying records (click here for Exxon example)

Go to Google Sheets and start a new sheet, it should say “blank” under “start a new spreadsheet” on the left side.

Click on “file” then “import”, the following screen should appear:

Choose “upload” and “select a file from your device”. After choosing your file, (preferably a csv file) the following window should appear:

Then choose the following in order:

1- under “import location” choose “replace current sheet”

2 – under “separator type” select “comma” – – (since we are using a csv in this case and we need Google Sheets to identify separate values)

3 – under “convert text to numbers, dates, and formulas” select “no” – – (we will change the formatting ourselves because the software makes mistakes)

At this point we are not worried about whether Sheets can differentiate between a number, word, or date.

Now make a copy of the original data set. (you can also use “file” then “version history” to look at past versions)

8 – Click on “sheet1” (you will make this one the original to put aside in case you need to reference the original data later) and rename it “original data”, or whatever.

9 – Then, to create a copy to work with, click the small arrow next to “Sheet1” / “original data” or whatever you call it now, and in the menu that appears you will choose “duplicate”. In the book they name the second sheet “step 1: modify and format” so we will use the same here for the sake of clarity.

Transform data in columns from string to numbers

Then we choose to highlight column C by clicking on the “C”. Then click on “format” then “number” then “number” again. This transforms every value in the column from a string to a number. Repeat this process for the Retweets column which also contains numbers.

Split Text To Columns

We want to find out how many times the twitter account tweeted on each day. So we will count how many times the same date occurs in our raw data on the tweets.

To do this, we need to create a column that only includes the dates of each tweet and then count how many times each date (or “column value”) occurs. So if March 3rd occurs ten times, that means ten tweets came out on that date.

The problem is that our data only provides a column that provides the date and time together for each tweet. That means that several tweets on the same date, (example: “March 3rd, at 12pm” and “March 3rd, at 4pm”) will be counted as separate values instead of identifying that the tweets occurred on the same date.

The solution is to use the Split Text to Columns tool which will separate the dates into one column and the times into a second column. This will let us count the values in the dates column so we can see how many times each date occurred.

How to Use the Split Text to Columns Tool

1 – Click on the new sheet (so you are not working with the original data) and find the “Split Text to Columns” tool under the Data menu

The tool looks for common characteristics in the columns (like a semicolon in-between two numbers) so that it can separate the data.

2 – Choose a column to split and then create an empty column to the right of it.

3 – To do so right-click the letter above the column and choose “Insert 1 right”.

Now you can use the Split Text to Columns tool without overriding any data.

4 – Click on the column you want to split so that it is highlighted, and then in the menu above click on “Data”.

5 – Click on Data and then choose Split Text to Columns.

It will split the text between the two columns based on the default setting “Detect Automatically”. But a small window appears, titled Separator, and it will give you different options for how you want to separate the data.

6 – We will choose “Space”.

Note that these columns of dates and times are still formatted as strings. That is because the separator tool works best with strings so it makes sense to work with the data as strings now, and then convert it later.

7 – For the sake of clarity, we now change the column headers to “date” and “time” to reflect the information in each column.

Now the information is ready to create a Pivot Table, which will summarize the data for us.

Part 2

According Lam Thuy Vo, at this point the data is prepped and the next part is to aggregate the data by using pivot tables and formulas.

Pivot tables can, according to the author (p. 110 from the book), “summarize data…in various ways…we can count the number of times a term occurs in a column or calculate the sum of numerical values based on a date or category.” This is done with the creation of a summary table that gives statistics about our raw data.

In the example from the book the reader uses pivot tables to find how many times each date occurs in a column.


Recall that the data originally provided the date and time in the same column.

so first the data looked like this:

This image has an empty alt attribute; its file name is image-37.png

Therefore in part 1 the reader had to separate the dates into a separate column. The point is so that every time a tweet occurred on a specific date, the column value would only be the date. Therefore if we count the number of times the same date/column value occurred in that column, that would be the number of times a tweet occurred on that date.

How to Create a Pivot Table (shorter explanation)

Note that if you want a short explanation on creating pivot tables, google provides one here.

1 – In short, to create a pivot table you highlight the data you want to use, find and click on “create pivot table.

2 – Then when the table is created, you input data into the table by using the pivot table editor. In the editor there are two options, “rows” and “columns”, you will use these to choose what data to include.

3 – Within the Rows/Columns options, you will choose the data by choosing column headers from the original spreadsheet and each unique value from that data will be listed once in the table. Rows/Columns also lets you order the data ascending or descending.

4 – Finally, you can use the pivot table editor’s “values” option to summarize the data (by adding, counting, finding the average, etc.) or the “filter” option to filter out data values based on the value itself (in which case you choose which specific values to include) or based on a conditional, which is explained below.

Select Data and Create Pivot Table

Returning to the guidance from Mining Social Media: The first step in creating the pivot table is to choose the data that it will analyze and we want to choose all of the data on our sheet so we click on this little rectangle in the top left where the column and row headings meet. This highlights everything so that the pivot table will have access to all of the data. The pivot table will not necessarily use all of the data, but this gives the option to use it for whatever data we want.

This click on the Data tab and then Pivot Table. A window will pop up, make sure New Sheet is chosen and then click Create.
A new sheet will appear, named Pivot Table 1.

Pivot Table Editor

On the right side of the pivot table there is something called the “Pivot table editor”. You use this to put the highlighted data from the sheet into your pivot table and choose how to analyze it.

Note that it suggests some possible things to do, and it even recognizes your data column headers in its suggestions. You can find out what these suggested options would look like with your data. To do this, if you hover your cursor over any of the suggestions, a magnifying glass appears on the right side. Click on the magnifying glass to see what kind of table these suggestions would produce.

For example, we click on the magnifying glass that appears next to “Number of unique id for each language”, and get the table seen below. The table shows that there are 5 times that a tweet is identified as “ar” (arabic) in the data under the language column’s information. This does not put the data into your pivot table, it just shows what it would look like.

Returning to Our Pivot Table and Counting Dates

As previously noted, there are options on the right side there are options for how to “populate” the pivot table with data and analysis. We want to create a column with each unique value from the “date” column in our sheet. Eventually we want to show next to that column how many times each of those dates occurred, but for now we just want a column that shows each unique date.

In the Pivot Table Editor we see the Rows option and click on Add. This means that you are selecting to put your data in rows and when you click add you choose what source of information (which column’s information) you want to input into the rows of the pivot table.

A drop down menu appears and it has the names of each of the column headers from the “step 1: modify and format” table. You click on Date and now the table should look like the screenshot below.

It might seem confusing that the Rows option basically entails choosing a column from the raw data. It is best to think of it as choosing what data will be added into rows.

A new column will appear in the pivot table (named “date”) and in the rows below it each unique value from the Date column (from the “step 1: modify and format” sheet) will appear once. Recall that in the “step 1” table each of these unique values (which are the dates) appeared several times.

If we had chosen “Columns” and then “date”, we would have gotten each unique value (date) in a different column in one row. See below:

Now returning to our table that has each unique value in different rows in one column. Our goal is to count how many times each unique date occurred in the date column of our sheet (“step 1: modify and format”).

So now we want to do analysis on the unique values and display it in a new column.

The Values option in Pivot Table Editor will let us do analysis on our data. So we choose which data we want to work with (in this case it will be the “date” column). To do this we go to Values and click Add. A drop down appears and we choose “date” once again.

Just to be clear about how the Values option works, you choose one of the columns from your raw data and then choose a function to apply to it. So for example, you can add every number in the column together, or multiply them, count them, of find the average.

A table appears under values that shows that “date” is the column we are working with and under “summarize by” we can click below it and a drop down appears with different ways to summarize the data. We choose COUNTA. This option will count the number of times each unique value in the date column appears.

There was also a COUNT option available, but we do not use that because it will only work with data that is formatted as a number. Other options include SUM, AVERAGE, MEDIAN, PRODUCT, MAX, and MIN. Google has a help page here that lists and explains each of the available functions.

Also see that a new column appears (named “COUNTA of date”) and each shows how many times each date appears and automatically puts the count for each one next to the unique value in the rows of the first column.

Success! we now have a pivot table that shows how many times each date occurred in our data. The first row shows that the date 2016-12-10 occurred 3 times, meaning there were 3 tweets on that date.

The author explains in the book that this information is useful because you can identify if a Twitter account is probably a bot based on how often it tweets. The author sites that, according to the Digital Forensic Research Lab (, tweeting more than 72 times per day is “suspicious.”

With that information in mind we can look at our data on a twitter account and judge if it is tweeting on any give day at a “suspicious” rate. (the answer is “yes”)


The values option in the dropdown gives a list of ways to summarize data but they are not always easy to understand. Here is an explanation of the ways to use Values to summarize data:

  • SUM – Returns the sum of a series of numbers and/or cells.
  • COUNTA – Returns the number of values in a dataset.
  • COUNT – Returns the number of numeric values in a dataset.
  • COUNTUNIQUE – Counts the number of unique values in a list of specified values and ranges.
  • AVERAGE – The AVERAGE function returns the numerical average value in a dataset, ignoring text.
  • MAX – Returns the maximum value in a numeric dataset.
  • MIN – Returns the minimum value in a numeric dataset.
  • MEDIAN – Returns the median value in a numeric dataset.
  • PRODUCT – Returns the result of multiplying a series of numbers together.
  • STDEV – The STDEV function calculates the standard deviation based on a sample.
  • STDEVP – Calculates the standard deviation based on an entire population.
  • VAR – Calculates the variance based on a sample.
  • VARP – Calculates the variance based on an entire population.

Part 3

Formulas and Functions

Google Sheets has a basic instruction page and video here to explain how to use functions to create formulas.

The intro video explains that a Function is a built-in operation in Sheets, such as adding or subtracting. Functions appear as options in dropdown lists and they can also be typed directly into Sheets. For example, we used the Function COUNTA in the pivot table to count how many times each date appeared.

Within Sheets, we can also type functions directly into a cell by typing the “=” sign, the name of the function, and then specifying what data (or the location of the data) in parenthesis.

For example, look at the sheet below:

In this case we want to add the values in the two cells A2 and A3, and put the return value in the cell B4.

1- So we start by typing the “=” sign so that sheets knows we are typing a function

2- We type the name of the function, in this case the function name “SUM” is for doing addition

3- Finally specify which cells we want to add by identifying them in parenthesis and separating them with a comma. If you hit enter, the cell B4 will just have the number 4 in it and the formula disappears from view but still remains for the cell.

Side Note: when specifying cells in parenthesis you can use (A2:A7) to mean add all the cells from A2 to A7. You could also type (A:A) to refer to the entire A column. You can choose from cells in a different sheet within the same file by typing the name of the sheet in quotes followed by an exclamation mark. So if you had a different sheet named Sheet 2 and you wanted to get the sum of cells A2 to A7, you would type =SUM(‘Sheet2’ !A2:A7). Also, you could just type =SUM( and then use your cursor to choose the cells.

A Formula uses a Function (or several Functions) to obtain a specific result. So in the previous example we created a Formula in cell B4 that used the SUM function to add the values of A2 and A3.

Smart Copying

The author goes on the point out the value of smart copying. Accordingly, if we want to find the length of each tweet, we can return to the sheet Step 1: Modify and Format and add a column to the right the text column (which is the H column). Right-click on the column header and choose “insert 1 right” (which creates the empty I column).

So if we type the formula =len(H2) in the I2 block and then hit enter, we will get the number of characters (of “length”) for the H2 block.

Sheets actually suggests using smart copying to fill in the column below with the same formula but accounting for the different placement of the cells.

or you could find the little box in the bottom right corner of the first cell with the forumla in it and pull down to other tabs. This will also smart copy to the other cells.

Part 4

Sorting Data

Vo says that the next step is to learn how to sort and filter the results, or data in general, to “rank or isolate data.”

In short, to sort or filter data, you want to remove the formatting by copying and pasting the data into a new sheet, but paste using Paste Special. The =n you highlight it and create a filter view, which allows you to sort and filter. This does not alter the data, it only rearranges it, or hides parts of it.

Vo uses the example goal of sorting the data in the pivot data to see how the suspected bot tweeted on its busiest dat. To do this, Vo explains how to do so by “creating a new sheet with our aggerate results and changing the entire sheet to a filter view.

Paste Special

We will want to copy and paste the data from the pivot table into a new sheet to isolate our work. Note that those cells have data and formulas, so we want to only paste the cells’ data without the formula. To do this we will use Paste Special.

First we create a new sheet by clicking on the plus sign on the bottom left.

The next step is to highlight the cells with the data you want, which in this case is the entire table.

Open the new sheet (named “Sheet 3”) and right-click in the A1 box.

In the drop down menu, choose “Paste Special” and then “Paste Values Only”. This will paste only the data, and not any of the formatting (excluding the formatting includes excluding the formulas). The data is now formatted as strings of integers.

We need to convert the formatting of the data (the dates) from integers to dates (in the formatting sense of the word dates) so that Sheets will adjust how it deals with this data. To do this, you highlight the newly pasted cells and click the Format option in the title bar. From the dropdown menu choose Number and then Date.

Filter View

We need to turn the data into a filter view.

First, we select all of the data in the sheet by selecting the box in the top left corner between the row and column headers.

Clicking this box highlights every cell in the sheet.

Next look at the tools under the headers and find the filter icon, which looks like a funnel, see below:

If it is not there, you may need to click the 3 dots icon seen below. When you see the filter icon, click on the upside down triangle to the right of the icon.

Make sure the cells are all still highlighted/selected and click on the triangle and choose Create New Filter View.

This creates a new heading row and filter icons at the tops of the columns, as below:

The new icons in the first row can be used for the filter functions. If we click on one, (in this case we are using the Date column) we are given a dropdown menu with sort and filter options.

Rearrange in ascending or descending

At the top there are options to sort the data in ascending or descending order (“Sort A->Z” means ascending, regardless of what kind of data you are using). If we choose to sort in ascending order based on the Date column, the values in the COUNTA column will also be rearranged to align with the Date column values.

This is also where it is important that the dates column is formatted as dates. Sheets will recognize the values as dates and arrange them in chronological order. If we had not changed the formatting Sheets would view the values as integers and arranged them as lowest to highest.

Filter by value

We can also filter out certain values by using the options at the bottom of the menu. There is a list (that continues downward past what is visible) of every kind of value in the dates column and there is also a search function if you have a particular value in mind. By unclicking the checkmark next to one of the values, Sheets will hide the value, but only in this particular sheet.

Filter by condition

Finally, the filter dropdown menu gives the option to filter based on a condition. A condition is basically an “if…then…” statement. If we click on the filter option in the menu a minimized dropdown will appear, if we click on that we see a list of conditions we can use.

For example, if we were looking through the content of the tweets, we could use this to look for tweets that mention something specific.

Part 5

Combining Data from Different Sources

Sheets allows us to cross reference data sets and combine them into one data set. This is also known as Merging Data Sets.

What are we going to do specifically? We will take two pivot tables in two separate sheets. The first table (from before) will show how many times per day on specific dates a twitter account (the bot account @sunneversets) tweeted.

The second table shows the number of times a second twitter account (a human’s account, @nostarchpress) tweeted per day on the same dates.

Then we will create a third sheet where we put the data side by side. To do so, first we will choose a date range for the data we want to look at and we create one column with each date in the range listed in separate rows. We name the first column “dates” and then name the next two columns “Pivot Table 1 – sunneversets account” and “Pivot Table 2 – nostarchpress account”. Finally, we create a formula that we will put in the 2nd and 3rd columns that will automatically find and input the relevant data.

The end result will look like this:

This image has an empty alt attribute; its file name is image-36.png

How to Cross Reference Data

We use the formula called =vlookup() to cross reference and merge the data sets based on a common value. Basically that means that both data sets have to have a column for the same kind of data (such as a date column). One data set shows how many tweets occurred on each date, while another data set might show how many political speeches occurred on those dates.

To provide an example of a second data set to reference, Lam Thuy Vo provides a second twitter account’s csv file (nostarch_tweets.csv) on her GitHub page.

This new file is made up of data on a human being’s twitter account.

Merging the Spreadsheets

Vo points out that the first step is to create another new sheet by clicking on the plus sign on the bottom left in Google Sheets. Name the new sheet “cross reference sheet”.

As a reminder, to import data from a csv file into a Google Sheets sheet, open a new sheet and then take the following steps:

Click on “file” then “import”, the following screen should appear:

Choose “upload” and “select a file from your device”. After choosing your file, (preferably a csv file) the following window should appear:

Then choose the following in order:

1 – under “import location” choose “replace current sheet”

2 – under “separator type” select “comma” – – (since we are using a csv in this case and we need Google Sheets to identify separate values)

3 – under “convert text to numbers, dates, and formulas” select “no” – – (we will change the formatting ourselves because the software makes mistakes)

4 – then we split the dates and times from the one “created_at” column into two columns, one for dates and one for times. To do so we highlight the column, click on Data, choose “split text to columns” and when a window pops up asking to choose a separator, choose space. Name the column with dates “date”.

5 – highlight the date column, click on format and in the dropdown choose number, and then date.

6 – we also need to turn this data into a pivot table like the last one, so click the box in the top left corner between the headers of the columns and rows so that everything is highlighted, click on the data tab, choose create pivot table, create new, name the new pivot table Pivot Table 2, click on the new pivot table and under Rows choose date, under Values also choose date, and then again under Values choose COUNTA.

Prep Cross Reference Sheet with the Common Value

In the sheet named “cross reference sheet” we are going to focus on the data from the two csv files that occured within a specific range of values in one of the columns. In the example from the book Mining Social Media, the author focuses on the data in the date range April 13th to May 1st 2017. To do this, you type “date” in the A1 cell and then need to input the dates in the column below. To do this quickly, type the first two dates in the first two cells and highlight them as seen below.

Next, click on the blue square on the bottom right and pull down to highlight the 20 empty cells below it. Sheets will figure out the pattern and automatically fill the rest of the cells with dates.

Using the =vlookup() Formula

According to Lam Thuy Vo, the =vlookup() formula “looks at a value in one table, looks up the value in another table, and then retrieves data based on this common value”.

So we will use the sheet “cross reference sheet” that has the first column of dates for our date range, and then we will put formulas in the second and third columns that will find the relevant data from the two pivot tables and input the data into the “cross reference sheet” columns.

But first (according to para 2 on page 118) we need to set up a new column for the @sunneversets100’s daily tweet counts right next to the date column.

we start by naming the second column “Pivot Table 1 – sunneversets account”. This column will have the number of tweets for that twitter account listed for each date identified in the first column. To do this, we will write a formula for the first cell (located at B2), that will look up the value in cell A2 (which has the date 4/13/2017), and look in Pivot Table 1 for the value associated with the date 4/13/2017 and input it into cell B2 of the cross reference sheet.

The =vlookup() formula takes four “arguments” in the “()” section.

First Argument – This argument is the value you that will be looked up in another table. In our case is the the first date in the series of dates, which is identified by its cell location at A2. So we put it as =vlookup(A2, …) .

Second Argument – This is a bit complicated and therefore best explained in the author’s own words:

Recall that Pivot Table 1 looks like this:

So we are telling the =vlookup() formula to look at columns A and B because A has the common value (the date) that exists in both pivot tables and B has the values that are unique to Pivot Table 1.

Third Argument – This argument might seem a bit redundant. The formula needs to know where the relevant data is located in the Pivot Table sheet relative to the column with the common value. The common value in Pivot TAble 1 is the date in the date column (the first column) and the relevant data is in the second column next to it. We communicate the location to the formula with the number 2. The number 2 informs the formula that the data is essentially, one column to the right. so our formula looks now looks like this:

=vlookup(A2, ‘Pivot Table 1’!A:B, 2, …)

Fourth Argument – In the fourth argument we tell the formula, as the author explains, “whether the range we’re looking at has been sorted in the same order as the table we created for our data merge.” This is not the case, so we input FALSE into the formula. The author recommends to always input FALSE because it will still find the data even if it has been sorted. The complete formula is as such:

=vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE)

Input the =vlookup() Formula

In our “cross reference sheet” we input the formula in cell B2 as seen below:

cross reference sheet

The formula looks at cell A2 in the sheet “cross reference sheet”, sees that the value in the cell is 4/13/2017. The formula then looks at the “Pivot Table 1” sheet (see below) and looks in the column A to find the same value. It finds the value 4/13/2017 in the 5th row of the A column, and then looks at the value in the B column to the right and see that the value in B5 is 1054. So the formula brings that value back to the cell B2 in “cross reference sheet”.

So, upon hitting enter, the formula shows the value 1054 in cell B2 (as seen below). We can smart copy the formula into the rest of the column by clicking on the small blue box at the bottom right of cell B2 and dragging it down to the rest of the column.

In order for the third column to perform the same task for Pivot Table 2, we use the same formula but change “Pivot Table 1” to “Pivot Table 2”. Smart copy again and the table should look like this:

Notice that some cells have #N/A written in them, that is because there was no data available. There is also a red mark in the top right corner of these cells which indicates that there was an error. We can fix this by altering the formula so that it knows that if there is an error, it should just input 0. This is accurate because when there is no data in our case, that was because there were no tweets from the account on that date.

To make this change we use the =iferror() formula. This formula takes 2 arguments.

The first argument is the entire previous formula followed by a common (it might seem like the formula should be inside a set of parentheses but that is not how this works). we also remove the = sign from the =vlookup() formula.

So we take the formula =vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE) and put it in =iferror() as the first argument. Which yields:

=iferror(vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE), …)

The second argument is what we want to input in the cell if the original formula yields an error. In our case we want a 0 so we use a 0 as our second argument. The final product looks like this:

=iferror(vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE), 0)

Now we replace the formula in the sheet “cross reference sheet” with this one and we get the following:

That’s it! Congratulations, you have successfully learned to cross reference data from two different sources!

The Path Forward

Lam Thuy Vo does a great job of addressing the path forward after you have learned the functionalities to perform her example. The following sources can be used to learn or merely as a reference as you use Sheets for you own purposes

The Google for Sheets is here –

A list of available formulas –​/docs/table/25273/).

If you find yourself performing the same task repeatedly, the author notes that you may want to write your own custom functions and there is a handy walk-through here –​/apps-script/guides/sheets/functions/.

The author sums up what you have learned as follows:

In this chapter, you saw how to conduct simple data analysis with Google Sheets: you learned how to import and organize data in Sheets, how to ask specific questions of a data set, and how to answer those questions by modifying, sorting, filtering, and aggregating the data.

Congratulations again!

A Short List of Good Tools to Research a Twitter Account

Geo Social Footprint

Geo Social Footprint ( should show a twitter account’s geotagged tweets on a map and link to the tweets themselves. If you get an error message when you run a twitter account, like “map cannot display”, that often just means that there are no geotagged tweets. Based on the twitter api limitations, it is reasonable to guess that the tool looks at the last 4 thousand tweets.

For geo context, you can find what people nearby are tweeting about by using ( or (

Foller Me

Foller Me ( gives similar info on an account (and is easier to read) such as when they joined but also gives a larger list of the people that researched account interacts with.


Twitonomy ( performs analysis on the account as a whole, but you have to remember to search for an account and then actually click on the account name somewhere in the results. It gives information such as which accounts it tweets about or replies to most, how many times the account tends to tweet per day and from what kind of device, and how often they tend to tweet on given hours in the day or days in the week.

For example, at the very bottom of the results we here we see that the user of the @searchish_site account uses an Android phone and the search-ish wordpress account to Tweet.

Keep in mind that this tool is a little tricky at first. You have to search an account, and then in the initial results click on the account name again or click on Analyze a Twitter profile in order to get the full results.

Sleeping Time

For redundancy, Sleeping Time ( also gives the hours of use for an account. But this tool gets right to the point and makes an educated guess about the hours of sleep so you don’t have to look into the data yourself and guess if an average of 3 hours at 4am means the person usually sleeps at that time or not.

Tweet Topic Explorer

Tweet Topic Explorer ( identifies the most common words tweeted from the account (excluding useless words like “the”) and allows you to click on any of them to immediately see a list of the tweets with that word. Scan the map for words that might reflect important things about the account user like political views or profession.

See this post here for how to find an account’s closest friends.

Or, see this post, for how to manipulate data and view original posts, most important topics in content, or rank other accounts mentioned in tweets.

Tweet Beaver

Tweet Beaver ( has a variety of tools that are especially useful for assessing a relationship between two accounts (common followers, what have they tweeted at each other, etc.)

All My Tweets ( is a great tool to find an account’s first follower. Just select the account to search, click in “Followers” and it will give you a list of followers in chronological order, so scroll to the bottom. A number of investigative reporting guides suggest that the first follower is often a person that has a close relationship with the account holder.

This tool will also list all of an accounts tweets in a list or everything the account has “liked”.

Python Scripts, Research Tools for Obtaining Data About Youtube Videos

If you are researching a youtube video there are also several research tools available online for obtaining more information about the video. This post will describe how to do more in-depth research of a video.

The Research Tools

This next part will address some of the tools that are available for researching videos online.

What to Look For

If you are researching videos online there are 3 basic research goals to look for.

1 – Search for videos. There are two subreasons to search for videos. 1a – You are looking for information and it might be available on a video, but you don’t know what that video is. In this case you are looking for videos based on topic. 1b -You have already found a video you are very interested in and you want to analyze it but before you can do so, you need to find the original video. Copies of the video may be altered or edited, plus you might be interested in researching the source of the video. To find the original video you must search for other versions of the video online and find the one that was posted online first.

2 – Try to find out who posted the video. This is difficult and basically involves searching if the youtube account username is also used in social media accounts. Also, you can look up the first commenter, or maybe the first few commenters, that posted on the video because they might know the person that posted the video. You can search for social media accounts with their usernames and look into their common friends for potential candidates. Searching for the original video may reveal that it was originally posted on a social media account, which would obviously make the job easier.

3 – Analysis of the video content. This skill is largely outside the bounds of this blog but we identify tools and guides to for this kind of operation. If you are searching videos to see if they have information on a specific topic (like a person, company, legislation) you can now do a word search in youtube to see if anyone says something like mentioning a person, company, or legislation. This feature can help save a lot of time for a researcher going through different videos.

There is a detailed guide for how to anlyze content of different videos to do things like find the location or verify the video in the  Verification Handbook (

1 – Search for Videos

Searching for videos can be difficult. If you are looking for videos on a certain topic you can try using Petey Vid, a text-based search engine that searches exclusively for videos. Keep in mind that text-based searches, whether they are PeteyVid or Google, can only search for text that is affiliated with a video. So if the video is on a blank webpage and has no affiliated words, the search engine can’t find it.

According to Bellingcat’s Aric Toler, there is currently no way to run an Internet search on a video (this refers to the idea of actually uploading the a video to a search engine that would look for other videos based on it, like a reverse image search). So the next best thing is to get thumbnail photos of the video and run reverse image searches on those photos. The idea is the you are hopefully searching for the photo that appears on a video when it has not yet been started. If this sounds confusing we will walk through an example. At the time of this post’s writing, the youtube homepage looked like this screenshot below:

So if you wanted to search for the video on the top right, you could use the snipping tool to capture the photo that is currently on the video while it has not yet been played. You would do a reverse image search on this photo:

Amnesty International also has a tool for the public called the Youtube DataViewer that extracts data from any youtube video and creates four thumbnail images from the video that you can use for reverse image searches. For example, if we paste the youtube video’s url into the tool and run it, we get the following results.

The results above show the name and description of the video, the video ID, and the specific date and time when it was uploaded.

If we scroll down, we see that the tool created four thumbnail images from the video that can be used for a reverse image search, and there is even a link next to each photo that will do the reverse image search.

There is also a link next to each photo that can be used to do a reverse image search.

2 – Try to Find Out Who Posted the Video

This step requires a bit of time but is relatively simple. Google the username

For example, look at this video here. When we click on the user’s ID in the bottom left corner we are brought to a channel homepage.

That brings us to this page below, which has a strange url that does not identify the user well.

Strangely, if you click on “Home” you will get the same webpage the url changes to show the user’s username. See below, the channel’s homepage is the same but the url has chanced to show that the username is “oregonzoo”.

A quick google search of the username reveals the following Twitter account, which of course provides further information on the user.

When we scroll down to the first two commenters we can apply the same method to try to find their social media.

If we find two twitter accounts associated with those commenters we can try to find common friends by using, see below.

Using this tool we can look for common friends that might be the youtube poster.

3 – Analysis of Video Content

Youtube’s Computer-Generated Transcripts

Youtube has a new feature that makes researching videos easier by generating transcripts for each video. To access the feature, click on the three dots below the video on the right side and then click “Open transcript”.

The computer-generated transcript of the video appears next to the video. The transcript is word-searchable so you can save time by searching for a specific name or company is mentioned in the video. in the screenshot below, I searched for the name Tina Larsen and it popped up in the transcript. Note that the transcript also shows the time when different things were said. If you find something interesting in the transcript you can click on the words and youtube will automatically bring the video to that time.

Depending on the video, you may want to go through all of the content very slowly to try to analysis the background or street signs. You can do this with –

The Metadata Viewer is a great tool for gathering all of the data about a video and its publisher in one place. When was the video posted, by which account, how long has that account been around, is the video geolocated, etc.

If the video is not geolocated, you can look to see if the account that posted it (also known as a channel) has ever geolocated any of its videos, which of course implies the location of the account holder and even the original video itself. Geofinder -(

It is not clear if the following tools are actually useful for the purpose of researching a video, but they are interesting enough to warrant a mention. – will let you edit a video or convert it to an audio file. and – will let you download the video to your device as a video or audio file

More Tools

InVid has a tool for indepth analysis of the video content, for the explanation about how to use the tool click here and tool click here.

Download all of the Comments

You can download all of a youtube video’s comments with the following process. Note that the first time you do this it takes a number of steps but it will be much quicker any other time afterwards once it is set up. You will need to get a youtube api key (you only need to do this process once), find the video id number, setup gitpod (only once), and then hit run.

Getting a YouTube API key:

First, go to “”.

click on agree to terms and services and then click on “agree and continue”

This image has an empty alt attribute; its file name is image-1.png

Next click “select a project”

This image has an empty alt attribute; its file name is image-2.png

Then click on “new project”

This image has an empty alt attribute; its file name is image-3.png

In the next screen click on “create”, and then in the screen click on “select project”, on the left slick on “api & services” and then “credentials”

Then click on “create credentials” and then “API key”

This image has an empty alt attribute; its file name is image.png

You are not done yet, next click on “library” on the left and in the next screen scroll down to youtube and then click on “youtube data api v3”

This image has an empty alt attribute; its file name is image-4.png

In the next screen click “enable”. Now your api has access to youtube data!

To return to your api key, click on “credentials” on the left and you are brought to a page that has your key.

For further information, the api documentation is located here – “”

Find the Video ID

Now get the video id from the youtube video of interest. Do this by opening the video and copying this portion of the url after “v=”

Scraping the Comments

If you do not already have them, first you must sign up for accounts on github and They are both pretty quick and easy. Plus, once you have a github account, all you need to do to get a gitpod account is go to the login page and choose “sign in with github”. Now leave both accounts open in two tabs and go to the following url in a third tab.

If your screen has the Gitpod button on the top left, click on it. Sometimes you screen will have a green button there that just says “The code”, if that is the case, copy and paste the following link in a new tab.

Either option will bring you to the following screen

Click on “” on the left and see the code appear. Find lines 7 and 8 where you need to paste your api key and video id. Make sure you put the apk key and video id in quotes

Now your screen should look like this below (but without the redaction over the api key).


Next, on the bottom right see where it says

gitpod /workspace/Download-All-YouTube-Comments-From-Any-Video $

To the right of the $ sign, type “pip install pytube” and then hit enter

Let it run and then click the little arrow on the top right

You will see the script running like below.

When it is done, click on “allComments.tsv” on the top left and the usernames and comments for the video will appear in chronological order.

That’s it, you’re done!

What to Look for When Researching U.S. Corporations

This article provides an overview for conducting corporate research, identifying what information is available and how it is useful to a researcher. This post also weaves together previous articles and guides into a coherent whole. We will address where to find information on a company and what kinds of questions you can consider answering with your research.

Basic Research

The initial steps for researching any company is to do a quick Google search on it, then skim any press articles mentioning the company and look at the company website. With that completed, you can then start getting into the more advanced and more important research. This research guide assumes the researcher has already done the aforementioned basic steps.

Advanced Research

Deep Web –  See our article, “Deep Net, Open Net, and the Dark Net” that explains the deep web to understand why this will play an important role in corporate research.

The more advanced research relies on deep web databases, so at this point it is worth taking a moment to understand the “deep web.”  Most people assume that if they google a company’s name, they will find all of the information on the company that is available on the internet. In truth, the best information is usually located in deep web databases, which means the information in those databases will never appear in one’s google results.

Who Owns/Runs the Company – Every company must register with the government and the registration will usually identify the company owner in addition to other kinds of information that vary depending on the location. To find out how to obtain a registration, see “Corporate Research on the Deep Web.”

This article will also address the difference between a standard company and a corporation, from the perspective of corporate research. The main issue here is that a corporation has addition information available that it must file with the Securities and Exchange Commission (SEC) which is available to the public on the SEC’s EDGAR database. Here you can find the annual filing of a company if it is incorporated. The annual filing, among other things, will show you the subsidiaries of a company.

Court Records, Property Records, Local News – Using our three guides for finding this information, you can research court recordsproperty records, and local news for any company or individual subsidiaries.

Company Website – See our guide for website research, “How to Investigate a Company’s Website.” This article ignores the content of the website and instead looks at what is “behind the scenes”.

Contact Information Without a Name – Some records may list only a phone number, or other contact information, where it is supposed to list a person’s name. If this is the case, you can use our guide for How to Research a Phone Number. (We urge you to use these methods only in support of corporate research, this is not intended as an invitation to stalk someone that wants privacy)

Leadership – You can identify important players within a company by reading our guide “How to Find Influential Actors in a Corporation”. You can also learn to conduct focused and in depth research on a corporation’s Board Members by seeing our article on “Researching Board Members.” This guide will identify how to get background information, identify connections, how much money they are paid, and find possible conflicts-of-interest.

Corporate Profiles – Web scraping can be very useful for researching a company, especially if the company website has many employee profiles. You can use our guide to web scrape profiles (or anything else on the Internet), “Enhance Corporate Investigations by Web Scraping with Python.”

Hidden, Unofficial Connections – There are many ways that people and companies can have hidden or unofficial connections with other entities, and these connections often influence the actions of both parties. See our guide on finding these connections “Corporate Research on Hidden Connections.” For an example, see the chart that maps out connections between people, addresses, stocks, and companies that were revealed when secret financial documents were leaked to

Screenshot of entities linked to Blackwater Investments (from

Past Violations –  Our article on “Material disclosures and violations” will identify problems that the company has run into, whether it be legal issues or increased debt. This article will also show how to find when the company hires a new member to its senior ranks and how much they get paid.

Shipping Companies – Shipping companies have their own special factors, identified in our guide “Shipping companies,” that addresses identifying and locating ships, cargo, and discovering shipping violations.

Nonprofits – Companies are often linked to nonprofits via donations or because their leadership also are involved in the nonprofits themselves. See our guide for basic research on nonprofits “Researching a Nonprofit” that will explain methods like finding their tax records.

Nonprofit Corruption, Corporate Influence – Also see our guide for how to identify corruption and misuse of funds in “How to Discover Corruption in Nonprofits.” Note that it is common for corporations in the U.S. to donate funds to nonprofits affiliated with a politician. In fact, a nonprofit is four times more likely to receive a donations from a corporation if a politician sits on its board or runs it, according to a study by the National Bureau of Economic Research.

For an example of corporate links to nonprofits, see the link analysis chart below, it maps out how oil companies influence think tanks (which are also nonprofits). This map was created by, a free tool that identifies hidden connections.

from, click here to see the source

Contracts with the U.S. Government – If the company has ever contracted or tried to contract for the U.S. government, there are several special databases that will provide unique information about the company. You can learn about this with our article “Ties to the US government”.

Questions to Help Reach a Conclusion

The next paragraph provides a number of questions that may help with a corporate investigation for several reasons. You can look at these questions before your research to help you guide your investigation and decide where to look for information. It is also useful to look at these questions at the end of an investigation so that you can take all of your information and create some form of narrative or conclusion based on it. In addition, the guides above show a lot of ways to obtain information, but you may not have time to run through all of them. Plus, a lot of the information available on a company may not be relevant for a researcher depending on why they are investigating the company in the first place.

For those reasons, it can be helpful for a research to address some or all of the following questions: Who owns the company and what connections do they have to other entities? Who runs the company?  What does the company own? Who does it owe money to? Does the company or its owners/staff have secret companies in tax havens? Where does it do business?  What government or public service contracts does it have? What links does it have with politicians and civil servants?  What regulations has it violated? What legal cases have been brought against it? Who is taking action against it? And finally, who can influence the company?

Now you will have a well-researched and analyzed product.

Use Python to Recognize Text, Compare Data From Scanned Documents

Background From Previous Post

This post follows on the instruction from a previous article, Research Scanned Documents with Optical Character Recognition, that explained how to use Python to recognize text in png files with the use of Optical Character Recognition.

Optical Character Recognition (OCR) is basically the ability for a computer to recognize words in photos. OCR is particularly useful when dealing with scanned documents.

The previous post addressed how to install Python, the python interpreter Sublime Text, and the python tool for installing libraries Pip. Then, it explained how to install the python libraries Pillow, Tesseract, and Pytesseract. Finally, the post explained how different simpled Python scripts, can read text and then print in out in the python interpreter, create and print the text into a text file, or convert the png file into a pdf that has a layer of OCR over the text.

For example the following script would take a png file named screen.png, read the text and print it out into a text file.

from PIL import Image
import pytesseract

f = open("output.txt", "w")

Moving Forward, Converting Data into a Python List

Now if we take our output text file, we can turn it into a list by making this addition:

from PIL import Image
import pytesseract

f = open("output.txt", "w")

rows = []
with open('demofile3.txt', 'r') as txtfile:
	for row in txtfile:

With the new list named “rows”, if we “print(rows)”, the list of names will be the same.

If we wanted to print the list “rows” into a text file, we cannot simply write “f.write(rows)” because the function requires a string. So if we want to create a text file the same as the one we already created, we have to write the script like this:

from PIL import Image
import pytesseract


f = open("demofile3.txt", "w")

rows = []
with open('demofile3.txt', 'r') as txtfile:
	for row in txtfile:

f = open("demofile4.txt", "w")
for row in rows:

This addition will create a second text file “demofile4.txt” that is exactly the same as “demofile3.txt”. This simple task does not achieve anything but it shows how to work with the data. And if we have a list, we can compare the contents to another list. So with the names in a list we can do things like see if any of the donation recipients are also in the names of financial disclosure documents for politicians (meaning that a politician also works for the donation recipient).

Now, observe that if we run the script above and tell it to print the list in the python interpreter (Sublime Text), this is the result (script is on top, results are on bottom):

Each “\n” represents a line break. This means that the data recognizes the line breaks. This will be relevant in a moment.

There are other measures that can be taken, such as removing all symbol characters or making all text lowercase to make them more easily compared. However, these measures will not be addressed at the moment.

Converting the Data Directly to a List

But backing up a bit, it seems there should be a more efficient way to make a list without creating a file. However, note that if we tried to put the data directly into the list (instead of putting it into a text file first) like in the script below it did not work:

from PIL import Image
import pytesseract

f = open("demofile3.txt", "w")

rows = []
data = (pytesseract.image_to_string('screen.png')))
for line in data:

If we printed the list, “rows” by typing “print(rows)”, we would get a list of each individual character from all for the names.

Here is a better script below. The original data recognizes line breaks so we can import the data and assign it to a variable (named “data”) and separate the names by using the “split.()” function (as explained here in and choose to split by the line breaks using “\n” (as explained here in And then we put all of that together as “data.split(‘\n’)” and assign that to the variable “t”. Finally we make it a list by saying each thing in “t” (which has now been separated), and we will call each thing “i”, we will append each “i” to “rows”.

This creates a list of every name.

from PIL import Image
import pytesseract

rows = []
data = (pytesseract.image_to_string('screen.png')))
t = data.split('\n')
for i in t:

If we add in a “print(rows)”, the results look like this:

We see that there are empty lines that are identified as individual elements in the list as this:

‘ ‘,

You may also notice that the character [ appears a lot in the list.

It is worth mentioning that if there were certain characters that you want removed from the list, like for example a “[“, you can simply add this coding below:

[s.strip('[') for s in rows]
rows = [s.replace('[', '') for s in rows]

The resulting list is clean

This is particularly useful when you want to compare one list to another and you want cleaner data.

Now you have a proper list of the names that we can manipulate and compare.

How to Compare Data from Different Scanned Documents

From here, we will use a common example that involves looking at which nonprofits a corporation provides donations and checking if any of them are involved with local politicians. In this case we are looking at the pharmaceutical company AabVie and a congressman that is involved in areas of potential interest to the company.

To explain why we would research this, it should be noted that the National Bureau of Economic Research published an extensive study, “Tax Exempt Lobbying: Corporate Philanthropy as a Tool For Political Influence”, about how corporate foundations are more likely to give charitable donations to a nonprofit if it is affiliated with a politician.

The study found that “a foundation is more likely to give to a politician-connected non-profit if the politician sits on committees lobbied by the firm.” In addition, the researchers concluded that “a non-profit is more than four times more likely to receive grants from a corporate foundation if a politician sits on its board, controlling for the non-profit’s state as well as fine-grained measures of its size and sector.”

The study identified specific politicians of interest by checking if they were on any committees that were lobbied by a given company. If so, they would check for a link via a nonprofit and donations. So in our example we will look into whether a specific company is given to a nonprofit affiliated with a politician of interest. The existence of such a link does not prove nefarious intent, but it is interesting and in some cases it can be an indicator of a deeper relationship.

This example is relevant because it requires researching files from two different databases (irs nonprofit filings database and congressional financial disclosures database) with records that are often badly scanned documents. The lack of OCR on these files makes it difficult to find links.

We can use the png of donation recipients identified in the public tax filings of the AbbVie Foundation (affiliated with the pharmaceutical company with the same name, which has a presence in Rhode Island). We will name this file abbvie.png.

donation recipients

For the second file we will use a list of nonprofit organizations that have Rhode Island-based congressman Jim Langevin on their board to see if any of his nonprofits receives money from the aforementioned foundation. The information for both files was scraped from records made available via Propublica’s Nonprofit Explorer.

In this case we will use the following Python script to read both documents and compare the data.

from PIL import Image
import pytesseract

rows = []
data = (pytesseract.image_to_string('abbvie.png')))
t = data.split('\n')
for i in t:

rows2 = []
data2 = (pytesseract.image_to_string('langevin.png')))
t2 = data2.split('\n')
for i2 in t2:

names = []
for name in rows:
	if name in rows2:

Below you can see that running the script identified that the name “Adoption Rhode Island” was in both files (along with a lot of empty spaces shared in both docs).

Compare Many Documents

Now that we have explained how to compare two documents, let’s look at how to compare many documents.

Let’s say you have 20 pdf files of a hundred pages each and you have another set of 20 pdf files and you want to find common names or something else that exist in both groups. First go to

Upload one set of 20 pdf documents and it will return a zip folder to download. Inside the folder is a set of png files, each page of each pdf file will be converted into a separate png file. Next, open Command Prompt / Terminal and navigate to the folder (you may consider copy and pasting the files from the zip folder to a regular folder). Once you have navigated there, type “dir /b > filenames.txt” in the Command Prompt and it will create a text file in the folder with all of the file names. Copy and paste those files into the same folder with the Python script and then replace the name of the png file in the python script with the name of the textfile (which should also be in the same folder as the python script now.

The script will read the file’s names of the png files and then go to each png file to OCR it. You could also put the files in a separate folder and put the path to each file in the text file. This could be accomplished with use of the “Find and Replace” tool. For example, if every file starts with the same word (which will be the case because that is how the website will create the files for you) like “policy”, and the path to each file was something like “documents/research/policy1.png” (with the number changing for every file), you could tell the Find and Replace tool to find all instances of the word “policy”, and replace them with “documents/research/policy”. This would leave the rest of the file names unchanged.

Now repeat these steps for the second group of files and you are done.

That’s it.

Research Scanned Documents with Optical Character Recognition

Optical Character Recognition (OCR) means that your computer can read words in photos and scanned documents. When you have a document or documents where you want to copy and paste the words into a search engine or you want to do a word search for a specific name in the document, OCR will make that possible.

For example, two previous posts described how you can research the tax records of nonprofits’ tax records or politicians’ financial disclosures, in government databases but the records are all in badly scanned pdf documents that do not recognize the words. Therefore if you want to find if Exxon Mobil’s foundation donated to a Senator’s personal nonprofit, you have to potential search through a lot of pages of tax records to look for the name.

This article will explain how to use Python to add OCR to files.

Get Started on Installations

You will need Python, Sublime Text, and Pip for the basics.

To start with, if you are completely new you can download Python from Then you can download Sublime Text, a tool for accessing Python scripts, at

Now, access your Command Line (if you are using Microsoft) or Terminal (if you are using Mac).

Pip is included in Python but you can see guidance for installation and updating at

If you get a message saying you need to upgrade pip, you can do so in the Command Line by typing: python3 -m pip install –upgrade pip

The next group of installations are Pillow, Tesseract, and Pytesseract

Next install Pillow by going here for instructions.

(Install Pillow on Windows by typing: python3 -m pip install –upgrade Pillow)

from –


There is the command-line program Tesseract and its third party Python “wrapper” (whatever that means) named Pytesseract.

Click here for a basic overview of Tesseract and its installation and usage, see screenshot below. Following the below screenshot there is a link to the more detailed documentation on installation.G

Go here for installation documenation.

Note in the window below in the second para that, for Windows, if you want to OCR different languages you need to click on the link that says “download the appropriate training data”, which brings you to a webpage that offers a different download for each language.

Tesseract installers for Windows are available here.

The process for Windows should identify the location where it is downloaded, this is necessary to know so that “you can put it in your PATH” (a phrase that is often used by rarely explained with Python).

You have installed tesseract and if you are using windows it will default to the location above.

How to Add Something to the PATH

Next, tesseract must be “added to the PATH”, this means that you must add the directory of tesseract must be added to the PATH environment variable.

The following instructions are for Windows.

1- go to System Properties

2- click on Environment Variables

3 – a box will appear that it titled Environment Variables, within it find where it says System Variables and underneath that there is a list of variables, choose the one titled PATH

4 – a new window appears, click on one of the empty lines then click on browse and find tesseract, click on the tesseract folder (or “directory”) then click on okay until you have closed every box

Now when you open the command prompt you should be able to hit the command from any folder and be able to access tesseract.


To install the python wrapper library, Pytesseract, which uses existing Tesseract installation to read image files and out put strings and objects that can be used in Python scripts.

You can run “pip install pytesseract”, or go here and download the file and then run “python install”

Command Line / Terminal command for OCR

The python capabilities described here require that you have a png file, not a pdf. Most pdf conversion capabilities are not friendly to Windows. But the methods here will work just as well for Mac.

How to get a png file? Most snipping and screenshot tools will automatically create a png file of the image you are capturing. If you are working with a pdf file and only need one page, you can snip an image of that page.

Here is a very basic OCR python script. The script must be saved to the same folder/directory as the png file that you want to read. If not, then you would put the path instead of just the name. So if the python script was saved to a folder named “first” but the png file was in a folder named “second” that was located in that same folder named first, then instead of (‘yourfilename.png’), you would type (‘second/yourfilename.png’). This is otherwise known as the path to your file.

from PIL import Image
import pytesseract


There are many python interpreters but for this post we suggested using Sublime Text. To run a script in Sublime Text you must save it (File, Save or Ctrl-S) then take the not intuitive step of chooseing to “build” (Tools, Build or Ctrl-B).

For this example I used a png file of a list of donation recipients from the Target Foundation’s tax records that looks like this

The aforementioned python script will produce a list of words and phrases printed out into the python interpreter and should look like this:

If you want to print out that same list into a txt file, you could go to the Command Line / Terminal, navigate to the folder with the python script, and then type “tesseract”, space, the png file name, space, then the name you want to give the text file with your list, like so: tesseract yourfile.png textfile.txt

The following script will take your original png file and convert it into a pdf file that has OCR over the words

    from PIL import Image
except ImportError:
    import Image
import pytesseract

pdf = pytesseract.image_to_pdf_or_hocr('pngtarget.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default

The resulting pdf looks like this (with words highlighted):

This method was equally successful when used on the full page of the original document that looked like this:

How to create and print list to a TXT file

According to a instructions from, if we want to create a new text file:

So based on this information, if we want to print our items into a text file, we use this script:

from PIL import Image
import pytesseract

f = open("demofile3.txt", "w")

The resulting text file looks like this:

That’s it.

PACs, Foundations, Press Releases:

How to Research Corporate Efforts at Influence

The Dirt Digger’s Guide identifies political contributions, nonprofit donations, and press releases as three key factors in a corporation’s ability to influence others. Corporations spend a lot of money on donations to political and charitable causes but instead of making direct payments they create PACs and foundations. PACs are used because of laws against direct corporate political donations and foundations are used to avoid taxes.

To find a PAC, the easiest method is to go to and search for the name of the company and PAC. For an example, the Target Corporation has a Target Corporation PAC with a profile page on Open Secrets that displays its information.

The source of the information is on the Federal Elections Commission website and you can see on the Open Secrets profile page’s section on registration details below that there is a link to the source.

The website has its own profile page for the Target Corp. PAC. The format is not as good as the one on Open Secrets but if you want to cite your data it is good to know the source.

A corporation’s foundation is usually easy to find with just a search on the corporation name and the word foundation. As another example, Propublica’s nonprofit explorer has a page on Target Foundation tax records.

Note that in Section B from the most recent tax filing available there is only one source of its funding, the Target Corporation itself. This is standard practice for a corporation’s foundation.

Depending on the tax record, the corporation may list the recipients in an attachment at the end, or in this case from 2015, in Statement A. Notice that Minnesota and Minneapolis appear a lot in the list. This is likely because their corporate headquarters is in Minneapolis. (an upcoming post will explain how to make these documents word-searchable with optical character recognition)

Note that the National Bureau of Economic Research published an extensive study, “Tax Exempt Lobbying: Corporate Philanthropy as a Tool For Political Influence”, (click here to read) revealing that the S&P 500 corporations commonly donate to nonprofits linked to politicians that they want to influence. (A previous post addressed how to lookup if a politician, or anyone else, is misusing a nonprofit for personal gains). See below for the summation of the study’s findings.

“In our first analysis using these data, we show that a non-profit is more than four times
more likely to receive grants from a corporate foundation if a politician sits on its board, controlling for the non-profit’s state as well as fine-grained measures of its size and sector.”

the study also found that “…a foundation is more likely to give to a politician-connected non-profit if the politician sits on committees lobbied by the firm.” The study also noted that there was a high probability that the foundation would stop donating to the nonprofit when the politician lost their bid for re-election.

Therefore, the nonprofits that receive donations can be compared to politician’s financial disclosures that usually list if they are involved in a nonprofit. A researcher can look at which committees a company lobbies and then lookup the disclosures of the politicians that sit in those committees with lobbying and disclosure records, a previous post addressed how to look up lobbying and disclosure records.

Press Releases

On a side note, it is important to remember that corporations use press releases as possibly the most direct way to release information and try to improve their public image. Specifically, press releases are used by corporations to influence their public, release information that is legally required to be made public, and describe/acknowledge their own activities in the manner that they want to be viewed (which is arguably true about any organization).

There are three good places to find press releases. First, almost every corporation’s website will have a page entirely dedicated to press releases. Second, press releases are often included in SEC filings so you can do a keyword search in the SEC’s EDGAR database for the phrase “press release” with the corporation’s stock ticker to ensure you only get results from the one corporation. Finally, there is a website,, that exists to post and cover corporate press releases. One of the benefits of uses this website is that you can also read other press releases on the same topic from different sources/corporations.

The EDGAR keyword search function (often overlooked)