Python for Beginners – Find if a Phone has any Instagram Account

This guide will walk through how to use Gitpod to run the Megadose/Ignorant script that checks if a phone number is registered to any Instagram account. Note that it will not identify the specific account.

If you already have Gitpod and Github accounts, login to both and jump to step 3.

Step 1 – (the hardest) – Click here and sign up for an account on Github . Sorry, that is more than one step but the process is simple and it gets easy afterward.

Step 2 – Login to Gitpod. You do not need to sign up for Gitpod if you already have a Github account. Go here (https://gitpod.io/login/) and you will see an option on the left to sign in with your Github account even though you don’t have a Gitpod account. See below:

Once you have logged in your page will probably look like this:

Step 3 – Copy and paste this url into your browser and hit Enter:

gitpod.io/#https://github.com/megadose/ignorant

Why? – Basically, you are making a url that consists of the gitpod website url, a hashtage, and the url of the github page for the script.

Here is the explanation. We want to run a Python script but to do so we need a development environment. Normally you would download it but in this case, Gitpod provides a development environment online where you can run Python scripts. When you identify a script posted on Github you create a url of the Gitpod website’s url, a Hashtag, and the url for the page hosting the Python script. So with our script hosted at https://github.com/megadose/ignorant, we combine it along the gitpod parameters:

gitpod.io + /# + https://github.com/megadose/ignorant

to make this url – gitpod.io/#https://github.com/megadose/ignorant

Gitpod will create a workspace, a virtual computer, specifically for running the script. The script and its affiliated files will be downloaded though you will likely still have to run the setup.py file, or its equivalent. If you go to the script’s page on Github there should be instructions for downloading and running the script.

Wait for Gitpod to do some processing and then your computer should look like this:

Step 4 – At the bottom of the screen find where it says “/workspace/ignorant $”.

Click to the right of these words and type “python3 setup.py install” and then hit enter.

Step 5 – Then when it is done “python3 setup.py install” and hit enter

Step 6 – Wait for the install to complete and then right click on the folder on the type left that is named “ignorant” (not the one titled “ignorant.egg-info”). When you right click on the folder a drop down menu appears, choose “open in terminal”.

Step 7 – A new tab has appeared in the terminal, notice the new tab that reads “gitpod /workspace/ignorant/ignorant” and the cursor is located next to a similarly named prompt.

Finally, choose your phone number of choice, (i will choose a US-based fake number 123456789, and the US country code is 1) so I would type “ignorant 1 123456789” and you will get results like you see below

Note that the script also checks if the phone was used to register amazon.com or snapchat.com accounts.

Now that you’ve done this once, the process will be much easier in the future.

Next time

Login to Gitpod and there will be a workspace named for the script. It should look like the image below. Just click on the workspace.

This should reopen your workspace to right where you left off and you can run the code again.

Data Analysis for Beginners

This post explains how to conduct Data Analysis on large sets of data using Google Sheets and is intended for people with no background in the subject. The post will explain how to import data into Sheets and how to summarize, sort, filter, and modify the data by performing custom math on it. Finally, we will walk through how to cross reference data from different sources.

CREDIT WHERE IT IS DUE: The post is based on following the instructions from Lam Thuy Vo’s book Mining Social Media. Chapter 6 from this book focuses on Data Analysis is available online here. This information was originally identified by the website Tools For Reporters, here.

In the words of the author, you learn “how to conduct simple data analysis with Google Sheets: you learned how to import and organize data in Sheets, how to ask specific questions of a data set, and how to answer those questions by modifying, sorting, filtering, and aggregating the data.

This post will address how to import, modify, and analyze data in Google Sheets.

In Part 1 we will go over how to import data into Google Sheets, make it recognize numbers, and using the function for splitting data from one column into two columns.

Part 2 will go over how to create a Pivot Table (which provides statistics about our raw data) to count how many times a specific value occurs in a column. This will be used in the example to find out how many times a twitter account tweeted on each date. The goal of the example is to identify if the twitter account is a bot based on the frequency of its tweets.

Part 3 describes how to use formulas and functions to analyze data. While Part 2 aggregated how data, Part 3 will use math equations such as find the average number of tweets per day by a given Twitter account in order to see if it is within the bounds of a normal human user or a bot.

Part 4 addresses how to sort and filter data. You can choose one value in your data (such as the number of retweets) and order all of your data based on ascending or descending number of retweets. You can also sort the data in the pivot table (which is based on the raw data). Filtering allows you to hide data you do not need or find the most relevant data (maybe search for all tweets that mention a specified person).

At this point, the reader has a general understanding about what kind of data analysis is available, but there are too many functions and tools to learn them all. It is better to obtain data and then consider what you would like to do with it. Then work backwards to look at the Sheets functions and tools available to find a way to achieve your goal. Note that Parts 3 and 4 reference lists of the available functions and tools.

Part 5 addresses how to cross reference data from different sets based on a common data type or value, for example if both datasheets have a “dates” column.

Part 1

Importing Data to Google Sheets

First we are going to import the data, which was originally obtained directly from the Twitter API (a subject for a separate post) and we start with the data in CSV format.

There are a plethora of potential data sources that allow you to download data in this kind of format. For example, the FEC provides CSV data for public campaign financing (click here) and OpenSecrets provides CSV data on corporate lobbying records (click here for Exxon example)

Go to Google Sheets and start a new sheet, it should say “blank” under “start a new spreadsheet” on the left side.

Click on “file” then “import”, the following screen should appear:

Choose “upload” and “select a file from your device”. After choosing your file, (preferably a csv file) the following window should appear:

Then choose the following in order:

1- under “import location” choose “replace current sheet”

2 – under “separator type” select “comma” – – (since we are using a csv in this case and we need Google Sheets to identify separate values)

3 – under “convert text to numbers, dates, and formulas” select “no” – – (we will change the formatting ourselves because the software makes mistakes)

At this point we are not worried about whether Sheets can differentiate between a number, word, or date.

Now make a copy of the original data set. (you can also use “file” then “version history” to look at past versions)

8 – Click on “sheet1” (you will make this one the original to put aside in case you need to reference the original data later) and rename it “original data”, or whatever.

9 – Then, to create a copy to work with, click the small arrow next to “Sheet1” / “original data” or whatever you call it now, and in the menu that appears you will choose “duplicate”. In the book they name the second sheet “step 1: modify and format” so we will use the same here for the sake of clarity.

Transform data in columns from string to numbers

Then we choose to highlight column C by clicking on the “C”. Then click on “format” then “number” then “number” again. This transforms every value in the column from a string to a number. Repeat this process for the Retweets column which also contains numbers.

Split Text To Columns

We want to find out how many times the twitter account tweeted on each day. So we will count how many times the same date occurs in our raw data on the tweets.

To do this, we need to create a column that only includes the dates of each tweet and then count how many times each date (or “column value”) occurs. So if March 3rd occurs ten times, that means ten tweets came out on that date.

The problem is that our data only provides a column that provides the date and time together for each tweet. That means that several tweets on the same date, (example: “March 3rd, at 12pm” and “March 3rd, at 4pm”) will be counted as separate values instead of identifying that the tweets occurred on the same date.

The solution is to use the Split Text to Columns tool which will separate the dates into one column and the times into a second column. This will let us count the values in the dates column so we can see how many times each date occurred.

How to Use the Split Text to Columns Tool

1 – Click on the new sheet (so you are not working with the original data) and find the “Split Text to Columns” tool under the Data menu

The tool looks for common characteristics in the columns (like a semicolon in-between two numbers) so that it can separate the data.

2 – Choose a column to split and then create an empty column to the right of it.

3 – To do so right-click the letter above the column and choose “Insert 1 right”.

Now you can use the Split Text to Columns tool without overriding any data.

4 – Click on the column you want to split so that it is highlighted, and then in the menu above click on “Data”.

5 – Click on Data and then choose Split Text to Columns.

It will split the text between the two columns based on the default setting “Detect Automatically”. But a small window appears, titled Separator, and it will give you different options for how you want to separate the data.

6 – We will choose “Space”.

Note that these columns of dates and times are still formatted as strings. That is because the separator tool works best with strings so it makes sense to work with the data as strings now, and then convert it later.

7 – For the sake of clarity, we now change the column headers to “date” and “time” to reflect the information in each column.

Now the information is ready to create a Pivot Table, which will summarize the data for us.

Part 2

According Lam Thuy Vo, at this point the data is prepped and the next part is to aggregate the data by using pivot tables and formulas.

Pivot tables can, according to the author (p. 110 from the book), “summarize data…in various ways…we can count the number of times a term occurs in a column or calculate the sum of numerical values based on a date or category.” This is done with the creation of a summary table that gives statistics about our raw data.

In the example from the book the reader uses pivot tables to find how many times each date occurs in a column.

Recap:

Recall that the data originally provided the date and time in the same column.

so first the data looked like this:

This image has an empty alt attribute; its file name is image-37.png

Therefore in part 1 the reader had to separate the dates into a separate column. The point is so that every time a tweet occurred on a specific date, the column value would only be the date. Therefore if we count the number of times the same date/column value occurred in that column, that would be the number of times a tweet occurred on that date.

How to Create a Pivot Table (shorter explanation)

Note that if you want a short explanation on creating pivot tables, google provides one here.

1 – In short, to create a pivot table you highlight the data you want to use, find and click on “create pivot table.

2 – Then when the table is created, you input data into the table by using the pivot table editor. In the editor there are two options, “rows” and “columns”, you will use these to choose what data to include.

3 – Within the Rows/Columns options, you will choose the data by choosing column headers from the original spreadsheet and each unique value from that data will be listed once in the table. Rows/Columns also lets you order the data ascending or descending.

4 – Finally, you can use the pivot table editor’s “values” option to summarize the data (by adding, counting, finding the average, etc.) or the “filter” option to filter out data values based on the value itself (in which case you choose which specific values to include) or based on a conditional, which is explained below.

Select Data and Create Pivot Table

Returning to the guidance from Mining Social Media: The first step in creating the pivot table is to choose the data that it will analyze and we want to choose all of the data on our sheet so we click on this little rectangle in the top left where the column and row headings meet. This highlights everything so that the pivot table will have access to all of the data. The pivot table will not necessarily use all of the data, but this gives the option to use it for whatever data we want.

This click on the Data tab and then Pivot Table. A window will pop up, make sure New Sheet is chosen and then click Create.
A new sheet will appear, named Pivot Table 1.

Pivot Table Editor

On the right side of the pivot table there is something called the “Pivot table editor”. You use this to put the highlighted data from the sheet into your pivot table and choose how to analyze it.

Note that it suggests some possible things to do, and it even recognizes your data column headers in its suggestions. You can find out what these suggested options would look like with your data. To do this, if you hover your cursor over any of the suggestions, a magnifying glass appears on the right side. Click on the magnifying glass to see what kind of table these suggestions would produce.

For example, we click on the magnifying glass that appears next to “Number of unique id for each language”, and get the table seen below. The table shows that there are 5 times that a tweet is identified as “ar” (arabic) in the data under the language column’s information. This does not put the data into your pivot table, it just shows what it would look like.

Returning to Our Pivot Table and Counting Dates

As previously noted, there are options on the right side there are options for how to “populate” the pivot table with data and analysis. We want to create a column with each unique value from the “date” column in our sheet. Eventually we want to show next to that column how many times each of those dates occurred, but for now we just want a column that shows each unique date.

In the Pivot Table Editor we see the Rows option and click on Add. This means that you are selecting to put your data in rows and when you click add you choose what source of information (which column’s information) you want to input into the rows of the pivot table.

A drop down menu appears and it has the names of each of the column headers from the “step 1: modify and format” table. You click on Date and now the table should look like the screenshot below.

It might seem confusing that the Rows option basically entails choosing a column from the raw data. It is best to think of it as choosing what data will be added into rows.

A new column will appear in the pivot table (named “date”) and in the rows below it each unique value from the Date column (from the “step 1: modify and format” sheet) will appear once. Recall that in the “step 1” table each of these unique values (which are the dates) appeared several times.

If we had chosen “Columns” and then “date”, we would have gotten each unique value (date) in a different column in one row. See below:

Now returning to our table that has each unique value in different rows in one column. Our goal is to count how many times each unique date occurred in the date column of our sheet (“step 1: modify and format”).

So now we want to do analysis on the unique values and display it in a new column.

The Values option in Pivot Table Editor will let us do analysis on our data. So we choose which data we want to work with (in this case it will be the “date” column). To do this we go to Values and click Add. A drop down appears and we choose “date” once again.

Just to be clear about how the Values option works, you choose one of the columns from your raw data and then choose a function to apply to it. So for example, you can add every number in the column together, or multiply them, count them, of find the average.

A table appears under values that shows that “date” is the column we are working with and under “summarize by” we can click below it and a drop down appears with different ways to summarize the data. We choose COUNTA. This option will count the number of times each unique value in the date column appears.

There was also a COUNT option available, but we do not use that because it will only work with data that is formatted as a number. Other options include SUM, AVERAGE, MEDIAN, PRODUCT, MAX, and MIN. Google has a help page here that lists and explains each of the available functions.

Also see that a new column appears (named “COUNTA of date”) and each shows how many times each date appears and automatically puts the count for each one next to the unique value in the rows of the first column.

Success! we now have a pivot table that shows how many times each date occurred in our data. The first row shows that the date 2016-12-10 occurred 3 times, meaning there were 3 tweets on that date.

The author explains in the book that this information is useful because you can identify if a Twitter account is probably a bot based on how often it tweets. The author sites that, according to the Digital Forensic Research Lab (digitalsherlocks.org), tweeting more than 72 times per day is “suspicious.”

With that information in mind we can look at our data on a twitter account and judge if it is tweeting on any give day at a “suspicious” rate. (the answer is “yes”)

SIDENOTE ON VALUE DROPDOWN MENU

The values option in the dropdown gives a list of ways to summarize data but they are not always easy to understand. Here is an explanation of the ways to use Values to summarize data:

  • SUM – Returns the sum of a series of numbers and/or cells.
  • COUNTA – Returns the number of values in a dataset.
  • COUNT – Returns the number of numeric values in a dataset.
  • COUNTUNIQUE – Counts the number of unique values in a list of specified values and ranges.
  • AVERAGE – The AVERAGE function returns the numerical average value in a dataset, ignoring text.
  • MAX – Returns the maximum value in a numeric dataset.
  • MIN – Returns the minimum value in a numeric dataset.
  • MEDIAN – Returns the median value in a numeric dataset.
  • PRODUCT – Returns the result of multiplying a series of numbers together.
  • STDEV – The STDEV function calculates the standard deviation based on a sample.
  • STDEVP – Calculates the standard deviation based on an entire population.
  • VAR – Calculates the variance based on a sample.
  • VARP – Calculates the variance based on an entire population.

Part 3

Formulas and Functions

Google Sheets has a basic instruction page and video here to explain how to use functions to create formulas.

The intro video explains that a Function is a built-in operation in Sheets, such as adding or subtracting. Functions appear as options in dropdown lists and they can also be typed directly into Sheets. For example, we used the Function COUNTA in the pivot table to count how many times each date appeared.

Within Sheets, we can also type functions directly into a cell by typing the “=” sign, the name of the function, and then specifying what data (or the location of the data) in parenthesis.

For example, look at the sheet below:

In this case we want to add the values in the two cells A2 and A3, and put the return value in the cell B4.

1- So we start by typing the “=” sign so that sheets knows we are typing a function

2- We type the name of the function, in this case the function name “SUM” is for doing addition

3- Finally specify which cells we want to add by identifying them in parenthesis and separating them with a comma. If you hit enter, the cell B4 will just have the number 4 in it and the formula disappears from view but still remains for the cell.

Side Note: when specifying cells in parenthesis you can use (A2:A7) to mean add all the cells from A2 to A7. You could also type (A:A) to refer to the entire A column. You can choose from cells in a different sheet within the same file by typing the name of the sheet in quotes followed by an exclamation mark. So if you had a different sheet named Sheet 2 and you wanted to get the sum of cells A2 to A7, you would type =SUM(‘Sheet2’ !A2:A7). Also, you could just type =SUM( and then use your cursor to choose the cells.

A Formula uses a Function (or several Functions) to obtain a specific result. So in the previous example we created a Formula in cell B4 that used the SUM function to add the values of A2 and A3.

Smart Copying

The author goes on the point out the value of smart copying. Accordingly, if we want to find the length of each tweet, we can return to the sheet Step 1: Modify and Format and add a column to the right the text column (which is the H column). Right-click on the column header and choose “insert 1 right” (which creates the empty I column).

So if we type the formula =len(H2) in the I2 block and then hit enter, we will get the number of characters (of “length”) for the H2 block.

Sheets actually suggests using smart copying to fill in the column below with the same formula but accounting for the different placement of the cells.

or you could find the little box in the bottom right corner of the first cell with the forumla in it and pull down to other tabs. This will also smart copy to the other cells.

Part 4

Sorting Data

Vo says that the next step is to learn how to sort and filter the results, or data in general, to “rank or isolate data.”

In short, to sort or filter data, you want to remove the formatting by copying and pasting the data into a new sheet, but paste using Paste Special. The =n you highlight it and create a filter view, which allows you to sort and filter. This does not alter the data, it only rearranges it, or hides parts of it.

Vo uses the example goal of sorting the data in the pivot data to see how the suspected bot tweeted on its busiest dat. To do this, Vo explains how to do so by “creating a new sheet with our aggerate results and changing the entire sheet to a filter view.

Paste Special

We will want to copy and paste the data from the pivot table into a new sheet to isolate our work. Note that those cells have data and formulas, so we want to only paste the cells’ data without the formula. To do this we will use Paste Special.

First we create a new sheet by clicking on the plus sign on the bottom left.

The next step is to highlight the cells with the data you want, which in this case is the entire table.

Open the new sheet (named “Sheet 3”) and right-click in the A1 box.

In the drop down menu, choose “Paste Special” and then “Paste Values Only”. This will paste only the data, and not any of the formatting (excluding the formatting includes excluding the formulas). The data is now formatted as strings of integers.

We need to convert the formatting of the data (the dates) from integers to dates (in the formatting sense of the word dates) so that Sheets will adjust how it deals with this data. To do this, you highlight the newly pasted cells and click the Format option in the title bar. From the dropdown menu choose Number and then Date.

Filter View

We need to turn the data into a filter view.

First, we select all of the data in the sheet by selecting the box in the top left corner between the row and column headers.

Clicking this box highlights every cell in the sheet.

Next look at the tools under the headers and find the filter icon, which looks like a funnel, see below:

If it is not there, you may need to click the 3 dots icon seen below. When you see the filter icon, click on the upside down triangle to the right of the icon.

Make sure the cells are all still highlighted/selected and click on the triangle and choose Create New Filter View.

This creates a new heading row and filter icons at the tops of the columns, as below:

The new icons in the first row can be used for the filter functions. If we click on one, (in this case we are using the Date column) we are given a dropdown menu with sort and filter options.

Rearrange in ascending or descending

At the top there are options to sort the data in ascending or descending order (“Sort A->Z” means ascending, regardless of what kind of data you are using). If we choose to sort in ascending order based on the Date column, the values in the COUNTA column will also be rearranged to align with the Date column values.

This is also where it is important that the dates column is formatted as dates. Sheets will recognize the values as dates and arrange them in chronological order. If we had not changed the formatting Sheets would view the values as integers and arranged them as lowest to highest.

Filter by value

We can also filter out certain values by using the options at the bottom of the menu. There is a list (that continues downward past what is visible) of every kind of value in the dates column and there is also a search function if you have a particular value in mind. By unclicking the checkmark next to one of the values, Sheets will hide the value, but only in this particular sheet.

Filter by condition

Finally, the filter dropdown menu gives the option to filter based on a condition. A condition is basically an “if…then…” statement. If we click on the filter option in the menu a minimized dropdown will appear, if we click on that we see a list of conditions we can use.

For example, if we were looking through the content of the tweets, we could use this to look for tweets that mention something specific.

Part 5

Combining Data from Different Sources

Sheets allows us to cross reference data sets and combine them into one data set. This is also known as Merging Data Sets.

What are we going to do specifically? We will take two pivot tables in two separate sheets. The first table (from before) will show how many times per day on specific dates a twitter account (the bot account @sunneversets) tweeted.

The second table shows the number of times a second twitter account (a human’s account, @nostarchpress) tweeted per day on the same dates.

Then we will create a third sheet where we put the data side by side. To do so, first we will choose a date range for the data we want to look at and we create one column with each date in the range listed in separate rows. We name the first column “dates” and then name the next two columns “Pivot Table 1 – sunneversets account” and “Pivot Table 2 – nostarchpress account”. Finally, we create a formula that we will put in the 2nd and 3rd columns that will automatically find and input the relevant data.

The end result will look like this:

This image has an empty alt attribute; its file name is image-36.png

How to Cross Reference Data

We use the formula called =vlookup() to cross reference and merge the data sets based on a common value. Basically that means that both data sets have to have a column for the same kind of data (such as a date column). One data set shows how many tweets occurred on each date, while another data set might show how many political speeches occurred on those dates.

To provide an example of a second data set to reference, Lam Thuy Vo provides a second twitter account’s csv file (nostarch_tweets.csv) on her GitHub page.

This new file is made up of data on a human being’s twitter account.

Merging the Spreadsheets

Vo points out that the first step is to create another new sheet by clicking on the plus sign on the bottom left in Google Sheets. Name the new sheet “cross reference sheet”.

As a reminder, to import data from a csv file into a Google Sheets sheet, open a new sheet and then take the following steps:

Click on “file” then “import”, the following screen should appear:

Choose “upload” and “select a file from your device”. After choosing your file, (preferably a csv file) the following window should appear:

Then choose the following in order:

1 – under “import location” choose “replace current sheet”

2 – under “separator type” select “comma” – – (since we are using a csv in this case and we need Google Sheets to identify separate values)

3 – under “convert text to numbers, dates, and formulas” select “no” – – (we will change the formatting ourselves because the software makes mistakes)

4 – then we split the dates and times from the one “created_at” column into two columns, one for dates and one for times. To do so we highlight the column, click on Data, choose “split text to columns” and when a window pops up asking to choose a separator, choose space. Name the column with dates “date”.

5 – highlight the date column, click on format and in the dropdown choose number, and then date.

6 – we also need to turn this data into a pivot table like the last one, so click the box in the top left corner between the headers of the columns and rows so that everything is highlighted, click on the data tab, choose create pivot table, create new, name the new pivot table Pivot Table 2, click on the new pivot table and under Rows choose date, under Values also choose date, and then again under Values choose COUNTA.

Prep Cross Reference Sheet with the Common Value

In the sheet named “cross reference sheet” we are going to focus on the data from the two csv files that occured within a specific range of values in one of the columns. In the example from the book Mining Social Media, the author focuses on the data in the date range April 13th to May 1st 2017. To do this, you type “date” in the A1 cell and then need to input the dates in the column below. To do this quickly, type the first two dates in the first two cells and highlight them as seen below.

Next, click on the blue square on the bottom right and pull down to highlight the 20 empty cells below it. Sheets will figure out the pattern and automatically fill the rest of the cells with dates.

Using the =vlookup() Formula

According to Lam Thuy Vo, the =vlookup() formula “looks at a value in one table, looks up the value in another table, and then retrieves data based on this common value”.

So we will use the sheet “cross reference sheet” that has the first column of dates for our date range, and then we will put formulas in the second and third columns that will find the relevant data from the two pivot tables and input the data into the “cross reference sheet” columns.

But first (according to para 2 on page 118) we need to set up a new column for the @sunneversets100’s daily tweet counts right next to the date column.

we start by naming the second column “Pivot Table 1 – sunneversets account”. This column will have the number of tweets for that twitter account listed for each date identified in the first column. To do this, we will write a formula for the first cell (located at B2), that will look up the value in cell A2 (which has the date 4/13/2017), and look in Pivot Table 1 for the value associated with the date 4/13/2017 and input it into cell B2 of the cross reference sheet.

The =vlookup() formula takes four “arguments” in the “()” section.

First Argument – This argument is the value you that will be looked up in another table. In our case is the the first date in the series of dates, which is identified by its cell location at A2. So we put it as =vlookup(A2, …) .

Second Argument – This is a bit complicated and therefore best explained in the author’s own words:

Recall that Pivot Table 1 looks like this:

So we are telling the =vlookup() formula to look at columns A and B because A has the common value (the date) that exists in both pivot tables and B has the values that are unique to Pivot Table 1.

Third Argument – This argument might seem a bit redundant. The formula needs to know where the relevant data is located in the Pivot Table sheet relative to the column with the common value. The common value in Pivot TAble 1 is the date in the date column (the first column) and the relevant data is in the second column next to it. We communicate the location to the formula with the number 2. The number 2 informs the formula that the data is essentially, one column to the right. so our formula looks now looks like this:

=vlookup(A2, ‘Pivot Table 1’!A:B, 2, …)

Fourth Argument – In the fourth argument we tell the formula, as the author explains, “whether the range we’re looking at has been sorted in the same order as the table we created for our data merge.” This is not the case, so we input FALSE into the formula. The author recommends to always input FALSE because it will still find the data even if it has been sorted. The complete formula is as such:

=vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE)

Input the =vlookup() Formula

In our “cross reference sheet” we input the formula in cell B2 as seen below:

cross reference sheet

The formula looks at cell A2 in the sheet “cross reference sheet”, sees that the value in the cell is 4/13/2017. The formula then looks at the “Pivot Table 1” sheet (see below) and looks in the column A to find the same value. It finds the value 4/13/2017 in the 5th row of the A column, and then looks at the value in the B column to the right and see that the value in B5 is 1054. So the formula brings that value back to the cell B2 in “cross reference sheet”.

So, upon hitting enter, the formula shows the value 1054 in cell B2 (as seen below). We can smart copy the formula into the rest of the column by clicking on the small blue box at the bottom right of cell B2 and dragging it down to the rest of the column.

In order for the third column to perform the same task for Pivot Table 2, we use the same formula but change “Pivot Table 1” to “Pivot Table 2”. Smart copy again and the table should look like this:

Notice that some cells have #N/A written in them, that is because there was no data available. There is also a red mark in the top right corner of these cells which indicates that there was an error. We can fix this by altering the formula so that it knows that if there is an error, it should just input 0. This is accurate because when there is no data in our case, that was because there were no tweets from the account on that date.

To make this change we use the =iferror() formula. This formula takes 2 arguments.

The first argument is the entire previous formula followed by a common (it might seem like the formula should be inside a set of parentheses but that is not how this works). we also remove the = sign from the =vlookup() formula.

So we take the formula =vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE) and put it in =iferror() as the first argument. Which yields:

=iferror(vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE), …)

The second argument is what we want to input in the cell if the original formula yields an error. In our case we want a 0 so we use a 0 as our second argument. The final product looks like this:

=iferror(vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE), 0)

Now we replace the formula in the sheet “cross reference sheet” with this one and we get the following:

That’s it! Congratulations, you have successfully learned to cross reference data from two different sources!

The Path Forward

Lam Thuy Vo does a great job of addressing the path forward after you have learned the functionalities to perform her example. The following sources can be used to learn or merely as a reference as you use Sheets for you own purposes

The Google for Sheets is here – https://support.google.com/docs/answer/6000292?hl=en&ref_topic=2811806/.

A list of available formulas – https://support.google.com​/docs/table/25273/).

If you find yourself performing the same task repeatedly, the author notes that you may want to write your own custom functions and there is a handy walk-through here –  https://developers.google.com​/apps-script/guides/sheets/functions/.

The author sums up what you have learned as follows:

In this chapter, you saw how to conduct simple data analysis with Google Sheets: you learned how to import and organize data in Sheets, how to ask specific questions of a data set, and how to answer those questions by modifying, sorting, filtering, and aggregating the data.

Congratulations again!

A Short List of Good Tools to Research a Twitter Account

Geo Social Footprint

Geo Social Footprint (http://geosocialfootprint.com/) should show a twitter account’s geotagged tweets on a map and link to the tweets themselves. For geo context, you can find what people nearby are tweeting about by using (https://www.omnisci.com/demos/tweetmap) or (https://onemilliontweetmap.com/).

Foller Me

Foller Me (https://foller.me/) gives similar info on an account (and is easier to read) such as when they joined but also gives a larger list of the people that researched account interacts with.

Twitonomy

Twitonomy (https://www.twitonomy.com/) performs analysis on the account as a whole, but you have to remember to search for an account and then actually click on the account name somewhere in the results. It gives information such as which accounts it tweets about or replies to most, how many times the account tends to tweet per day and from what kind of device, and how often they tend to tweet on given hours in the day or days in the week.

Sleeping Time

For redundancy, Sleeping Time (http://sleepingtime.org/) also gives the hours of use for an account.

Tweet Topic Explorer

Tweet Topic Explorer (http://tweettopicexplorer.neoformix.com/) identifies the most common words tweeted from the account (excluding useless words like “the”) and allows you to click on any of them to immediately see a list of the tweets with that word. Scan the map for words that might reflect important things about the account user like political views or profession.

Tweet Beaver

Tweet Beaver (https://tweetbeaver.com/) has a variety of tools that are especially useful for assessing a relationship between two accounts (common followers, what have they tweeted at each other, etc.)

Python Scripts, Research Tools for Obtaining Data About Youtube Videos

If you are researching a youtube video, there are two good python scripts for scraping data on it. There are also several research tools available online for obtaining more information about the video. This post will describe how to use Python and then go into the available research tools.

The Python Scripts and the Youtube API

To use these scripts (or any scripts that scrape youtube data) you first need to get a youtube api key. This article will walk through getting a youtube api key and finding and running the two python scripts on a given video.

Getting a YouTube API key:

First, go to “https://console.developers.google.com/apis/credentials”.

click on agree to terms and services and then click on “agree and continue”

This image has an empty alt attribute; its file name is image-1.png

Next click “select a project”

This image has an empty alt attribute; its file name is image-2.png

Then click on “new project”

This image has an empty alt attribute; its file name is image-3.png

In the next screen click on “create”, and then in the screen click on “select project”, on the left slick on “api & services” and then “credentials”

Then click on “create credentials” and then “API key”

This image has an empty alt attribute; its file name is image.png

You are not done yet, next click on “library” on the left and in the next screen scroll down to youtube and then click on “youtube data api v3”

This image has an empty alt attribute; its file name is image-4.png

In the next screen click “enable”. Now your api has access to youtube data!

To return to your api key, click on “credentials” on the left and you are brought to a page that has your key.

The api documentation is located here – “https://developers.google.com/youtube/v3/docs”

Python Script for Scraping Video Info

There is a python script on Github that will scrape the video’s info and present it in a csv or excel file for you.

To find and run the script, first go here to get set up for the script – “https://github.com/lamthuyvo/social-media-data-scripts/blob/master/README.md”

You will need git, this website will show you how to install it – “https://git-scm.com/book/en/v2/Getting-Started-Installing-Git”

Then go to Terminal or Command Prompt and type the following 3 commands (without the quotes):

“git clone https://github.com/lamthuyvo/social-media-data-scripts.git”

“cd social-media-data-scripts”

“pip install -r requirements.txt”

Then, in Terminal / Command Prompt, navigate to “social-media-data-scripts/01-apis/scripts/”

Find the file named “secrets.py.example” and put your youtube api key in there where it says “youtube api key =”, and then change the file name to “secrets.py”

Now for the video you are researching you need to find it’s video ID. See the screenshot below for an explanation for finding the id.

This image has an empty alt attribute; its file name is image-6.png
this is from – “https://gist.github.com/jakebellacera/d81bbf12b99448188f183141e6696817”

Next, go to the python script “youtube-get-video-info.py” and input the video id in line 12 where it says “video_ids = “

This image has an empty alt attribute; its file name is image-7.png

Run the script with the id of the video (or videos) that you are researching and then an excel file wil appear in the “output” folder in the “01-apis” folder. The excel with have the video’s information for each of the following paramenters:

  • youtube_id,
  • publishedAt,
  • channelId,
  • channelTitle,
  • title,
  • description,
  • tags,
  • viewCount,
  • likeCount,
  • dislikeCount,
  • favoriteCount,
  • commentCount, and
  • topicCategories

Python Script to Scrape a Video’s Comments

Also, if you want to scrape all of the comments from a video, you can go here – “https://github.com/Jabrils/Download-All-YouTube-Comments-From-Any-Video?files=1”

Then go to the script Dumpallcomments here- “https://github.com/Jabrils/Download-All-YouTube-Comments-From-Any-Video/blob/master/DumpAllComments.py”

To run the script you need to install pytube by typing into terminal “pip install pytube”

copy and paste the script into Sublime Text or whichever python interpreter you are using and put your api key and the video id in lines 7 and 8 as shown below:

This image has an empty alt attribute; its file name is image-8.png

Run the script and you will have a tsv (tab separated values) file in the same folder as the script and it will contain all of the video’s comments.

Alternatively, you can click on the “gitpod” button below the script in github and run the script in gitpod. This is pretty simple. You need to first install pytube by typing “pip install pytube” in terminal (at the bottom of the screen in the picture below). Then click on the file “dumpallcomments.py” on the left and then Input the api key and video id into the script.

This image has an empty alt attribute; its file name is image-9.png

save the file and then click on the little green arrow on the top right

This image has an empty alt attribute; its file name is image-10.png

And a file will appear above “dumpallcomments.py” with your comments. The file contents will look like this screenshot below:

This image has an empty alt attribute; its file name is image-11.png

That’s it! You are done!

The Research Tools

This next part will address some of the tools that are available for researching videos online.

What to Look For

If you are researching videos online there are 3 basic research goals to look for.

1 – Search for videos. There are two subreasons to search for videos. 1a – You are looking for information and it might be available on a video, but you don’t know what that video is. In this case you are looking for videos based on topic. 1b -You have already found a video you are very interested in and you want to analyze it but before you can do so, you need to find the original video. Copies of the video may be altered or edited, plus you might be interested in researching the source of the video. To find the original video you must search for other versions of the video online and find the one that was posted online first.

2 – Try to find out who posted the video. This is difficult and basically involves searching if the youtube account username is also used in social media accounts. Also, you can look up the first commenter, or maybe the first few commenters, that posted on the video because they might know the person that posted the video. You can search for social media accounts with their usernames and look into their common friends for potential candidates. Searching for the original video may reveal that it was originally posted on a social media account, which would obviously make the job easier.

3 – Analysis of the video content. This skill is largely outside the bounds of this blog but we identify tools and guides to for this kind of operation. If you are searching videos to see if they have information on a specific topic (like a person, company, legislation) you can now do a word search in youtube to see if anyone says something like mentioning a person, company, or legislation. This feature can help save a lot of time for a researcher going through different videos..

Search for Videos

Searching for videos can be difficult. If you are looking for videos on a certain topic you can try using Petey Vid, a text-based search engine that searches exclusively for videos. Keep in mind that text-based searches, whether they are PeteyVid or Google, can only search for text that is affiliated with a video. So if the video is on a blank webpage and has no affiliated words, the search engine can’t find it.

According to Bellingcat’s Aric Toler, there is currently no way to run an Internet search on a video (this refers to the idea of actually uploading the a video to a search engine that would look for other videos based on it, like a reverse image search). So the next best thing is to get thumbnail photos of the video and run reverse image searches on those photos. The idea is the you are hopefully searching for the photo that appears on a video when it has not yet been started. If this sounds confusing we will walk through an example. At the time of this post’s writing, the youtube homepage looked like this screenshot below:

So if you wanted to search for the video on the top right, you could use the snipping tool to capture the photo that is currently on the video while it has not yet been played. You would do a reverse image search on this photo:

Amnesty International also has a tool for the public called the Youtube DataViewer that extracts data from any youtube video and creates four thumbnail images from the video that you can use for reverse image searches. For example, if we paste the youtube video’s url into the tool and run it, we get the following results.

The results above show the name and description of the video, the video ID, and the specific date and time when it was uploaded.

If we scroll down, we see that the tool created four thumbnail images from the video that can be used for a reverse image search, and there is even a link next to each photo that will do the reverse image search.

to do a reverse image search. There is also a link next to each photo that can be used to do

Try to Find Out Who Posted the Video

This step requires a bit of time but is relatively simple. Google the username

For example, look at this video here. When we click on the user’s ID in the bottom left corner we are brought to a channel homepage.

That brings us to this page below, which has a strange url that does not identify the user well.

Strangely, if you click on “Home” you will get the same webpage the url changes to show the user’s username. See below, the channel’s homepage is the same but the url has chanced to show that the username is “oregonzoo”.

A quick google search of the username reveals the following Twitter account, which of course provides further information on the user.

When we scroll down to the first two commenters we can apply the same method to try to find their social media.

If we find two twitter accounts associated with those commenters we can try to find common friends by using Tweetbeaver.com, see below.

Using this tool we can look for common friends that might be the youtube poster.

Youtube’s Computer-Generated Transcripts

Youtube has a new feature that makes researching videos easier by generating transcripts for each video. To access the feature, click on the three dots below the video on the right side and then click “Open transcript”.

The computer-generated transcript of the video appears next to the video. The transcript is word-searchable so you can save time by searching for a specific name or company is mentioned in the video. in the screenshot below, I searched for the name Tina Larsen and it popped up in the transcript. Note that the transcript also shows the time when different things were said. If you find something interesting in the transcript you can click on the words and youtube will automatically bring the video to that time.

More Tools

InVid has a tool for indepth analysis of the video content, for the explanation about how to use the tool click here and tool click here.

There is a very good guide for analyzing content in a video to do things like find the location or verify the video in the  Verification Handbook.

What to Look for When Researching U.S. Corporations

This article provides an overview for conducting corporate research, identifying what information is available and how it is useful to a researcher. This post also weaves together previous articles and guides into a coherent whole. We will address where to find information on a company and what kinds of questions you can consider answering with your research.

Basic Research

The initial steps for researching any company is to do a quick Google search on it, then skim any press articles mentioning the company and look at the company website. With that completed, you can then start getting into the more advanced and more important research. This research guide assumes the researcher has already done the aforementioned basic steps.

Advanced Research

Deep Web –  See our article, “Deep Net, Open Net, and the Dark Net” that explains the deep web to understand why this will play an important role in corporate research.

The more advanced research relies on deep web databases, so at this point it is worth taking a moment to understand the “deep web.”  Most people assume that if they google a company’s name, they will find all of the information on the company that is available on the internet. In truth, the best information is usually located in deep web databases, which means the information in those databases will never appear in one’s google results.

Who Owns/Runs the Company – Every company must register with the government and the registration will usually identify the company owner in addition to other kinds of information that vary depending on the location. To find out how to obtain a registration, see “Corporate Research on the Deep Web.”

This article will also address the difference between a standard company and a corporation, from the perspective of corporate research. The main issue here is that a corporation has addition information available that it must file with the Securities and Exchange Commission (SEC) which is available to the public on the SEC’s EDGAR database. Here you can find the annual filing of a company if it is incorporated. The annual filing, among other things, will show you the subsidiaries of a company.

Court Records, Property Records, Local News – Using our three guides for finding this information, you can research court recordsproperty records, and local news for any company or individual subsidiaries.

Company Website – See our guide for website research, “How to Investigate a Company’s Website.” This article ignores the content of the website and instead looks at what is “behind the scenes”.

Contact Information Without a Name – Some records may list only a phone number, or other contact information, where it is supposed to list a person’s name. If this is the case, you can use our guide for How to Research a Phone Number. (We urge you to use these methods only in support of corporate research, this is not intended as an invitation to stalk someone that wants privacy)

Leadership – You can identify important players within a company by reading our guide “How to Find Influential Actors in a Corporation”. You can also learn to conduct focused and in depth research on a corporation’s Board Members by seeing our article on “Researching Board Members.” This guide will identify how to get background information, identify connections, how much money they are paid, and find possible conflicts-of-interest.

Corporate Profiles – Web scraping can be very useful for researching a company, especially if the company website has many employee profiles. You can use our guide to web scrape profiles (or anything else on the Internet), “Enhance Corporate Investigations by Web Scraping with Python.”

Hidden, Unofficial Connections – There are many ways that people and companies can have hidden or unofficial connections with other entities, and these connections often influence the actions of both parties. See our guide on finding these connections “Corporate Research on Hidden Connections.” For an example, see the chart that maps out connections between people, addresses, stocks, and companies that were revealed when secret financial documents were leaked to ICIJ.org.

Screenshot of entities linked to Blackwater Investments (from offshoreleaks.icij.org)

Past Violations –  Our article on “Material disclosures and violations” will identify problems that the company has run into, whether it be legal issues or increased debt. This article will also show how to find when the company hires a new member to its senior ranks and how much they get paid.

Shipping Companies – Shipping companies have their own special factors, identified in our guide “Shipping companies,” that addresses identifying and locating ships, cargo, and discovering shipping violations.

Nonprofits – Companies are often linked to nonprofits via donations or because their leadership also are involved in the nonprofits themselves. See our guide for basic research on nonprofits “Researching a Nonprofit” that will explain methods like finding their tax records.

Nonprofit Corruption, Corporate Influence – Also see our guide for how to identify corruption and misuse of funds in “How to Discover Corruption in Nonprofits.” Note that it is common for corporations in the U.S. to donate funds to nonprofits affiliated with a politician. In fact, a nonprofit is four times more likely to receive a donations from a corporation if a politician sits on its board or runs it, according to a study by the National Bureau of Economic Research.

For an example of corporate links to nonprofits, see the link analysis chart below, it maps out how oil companies influence think tanks (which are also nonprofits). This map was created by Littlesis.org, a free tool that identifies hidden connections.

from LittleSis.org, click here to see the source

Contracts with the U.S. Government – If the company has ever contracted or tried to contract for the U.S. government, there are several special databases that will provide unique information about the company. You can learn about this with our article “Ties to the US government”.

Questions to Help Reach a Conclusion

The next paragraph provides a number of questions that may help with a corporate investigation for several reasons. You can look at these questions before your research to help you guide your investigation and decide where to look for information. It is also useful to look at these questions at the end of an investigation so that you can take all of your information and create some form of narrative or conclusion based on it. In addition, the guides above show a lot of ways to obtain information, but you may not have time to run through all of them. Plus, a lot of the information available on a company may not be relevant for a researcher depending on why they are investigating the company in the first place.

For those reasons, it can be helpful for a research to address some or all of the following questions: Who owns the company and what connections do they have to other entities? Who runs the company?  What does the company own? Who does it owe money to? Does the company or its owners/staff have secret companies in tax havens? Where does it do business?  What government or public service contracts does it have? What links does it have with politicians and civil servants?  What regulations has it violated? What legal cases have been brought against it? Who is taking action against it? And finally, who can influence the company?

Now you will have a well-researched and analyzed product.

Use Python to Recognize Text, Compare Data From Scanned Documents

Background From Previous Post

This post follows on the instruction from a previous article, Research Scanned Documents with Optical Character Recognition, that explained how to use Python to recognize text in png files with the use of Optical Character Recognition.

Optical Character Recognition (OCR) is basically the ability for a computer to recognize words in photos. OCR is particularly useful when dealing with scanned documents.

The previous post addressed how to install Python, the python interpreter Sublime Text, and the python tool for installing libraries Pip. Then, it explained how to install the python libraries Pillow, Tesseract, and Pytesseract. Finally, the post explained how different simpled Python scripts, can read text and then print in out in the python interpreter, create and print the text into a text file, or convert the png file into a pdf that has a layer of OCR over the text.

For example the following script would take a png file named screen.png, read the text and print it out into a text file.

from PIL import Image
import pytesseract

f = open("output.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

Moving Forward, Converting Data into a Python List

Now if we take our output text file, we can turn it into a list by making this addition:

from PIL import Image
import pytesseract

f = open("output.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

rows = []
with open('demofile3.txt', 'r') as txtfile:
	for row in txtfile:
		rows.append(row)

With the new list named “rows”, if we “print(rows)”, the list of names will be the same.

If we wanted to print the list “rows” into a text file, we cannot simply write “f.write(rows)” because the function requires a string. So if we want to create a text file the same as the one we already created, we have to write the script like this:


from PIL import Image
import pytesseract


#print(pytesseract.image_to_string(Image.open('screen.png')))

f = open("demofile3.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

rows = []
with open('demofile3.txt', 'r') as txtfile:
	for row in txtfile:
		rows.append(row)

f = open("demofile4.txt", "w")
for row in rows:
	f.write(row)
f.close()

This addition will create a second text file “demofile4.txt” that is exactly the same as “demofile3.txt”. This simple task does not achieve anything but it shows how to work with the data. And if we have a list, we can compare the contents to another list. So with the names in a list we can do things like see if any of the donation recipients are also in the names of financial disclosure documents for politicians (meaning that a politician also works for the donation recipient).

Now, observe that if we run the script above and tell it to print the list in the python interpreter (Sublime Text), this is the result (script is on top, results are on bottom):

Each “\n” represents a line break. This means that the data recognizes the line breaks. This will be relevant in a moment.

There are other measures that can be taken, such as removing all symbol characters or making all text lowercase to make them more easily compared. However, these measures will not be addressed at the moment.

Converting the Data Directly to a List

But backing up a bit, it seems there should be a more efficient way to make a list without creating a file. However, note that if we tried to put the data directly into the list (instead of putting it into a text file first) like in the script below it did not work:

from PIL import Image
import pytesseract

f = open("demofile3.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

rows = []
data = (pytesseract.image_to_string(Image.open('screen.png')))
for line in data:
	rows.append(line)

If we printed the list, “rows” by typing “print(rows)”, we would get a list of each individual character from all for the names.

Here is a better script below. The original data recognizes line breaks so we can import the data and assign it to a variable (named “data”) and separate the names by using the “split.()” function (as explained here in geeksforgeeks.com) and choose to split by the line breaks using “\n” (as explained here in netinformation.com). And then we put all of that together as “data.split(‘\n’)” and assign that to the variable “t”. Finally we make it a list by saying each thing in “t” (which has now been separated), and we will call each thing “i”, we will append each “i” to “rows”.

This creates a list of every name.

from PIL import Image
import pytesseract

rows = []
data = (pytesseract.image_to_string(Image.open('screen.png')))
t = data.split('\n')
for i in t:
	rows.append(i)

If we add in a “print(rows)”, the results look like this:

We see that there are empty lines that are identified as individual elements in the list as this:

‘ ‘,

You may also notice that the character [ appears a lot in the list.

It is worth mentioning that if there were certain characters that you want removed from the list, like for example a “[“, you can simply add this coding below:

[s.strip('[') for s in rows]
rows = [s.replace('[', '') for s in rows]

The resulting list is clean

This is particularly useful when you want to compare one list to another and you want cleaner data.

Now you have a proper list of the names that we can manipulate and compare.

How to Compare Data from Different Scanned Documents

From here, we will use a common example that involves looking at which nonprofits a corporation provides donations and checking if any of them are involved with local politicians. In this case we are looking at the pharmaceutical company AabVie and a congressman that is involved in areas of potential interest to the company.

To explain why we would research this, it should be noted that the National Bureau of Economic Research published an extensive study, “Tax Exempt Lobbying: Corporate Philanthropy as a Tool For Political Influence”, about how corporate foundations are more likely to give charitable donations to a nonprofit if it is affiliated with a politician.

The study found that “a foundation is more likely to give to a politician-connected non-profit if the politician sits on committees lobbied by the firm.” In addition, the researchers concluded that “a non-profit is more than four times more likely to receive grants from a corporate foundation if a politician sits on its board, controlling for the non-profit’s state as well as fine-grained measures of its size and sector.”

The study identified specific politicians of interest by checking if they were on any committees that were lobbied by a given company. If so, they would check for a link via a nonprofit and donations. So in our example we will look into whether a specific company is given to a nonprofit affiliated with a politician of interest. The existence of such a link does not prove nefarious intent, but it is interesting and in some cases it can be an indicator of a deeper relationship.

This example is relevant because it requires researching files from two different databases (irs nonprofit filings database and congressional financial disclosures database) with records that are often badly scanned documents. The lack of OCR on these files makes it difficult to find links.

We can use the png of donation recipients identified in the public tax filings of the AbbVie Foundation (affiliated with the pharmaceutical company with the same name, which has a presence in Rhode Island). We will name this file abbvie.png.

donation recipients

For the second file we will use a list of nonprofit organizations that have Rhode Island-based congressman Jim Langevin on their board to see if any of his nonprofits receives money from the aforementioned foundation. The information for both files was scraped from records made available via Propublica’s Nonprofit Explorer.

In this case we will use the following Python script to read both documents and compare the data.


from PIL import Image
import pytesseract

rows = []
data = (pytesseract.image_to_string(Image.open('abbvie.png')))
t = data.split('\n')
for i in t:
	rows.append(i)

rows2 = []
data2 = (pytesseract.image_to_string(Image.open('langevin.png')))
t2 = data2.split('\n')
for i2 in t2:
	rows2.append(i2)

names = []
for name in rows:
	if name in rows2:
		names.append(name)
print(names)

Below you can see that running the script identified that the name “Adoption Rhode Island” was in both files (along with a lot of empty spaces shared in both docs).

Compare Many Documents

Now that we have explained how to compare two documents, let’s look at how to compare many documents.

Let’s say you have 20 pdf files of a hundred pages each and you have another set of 20 pdf files and you want to find common names or something else that exist in both groups. First go to pdftopng.com

pdftopng.com

Upload one set of 20 pdf documents and it will return a zip folder to download. Inside the folder is a set of png files, each page of each pdf file will be converted into a separate png file. Next, open Command Prompt / Terminal and navigate to the folder (you may consider copy and pasting the files from the zip folder to a regular folder). Once you have navigated there, type “dir /b > filenames.txt” in the Command Prompt and it will create a text file in the folder with all of the file names. Copy and paste those files into the same folder with the Python script and then replace the name of the png file in the python script with the name of the textfile (which should also be in the same folder as the python script now.

The script will read the file’s names of the png files and then go to each png file to OCR it. You could also put the files in a separate folder and put the path to each file in the text file. This could be accomplished with use of the “Find and Replace” tool. For example, if every file starts with the same word (which will be the case because that is how the website will create the files for you) like “policy”, and the path to each file was something like “documents/research/policy1.png” (with the number changing for every file), you could tell the Find and Replace tool to find all instances of the word “policy”, and replace them with “documents/research/policy”. This would leave the rest of the file names unchanged.

Now repeat these steps for the second group of files and you are done.

That’s it.

Research Scanned Documents with Optical Character Recognition

Optical Character Recognition (OCR) means that your computer can read words in photos and scanned documents. When you have a document or documents where you want to copy and paste the words into a search engine or you want to do a word search for a specific name in the document, OCR will make that possible.

For example, two previous posts described how you can research the tax records of nonprofits’ tax records or politicians’ financial disclosures, in government databases but the records are all in badly scanned pdf documents that do not recognize the words. Therefore if you want to find if Exxon Mobil’s foundation donated to a Senator’s personal nonprofit, you have to potential search through a lot of pages of tax records to look for the name.

This article will explain how to use Python to add OCR to files.

Get Started on Installations

You will need Python, Sublime Text, and Pip for the basics.

To start with, if you are completely new you can download Python from https://www.python.org/downloads/. Then you can download Sublime Text, a tool for accessing Python scripts, at https://www.sublimetext.com/3.

Now, access your Command Line (if you are using Microsoft) or Terminal (if you are using Mac).

Pip is included in Python but you can see guidance for installation and updating at https://pip.pypa.io/en/stable/installing/.

If you get a message saying you need to upgrade pip, you can do so in the Command Line by typing: python3 -m pip install –upgrade pip

The next group of installations are Pillow, Tesseract, and Pytesseract

Next install Pillow by going here for instructions.

(Install Pillow on Windows by typing: python3 -m pip install –upgrade Pillow)

from – https://pillow.readthedocs.io/en/stable/installation.html

Tesseract

There is the command-line program Tesseract and its third party Python “wrapper” (whatever that means) named Pytesseract.

Click here for a basic overview of Tesseract and its installation and usage, see screenshot below. Following the below screenshot there is a link to the more detailed documentation on installation.G

Go here for installation documenation.

Note in the window below in the second para that, for Windows, if you want to OCR different languages you need to click on the link that says “download the appropriate training data”, which brings you to a webpage that offers a different download for each language.

Tesseract installers for Windows are available here.

The process for Windows should identify the location where it is downloaded, this is necessary to know so that “you can put it in your PATH” (a phrase that is often used by rarely explained with Python).

You have installed tesseract and if you are using windows it will default to the location above.

How to Add Something to the PATH

Next, tesseract must be “added to the PATH”, this means that you must add the directory of tesseract must be added to the PATH environment variable.

The following instructions are for Windows.

1- go to System Properties

2- click on Environment Variables

3 – a box will appear that it titled Environment Variables, within it find where it says System Variables and underneath that there is a list of variables, choose the one titled PATH

4 – a new window appears, click on one of the empty lines then click on browse and find tesseract, click on the tesseract folder (or “directory”) then click on okay until you have closed every box

Now when you open the command prompt you should be able to hit the command from any folder and be able to access tesseract.

Pytesseract

To install the python wrapper library, Pytesseract, which uses existing Tesseract installation to read image files and out put strings and objects that can be used in Python scripts.

You can run “pip install pytesseract”, or go here and download the file and then run “python setup.py install”

Command Line / Terminal command for OCR

The python capabilities described here require that you have a png file, not a pdf. Most pdf conversion capabilities are not friendly to Windows. But the methods here will work just as well for Mac.

How to get a png file? Most snipping and screenshot tools will automatically create a png file of the image you are capturing. If you are working with a pdf file and only need one page, you can snip an image of that page.

Here is a very basic OCR python script. The script must be saved to the same folder/directory as the png file that you want to read. If not, then you would put the path instead of just the name. So if the python script was saved to a folder named “first” but the png file was in a folder named “second” that was located in that same folder named first, then instead of (‘yourfilename.png’), you would type (‘second/yourfilename.png’). This is otherwise known as the path to your file.

from PIL import Image
import pytesseract

print(pytesseract.image_to_string(Image.open('yourfilename.png')))

There are many python interpreters but for this post we suggested using Sublime Text. To run a script in Sublime Text you must save it (File, Save or Ctrl-S) then take the not intuitive step of chooseing to “build” (Tools, Build or Ctrl-B).

For this example I used a png file of a list of donation recipients from the Target Foundation’s tax records that looks like this

The aforementioned python script will produce a list of words and phrases printed out into the python interpreter and should look like this:

If you want to print out that same list into a txt file, you could go to the Command Line / Terminal, navigate to the folder with the python script, and then type “tesseract”, space, the png file name, space, then the name you want to give the text file with your list, like so: tesseract yourfile.png textfile.txt

The following script will take your original png file and convert it into a pdf file that has OCR over the words

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract

pdf = pytesseract.image_to_pdf_or_hocr('pngtarget.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default

The resulting pdf looks like this (with words highlighted):

This method was equally successful when used on the full page of the original document that looked like this:

How to create and print list to a TXT file

According to a instructions from w3schools.com, if we want to create a new text file:

So based on this information, if we want to print our items into a text file, we use this script:

from PIL import Image
import pytesseract

f = open("demofile3.txt", "w")
f.write(pytesseract.image_to_string(Image.open('screen.png')))
f.close()

The resulting text file looks like this:

That’s it.

PACs, Foundations, Press Releases:

How to Research Corporate Efforts at Influence

The Dirt Digger’s Guide identifies political contributions, nonprofit donations, and press releases as three key factors in a corporation’s ability to influence others. Corporations spend a lot of money on donations to political and charitable causes but instead of making direct payments they create PACs and foundations. PACs are used because of laws against direct corporate political donations and foundations are used to avoid taxes.

To find a PAC, the easiest method is to go to OpenSecrets.com and search for the name of the company and PAC. For an example, the Target Corporation has a Target Corporation PAC with a profile page on Open Secrets that displays its information.

The source of the information is on the Federal Elections Commission website and you can see on the Open Secrets profile page’s section on registration details below that there is a link to the source.

The FEC.gov website has its own profile page for the Target Corp. PAC. The format is not as good as the one on Open Secrets but if you want to cite your data it is good to know the source.

A corporation’s foundation is usually easy to find with just a search on the corporation name and the word foundation. As another example, Propublica’s nonprofit explorer has a page on Target Foundation tax records.

Note that in Section B from the most recent tax filing available there is only one source of its funding, the Target Corporation itself. This is standard practice for a corporation’s foundation.

Depending on the tax record, the corporation may list the recipients in an attachment at the end, or in this case from 2015, in Statement A. Notice that Minnesota and Minneapolis appear a lot in the list. This is likely because their corporate headquarters is in Minneapolis. (an upcoming post will explain how to make these documents word-searchable with optical character recognition)

Note that the National Bureau of Economic Research published an extensive study, “Tax Exempt Lobbying: Corporate Philanthropy as a Tool For Political Influence”, (click here to read) revealing that the S&P 500 corporations commonly donate to nonprofits linked to politicians that they want to influence. (A previous post addressed how to lookup if a politician, or anyone else, is misusing a nonprofit for personal gains). See below for the summation of the study’s findings.

“In our first analysis using these data, we show that a non-profit is more than four times
more likely to receive grants from a corporate foundation if a politician sits on its board, controlling for the non-profit’s state as well as fine-grained measures of its size and sector.”

the study also found that “…a foundation is more likely to give to a politician-connected non-profit if the politician sits on committees lobbied by the firm.” The study also noted that there was a high probability that the foundation would stop donating to the nonprofit when the politician lost their bid for re-election.

Therefore, the nonprofits that receive donations can be compared to politician’s financial disclosures that usually list if they are involved in a nonprofit. A researcher can look at which committees a company lobbies and then lookup the disclosures of the politicians that sit in those committees with lobbying and disclosure records, a previous post addressed how to look up lobbying and disclosure records.

Press Releases

On a side note, it is important to remember that corporations use press releases as possibly the most direct way to release information and try to improve their public image. Specifically, press releases are used by corporations to influence their public, release information that is legally required to be made public, and describe/acknowledge their own activities in the manner that they want to be viewed (which is arguably true about any organization).

There are three good places to find press releases. First, almost every corporation’s website will have a page entirely dedicated to press releases. Second, press releases are often included in SEC filings so you can do a keyword search in the SEC’s EDGAR database for the phrase “press release” with the corporation’s stock ticker to ensure you only get results from the one corporation. Finally, there is a website, https://www.prnewswire.com/, that exists to post and cover corporate press releases. One of the benefits of uses this website is that you can also read other press releases on the same topic from different sources/corporations.

The EDGAR keyword search function (often overlooked)

How to Research U.S. Gov. Contracts : Part 3 – Contract and Tender Lookup

This article will show how to lookup a government contract in order to see what useful information about the contracting company is available.

Brief recap of the path leading to this point:

The Federal Contractor Misconduct Database identified in part 1 revealed that a company was cited for violations related to the death of a contract trainer involved in training sea lions for the U.S. Navy.

The company’s registration information and unique identifier, known as its DUNS number, were identified in part 2 by looking up the company in SAM.gov and DUNS.com

The company is named Science Applications International Corporation (SAIC), a branch of the larger corporation with the same name, and its address is 12010 Sunset Hills, Reston, VA. The company’s DUNS number is 078883327. The original violation record listed that the company was working on the U.S. Navy’s Marine Mammal System Support program when the violation occurred and when the trainer drowned.

CONTRACT lookup

The violation record did not identify the specific contract so one my look through the available records for it. The record listed that the violation was committed by the identified company while it contracted on the Mark 6 / Marine Mammal Support System program in 2014. In theory, this means that the researcher needs to find the contract that fits those three criteria.

The company’s DUNS number that was identified in the previous posts can be used in USAspending.gov, which makes federal government contract information public, to lookup contracts with the company. The website can also search based on the company name or other inputs.

To do so, at the site click on Award Search, then Keyword search. If we search here for the company name there are too many results. Next, try clicking on Award Search then Advanced Search. On the left side of the screen find “Recipient Name”, put in the company name, hit enter (which does not actually start the search), and then scroll down and hit ‘enter’.

This also produces a lot of results. So the next option is to search on the company’s DUNS number. After this, we can add in a keyword search for the name of the program to filter down our results for instances of the company contracting for that specific program.

We see in the results (click here) that when we search for the company (or its DUNS number) with the name of the program, we only get two contracts and the first one started in 2015 and the other in 2020.

This image has an empty alt attribute; its file name is image-12.png

However, we know from the record of the violation that it occurred in 2014.

This is a problem that is NOT common. Basically, the company had a different name when it took the contract. Back then, it was known as Leidos, a company that split off into two other companies known as SAIC and Leidos. The original Leidos, the new Leidos, and SAIC all have different DUNS numbers. This does undermine the value of using the DUNS number to search for a company.

But to investigate the problem, assuming one did not know about the company’s history, a quick google search for SAIC finds on the wikipedia page that the company started on September 27, 2013.

To confirm the information from wikipedia, we see in the company’s SAM.gov registration from the Part 2 post that the company registered on the following day.

One can see from the two contracts identified in the search on USAspending that they are scheduled from 2015 to 2020, and from 2020 to 2025, which suggests that the contracts are usually scheduled in 5 year increments. One can infer that the previous contract was scheduled from 2010 to 2015. At this point one knows that even though SAIC was working on the contract in 2014, the contract may have begun in 2010 before SAIC registered with SAM.gov in 2013.

Since SAIC spun off from Leidos, we can search for that company name in conjunction with the name of the program. Hence, we see that Leidos started a contract in 2009 and it continued through the date of the violation in 2014 and ended in 2015. Therefore, this result is the contract that SAIC was fulfilling at the time of the violation.

This image has an empty alt attribute; its file name is image-40.png

We click on the result and are brought to a page with the contract information. This information includes the contract identification number N6600110C0070.

There is a description of the contract that is kind of vague and not very useful:

There is a history of the transactions for when the company was paid over time and what services were rendered.

Unique identifier information for the award and the government entity that is the source of the contract.

Information about the recipient:

Competition details are interesting. For example, this shows that the contract was intended for open competition but only one company bid for it, meaning that the company did not face any competition for the contract. If we wanted to assess the company based on its ability to obtain a contract, it appears significantly easier (and therefore requiring much less competency from the company) to acquire contracts without facing competition, unless there is some other relevant information that is not apparent here.

Sometimes executives are listed but that is not the case here:

Tender Lookup

We use the solicitation ID, which is also the contract ID, from the contract and go to beta.sam.gov where we can lookup the original tender. When you to the website, you can use the basic search function or if you want the advanced search you have to take the not-obvious step of scrolling down and clicking on Search Contract Opportunities.

Make sure when you are looking for old tenders that you have to unclick a box that says Active Only.

Finally, we search for the ID number and find the original tender.

This is useful because the tender has a description of the contract and that gives us details of what the contract actually entailed.

Additional Contracts

We can also look up to see if the company acquired additional contracts for the same program. Returning to USAspending.gov and searching for the company name and the program name we see that after the first contract ended, the company obtained two more on the same program.

This image has an empty alt attribute; its file name is image-12.png

Interestingly, we see at beta.sam.gov that before the last tender was issued, the navy issued a notice of intent for sole source procurement to SAIC for the program.

If we open this notice, it states that the Navy will not open the tender to open bidding, but will instead offer it directly to SAIC. It is not clear from the notice why the tender will not be open to bids from different companies, but it confirms that SAIC did not face any competition for its most recent contract on the program.

That’s it! An upcoming post will show how to do more in-depth research.

How to Check If a Website Is Safe

There are several ways to check a website to see if it is dangerous.

You can “scan” the website infrastructure for telltale signs of danger. Find if known malware refers back to the website or check the downloadable files on the site itself. Investigate website’s links or the SSL certificates for clues.

There are also blacklists of websites that are considered security threats and this is the best place to start. The easiest first step is to click here to use Virus Total, which has a tool (click here) to check suspicious urls. Virus Total will check your url against 60+ blacklists.

See the results below from searching on “search-ish.com”, and you can see the beginning of a list of different checks that came up clean for my website.

This is the results from searching my website

I also like that they have a nice “summary” section for those of us who do not have PhD’s in Computer Science. See below:

You can also use Threat Intelligence Platform for the same purpose. But keep in mind that it will not give you a simple “yes” or “no” answer. Instead it will provide your various kinds of evidence that the site is or is not safe.

To be extra careful, you can also check if the site’s IP address is blacklisted by looking up the IP on Ultra Tools and searching it at IPvoid.

Malware Affiliated With the Site

Another method is to look for malware that has been hosted on the website in the past, which is pretty damning of the site. You can also search for known malware that refers to the site, which is a sign that the website is affiliated with the processes of the malware. You can search for either of these on the aforementioned Virus Total and Threat Intelligence Platform. However, it is not clear where it obtains this information.

Suspicious File on the Website

If you want to check if a particular file is dangerous you can upload it to Virus Total for analysis and it will be checked against 60 virus databases. While it is definitely recommended that you do not download or in any way work with potentially dangerous files, if you plan to do so it is recommended that you use a virtual machine.

Using a Virtual Machine

For a newcomer, dealing with a virtual machine is a bit of a hassle, so feel free to jump to the next section. Still reading? Okay so if you want to actually go to the suspicious website, maybe download suspicious files to upload them elsewhere for inspection, you can use a virtual machine to mitigate the danger to your computer.

The Intercepts guide for novices to set up a virtual machine

A virtual machine is basically an isolated computer within your computer. The idea is that if a malicious file infects your virtual machine it should not be able to infect your regular computer, though there are a few documented cases where this is possible. For a primer on how to set up and use, click here to see the guidance from The Intercept. I hate to pawn you off to another website to learn virtual machines, but I can’t do a better job then they did.

Additional Names on the SSL Certificate

One can also check the website’s SSL certificate for a sign that it could be dangerous. The SSL certificate is a type of digital certificate that authenticates the website so that when you go to cnn.com, you are actually going to CNN’s website. General you should expect the certificate to have one domain possibly with additional subdomains. Scammers often use many domains on the same certificate.

You can look up any website’s SSL certificate at censys.io, or if you scanned the website on Virus Total you can look in your results under the heading “Subject Alternative Names.” However, censys.io will give you more detailed findings.

Does It Use a Phishing Kit?

While there are many companies that offer customers different kinds of pre-built websites (WordPress) there are similar providers that offer pre-built phishing websites to scammers. These are called “phishing kits” and they have several telltale characteristics. To check if a site was built with a phishing kit, you can send its url to urlscan.io. UrlScan will find evidence of a phishing kit or if other sites use the same kit.

This is a bit long but you see the results below for a search on this website.

Right up top you can click “similar”?

You click there and you can see specifically whether there are indications of a phishing kit (like in the website below that is not mine):

Additional Security Websites

In addition to the websites mentioned above, you find a list of websites that check the safety of suspicious sites by clicking here.