# Data Analysis for Beginners

This post explains how to conduct Data Analysis on large sets of data using Google Sheets and is intended for people with no background in the subject. The post will explain how to import data into Sheets and how to summarize, sort, filter, and modify the data by performing custom math on it. Finally, we will walk through how to cross reference data from different sources.

CREDIT WHERE IT IS DUE: The post is based on following the instructions from Lam Thuy Vo’s book Mining Social Media. Chapter 6 from this book focuses on Data Analysis is available online here. This information was originally identified by the website Tools For Reporters, here.

In the words of the author, you learn “how to conduct simple data analysis with Google Sheets: you learned how to import and organize data in Sheets, how to ask specific questions of a data set, and how to answer those questions by modifying, sorting, filtering, and aggregating the data.

This post will address how to import, modify, and analyze data in Google Sheets.

Part 1 we will go over how to import data into Google Sheets, make it recognize numbers, and using the function for splitting data from one column into two columns.

Part 2 will go over how to create a Pivot Table (which provides statistics about our raw data) to count how many times a specific value occurs in a column. This will be used in the example to find out how many times a twitter account tweeted on each date. The goal of the example is to identify if the twitter account is a bot based on the frequency of its tweets.

Part 3 describes how to use formulas and functions to analyze data. While Part 2 aggregated how data, Part 3 will use math equations such as find the average number of tweets per day by a given Twitter account in order to see if it is within the bounds of a normal human user or a bot.

Part 4 addresses how to sort and filter data. You can choose one value in your data (such as the number of retweets) and order all of your data based on ascending or descending number of retweets. You can also sort the data in the pivot table (which is based on the raw data). Filtering allows you to hide data you do not need or find the most relevant data (maybe search for all tweets that mention a specified person).

At this point, the reader has a general understanding about what kind of data analysis is available, but there are too many functions and tools to learn them all. It is better to obtain data and then consider what you would like to do with it. Then work backwards to look at the Sheets functions and tools available to find a way to achieve your goal. Note that Parts 3 and 4 reference lists of the available functions and tools.

Part 5 addresses how to cross reference data from different sets based on a common data type or value, for example if both datasheets have a “dates” column.

# Part 1

## Importing Data to Google Sheets

First we are going to import the data, which was originally obtained directly from the Twitter API (a subject for a separate post) and we start with the data in CSV format.

There are a plethora of potential data sources that allow you to download data in this kind of format. For example, the FEC provides CSV data for public campaign financing (click here) and OpenSecrets provides CSV data on corporate lobbying records (click here for Exxon example)

Go to Google Sheets and start a new sheet, it should say “blank” under “start a new spreadsheet” on the left side.

Click on “file” then “import”, the following screen should appear:

Choose “upload” and “select a file from your device”. After choosing your file, (preferably a csv file) the following window should appear:

Then choose the following in order:

1- under “import location” choose “replace current sheet”

2 – under “separator type” select “comma” – – (since we are using a csv in this case and we need Google Sheets to identify separate values)

3 – under “convert text to numbers, dates, and formulas” select “no” – – (we will change the formatting ourselves because the software makes mistakes)

At this point we are not worried about whether Sheets can differentiate between a number, word, or date.

Now make a copy of the original data set. (you can also use “file” then “version history” to look at past versions)

8 – Click on “sheet1” (you will make this one the original to put aside in case you need to reference the original data later) and rename it “original data”, or whatever.

9 – Then, to create a copy to work with, click the small arrow next to “Sheet1” / “original data” or whatever you call it now, and in the menu that appears you will choose “duplicate”. In the book they name the second sheet “step 1: modify and format” so we will use the same here for the sake of clarity.

### Transform data in columns from string to numbers

Then we choose to highlight column C by clicking on the “C”. Then click on “format” then “number” then “number” again. This transforms every value in the column from a string to a number. Repeat this process for the Retweets column which also contains numbers.

### Split Text To Columns

We want to find out how many times the twitter account tweeted on each day. So we will count how many times the same date occurs in our raw data on the tweets.

To do this, we need to create a column that only includes the dates of each tweet and then count how many times each date (or “column value”) occurs. So if March 3rd occurs ten times, that means ten tweets came out on that date.

The problem is that our data only provides a column that provides the date and time together for each tweet. That means that several tweets on the same date, (example: “March 3rd, at 12pm” and “March 3rd, at 4pm”) will be counted as separate values instead of identifying that the tweets occurred on the same date.

The solution is to use the Split Text to Columns tool which will separate the dates into one column and the times into a second column. This will let us count the values in the dates column so we can see how many times each date occurred.

### How to Use the Split Text to Columns Tool

1 – Click on the new sheet (so you are not working with the original data) and find the “Split Text to Columns” tool under the Data menu

The tool looks for common characteristics in the columns (like a semicolon in-between two numbers) so that it can separate the data.

2 – Choose a column to split and then create an empty column to the right of it.

3 – To do so right-click the letter above the column and choose “Insert 1 right”.

Now you can use the Split Text to Columns tool without overriding any data.

4 – Click on the column you want to split so that it is highlighted, and then in the menu above click on “Data”.

5 – Click on Data and then choose Split Text to Columns.

It will split the text between the two columns based on the default setting “Detect Automatically”. But a small window appears, titled Separator, and it will give you different options for how you want to separate the data.

6 – We will choose “Space”.

Note that these columns of dates and times are still formatted as strings. That is because the separator tool works best with strings so it makes sense to work with the data as strings now, and then convert it later.

7 – For the sake of clarity, we now change the column headers to “date” and “time” to reflect the information in each column.

Now the information is ready to create a Pivot Table, which will summarize the data for us.

## Part 2

According Lam Thuy Vo, at this point the data is prepped and the next part is to aggregate the data by using pivot tables and formulas.

Pivot tables can, according to the author (p. 110 from the book), “summarize data…in various ways…we can count the number of times a term occurs in a column or calculate the sum of numerical values based on a date or category.” This is done with the creation of a summary table that gives statistics about our raw data.

In the example from the book the reader uses pivot tables to find how many times each date occurs in a column.

## Recap:

Recall that the data originally provided the date and time in the same column.

so first the data looked like this:

Therefore in part 1 the reader had to separate the dates into a separate column. The point is so that every time a tweet occurred on a specific date, the column value would only be the date. Therefore if we count the number of times the same date/column value occurred in that column, that would be the number of times a tweet occurred on that date.

## How to Create a Pivot Table (shorter explanation)

Note that if you want a short explanation on creating pivot tables, google provides one here.

1 – In short, to create a pivot table you highlight the data you want to use, find and click on “create pivot table.

2 – Then when the table is created, you input data into the table by using the pivot table editor. In the editor there are two options, “rows” and “columns”, you will use these to choose what data to include.

3 – Within the Rows/Columns options, you will choose the data by choosing column headers from the original spreadsheet and each unique value from that data will be listed once in the table. Rows/Columns also lets you order the data ascending or descending.

4 – Finally, you can use the pivot table editor’s “values” option to summarize the data (by adding, counting, finding the average, etc.) or the “filter” option to filter out data values based on the value itself (in which case you choose which specific values to include) or based on a conditional, which is explained below.

## Select Data and Create Pivot Table

Returning to the guidance from Mining Social Media: The first step in creating the pivot table is to choose the data that it will analyze and we want to choose all of the data on our sheet so we click on this little rectangle in the top left where the column and row headings meet. This highlights everything so that the pivot table will have access to all of the data. The pivot table will not necessarily use all of the data, but this gives the option to use it for whatever data we want. This click on the Data tab and then Pivot Table. A window will pop up, make sure New Sheet is chosen and then click Create. A new sheet will appear, named Pivot Table 1.

## Pivot Table Editor

On the right side of the pivot table there is something called the “Pivot table editor”. You use this to put the highlighted data from the sheet into your pivot table and choose how to analyze it.

Note that it suggests some possible things to do, and it even recognizes your data column headers in its suggestions. You can find out what these suggested options would look like with your data. To do this, if you hover your cursor over any of the suggestions, a magnifying glass appears on the right side. Click on the magnifying glass to see what kind of table these suggestions would produce.

For example, we click on the magnifying glass that appears next to “Number of unique id for each language”, and get the table seen below. The table shows that there are 5 times that a tweet is identified as “ar” (arabic) in the data under the language column’s information. This does not put the data into your pivot table, it just shows what it would look like.

## Returning to Our Pivot Table and Counting Dates

As previously noted, there are options on the right side there are options for how to “populate” the pivot table with data and analysis. We want to create a column with each unique value from the “date” column in our sheet. Eventually we want to show next to that column how many times each of those dates occurred, but for now we just want a column that shows each unique date.

In the Pivot Table Editor we see the Rows option and click on Add. This means that you are selecting to put your data in rows and when you click add you choose what source of information (which column’s information) you want to input into the rows of the pivot table.

A drop down menu appears and it has the names of each of the column headers from the “step 1: modify and format” table. You click on Date and now the table should look like the screenshot below.

It might seem confusing that the Rows option basically entails choosing a column from the raw data. It is best to think of it as choosing what data will be added into rows.

A new column will appear in the pivot table (named “date”) and in the rows below it each unique value from the Date column (from the “step 1: modify and format” sheet) will appear once. Recall that in the “step 1” table each of these unique values (which are the dates) appeared several times.

If we had chosen “Columns” and then “date”, we would have gotten each unique value (date) in a different column in one row. See below:

Now returning to our table that has each unique value in different rows in one column. Our goal is to count how many times each unique date occurred in the date column of our sheet (“step 1: modify and format”).

So now we want to do analysis on the unique values and display it in a new column.

The Values option in Pivot Table Editor will let us do analysis on our data. So we choose which data we want to work with (in this case it will be the “date” column). To do this we go to Values and click Add. A drop down appears and we choose “date” once again.

Just to be clear about how the Values option works, you choose one of the columns from your raw data and then choose a function to apply to it. So for example, you can add every number in the column together, or multiply them, count them, of find the average.

A table appears under values that shows that “date” is the column we are working with and under “summarize by” we can click below it and a drop down appears with different ways to summarize the data. We choose COUNTA. This option will count the number of times each unique value in the date column appears.

There was also a COUNT option available, but we do not use that because it will only work with data that is formatted as a number. Other options include SUM, AVERAGE, MEDIAN, PRODUCT, MAX, and MIN. Google has a help page here that lists and explains each of the available functions.

Also see that a new column appears (named “COUNTA of date”) and each shows how many times each date appears and automatically puts the count for each one next to the unique value in the rows of the first column.

Success! we now have a pivot table that shows how many times each date occurred in our data. The first row shows that the date 2016-12-10 occurred 3 times, meaning there were 3 tweets on that date.

The author explains in the book that this information is useful because you can identify if a Twitter account is probably a bot based on how often it tweets. The author sites that, according to the Digital Forensic Research Lab (digitalsherlocks.org), tweeting more than 72 times per day is “suspicious.”

With that information in mind we can look at our data on a twitter account and judge if it is tweeting on any give day at a “suspicious” rate. (the answer is “yes”)

## SIDENOTE ON VALUE DROPDOWN MENU

The values option in the dropdown gives a list of ways to summarize data but they are not always easy to understand. Here is an explanation of the ways to use Values to summarize data:

• SUM – Returns the sum of a series of numbers and/or cells.
• COUNTA – Returns the number of values in a dataset.
• COUNT – Returns the number of numeric values in a dataset.
• COUNTUNIQUE – Counts the number of unique values in a list of specified values and ranges.
• AVERAGE – The AVERAGE function returns the numerical average value in a dataset, ignoring text.
• MAX – Returns the maximum value in a numeric dataset.
• MIN – Returns the minimum value in a numeric dataset.
• MEDIAN – Returns the median value in a numeric dataset.
• PRODUCT – Returns the result of multiplying a series of numbers together.
• STDEV – The STDEV function calculates the standard deviation based on a sample.
• STDEVP – Calculates the standard deviation based on an entire population.
• VAR – Calculates the variance based on a sample.
• VARP – Calculates the variance based on an entire population.

# Part 3

## Formulas and Functions

Google Sheets has a basic instruction page and video here to explain how to use functions to create formulas.

The intro video explains that a Function is a built-in operation in Sheets, such as adding or subtracting. Functions appear as options in dropdown lists and they can also be typed directly into Sheets. For example, we used the Function COUNTA in the pivot table to count how many times each date appeared.

Within Sheets, we can also type functions directly into a cell by typing the “=” sign, the name of the function, and then specifying what data (or the location of the data) in parenthesis.

For example, look at the sheet below:

In this case we want to add the values in the two cells A2 and A3, and put the return value in the cell B4.

1- So we start by typing the “=” sign so that sheets knows we are typing a function

2- We type the name of the function, in this case the function name “SUM” is for doing addition

3- Finally specify which cells we want to add by identifying them in parenthesis and separating them with a comma. If you hit enter, the cell B4 will just have the number 4 in it and the formula disappears from view but still remains for the cell.

Side Note: when specifying cells in parenthesis you can use (A2:A7) to mean add all the cells from A2 to A7. You could also type (A:A) to refer to the entire A column. You can choose from cells in a different sheet within the same file by typing the name of the sheet in quotes followed by an exclamation mark. So if you had a different sheet named Sheet 2 and you wanted to get the sum of cells A2 to A7, you would type =SUM(‘Sheet2’ !A2:A7). Also, you could just type =SUM( and then use your cursor to choose the cells.

A Formula uses a Function (or several Functions) to obtain a specific result. So in the previous example we created a Formula in cell B4 that used the SUM function to add the values of A2 and A3.

## Smart Copying

The author goes on the point out the value of smart copying. Accordingly, if we want to find the length of each tweet, we can return to the sheet Step 1: Modify and Format and add a column to the right the text column (which is the H column). Right-click on the column header and choose “insert 1 right” (which creates the empty I column).

So if we type the formula =len(H2) in the I2 block and then hit enter, we will get the number of characters (of “length”) for the H2 block.

Sheets actually suggests using smart copying to fill in the column below with the same formula but accounting for the different placement of the cells.

or you could find the little box in the bottom right corner of the first cell with the forumla in it and pull down to other tabs. This will also smart copy to the other cells.

## Sorting Data

Vo says that the next step is to learn how to sort and filter the results, or data in general, to “rank or isolate data.”

In short, to sort or filter data, you want to remove the formatting by copying and pasting the data into a new sheet, but paste using Paste Special. The =n you highlight it and create a filter view, which allows you to sort and filter. This does not alter the data, it only rearranges it, or hides parts of it.

Vo uses the example goal of sorting the data in the pivot data to see how the suspected bot tweeted on its busiest dat. To do this, Vo explains how to do so by “creating a new sheet with our aggerate results and changing the entire sheet to a filter view.

### Paste Special

We will want to copy and paste the data from the pivot table into a new sheet to isolate our work. Note that those cells have data and formulas, so we want to only paste the cells’ data without the formula. To do this we will use Paste Special.

First we create a new sheet by clicking on the plus sign on the bottom left.

The next step is to highlight the cells with the data you want, which in this case is the entire table.

Open the new sheet (named “Sheet 3”) and right-click in the A1 box.

In the drop down menu, choose “Paste Special” and then “Paste Values Only”. This will paste only the data, and not any of the formatting (excluding the formatting includes excluding the formulas). The data is now formatted as strings of integers.

We need to convert the formatting of the data (the dates) from integers to dates (in the formatting sense of the word dates) so that Sheets will adjust how it deals with this data. To do this, you highlight the newly pasted cells and click the Format option in the title bar. From the dropdown menu choose Number and then Date.

### Filter View

We need to turn the data into a filter view.

First, we select all of the data in the sheet by selecting the box in the top left corner between the row and column headers.

Next look at the tools under the headers and find the filter icon, which looks like a funnel, see below: If it is not there, you may need to click the 3 dots icon seen below. When you see the filter icon, click on the upside down triangle to the right of the icon.

Make sure the cells are all still highlighted/selected and click on the triangle and choose Create New Filter View.

This creates a new heading row and filter icons at the tops of the columns, as below:

The new icons in the first row can be used for the filter functions. If we click on one, (in this case we are using the Date column) we are given a dropdown menu with sort and filter options.

## Rearrange in ascending or descending

At the top there are options to sort the data in ascending or descending order (“Sort A->Z” means ascending, regardless of what kind of data you are using). If we choose to sort in ascending order based on the Date column, the values in the COUNTA column will also be rearranged to align with the Date column values.

This is also where it is important that the dates column is formatted as dates. Sheets will recognize the values as dates and arrange them in chronological order. If we had not changed the formatting Sheets would view the values as integers and arranged them as lowest to highest.

Filter by value

We can also filter out certain values by using the options at the bottom of the menu. There is a list (that continues downward past what is visible) of every kind of value in the dates column and there is also a search function if you have a particular value in mind. By unclicking the checkmark next to one of the values, Sheets will hide the value, but only in this particular sheet.

Filter by condition

Finally, the filter dropdown menu gives the option to filter based on a condition. A condition is basically an “if…then…” statement. If we click on the filter option in the menu a minimized dropdown will appear, if we click on that we see a list of conditions we can use.

For example, if we were looking through the content of the tweets, we could use this to look for tweets that mention something specific.

## Combining Data from Different Sources

Sheets allows us to cross reference data sets and combine them into one data set. This is also known as Merging Data Sets.

What are we going to do specifically? We will take two pivot tables in two separate sheets. The first table (from before) will show how many times per day on specific dates a twitter account (the bot account @sunneversets) tweeted.

The second table shows the number of times a second twitter account (a human’s account, @nostarchpress) tweeted per day on the same dates.

Then we will create a third sheet where we put the data side by side. To do so, first we will choose a date range for the data we want to look at and we create one column with each date in the range listed in separate rows. We name the first column “dates” and then name the next two columns “Pivot Table 1 – sunneversets account” and “Pivot Table 2 – nostarchpress account”. Finally, we create a formula that we will put in the 2nd and 3rd columns that will automatically find and input the relevant data.

The end result will look like this:

## How to Cross Reference Data

We use the formula called =vlookup() to cross reference and merge the data sets based on a common value. Basically that means that both data sets have to have a column for the same kind of data (such as a date column). One data set shows how many tweets occurred on each date, while another data set might show how many political speeches occurred on those dates.

To provide an example of a second data set to reference, Lam Thuy Vo provides a second twitter account’s csv file (nostarch_tweets.csv) on her GitHub page.

This new file is made up of data on a human being’s twitter account.

Vo points out that the first step is to create another new sheet by clicking on the plus sign on the bottom left in Google Sheets. Name the new sheet “cross reference sheet”.

As a reminder, to import data from a csv file into a Google Sheets sheet, open a new sheet and then take the following steps:

Click on “file” then “import”, the following screen should appear:

Choose “upload” and “select a file from your device”. After choosing your file, (preferably a csv file) the following window should appear:

Then choose the following in order:

1 – under “import location” choose “replace current sheet”

2 – under “separator type” select “comma” – – (since we are using a csv in this case and we need Google Sheets to identify separate values)

3 – under “convert text to numbers, dates, and formulas” select “no” – – (we will change the formatting ourselves because the software makes mistakes)

4 – then we split the dates and times from the one “created_at” column into two columns, one for dates and one for times. To do so we highlight the column, click on Data, choose “split text to columns” and when a window pops up asking to choose a separator, choose space. Name the column with dates “date”.

5 – highlight the date column, click on format and in the dropdown choose number, and then date.

6 – we also need to turn this data into a pivot table like the last one, so click the box in the top left corner between the headers of the columns and rows so that everything is highlighted, click on the data tab, choose create pivot table, create new, name the new pivot table Pivot Table 2, click on the new pivot table and under Rows choose date, under Values also choose date, and then again under Values choose COUNTA.

## Prep Cross Reference Sheet with the Common Value

In the sheet named “cross reference sheet” we are going to focus on the data from the two csv files that occured within a specific range of values in one of the columns. In the example from the book Mining Social Media, the author focuses on the data in the date range April 13th to May 1st 2017. To do this, you type “date” in the A1 cell and then need to input the dates in the column below. To do this quickly, type the first two dates in the first two cells and highlight them as seen below.

Next, click on the blue square on the bottom right and pull down to highlight the 20 empty cells below it. Sheets will figure out the pattern and automatically fill the rest of the cells with dates.

## Using the =vlookup() Formula

According to Lam Thuy Vo, the =vlookup() formula “looks at a value in one table, looks up the value in another table, and then retrieves data based on this common value”.

So we will use the sheet “cross reference sheet” that has the first column of dates for our date range, and then we will put formulas in the second and third columns that will find the relevant data from the two pivot tables and input the data into the “cross reference sheet” columns.

But first (according to para 2 on page 118) we need to set up a new column for the @sunneversets100’s daily tweet counts right next to the date column.

we start by naming the second column “Pivot Table 1 – sunneversets account”. This column will have the number of tweets for that twitter account listed for each date identified in the first column. To do this, we will write a formula for the first cell (located at B2), that will look up the value in cell A2 (which has the date 4/13/2017), and look in Pivot Table 1 for the value associated with the date 4/13/2017 and input it into cell B2 of the cross reference sheet.

The =vlookup() formula takes four “arguments” in the “()” section.

First Argument – This argument is the value you that will be looked up in another table. In our case is the the first date in the series of dates, which is identified by its cell location at A2. So we put it as =vlookup(A2, …) .

Second Argument – This is a bit complicated and therefore best explained in the author’s own words:

Recall that Pivot Table 1 looks like this: So we are telling the =vlookup() formula to look at columns A and B because A has the common value (the date) that exists in both pivot tables and B has the values that are unique to Pivot Table 1.

Third Argument – This argument might seem a bit redundant. The formula needs to know where the relevant data is located in the Pivot Table sheet relative to the column with the common value. The common value in Pivot TAble 1 is the date in the date column (the first column) and the relevant data is in the second column next to it. We communicate the location to the formula with the number 2. The number 2 informs the formula that the data is essentially, one column to the right. so our formula looks now looks like this:

=vlookup(A2, ‘Pivot Table 1’!A:B, 2, …)

Fourth Argument – In the fourth argument we tell the formula, as the author explains, “whether the range we’re looking at has been sorted in the same order as the table we created for our data merge.” This is not the case, so we input FALSE into the formula. The author recommends to always input FALSE because it will still find the data even if it has been sorted. The complete formula is as such:

=vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE)

## Input the =vlookup() Formula

In our “cross reference sheet” we input the formula in cell B2 as seen below:

The formula looks at cell A2 in the sheet “cross reference sheet”, sees that the value in the cell is 4/13/2017. The formula then looks at the “Pivot Table 1” sheet (see below) and looks in the column A to find the same value. It finds the value 4/13/2017 in the 5th row of the A column, and then looks at the value in the B column to the right and see that the value in B5 is 1054. So the formula brings that value back to the cell B2 in “cross reference sheet”.

So, upon hitting enter, the formula shows the value 1054 in cell B2 (as seen below). We can smart copy the formula into the rest of the column by clicking on the small blue box at the bottom right of cell B2 and dragging it down to the rest of the column.

In order for the third column to perform the same task for Pivot Table 2, we use the same formula but change “Pivot Table 1” to “Pivot Table 2”. Smart copy again and the table should look like this:

Notice that some cells have #N/A written in them, that is because there was no data available. There is also a red mark in the top right corner of these cells which indicates that there was an error. We can fix this by altering the formula so that it knows that if there is an error, it should just input 0. This is accurate because when there is no data in our case, that was because there were no tweets from the account on that date.

To make this change we use the =iferror() formula. This formula takes 2 arguments.

The first argument is the entire previous formula followed by a common (it might seem like the formula should be inside a set of parentheses but that is not how this works). we also remove the = sign from the =vlookup() formula.

So we take the formula =vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE) and put it in =iferror() as the first argument. Which yields:

=iferror(vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE), …)

The second argument is what we want to input in the cell if the original formula yields an error. In our case we want a 0 so we use a 0 as our second argument. The final product looks like this:

=iferror(vlookup(A2, ‘Pivot Table 1’!A:B, 2, FALSE), 0)

Now we replace the formula in the sheet “cross reference sheet” with this one and we get the following:

That’s it! Congratulations, you have successfully learned to cross reference data from two different sources!

## The Path Forward

Lam Thuy Vo does a great job of addressing the path forward after you have learned the functionalities to perform her example. The following sources can be used to learn or merely as a reference as you use Sheets for you own purposes