« Back Home

Manipulating Data

Lena Groeger, ProPublica, September 2015

Tools/Data You'll Need

Google Spreadsheets »

You can use Microsoft Excel if you're familiar with it, but for simplicity's sake all of my examples will be in Google Spreadsheets

This Data »

All of our data wrangling will be done on this spreadsheet. Open it up and make your own local copy. In case you're wondering, yes this is actual, real data about deer hunting accidents in Wisconsin. Cool.

Ok, let's manipulate some data.

Spreadsheet programs come in many different flavors, but for the most part they are all copies of the same program. The commands are there, they may be in different menus or locations.

At this point you should have made a copy of this file: https://docs.google.com/spreadsheets/d/1mzd4F_slzOL6BYfLUaKhrXf9xV-KDfO4r5W4jwlgzSc/edit#gid=0 If not, go to that link and File > Make a Copy

Now you have your own copy! Hurray. Save it locally.

Here are some things you should do with every dataset you encounter:

1. The Eyeball Test. Scroll through the spreadsheet and look for patterns or missing spots. Some questions/notes that you should keep in mind:

  1. Where did this data come from?
  2. What was the source?
  3. How was it collected? What can and can't be said with this data?

Typically, you can't understand most data in the real world without making some phone calls.

  1. What do all the columns mean? For example, what's the "MUZDIS" column?
  2. What units are each of the columns in? Which are the most important columns?

In most cases you can track down a data dictionary or at least more context.

All that said, we are going to continue anyway...

2. Set Up Shop. A few things to do before you get started that will make it easier to work with your data.

Freeze top row: View > Freeze

Turn on filters: Data > Filter

3. Data Wrangling. Techniques that will help you with most of your initial data questions.

Sorting

  1. Sort by year (what years covered by this data?)
  2. Victim Age (what's the highest and lowest age?)

Filtering

  1. Filter by county (most of the time you're going to want to look near where you live).
  2. Filter by wound location
  3. Filter by condition (try any victim under age of 40)

Functions

  1. SUM ()
  2. AVERAGE (What's the average age of a shooter? )
  3. COUNTIF (How many people were mistaken for deer?)
  4. MIN/MAX (How old is the oldest/youngest shooter?)

Charting

  1. Scatter plot (try victim age and shooter age)
  2. Bar chart (by year)

Pivot Tables. Say I want to find out which gun brand is associated the most with fatal accidents. What is the best way to do that?

Data > Pivot Tables. We'll use this technique to answer the following questions:

  1. What's the most common gun brand? (Hint: Summarize by COUNTA)
  2. What's the most common wound?
  3. How many victims are male?

4. More Things to Know

Make phone calls. Find the nerd that works with the data and talk to them.

Save early and often

Keep a data diary

Backup your data

If you can, go see it in the real world.

Bulletproof, Bulletproof, Bulletproof. Have other colleagues look at your data before it's published. Also make sure to do the following:

  1. Integrity checks to find flaws.
  2. Add caveats where necessary.
  3. Do your own analysis rather than relying on an agency's analysis.
  4. Read the documentation. Understand the contents of every field.
  5. Know how many records you should have.
  6. Check counts and totals against reports.
  7. Are all possibilities included?
  8. Check with experts.
  9. Know the standards.
  10. Find out what others have done.
  11. Gut check – does it seem wrong?
  12. Go physically see a record or spot check against documents.