« Back Home

Getting Data Without (Much) Code

Lena Groeger, ProPublica, September 2015

Tools You'll Need

Google Chrome »

Firefox and Safari are OK, but all of our examples and tools will be in Chrome. Please don't use Internet Explorer, I beg you.

Google Spreadsheets »

We'll learn a pretty neat trick that let's you grab data with Google Spreadsheets.

Example's We'll Use

Failed Banks: https://www.fdic.gov/bank/individual/failed/banklist.html

School Zone Clusters: http://www.atlanta.k12.ga.us/Page/832

FDA Directory: http://dslo.afdo.org/results/?q=Georgia&unifyfda=1&bystate=1&selected_facets=area_exact:%22100%22

Ready? Let's Get Some Data

Try a Blank Search. First things first. Often you can just try to search with nothing in the search field, and return ALL the data in a database. Let's try it with this example: http://www.asias.faa.gov/pls/apex/

Look for the Download Button. Often websites with data will have CSV, Excel, or other data download options: http://www.oecd.org/gender/data/employmentandunemploymentratebysexandagegroupquarterlydata.htm

Try Google Spreadsheets. Did you know that you can use Google spreadsheets to pull down an html table? You can using a simple formula: =ImportHTML(“url”, “elementtype”, numberElement on page)

Start by opening up a new spreadsheet, then paste either of these two formulas into the first cell:

Example 1: =IMPORTHTML("http://www.atlanta.k12.ga.us/Page/832","table",2)

Example 2:=IMPORTHTML("https://www.fdic.gov/bank/individual/failed/banklist.html","table",0)

Now, can you find your own example of a page to try importing? Remember, IMPORTHTML works with both tables and lists.

PS. Guess what? You just wrote some code. Yup, formulas in Google Spreadsheets, Excel or any other spreadsheet program are all code. NICE!!

Try Scraping It. If you're wondering what in the world scraping is, don't worry, it's not too complicated. It basically means grabbing things on a webpage, and copying them, and then downloading them. But first, we need to talk about what a webpage.

Let's talk about what a webpage is, really.

A webpage is basically a bunch of stuff that gets downloaded. (Really, that's it.) Slightly more technically, it's a bunch of different files that get downloaded: some text files and maybe some images. The best way to see the "behind the scenes" of a webpage is to use a tool called the web inspector to actually look at these files. Let's do that next.

The Web Inspector. As an example webpage, let's go to the home page of the New York Times. Right click (or controll click), and select "Inspect Element." I can now look at the contents of the files making up this webpage! You can see the underlying HTML (the structure of the page) the CSS (which styles the page) and much more.

These are just a few things you can do with the web inspector:

  1. Change a Headline
  2. Change the color of some text
  3. Download an image

And here's a much more detailed look at the Web Inspector and all the magical things you can do with it (by my former colleague Dan Nguyen)

Ok, back to our data.

Now that we've seen the building blocks of a webpage, we can take a look at some tools that can "scrape" some of those elements for us.

Chrome Extension Scraper: This is a super simple chrome extension that lets you select some information and turn it into some structured data. Enable the extension here: https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd?hl=en. When you've got it enabled and gone to a website you'd like to scrape, click on some type of information (names, emails, URLs, etc) and then right-click (or control-click) and choose "Scrape Similar". There’s an option to bring the results into a Google spreadsheet, or download the CSV.

I'll walk us through a demo using this website: http://www.safekids.org/statelaws

Import.IO: This is a free tool that lets you break down your page down into elements and grab exactly what you want. You can either download the data right away or "train" the program to recognize exactly what you want. It also "crawls" through multiple paginated results. Let's take a look: https://import.io/

We'll try pasting in the same failed banks list as before: https://www.fdic.gov/bank/individual/failed/banklist.html

Here's a pretty good tutorial for more details and advanced uses: http://www.interhacktives.com/2014/03/06/scrape-data-without-coding-step-step-tutorial-import-io/

Scraping with Ruby & Python (not as scary as it sounds)

There are a lot more ways to get data from websites using just a few lines of code. Python & Ruby are both programming languages that let you do that. In some cases (like when some government webpage is giving you a particularly hard time) you might need to turn to these more powerful tools to get the data you want. I'll walk us through a few examples, but for much more detail here's a great guide to web scraping.

Other Useful Tools

Tabula

http://tabula.nerdpower.org/

Tabula turns tabular PDF data into tables. Free software from the Knight-Mozilla OpenNews project.

DownThemAll

https://addons.mozilla.org/en-US/firefox/addon/downthemall/

A Firefox extension that detects what types of assets you might want to download on a page and allows you to download them, well, all. Doesn’t work for everything, but a good quick one to try.

OutwitHub

https://addons.mozilla.org/en-us/firefox/addon/outwit-hub/

Break your page down into its elements that it is made up of. Grab all pictures on a page at once, create a scraper using various HTML elements as start and end points. This is quite powerful, if you learn how to use it well. More robust than DownThemAll.

JQ

http://stedolan.github.io/jq/download/

Helpful command line utility for working with JSON files.

JSON Viewer

http://jsonviewer.stack.hu/

Open Refine

http://openrefine.org/

Tool to inspect and manipulate spreadsheet files, allowing to run queries on it to manipulate it the way you would like.


Note: When grabbing data, please remember to make sure you have permission to get the information you are grabbing. Just because you can take photos off a site, doesn’t mean you can use them for your project. With great power comes great responsibility!

Next up: What to Do With Your Data »