Finding Hidden Data, Sort Of
We’ll go over some project stuff, learn a little about scraping, and use the Chrome Inspector to find hidden data. The main thing to remember is this: if you can see data in a structured way on the page, you can probably automate collecting it. (That doesn’t mean you should…it only means you can!)
Another plug for this sweet tumblr.
Sam and Nausheen are discussing a recent UGC map by the Washington Post (left over from last week).
Scraping the web
The inspector in Chrome is an extremely useful tool for all sorts of functions. Today we’ll use it for help in three things: styling pages, finding “hidden” data Shan and Kevin give a tour of the Chrome Inspector to find data you didn’t know was there. The inspector does much more than just let you experiment with CSS. It also lets you see every asset your web page is loading — even when you can’t see it.
There are tons of resources out there to automate data collection. Here’s a handy tipsheet from Scott Klein and Michelle Minkoff. We’ll use R’s XML package to fetch HTML tables and other kinds of things that are easily automated when we think like a robot.
Say you’re a data editor for the New Yorker (or, better yet, you’re trying to convince them that they need one) and you are helping a reporter report her story about Russia’s recent ban on U.S. adoptions. There is plenty of data on their stats page but no place for you to download them all at once. You’re looking for data for all U.S. adoptions to any country for as many years as you can.
Here’s what we’ll do
- Use the inspector to find out where the data lives
- Use R to download that table
- Generalize the code ito a function to download the data for any country
- Get a list of all countries in the database
Apply the function to get a data frame representing all the adoptions in the data base.
We’ll start setting up the infrastructure for your projects and making sure your pitch is where it should be. Everyone should end up with a new “project” repo with an a completed pitch that meets all the requirements, published to your github page.
You should all have a url that looks like this example with the same url structure. When you’re done and shavin has approved, move on to the next exercises.
Using the network panel…
• Say you got in touch with the press office of the National Community Pharmacists Association to see if they could send you data for the locations of all 23,000 community pharmacies in the U.S. Their press office replied, saying they could not send her that data but would be willing to send her summary information.
Take a look at their web site. Enter your favorite ZIP code and note the response.
Now open the Chrome inspector. Click the “Network” panel, refresh browswer and do the same search again. What happens? Take a look at the files and see if you can find anything that might resemble data.
Here’s the URL for Kevin’s hometown (Plymouth, MN). Say you want to work with the data for just your town in Excel. Is this ready for cutting and pasting?
With some mild formatting in a text editor, this site might help.
If you could tell a robot what to do to get all this data, what would it be?
Let’s use the Inspector to…
- find out if Erik and Sean used the same data you did on their homework assignment.
- try to find some help reporting about bicycle destinations in San Francisco. (Goal: a list of people from the SF bay area who commented on this map. Bonus, how many times did they each comment?)
- get a list of GIS coordinates for NYC’s citiBike stations