Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 01 2012


Can Google Maps + Fusion Tables Beat OpenBlock?

WRAL.com, North Carolina's most widely read online news site, recently published a tool that allows you to search concealed weapons permits down to the street level. It didn't use OpenBlock to do so. Why?


Or, if you're like many journalistically and technically savvy people I've spoken over the last few months, you could ask why would they? There's plenty of evidence out there to suggest the OpenBlock application is essentially a great experiment and proof of concept, but a dud as a useful tool for journalists. Many of the public records portions of Everyblock.com -- OpenBlock's commercial iteration -- are months if not years out of date. It can't be found anywhere on the public sites of the two news organizations in which the Knight Foundation invested $223,625. There are only three sites running the open-source code -- two of those are at universities and only one of which was created without funding from the Knight Foundation.

And, you, Thornburg. You don't have a site up and running yet, either.

All excellent points, dear friends. OpenBlock has its problems -- it doesn't work well in multi-city installations, some search functions don't work as you'd expect, there's no easy way to correct incorrect geocoding or even identify possible failures, among other obstacles that I'll describe in greater detail in a later blog post. But the alternatives also have shortcomings. And deciding whether to use OpenBlock depends on which shortcomings will be more tolerable to your journalists, advertisers and readers.


If you want to publish news from multiple cities or from unincorporated areas, or if you serve a rural community I'd hold off for now. If you visit our public repositories on GitHub you can see the good work the developers at Caktus have been doing to remove these limitations, and I'm proud to say that we have a private staging site that's up and running for our initial partner site. But until we make the set-up process easier, you're going to have to hire a Django developer (at anywhere from $48,000 a year to $150 an hour) to customize the site with your logo, your geographic data, and your news items.

The other limitation to OpenBlock right now is that it isn't going to be cheap to maintain once you do get it up and running. The next priority for me is to make the application scale better to multiple installations and therefore lower the maintenance costs. Within the small OpenBlock community, there's debate about how large of a server it requires. The very good developers at OpenPlans who did a lot of heavy lifting on the code between the time it was open sourced and the time that it should run nicely on a "micro" instance of Amazon's EC2 cloud hosting service -- about $180 a year.

But we and Tim Shedor, the University of Kansas student who built LarryvilleKU, find OpenBlock a little too memory intensive for the "micro" instance. We're on an Amazon Web Services "small" instance, and LarryvilleKU is on a similar sized virtual server at MediaTemple. That costs more like $720 a year. And if you add a staging server to make sure your code changes break in private instead of public, you're looking at hosting costs of nearly $1,500 a year.

And that's before your scrapers start breaking. Depending on how conservative you are, you'll want to set aside a budget for fixing each scraper somewhere between one and three times a year. Each fix might be an hour or maybe up to 12 hours of work for a Python programmer (or the good folks at ScraperWiki). If you have three data sources -- arrests, restaurant inspections and home sales, let's say -- then you may get away with a $300 annual scraper maintenance cost, or it may set you back as much as $15,000 a year.

I've got some ideas on how to reduce those scraper costs, too, but more on that later as well.

Of course, if you have someone on staff who does Python programming and whose done some work with public records and published a few Django sites and they've got time to spare, then your costs will go down significantly.

But just in case you don't have such a person on staff or aren't ready to make this kind of investment, what are your alternatives?


Using a Google Map on your news website is a little like playing the saxophone. It's probably the easiest instrument to learn how to play poorly, but pretty difficult to make it really sing. Anyone can create a Google Map of homicides or parking garages or whatever, but it's going to be a static map of only one schema, and it won't be searchable or sortable.


On the other hand, you can also use Google Maps and Fusion Tables to build some really amazing applications, like the ones you might see in The Guardian or on The Texas Tribune or WNYC or The Bay Citizen. You can do all this, but it also takes some coding effort and probably a bit more regular hand care and feeding to keep the site up-to-date.

I've taken a look at how you might use Google's data tools to replicate something like OpenBlock, although I've not actually done it. If you want to give it a whirl and report back, here's my recipe.


Step 1. Create one Google Docs spreadsheet for each schema, up to a maximum of four spreadsheets. And create one Google Fusion Table for each scheme, up to a maximum of four tables.

Step 2. If the data you want is in a CSV file that's been published to the web, you can populate it with a Google Docs function called ImportData. This function -- as well as its sister functions ImportHTML and ImportXML -- will only update 50 records a time. And I believe this function will pull in new data from the CSV about once an hour. I don't know whether it will append the new rows or overwrite them, or what it would do if only a few of the fields in a record change. If you're really lucky, the data would be in an RSS feed and you could use the ImportFeed function to get past this 50-record limit.

Of course, in the real world almost none of your data will be in these formats. None of mine are. And in that case, you'd have to either re-enter the data into Google Docs by hand or use something like ScraperWiki to scrape a datasource and present it as a CSV or a feed.

Step 3. Use a modification of this script to automatically pull the data -- including updates -- from the Google Docs spreadsheet into the corresponding Fusion table you created for that schema.

Step 4. Find the U.S. Census or local county shapefiles for any geographies you want -- such as ZIP codes or cities or school districts -- and convert them to KML.

Step 5. Upload that geographic information into another Fusion Table.

Step 6. Merge the the Fusion table from Step 3 with the Fusion table from Step 5.

Step 7. This is really a thousand little steps, each depending on which of OpenBlock's user interface features you'd like to replicate. And, really, it should be preceded by step 6a -- learn JavaScript, SQL, CSS and HTML. Once you've done that, you can build tools so that users can:

And there's even at least one prototype of using server-side scripting and Google's APIs to build a relatively full-functioning GIS-type web application: https://github.com/odi86/GFTPrototype

After all that, you will have some of the features of OpenBlock, but not others.

Some key OpenBlock features you can replicate with Google Maps and Fusion Tables:

  • Filter by date, street, city, ZIP code or any other field you choose. Fusion Tables is actually a much better interface for searching and filtering -- or doing any kind of reporting work -- than OpenBlock.
  • Show up to four different kinds of news items on one map (five if you don't include a geography layer).
  • Conduct proximity searches. "Show me crimes reported within 1 mile of a specific address."


The OpenBlock features you can't replicate with Google:

  • Use a data source that is anything other than an RSS feed, HTML table, CSV or TSV. That's right, no XLS files unless you manually import them.
  • Use a data source for which you need to combine two CSV files before import. This is the case with our property transactions and restaurant inspections.
  • Update more than 50 records at a time. Definitely a problem for police reports in all but the smallest towns.
  • Use a data source that doesn't store the entire address in a single field. That's a problem for all the records with which we're working.
  • Map more than 100,000 rows in any one Fusion table. In rural counties, this probably wouldn't be a concern. In Columbus County, N.C., there are only 45,000 parcels of land and 9,000 incidents and arrests a year.
  • Use data sources that are larger than 20MB or 400,000 cells. I don't anticipate this would be a problem for any dataset in any county we're working.
  • Plot more than 2,500 records a day on a map. Don't anticipate hitting this limit either, especially after the initial upload of data.
  • Parse text for an address -- so you can't map news articles, for example.
  • Filter to the block level. If Main Street runs for miles through several miles, you're not going to be able to narrow your search to anything relevant.
  • Create a custom RSS feed, or email alert.


And there's one final feature of OpenBlock that you can't replicate using Google tools without investing a good deal of manual, rote set-up work -- taking advantage of SEO or social media sharing by having a unique URL for a particular geography or news item type. Ideally, if someone searches for "home sales in 27514" I want them to come to my site. And if someone wants to post to Facebook a link to a particular restaurant that was scolded for having an employee with a finger-licking tendency (true story), I'd want them to be able to link directly to that specific inspection incident without forcing their friends to hunt through a bunch of irrelevant 100 scores.

To replicate OpenBlock's URL structure using Google Maps and Fusion Tables, you'd have to create a unique web page and a unique Google map for each city and ZIP code. The geography pages would display a polygon of the selected geography, whether it's a ZIP code or city or anything else, and all of the news items for that geography (up to four schemas, such as arrests, incidents, property sales, and restaurant inspections). That's 55 map pages.

Then you'd have to create a map and a page for each news item type. That's four pages, four Fusion tables, and four Google Docs spreadsheets.

Whew. I'm going to stick with our work in improving the flexibility and scalability of OpenBlock. But it's still worth looking at Google Maps and Fusion Tables for some small and static data use cases. Other tools such as Socrata's Open Data, Caspio and Tableau Public are also worth your time as you begin to think about publishing public data. Each of those have some maintenance costs and their own strengths and weaknesses, but the real trick for using all of these tools is public data that isn't in any usable format. We're looking hard at solving that problem with a combination of scraping and crowdsourcing, and I'll report what we've found in an upcoming post.

Ryan Thornburg researches and teaches online news writing, editing, producing and reporting as an assistant professor in the School of Journalism and Mass Communication at the University of North Carolina at Chapel Hill. He has helped news organizations on four continents develop digital editorial products and use new media to hold powerful people accountable, shine light in dark places and explain a complex world. Previously, Thornburg was managing editor of USNews.com, managing editor for Congressional Quarterly's website and national/international editor for washingtonpost.com. He has a master's degree from George Washington University's Graduate School of Political Management and a bachelor's from the University of North Carolina at Chapel Hill.

July 07 2011


The inverted pyramid of data journalism

I’ve been working for some time on picking apart the many processes which make up what we call data journalism. Indeed, if you read the chapter on data journalism (blogged draft) in my Online Journalism Handbook, or seen me speak on the subject, you’ll have seen my previous diagram that tries to explain those processes.

I’ve now revised that considerably, and what I’ve come up with bears some explanation. I’ve cheekily called it the inverted pyramid of data journalism, partly because it begins with a large amount of information which becomes increasingly focused as you drill down into it until you reach the point of communicating the results.

What’s more, I’ve also sketched out a second diagram that breaks down how data journalism stories are communicated – an area which I think has so far not been very widely explored. But that’s for a future post.

I’m hoping this will be helpful to those trying to get to grips with data, whether as journalists, developers or designers. This is, as always, work in progress so let me know if you think I’ve missed anything or if things might be better explained.

The inverted pyramid of data journalism

Inverted pyramid of data journalism Paul Bradshaw

Here are the stages explained:


Data journalism begins in one of two ways: either you have a question that needs data, or a dataset that needs questioning. Whichever it is, the compilation of data is what defines it as an act of data journalism.

Compiling data can take various forms. At its most simple the data might be:

  1. supplied directly to you by an organisation (how long until we see ‘data releases’ alongside press releases?),
  2. found through using advanced search techniques to plough into the depths of government websites;
  3. compiled by scraping databases hidden behind online forms or pages of results using tools like OutWit Hub and Scraperwiki;
  4. by converting documents into something that can be analysed, using tools like DocumentCloud;
  5. by pulling information from APIs;
  6. or by collecting the data yourself through observation, surveys, online forms or crowdsourcing.

This compilation stage is the most important – not only because everything else rests on that, but because it is probably the stage that is returned to the most – at each of the subsequent stages – cleaning, contextualising, combining and communicating – it may be that you need to compile further information.


Having data is just the beginning. Being confident in the stories hidden within it means being able to trust the quality of the data – and that means cleaning it.

Cleaning typically takes two forms: removing human error; and converting the data into a format that is consistent with other data you are using.

For example, datasets will often include some or all of the following: duplicate entries; empty entries; the use of default values to save time or where no information was held; incorrect formatting (e.g. words instead of numbers); corrupted entries or entries with HTML code; multiple names for the same thing (e.g. BBC and B.B.C. and British Broadcasting Corporation); and missing data (e.g. constituency). You can probably suggest others.

There are simple ways to clean up data in Excel or Google Docs such as find and replace, sorting to find unusually high, low, or empty entries, and using filters so that only duplicate entries (i.e. those where a piece of data occurs more than once) are shown.

Google Refine adds a lot more power: its ‘common transforms’ function will, for example, convert all entries to lowercase, uppercase or titlecase. It can remove HTML, remove spaces before and after entries (which you can’t see but which computers will see as different to the same data without a space), remove double spaces, join and split cells, and format them consistently. It will also ‘cluster’ entries and allow you to merge those which should be the same. Note: this will work for BBC and B.B.C. but not BBC and British Broadcasting Corporation, so some manual intervention is often needed.


Like any source, data cannot always be trusted. It comes with its own histories, biases, and objectives. So like any source, you need to ask questions of it: who gathered it, when, and for what purpose? How was it gathered? (The methodology). What exactly do they mean by that?

You will also need to understand jargon, such as codes that represent categories, classifications or locations, and specialist terminology.

All the above will most likely lead you to compile further data. For example, knowing the number of crimes in a city is interesting, but only becomes meaningful when you contextualise that alongside the population, or the numbers of police, or the levels of crime 5 years ago, or perceptions of crime, or levels of unemployment, and so on. Statistical literacy is a must here – or at least show your work to someone who has read Ben Goldacre’s book.

Having a clear question at the start of the whole process, by the way, helps ensure you don’t lose your focus at this point, or miss an interesting angle.


Good stories can be found in a single dataset, but often you will need to combine two together. After all, given the choice between a single-source story and a multiple-source one, which would you prefer?

The classic combination is the maps mashup: taking one dataset and combining it with map data to provide an instant visualisation of how something is distributed in space: where are the cuts hitting hardest? Which schools are performing best? What are the most talked-about topics around the world on Twitter right now?

This is so common (largely because the Google Maps API was one of the first journalistically useful APIs) it has almost become a cliche. But still, cliches are often – if not always – effective.

A more mundane combination is to combine two or more datasets with a common data point. That might be a politican’s name, for example, or a school, or a location.

This often means ensuring that the particular data point is formatted in the same name across each dataset.

In one, for example, the first and last names might have separate columns, but not in the other (you can concatenate or split cells to solve this).

Or you might have local authority names in one, but local authority codes in another (find another dataset that has both together and use a tool like Google Fusion Tables to merge them).

One might use latitude and longitude; another postcodes, or easting and northing (a postcodes API and Google Refine can help). But once you’ve got them formatted right, you may find some interesting stories or leads for further questions to ask.


In data journalism the all-too-obvious thing to do at this point is to visualise the results – on a map, in a chart, an infographic, or an animation. But there’s a lot more here to consider – from the classic narrative, to news apps, case studies and personalisation. In fact there’s so much in this stage alone that I’m going to write a separate post to explore that and save this space for comments on the process outlined above.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!