Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

November 11 2010

14:10

Data cleaning tool relaunches: Freebase Gridworks becomes Google Refine

When I first saw Freebase Gridworks I was a very happy man. Here was a tool that tackled one of the biggest problems in data journalism: cleaning dirty data (and data is invariably dirty). The tool made it easy to identify variations of a single term, and clean them up, to link one set of data to another – and much more besides.

Then Google bought the company that made Gridworks, and now it’s released a new version of the tool under a new name: Google Refine.

It’s notable that Google are explicitly positioning Refine in their video (above) as a “data journalism” tool.

You can download Google Refine here.

Further videos below. The first explains how to take a list on a webpage and convert it into a cleaned-up dataset – a useful alternative to scraping:

The second video explains how to link your data to data from elsewhere, aka “reconciliation” – e.g. extracting latitude and longitude or language.

July 22 2010

07:00

The New Online Journalists #6: Conrad Quilty-Harper

As part of an ongoing series on recent graduates who have gone into online journalism, The Telegraph’s new Data Mapping Reporter Conrad Quilty-Harper talks about what got him the job, what it involves, and what skills he feels online journalists need today.

I got my job thanks to Twitter. Chris Brauer, head of online journalism at City University, was impressed by my tweets and my experience, and referred me to the Telegraph when they said they were looking for people to help build the UK Political database.

I spent six weeks working on the database, at first manually creating candidate entries, and later mocking up design elements and cleaning the data using Freebase Gridworks, Excel and Dabble DB. At the time the Telegraph was advertising for a “data juggler” role, and I interviewed for the job and was offered it.

My job involves three elements:

  • Working with reporters to add visualisations to stories based on numbers,
  • Covering the “open data” beat as a reporter, and
  • Creating original stories with visualisations based on data from FOI and other sources.

For my job I need to know how to select and scrape good data, clean it, pick out the stories and visualise it. (P.S. you may have noticed that I’m a “data is singular” kinda guy).

The “data” niche is greatly exciting to me. Feeding into this is the #opendata movement, the new Government’s plan to release more data and the understanding that data driven journalism as practised in the United States has to come here. There’s clearly a hunger for more data driven stories, a point well illustrated by a recent letter to the FT.

The mindset you need to have as an online journalist today is to become familiar with and proficient at using tools that make you better at your job. You have to be an early adopter. Get on the latest online service, get the latest gadget and get it before your colleagues and competitors. Find the value in those tools, integrate it into your work and go and find another tool.

When I blogged for Engadget our team had built an automated picture watermarker for liveblogging. I played with it for a few hours and made a new script that downloaded the pictures from a card, applied the watermark, uploaded the pictures and ejected the SD card. Engadget continues to try out new tools that enable them to do their job faster and better. There are endless innovations being churned out every day from the world of technology. Make time to play with them and make them work for you.

If you know of anyone else who should be featured in this series, let me know in the comments.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl