Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

June 27 2013

15:52

Nonprofits Learn How to Tell Untold Stories at Data Day 2013

Data can be integrated seamlessly into stories that benefit communities, presenters told nonprofits and journalists at a conference on June 21. The event demonstrated how one can tap into information sources about communities whose voices are often unheard.

Data Day 2013, held at Northeastern University in Boston, showcased how successful data-based stories engage people on an emotional level. Metropolitan Area Planning Council, Northeastern University’s School of Public Policy & Urban Affairs, and The Boston Foundation co-hosted the conference.

read more

15:46

Nonprofits Learn How to Tell Untold Stories at Data Day 2013

Data can be integrated seamlessly into stories that benefit communities, presenters told nonprofits and journalists at a conference on June 21. The event demonstrated how one can tap into information sources about communities whose voices are often unheard.

read more

September 04 2012

18:47

Top-2 plug-ins for scraping data right in your browser

Scraping information from the Web can be a complicated affair -- but that's not always the case. If you've got a simple data extraction job to do for a story, check out these two inexpensive browser extensions that can help you get the job done. Read More »

August 17 2012

14:00

Next Knight News Challenge Calls for Mobile Visionaries

The Knight Foundation, which now offers three rounds of its News Challenge instead of one competition per year, just announced the theme of its next contest: mobile. This round focuses on funding innovators who are using mobile to change the face of the media industry.

iphone sky.jpg

Considerable growth in mobile Internet usage over the past few years has meant the way in which people consume news is undoubtedly shifting -- so it's not much of a surprise that mobile would be the theme of one of this year's rounds. In fact, several mobile players have already been the recipients of past News Challenge awards -- think MobileActive, FrontlineSMS, as well as Watchup, Behavio and Peepol.tv, which were winners of the round on networks.

"We know that we (and our kids) have grown attached to our mobile devices," Knight's John Bracken and Christopher Sopher wrote in a blog post announcing the round, "but we have less clarity about the ways people are using them, or might use them, as citizens, content producers and consumers to tell, share and receive stories."

move over, data

The announcement of the next theme comes as round 2, which focuses on data, moves onto the next stage. The round is now closed for submissions, and Knight's team of advisers has selected 16 finalists. They'll be doing interviews and video chats with the finalists over the next couple of weeks. Winners of the data round will be announced in September.

"We've focused the News Challenge this year on big opportunities in news and information -- networks, data and now mobile," Bracken and Sopher wrote in their post. "In some ways, mobile represents both the greatest need and greatest potential for individual citizens and news organizations."

The mobile round will be open to applicants starting on August 29, and Knight will accept entries until September 10.

July 30 2012

14:00

The Fundamental Problem With Political Cookies, Microtargeting

The MIT Technology Review recently posted an article titled, "Campaigns to Track Voters with 'Political Cookies." It freaks me out for a reason I'll get to below.

From Technology Review:

The technology involves matching a person's web identity with information gathered about that person offline, including his or her party registration, voting history, charitable donations, address, age, and even hobbies.

Companies selling political targeting services say "microtargeting" of campaign advertising will make it less expensive, more up to the minute, and possibly less intrusive. But critics say there's a downside to political ads that combine offline and online data. "These are not your mom and pop TV ads. These are ads increasingly designed for you--to tell you what you may want to hear," says Jeff Chester, executive director of the Center for Digital Democracy.

funny-pictures-cat-wishes-to-access-your-cookies.jpg

Like most conscientious web users, I'm skeeved by the privacy issues of cookies, even as I tolerate them for their convenience.

But the real, immediate, permanent harm of political cookies, like Chester argues, is the other kind of privacy: the privacy it affords you to avoid public discussion, the (otherwise positive) right to be left alone.

Targeted ads bypass the public. They needn't be subject to civic debate. In fact, they foreclose the very possibility. With political cookies, civic debate about those messages can only happen within the subject's own head.

When MIT Center for Civic Media's own Matt Stempeck and Dan Schultz proposed projects like a recommended daily intake for the news, a credibility API, or automatic bullshit detectors, they're doing a great service but not necessarily a public service. Their work implicitly acknowledges -- and they're right -- that a political message is now predominantly a direct communications experience, from a campaign directly to an individual subject.

toward private politics

It's a private experience. Democracy without the demos. By definition and design, there's no public counterpoint to an ad targeted with cookies.

The earliest examples of American democracy took for granted that debate was public, happening among many individuals and associations of them. And a core logic, without which the rest fails, is that people are persuadable. Campaigns would love to persuade, but it's cheaper to reinforce. And reinforcement happens by aggregating individuals' click and spending data, with targeting taking into account predispositions, self-identification, and biases.

There's no need to persuade. No need, it feels, to be persuaded. No need to live outside our own private politics.

A version of this post originally appeared on the MIT Center for Civic Media blog.

Andrew Whitacre is Communications Director for the MIT Center for Future Civic Media, 2007 Knight News Challenge winner. A native of the nation's capital, Whitacre has written on communications policy issues, starting with work on satellite radio as a student at Wake Forest University.

July 27 2012

14:00

The Importance of NextDrop's Customer Cycle, and How to Improve Service

In our last post on PBS Idea Lab, NextDrop, which informs residents in India via cell phone about the availability of piped water, was trying to scale up in a very short period of time. How did we fare?

nextdroplog.png

Well, I think we discovered the first step to winning: Just get good data about yourself. Period. Even if it's ugly. Because after admitting there's something wrong, the second hardest part is wading through the mess and figuring out what exactly that is!

Let me try to lay out everything we discovered about our service.

Customer Side

Goal: Bill everyone possible and make money.

Immediate problem: Billers wasted a lot of time because even when they found houses (which many times proved difficult), a lot of people were getting late messages, weren't getting messages at all, getting them intermittently so they didn't want to pay for the service (no argument there), or just didn't want the service.

Immediate solution: Make a list of areas that have been getting regular messages for the past two weeks, and then call all those people before we actually go out and bill.

Immediate Systems We Put in place

Creation of the "Green List": We look through all of our valvemen data, and using the all-mighty Excel, we figure out which areas received at least four calls within the last two weeks. Our logic here is that since the supply cycle is once every 3-4 days now, if they are getting regular messages, valvemen should call in at least four times in a 2-week span. This system is by no means perfect, but it's a start, and at least gets us to the next level.

Conduct phone surveys: After we see all the areas that are on the Green List, we then call all the customers in that area. We spent two weeks piloting the survey to even figure out what categories/questions we should ask, and we've finally got some classifications the sales team feels good about.

Here are the different categories of NextDrop potential customers:

  • Could Not Contact (people who had phones turned off, didn't answer the call, possibly fake numbers)
  • Satisfied Customers
  • Pay (want to pay for service)
  • Continue
  • 1-month Free Trial (again)
  • Deactivate
  • Unsatisfied Customers
  • Not Getting Messages
  • Wrong Messages

Bill: We just bill the people who are satisfied and want to pay, or who are satisfied but want another free month trial (and have already had one).

our customer cycle

Here's a great flow chart that our sales manager made of our customer cycle (and if any engineers out there think this looks familiar, you're right! It is, in fact, a State Diagram. This is why I love hiring engineers!) And let me say, this may look easy, but it took two weeks to analyze customer behavior to even figure out what states to include and how to go from one state to another state.

customercycle.png

When we finally had data, we discovered some really interesting things about our service:

  • Total number of people called: 1,493
  • Total number of people we could contact: 884 (59%)
  • Total number of deactivated customers: 229
    15% of total customers
    26% of contacted customers
  • Total number of continuing customers: 655
    44% of total customers
    74% of contacted customers
  • Total billable customers: 405
    27% of total customers
    46% of contacted customers
  • Total billed customers: 223
    15% of total customers
    25% of contacted customers
    55% of billable customers
  • Total number of people who paid: 95
    6% of total customers
    23% of billable customers
    43% of billed customers

As you can see, the two major problems we identified were 1) we were unable to contact 41% of the customers we tried to contact, and 2) a majority of the people who we were able to contact were getting incorrect messages (54% of the contacted customers).

troubleshooting problems

And that's where we're at: trying to troubleshoot those two problems. Here are the immediate solutions we're putting in place to increase the people that we contact, and to put customers in the correct valve area.

Instead of taking "Could Not Contact" customers off the billing list, we are going to try to contact them. We're in the process of seeing what percentage of the "Could Not Contact" customers we can actually find and contact when we bill.

We have an intern, Kristine, from UC Berkeley, who will be working with us for the next six months to figure out how to place people in the correct valve area (because that is the critical question now, isn't it?) Kristine's findings are pretty interesting (and definitely deserves its own blog post), but our first prototype is to test a guess and check methodology:

  • First we call customers and find out when was the last time they got water.
  • Then sort through our data and see what areas got water on that date (plus or minus a few hours). This should at least eliminate 50% of the areas.
  • Then, to narrow it down even further, we only consider those areas that are geographically close to the customer. This should narrow it down to within 4-5 areas to check.
  • We subscribe the customer to these areas, and see when he/she gets the correct message. (We will find out through the phone survey.)

That's what we are going to try -- we'll let you know how that goes.

steps toward progress

In any case, I think the tunnel has a light at the end of it, so that's all we can really ask for -- progress!

And, as always, we will keep you updated on our progress, what seems to work, what doesn't, and more importantly, why.

Additionally, and most importantly, we're hiring! We are looking for enthusiastic and passionate individuals who want to be a part of our team. If you love problem solving, and finding creative solutions to problems, we want you!

As always, please feel free to write comments, offer insight, ask questions, or just say hi. Our proverbial door is always open!

A version of this post first appeared on the NextDrop blog.

Anu Sridharan graduated from the University of California, Berkeley in 2010 with a master's degree in civil systems engineering; she received her bachelor's degree from UC Berkeley as well. During her time there, Sridharan researched the optimization of pipe networked systems in emerging economies as well as new business models for the dissemination of water purification technologies for arsenic removal. Sridharan also served as the education and health director for a water and sanitation project in the slums of Mumbai, India, where she piloted a successful volunteer recruitment and community training model.

July 25 2012

21:15

NYTimes Latest Push

The Data Universe team at The New York Times has over the past few months overhauled the company's push infrastructure. What had been accomplished using a vendor's push notification service is now managed through our new DU Push Mobile Center. It's a flexible system that's improved the speed we push news alerts out to millions of customers (4 minutes in most cases) and should enable plenty of future enhancements and new services.

April 30 2012

11:58

Working With Excel Spreadsheet Files Without Using Excel…

One of the most frequently encountered ways of sharing small datasets is in the form of Excel spreadsheet (.xls) files, notwithstanding all that can be said In Praise of CSV;-) The natural application for opening these files is Microsoft Excel, but what if you don’t have a copy of Excel available?

There are other desktop office suites that can open spreadsheet files, of course, such as Open Office. As long as they’re not too big, spreadsheet files can also be uploaded to and then opened using a variety of online services, such as Google Spreadsheets, Google Fusion Tables or Zoho Sheet. But spreadsheet applications aren’t the only data wrangling tools that can be used to open xls files… Here are a couple more that should be part of every data wrangler’s toolbox…

(If you want to play along, the file I’m going to play with is a spreadsheet containing the names and locations of GP practices in England. The file can be found on the NHS Indicators portal – here’s the actual spreadsheet.)

Firstly, Google Refine. Google Refine is a cross-platform, browser based tool that helps with many of the chores relating to getting a dataset tidied up so that you can use it elsewhere, as well as helping out with data reconcilation or augmenting rows with annotations provided by separate online services. You can also use it as a quick-and-dirty tool for opening an xls spreadsheet from a URL, knocking the data into shape, and dumping it to a CSV file that you can use elsewhere. To start with, choose the option to create a project by importing a file from a web address (the XLS spreadsheet URL):

Once loaded, you get a preview view..

You can tidy up the data that you are going to use in your project via the preview panel. In this case, I’m going to ignore the leading lines and just generate a dataset that I can export directly as a CSV file once I’ve got the data into my project.

If I then create a project around this dataset, I can trivially export it again using a format of my own preference:

So that’s one way of using Google Refine as a simple file converter service that allows you to preview and to a certain extent shape the data in XLS spreadsheet, as well as converting it to other file types.

The second approach I want to mention is to use a really handy Python software library (xlrd – Escel Reader) in Scraperwiki. The Scraperwiki tutorial on Excel scraping gives a great example of how to get started, which I cribbed wholesale to produce the following snippet.

import scraperwiki
import xlrd

#cribbing https://scraperwiki.com/docs/python/python_excel_guide/
def cellval(cell):
    if cell.ctype == xlrd.XL_CELL_EMPTY:    return None
    return cell.value

def dropper(table):
    if table!='':
        try: scraperwiki.sqlite.execute('drop table "'+table+'"')
        except: pass

def reGrabber():
    #dropper('GPpracticeLookup')
    url = 'https://indicators.ic.nhs.uk/download/GP%20Practice%20data/summaries/demography/Practice%20Addresses%20Final.xls'
    xlbin = scraperwiki.scrape(url)
    book = xlrd.open_workbook(file_contents=xlbin)

    sheet = book.sheet_by_index(0)        

    keys = sheet.row_values(8)           
    keys[1] = keys[1].replace('.', '')
    print keys

    for rownumber in range(9, sheet.nrows):           
        # create dictionary of the row values
        values = [ cellval(c) for c in sheet.row(rownumber) ]
        data = dict(zip(keys, values))
        #print data
        scraperwiki.sqlite.save(table_name='GPpracticeLookup',unique_keys=['Practice Code'], data=data)

#Uncomment the next line if you want to regrab the data from the original spreadsheet
reGrabber()

You can find my scraper here: UK NHS GP Practices Lookup. What’s handy about this approach is that having scraped the spreadsheet data into a Scraperwiki database, I can now query it as database data via the Scraperwiki API.

(Note that the Google Visualisation API query language would also let me treat the spreadsheet data as a database if I uploaded it to Google Spreadsheets.)

So, if you find yourself with an Excel spreadsheet, but no Microsoft Office to hand, fear not… There are plenty of other tools other there you can appropriate to help you get the data out of the file and into a form you can work with:-)

PS R is capable of importing Excel files, I think, but the libraries I found don’t seem to compile onto Max OS/X?

PPS ***DATA HEALTH WARNING*** I haven’t done much testing of either of these approaches using spreadsheets containing multiple workbooks, complex linked formulae or macros. They may or may not be appropriate in such cases… but for simple spreadsheets, they’re fine…


11:58

Working With Excel Spreadsheet Files Without Using Excel…

One of the most frequently encountered ways of sharing small datasets is in the form of Excel spreadsheet (.xls) files, notwithstanding all that can be said In Praise of CSV;-) The natural application for opening these files is Microsoft Excel, but what if you don’t have a copy of Excel available?

There are other desktop office suites that can open spreadsheet files, of course, such as Open Office. As long as they’re not too big, spreadsheet files can also be uploaded to and then opened using a variety of online services, such as Google Spreadsheets, Google Fusion Tables or Zoho Sheet. But spreadsheet applications aren’t the only data wrangling tools that can be used to open xls files… Here are a couple more that should be part of every data wrangler’s toolbox…

(If you want to play along, the file I’m going to play with is a spreadsheet containing the names and locations of GP practices in England. The file can be found on the NHS Indicators portal – here’s the actual spreadsheet.)

Firstly, Google Refine. Google Refine is a cross-platform, browser based tool that helps with many of the chores relating to getting a dataset tidied up so that you can use it elsewhere, as well as helping out with data reconcilation or augmenting rows with annotations provided by separate online services. You can also use it as a quick-and-dirty tool for opening an xls spreadsheet from a URL, knocking the data into shape, and dumping it to a CSV file that you can use elsewhere. To start with, choose the option to create a project by importing a file from a web address (the XLS spreadsheet URL):

Once loaded, you get a preview view..

You can tidy up the data that you are going to use in your project via the preview panel. In this case, I’m going to ignore the leading lines and just generate a dataset that I can export directly as a CSV file once I’ve got the data into my project.

If I then create a project around this dataset, I can trivially export it again using a format of my own preference:

So that’s one way of using Google Refine as a simple file converter service that allows you to preview and to a certain extent shape the data in XLS spreadsheet, as well as converting it to other file types.

The second approach I want to mention is to use a really handy Python software library (xlrd – Escel Reader) in Scraperwiki. The Scraperwiki tutorial on Excel scraping gives a great example of how to get started, which I cribbed wholesale to produce the following snippet.

import scraperwiki
import xlrd

#cribbing https://scraperwiki.com/docs/python/python_excel_guide/
def cellval(cell):
    if cell.ctype == xlrd.XL_CELL_EMPTY:    return None
    return cell.value

def dropper(table):
    if table!='':
        try: scraperwiki.sqlite.execute('drop table "'+table+'"')
        except: pass

def reGrabber():
    #dropper('GPpracticeLookup')
    url = 'https://indicators.ic.nhs.uk/download/GP%20Practice%20data/summaries/demography/Practice%20Addresses%20Final.xls'
    xlbin = scraperwiki.scrape(url)
    book = xlrd.open_workbook(file_contents=xlbin)

    sheet = book.sheet_by_index(0)        

    keys = sheet.row_values(8)           
    keys[1] = keys[1].replace('.', '')
    print keys

    for rownumber in range(9, sheet.nrows):           
        # create dictionary of the row values
        values = [ cellval(c) for c in sheet.row(rownumber) ]
        data = dict(zip(keys, values))
        #print data
        scraperwiki.sqlite.save(table_name='GPpracticeLookup',unique_keys=['Practice Code'], data=data)

#Uncomment the next line if you want to regrab the data from the original spreadsheet
reGrabber()

You can find my scraper here: UK NHS GP Practices Lookup. What’s handy about this approach is that having scraped the spreadsheet data into a Scraperwiki database, I can now query it as database data via the Scraperwiki API.

(Note that the Google Visualisation API query language would also let me treat the spreadsheet data as a database if I uploaded it to Google Spreadsheets.)

So, if you find yourself with an Excel spreadsheet, but no Microsoft Office to hand, fear not… There are plenty of other tools other there you can appropriate to help you get the data out of the file and into a form you can work with:-)

PS R is capable of importing Excel files, I think, but the libraries I found don’t seem to compile onto Max OS/X?

PPS ***DATA HEALTH WARNING*** I haven’t done much testing of either of these approaches using spreadsheets containing multiple workbooks, complex linked formulae or macros. They may or may not be appropriate in such cases… but for simple spreadsheets, they’re fine…


April 27 2012

21:09

Exploring GP Practice Level Prescribing Data

Some posts I get a little bit twitchy about writing. Accessing and Visualising Sentencing Data for Local Courts was one, and this is another: exploring practice level prescription data (get the data).

One of the reasons it feels “dangerous” is that the rationale behind the post is to demonstrate some of the mechanics of engaging with the data at a context free level, devoid of any real consideration about what the data represents, whilst using a data set that does have meaning, the interpretation of which can be used as the basis of making judgements about various geographical areas, for example.

The datasets that are the focus of this post relate to GP practice level prescription data. One datafile lists GP practices (I’ve uploaded this to Google Fusion tables), and includes practice name, identifier, and address. I geocoded the Google Fusion tables version of the data according to practice postcode, so we can see on a map how the practices are distributed:

(There are a few errors in the geocoding that could probably be fixed by editing the correspond data rows, and adding something like “, UK” to the postcode. (I’ve often thought it would be handy if you could force Google Fusion Table’s geocoder to only return points within a particular territory…))

The prescription data includes data at the level of item counts by drug name or prescription item per month for each practice. Trivially, we might do something like take the count of methadone prescriptions for each practice, and plot a map sizing points at the location of each practice by the number of methadone prescriptions by that practice. All well and good if we bear in mind the fact the the data hasn’t been normalised by the size of the practice, doesn’t take into account the area over which the patients are distributed, doesn’t take into account the demographics of the practices constituency (or recognise that a particular practice may host a special clinic, or the sample month may have included an event that drew in a large transient population with a particular condition, or whatever). A good example to illustrate this taken from another context might be “murder density” in London. It wouldn’t surprise me if somewhere like Russell Square came out as a hot spot – not because there are lots of murders there, but because a bomb went off on a single occasion killing multiple people… Another example of “crime hot spots” might well be courts or police stations, places that end up being used as default/placeholder locations if the actual location of crime isn’t known. And so on.

The analyst responsible for creating quick and dirty sketch maps will hopefully be mindful of the factors that haven’t been addressed in the construction of a sketch, and will consequently treat with suspicion any result unless they’ve satisfied themselves that various factors have been taken into account, or discount particular results that are not the current focus of the question they are asking themselves of the data in a particular way.

So when it comes to producing a post like this looking at demonstrating some practical skills, care needs to be taken not to produce charts or maps that appear to say one thing when indeed they say nothing… So bear that in mind: this post isn’t about how to generate statistically meaningful charts and tables; it’s about mechanics of getting rows of data out of big files and into a form we can start to try to make sense of them

Another reason I’m a little twitchy about this post relates to describing certain skills in an open and searchable/publicly discoverable forum. (This is one reason why folk often demonstrate core skills on “safe” datasets or randomly generated data files.) In the post Googling Nasties and Oopses on University and Public Sector Websites, a commenter asked: “is it really ethical to post that information?” in the context of an example showing how to search for confidential spreadsheet information using a web search engine. I could imagine a similar charge being leveled at a post that describes certain sorts of data wrangling skills. Maybe some areas of knowledge should be limited to the priesthood..?

To mitigate against any risks of revealing things best left undiscovered, I could draw on the NHS Information Centre’s Evaluation and impact assessment – proposal to publish practice-level prescribing data[PDF] as well as the risks acknowledged by the recent National Audit Office report on Implementing transparency (risks to privacy, of fraud, and other possible unintended consequences). But I won’t, for now…. (dangerrrrrroussssssssss…;-)

(Academically speaking, it might be interesting to go through the NHS Info Centre’s risk assessment and see just how far we can go in making those risks real using the released data set as a “white hat data hacker”, for example! I will go through the risk assessment properly in another post.)

So… let the journey into the data begin, and the reason why I felt the need to have a play with this data set:

Note: Due to the large file size (over 500MB) standard spreadsheet applications will not be able to handle the volumes of data contained in the monthly datasets. Data users will need to analyse the information using specialist data-handling software.

Hmmm… that’s not very accessible is it?!

However, if you’ve read my previous posts on Playing With Large (ish) CSV Files or Postcards from a Text Processing Excursion, or maybe even the aforementioned local sentencing data post, you may have some ideas about how to actually work with this file…

So fear not – if you fancy playing along, you should already be set up tooling wise if you’re on a Mac or a Linux computer. (If you’re on a Windows machine, I cant really help – you’ll probably need to install something like gnuwin or Cygwin – if any Windows users could add support in the comments, please do:-)

Download the data (all 500MB+ of it – it’s published unzipped/uncompressed (a zipped version comes in at a bit less than 100MB)) and launch a terminal.

>

I downloaded the December 2011 files as nhsPracticesDec2011.csv and nhsPrescribingDataDec2011.CSV so those are the filenames I’ll be using.

To look at the first few lines of each file we can use the head command:

head nhsPrescribingDataDec2011.CSV
head nhsPracticesDec2011.csv

Inspection of the practices data suggests that counties for each practice are specified, so I can generate a subset of the practices file listing just practices on the ISLE OF WIGHT by issuing a grep (search) command and sending (>) the result to a new file:

grep WIGHT nhsPracticesDec2011.CSV > wightPracDec2011.csv

The file wightPracDec2011.csv should now contain details of practices (one per row) based on the Isle of Wight. We can inspect the first few lines of the file using the head command, or use more to scroll through the data one page at a time (hit space bar to move on a page, ESCape to exit).

head wightPracDec2011.csv
more wightPracDec2011.csv

Hmmm.. there’s a rogue practice in there from the Wirral – let’s refine the grep a little:

grep 'OF WIGHT' nhsPracticesDec2011.CSV > wightPracDec2011.csv
more wightPracDec2011.csv

From looking at the data file itslef, along with the prescribing data release notes/glossary, we can see that each practice has a unique identifier. From previewing the head of the prescription data itself, as well as from the documentation, we know that the large prescription data file contains identifiers for each practice too. So based on the previous steps, can you figure out how to pull out the rows from the prescriptions file that relate to drugs issued by the Ventnor medical centre, which has code J84003? Like this, maybe?

grep J84003 nhsPrescribingDataDec2011.CSV > wightPrescDec2011_J84003.csv
head wightPrescDec2011_J84003.csv

(It may take a minute or two, so be patient…)

We can check how many rows there actually are as follows:

wc -l wightPrescDec2011_J84003.csv

I was thinking it would be nice to be able to get prescription data from all the Isle of Wight practices, so how might we go about that. From reviewing my previous text mining posts, I noticed that I could pull out data from a file by column:

cut -f 2 -d ',' wightPracDec2011.csv

This lists column two of the file wightPracDec2011.csv where columns are comma delimited.

We can send this list of codes to the grep command to pull out records from the large prescriptions file for each of the codes we grabbed using the cut command (I asked on Twitter for how to do this, and got a reply back that seemed to do the trick pretty much by return of tweet from @smelendez):

cut -d ',' -f 2 wightPracDec2011.csv | grep nhsPrescribingDataDec2011.CSV -f - > iwPrescDec2011.csv
more iwPrescDec2011.csv

We can sort the result by column – for example, in alphabetic order by column 5 (-k 5), the drugs column:

sort -t ',' -k 5 iwPrescDec2011.csv | head

Or we can sort by decreasing (-r) total ingredient cost:

sort -t ',' -k 7 -r iwPrescDec2011.csv | head

Or in decreasing order of the largest number of items:

sort -t ',' -k 6 -r iwPrescDec2011.csv | head

One problem with looking at those results is that we can’t obviously recognise the practice. (That might be a good thing, especially if we looked at item counts in increasing order… Whilst we don’t know how many patients were in receipt of one or more items of drug x if 500 or so items were prescribed in the reporting period across several practices, if there is only one item of a particular drug prescribed for one practice, then we’re down to one patient in receipt of that item across the island, which may be enough to identify them…) I leave it as an exercise for the reader to work out how you might reconcile the practice codes with practice names (Merging Datasets with Common Columns in Google Refine might be one way? Merging Two Different Datasets Containing a Common Column With R and R-Studio another..?).

Using the iwPrescDec2011.csv file, we can now search to see how many items of a particular drug are prescribed across island practices using searches of the form:

grep Aspirin iwPrescDec2011.csv
grep 'Peppermint Oil' iwPrescDec2011.csv

And this is where we now start to need taking a little care… Scanning through that data by eye, a bit of quick mental arithmetic (divide column 7 by column 6) suggests that the unit price for peppermint oil is different across practices. So is there a good reason for this? I would guess that the practices may well be describing different volumes of peppermint oil as single prescription items, which makes a quick item cost calculation largely meaningless? I guess we need to check the data glossary/documentation to confirm (or deny) this?

Okay – enough for now… maybe I’ll see how we can do a little more digging around this data in another post…


21:09

Exploring GP Practice Level Prescribing Data

Some posts I get a little bit twitchy about writing. Accessing and Visualising Sentencing Data for Local Courts was one, and this is another: exploring practice level prescription data (get the data).

One of the reasons it feels “dangerous” is that the rationale behind the post is to demonstrate some of the mechanics of engaging with the data at a context free level, devoid of any real consideration about what the data represents, whilst using a data set that does have meaning, the interpretation of which can be used as the basis of making judgements about various geographical areas, for example.

The datasets that are the focus of this post relate to GP practice level prescription data. One datafile lists GP practices (I’ve uploaded this to Google Fusion tables), and includes practice name, identifier, and address. I geocoded the Google Fusion tables version of the data according to practice postcode, so we can see on a map how the practices are distributed:

(There are a few errors in the geocoding that could probably be fixed by editing the correspond data rows, and adding something like “, UK” to the postcode. (I’ve often thought it would be handy if you could force Google Fusion Table’s geocoder to only return points within a particular territory…))

The prescription data includes data at the level of item counts by drug name or prescription item per month for each practice. Trivially, we might do something like take the count of methadone prescriptions for each practice, and plot a map sizing points at the location of each practice by the number of methadone prescriptions by that practice. All well and good if we bear in mind the fact the the data hasn’t been normalised by the size of the practice, doesn’t take into account the area over which the patients are distributed, doesn’t take into account the demographics of the practices constituency (or recognise that a particular practice may host a special clinic, or the sample month may have included an event that drew in a large transient population with a particular condition, or whatever). A good example to illustrate this taken from another context might be “murder density” in London. It wouldn’t surprise me if somewhere like Russell Square came out as a hot spot – not because there are lots of murders there, but because a bomb went off on a single occasion killing multiple people… Another example of “crime hot spots” might well be courts or police stations, places that end up being used as default/placeholder locations if the actual location of crime isn’t known. And so on.

The analyst responsible for creating quick and dirty sketch maps will hopefully be mindful of the factors that haven’t been addressed in the construction of a sketch, and will consequently treat with suspicion any result unless they’ve satisfied themselves that various factors have been taken into account, or discount particular results that are not the current focus of the question they are asking themselves of the data in a particular way.

So when it comes to producing a post like this looking at demonstrating some practical skills, care needs to be taken not to produce charts or maps that appear to say one thing when indeed they say nothing… So bear that in mind: this post isn’t about how to generate statistically meaningful charts and tables; it’s about mechanics of getting rows of data out of big files and into a form we can start to try to make sense of them

Another reason I’m a little twitchy about this post relates to describing certain skills in an open and searchable/publicly discoverable forum. (This is one reason why folk often demonstrate core skills on “safe” datasets or randomly generated data files.) In the post Googling Nasties and Oopses on University and Public Sector Websites, a commenter asked: “is it really ethical to post that information?” in the context of an example showing how to search for confidential spreadsheet information using a web search engine. I could imagine a similar charge being leveled at a post that describes certain sorts of data wrangling skills. Maybe some areas of knowledge should be limited to the priesthood..?

To mitigate against any risks of revealing things best left undiscovered, I could draw on the NHS Information Centre’s Evaluation and impact assessment – proposal to publish practice-level prescribing data[PDF] as well as the risks acknowledged by the recent National Audit Office report on Implementing transparency (risks to privacy, of fraud, and other possible unintended consequences). But I won’t, for now…. (dangerrrrrroussssssssss…;-)

(Academically speaking, it might be interesting to go through the NHS Info Centre’s risk assessment and see just how far we can go in making those risks real using the released data set as a “white hat data hacker”, for example! I will go through the risk assessment properly in another post.)

So… let the journey into the data begin, and the reason why I felt the need to have a play with this data set:

Note: Due to the large file size (over 500MB) standard spreadsheet applications will not be able to handle the volumes of data contained in the monthly datasets. Data users will need to analyse the information using specialist data-handling software.

Hmmm… that’s not very accessible is it?!

However, if you’ve read my previous posts on Playing With Large (ish) CSV Files or Postcards from a Text Processing Excursion, or maybe even the aforementioned local sentencing data post, you may have some ideas about how to actually work with this file…

So fear not – if you fancy playing along, you should already be set up tooling wise if you’re on a Mac or a Linux computer. (If you’re on a Windows machine, I cant really help – you’ll probably need to install something like gnuwin or Cygwin – if any Windows users could add support in the comments, please do:-)

Download the data (all 500MB+ of it – it’s published unzipped/uncompressed (a zipped version comes in at a bit less than 100MB)) and launch a terminal.

>

I downloaded the December 2011 files as nhsPracticesDec2011.csv and nhsPrescribingDataDec2011.CSV so those are the filenames I’ll be using.

To look at the first few lines of each file we can use the head command:

head nhsPrescribingDataDec2011.CSV
head nhsPracticesDec2011.csv

Inspection of the practices data suggests that counties for each practice are specified, so I can generate a subset of the practices file listing just practices on the ISLE OF WIGHT by issuing a grep (search) command and sending (>) the result to a new file:

grep WIGHT nhsPracticesDec2011.CSV > wightPracDec2011.csv

The file wightPracDec2011.csv should now contain details of practices (one per row) based on the Isle of Wight. We can inspect the first few lines of the file using the head command, or use more to scroll through the data one page at a time (hit space bar to move on a page, ESCape to exit).

head wightPracDec2011.csv
more wightPracDec2011.csv

Hmmm.. there’s a rogue practice in there from the Wirral – let’s refine the grep a little:

grep 'OF WIGHT' nhsPracticesDec2011.CSV > wightPracDec2011.csv
more wightPracDec2011.csv

From looking at the data file itslef, along with the prescribing data release notes/glossary, we can see that each practice has a unique identifier. From previewing the head of the prescription data itself, as well as from the documentation, we know that the large prescription data file contains identifiers for each practice too. So based on the previous steps, can you figure out how to pull out the rows from the prescriptions file that relate to drugs issued by the Ventnor medical centre, which has code J84003? Like this, maybe?

grep J84003 nhsPrescribingDataDec2011.CSV > wightPrescDec2011_J84003.csv
head wightPrescDec2011_J84003.csv

(It may take a minute or two, so be patient…)

We can check how many rows there actually are as follows:

wc -l wightPrescDec2011_J84003.csv

I was thinking it would be nice to be able to get prescription data from all the Isle of Wight practices, so how might we go about that. From reviewing my previous text mining posts, I noticed that I could pull out data from a file by column:

cut -f 2 -d ',' wightPracDec2011.csv

This lists column two of the file wightPracDec2011.csv where columns are comma delimited.

We can send this list of codes to the grep command to pull out records from the large prescriptions file for each of the codes we grabbed using the cut command (I asked on Twitter for how to do this, and got a reply back that seemed to do the trick pretty much by return of tweet from @smelendez):

cut -d ',' -f 2 wightPracDec2011.csv | grep nhsPrescribingDataDec2011.CSV -f - > iwPrescDec2011.csv
more iwPrescDec2011.csv

We can sort the result by column – for example, in alphabetic order by column 5 (-k 5), the drugs column:

sort -t ',' -k 5 iwPrescDec2011.csv | head

Or we can sort by decreasing (-r) total ingredient cost:

sort -t ',' -k 7 -r iwPrescDec2011.csv | head

Or in decreasing order of the largest number of items:

sort -t ',' -k 6 -r iwPrescDec2011.csv | head

One problem with looking at those results is that we can’t obviously recognise the practice. (That might be a good thing, especially if we looked at item counts in increasing order… Whilst we don’t know how many patients were in receipt of one or more items of drug x if 500 or so items were prescribed in the reporting period across several practices, if there is only one item of a particular drug prescribed for one practice, then we’re down to one patient in receipt of that item across the island, which may be enough to identify them…) I leave it as an exercise for the reader to work out how you might reconcile the practice codes with practice names (Merging Datasets with Common Columns in Google Refine might be one way? Merging Two Different Datasets Containing a Common Column With R and R-Studio another..?).

Using the iwPrescDec2011.csv file, we can now search to see how many items of a particular drug are prescribed across island practices using searches of the form:

grep Aspirin iwPrescDec2011.csv
grep 'Peppermint Oil' iwPrescDec2011.csv

And this is where we now start to need taking a little care… Scanning through that data by eye, a bit of quick mental arithmetic (divide column 7 by column 6) suggests that the unit price for peppermint oil is different across practices. So is there a good reason for this? I would guess that the practices may well be describing different volumes of peppermint oil as single prescription items, which makes a quick item cost calculation largely meaningless? I guess we need to check the data glossary/documentation to confirm (or deny) this?

Okay – enough for now… maybe I’ll see how we can do a little more digging around this data in another post…


16:09

Daily Must Reads, April 27, 2012

The best stories across the web on media and technology, curated by Lily Leung.

1. Rupert Murdoch apologizes for hacking scandal (NYT)



2. Providence may sell its stake in Hulu for $2 billion (Bloomberg)



3. Redbox revenue grows 39 percent in the first quarter (LAT)



4. Gawker still embraces anonymous commenters as other media orgs push them away (Gawker)



5. Free data-journalism handbook to launch Saturday (Online Journalism Blog)



6. Why flying drones may be a big part of the future of journalism (Fast Company)




Subscribe to our daily Must Reads email newsletter and get the links in your in-box every weekday!



Subscribe to Daily Must Reads newsletter

This is a summary. Visit our site for the full post ».

13:38

April 23 2012

13:10

NextDrop: Water Utilities in India Need Good Data

In places like the United States, we have access to more data than we ever know what to do with. We measure everything from what the average historical temperature is on a certain day for a city, to how good a restaurant is, to how much energy we consume. Because of this access, we base many of our critical decisions on this data (or at least that's the hope). Essentially, because we have had access, we know how to use this data.

nextdrop.jpg.jpg

However, this isn't the case everywhere.

Fact: Just because you have access to data, it does not guarantee that you will use it appropriately. Using it appropriately requires behavior change, something that, any person will tell you, is incredibly difficult.

This is the hard part about data -- not the production of it, but the usage. This means that simply providing technology is not a solution. It is technology and the realization of the potential results that will produce meaningful change.

making data-driven decisions

This is similar to the situation water utilities are facing in India. There's no real incentive to get good data, and it makes sense. They have many things to worry about -- mainly, the reduction of non-revenue water. Data is tricky, because the results are more of the intangible kind. You need initial buy-in, and lots of time, in order to build your case for making data-driven decisions.

We know that these data-driven decisions will, in time, reduce non-revenue water, but it will take some time. And unfortunately, in a world that wants sexy solutions along with fast results, this does not come easily.

We're hoping that in the future, other stakeholders will promote the acquisition of quality data, and will push the utilities to make data-driven decisions. From academics, to other government agencies, we see a need from other stakeholders to push this agenda and create this water data market for water utilities.

And when that happens, NextDrop will be there to provide that quality data to the utilities to help them become more efficient.

April 21 2012

15:36

Insights into data journalism in Argentina

Angelica Peralta Ramos, multimedia development manager, La Nación in Argentina, gave an insight into the challenges of doing data journalism.

In her ISOJ talk, she explained how La Nacion started doing data visualisations with few resources and in a less than friendly government environment.

Peralta pointed out that Argentina ranks 100 out of 180 in corruption index. The country does not have a freedom of information law and it not part of the open government initiative.

But there is hope said Peralta. La Nacion wanted to do data journalism but didn’t have any programmers so they adopted tools for non programmers such as Tableau Public and Excel.

One of its initiatives involved gathering data on inflation to try to reveal more accurate inflation levels.

The newspaper has been taking public data and seeking to derive meaning from masses of figures.

For example, La Nacion took 400 PDFs with tables of 235,000 rows that recorded subsidies to bus companies to figure out who was getting what.

It is using software to keep track of updates to the PDFs to show how subsidies to the companies are on the rise.

Peralta’s short presentation showed how some media organisations are exploring data journalism in circumstances which are very different to the US or UK.

La Nacion have a data blog and will be posting links to the examples mentioned by Peralta.

15:09

Making data visualisation useful for audiences

At ISOJ, Alberto Cairo, lecturer in visual journalism, University of Miami, raised some critical questions about the visualisation of data in journalism.

Cairo explained that an information graphic is a tool for presenting information and for exploring information.

In the past, info graphics were about editing data down and summarising it. But this worries me, he says, as it is just presenting information but does not allow readers to explore the data.

Today we have the opposite trend and often ends up as data art which doesn’t help readers understand the data.

Cairo cited a New York Times project mapping neighbourhoods which he said forced readers to become their own reporters and editors to understand the data.

We have to create layers, he said. We have the presentation layer and we have the exploration layer, and these are complementary.

But readers need help to navigate the data, he said. Part of the task is giving clues to readers to understand the complexity of data.

Cairo quoted a visualistion mantra by Ben Shneiderman: “Overview first, zoom and filter, then details-on-demand.”

His approached echoed earlier comments by Brian Boyer, news applications editor, Chicago Tribune Media Group. Boyer said that we should make data beautiful, inspirational but make it useful to the audience.

 

April 20 2012

06:26

Programming and journalism students: A conversation

I think it’s pretty cool to use Storify to sort out the threads of a bunch of simultaneous conversations on Twitter:

[View the story "Programming and journalism students: A conversation" on Storify]

Please join in — on Twitter, on Facebook, or here.

06:26

Programming and journalism students: A conversation

I think it’s pretty cool to use Storify to sort out the threads of a bunch of simultaneous conversations on Twitter:

[View the story "Programming and journalism students: A conversation" on Storify]

Please join in — on Twitter, on Facebook, or here.

April 12 2012

16:49

‘Hypothesis generator’ helps mine huge datasets

A tool created through a collaboration with Harvard and MIT could soon help journalists find relationships in massive amounts of data — even if they don’t know what they’re looking for. Read More »

March 28 2012

14:00

Colorful City Tracking Maps Launch Under Creative Commons

Maps.stamen.com, the second installment of the City Tracking project funded by the Knight News Challenge, is live.

These unique cartographic styles and tiles, based on data from Open Street Map, are available for the entire world, downloadable for use under a under a Creative Commons Attribution 3.0 license, and free.

takes deep breath

There are three styles available: toner, terrain, and watercolor:

  • Toner is about stripping online cartography down to its absolute essentials. It uses just black and white, describing a baseline that other kinds of data can be layered on. Stripping out any kind of color or image makes it easier to focus on the interactive nature of online cartography: When do different labels show up for different cities? What should the thickness of freeways be at different zoom levels? And so forth. This project is the one that Nathaniel Vaughn Kelso is hacking on at all hours, and it's great to be seeing Natural Earth data get more tightly integrated into the project over time.
  • Terrain occupies a middle ground: "shaded hills, nice big text, and green where it belongs." In keeping with City Tracking's mandate to make it easier for people to tell stories about cities, this is an open-source alternative to Google's terrain maps, and it uses all open-source software like Skeletron to improve on the base line cartographic experience. Mike Migurski has been heading up this design, with help from Gem Spear and Nelson Minar.
  • Watercolor pushes through to the other side of normal, bending the rules of traditional legibility in order to explore some new terrain. It incorporates hand-painted textures and algorithmic rule sets into a design that looks like it's been done by 10,000 slaves in the basement, but is rendered on the fly. Geraldine Sarmiento and Zach Watson did the lion's share of the design and development on this one. This design is a mixed bag for me: I'm delighted to see it out in the world, but it's the thing that's pretty much kept me from looking at anything else for the last month and a half.

The code that runs Toner and Terrain is available for download and use at the City Tracking GitHub repository; we're going to wait on watercolor a little while until we can get some of the kinks ironed out. We talked about waiting to launch until watercolor was all buttoned up, but what with all the attention that Open Street Map has been getting, we decided to just bite the bullet and go for it.

We'll follow up this week with some posts on how everything works and how the sausage is made, and I've got a lot more to say about what I think this implies for what can be done with online maps and data visualization.

In the meantime, have you seen how awesome Los Angeles, Washington, D.C., the Forbidden City, Massachusetts Bay, Key West, London, New Orleans, New York, Versailles, and every other city in the cotton-pickin' world look when you point this thing at it? Holy heck.

Los Angeles

Washington, D.C.

The Forbidden City

Massachusetts Bay

Key West

London

New Orleans

New York

San Francisco

Tokyo

Versailles

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl