Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 30 2012

11:58

Working With Excel Spreadsheet Files Without Using Excel…

One of the most frequently encountered ways of sharing small datasets is in the form of Excel spreadsheet (.xls) files, notwithstanding all that can be said In Praise of CSV;-) The natural application for opening these files is Microsoft Excel, but what if you don’t have a copy of Excel available?

There are other desktop office suites that can open spreadsheet files, of course, such as Open Office. As long as they’re not too big, spreadsheet files can also be uploaded to and then opened using a variety of online services, such as Google Spreadsheets, Google Fusion Tables or Zoho Sheet. But spreadsheet applications aren’t the only data wrangling tools that can be used to open xls files… Here are a couple more that should be part of every data wrangler’s toolbox…

(If you want to play along, the file I’m going to play with is a spreadsheet containing the names and locations of GP practices in England. The file can be found on the NHS Indicators portal – here’s the actual spreadsheet.)

Firstly, Google Refine. Google Refine is a cross-platform, browser based tool that helps with many of the chores relating to getting a dataset tidied up so that you can use it elsewhere, as well as helping out with data reconcilation or augmenting rows with annotations provided by separate online services. You can also use it as a quick-and-dirty tool for opening an xls spreadsheet from a URL, knocking the data into shape, and dumping it to a CSV file that you can use elsewhere. To start with, choose the option to create a project by importing a file from a web address (the XLS spreadsheet URL):

Once loaded, you get a preview view..

You can tidy up the data that you are going to use in your project via the preview panel. In this case, I’m going to ignore the leading lines and just generate a dataset that I can export directly as a CSV file once I’ve got the data into my project.

If I then create a project around this dataset, I can trivially export it again using a format of my own preference:

So that’s one way of using Google Refine as a simple file converter service that allows you to preview and to a certain extent shape the data in XLS spreadsheet, as well as converting it to other file types.

The second approach I want to mention is to use a really handy Python software library (xlrd – Escel Reader) in Scraperwiki. The Scraperwiki tutorial on Excel scraping gives a great example of how to get started, which I cribbed wholesale to produce the following snippet.

import scraperwiki
import xlrd

#cribbing https://scraperwiki.com/docs/python/python_excel_guide/
def cellval(cell):
    if cell.ctype == xlrd.XL_CELL_EMPTY:    return None
    return cell.value

def dropper(table):
    if table!='':
        try: scraperwiki.sqlite.execute('drop table "'+table+'"')
        except: pass

def reGrabber():
    #dropper('GPpracticeLookup')
    url = 'https://indicators.ic.nhs.uk/download/GP%20Practice%20data/summaries/demography/Practice%20Addresses%20Final.xls'
    xlbin = scraperwiki.scrape(url)
    book = xlrd.open_workbook(file_contents=xlbin)

    sheet = book.sheet_by_index(0)        

    keys = sheet.row_values(8)           
    keys[1] = keys[1].replace('.', '')
    print keys

    for rownumber in range(9, sheet.nrows):           
        # create dictionary of the row values
        values = [ cellval(c) for c in sheet.row(rownumber) ]
        data = dict(zip(keys, values))
        #print data
        scraperwiki.sqlite.save(table_name='GPpracticeLookup',unique_keys=['Practice Code'], data=data)

#Uncomment the next line if you want to regrab the data from the original spreadsheet
reGrabber()

You can find my scraper here: UK NHS GP Practices Lookup. What’s handy about this approach is that having scraped the spreadsheet data into a Scraperwiki database, I can now query it as database data via the Scraperwiki API.

(Note that the Google Visualisation API query language would also let me treat the spreadsheet data as a database if I uploaded it to Google Spreadsheets.)

So, if you find yourself with an Excel spreadsheet, but no Microsoft Office to hand, fear not… There are plenty of other tools other there you can appropriate to help you get the data out of the file and into a form you can work with:-)

PS R is capable of importing Excel files, I think, but the libraries I found don’t seem to compile onto Max OS/X?

PPS ***DATA HEALTH WARNING*** I haven’t done much testing of either of these approaches using spreadsheets containing multiple workbooks, complex linked formulae or macros. They may or may not be appropriate in such cases… but for simple spreadsheets, they’re fine…


11:58

Working With Excel Spreadsheet Files Without Using Excel…

One of the most frequently encountered ways of sharing small datasets is in the form of Excel spreadsheet (.xls) files, notwithstanding all that can be said In Praise of CSV;-) The natural application for opening these files is Microsoft Excel, but what if you don’t have a copy of Excel available?

There are other desktop office suites that can open spreadsheet files, of course, such as Open Office. As long as they’re not too big, spreadsheet files can also be uploaded to and then opened using a variety of online services, such as Google Spreadsheets, Google Fusion Tables or Zoho Sheet. But spreadsheet applications aren’t the only data wrangling tools that can be used to open xls files… Here are a couple more that should be part of every data wrangler’s toolbox…

(If you want to play along, the file I’m going to play with is a spreadsheet containing the names and locations of GP practices in England. The file can be found on the NHS Indicators portal – here’s the actual spreadsheet.)

Firstly, Google Refine. Google Refine is a cross-platform, browser based tool that helps with many of the chores relating to getting a dataset tidied up so that you can use it elsewhere, as well as helping out with data reconcilation or augmenting rows with annotations provided by separate online services. You can also use it as a quick-and-dirty tool for opening an xls spreadsheet from a URL, knocking the data into shape, and dumping it to a CSV file that you can use elsewhere. To start with, choose the option to create a project by importing a file from a web address (the XLS spreadsheet URL):

Once loaded, you get a preview view..

You can tidy up the data that you are going to use in your project via the preview panel. In this case, I’m going to ignore the leading lines and just generate a dataset that I can export directly as a CSV file once I’ve got the data into my project.

If I then create a project around this dataset, I can trivially export it again using a format of my own preference:

So that’s one way of using Google Refine as a simple file converter service that allows you to preview and to a certain extent shape the data in XLS spreadsheet, as well as converting it to other file types.

The second approach I want to mention is to use a really handy Python software library (xlrd – Escel Reader) in Scraperwiki. The Scraperwiki tutorial on Excel scraping gives a great example of how to get started, which I cribbed wholesale to produce the following snippet.

import scraperwiki
import xlrd

#cribbing https://scraperwiki.com/docs/python/python_excel_guide/
def cellval(cell):
    if cell.ctype == xlrd.XL_CELL_EMPTY:    return None
    return cell.value

def dropper(table):
    if table!='':
        try: scraperwiki.sqlite.execute('drop table "'+table+'"')
        except: pass

def reGrabber():
    #dropper('GPpracticeLookup')
    url = 'https://indicators.ic.nhs.uk/download/GP%20Practice%20data/summaries/demography/Practice%20Addresses%20Final.xls'
    xlbin = scraperwiki.scrape(url)
    book = xlrd.open_workbook(file_contents=xlbin)

    sheet = book.sheet_by_index(0)        

    keys = sheet.row_values(8)           
    keys[1] = keys[1].replace('.', '')
    print keys

    for rownumber in range(9, sheet.nrows):           
        # create dictionary of the row values
        values = [ cellval(c) for c in sheet.row(rownumber) ]
        data = dict(zip(keys, values))
        #print data
        scraperwiki.sqlite.save(table_name='GPpracticeLookup',unique_keys=['Practice Code'], data=data)

#Uncomment the next line if you want to regrab the data from the original spreadsheet
reGrabber()

You can find my scraper here: UK NHS GP Practices Lookup. What’s handy about this approach is that having scraped the spreadsheet data into a Scraperwiki database, I can now query it as database data via the Scraperwiki API.

(Note that the Google Visualisation API query language would also let me treat the spreadsheet data as a database if I uploaded it to Google Spreadsheets.)

So, if you find yourself with an Excel spreadsheet, but no Microsoft Office to hand, fear not… There are plenty of other tools other there you can appropriate to help you get the data out of the file and into a form you can work with:-)

PS R is capable of importing Excel files, I think, but the libraries I found don’t seem to compile onto Max OS/X?

PPS ***DATA HEALTH WARNING*** I haven’t done much testing of either of these approaches using spreadsheets containing multiple workbooks, complex linked formulae or macros. They may or may not be appropriate in such cases… but for simple spreadsheets, they’re fine…


April 27 2012

21:09

Exploring GP Practice Level Prescribing Data

Some posts I get a little bit twitchy about writing. Accessing and Visualising Sentencing Data for Local Courts was one, and this is another: exploring practice level prescription data (get the data).

One of the reasons it feels “dangerous” is that the rationale behind the post is to demonstrate some of the mechanics of engaging with the data at a context free level, devoid of any real consideration about what the data represents, whilst using a data set that does have meaning, the interpretation of which can be used as the basis of making judgements about various geographical areas, for example.

The datasets that are the focus of this post relate to GP practice level prescription data. One datafile lists GP practices (I’ve uploaded this to Google Fusion tables), and includes practice name, identifier, and address. I geocoded the Google Fusion tables version of the data according to practice postcode, so we can see on a map how the practices are distributed:

(There are a few errors in the geocoding that could probably be fixed by editing the correspond data rows, and adding something like “, UK” to the postcode. (I’ve often thought it would be handy if you could force Google Fusion Table’s geocoder to only return points within a particular territory…))

The prescription data includes data at the level of item counts by drug name or prescription item per month for each practice. Trivially, we might do something like take the count of methadone prescriptions for each practice, and plot a map sizing points at the location of each practice by the number of methadone prescriptions by that practice. All well and good if we bear in mind the fact the the data hasn’t been normalised by the size of the practice, doesn’t take into account the area over which the patients are distributed, doesn’t take into account the demographics of the practices constituency (or recognise that a particular practice may host a special clinic, or the sample month may have included an event that drew in a large transient population with a particular condition, or whatever). A good example to illustrate this taken from another context might be “murder density” in London. It wouldn’t surprise me if somewhere like Russell Square came out as a hot spot – not because there are lots of murders there, but because a bomb went off on a single occasion killing multiple people… Another example of “crime hot spots” might well be courts or police stations, places that end up being used as default/placeholder locations if the actual location of crime isn’t known. And so on.

The analyst responsible for creating quick and dirty sketch maps will hopefully be mindful of the factors that haven’t been addressed in the construction of a sketch, and will consequently treat with suspicion any result unless they’ve satisfied themselves that various factors have been taken into account, or discount particular results that are not the current focus of the question they are asking themselves of the data in a particular way.

So when it comes to producing a post like this looking at demonstrating some practical skills, care needs to be taken not to produce charts or maps that appear to say one thing when indeed they say nothing… So bear that in mind: this post isn’t about how to generate statistically meaningful charts and tables; it’s about mechanics of getting rows of data out of big files and into a form we can start to try to make sense of them

Another reason I’m a little twitchy about this post relates to describing certain skills in an open and searchable/publicly discoverable forum. (This is one reason why folk often demonstrate core skills on “safe” datasets or randomly generated data files.) In the post Googling Nasties and Oopses on University and Public Sector Websites, a commenter asked: “is it really ethical to post that information?” in the context of an example showing how to search for confidential spreadsheet information using a web search engine. I could imagine a similar charge being leveled at a post that describes certain sorts of data wrangling skills. Maybe some areas of knowledge should be limited to the priesthood..?

To mitigate against any risks of revealing things best left undiscovered, I could draw on the NHS Information Centre’s Evaluation and impact assessment – proposal to publish practice-level prescribing data[PDF] as well as the risks acknowledged by the recent National Audit Office report on Implementing transparency (risks to privacy, of fraud, and other possible unintended consequences). But I won’t, for now…. (dangerrrrrroussssssssss…;-)

(Academically speaking, it might be interesting to go through the NHS Info Centre’s risk assessment and see just how far we can go in making those risks real using the released data set as a “white hat data hacker”, for example! I will go through the risk assessment properly in another post.)

So… let the journey into the data begin, and the reason why I felt the need to have a play with this data set:

Note: Due to the large file size (over 500MB) standard spreadsheet applications will not be able to handle the volumes of data contained in the monthly datasets. Data users will need to analyse the information using specialist data-handling software.

Hmmm… that’s not very accessible is it?!

However, if you’ve read my previous posts on Playing With Large (ish) CSV Files or Postcards from a Text Processing Excursion, or maybe even the aforementioned local sentencing data post, you may have some ideas about how to actually work with this file…

So fear not – if you fancy playing along, you should already be set up tooling wise if you’re on a Mac or a Linux computer. (If you’re on a Windows machine, I cant really help – you’ll probably need to install something like gnuwin or Cygwin – if any Windows users could add support in the comments, please do:-)

Download the data (all 500MB+ of it – it’s published unzipped/uncompressed (a zipped version comes in at a bit less than 100MB)) and launch a terminal.

>

I downloaded the December 2011 files as nhsPracticesDec2011.csv and nhsPrescribingDataDec2011.CSV so those are the filenames I’ll be using.

To look at the first few lines of each file we can use the head command:

head nhsPrescribingDataDec2011.CSV
head nhsPracticesDec2011.csv

Inspection of the practices data suggests that counties for each practice are specified, so I can generate a subset of the practices file listing just practices on the ISLE OF WIGHT by issuing a grep (search) command and sending (>) the result to a new file:

grep WIGHT nhsPracticesDec2011.CSV > wightPracDec2011.csv

The file wightPracDec2011.csv should now contain details of practices (one per row) based on the Isle of Wight. We can inspect the first few lines of the file using the head command, or use more to scroll through the data one page at a time (hit space bar to move on a page, ESCape to exit).

head wightPracDec2011.csv
more wightPracDec2011.csv

Hmmm.. there’s a rogue practice in there from the Wirral – let’s refine the grep a little:

grep 'OF WIGHT' nhsPracticesDec2011.CSV > wightPracDec2011.csv
more wightPracDec2011.csv

From looking at the data file itslef, along with the prescribing data release notes/glossary, we can see that each practice has a unique identifier. From previewing the head of the prescription data itself, as well as from the documentation, we know that the large prescription data file contains identifiers for each practice too. So based on the previous steps, can you figure out how to pull out the rows from the prescriptions file that relate to drugs issued by the Ventnor medical centre, which has code J84003? Like this, maybe?

grep J84003 nhsPrescribingDataDec2011.CSV > wightPrescDec2011_J84003.csv
head wightPrescDec2011_J84003.csv

(It may take a minute or two, so be patient…)

We can check how many rows there actually are as follows:

wc -l wightPrescDec2011_J84003.csv

I was thinking it would be nice to be able to get prescription data from all the Isle of Wight practices, so how might we go about that. From reviewing my previous text mining posts, I noticed that I could pull out data from a file by column:

cut -f 2 -d ',' wightPracDec2011.csv

This lists column two of the file wightPracDec2011.csv where columns are comma delimited.

We can send this list of codes to the grep command to pull out records from the large prescriptions file for each of the codes we grabbed using the cut command (I asked on Twitter for how to do this, and got a reply back that seemed to do the trick pretty much by return of tweet from @smelendez):

cut -d ',' -f 2 wightPracDec2011.csv | grep nhsPrescribingDataDec2011.CSV -f - > iwPrescDec2011.csv
more iwPrescDec2011.csv

We can sort the result by column – for example, in alphabetic order by column 5 (-k 5), the drugs column:

sort -t ',' -k 5 iwPrescDec2011.csv | head

Or we can sort by decreasing (-r) total ingredient cost:

sort -t ',' -k 7 -r iwPrescDec2011.csv | head

Or in decreasing order of the largest number of items:

sort -t ',' -k 6 -r iwPrescDec2011.csv | head

One problem with looking at those results is that we can’t obviously recognise the practice. (That might be a good thing, especially if we looked at item counts in increasing order… Whilst we don’t know how many patients were in receipt of one or more items of drug x if 500 or so items were prescribed in the reporting period across several practices, if there is only one item of a particular drug prescribed for one practice, then we’re down to one patient in receipt of that item across the island, which may be enough to identify them…) I leave it as an exercise for the reader to work out how you might reconcile the practice codes with practice names (Merging Datasets with Common Columns in Google Refine might be one way? Merging Two Different Datasets Containing a Common Column With R and R-Studio another..?).

Using the iwPrescDec2011.csv file, we can now search to see how many items of a particular drug are prescribed across island practices using searches of the form:

grep Aspirin iwPrescDec2011.csv
grep 'Peppermint Oil' iwPrescDec2011.csv

And this is where we now start to need taking a little care… Scanning through that data by eye, a bit of quick mental arithmetic (divide column 7 by column 6) suggests that the unit price for peppermint oil is different across practices. So is there a good reason for this? I would guess that the practices may well be describing different volumes of peppermint oil as single prescription items, which makes a quick item cost calculation largely meaningless? I guess we need to check the data glossary/documentation to confirm (or deny) this?

Okay – enough for now… maybe I’ll see how we can do a little more digging around this data in another post…


21:09

Exploring GP Practice Level Prescribing Data

Some posts I get a little bit twitchy about writing. Accessing and Visualising Sentencing Data for Local Courts was one, and this is another: exploring practice level prescription data (get the data).

One of the reasons it feels “dangerous” is that the rationale behind the post is to demonstrate some of the mechanics of engaging with the data at a context free level, devoid of any real consideration about what the data represents, whilst using a data set that does have meaning, the interpretation of which can be used as the basis of making judgements about various geographical areas, for example.

The datasets that are the focus of this post relate to GP practice level prescription data. One datafile lists GP practices (I’ve uploaded this to Google Fusion tables), and includes practice name, identifier, and address. I geocoded the Google Fusion tables version of the data according to practice postcode, so we can see on a map how the practices are distributed:

(There are a few errors in the geocoding that could probably be fixed by editing the correspond data rows, and adding something like “, UK” to the postcode. (I’ve often thought it would be handy if you could force Google Fusion Table’s geocoder to only return points within a particular territory…))

The prescription data includes data at the level of item counts by drug name or prescription item per month for each practice. Trivially, we might do something like take the count of methadone prescriptions for each practice, and plot a map sizing points at the location of each practice by the number of methadone prescriptions by that practice. All well and good if we bear in mind the fact the the data hasn’t been normalised by the size of the practice, doesn’t take into account the area over which the patients are distributed, doesn’t take into account the demographics of the practices constituency (or recognise that a particular practice may host a special clinic, or the sample month may have included an event that drew in a large transient population with a particular condition, or whatever). A good example to illustrate this taken from another context might be “murder density” in London. It wouldn’t surprise me if somewhere like Russell Square came out as a hot spot – not because there are lots of murders there, but because a bomb went off on a single occasion killing multiple people… Another example of “crime hot spots” might well be courts or police stations, places that end up being used as default/placeholder locations if the actual location of crime isn’t known. And so on.

The analyst responsible for creating quick and dirty sketch maps will hopefully be mindful of the factors that haven’t been addressed in the construction of a sketch, and will consequently treat with suspicion any result unless they’ve satisfied themselves that various factors have been taken into account, or discount particular results that are not the current focus of the question they are asking themselves of the data in a particular way.

So when it comes to producing a post like this looking at demonstrating some practical skills, care needs to be taken not to produce charts or maps that appear to say one thing when indeed they say nothing… So bear that in mind: this post isn’t about how to generate statistically meaningful charts and tables; it’s about mechanics of getting rows of data out of big files and into a form we can start to try to make sense of them

Another reason I’m a little twitchy about this post relates to describing certain skills in an open and searchable/publicly discoverable forum. (This is one reason why folk often demonstrate core skills on “safe” datasets or randomly generated data files.) In the post Googling Nasties and Oopses on University and Public Sector Websites, a commenter asked: “is it really ethical to post that information?” in the context of an example showing how to search for confidential spreadsheet information using a web search engine. I could imagine a similar charge being leveled at a post that describes certain sorts of data wrangling skills. Maybe some areas of knowledge should be limited to the priesthood..?

To mitigate against any risks of revealing things best left undiscovered, I could draw on the NHS Information Centre’s Evaluation and impact assessment – proposal to publish practice-level prescribing data[PDF] as well as the risks acknowledged by the recent National Audit Office report on Implementing transparency (risks to privacy, of fraud, and other possible unintended consequences). But I won’t, for now…. (dangerrrrrroussssssssss…;-)

(Academically speaking, it might be interesting to go through the NHS Info Centre’s risk assessment and see just how far we can go in making those risks real using the released data set as a “white hat data hacker”, for example! I will go through the risk assessment properly in another post.)

So… let the journey into the data begin, and the reason why I felt the need to have a play with this data set:

Note: Due to the large file size (over 500MB) standard spreadsheet applications will not be able to handle the volumes of data contained in the monthly datasets. Data users will need to analyse the information using specialist data-handling software.

Hmmm… that’s not very accessible is it?!

However, if you’ve read my previous posts on Playing With Large (ish) CSV Files or Postcards from a Text Processing Excursion, or maybe even the aforementioned local sentencing data post, you may have some ideas about how to actually work with this file…

So fear not – if you fancy playing along, you should already be set up tooling wise if you’re on a Mac or a Linux computer. (If you’re on a Windows machine, I cant really help – you’ll probably need to install something like gnuwin or Cygwin – if any Windows users could add support in the comments, please do:-)

Download the data (all 500MB+ of it – it’s published unzipped/uncompressed (a zipped version comes in at a bit less than 100MB)) and launch a terminal.

>

I downloaded the December 2011 files as nhsPracticesDec2011.csv and nhsPrescribingDataDec2011.CSV so those are the filenames I’ll be using.

To look at the first few lines of each file we can use the head command:

head nhsPrescribingDataDec2011.CSV
head nhsPracticesDec2011.csv

Inspection of the practices data suggests that counties for each practice are specified, so I can generate a subset of the practices file listing just practices on the ISLE OF WIGHT by issuing a grep (search) command and sending (>) the result to a new file:

grep WIGHT nhsPracticesDec2011.CSV > wightPracDec2011.csv

The file wightPracDec2011.csv should now contain details of practices (one per row) based on the Isle of Wight. We can inspect the first few lines of the file using the head command, or use more to scroll through the data one page at a time (hit space bar to move on a page, ESCape to exit).

head wightPracDec2011.csv
more wightPracDec2011.csv

Hmmm.. there’s a rogue practice in there from the Wirral – let’s refine the grep a little:

grep 'OF WIGHT' nhsPracticesDec2011.CSV > wightPracDec2011.csv
more wightPracDec2011.csv

From looking at the data file itslef, along with the prescribing data release notes/glossary, we can see that each practice has a unique identifier. From previewing the head of the prescription data itself, as well as from the documentation, we know that the large prescription data file contains identifiers for each practice too. So based on the previous steps, can you figure out how to pull out the rows from the prescriptions file that relate to drugs issued by the Ventnor medical centre, which has code J84003? Like this, maybe?

grep J84003 nhsPrescribingDataDec2011.CSV > wightPrescDec2011_J84003.csv
head wightPrescDec2011_J84003.csv

(It may take a minute or two, so be patient…)

We can check how many rows there actually are as follows:

wc -l wightPrescDec2011_J84003.csv

I was thinking it would be nice to be able to get prescription data from all the Isle of Wight practices, so how might we go about that. From reviewing my previous text mining posts, I noticed that I could pull out data from a file by column:

cut -f 2 -d ',' wightPracDec2011.csv

This lists column two of the file wightPracDec2011.csv where columns are comma delimited.

We can send this list of codes to the grep command to pull out records from the large prescriptions file for each of the codes we grabbed using the cut command (I asked on Twitter for how to do this, and got a reply back that seemed to do the trick pretty much by return of tweet from @smelendez):

cut -d ',' -f 2 wightPracDec2011.csv | grep nhsPrescribingDataDec2011.CSV -f - > iwPrescDec2011.csv
more iwPrescDec2011.csv

We can sort the result by column – for example, in alphabetic order by column 5 (-k 5), the drugs column:

sort -t ',' -k 5 iwPrescDec2011.csv | head

Or we can sort by decreasing (-r) total ingredient cost:

sort -t ',' -k 7 -r iwPrescDec2011.csv | head

Or in decreasing order of the largest number of items:

sort -t ',' -k 6 -r iwPrescDec2011.csv | head

One problem with looking at those results is that we can’t obviously recognise the practice. (That might be a good thing, especially if we looked at item counts in increasing order… Whilst we don’t know how many patients were in receipt of one or more items of drug x if 500 or so items were prescribed in the reporting period across several practices, if there is only one item of a particular drug prescribed for one practice, then we’re down to one patient in receipt of that item across the island, which may be enough to identify them…) I leave it as an exercise for the reader to work out how you might reconcile the practice codes with practice names (Merging Datasets with Common Columns in Google Refine might be one way? Merging Two Different Datasets Containing a Common Column With R and R-Studio another..?).

Using the iwPrescDec2011.csv file, we can now search to see how many items of a particular drug are prescribed across island practices using searches of the form:

grep Aspirin iwPrescDec2011.csv
grep 'Peppermint Oil' iwPrescDec2011.csv

And this is where we now start to need taking a little care… Scanning through that data by eye, a bit of quick mental arithmetic (divide column 7 by column 6) suggests that the unit price for peppermint oil is different across practices. So is there a good reason for this? I would guess that the practices may well be describing different volumes of peppermint oil as single prescription items, which makes a quick item cost calculation largely meaningless? I guess we need to check the data glossary/documentation to confirm (or deny) this?

Okay – enough for now… maybe I’ll see how we can do a little more digging around this data in another post…


16:09

Daily Must Reads, April 27, 2012

The best stories across the web on media and technology, curated by Lily Leung.

1. Rupert Murdoch apologizes for hacking scandal (NYT)



2. Providence may sell its stake in Hulu for $2 billion (Bloomberg)



3. Redbox revenue grows 39 percent in the first quarter (LAT)



4. Gawker still embraces anonymous commenters as other media orgs push them away (Gawker)



5. Free data-journalism handbook to launch Saturday (Online Journalism Blog)



6. Why flying drones may be a big part of the future of journalism (Fast Company)




Subscribe to our daily Must Reads email newsletter and get the links in your in-box every weekday!



Subscribe to Daily Must Reads newsletter

This is a summary. Visit our site for the full post ».

13:38

April 23 2012

13:10

NextDrop: Water Utilities in India Need Good Data

In places like the United States, we have access to more data than we ever know what to do with. We measure everything from what the average historical temperature is on a certain day for a city, to how good a restaurant is, to how much energy we consume. Because of this access, we base many of our critical decisions on this data (or at least that's the hope). Essentially, because we have had access, we know how to use this data.

nextdrop.jpg.jpg

However, this isn't the case everywhere.

Fact: Just because you have access to data, it does not guarantee that you will use it appropriately. Using it appropriately requires behavior change, something that, any person will tell you, is incredibly difficult.

This is the hard part about data -- not the production of it, but the usage. This means that simply providing technology is not a solution. It is technology and the realization of the potential results that will produce meaningful change.

making data-driven decisions

This is similar to the situation water utilities are facing in India. There's no real incentive to get good data, and it makes sense. They have many things to worry about -- mainly, the reduction of non-revenue water. Data is tricky, because the results are more of the intangible kind. You need initial buy-in, and lots of time, in order to build your case for making data-driven decisions.

We know that these data-driven decisions will, in time, reduce non-revenue water, but it will take some time. And unfortunately, in a world that wants sexy solutions along with fast results, this does not come easily.

We're hoping that in the future, other stakeholders will promote the acquisition of quality data, and will push the utilities to make data-driven decisions. From academics, to other government agencies, we see a need from other stakeholders to push this agenda and create this water data market for water utilities.

And when that happens, NextDrop will be there to provide that quality data to the utilities to help them become more efficient.

April 21 2012

15:36

Insights into data journalism in Argentina

Angelica Peralta Ramos, multimedia development manager, La Nación in Argentina, gave an insight into the challenges of doing data journalism.

In her ISOJ talk, she explained how La Nacion started doing data visualisations with few resources and in a less than friendly government environment.

Peralta pointed out that Argentina ranks 100 out of 180 in corruption index. The country does not have a freedom of information law and it not part of the open government initiative.

But there is hope said Peralta. La Nacion wanted to do data journalism but didn’t have any programmers so they adopted tools for non programmers such as Tableau Public and Excel.

One of its initiatives involved gathering data on inflation to try to reveal more accurate inflation levels.

The newspaper has been taking public data and seeking to derive meaning from masses of figures.

For example, La Nacion took 400 PDFs with tables of 235,000 rows that recorded subsidies to bus companies to figure out who was getting what.

It is using software to keep track of updates to the PDFs to show how subsidies to the companies are on the rise.

Peralta’s short presentation showed how some media organisations are exploring data journalism in circumstances which are very different to the US or UK.

La Nacion have a data blog and will be posting links to the examples mentioned by Peralta.

15:09

Making data visualisation useful for audiences

At ISOJ, Alberto Cairo, lecturer in visual journalism, University of Miami, raised some critical questions about the visualisation of data in journalism.

Cairo explained that an information graphic is a tool for presenting information and for exploring information.

In the past, info graphics were about editing data down and summarising it. But this worries me, he says, as it is just presenting information but does not allow readers to explore the data.

Today we have the opposite trend and often ends up as data art which doesn’t help readers understand the data.

Cairo cited a New York Times project mapping neighbourhoods which he said forced readers to become their own reporters and editors to understand the data.

We have to create layers, he said. We have the presentation layer and we have the exploration layer, and these are complementary.

But readers need help to navigate the data, he said. Part of the task is giving clues to readers to understand the complexity of data.

Cairo quoted a visualistion mantra by Ben Shneiderman: “Overview first, zoom and filter, then details-on-demand.”

His approached echoed earlier comments by Brian Boyer, news applications editor, Chicago Tribune Media Group. Boyer said that we should make data beautiful, inspirational but make it useful to the audience.

 

April 20 2012

06:26

Programming and journalism students: A conversation

I think it’s pretty cool to use Storify to sort out the threads of a bunch of simultaneous conversations on Twitter:

[View the story "Programming and journalism students: A conversation" on Storify]

Please join in — on Twitter, on Facebook, or here.

06:26

Programming and journalism students: A conversation

I think it’s pretty cool to use Storify to sort out the threads of a bunch of simultaneous conversations on Twitter:

[View the story "Programming and journalism students: A conversation" on Storify]

Please join in — on Twitter, on Facebook, or here.

April 12 2012

16:49

‘Hypothesis generator’ helps mine huge datasets

A tool created through a collaboration with Harvard and MIT could soon help journalists find relationships in massive amounts of data — even if they don’t know what they’re looking for. Read More »

March 28 2012

14:00

Colorful City Tracking Maps Launch Under Creative Commons

Maps.stamen.com, the second installment of the City Tracking project funded by the Knight News Challenge, is live.

These unique cartographic styles and tiles, based on data from Open Street Map, are available for the entire world, downloadable for use under a under a Creative Commons Attribution 3.0 license, and free.

takes deep breath

There are three styles available: toner, terrain, and watercolor:

  • Toner is about stripping online cartography down to its absolute essentials. It uses just black and white, describing a baseline that other kinds of data can be layered on. Stripping out any kind of color or image makes it easier to focus on the interactive nature of online cartography: When do different labels show up for different cities? What should the thickness of freeways be at different zoom levels? And so forth. This project is the one that Nathaniel Vaughn Kelso is hacking on at all hours, and it's great to be seeing Natural Earth data get more tightly integrated into the project over time.
  • Terrain occupies a middle ground: "shaded hills, nice big text, and green where it belongs." In keeping with City Tracking's mandate to make it easier for people to tell stories about cities, this is an open-source alternative to Google's terrain maps, and it uses all open-source software like Skeletron to improve on the base line cartographic experience. Mike Migurski has been heading up this design, with help from Gem Spear and Nelson Minar.
  • Watercolor pushes through to the other side of normal, bending the rules of traditional legibility in order to explore some new terrain. It incorporates hand-painted textures and algorithmic rule sets into a design that looks like it's been done by 10,000 slaves in the basement, but is rendered on the fly. Geraldine Sarmiento and Zach Watson did the lion's share of the design and development on this one. This design is a mixed bag for me: I'm delighted to see it out in the world, but it's the thing that's pretty much kept me from looking at anything else for the last month and a half.

The code that runs Toner and Terrain is available for download and use at the City Tracking GitHub repository; we're going to wait on watercolor a little while until we can get some of the kinks ironed out. We talked about waiting to launch until watercolor was all buttoned up, but what with all the attention that Open Street Map has been getting, we decided to just bite the bullet and go for it.

We'll follow up this week with some posts on how everything works and how the sausage is made, and I've got a lot more to say about what I think this implies for what can be done with online maps and data visualization.

In the meantime, have you seen how awesome Los Angeles, Washington, D.C., the Forbidden City, Massachusetts Bay, Key West, London, New Orleans, New York, Versailles, and every other city in the cotton-pickin' world look when you point this thing at it? Holy heck.

Los Angeles

Washington, D.C.

The Forbidden City

Massachusetts Bay

Key West

London

New Orleans

New York

San Francisco

Tokyo

Versailles

February 28 2012

16:12

How a conference taught me I know nothing

The 2012 Computer-Assisted Reporting conference in St. Louis provided journalists with plenty of new reporting tools. Here's our top-15 list of applications and websites from the weekend. Read More »

February 10 2012

18:00

Still shaping the way people think about news innovation? A few reflections on the new KNC 2.0

As someone who probably has spent more time thinking about the Knight News Challenge than anyone outside of Knight Foundation headquarters — doing a dissertation on the subject will do that to you! — I can’t help but follow its evolution, even after my major research ended in 2010. And evolve it has: from an initial focus on citizen journalism and bloggy kinds of initiatives (all the rage circa 2007, right?) to a later emphasis on business models, visualizations, and data-focused projects (like this one) — among a whole host of other projects including news games, SMS tools for the developing world, crowdsourcing applications, and more.

Now, after five years and $27 million in its first incarnation, Knight News Challenge 2.0 has been announced for 2012, emphasizing speed and agility (three contests a year, eight-week turnarounds on entries) and a new topical focus (the first round is focused on leveraging existing networks). While more information will be coming ahead of the February 27 launch, here are three questions to chew on now.

Does the Knight News Challenge still dominate this space?

The short answer is yes (and I’m not just saying that because, full disclosure, the Knight Foundation is a financial supporter of the Lab). As I’ve argued before, in the news innovation scene, at this crossroads of journalism and technology communities, the KNC has served an agenda-setting kind of function — perhaps not telling news hipsters what to think regarding the future of journalism, but rather telling them what to think about. So while folks might disagree on the Next Big Thing for News, there’s little question that the KNC has helped to shape the substance and culture of the debate and the parameters in which it occurs.

Some evidence for this comes from the contest itself: Whatever theme/trend got funded one year would trigger a wave of repetitive proposals the next. (As Knight said yesterday: “Our concern is that once we describe what we think we might see, we receive proposals crafted to meet our preconception.”)

And yet the longer answer to this question is slightly more nuanced. When the KNC began in 2006, with the first winners named in 2007, it truly was the only game in town — a forum for showing “what news innovation looks like” unlike any other. Nowadays, a flourishing ecosystem of websites (ahem, like this one), aggregators (like MediaGazer), and social media platforms is making the storyline of journalism’s reboot all the more apparent. It’s easier than ever to track who’s trying what, which experiments are working, and so on — and seemingly in real time, as opposed to a once-a-year unveiling. Hence the Knight Foundation’s move to three quick-fire contests a year, “as we try to bring our work closer to Internet speed.”

How should we define the “news” in News Challenge?

One of the striking things I found in my research (discussed in a previous Lab post) was that Knight, in its overall emphasis, has pivoted away from focusing mostly on journalism professionalism (questions like “how do we train/educate better journalists?”) and moved toward a broader concern for “information.” This entails far less regard for who’s doing the creating, filtering, or distributing — rather, it’s more about ensuring that people are informed at the local community level. This shift from journalism to information, reflected in the Knight Foundation’s own transformation and its efforts to shape the field, can be seen, perhaps, like worrying less about doctors (the means) and more about public health (the ends) — even if this pursuit of health outcomes sometimes sidesteps doctors and traditional medicine along the way.

This is not to say that Knight doesn’t care about journalism. Not at all. It still pours millions upon millions of dollars into clearly “newsy” projects — including investigative reporting, the grist of shoe-leather journalism. Rather, this is about Knight trying to rejigger the boundaries of journalism: opening them up to let other fields, actors, and ideas inside.

So, how should you define “news” in your application? My suggestion: broadly.

What will be the defining ethos of KNC 2.0?

This is the big, open, and most interesting question to me. My research on the first two years of KNC 1.0, using a regression analysis, found that contest submissions emphasizing participation and distributed knowledge (like crowdsourcing) were more likely to advance, all things being equal. My followup interviews with KNC winners confirmed this widely shared desire for participation — a feeling that the news process not only could be shared with users, but in fact should be.

I called this an “ethic of participation,” a founding doctrine of news innovation that challenges journalism’s traditional norm of professional control. But perhaps, to some extent, that was a function of the times, during the roughly 2007-2010 heyday of citizen media, with the attendant buzz around user-generated content as the hot early-adopter thing in news — even if news organizations then, as now, struggled to reconcile and incorporate a participatory audience. Even while participation has become more mainstream in journalism, there are still frequent flare-ups, like this week’s flap over breaking news on Twitter, revealing enduring tensions at the “collision of two worlds — when a hierarchical media system in the hands of the few collides with a networked media system open to all,” as Alfred Hermida wrote.

So what about this time around? Perhaps KNC 2.0 will have an underlying emphasis on Big Data, algorithms, news apps, and other things bubbling up at the growing intersection of computer science and journalism. It’s true that Knight is already underwriting a significant push in this area through the (also just-revised) Knight-Mozilla OpenNews project (formerly called the Knight-Mozilla News Technology Partnership — which Nikki Usher and I have written about for the Lab). To what extent is there overlap or synergy here? OpenNews, for 2012, is trying to build on the burgeoning “community around code” in journalism — leveraging the momentum of Hacks/Hackers, NICAR, and ONA with hackfests, code-swapping, and online learning. KNC 2.0, meanwhile, talks about embracing The Hacker Way described by Mark Zuckerberg — but at the same time backs away a bit from its previous emphasis on open source as a prerequisite. It’ll be interesting to see how computational journalism — explained well in this forthcoming paper (PDF here) by Terry Flew et al. in Journalism Practice — figures into KNC 2.0.

Regardless, the Knight News Challenge is worth watching for what it reveals about the way people — journalists and technologists, organizations and individuals, everybody working in this space — talk about and make sense of “news innovation”: what it means, where it’s taking us, and why that matters for the future of journalism.

February 05 2012

06:04

Python Web Scraping 101 (Hacks/Hackers)

Python Web Scraping 101 resources, links, etc. How to do it.

February 03 2012

19:10

Google Privacy Policy And What It Means For The Nptech World

Last week Google announced their new privacy policy to the world. The changes in the way that Google combines and uses information one shares with its services is effective in less than a month, on March the 1st. There is a few absolutely basic facts that every Internet user (be it a Google ID user or not) should be aware of in the context of the change, and I will try to brief them here. I would love to learn and understand how exactly non-profit organizations will be affected by the new policy -- I understand that this is a very complex issue, and it is still hard to distill how this situation will be different and unique for the civil sector in particular. It doesn’t make the questions any less important or urging for an answer though. The new Google Privacy Policy run about 10,000 words, and I strongly recommend the read.

Starting March 1st any information that Google engines tracked so far, and used for customizing a specific tool of your use (e.g. you must have noted the search results being differently positioned based on how you used the engine before) will be now available almost across the entire spectrum of Google products: “If you're signed in, we may combine information you've provided from one service with information from other services (...). In short, we'll treat you as a single user across all our products, which will mean a simpler, more intuitive Google experience.”-- Google's director of privacy, product and engineering, Alma Whitten wrote in a blog post. 

 

Intuitive Experience vs. Privacy Violation

 

Whitten’s creativity goes further and can be very specific: “Google will be able to provide reminders that you’re going to be late for a meeting based on your location, your calendar and an understanding of what traffic is like that day” -- she wrote. The policy will obviously apply to mobile Internet use, particularly in case of any Android phones, and e.g. a new Kindle Fire. Because you have to sign in to your Google account to do anything except for browse the Web and make phone calls, Google will be able to track practically anything you do on your phone using Google services. 

For many life might become simpler with all the data as well as big data being processed and customized for them by Google -- there is an upside of the change that will add to a Google user experience. However, the levels of your somewhat enthusiastic attitude towards the changes differ according to how much you are willing to share with Google, and how strongly you are inclined to believe in their “don’t be evil” motto. Google pointed out that cookies and 'identifiers' will not be tagged to sensitive categories, such as those based on race, religion, sexual orientation or health. Google has done a great job explaining the change through articles, blog posts and various following the announcement.

 

Coming This March -- Steady?

 

The policy will come to life starting March 1. If you already are a Google ID user you can’t really opt -out. Google can only integrate your information if you are signed in. For example, if you’re signed in to your Gmail account on one tab, and then decide to look up a clip on YouTube on another tab without signing out of your e-mail, the data will be integrated. If you sign out or look up a YouTube clip on a different browser, the data won’t be integrated. 

Another thing, that I would strongly encourage you to do is to take a closer look at your Google privacy settings. As the policy itself advices you can:

  • Review and control certain types of information tied to your Google Account by using Google Dashboard.
  • View and edit your ads preferences, such as which categories might interest you, using the Ads Preferences Manager. 
  • Use Google editor to see and adjust how your Google Profile appears to particular individuals.
  • Control who you share information with.
  • Take information out of many of Google services.

 

In the end, last but not least, you can always pull out your data from the Google Services. To learn more about liberating your data check out the Data Liberation Front manual

 

What Does It Mean For NGOs?

For these who have been observing Google development, and their struggle to monopolize the Internet, the policy change shouldn’t come as a surprise. Nevertheless, it does come as a statement, and should be re-thought by these who wish to be informed Internet users. Apart from asking ourselves questions about how to navigate through the Google changes, we should also take time to decide what should be a stand of a non-profit organization we represent -- work or collaborate with. How will the organizations stakeholders be affected by us using Google tools? How to avoid Google when working on a cloud? Is there a way back?

 

For these who fear the changes there are always, admitteddly less user-friendly but privacy sensitive, open source collaboration tools and platforms such as Etherpads, Zoho (commercial solutions) andalternative social media tools like Diaspora or Identica. In addition, in the shadow of the policy change Microsoft sensed an opportunity to fish for new clients, and claims their products are safer and treat your data with more respect. To see how subtly Microsoft wants to convince you to leave Google for them check this, ironically, youtube video

 

What Is Next?

It is probably too early to know how Google privacy policy will drive the change in how we use the Internet, and probably the majority of the users will stay with Google nevertheless.

What will you do? Will this change affect you? Were you heavily  relying on Google product.

Share your thoughts, emotions and questions in the comments or poke us via our social media channels.

 

 

Learn More:

January 18 2012

15:20

OpenCourt Coaxes Out More Data with Cooperative Coverage Day

A version of this post first appeared on the OpenCourt blog.

A man charged with selling drugs inside the courthouse. A woman said to have shoplifted $5 worth of barbeque chicken wings. A man charged with multiple counts of raping a child with force. A longtime Drug Court participant booted from the program for taking a non-narcotic pill (still against the rules). Everyone brought back to court owing fees or victim restitution in previously dismissed cases. A man on psychotropic medication charged with shoplifting a Stop and Shop cart full of meat and pulling a knife when confronted in the parking lot. A naked hiker in the Blue Hills whose defense to lewd behavior is being raised as a naturist. OUIs. Restraining order hearings. A wife sectioning her husband for alcoholism.

opencourt.jpg

OpenCourt has been streaming public court hearings from the First Session courtroom in the Quincy District Court in Massachusetts since May of 2011. We've received feedback about how our viewers use and value the footage, and we realize it would be useful to show more of the court's daily business -- not just the cross-section that comes through the First Session.

While holding to our goal to carve a plausible model for other courthouses, we've often asked ourselves how we and other journalists around the country could do a better job shedding light on a bigger portion of the iceberg's tip, and not necessarily using as much expensive technology.

In other words, just how much business does all of Quincy District Court do in a single day? How can we more fully capture the breadth of cases heard, day in and day out?

Cooperative Coverage

To help answer these questions, last month we hosted a cooperative coverage event at the court, an open invitation to citizen and traditional journalists alike to help us gather notes about everything that transpires in the building's six public courtrooms.

Our combined notes, which you can read here on our blog -- gathered between myself, our producer Val Wang, two Patriot Ledger reporters, two Harvard Berkman interns, one State House News reporter, and three citizen journalists -- are inevitably incomplete. But we hope this coming together shows more fully the wide array of hearings before the court, the sheer volume of cases, and the fact that this is all happening every day, outside of normal public view.

We realized it's easier than you might think for loosely affiliated citizens to collaborate on a one-off project (read: Twitter + Google Docs).

There were unsurprisingly a wide variety of cases. Some rough tallies: Assault & Battery (15, of which 3 were labeled as domestic violence), disorderly conduct (4), trespassing (3), resisting arrest (1), uninsured and/or unlicensed motor vehicle operation (5), speeding (3), shoplifting (5), larceny over $250 (1), receiving stolen property (2), distribution of an illegal substance (3), section 35 (1), sealed record request (3), interpreter needed (1).

This day was exceptional not by any standard of caseload or substance, only that more of us were there to see it and relay stories. For me and Val, the longer we're in court streaming, the clearer it is that we're sitting on a relatively unchecked sociological goldmine.

Opening Court Data

These notes from last month's experiment are, at the least, a compelling glance at the river of data flowing through our local courts every single day.

At best they offer a new angle on approaching larger questions: How do we get to a place where public court data is more accessible? Why aren't the stats being tracked more extensively and automatically in the name of scientifically diagnosing societal ills?

The Boston Globe recently published an extensive three-part series on Massachusetts drunken driving prosecutions, which undoubtedly required massive reporting energy. While that energy will always be required for strong narrative journalism, shouldn't reporters and the public at large alike have easier access to court proceedings to begin with? Wouldn't the state be better off if tracking the operation of its courts didn't require the Herculean effort of a crack, paid investigative team?

Thanks again to everyone who helped make this possible. Our aim at this point, as always, is to provide a window into the everyday landscape of our legal system. Beyond that, we hope efforts like this lead to smarter methods to inform and awaken the public -- to be a better radar for a community's prevalent crimes.

What do you think? What do you see? What should we do differently if we host another event like this?

January 09 2012

20:10

NYT Districts API helps Fractured Atlas help artists

The arts, and the benefits to the public they provide, sometimes gets lost, barely noticed by government. Fractured Atlas, a New York City-based multi-disciplinary arts service organization, is finding that by creating information and data services for its members and for city arts communities, it can also provide more effective advocacy for the arts to government. Fractured Atlas uses the New York Times' District API to create and run these services and to link them to its arts advocacy mission.

January 05 2012

15:20

Feed Your PANDA With New APIs and Excel Import

PANDA_reasonably_small.jpg

Last time I wrote it was to solicit ideas for PANDA's API. We've since implemented those ideas, and we've just released our third alpha, which includes a complete writable API, demo scripts showing how to import from three different data sources, and the ability to import data from Excel spreadsheets.

The PANDA project aims to make basic data analysis quick and easy for news organizations, and make data sharing simple.

Try Alpha 3 now.

Hello, Write API

Our new write API is designed to be as simple and consistent as possible. We've gone to great lengths to illustrate how it works in our new API documentation. We've also provided three example scripts showing how to populate PANDA with data from:

Using these scripts as a starting point, any programmer with a little bit of Python knowledge should be able to easily import data from an SQL database, local file or any other arcane data source they can conjure up in the newsroom.

Excel support

Also included in this release is support for importing from Excel .xls and .xlsx files. It's difficult to promise that this support will work for every Excel file anyone can find to throw at it, but we've had good results with files produced from both Windows and Mac versions of Excel, as well as from OpenOffice on Mac and Linux.

Our Alpha 4 release will be coming at the end of January, followed quickly by Beta 1 around the time of NICAR. To see what we have planned, check out our Release schedule.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.