Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 11 2011

18:22

Scrape it – Save it – Get it

I imagine I’m talking to a load of developers. Which is odd seeing as I’m not a developer. In fact, I decided to lose my coding virginity by riding the ScraperWiki digger! I’m a journalist interested in data as a beat so all I need to do is scrape. All my programming will be done on ScraperWiki, as such this is the only coding home I know. So if you’re new to ScraperWiki and want to make the site a scraping home-away-from-home, here are the basics for scraping, saving and downloading your data:

With these three simple steps you can take advantage of what ScraperWiki has to offer – writing, running and debugging code in an easy to use editor; collaborative coding with chat and user viewing functions; a dashboard with all your scrapers in one place; examples, cheat sheets and documentation; a huge range of libraries at your disposal; a datastore with API callback; and email alerts to let you know when your scrapers break.

So give it a go and let us know what you think!


June 25 2010

12:51

Guardian Datastore MPs’ Expenses Spreadsheet as a Database

Continuing my exploration of what is and isn’t acceptable around the edges of doing stuff with other people’s data(?!), the Guardian datastore have just published a Google spreadsheet containing partial details of MPs’ expenses data over the period July-Decoember 2009 (MPs’ expenses: every claim from July to December 2009):

thanks to the work of Guardian developer Daniel Vydra and his team, we’ve managed to scrape the entire lot out of the Commons website for you as a downloadable spreadsheet. You cannot get this anywhere else.

In sharing the data, the Guardian folks have opted to share the spreadsheet via a link that includes an authorisation token. Which means that if you try to view the spreadsheet just using the spreadsheet key, you won’t be allowed to see it; (you also need to be logged in to a Google account to view the data, both as a spreadsheet, and in order to interrogate it via the visualisation API). Which is to say, the Guardian datastore folks are taking what steps they can to make the data public, whilst retaining some control over it (because they have invested resource in collecting the data in the form they’re re-presenting it, and reasonably want to make a return from it…)

But in sharing the link that includes the token on a public website, we can see the key – and hence use it to access the data in the spreadsheet, and do more with it… which may be seen as providing a volume add service over the data, or unreasonably freeloading off the back of the Guardian’s data scraping efforts…

So, just pasting the spreadsheet key and authorisation token into the cut down Guardian datastore explorer script I used in Using CSV Docs As a Database to generate an explorer for the expenses data.

So for example, we can run for example run a report to group expenses by category and MP:

MP expesnes explorer

Or how about claims over 5000 pounds (also viewing the information as an HTML table, for example).

Remember, on the datastore explorer page, you can click on column headings to order the data according to that column.

Here’s another example – selecting A,sum(E), where E>0 group by A and order is by sum(E) then asc and viewing as a column chart:

Datastore exploration

We can also (now!) limit the number of results returned, e.g. to show the 10 MPs with lowest claims to date (the datastore blog post explains that why the data is incomplete and to be treated warily).

Limiting results in datstore explorer

Changing the asc order to desc in the above query gives possibly a more interesting result, the MPs who have the largest claims to date (presumably because they have got round to filing their claims!;-)

Datastore exploring

Okay – enough for now; the reason I’m posting this is in part to ask the question: is the this an unfair use of the Guardian datastore data, does it detract from the work they put in that lets them claim “You cannot get this anywhere else”, and does it impact on the returns they might expect to gain?

Sbould they/could they try to assert some sort of database collection right over the collection/curation and re-presentation of the data that is otherwise publicly available that would (nominally!) prevent me from using this data? Does the publication of the data using the shared link with the authorisation token imply some sort of license with which that data is made available? E.g. by accepting the link by clicking on it, becuase it is a shared link rather than a public link, could the Datastore attach some sort of tacit click-wrap license conditions over the data that I accept when I accept the shared data by clicking through the shared link? (Does the/can the sharing come with conditions attached?)

PS It seems there was a minor “issue” with the settings of the spreadsheet, a result of recent changes to the Google sharing setup. Spreadsheets should now be fully viewable… But as I mention in a comment below, I think there are still interesting questions to be considered around the extent to which publishers of “public” data can get a return on that data?

Continuing my exploration of what is and isn’t acceptable around the edges of doing stuff with other people’s data(?!), the Guardian datastore have just published a Google spreadsheet containing partial details of MPs’ expenses data over the period July-Decoember 2009 (MPs’ expenses: every claim from July to December 2009):

thanks to the work of Guardian developer Daniel Vydra and his team, we’ve managed to scrape the entire lot out of the Commons website for you as a downloadable spreadsheet. You cannot get this anywhere else.

In sharing the data, the Guardian folks have opted to share the spreadsheet via a link that includes an authorisation token. Which means that if you try to view the spreadsheet just using the spreadsheet key, you won’t be allowed to see it; (you also need to be logged in to a Google account to view the data, both as a spreadsheet, and in order to interrogate it via the visualisation API). Which is to say, the Guardian datastore folks are taking what steps they can to make the data public, whilst retaining some control over it (because they have invested resource in collecting the data in the form they’re re-presenting it, and reasonably want to make a return from it…)

But in sharing the link that includes the token on a public website, we can see the key – and hence use it to access the data in the spreadsheet, and do more with it… which may be seen as providing a volume add service over the data, or unreasonably freeloading off the back of the Guardian’s data scraping efforts…

So, just pasting the spreadsheet key and authorisation token into the cut down Guardian datastore explorer script I used in Using CSV Docs As a Database to generate an explorer for the expenses data.

So for example, we can run for example run a report to group expenses by category and MP:

MP expesnes explorer

Or how about claims over 5000 pounds (also viewing the information as an HTML table, for example).

Remember, on the datastore explorer page, you can click on column headings to order the data according to that column.

Here’s another example – selecting A,sum(E), where E>0 group by A and order is by sum(E) then asc and viewing as a column chart:

Datastore exploration

We can also (now!) limit the number of results returned, e.g. to show the 10 MPs with lowest claims to date (the datastore blog post explains that why the data is incomplete and to be treated warily).

Limiting results in datstore explorer

Changing the asc order to desc in the above query gives possibly a more interesting result, the MPs who have the largest claims to date (presumably because they have got round to filing their claims!;-)

Datastore exploring

Okay – enough for now; the reason I’m posting this is in part to ask the question: is the this an unfair use of the Guardian datastore data, does it detract from the work they put in that lets them claim “You cannot get this anywhere else”, and does it impact on the returns they might expect to gain?

Sbould they/could they try to assert some sort of database collection right over the collection/curation and re-presentation of the data that is otherwise publicly available that would (nominally!) prevent me from using this data? Does the publication of the data using the shared link with the authorisation token imply some sort of license with which that data is made available? E.g. by accepting the link by clicking on it, becuase it is a shared link rather than a public link, could the Datastore attach some sort of tacit click-wrap license conditions over the data that I accept when I accept the shared data by clicking through the shared link? (Does the/can the sharing come with conditions attached?)

PS It seems there was a minor “issue” with the settings of the spreadsheet, a result of recent changes to the Google sharing setup. Spreadsheets should now be fully viewable… But as I mention in a comment below, I think there are still interesting questions to be considered around the extent to which publishers of “public” data can get a return on that data?


June 08 2010

13:21

Liberating Data from the Guardian… Has it Really Come to This?

When the data is the story, should a news organisation make it available? When the Telegraph started trawling through MPs’ expenses data it had bought from a source, industry commentators started asking questions around whether it was the Telegraph’s duty to release that data (e.g. Has Telegraph failed by keeping expenses process and data to itself?).

Today, the Guardian released its University guide 2011: University league table, as a table:

Guardian university tables, sort of

Yes, this is data, sort of (though the javascript applied to the table means that it’s hard to just select and copy the data from the page �" unless you turn javascript off, of course:

Data grab

but it’s not like the data that the Guardian are republishing it in their datastore, as they did with these league tables…:

Guardian datastore

…which was actually a republication of data from the THES… ;-)

I’ve been wondering for some time when this sort of apparent duplicity was going to occur… the Guardian datastore has been doing a great job of making data available (as evidenced by its award from the Royal Statistical Society last week, which noted: “there was commendable openness with data, providing it in easily accessible ways”) but when the data is “commercially valuable” data to the Guardian, presumably in terms of being able to attract eyeballs to Guardian Education web pages, there seems to be some delay in getting the data onto the datastore… (at least, it isn’t there yet/wasn’t published contemporaneously the original story…)

I have to admit I’m a bit wary about writing this post �" I don’t want to throw any spanners in the works as far as harming the work being done by the Datastore team �" but I can’t not…

So what do we learn from this about the economics of data in a news environment?

- data has creation costs;
- there may be a return to be had from maintaining limited, priviliged or exclusive access to the data as data OR as information, where information is interpreted, contextualised or visualised data, or is valuable in the short term (as for example, in the case of financial news). By withholding access to data, publishers maintain the ability to generate views or analysis of the data that they can create stories, or attractive website content, around. (Just by the by, I noticed that an interactive Many Eyes widget was embedded in a Guardian Datablog post today :-)
- if you’ve incurred the creation cost, maybe you have a right to a limited period of exclusivity with respect to profiting from that content. This is what intellectual property rights try to guarantee, at least until the Mickey Mouse lawyers get upset about losing their exclusive right to profit from the content.

I think (I think) what the Guardian doing is not so different to what the Telegraph did. A cost was incurred, and now there is a (hopefully limited) period in which some sort of return is attempting to be generated. But there’s a problem, I think, with the way it looks, especially given the way the Guardian has been championing open data access. Maybe the data should have been posted to the datablog, but with access permissions denied until a stated date, so that at least people could see the data was going to be made available.

What this has also thrown up, for me at least, is the question as to what sort of “contract” the datablog might have, implied or otherwise, with third parties who develop visualisations based on data in the Guardian Datastore, particularly if those visualisations are embeddable and capable of generating traffic (i.e. eyeballs, = ad impressions, = income…).

It also gets me wondering; does there need to be a separate datastore? Or is the ideal case where the stories themselves are linking out to datasets directly? (I suppose that would make it hard to locate the data? On second thoughts, the directory datastore approach is much better…)

Related: Time for data.ac.uk? Or a local data.open.ac.uk?

PS I toyed with the idea of republishing all the data from the Guardian Education pages in a spreadsheet somewhere, and then taking my chances with the lawyers in the court of public opinion, but instead, here’s a howto:

Scraping data from the Grauniad

So just create a Google spreadsheet (you don’t even need an account: just go to docs.google.com/demo), double click on cell A1 and enter:

=ImportHtml(“http://www.guardian.co.uk/education/table/2010/jun/04/university-league-table”,”table”,1)

and then you’ll be presented with the data, in a handy spreadsheet form, from:
http://www.guardian.co.uk/education/table/2010/jun/04/university-league-table

For the subject pages �" e.g. Agriculture, Forestry and Food, paste in something like:

=ImportHtml(“http://www.guardian.co.uk/education/table/2010/jun/04/university-guide-agriculture-forestry-and-food”,”table”,1)

You can probably see the pattern… ;-)

(You might want to select all the previously filled cells and clear them first so you don’t get the data sets messed up. If you’ve got your own spreadsheet, you could always create a new sheet for each table. (It is also possible to automate the scraping of all the tables using Google Apps script: Screenscraping With Google Spreadsheets App Script and the =importHTML Formula gives an example how…))

An alternative route to the data is via YQL:

Scraping HTML table data in YQL

Enjoy…;-) And if you do grab the data and produce some interesting visualisations, feel free to post a link back here… ;-) To give you some ideas, here are a few examples of education data related visualisations I’ve played around with previously.

PPS it’ll be interested to see if this post gets picked up by the Datablog, or popped into the Guardian Technology newsbucket… ;-) Heh heh…

When the data is the story, should a news organisation make it available? When the Telegraph started trawling through MPs’ expenses data it had bought from a source, industry commentators started asking questions around whether it was the Telegraph’s duty to release that data (e.g. Has Telegraph failed by keeping expenses process and data to itself?). Today, the Guardian released its University guide 2011: University league table, as a table: Guardian university tables, sort of Yes, this is data, sort of (though the javascript applied to the table means that it’s hard to just select and copy the data from the page – unless you turn javascript off, of course: Data grab but it’s not like the data that the Guardian are republishing it in their datastore, as they did with these league tables…: Guardian datastore …which was actually a republication of data from the THES… ;-) I’ve been wondering for some time when this sort of apparent duplicity was going to occur… the Guardian datastore has been doing a great job of making data available (as evidenced by its award from the Royal Statistical Society last week, which noted: “there was commendable openness with data, providing it in easily accessible ways”) but when the data is “commercially valuable” data to the Guardian, presumably in terms of being able to attract eyeballs to Guardian Education web pages, there seems to be some delay in getting the data onto the datastore… (at least, it isn’t there yet/wasn’t published contemporaneously the original story…) I have to admit I’m a bit wary about writing this post – I don’t want to throw any spanners in the works as far as harming the work being done by the Datastore team – but I can’t not… So what do we learn from this about the economics of data in a news environment? - data has creation costs; - there may be a return to be had from maintaining limited, priviliged or exclusive access to the data as data OR as information, where information is interpreted, contextualised or visualised data, or is valuable in the short term (as for example, in the case of financial news). By withholding access to data, publishers maintain the ability to generate views or analysis of the data that they can create stories, or attractive website content, around. (Just by the by, I noticed that an interactive Many Eyes widget was embedded in a Guardian Datablog post today:-) - if you’ve incurred the creation cost, maybe you have a right to a limited period of exclusivity with respect to profiting from that content. This is what intellectual property rights try to guarantee, at least until the Mickey Mouse lawyers get upset about losing their exclusive right to profit from the content. I think (I think) what the Guardian doing is not so different to what the Telegraph did. A cost was incurred, and now there is a (hopefully limited) period in which some sort of return is attempting to be generated. But there’s a problem, I think, with the way it looks, especially given the way the Guardian has been championing open data access. Maybe the data should have been posted to the datablog, but with access permissions denied until a stated date, so that at least people could see the data was going to be made available. What this has also thrown up, for me at least, is the question as to what sort of “contract” the datablog might have, implied or otherwise, with third parties who develop visualisations based on data in the Guardian Datastore, particularly if those visualisations are embeddable and capable of generating traffic (i.e. eyeballs, = ad impressions, = income…). It also gets me wondering; does there need to be a separate datastore? Or is the ideal case where the stories themselves are linking out to datasets directly? (I suppose that would make it hard to locate the data? On second thoughts, the directory datastore approach is much better…) Related: Time for data.ac.uk? Or a local data.open.ac.uk? PS I toyed with the idea of republishing all the data from the Guardian Education pages in a spreadsheet somewhere, and then taking my chances with the lawyers in the court of public opinion, but instead, here’s a howto: Scraping data from the Grauniad So just create a Google spreadsheet (you don’t even need an account: just go to docs.google.com/demo), double click on cell A1 and enter: =ImportHtml(“http://www.guardian.co.uk/education/table/2010/jun/04/university-league-table”,”table”,1) and then you’ll be presented with the data, in a handy spreadsheet form, from: http://www.guardian.co.uk/education/table/2010/jun/04/university-league-table For the subject pages – e.g. Agriculture, Forestry and Food, paste in something like: =ImportHtml(“http://www.guardian.co.uk/education/table/2010/jun/04/university-guide-agriculture-forestry-and-food”,”table”,1) You can probably see the pattern… ;-) (You might want to select all the previously filled cells and clear them first so you don’t get the data sets messed up. If you’ve got your own spreadsheet, you could always create a new sheet for each table. (It is also possible to automate the scraping of all the tables using Google Apps script: Screenscraping With Google Spreadsheets App Script and the =importHTML Formula gives an example how…)) An alternative route to the data is via YQL: Scraping HTML table data in YQL Enjoy…;-) And if you do grab the data and produce some interesting visualisations, feel free to post a link back here… ;-) To give you some ideas, here are a few examples of education data related visualisations I’ve played around with previously. PPS it’ll be interested to see if this post gets picked up by the Datablog, or popped into the Guardian Technology newsbucket… ;-) Heh heh…
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl