Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 06 2012

15:48

ScraperWikiをためしてみよう

Guest post by Makoto Inoue, a Japanese ScraperWiki user

はじめに

みなさんスクレイプ���Scrape���という単語はご存知でしょうか���

ウェッブページから特定のデータを引っこ抜く作業のことをスクレイピング���Scraping���と呼びます。

昨今のホームページではデータを簡単に提供するためのAPI���Application Programming Interface���というしくみが多いので「なんで今更そんなの必要なの」と思われる方>も多いかもしれません。しかしながら前回起きた東日本大地震の際、地震や電力の速報や、各地の被害状況を把握するために必要な政府の統計情報などがAPIとして提供されておらず、開発者の中には自分でスクレイパー���Scraper���用のプログラムを書いた人も多いのではないのでしょうか��� ただそういった多くの開発者の善意でつくられたプログラムがいろいろなサイトに散らばっていたり、やがてメンテナンスされなくなるのは非常に残念なことです。

そういうときにScraperWikiの出番です。

ScraperWikiとは

ScraperWikiはイギリスのスタートアップ企業で、スクレイパーコードを共有するサイトを提供しています。開発者達はサイト上から直接コード���Ruby, PHP, Python���を編集、実行することができます。スクレイプを定期的に実行することも可能で、取得されたデータはScraperWikiに保存されますが、ScraperWikiはAPIを用意しているので、このAPIを通して、他のサイトでデータを再利用することが可能です。

「Wiki」といっているだけあって、一般公開されているコードは他の人も編集したり、またコードをコピーして他のスクレイピングに利用することもできます。定期的に実>行されているスクレイパーがエラーを起こしていないかをチェックする仕組みがあり「みんなでスクレイピングを管理」するための仕組みがいたるところにあります。

ScraperWikiは、もともとイギリスで、どの議員がどの法案に賛成または反対票を投じたかを議会のサイトから創業者の一人が2003年頃にスクレイプしたことを起源に持ちます。

日本であればちょうどこういったページでしょうか���

現在ではGuardian社といった大手報道機関が企業ロビイストの議会での影響力を調べるのにつかったり、イギリス政府自身がalpha.gov.ukというプロトタイプサ>イトで、各省庁に点在したデータを一元的にアクセスするための仕組みとしてScraperWikiを使っているそうです。

ScraperWikiのビジネスモデルですが、一般公開するコードに関しては無料ですが、非公開にしたり、定期的にスクレイプする量などに応じて課金するようになっています。

前置きが長くなってきましたが、実際に使ってみましょう。

既存のスクレイパーを眺めてみる

「ScraperWiki」でGoogle検索すると、すでにScraperWikiを使用している日本人の方がいらっしゃいました。

ここでは衆議院議員のデータをスクレイプするのに使用しています。

データは月一回走るように設定されていたり、複数のcontributorがいるのがわかります。

ページの下の方にはスプレッドシート形式でデータを閲覧できるようになっていますが、これだけだと他のサイトで再利用とか難しいですよね。そういうときは”Explorer with API”ボタンをクリックしてみて下さい。そこのページの最後に以下のようなurlがあると思います。

https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=jsondict&name=members_of_the_house_of_representatives_of_japan&query=select%20*%20from%20%60swdata%60%20limit%2010

このurlにアクセスすると、先ほどのデータをJSON(Javascript Object Notation)で返してくれます。出力フォーマットは CSV, RSS,HTMLテーブルといった他の形式にも対応している上sql文をつかってフィルタリングなどをかけることも可能です。

select * from `swdata` where party = ‘民主’

COPYして独自のスクレイパーを作ってみる

ブラウザの”バック”ボタンを押して先ほどのページのスプレッドシートの下の方に目を通してい見て下さい。”This Scraper in Context”というところ”Copied To”という項>目があります。これはこのソースコードがコピーされ、他の用途に利用されていることを示しています。

そこに「makoto / Members of the House of Councillors of Japan」とあるの>でクリックしてみて下さい。実はこれは私が参議院議員の名簿を抜き出すために作ったスクレイパーです。衆議院と参議院はそれぞれ別にホームページを持っているのです>が、それぞれの議員名簿のページが結構似ていたので簡単に流用できるのではと思っていました。

作り方は簡単でCopyリンクをクリックするだけです。ログインしていなくてもコピーをとれますが、これを機にアカウントを取得するのをお勧めします。

”Edit”ページを開くとその場でコードを編集するためのオンラインエディタが現れます。下にある”Run”ボタンを押すと実際にサイトからデータをとってきている模様が見て取れます。

http://www.screenr.com/embed/BgQs

以下がオリジナルのコードと私のコードの差異です。

衆議院と参議院のページ異なっていたため変更した点はは���つほどありました。

  • エンコーディング���文字の表示形式���がUTFとShift-JISでことなる
  • 衆議院のページは複数ページにまたがっているが参議院ページ���ページのみ
  • 衆議院のページで議員名は「くん」づけ。参議院のページは芸名と本名の両方が載っている

他にもHTMLのページの文法が微妙に違っていたので、XPathという、HTMLに構造的にアクセスするための書式を少し換えました。

もちろんこれらの変更をするのにはある程度のプログラミング知識が必要なのですが。動くサンプルを少し自分用にカスタマイズするScraperWikiはプログラミングを勉強したい人にとっても絶好の教材なのではないでしょうか��� 私自身XPathはあまり使ったことがなかったのですが、このもとプログラムを参考にすることで比較的簡単に学習できました。

最後に

今回のScraperWikiの簡単なチュートリアルで概要がわかっていただけたでしょうか���

公共機関、メディアや政府機関の中でインターネットを通じた情報公開は進んできていますが、「マッシュアップを前提としたデータの再利用」を考慮したサイトが十分で>ないのが現状です。そういった状態に一石と投じるべくScraperWikiは活動しており、ヨーロッパのジャーナリストや政府関係者の間では徐々に認知度があがってきております。 現在ScraperWikiでは米国でのワークショップを予定していますが、日本でもワークショップを始めるべく準備をしている所です。 もし興味のある方はコンタクト>ページより気軽にご連絡下さい。

Like this:

Be the first to like this post.

September 16 2011

13:16

Driving the Digger Down Under

G’day,

Henare here from the OpenAustralia Foundation – Australia’s open data, open government and civic hacking charity. You might have heard that we were planning to have a hackfest here in Sydney last weekend. We decided to focus on writing new scrapers to add councils to our PlanningAlerts project that allows you to find out what is being built or knocked down in your local community. During the two afternoons over the weekend seven of us were able to write nineteen new scrapers, which covers an additional 1,823,124 Australiansa huge result.

There are a number of reasons why we chose to work on new scrapers for PlanningAlerts. ScraperWiki lowers the barrier of entry for new contributors by allowing them to get up and running quickly with no setup – just visit a web page. New scrapers are also relatively quick to write which is perfect for a hackfest over the weekend. And finally, because we have a number of working examples and ScraperWiki’s documentation, it’s conceivable that someone with no programming experience can come along and get started.

It’s also easy to support people writing scrapers in different programming languages using ScraperWiki. PlanningAlerts has always allowed people to write scrapers in whatever language they choose by using an intermediate XML format. With ScraperWiki this is even simpler because as far as our application is concerned it’s just a ScraperWiki scraper – it doesn’t even know what language the original scraper was written in.

Once someone has written a new scraper and formatted the data according to our needs, it’s a simple process for us to add it to our site. All they need to do is let us know, we add it to our list of planning authorities and then we automatically start to ask for the data daily using the ScraperWiki API.

Another issue is maintenance of these scrapers after the hackfest is over. Lots of volunteers only have the time to write a single scraper, maybe to support their local community. What happens when there’s an issue with that scraper but they’ve moved on? With ScraperWiki anyone can now pick up where they left off and fix the scraper – all without us ever having to get involved.

It was a really fun weekend and hopefully we’ll be doing this again some time. If you’ve got friends or family in Australia, don’t forget to tell them to sign up for PlanningAlerts.

Cheers,

Henare
OpenAustralia Foundation volunteer


May 15 2011

17:14

Finding bulk IRS 990 data

Are there good sources for download bulk IRS 990 data? I'm looking for files from a specific geographic area (Michigan), but would be happy to have a wider dataset. My end product will involve extracting specific datapoints, including largest program service areas and executive officers.

The Foundation Center's search tool is the best source I've found so far. They have OCR'ed PDFs and it looks scrapeable.

The IRS provides scanned copies of returns on DVD. In theory, one organization could buy the data and distribute it, or we could take a collection to make it freely available if someone hasn't already. However, these are image files, and would need to be recombined into PDFs and OCRed (which is very possible using pdftk/docsplit/documentcloud), but still a hassle.

Guidestar provides much of this data, but I don't think I am able to legally repackage it in bulk, and it still requires manually downloading files.

Tags: irs opendata data

May 02 2011

15:03

Link love from BarCamp News Innovation Philly #BCNIPhilly

An actual session at an unconference! #bcniphilly on Twitpic

Before I hit the road back to Toronto today, I wanted to share a little link love from BarCamp News Innovation Philly.

The event drew more than 100 people from around Philadelphia, and as far away as Washington, DC, New York, and Sacramento, California. There were roughly 25 sessions with a keynote by Zach Seward, Outreach Editor for the Wall Street Journal. There’s a good summary of the event on Philly.com.

Here are some resources from the event:

Many thanks to Sean Blanda and the other folks at Technically Philly and Temple University for organizing a great event. Other cities that want to explore ‘news innovation’ should look at this event as a template for what to do.

As soon as I’m back at my desk, I’ll post a little video summary of the event. Stay tuned.

April 14 2011

07:44

Le Linee Guida per l’Open Data

Open Data, tutti ne parlano, ma come si fa?

A questa domanda abbiamo provato a dare una prima risposta con la stesura di un manualetto: http://tinyurl.com/pendataitalia

L’auspicio è che questo umile lavoro a cui hanno contribuito alcuni dei nostri associati, possa essere un un riferimento per gli amministratori pubblici, i manager e tutti quei decisori che, convinti sulla bontà della filosofia che sorregge la disciplina dell’Open Data Government, non hanno ancora trovato la scatola degli attrezzi per passare dalla teoria alle azioni concrete.

Come tutte le scatole degli attrezzi, anche questa potrà essere riempita di nuovi strumenti e, grazie all’apporto di nuovi contributi, diventare un riferimento per dare finalmente anche all’Italia una strategia per il “governo digitale”.

Cosa vuol dire Open Data? Perché l’Open Data rappresenta una strada verso l’Open Government, e perché l’Open Government è  uno strumento di sviluppo? Quali sono i principali problemi da affrontare quando si vuole “fare” Open Data”? Quali le tematiche giuridiche da tenere in considerazione? Quali gli aspetti tecnici e gli impatti organizzativi? A queste domande (ed a qualcuna in più) abbiamo voluto fornire una prima risposta, per consentire a tutti di iniziare a comprendere i motivi della centralità di questo tema per lo sviluppo del Paese.

Queste linee guida fanno seguito al Manifesto per l’Open Government, che la nostra associazione ha pubblicato a novembre dello scorso anno. Le prossime iniziative che contiamo di portare avanti grazie all’aiuto di un sempre più nutrito gruppi di esperti saranno annunciate nei prossimi giorni, nel corso di alcuni eventi ai quali stiamo lavorando.

Nel contempo, chiunque voglia contribuire a migliorare questa versione può commentare il post. Garantiamo, come sempre, la massima attenzione a tutte le osservazioni, critiche e proposte di miglioria che verranno inserite nella prossima versione.

Ringraziamo la rivista eGov che ha stampato le prime copie del presente manuale per consegnarle a tutti coloro che assisteranno alla premiazione di oggi a Palazzo Marino.

Come Si Fa Open Data – Versione 1.0

 

March 11 2011

16:23

Help wanted: Canadian media organization to lead on open data.

"Freedom of information is often thought to be about "the Press." Open Data, however, is about citizens"

Here's the 30-second version of this post:

  • The Guardian UK played an important role in pushing the open data and transparency agenda;
  • They did this, in many parts, by simply being a meeting point for the open data & transparency conversation;
  • Ultimately, it was probably timing -- an election -- that helped most to put open data on the UK government's radar;
  • With elections coming in Canada, what can open-data advocates of all stripes -- individuals, grassroots groups, and media organizations -- do to push this critical issue into the spotlight?

Last Friday morning, Emily Bell spoke to a small group gathered at the Samara offices on Prince Arthur Avenue. Over breakfast, she explained why the Guardian UK has invested so much energy into being the meeting point for the Open Data conversation.

Specifically, she described how the Guardian partnered early on with the pioneering open data efforts of Tom Steinberg and his merry band of open-data hackers at MySociety, and also how the Guardian was quick to adopt the idea of organizing "hack days," which sought to bring outside ideas inside. Early efforts like these led, in part, to the Guardian being invited to Downing Street to meet with the likes of Tim Burners-Lee and to discuss the benefits of open data with the UK government. "You have to do it," Emily implored those gathered at Samara, and -- ultimately, she proposed that "Wikileaks data would not have gone to the Guardian if not for their demonstrated skills in working with data."

It certainly left me asking, who will play the Guardian's role here in Canada? Who will be the lightening rod for the open data conversation?

Interestingly, most major Canadian news outlets already have at least one software developer working in the newsroom. More than that, David Skok shared that GlobalNews.ca had recently taken part in the Random Hacks of Kindness event at the University of Toronto -- an event that aimed to bring together "developers, geeks and tech-savvy do-gooders around the world, working to develop software solutions that respond to the challenges facing humanity today." However, it still feels like the most tangible open data efforts in Canada are coming from citizens like David Eaves, Russell McOrmond, and groups like Civic Access.

Ryan Merkely -- currently, Mozilla Foundation's director of programs & strategy, previously an advisor to the City of Toronto -- was quick to point out that most of the open data initiatives in Canada are coming from the municipal level, either through official efforts like www.toronto.ca/open or data.vancouver.ca, or through grassroots initiatives like Open Data Ottawa Hackfest and Montreal Overt. And, while there are challenges to getting provincial and federal data, that's not to say it doesn't happen -- one recent example is OpenFile's "Baby File" story, which asked the province of Ontario to hand over years of birth records. Where there's a will, there's a way, it would seem (at least if you're Patrick Cain).

Back to Emily Bell: asked about the launch of data.gov.uk in 2010, she was quick to point out that elections present an opportunity for movement toward greater transparency. (However, it's becoming ever-more clear that you have to hold elected officials accountable, or they'll actually do the opposite of what they campaigned on.) Often an upcoming election is incentive for the incumbents to make a bold move to win support, or for the challengers to make commitments that the incumbent refuses to address, and -- let's face it -- open data is an inherently non-partisan issue. So this all begs the question: how does open data become an election issue in Canada?

Even though Emily believes that the jury is still out on data.gov.uk, it's clearly a move in the right direction. It sounds like the big push for data.gov.uk came before the 2010 general election, and it came from people like Tim Burners-Lee and Tom Steinberg, both individuals who have been campaigning for open data for more than a decade. In Steinberg's case, he's taken the pragmatic hacker approach of continuing to innovate and demonstrate what's possible -- standing on the virtual Speaker's Corner and shouting "Hey, look at what I can do with this open data!" So the next question for Canada is, who is our Steinberg or Berners-Lee, who is constantly banging the drum at the federal level for more open data, and more transparency?

The movement for open data in the UK appears strong and vibrant, and it's likely that the Guardian played an important role by investing resources, providing space, and convening ideas and people around the issue. According to Emily, the first step was simply to set-up what is now known as the Data Blog; It became a gathering point for the broader conversation, and made it possible for disparate voices to find each other. The Guardian has called Canada an "open data and journalism powerhouse," but Canada still lacks this simple piece of the puzzle -- one visionary media organization to pick up the flag and say "We care about open data, we're going to convene the conversation."

Some will say that it's not the media's place to play a role here. However, at the end of breakfast last Friday, Emily Bell pointed out, in her perfect British accent, "The public gives the media permission to act."

So, let's give Canadian media permission to act on open data.

Enhanced by Zemanta

November 26 2010

20:37

“Open Data: dati, conoscenza, valore” VI Conferenza Annuale del Consorzio TOP-IX e Turin Cloud Camp

Venerdì 3 dicembre, ore 9,30 – 17 “Open Data: dati, conoscenza valore”; Environment Park, via Livorno 58 – Torino

Giovedì 2 dicembre, ore 14 – 18 “Turin Cloud Camp”; Environment Park, via Livorno 58 – Torino

Open Data vuol dire dati disponibili, accessibili e riutilizzabili, senza restrizioni di copyright, brevetti o altre forme di controllo che ne limitino la riproduzione.

La maggior parte di questi dati appartengono a tutti, sono prodotti da organismi pubblici, nell’espletamento delle loro funzioni – dal catasto alla sanità, dai dati ambientali a quelli sulla viabilità – e rappresentano un immenso patrimonio informativo, che grazie a Internet può essere facilmente condiviso.

Per comprendere il valore dei dati e della loro circolazione è sufficiente pensare a Google, che ha costruito la sua fortuna e il suo successo sulla raccolta e sulla riorganizzazione di dati e informazioni, da restituire in forma di servizio “open”, in rete.

Dopo aver contribuito, con Regione Piemonte, Centro Nexa e CSI Piemonte, alla nascita della web platform dati.piemonte.it, primo esempio italiano Open Government regionale, il Consorzio TOP-IX dedica agli Open Data la sua VI Conferenza annuale, dal titolo “Open Data: dati, conoscenza, valore”, venerdì 3 dicembre 2010.

La conferenza sarà preceduta, giovedì 2 dicembre 2010, dal primo Cloud Camp torinese, focalizzato sulla gestione di grandi quantità di dati all’interno di infrastrutture Cloud Computing.

Sono molteplici gli interrogativi e i temi che saranno affrontati nella ormai classica modalità del Barcamp, dove non è prevista un’agenda a priori, ma seguendo le indicazioni dei partecipanti vengono scelti e discussi i temi di maggiore interesse, in un’ottica di partecipazione e coinvolgimento.

Il Cloud Camp torinese – organizzato in collaborazione con la community internazionale CloudCamp – vedrà la partecipazione di Reuven Cohen, fondatore e CTO di Enomaly Toronto Inc. (leader nello sviluppo di prodotti informatici e soluzioni focalizzate sul Cloud Computing), nonché uno dei primi ideatori del Cloud Camp. Yosu Cadilla, facilitatore per gli eventi Cloud Camp sull’area europea, avrà il compito di guidare la platea di esperti attraverso la discussione e il confronto.

I partecipanti del Camp avranno inoltre la possibilità di seguire un Workshop dedicato agli Open Data in cui analizzare concretamente lo stato dell’arte del tema e presentare progetti e soluzioni. Il Workshop è stato pensato come ponte tra “unconference” e la conferenza istituzionale TOP-IX, che prevede una sessione dedicata al rapporto tra Open Data e Cloud Computing.

Venerdì 3 dicembre, la conferenza si snoderà attraverso quattro sezioni, che apriranno altrettanti focus sulle diverse declinazioni del tema Open Data. Nella prima sessione si guarderà alla situazione internazionale, con particolare attenzione all’Europa e alla soluzione adottata in UK, dove ci si è concentrati fin da subito sui Linked Open Data. Si proseguirà offrendo una panoramica sulla situazione italiana, dove il tema ha subito una forte accelerazione negli ultimi mesi, per soffermarsi sul caso Piemonte. Quindi verrà proposto il binomio Open Data e informazione, prima di presentare interessanti progetti legati alle possibilità di business offerte dal riutilizzo dei dati.

Info:

http://www.top-ix.org

http://conferenza.top-ix.org

http://conferenza.top-ix.org/cloudcamp

http://www.cloudcamp.org/turin

November 24 2010

10:47

We need standards, international standards

di Lorenzo Benussi, Federico Morando e Michele Barbera

“We need standards, international standards – Abbiamo bisogno di standard, standard internazionali. Così Tim Berners-Lee ha chiuso il suo intervento oggi all’Open Government Data Camp di Londra (http://opengovernmentdata.org/camp2010/). Standard per aprire i dati, standard per classificarli, standard per creare piattaforme che possano comunicare tra loro e che ci permettano di mettere in relazione le informazioni. Un messaggio chiaro lanciato del padre del WEB e, come qualcuno ha detto al Camp, padre un po’ di tutto il “popolo della Rete”.
Prima di lui il primo ministro Cameron, con un video messaggio, ha espresso chiaramente la sua determinazione nel realizzare il diritto ai dati. Complementare al diritto all’informazione, esso riguarda la libertà di vedere e soprattutto usare i dati pubblici per capire e gestire sempre meglio l’organizzazione pubblica; per dare la possibilità a chiunque di controllare e soprattutto di collaborare a trovare le soluzioni. In pratica, ha spiegato poi il Ministro Francis Maude – Minister for the Cabinet Office – nel suo discorso, il governo pubblicherà su Internet tutte le spese superiori ai 25K£ lanciando apertamente una sfida al mondo della rete perché trovi i problemi e proponga le soluzioni.

UK, Norvegia, Finlandia, Svezia, Francia, Germania, Italia, Messico, Cile, Lituania, Cina, Brasile, US sono solo alcuni dei paesi che erano rappresentati al OGDC. Paesi dove stanno nascendo progetti Open Data sia dai governi sia dalla società civile, dall’alto e dal basso. Un’onda che sta montando velocemente.

Tim Berners-Lee ha suggerito di fare un ulteriore passo avanti trasformando gli Open Data in Linked Open Data. Collegare tra loro i dati prodotti da diversi attori e descriverne la semantica utilizzando opportuni vocabolari, permetterebbe di confrontare e incrociare facilmente dati provenienti da diverse fonti, ad esempio le spese in ricerca in UK e in Cina o il costo del sistema sanitario nazionale in Italia e in Francia.
Berners-Lee ha tenuto a sottolineare come questo ulteriore passo possa essere compiuto contemporaneamente sia sul piano locale, sia su quello nazionale e internazionale, dall’alto e dal basso, nel settore pubblico e in quello privato. Creare una rete di Linked Open Data significa accrescere il valore dei dati calandoli in un contesto più ampio, significa poter realizzare applicazioni che utilizzino dati governativi incrociandoli con dati provenienti da ogni segmento della società civile, della comunità scientifica e del settore privato, dalla scuola, dalla cultura, dallo sport.

Una buona parte del dibattito tenutosi durante la due giorni londinese ha evidenziato come l’impatto degli Open Data non sia limitato a facilitare la trasparenza delle amministrazioni, ma sia in grado di creare un vero e proprio ecosistema capace di produrre valore. Ed è proprio su questo punto che, indipendentemente dall’orientamento politico, i governi di molti paesi hanno deciso di abbracciare la visione di “Open Data che, alla pari della rete elettrica e stradale, siano concepiti come un’infrastruttura pubblica, aperta e condivisa”.

Molto lavoro è stato già fatto e la comunità è aperta e pronta a ulteriori collaborazioni. Esistono definizioni e manuali sull’open data, classificazioni sulla qualità tecnica dei dati e linee guida per creare piattaforma di raccolta e distribuzione. Non è necessario pensare a qualcosa di totalmente nuovo, reinventare la ruota, basta partecipare a quanto sta avvenendo a livello internazionale, adottare gli standard e adattare i modelli alle peculiarità nazionali. Questo non significa appiattirsi sulle soluzioni altrui, ma seguire un approccio modulare ed aperto, sia al riuso delle soluzioni altrui che alla condivisione delle proprie. Col vantaggio che così facendo non si resta soli, ma si può contare su un supporto globale, costruendo sull’esperienza degli altri.

Questo vale anche in Italia dove, forse più che in altre paesi, è necessario essere lucidi, determinati e concreti. In questi ultimi mesi – e molto probabilmente con più forza l’anno prossimo – il modello open data è diventato finalmente un argomento importante. Gruppi e associazioni, PA e società civile hanno cominciato a lavorare per “liberare” i dati ma è necessario seguire la comunità internazionale che è molto avanzata non fosse altro per l’impegno che altri governi stanno spendendo su questo tema. Non facciamo l’errore di reinventare la ruota con il rischio che sia adatta solo alle nostre strade.”

Lorenzo Benussi, Consorzio TOP-IX (www.top-ix.org)
Federico Morando, NEXA Center for Internet & Society (http://nexa.polito.it/staff#morando)
Michele Barbera, Net7 (www.netseven.it)

October 24 2010

19:23

Reliable Open Database of Zipcodes mapped to City/County/State?

I'm looking for a database of zipcodes mapped to cities, counties and states, but most of the data sets I'm find seem either very out of date, proprietary, or both.

Any suggestions for an open data set that is solid in this area?

June 28 2010

09:22

So Where Do the Numbers in Government Reports Come From?

Last week, the COI (Central Office of Information) released a report on the “websites run by ministerial and non-ministerial government departments”, detailing visitor numbers, costs, satisfaction levels and so on, in accordance with COI standards on guidance on website reporting (Reporting on progress: Central Government websites 2009-10).

As well as the print/PDF summary report (>a href=”http://coi.gov.uk/websitemetricsdata/websitemetrics2009-10.pdf”>Reporting on progress: Central Government websites 2009-10 (Summary) [PDF, 33 pages, 942KB]) , a dataset was also released as a CSV document (Reporting on progress: Central Government websites 2009-10 (Data) [CSV, 66KB]).

The summary report is full of summary tables on particular topics, for example:

TABLE 1: REPORTED TOTAL COSTS OF DEPARTMENT-RUN WEBSITES
COI web report 2009-10 table 1

TABLE 2: REPORTED WEBSITE COSTS BY AREA OF SPENDING
COI web report 2009-10 table 2

TABLE 3: USAGE OF DEPARTMENT-RUN WEBSITES
COI website report 2009-10 table 3

Whilst I firmly believe it is a Good Thing that the COI published the data alongside the report, there is a still a disconnect between the two. The report is publishing fragments of the released dataset as information in the form of tables relating to particular reporting categories – reported website costs, or usage, for example – but there is no direct link back to the CSV data table.

Looking at the CSV data, we see a range of columns relating to costs, such as:

COI website report - costs column headings

and:

COI website report costs

There are also columns headed SEO/SIO, and HEO, for example, that may or may not relate to costs? (To see all the headings, see the CSV doc on Google spreadsheets).

But how does the released data relate to the summary reported data? It seems to me that there is a huge “hence” between the released CSV data and the summary report. Relating the two appears to be left as an exercise for the reader (or maybe for the data journalist looking to hold the report writers to account?).

The recently published New Public Sector Transparency Board and Public Data Transparency Principles, albeit in draft form, has little to say on this matter either. The principles appear to be focussed on the way in which the data is released, in a context free way, (where by “context” I mean any of the uses to which government may be putting the data).

For data to be useful as an exercise in transparency, it seems to me that when government releases reports, or when government, NGOs, lobbiests or the media make claims using summary figures based on, or derived from, government data, the transparency arises from an audit trail that allows us to see where those numbers came from.

So for example, around the COI website report, the Guardian reported that “[t]he report showed uktradeinvest.gov.uk cost £11.78 per visit, while businesslink.gov.uk cost £2.15.” (Up to 75% of government websites face closure). But how was that number arrived at?

The publication of data means that report writers should be able to link to views over original government data sets that show their working. The publication of data allows summary claims to be justified, and contributes to transparency by allowing others to see the means by which those claims were arrived at and the assumptions that went in to making the summary claim in the first place. (By summary claim, I mean things like “non-staff costs were X”, or the “cost per visit was Y”.)

[Just an aside on summary claims made by, or "discovered" by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. ("According to the Guardian/Times/Telegraph/FT, etc etc etc". To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.]

So what would I like to see? Taking the example of the COI websites report, what I’d like to be able to see would be links from each of the tables to a page that “shows the working”.

In Using CSV Docs As a Database, I show how by putting the CSV data into a Google spreadsheet, we can generate several different views over the data using the using the Google Query language. For example, here’s a summary of the satisfaction levels, and here’s one over some of the costs:

COI website report - costs
select A,B,EL,EN,EP,ER,ET

We can even have a go at summing the costs:

COI summed website costs
select A,B,EL+EN+EP+ER+ET

In short, it seems to me that releasing the data as data is a good start, but the promise for transparency lays in being able to share queries over data sets that make clear the origins of data-derived information that we are provided with, such as the total non-staff costs of website development, or the average cost per visit to the blah, blah website.

So what would I like to see? Well, for each of the tables in the COI website report, a link to a query over the co-released CSV dataset that generates the summary table “live” from the original dataset would be a start… ;-)

PS In the meantime, to the extent that journalists and the media hold government to account, is there maybe a need for data journalysts (journalist+analyst portmanteau) to recreate the queries used to generate summary tables in government reports to find out exactly how they were derived from released data sets? Finding queries over the COI dataset that generate the tables published in the summary report is left as an exercise for the reader… ;-) If you manage to generate queries, in a bookmarkable form (e.g. using the COI website data explorer (see also this for more hints), please feel free to share the links in the comments below :-)


May 06 2010

18:26

Resources for Finding and Sharing Data?

"I'd love to know about a good resource for journalists and other people who already FOIA big data sets that will facilitate sharing and re-use of gov't data sets." --Amanda

Repositories: (Places to upload, share and visualize data)

Data sources/aggregators:

May 04 2010

15:22

Resources for Big Data Sets?

"I'd love to know about a good resource for journalists and other people who already FOIA big data sets that will facilitate sharing and re-use of gov't data sets." --Amanda

Repositories:

Data sources list:

April 30 2010

02:54

Resources for Big Government Data?

"I'd love to know about a good resource for journalists and other people who already FOIA big data sets that will facilitate sharing and re-use of gov't data sets." --Amanda

Data sources list:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl