Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 20 2012

15:56

Announcing ScraperWiki Premium Accounts!

ScraperWiki digger in front of credit card payment logosThe most exciting bit about ScraperWiki is how it forms a link between two very different worlds.

On the one hand, we love the public good that data liberation enables, and we’re used by everyone from journalists (did you see us on the Guardian front page last week?) to activists (like the guys behind Australian planning alerts).

But we also love the value that businesses create using data. They use ScraperWiki in many ways – like pulling customised marketing leads from the web, and extracting and cleaning old proprietary data so it can be sold anew – something we’ll be blogging about a lot more in the next few weeks.

Today, we’re really excited to announce that anyone (be they journalists, businessmen or anything else!) can now use ScraperWiki in private with the click of a button. Our new premium accounts range from $9 per month for individuals, to $299 for corporates with lots of collaborators – all you need is a credit card.

For that monthly fee you get to make ScraperWiki vaults (secure, private areas, which you can share with precisely who you want) and you also get the ability to schedule any scraper to run hourly (for data feeds that update more often than once a day).

This will let journalists keep their scrapers secret – embargoed until they write their story. It will let businesses scrape websites without revealing to their competitors the advantage they’ve found. It will let anyone scrape their own private data, in private, to repurpose it and do wonderful things that nobody had ever intended.

We’re quite excited to hear about what you do. Since vaults are private we won’t know, so please get in touch. We’d love to write about it here, if you’ll let us.


January 20 2012

09:27

How to stop missing the good weekends

The BBC's Michael Fish presenting the weather in the 80s, with a ScraperWiki tractor superimposed over LiverpoolFar too often I get so stuck into the work week that I forget to monitor the weather for the weekend when I should be going off to play on my dive kayaks — an activity which is somewhat weather dependent.

Luckily, help is at hand in the form of the ScraperWiki email alert system.

As you may have noticed, when you do any work on ScraperWiki, you start to receive daily emails that go:

Dear Julian_Todd,

Welcome to your personal ScraperWiki email update.

Of the 320 scrapers you own, and 157 scrapers you have edited, we
have the following news since 2011-12-01T14:51:34:

Histparl MP list - https://scraperwiki.com/scrapers/histparl_mp_list :
  * ran 1 times producing 0 records from 2 pages
  * with 1 exceptions, (XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<!DOCTYP')

...Lots more of the same

This concludes your ScraperWiki email update till next time.

Please follow this link to change how often you get these emails,
or to unsubscribe: https://scraperwiki.com/profiles/edit/#alerts

The idea behind this is to attract your attention to matters you may be interested in — such as fixing those poor dear scrapers you have worked on in the past and are now neglecting.

As with all good features, this was implemented as a quick hack.

I thought: why design a whole email alert system, with special options for daily and weekly emails, when we already have a scraper scheduling system which can do just that?

With the addition of a single flag to designate a scraper as an emailer (plus a further 20 lines of code), a new fully fledged extensible feature was born.

Of course, this is not counting the code that is in the Wiki part of ScraperWiki.

The default code in your emailer looks roughly like so:

import scraperwiki
emaillibrary = scraperwiki.utils.swimport("general-emails-on-scrapers")
subjectline, headerlines, bodylines, footerlines = emaillibrary.EmailMessageParts("onlyexceptions")
if bodylines:
    print "
".join([subjectline] + headerlines + bodylines + footerlines)

As you can see, it imports the 138 lines of Python from general-emails-on-scrapers, which I am not here to talk about right now.

Using ScraperWiki emails to watch the weather

Instead, what I want to explain is how I inserted my Good Weather Weekend Watcher by polling the weather forecast for Holyhead.

My extra code goes like this:

weatherlines = [ ]
if datetime.date.today().weekday() == 2:  # Wednesday
    url = "http://www.metoffice.gov.uk/weather/uk/wl/holyhead_forecast_weather.html"
    html = urllib.urlopen(url).read()
    root = lxml.html.fromstring(html)
    rows = root.cssselect("div.tableWrapper table tr")
    for row in rows:
        #print lxml.html.tostring(row)
        metweatherline = row.text_content().strip()
        if metweatherline[:3] == "Sat":
            subjectline += " With added weather"
            weatherlines.append("*** Weather warning for the weekend:")
            weatherlines.append("   " + metweatherline)
            weatherlines.append("")

What this does is check if today is Wednesday (day of the week #2 in Python land), then it parses through the Met Office Weather Report table for my chosen location, and pulls out the row for Saturday.

Finally we have to handle producing the combined email message, the one which can contain either a set of broken scraper alerts, or the weather forecast, or both.

if bodylines or weatherlines:
    if not bodylines:
        headerlines, footerlines = [ ], [ ]   # kill off cruft surrounding no message
    print "
".join([subjectline] + weatherlines + headerlines + bodylines + footerlines)

The current state of the result is:

*** Weather warning for the weekend:
  Mon 5Dec
  Day

  7 °C
  W
  33 mph
  47 mph
  Very Good

This was a very quick low-level implementation of the idea with no formatting and no filtering yet.

Email alerts can quickly become sophisticated and complex. Maybe I should only send a message out if the wind is below a certain speed. Should I monitor previous days’ weather to predict whether the sea will be calm? Or I could check the wave heights on the off-shore buoys? Perhaps my calendar should be consulted for prior engagements so I don’t get frustrated by being told I am missing out on a good weekend when I had promised to go to a wedding.

The possibilities are endless and so much more interesting than if we’d implemented this email alert feature in the traditional way, rather than taking advantage of the utterly unique platform that we happened to already have in ScraperWiki.


January 06 2012

15:48

ScraperWikiをためしてみよう

Guest post by Makoto Inoue, a Japanese ScraperWiki user

はじめに

みなさんスクレイプ���Scrape���という単語はご存知でしょうか���

ウェッブページから特定のデータを引っこ抜く作業のことをスクレイピング���Scraping���と呼びます。

昨今のホームページではデータを簡単に提供するためのAPI���Application Programming Interface���というしくみが多いので「なんで今更そんなの必要なの」と思われる方>も多いかもしれません。しかしながら前回起きた東日本大地震の際、地震や電力の速報や、各地の被害状況を把握するために必要な政府の統計情報などがAPIとして提供されておらず、開発者の中には自分でスクレイパー���Scraper���用のプログラムを書いた人も多いのではないのでしょうか��� ただそういった多くの開発者の善意でつくられたプログラムがいろいろなサイトに散らばっていたり、やがてメンテナンスされなくなるのは非常に残念なことです。

そういうときにScraperWikiの出番です。

ScraperWikiとは

ScraperWikiはイギリスのスタートアップ企業で、スクレイパーコードを共有するサイトを提供しています。開発者達はサイト上から直接コード���Ruby, PHP, Python���を編集、実行することができます。スクレイプを定期的に実行することも可能で、取得されたデータはScraperWikiに保存されますが、ScraperWikiはAPIを用意しているので、このAPIを通して、他のサイトでデータを再利用することが可能です。

「Wiki」といっているだけあって、一般公開されているコードは他の人も編集したり、またコードをコピーして他のスクレイピングに利用することもできます。定期的に実>行されているスクレイパーがエラーを起こしていないかをチェックする仕組みがあり「みんなでスクレイピングを管理」するための仕組みがいたるところにあります。

ScraperWikiは、もともとイギリスで、どの議員がどの法案に賛成または反対票を投じたかを議会のサイトから創業者の一人が2003年頃にスクレイプしたことを起源に持ちます。

日本であればちょうどこういったページでしょうか���

現在ではGuardian社といった大手報道機関が企業ロビイストの議会での影響力を調べるのにつかったり、イギリス政府自身がalpha.gov.ukというプロトタイプサ>イトで、各省庁に点在したデータを一元的にアクセスするための仕組みとしてScraperWikiを使っているそうです。

ScraperWikiのビジネスモデルですが、一般公開するコードに関しては無料ですが、非公開にしたり、定期的にスクレイプする量などに応じて課金するようになっています。

前置きが長くなってきましたが、実際に使ってみましょう。

既存のスクレイパーを眺めてみる

「ScraperWiki」でGoogle検索すると、すでにScraperWikiを使用している日本人の方がいらっしゃいました。

ここでは衆議院議員のデータをスクレイプするのに使用しています。

データは月一回走るように設定されていたり、複数のcontributorがいるのがわかります。

ページの下の方にはスプレッドシート形式でデータを閲覧できるようになっていますが、これだけだと他のサイトで再利用とか難しいですよね。そういうときは”Explorer with API”ボタンをクリックしてみて下さい。そこのページの最後に以下のようなurlがあると思います。

https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=jsondict&name=members_of_the_house_of_representatives_of_japan&query=select%20*%20from%20%60swdata%60%20limit%2010

このurlにアクセスすると、先ほどのデータをJSON(Javascript Object Notation)で返してくれます。出力フォーマットは CSV, RSS,HTMLテーブルといった他の形式にも対応している上sql文をつかってフィルタリングなどをかけることも可能です。

select * from `swdata` where party = ‘民主’

COPYして独自のスクレイパーを作ってみる

ブラウザの”バック”ボタンを押して先ほどのページのスプレッドシートの下の方に目を通してい見て下さい。”This Scraper in Context”というところ”Copied To”という項>目があります。これはこのソースコードがコピーされ、他の用途に利用されていることを示しています。

そこに「makoto / Members of the House of Councillors of Japan」とあるの>でクリックしてみて下さい。実はこれは私が参議院議員の名簿を抜き出すために作ったスクレイパーです。衆議院と参議院はそれぞれ別にホームページを持っているのです>が、それぞれの議員名簿のページが結構似ていたので簡単に流用できるのではと思っていました。

作り方は簡単でCopyリンクをクリックするだけです。ログインしていなくてもコピーをとれますが、これを機にアカウントを取得するのをお勧めします。

”Edit”ページを開くとその場でコードを編集するためのオンラインエディタが現れます。下にある”Run”ボタンを押すと実際にサイトからデータをとってきている模様が見て取れます。

http://www.screenr.com/embed/BgQs

以下がオリジナルのコードと私のコードの差異です。

衆議院と参議院のページ異なっていたため変更した点はは���つほどありました。

  • エンコーディング���文字の表示形式���がUTFとShift-JISでことなる
  • 衆議院のページは複数ページにまたがっているが参議院ページ���ページのみ
  • 衆議院のページで議員名は「くん」づけ。参議院のページは芸名と本名の両方が載っている

他にもHTMLのページの文法が微妙に違っていたので、XPathという、HTMLに構造的にアクセスするための書式を少し換えました。

もちろんこれらの変更をするのにはある程度のプログラミング知識が必要なのですが。動くサンプルを少し自分用にカスタマイズするScraperWikiはプログラミングを勉強したい人にとっても絶好の教材なのではないでしょうか��� 私自身XPathはあまり使ったことがなかったのですが、このもとプログラムを参考にすることで比較的簡単に学習できました。

最後に

今回のScraperWikiの簡単なチュートリアルで概要がわかっていただけたでしょうか���

公共機関、メディアや政府機関の中でインターネットを通じた情報公開は進んできていますが、「マッシュアップを前提としたデータの再利用」を考慮したサイトが十分で>ないのが現状です。そういった状態に一石と投じるべくScraperWikiは活動しており、ヨーロッパのジャーナリストや政府関係者の間では徐々に認知度があがってきております。 現在ScraperWikiでは米国でのワークショップを予定していますが、日本でもワークショップを始めるべく準備をしている所です。 もし興味のある方はコンタクト>ページより気軽にご連絡下さい。

Like this:

Be the first to like this post.

December 07 2011

14:50

How to scrape and parse Wikipedia

Today’s exercise is to create a list of the longest and deepest caves in the UK from Wikipedia. Wikipedia pages for geographical structures often contain Infoboxes (that panel on the right hand side of the page).

The first job was for me to design an Template:Infobox_ukcave which was fit for purpose. Why ukcave? Well, if you’ve got a spare hour you can check out the discussion considering its deletion between the immovable object (American cavers who believe cave locations are secret) and the immovable force (Wikipedian editors who believe that you can’t have two templates for the same thing, except when they are in different languages).

But let’s get on with some Wikipedia parsing. Here’s what doesn’t work:

import urllib
print urllib.urlopen("http://en.wikipedia.org/wiki/Aquamole_Pot").read()

because it returns a rather ugly error, which at the moment is: “Our servers are currently experiencing a technical problem.”

What they would much rather you do is go through the wikipedia api and get the raw source code in XML form without overloading their servers.

To get the text from a single page requires the following code:

import lxml.etree
import urllib

title = "Aquamole Pot"

params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"timestamp|user|comment|content" }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v)  for k, v in params.items())
url = "http://en.wikipedia.org/w/api.php?%s" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')

print "The Wikipedia text for", title, "is"
print revs[-1].text

Note how I am not using urllib.urlencode to convert params into a query string. This is because the standard function converts all the ‘|’ symbols into ‘%7C’, which the Wikipedia api site doesn’t accept.

The result is:

{{Infobox ukcave
| name = Aquamole Pot
| photo =
| caption =
| location = [[West Kingsdale]], [[North Yorkshire]], England
| depth_metres = 113
| length_metres = 142
| coordinates =
| discovery = 1974
| geology = [[Limestone]]
| bcra_grade = 4b
| gridref = SD 698 784
| location_area = United Kingdom Yorkshire Dales
| location_lat = 54.19082
| location_lon = -2.50149
| number of entrances = 1
| access = Free
| survey = [http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]
}}
'''Aquamole Pot''' is a cave on [[West Kingsdale]], [[North Yorkshire]],
England wih which was first discovered from the
bottom by cave diving through 550 feet of
sump from [[Rowten Pot]] in 1974....

This looks pretty structured. All ready for parsing. I’ve written a nice complicated recursive template parser that I use in wikipedia_utils, which makes it easy to extract all the templates from the page in the following way:

import scraperwiki
wikipedia_utils = scraperwiki.swimport("wikipedia_utils")

title = "Aquamole Pot"

val = wikipedia_utils.GetWikipediaPage(title)
res = wikipedia_utils.ParseTemplates(val["text"])
print res               # prints everything we have found in the text
infobox_ukcave = dict(res["templates"]).get("Infobox ukcave")
print infobox_ukcave    # prints just the ukcave infobox

This now produces the following Python data structure that is almost ready to push into our database — after we have converted the length and depths from strings into numbers:

{0: 'Infobox ukcave', 'number of entrances': '1',
 'location_lon': '-2.50149',
 'name': 'Aquamole Pot', 'location_area': 'United Kingdom Yorkshire Dales',
 'geology': '[[Limestone]]', 'gridref': 'SD 698 784', 'photo': '',
 'coordinates': '', 'location_lat': '54.19082', 'access': 'Free',
 'caption': '', 'survey': '[http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]',
 'location': '[[West Kingsdale]], [[North Yorkshire]], England',
 'depth_metres': '113', 'length_metres': '142', 'bcra_grade': '4b', 'discovery': '1974'}

Right. Now to deal with the other end of the problem. Where do we get the list of pages with the data?

Wikipedia is, unfortunately, radically categorized, so Aquamole_Pot is inside Category:Caves_of_North_Yorkshire, which is in turn inside Category:Caves_of_Yorkshire which is then inside
Category:Caves_of_England which is finally inside
Category:Caves_of_the_United_Kingdom.

So, in order to get all of the caves in the UK, I have to iterate through all the subcategories and all the pages in each category and save them to my database.

Luckily, this can be done with:

lcavepages = wikipedia_utils.GetWikipediaCategoryRecurse("Caves_of_the_United_Kingdom")
scraperwiki.sqlite.save(["title"], lcavepages, "cavepages")

All of this adds up to my current scraper wikipedia_longest_caves that extracts those infobox tables from caves in the UK and puts them into a form where I can sort them by length to create this table based on the query SELECT name, location_area, length_metres, depth_metres, link FROM caveinfo ORDER BY length_metres desc:

name location_area length_metres depth_metres Ease Gill Cave System United Kingdom Yorkshire Dales 66000.0 137.0 Dan-yr-Ogof Wales 15500.0 Gaping Gill United Kingdom Yorkshire Dales 11600.0 105.0 Swildon’s Hole Somerset 9144.0 167.0 Charterhouse Cave Somerset 4868.0 228.0

If I was being smart I could make the scraping adaptive, that is only updating the pages that have changed since the last scraped by using all the data returned by GetWikipediaCategoryRecurse(), but it’s small enough at the moment.

So, why not use DBpedia?

I know what you’re saying: Surely the whole of DBpedia does exactly this, with their parser?

And that’s fine if you don’t want your updates to come less than 6 months, which prevents you from getting any feedback when adding new caves into Wikipedia, like Aquamole_Pot.

And it’s also fine if you don’t want to be stuck with the naïve semantic web notion that the boundaries between entities is a simple, straightforward and general concept, rather than what it really is: probably the one deep and fundamental question within any specific domain of knowledge.

I mean, what is the definition of a singular cave, really? Is it one hole in the ground, or is it the vast network of passages which link up into one connected system? How good do those connections have to be? Are they defined hydrologically by dye tracing, or is a connection defined as the passage of one human body getting itself from one set of passages to the next? In the extreme cases this can be done by cave diving through an atrocious sump which no one else is ever going to do again, or by digging and blasting through a loose boulder choke that collapses in days after one nutcase has crawled through. There can be no tangible physical definition. So we invent the rules for the definition. And break them.

So while theoretically all the caves on Leck Fell and Easgill have been connected into the Three Counties System, we’re probably going to agree to continue to list them as separate historic caves, as well as some sort of combined listing. And that’s why you’ll get further treating knowledge domains as special cases.


October 06 2011

14:46

New backend now fully rolled out

The new faster, safer sandbox that powers ScraperWiki is now fully rolled out to all users.

You should find running and developing scrapers and views faster than before, and that you’re using much more recent versions of Ruby, Python and associated libraries.

Thank you to everyone, and there were lots of you, who helped us beta test it!

Now, Ross and Julian are fighting for the right to delete all the old code we don’t need any more…

Like this:

Be the first to like this post.

September 16 2011

13:16

Driving the Digger Down Under

G’day,

Henare here from the OpenAustralia Foundation – Australia’s open data, open government and civic hacking charity. You might have heard that we were planning to have a hackfest here in Sydney last weekend. We decided to focus on writing new scrapers to add councils to our PlanningAlerts project that allows you to find out what is being built or knocked down in your local community. During the two afternoons over the weekend seven of us were able to write nineteen new scrapers, which covers an additional 1,823,124 Australiansa huge result.

There are a number of reasons why we chose to work on new scrapers for PlanningAlerts. ScraperWiki lowers the barrier of entry for new contributors by allowing them to get up and running quickly with no setup – just visit a web page. New scrapers are also relatively quick to write which is perfect for a hackfest over the weekend. And finally, because we have a number of working examples and ScraperWiki’s documentation, it’s conceivable that someone with no programming experience can come along and get started.

It’s also easy to support people writing scrapers in different programming languages using ScraperWiki. PlanningAlerts has always allowed people to write scrapers in whatever language they choose by using an intermediate XML format. With ScraperWiki this is even simpler because as far as our application is concerned it’s just a ScraperWiki scraper – it doesn’t even know what language the original scraper was written in.

Once someone has written a new scraper and formatted the data according to our needs, it’s a simple process for us to add it to our site. All they need to do is let us know, we add it to our list of planning authorities and then we automatically start to ask for the data daily using the ScraperWiki API.

Another issue is maintenance of these scrapers after the hackfest is over. Lots of volunteers only have the time to write a single scraper, maybe to support their local community. What happens when there’s an issue with that scraper but they’ve moved on? With ScraperWiki anyone can now pick up where they left off and fix the scraper – all without us ever having to get involved.

It was a really fun weekend and hopefully we’ll be doing this again some time. If you’ve got friends or family in Australia, don’t forget to tell them to sign up for PlanningAlerts.

Cheers,

Henare
OpenAustralia Foundation volunteer


May 25 2011

01:02

‘Documentation is like sex: when it is good, it is very, very good; and when it is bad, it is better than nothing’

You may have noticed that the design of the ScraperWiki site has changed substantially.

As part of that, we made a few improvements to the documentation. Lots of you told us we had to make our documentation easier to find, more reliable and complete.

We’ve reorganised it all under one contents page, called Documentation throughout the site, including within the code editor. All the documentation is listed there. (The layout shamelessly inspired by Django).

Of course, everyone likes different kinds of documentation – talk to a teacher and they’ll tell you all about different learning styles. Here’s what we have on offer, all available in Ruby, Python and PHP (thanks Tom and Ross!).

  • New style tutorials – very directed recipes, that show you exactly how to make something specific in under 30 minutes. More on these in a future blog post.
  • Live tutorials – these are what we now call the ScraperWiki special sauce tutorials. Self contained chunks of code with commentary that you fork and edit and run entirely in your browser. (thanks Anna and Mark!)
  • Copy and paste guides – a new type of reference to a library, which gives you code snippets you can quickly copy into your scraper. With one click. (thanks Julian!)
  • Interactive API documentation – for how to get data out of ScraperWiki. More on that in a later blog post. (thanks Zarino!)
  • Reference documentation – we’ve gone through it to make sure it covers exactly what we support.
  • Links for further help – an FAQ and our Google Group. But also for more gnarly questions asks on the Stack Overflow scraperwiki tag.

We’ve got more stuff in the works – screencasts and copy & paste guides to specific view/scraper libraries (lxml, Nokogiri, Google Maps…). Let us know what you want.

Finally, none of the above is what really matters about this change.

The most important thing is our new Documentation Policy (thanks Ross). Our promise to keep documentation up to date, and available alike for all the languages that we support.

Normally in websites it is much more important to have a user interface that doesn’t need documentation. Of course, you need it for when people get stuck, and it has to be good quality. But you really do want to get rid of it.

But programming is fundamentally about language. Coders need some documentation, even if it is just the quickest answer they can get Googling for an error message.

We try hard to make it so as little as possible is needed, but what’s left isn’t an add on. It is a core part of ScraperWiki.

(The quote in the title of this blog post is attributed to Dick Brandon on lots of quotation sites on the Internet, but none very reliably)


May 18 2011

10:13

All recipes 30 minutes to cook

The other week we quietly added two tutorials of a new kind to the site, snuck in behind a radical site redesign.

They’re instructive recipes, which anyone with a modicum of programming knowledge should be able to easily follow.

1. Introductory tutorial

For programmers new to ScraperWiki, to a get an idea of what it does.

It runs through the whole cycle of scraping a page, parsing it then outputting the data in a new form. For a simplest possible example.

Available in Ruby, Python and PHP.

2. Views tutorial

Find out how to output data from ScraperWiki in exactly the format you want – i.e. write your own API functions on our servers.

This could be a KML file, an iCal file or a small web application. This tutorial covers the basics of what a ScraperWiki View is.

Available in Ruby, Python and PHP.

Hopefully these tutorials won’t take as long as Jamie Oliver’s recipes to make. Get in touch with feedback and suggestions!


May 16 2011

11:09

It’s SQL. In a URL.

Squirrelled away amongst the other changes to ScraperWiki’s site redesign, we made substantial improvements to the external API explorer.

We’re going to concentrate on the SQLite function here as it is most import, but as you can see on the right there are other functions for getting out scraper metadata.

Zarino and Julian have made it a little bit slicker to find out the URLs you can use to get your data out of ScraperWiki.

1. As you type into the name field, ScraperWiki now does an incremental search to help you find your scraper, like this.

2. After you select a scraper, it shows you its schema. This makes it much easier to know the names of the tables and columns while writing your query.

3. When you’ve edited your SQL query, you can run it as before. There’s also now a button to quickly and easily copy the URL that you’ve made for use in your application.

You can get to the explorer with the “Explore with ScraperWiki API” button at the top of every scraper’s page. This makes it quite useful for quick and dirty queries on your data, as well as for finding the URLs for getting data into your own applications.

Let us know when you do something interesting with data you’ve sucked out of ScraperWiki!


May 05 2011

10:06

ScraperWiki: A story about two boys, web scraping and a worm

Spectrum game

“It’s like a buddy movie.” she said.

Not quite the kind of story lead I’m used to. But what do you expect if you employ journalists in a tech startup?

“Tell them about that computer game of his that you bought with your pocket money.”

She means the one with the risqué name.

I think I’d rather tell you about screen scraping, and why it is fundamental to the nature of data.

About how Julian spent almost a decade scraping himself to death until deciding to step back out and build a tool to make it easier.

I’ll give one example.

Two boys

In 2003, Julian wanted to know how his MP had voted on the Iraq war.

The lists of votes were there, on the www.parliament.uk website. But buried behind dozens of mouse clicks.

Julian and I wrote some software to read the pages for us, and created what eventually became TheyWorkForYou.

We could slice and dice the votes, mix them with some knowledge from political anaroks, and create simple sentences. Mini computer generated stories.

“Louise Ellman voted very strongly for the Iraq war.”

You can see it, and other stories, there now. Try the postcode of the ScraperWiki office, L3 5RF.

I remember the first lobbiest I showed it to. She couldn’t believe it. Decades of work done in an instant by a computer. An encyclopedia of data there in a moment.

Web Scraping

It might seem like a trick at first, as if it was special to Parliament. But actually, everyone does this kind of thing.

Google search is just a giant screen scraper, with one secret sauce algorithm guessing its ranking data.

Facebook uses scraping as a core part of its viral growth to let users easily import their email address book.

There’s lots of messy data in the world. Talk to a geek or a tech company, and you’ll find a screen scraper somewhere.

Why is this?

It’s Tautology

On the surface, screen scrapers look just like devices to work round incomplete IT systems.

Parliament used to publish quite rough HTML, and certainly had no database of MP voting records. So yes, scrapers are partly a clever trick to get round that.

But even if Parliament had published it in a structured format, their publishing would never have been quite right for what we wanted to do.

We still would have had to write a data loader (search for ‘ETL’ to see what a big industry that is). We still would have had to refine the data, linking to other datasets we used about MPs. We still would have had to validate it, like when we found the dead MP who voted.

It would have needed quite a bit of programming, that would have looked very much like a screen scraper.

And then, of course, we still would have had to build the application, connecting the data to the code that delivered the tool that millions of wonks and citizens use every year.

Core to it all is this: When you’re reusing data for a new purpose, a purpose the original creator didn’t intend, you have to work at it.

Put like that, it’s a tautology.

A journalist doesn’t just want to know what the person who created the data wanted them to know.

Scrape Through

So when Julian asked me to be CEO of ScraperWiki, that’s what went through my head.

Secrets buried everywhere.

The same kind of benefits we found for politics in TheyWorkForYou, but scattered across a hundred countries of public data, buried in a thousand corporate intranets.

If only there was a tool for that.

A Worm

And what about my pocket money?

Nicola was talking about Fat Worm Blows a Sparky.

Julian’s boss’s wife gave it its risqué name while blowing bubbles in the bath. It was 1986. Computers were new. He was 17.

Fat Worm cost me £9.95. I was 12.

[Loading screen]

I was on at most £1 a week, so that was ten weeks of savings.

Luckily, the 3D graphics were incomprehensibly good for the mid 1980s. Wonder who the genius programmer is.

I hadn’t met him yet, but it was the start of this story.

April 14 2011

15:25

DonorsChoose.Org Competition: Enter to use technology to improve education

Are you a data analyst or a developer interested in improving public education in the United States? Submit your ideas and you can get to work with the DonorsChoose.org Hacking Education Competition!

To get a chance to work with DonorsChoose.org you would have to be chosen from the top finishers in each of the seven categories: Data Analysis, Javascript, .Net, PHP, Python, Ruby, Wildcard.

To participate:


Deadline for the submission is June 30, 2011, Midnight PST

Even if you don’t become The Big Winner, DonorsChoose.org still offers awards in each category. Remember to focus on apps or analyses that have the greatest potential to engage the public and impact education.


Since DonorsChoose.org was founded, it has enabled more than 165,000 teachers at 43,000 public schools to post over 300,000 classroom project requests, inspiring $80 million in giving from 400,000 donors. To further improve public education in the United States, DonorsChoose.org wants to build a community of data crunchers and developers. Launching this contest is one way to do that.

Learn more about the competition  here: http://www.donorschoose.org/hacking-education

April 12 2011

08:07

Hacks & Hackers Glasgow: the BBC College of Journalism video

Last month we celebrated the final leg of our UK & Ireland Hacks & Hackers tour in Glasgow, at an event hosted by BBC Scotland and supported by BBC College of Journalism and Guardian Open Platform. You can read more about it here. Other coverage includes:

The BBC College of Journalism kindly filmed the whole thing and the videos are now available to watch. The whole playlist can be viewed here, or watch each segment in the clips below:


April 11 2011

18:22

Scrape it – Save it – Get it

I imagine I’m talking to a load of developers. Which is odd seeing as I’m not a developer. In fact, I decided to lose my coding virginity by riding the ScraperWiki digger! I’m a journalist interested in data as a beat so all I need to do is scrape. All my programming will be done on ScraperWiki, as such this is the only coding home I know. So if you’re new to ScraperWiki and want to make the site a scraping home-away-from-home, here are the basics for scraping, saving and downloading your data:

With these three simple steps you can take advantage of what ScraperWiki has to offer – writing, running and debugging code in an easy to use editor; collaborative coding with chat and user viewing functions; a dashboard with all your scrapers in one place; examples, cheat sheets and documentation; a huge range of libraries at your disposal; a datastore with API callback; and email alerts to let you know when your scrapers break.

So give it a go and let us know what you think!


April 04 2011

18:07

#newschallenge: ScraperWiki honoured to be invited to the Knight’s roundtable!

Last Thursday we were very excited to find out that we have been shortlisted by the Knight Foundation for its prestigious #newschallenge competition.

This is a very big deal for us as it is such a highly respected organisation internationally. Over 1600 applications were entered in the first round. This was whittled down in the fourth round to a final shortlist of about two dozen, including ScraperWiki.  Ten to twelve of the semi-finalists will receive awards so the odds are reasonable…

..and what exactly do we plan to do with the money?
Our application is in two parts. The aim is for ScraperWiki to become even more media friendly and to make an impact in the US – by actually being there.

In Part I we want to improve the ScraperWiki platform for journalists directly, by providing: an ability to embargo stories; a ‘data on demand’ request facility; and email alerts that advise on a potential story… plus more.

Part II is about a series of ‘Journalism Data Camps’  that will liberate local, state and federal data. We want to run these across 12 states with the top schools of journalism, and in partnership with a host of wonderful organisations. We plan to work with: Sunlight Foundation, CodeforAmerica, Online News Association, ProPublica, Hacks and Hackers, Hacks for Democracy, Chicago Tribute, The New York Times, IRE, the Center for Investigative Reporting, the Center for Public Integrity and Spot.Us.

…and our very worthy competitors?
The competition is tough and we are pleased to be in the company of organisations like Newscloud, Overview, Recast, The Tiziano Project, and PANDA .  This is not an exhaustive list and we wish everyone the best. We know that the next phase will be very hard as we have to do more to prove our worth but we are up for it at ScraperWiki and we intend to put our best ‘truck’ forward! Game on!

When will we find out if we are good enough?
At the end of June 2011 winners will be announced at the Knight/MIT conference in the US. Needless to say that all fingers and toes at ScraperWiki will be crossed!  Thank you to everyone from our hacks and hackers community who lent weight to our application – your contribution is highly valued. And a big thank you to the KNC judges for all their work reviewing applications and for helping us get this far. Roll on June!


March 25 2011

11:25

OpenCorporates partners with ScraperWiki & offers bounties for open data scrapers

This is a guest post by Chris Taggart, co-founder of OpenCorporates

When we started OpenCorporates it was to solve a real need that we and a number of other people in the open data community had: whether it’s Government spending, subsidy info or court cases, we needed a database of corporate entities to match against, and not just for one country either.

But we knew from the first that we didn’t want this to be some heavily funded monolithic project that threw money at the project in order to create a walled garden of new URIs unrelated to existing identifiers. It’s also why we wanted to work with existing projects like OpenKvK, rather than trying to replace them.

So the question was, how do we make this scale, and at the same time do the right thing – that is work with a variety of different people using different solutions and different programming languages. The answer to both, it turns out, was to use open data, and the excellent ScraperWiki.

How does it work? Well, the basics we need in order to create a company record at OpenCorporates is the company number, the jurisdiction and the company’s name. (If there’s a status field — e.g. dissolved/active — company type or url for more data, that’s a bonus). So, all you need to do is write a scraper for a country we haven’t got data for, name the fields in a standard way (CompanyName, CompanyNumber, Status, EntityType, RegistryUrl, if the url of the company page can’t be worked out from the company number), and bingo, we can pull it into OpenCorporates, with just a couple of lines of code.

Let’s have a look at one we did earlier: the Isle of Man (there’s also one for GibraltarIreland, and in the US, the District of Columbia). It’s written in Ruby, because that’s what we at OpenCorporates code in, but ScraperWiki allows you to write scrapers in Python or php too, and the important thing here is the data, not the language used to produce it.

The Isle of Man company registry website is a .Net system which uses all sorts of hidden fields and other nonsense in the forms and navigation. This is a normally bit of a pain, but because you can use the Ruby Mechanize library to submit forms found on the pages (there’s even a tutorial scraper which shows how to do it), it becomes fairly straightforward.

The code itself should be fairly readable to anyone familiar with Ruby or Python, but essentially it tackles the problem by doing multiple searches for companies beginning with two letters, starting with ‘aa’ then ‘ab’ and so on, and for each letter pair iterating through each page of results in turn, which in turn is scraped to extract the data, using the standardised headings to save them in.  That’s it.

In the space of a couple of hours not only have we liberated the data, but both the code and the data are there for anyone else to use too, as well as being imported in OpenCorporates.

However, that’s not all. In order to kickstart the effort OpenCorporates (technically Chrinon Ltd, the micro start-up that’s behind OpenCorporates) is offering a bounty for new jurisdictions opened up.

It’s not huge (we’re a micro-startup remember): £100 for any jurisdiction that hasn’t been done yet, £250 for those territories we want to import sooner rather than later (Australia, France, Spain), and £500 for Delaware (there’s a captcha there, so not sure it’s even possible), and there’s an initial cap of £2500 on the bounty pot (details at the bottom of this post).

However, often the scrapers can be written in a couple of hours, and it’s worth stressing again that neither the code nor the data will belong to OpenCorporates, but to the open data community, and if people build other things on it, so much the better. Of course we think it would make sense for them to use the OpenCorporates URIs to make it easy to exchange data in a consistent and predictable way, but, hey, it’s open data ;-)

Small, simple pieces, loosely connected, to build something rather cool. So now you can do a search for, oh say Barclays, and get this:

The bounty details: how it works

Find a country/company registry that you fancy opening up the data for (here are a couple of lists of registries). Make sure it’s from the official registry, and not a commercial reseller. Check too that no-one has already written one, or is in the middle of writing one, by checking the scrapers tagged with opencorporates (be nice, and respect other people’s attempts, but feel free to start one if it looks as if someone’s given up on a scraper).

All clear? Go ahead and start a new scraper (useful tutorials here). Call it something like trial_fr_company_numbers (until it’s done and been OK’d) and get coding, using the headings detailed above for the CompanyNumber, CompanyName etc. When it’s done, and it’s churning away pulling in data, email us info@opencorporates.com, and assuming it’s OK, we’ll pay you by Paypal, or by bank transfer (you’ll need to give us an invoice in that case). If it’s not we’ll add comments to the scraper. Any questions, email us at info@opencorporates.com, and happy scraping.

February 22 2011

07:01

New event! Hacks & Hackers Glasgow (#hhhglas)

Calling journalists, bloggers, programmers and designers in Scotland!

Scraperwiki is pleased to announce another hacks & hackers hack day: in Glasgow. BBC Scotland is hosting and sponsoring the one day event, with support from BBC College of Journalism. As with our other UK hack days, Guardian Open Platform is providing the prizes.

Web developers and designers will pair up with journalists and bloggers to produce a number of projects and stories based on public data. It’s completely free (food provided) and open to both BBC and non BBC staff. It will take place at the Viewing Theatre, Pacific Quay, Glasgow on Friday 25 March 2011.

Any questions? Please email judith@scraperwiki.com.


February 10 2011

17:07

Spot and Normalize Inconsistent Measures

Here’s an example of why you have to be very careful when scraping,
and why your normal run-of-the-mill technology that makes assumptions
won’t cut it:

One of our super-users, Julian Todd, decided to scrape the Vehicle Certification Agency (VCA) website on new car fuel consumption and exhaust emissions figures. And he spotted this:

And another search resulted in this:

Yes, that’s a change from milligrams per km to grams per km, noted
only in the header.

In ScraperWiki we can normalize this in standard python code:

for key in data.keys():
if key[-6:] == " mg km":
    nkey = key[:-6]+" g km"
    v = data.pop(key)
    if v == None:
        data[nkey] = None
    else:
        data[nkey] = float(v)/1000

This is from the scraper:
http://scraperwiki.com/scrapers/vca-car-fuel-data/

February 09 2011

08:11

New event! Hacks and Hackers Hack Day Cardiff (#hhhCar)

The UK Hacks & Hackers tour carries on – into 2011. Our first stop: Wales.

Scraperwiki, which provides award-winning tools for screen scraping,data mining and visualisation, will hold a one day practical hack day* at the Atrium in Cardiff on Friday 11 March, 2011.

Web developers and designers will pair up with journalists and bloggers to produce a number of projects and stories based on public data.

We would like to thank our main sponsor Skillset Cymru, our hosts the Atrium and our prize sponsors Guardian Local, Guardian Open Platform and Cardiff School of Media, Journalism and Cultural Studies for making the event possible.

“Skillset Cymru is very pleased to be supporting the Cardiff Scraperwiki Hacks and Hackers Hack Day this March,” says Gwawr Hughes, director, Skillset Cymru.

“This exciting event will bring journalists and computer programmers and designers together to explore the scraping, storage, aggregation, and distribution of public data in more useful, structured formats.

“It is at the forefront of data journalism and should be of great interest to the media industry across the board here in Wales.”

More details

Who’s it for? We hope to attract ‘hacks’ and ‘hackers’ from all different types of backgrounds: people from big media organisations, as well as individual online publishers and freelancers.

What will I get out of it?
The aim is to show journalists how to use programming and design techniques to create online news stories and features; and vice versa, to show programmers how to find, develop, and polish stories and features. To see what happened at our past events in Liverpool and Birmingham visit the ScraperWiki blog. Here’s a video showing what happened in Belfast.

How much? NOTHING! It’s absolutely free, thanks to our sponsors. Food and refreshments will be provided throughout the day. If you have special dietary requirements please email judith [at] scraperwiki.com.

What should I bring? We would encourage people to come along with ideas for local ‘datasets’ that are of interest. In addition we will create a list of suggested data sets at the introduction on the morning of the event but flexibility is key for this event. If you have a laptop, please bring this too.

So what exactly will happen on the day? Armed with their laptops and WIFI, journalists and developers will be put into teams of around four to develop their ideas, with the aim of finishing final projects that can be published and shared publicly. Each team will then present their project to the whole group. Winners will receive prizes at the end of the day.

*Not sure what a hack day is? Let’s go with the Wikipedia definition: It “an event where developers, designers and people with ideas gather to build ‘cool stuff’”…

With thanks to our sponsors:

Keep an eye on the ScraperWiki blog for details about Scraperwiki events. Hacks & Hackers Hack Day Glasgow is scheduled for March 25 2011. For additional information please contact judith [at] scraperwiki.com.

January 28 2011

16:44

Ruby screen scraping tutorials

Mark Chapman has been busy translating our Python web scraping tutorials into Ruby.

They now cover three tutorials on how to write basic screen scrapers, plus extra ones on using .ASPX pages, Excel files and CSV files.

We’ve also installed some extra Ruby modules – spreadsheet and FastCSV – to make them possible.

These Ruby scraping tutorials are made using ScraperWiki, so you can of course do them from your browser without installing anything.

Thanks Mark!


January 04 2011

20:48

Views part 2 – Lincoln Council committees

(This is the second of two posts announcing ScraperWiki “views”. A new feature that Julian, Richard and Tom worked away and secretly launched a couple of months ago. Once you’ve scraped your data, how can you get it out again in just the form you want? See also: Views part 1 – Canadian weather stations.)

Lincoln Council committee updates

Sometimes you don’t want to output a visualisation, but instead some data in a specific form for use by another piece of software. You can think of this as using the ScraperWiki code editor to write the exact API you want on the server where the data is. This saves the person providing the data having to second guess every way someone might want to access it.

Andrew Beekan, who works at Lincoln City Council, has used this to make an RSS feed for their committee meetings. Their CMS software doesn’t have this facility built in, so he has to use a scraper to do it.

First he wrote a scraper in ScraperWiki for a “What’s new” search results page from Lincoln Council’s website. This creates a nice dataset containing the name, date and URL of each committee meeting. Next Andrew made a ScraperWiki view and wrote some Python to output exactly the XML that he wants.

Andrew then wraps the RSS feed in Feedburner for people who want email updates. This is all documented in the Council’s data directory. They used to use Yahoo pipes to do this, but Andrew is finding ScraperWiki easier maintain, even though some knowledge of programming is required.

Since then, Andrew has gone on to make a map for the Lincoln decent homes scheme, also using ScraperWiki views – he’s written a blog post about it.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl