Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 20 2012


How to stop missing the good weekends

The BBC's Michael Fish presenting the weather in the 80s, with a ScraperWiki tractor superimposed over LiverpoolFar too often I get so stuck into the work week that I forget to monitor the weather for the weekend when I should be going off to play on my dive kayaks — an activity which is somewhat weather dependent.

Luckily, help is at hand in the form of the ScraperWiki email alert system.

As you may have noticed, when you do any work on ScraperWiki, you start to receive daily emails that go:

Dear Julian_Todd,

Welcome to your personal ScraperWiki email update.

Of the 320 scrapers you own, and 157 scrapers you have edited, we
have the following news since 2011-12-01T14:51:34:

Histparl MP list - https://scraperwiki.com/scrapers/histparl_mp_list :
  * ran 1 times producing 0 records from 2 pages
  * with 1 exceptions, (XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<!DOCTYP')

...Lots more of the same

This concludes your ScraperWiki email update till next time.

Please follow this link to change how often you get these emails,
or to unsubscribe: https://scraperwiki.com/profiles/edit/#alerts

The idea behind this is to attract your attention to matters you may be interested in — such as fixing those poor dear scrapers you have worked on in the past and are now neglecting.

As with all good features, this was implemented as a quick hack.

I thought: why design a whole email alert system, with special options for daily and weekly emails, when we already have a scraper scheduling system which can do just that?

With the addition of a single flag to designate a scraper as an emailer (plus a further 20 lines of code), a new fully fledged extensible feature was born.

Of course, this is not counting the code that is in the Wiki part of ScraperWiki.

The default code in your emailer looks roughly like so:

import scraperwiki
emaillibrary = scraperwiki.utils.swimport("general-emails-on-scrapers")
subjectline, headerlines, bodylines, footerlines = emaillibrary.EmailMessageParts("onlyexceptions")
if bodylines:
    print "
".join([subjectline] + headerlines + bodylines + footerlines)

As you can see, it imports the 138 lines of Python from general-emails-on-scrapers, which I am not here to talk about right now.

Using ScraperWiki emails to watch the weather

Instead, what I want to explain is how I inserted my Good Weather Weekend Watcher by polling the weather forecast for Holyhead.

My extra code goes like this:

weatherlines = [ ]
if datetime.date.today().weekday() == 2:  # Wednesday
    url = "http://www.metoffice.gov.uk/weather/uk/wl/holyhead_forecast_weather.html"
    html = urllib.urlopen(url).read()
    root = lxml.html.fromstring(html)
    rows = root.cssselect("div.tableWrapper table tr")
    for row in rows:
        #print lxml.html.tostring(row)
        metweatherline = row.text_content().strip()
        if metweatherline[:3] == "Sat":
            subjectline += " With added weather"
            weatherlines.append("*** Weather warning for the weekend:")
            weatherlines.append("   " + metweatherline)

What this does is check if today is Wednesday (day of the week #2 in Python land), then it parses through the Met Office Weather Report table for my chosen location, and pulls out the row for Saturday.

Finally we have to handle producing the combined email message, the one which can contain either a set of broken scraper alerts, or the weather forecast, or both.

if bodylines or weatherlines:
    if not bodylines:
        headerlines, footerlines = [ ], [ ]   # kill off cruft surrounding no message
    print "
".join([subjectline] + weatherlines + headerlines + bodylines + footerlines)

The current state of the result is:

*** Weather warning for the weekend:
  Mon 5Dec

  7 °C
  33 mph
  47 mph
  Very Good

This was a very quick low-level implementation of the idea with no formatting and no filtering yet.

Email alerts can quickly become sophisticated and complex. Maybe I should only send a message out if the wind is below a certain speed. Should I monitor previous days’ weather to predict whether the sea will be calm? Or I could check the wave heights on the off-shore buoys? Perhaps my calendar should be consulted for prior engagements so I don’t get frustrated by being told I am missing out on a good weekend when I had promised to go to a wedding.

The possibilities are endless and so much more interesting than if we’d implemented this email alert feature in the traditional way, rather than taking advantage of the utterly unique platform that we happened to already have in ScraperWiki.

January 06 2012



Guest post by Makoto Inoue, a Japanese ScraperWiki user




昨今のホームページではデータを簡単に提供するためのAPI���Application Programming Interface���というしくみが多いので「なんで今更そんなの必要なの」と思われる方>も多いかもしれません。しかしながら前回起きた東日本大地震の際、地震や電力の速報や、各地の被害状況を把握するために必要な政府の統計情報などがAPIとして提供されておらず、開発者の中には自分でスクレイパー���Scraper���用のプログラムを書いた人も多いのではないのでしょうか��� ただそういった多くの開発者の善意でつくられたプログラムがいろいろなサイトに散らばっていたり、やがてメンテナンスされなくなるのは非常に残念なことです。



ScraperWikiはイギリスのスタートアップ企業で、スクレイパーコードを共有するサイトを提供しています。開発者達はサイト上から直接コード���Ruby, PHP, Python���を編集、実行することができます。スクレイプを定期的に実行することも可能で、取得されたデータはScraperWikiに保存されますが、ScraperWikiはAPIを用意しているので、このAPIを通して、他のサイトでデータを再利用することが可能です。











ページの下の方にはスプレッドシート形式でデータを閲覧できるようになっていますが、これだけだと他のサイトで再利用とか難しいですよね。そういうときは”Explorer with API”ボタンをクリックしてみて下さい。そこのページの最後に以下のようなurlがあると思います。


このurlにアクセスすると、先ほどのデータをJSON(Javascript Object Notation)で返してくれます。出力フォーマットは CSV, RSS,HTMLテーブルといった他の形式にも対応している上sql文をつかってフィルタリングなどをかけることも可能です。

select * from `swdata` where party = ‘民主’


ブラウザの”バック”ボタンを押して先ほどのページのスプレッドシートの下の方に目を通してい見て下さい。”This Scraper in Context”というところ”Copied To”という項>目があります。これはこのソースコードがコピーされ、他の用途に利用されていることを示しています。

そこに「makoto / Members of the House of Councillors of Japan」とあるの>でクリックしてみて下さい。実はこれは私が参議院議員の名簿を抜き出すために作ったスクレイパーです。衆議院と参議院はそれぞれ別にホームページを持っているのです>が、それぞれの議員名簿のページが結構似ていたので簡単に流用できるのではと思っていました。






  • エンコーディング���文字の表示形式���がUTFとShift-JISでことなる
  • 衆議院のページは複数ページにまたがっているが参議院ページ���ページのみ
  • 衆議院のページで議員名は「くん」づけ。参議院のページは芸名と本名の両方が載っている


もちろんこれらの変更をするのにはある程度のプログラミング知識が必要なのですが。動くサンプルを少し自分用にカスタマイズするScraperWikiはプログラミングを勉強したい人にとっても絶好の教材なのではないでしょうか��� 私自身XPathはあまり使ったことがなかったのですが、このもとプログラムを参考にすることで比較的簡単に学習できました。



公共機関、メディアや政府機関の中でインターネットを通じた情報公開は進んできていますが、「マッシュアップを前提としたデータの再利用」を考慮したサイトが十分で>ないのが現状です。そういった状態に一石と投じるべくScraperWikiは活動しており、ヨーロッパのジャーナリストや政府関係者の間では徐々に認知度があがってきております。 現在ScraperWikiでは米国でのワークショップを予定していますが、日本でもワークショップを始めるべく準備をしている所です。 もし興味のある方はコンタクト>ページより気軽にご連絡下さい。

Like this:

Be the first to like this post.

December 07 2011


How to scrape and parse Wikipedia

Today’s exercise is to create a list of the longest and deepest caves in the UK from Wikipedia. Wikipedia pages for geographical structures often contain Infoboxes (that panel on the right hand side of the page).

The first job was for me to design an Template:Infobox_ukcave which was fit for purpose. Why ukcave? Well, if you’ve got a spare hour you can check out the discussion considering its deletion between the immovable object (American cavers who believe cave locations are secret) and the immovable force (Wikipedian editors who believe that you can’t have two templates for the same thing, except when they are in different languages).

But let’s get on with some Wikipedia parsing. Here’s what doesn’t work:

import urllib
print urllib.urlopen("http://en.wikipedia.org/wiki/Aquamole_Pot").read()

because it returns a rather ugly error, which at the moment is: “Our servers are currently experiencing a technical problem.”

What they would much rather you do is go through the wikipedia api and get the raw source code in XML form without overloading their servers.

To get the text from a single page requires the following code:

import lxml.etree
import urllib

title = "Aquamole Pot"

params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"timestamp|user|comment|content" }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v)  for k, v in params.items())
url = "http://en.wikipedia.org/w/api.php?%s" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')

print "The Wikipedia text for", title, "is"
print revs[-1].text

Note how I am not using urllib.urlencode to convert params into a query string. This is because the standard function converts all the ‘|’ symbols into ‘%7C’, which the Wikipedia api site doesn’t accept.

The result is:

{{Infobox ukcave
| name = Aquamole Pot
| photo =
| caption =
| location = [[West Kingsdale]], [[North Yorkshire]], England
| depth_metres = 113
| length_metres = 142
| coordinates =
| discovery = 1974
| geology = [[Limestone]]
| bcra_grade = 4b
| gridref = SD 698 784
| location_area = United Kingdom Yorkshire Dales
| location_lat = 54.19082
| location_lon = -2.50149
| number of entrances = 1
| access = Free
| survey = [http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]
'''Aquamole Pot''' is a cave on [[West Kingsdale]], [[North Yorkshire]],
England wih which was first discovered from the
bottom by cave diving through 550 feet of
sump from [[Rowten Pot]] in 1974....

This looks pretty structured. All ready for parsing. I’ve written a nice complicated recursive template parser that I use in wikipedia_utils, which makes it easy to extract all the templates from the page in the following way:

import scraperwiki
wikipedia_utils = scraperwiki.swimport("wikipedia_utils")

title = "Aquamole Pot"

val = wikipedia_utils.GetWikipediaPage(title)
res = wikipedia_utils.ParseTemplates(val["text"])
print res               # prints everything we have found in the text
infobox_ukcave = dict(res["templates"]).get("Infobox ukcave")
print infobox_ukcave    # prints just the ukcave infobox

This now produces the following Python data structure that is almost ready to push into our database — after we have converted the length and depths from strings into numbers:

{0: 'Infobox ukcave', 'number of entrances': '1',
 'location_lon': '-2.50149',
 'name': 'Aquamole Pot', 'location_area': 'United Kingdom Yorkshire Dales',
 'geology': '[[Limestone]]', 'gridref': 'SD 698 784', 'photo': '',
 'coordinates': '', 'location_lat': '54.19082', 'access': 'Free',
 'caption': '', 'survey': '[http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]',
 'location': '[[West Kingsdale]], [[North Yorkshire]], England',
 'depth_metres': '113', 'length_metres': '142', 'bcra_grade': '4b', 'discovery': '1974'}

Right. Now to deal with the other end of the problem. Where do we get the list of pages with the data?

Wikipedia is, unfortunately, radically categorized, so Aquamole_Pot is inside Category:Caves_of_North_Yorkshire, which is in turn inside Category:Caves_of_Yorkshire which is then inside
Category:Caves_of_England which is finally inside

So, in order to get all of the caves in the UK, I have to iterate through all the subcategories and all the pages in each category and save them to my database.

Luckily, this can be done with:

lcavepages = wikipedia_utils.GetWikipediaCategoryRecurse("Caves_of_the_United_Kingdom")
scraperwiki.sqlite.save(["title"], lcavepages, "cavepages")

All of this adds up to my current scraper wikipedia_longest_caves that extracts those infobox tables from caves in the UK and puts them into a form where I can sort them by length to create this table based on the query SELECT name, location_area, length_metres, depth_metres, link FROM caveinfo ORDER BY length_metres desc:

name location_area length_metres depth_metres Ease Gill Cave System United Kingdom Yorkshire Dales 66000.0 137.0 Dan-yr-Ogof Wales 15500.0 Gaping Gill United Kingdom Yorkshire Dales 11600.0 105.0 Swildon’s Hole Somerset 9144.0 167.0 Charterhouse Cave Somerset 4868.0 228.0

If I was being smart I could make the scraping adaptive, that is only updating the pages that have changed since the last scraped by using all the data returned by GetWikipediaCategoryRecurse(), but it’s small enough at the moment.

So, why not use DBpedia?

I know what you’re saying: Surely the whole of DBpedia does exactly this, with their parser?

And that’s fine if you don’t want your updates to come less than 6 months, which prevents you from getting any feedback when adding new caves into Wikipedia, like Aquamole_Pot.

And it’s also fine if you don’t want to be stuck with the naïve semantic web notion that the boundaries between entities is a simple, straightforward and general concept, rather than what it really is: probably the one deep and fundamental question within any specific domain of knowledge.

I mean, what is the definition of a singular cave, really? Is it one hole in the ground, or is it the vast network of passages which link up into one connected system? How good do those connections have to be? Are they defined hydrologically by dye tracing, or is a connection defined as the passage of one human body getting itself from one set of passages to the next? In the extreme cases this can be done by cave diving through an atrocious sump which no one else is ever going to do again, or by digging and blasting through a loose boulder choke that collapses in days after one nutcase has crawled through. There can be no tangible physical definition. So we invent the rules for the definition. And break them.

So while theoretically all the caves on Leck Fell and Easgill have been connected into the Three Counties System, we’re probably going to agree to continue to list them as separate historic caves, as well as some sort of combined listing. And that’s why you’ll get further treating knowledge domains as special cases.

September 16 2011


Driving the Digger Down Under


Henare here from the OpenAustralia Foundation – Australia’s open data, open government and civic hacking charity. You might have heard that we were planning to have a hackfest here in Sydney last weekend. We decided to focus on writing new scrapers to add councils to our PlanningAlerts project that allows you to find out what is being built or knocked down in your local community. During the two afternoons over the weekend seven of us were able to write nineteen new scrapers, which covers an additional 1,823,124 Australiansa huge result.

There are a number of reasons why we chose to work on new scrapers for PlanningAlerts. ScraperWiki lowers the barrier of entry for new contributors by allowing them to get up and running quickly with no setup – just visit a web page. New scrapers are also relatively quick to write which is perfect for a hackfest over the weekend. And finally, because we have a number of working examples and ScraperWiki’s documentation, it’s conceivable that someone with no programming experience can come along and get started.

It’s also easy to support people writing scrapers in different programming languages using ScraperWiki. PlanningAlerts has always allowed people to write scrapers in whatever language they choose by using an intermediate XML format. With ScraperWiki this is even simpler because as far as our application is concerned it’s just a ScraperWiki scraper – it doesn’t even know what language the original scraper was written in.

Once someone has written a new scraper and formatted the data according to our needs, it’s a simple process for us to add it to our site. All they need to do is let us know, we add it to our list of planning authorities and then we automatically start to ask for the data daily using the ScraperWiki API.

Another issue is maintenance of these scrapers after the hackfest is over. Lots of volunteers only have the time to write a single scraper, maybe to support their local community. What happens when there’s an issue with that scraper but they’ve moved on? With ScraperWiki anyone can now pick up where they left off and fix the scraper – all without us ever having to get involved.

It was a really fun weekend and hopefully we’ll be doing this again some time. If you’ve got friends or family in Australia, don’t forget to tell them to sign up for PlanningAlerts.


OpenAustralia Foundation volunteer

July 04 2011


We Eat Data – ScraperWiki talk at Open Knowledge Conference 2011

Our tamed computer programmer, ‘The Julian’, recently gave a rare appearance at the Open Knowledge Conference in Berlin (if you want an appearance pay us or ask us!). The spectacle of such scraping royalty drew more people than the room could accommodate (‘The Julian’ is not related to any royals living or deceased). As such I have included the slides here:

We were honoured to be amongst an outstanding line-up of speakers. We also ran a workshop the week of the conference and you can see the German data we scraped into ScraperWiki on the OKCon2011 tag.

What was most interesting about the workshop is that we see the same types of data needed for similar projects wherever we go. Tobias Escher wants to do something similar to AlphaGov for Germany called Meine Demokratie. A lot of very simple little scrapers can go a long way and if there’s anyone looking to play around with scraping and ScraperWiki, or who would like to lend a coding hand to a worthy cause please to click the above link.

‘The Julian’ was also looking for a scraping challenge and the workshop gnomes found Berlin schools data. I showed those in attendance one of my favourite sites made from scrapers:  Schooloscope. So Julian is scraping the data for Berlin schools in various stages and the hope is to get all the data for schools in Germany to make a German schooloscope.

We have one lovely lady very interested in getting this project on its way so if you are willing, if you speak German and if you know where to find them maybe you can scrape German schools data.

So watch out useful things to know in Germany including schools – you’re being ScraperWikied!

(As ScraperWiki is being used for better and better things, this will just get harder for me…)

June 13 2011


Why the Government scraped itself

We wrote last month about Alphagov, the Cabinet Office’s prototype, more usable, central Government website. It made extensive use of ScraperWiki.

The question everyone asks – why was the Government scraping its own sites? Let’s take a look.

In total 56 scrapers were used. You can find them tagged “alphagov” on the ScraperWiki website. There are a few more not yet in use, making 66 in total. They were written by 14 different developers from both inside and outside the Alphagov team – more on that process another day.

The bulk of scrapers were there to migrate and combine content, like transcripts of ministerial speeches and details of government consultations. These were then imported into sections of alpha.gov.uk - speeches are here, and the consultations here.

This is the first time, that I know of, that the Government has organised a cross-government view of speeches and consultations. (Although third parties like TellThemWhatYouThink have covered similar ground before). This is vital to citizens who don’t fall into particular departmental categories, but want to track things based on topics that matter to them.

The rest of the scrapers were there to turn content into datasets. You need a dataset to make something more usable.

Two examples:

1. The list of DVLA driving test centres has been turned into the beginnings of a simple app to book a driving test. Compare to the original DfT site here.

2. The UK Bank Holiday data that ScraperWiki user Aubergene scraped last year was improved and used for the alpha.gov.uk Bank Holiday page.

It seems strange at first for a Government to scrape its own websites. It isn’t though. It lets them move quickly (agile!), and concentrate first on the important part – making the experience for citizens as good as possible.

And now, thanks to Alphagov using ScraperWiki, you can download and use all the data yourself – or repurpose the scraping scripts for something else.

Let us know if you do something with it!

May 13 2011


There’s More Than One Way to Scrape a Site

A request came in to ScraperWiki to scrape information on the Members of the European Parliament.  I put it out on Twitter and Facebook hoping a kind member of the ScraperWiki community will have spent so much time on the computer he/she has no life at all. I had to turn people away!

Within minutes, two tweeters wanted to give it a go and I got a reply on Facebook.  In fact, Tim Green had already scraped the names and URLs of MEPs by the time I got back to him saying it had already been claimed on twitter by Pall Hilmarsson.

Although both scrapers are looking at the same site, Tim‘s is less than 20 lines of code and with only 8 revisions, it’s a very quick scrape. Whereas Pall‘s went for the full schebang, scraping opinions and speeches and generally drilling down into the data a whole lot more. Hence the nearly 200 lines of code!

So if you’re a code junky, take a look and what it takes to scrape and then scrape further by comparing scrapers/meps with scrapers/meps_2.   Also, Tim kindly scraped the next request: National Historic Ships Register. To Tim and Pall I say: If the ScraperWiki digger were capable of emotion you would both be receiving a diesel greasy kiss!

European Parliament Members and National Historic Ships – you’ve been ScraperWikied! (with help from your friendly neighbourhood programmers)

May 05 2011


ScraperWiki: A story about two boys, web scraping and a worm

Spectrum game

“It’s like a buddy movie.” she said.

Not quite the kind of story lead I’m used to. But what do you expect if you employ journalists in a tech startup?

“Tell them about that computer game of his that you bought with your pocket money.”

She means the one with the risqué name.

I think I’d rather tell you about screen scraping, and why it is fundamental to the nature of data.

About how Julian spent almost a decade scraping himself to death until deciding to step back out and build a tool to make it easier.

I’ll give one example.

Two boys

In 2003, Julian wanted to know how his MP had voted on the Iraq war.

The lists of votes were there, on the www.parliament.uk website. But buried behind dozens of mouse clicks.

Julian and I wrote some software to read the pages for us, and created what eventually became TheyWorkForYou.

We could slice and dice the votes, mix them with some knowledge from political anaroks, and create simple sentences. Mini computer generated stories.

“Louise Ellman voted very strongly for the Iraq war.”

You can see it, and other stories, there now. Try the postcode of the ScraperWiki office, L3 5RF.

I remember the first lobbiest I showed it to. She couldn’t believe it. Decades of work done in an instant by a computer. An encyclopedia of data there in a moment.

Web Scraping

It might seem like a trick at first, as if it was special to Parliament. But actually, everyone does this kind of thing.

Google search is just a giant screen scraper, with one secret sauce algorithm guessing its ranking data.

Facebook uses scraping as a core part of its viral growth to let users easily import their email address book.

There’s lots of messy data in the world. Talk to a geek or a tech company, and you’ll find a screen scraper somewhere.

Why is this?

It’s Tautology

On the surface, screen scrapers look just like devices to work round incomplete IT systems.

Parliament used to publish quite rough HTML, and certainly had no database of MP voting records. So yes, scrapers are partly a clever trick to get round that.

But even if Parliament had published it in a structured format, their publishing would never have been quite right for what we wanted to do.

We still would have had to write a data loader (search for ‘ETL’ to see what a big industry that is). We still would have had to refine the data, linking to other datasets we used about MPs. We still would have had to validate it, like when we found the dead MP who voted.

It would have needed quite a bit of programming, that would have looked very much like a screen scraper.

And then, of course, we still would have had to build the application, connecting the data to the code that delivered the tool that millions of wonks and citizens use every year.

Core to it all is this: When you’re reusing data for a new purpose, a purpose the original creator didn’t intend, you have to work at it.

Put like that, it’s a tautology.

A journalist doesn’t just want to know what the person who created the data wanted them to know.

Scrape Through

So when Julian asked me to be CEO of ScraperWiki, that’s what went through my head.

Secrets buried everywhere.

The same kind of benefits we found for politics in TheyWorkForYou, but scattered across a hundred countries of public data, buried in a thousand corporate intranets.

If only there was a tool for that.

A Worm

And what about my pocket money?

Nicola was talking about Fat Worm Blows a Sparky.

Julian’s boss’s wife gave it its risqué name while blowing bubbles in the bath. It was 1986. Computers were new. He was 17.

Fat Worm cost me £9.95. I was 12.

[Loading screen]

I was on at most £1 a week, so that was ten weeks of savings.

Luckily, the 3D graphics were incomprehensibly good for the mid 1980s. Wonder who the genius programmer is.

I hadn’t met him yet, but it was the start of this story.

May 03 2011


It’s all a matter of trust

According to the latest Ipsos MORI poll on trust in people, only 1 in 5 people think journalists tell the truth. They’re still more trustworthy than politicians generally and government ministers! Phew.

But telling the truth and being trustworthy are not the same thing. There’s not believing what they say and then there’s knowing that what they say is wrong and doing something about it. Which is why we have the Press Complaints Commission.

Here at ScraperWiki we also have a group of developers that don’t just complain when sites don’t work, they do something about it. That’s what Ben Campbell did for the Press Complaints Commission. He scraped the PCC to produce this site (pictured above) for the Media Standards Trust.

‘Trying to work out basic stuff, like which newspapers are the most complained about, is virtually impossible on the existing PCC site. So we scraped the data to make it easier (oh, and it’s the Daily Mail)’
- Martin Moore (Media Standards Trust)

Just as a news story can be presented in myriad of ways so too can data. Some representations are more useful than others. Many have different purposes,  a different audience. Others are so buried behind web forms and coding, they can’t reveal a story unless liberated.

Scraping creates a data wire service. And our developers are showing how even creating a simple league table (with realtime updates) can tell a completely different story.

Press Complaints Commission – you’ve been ScraperWikied!

March 25 2011


OpenCorporates partners with ScraperWiki & offers bounties for open data scrapers

This is a guest post by Chris Taggart, co-founder of OpenCorporates

When we started OpenCorporates it was to solve a real need that we and a number of other people in the open data community had: whether it’s Government spending, subsidy info or court cases, we needed a database of corporate entities to match against, and not just for one country either.

But we knew from the first that we didn’t want this to be some heavily funded monolithic project that threw money at the project in order to create a walled garden of new URIs unrelated to existing identifiers. It’s also why we wanted to work with existing projects like OpenKvK, rather than trying to replace them.

So the question was, how do we make this scale, and at the same time do the right thing – that is work with a variety of different people using different solutions and different programming languages. The answer to both, it turns out, was to use open data, and the excellent ScraperWiki.

How does it work? Well, the basics we need in order to create a company record at OpenCorporates is the company number, the jurisdiction and the company’s name. (If there’s a status field — e.g. dissolved/active — company type or url for more data, that’s a bonus). So, all you need to do is write a scraper for a country we haven’t got data for, name the fields in a standard way (CompanyName, CompanyNumber, Status, EntityType, RegistryUrl, if the url of the company page can’t be worked out from the company number), and bingo, we can pull it into OpenCorporates, with just a couple of lines of code.

Let’s have a look at one we did earlier: the Isle of Man (there’s also one for GibraltarIreland, and in the US, the District of Columbia). It’s written in Ruby, because that’s what we at OpenCorporates code in, but ScraperWiki allows you to write scrapers in Python or php too, and the important thing here is the data, not the language used to produce it.

The Isle of Man company registry website is a .Net system which uses all sorts of hidden fields and other nonsense in the forms and navigation. This is a normally bit of a pain, but because you can use the Ruby Mechanize library to submit forms found on the pages (there’s even a tutorial scraper which shows how to do it), it becomes fairly straightforward.

The code itself should be fairly readable to anyone familiar with Ruby or Python, but essentially it tackles the problem by doing multiple searches for companies beginning with two letters, starting with ‘aa’ then ‘ab’ and so on, and for each letter pair iterating through each page of results in turn, which in turn is scraped to extract the data, using the standardised headings to save them in.  That’s it.

In the space of a couple of hours not only have we liberated the data, but both the code and the data are there for anyone else to use too, as well as being imported in OpenCorporates.

However, that’s not all. In order to kickstart the effort OpenCorporates (technically Chrinon Ltd, the micro start-up that’s behind OpenCorporates) is offering a bounty for new jurisdictions opened up.

It’s not huge (we’re a micro-startup remember): £100 for any jurisdiction that hasn’t been done yet, £250 for those territories we want to import sooner rather than later (Australia, France, Spain), and £500 for Delaware (there’s a captcha there, so not sure it’s even possible), and there’s an initial cap of £2500 on the bounty pot (details at the bottom of this post).

However, often the scrapers can be written in a couple of hours, and it’s worth stressing again that neither the code nor the data will belong to OpenCorporates, but to the open data community, and if people build other things on it, so much the better. Of course we think it would make sense for them to use the OpenCorporates URIs to make it easy to exchange data in a consistent and predictable way, but, hey, it’s open data ;-)

Small, simple pieces, loosely connected, to build something rather cool. So now you can do a search for, oh say Barclays, and get this:

The bounty details: how it works

Find a country/company registry that you fancy opening up the data for (here are a couple of lists of registries). Make sure it’s from the official registry, and not a commercial reseller. Check too that no-one has already written one, or is in the middle of writing one, by checking the scrapers tagged with opencorporates (be nice, and respect other people’s attempts, but feel free to start one if it looks as if someone’s given up on a scraper).

All clear? Go ahead and start a new scraper (useful tutorials here). Call it something like trial_fr_company_numbers (until it’s done and been OK’d) and get coding, using the headings detailed above for the CompanyNumber, CompanyName etc. When it’s done, and it’s churning away pulling in data, email us info@opencorporates.com, and assuming it’s OK, we’ll pay you by Paypal, or by bank transfer (you’ll need to give us an invoice in that case). If it’s not we’ll add comments to the scraper. Any questions, email us at info@opencorporates.com, and happy scraping.

March 15 2011


Cardiff Hacks and Hackers Hacks Day

What’s occurin’? Loads in fact, at our first Welsh Hacks and Hackers Hack Day! From schools from space to catering college’s with a Food Safety Standard of 2, we had an amazing day.

We got five teams:

Co-Ordnance – This project aimed to be a local business tracker. They wanted to make the London Stock Exchange code into meaningful data, but alas, the stock exchange prevents scraping. So they decided to use company data from registers like the LSE and Companies House to extract business information and structure it for small businesses who need to know best place to set up and for local business activists.

The team consisted of 3 hacks (Steve Fossey, Eva Tallaksen from Intrafish and Gareth Morlais from BBC Cymru) and 3 hackers (Carey HilesCraig Marvelley and Warren Seymour, all from Box UK).

It’s a good thing they had some serious hackers as they had a serious hack on their hands. Here’s a scraper they did for the London Stock Exchange ticker. And here’s what they were able to get done in just one day!

This was just a locally hosted site but the map did allow users to search for types of businesses by region, see whether they’d been dissolved and by what date.

Open Senedd – This project aimed to be a Welsh version of TheyWorkforYou. A way for people in Wales to find out how assembly members voted in plenary meetings. It tackles the worthy task of making assembly members voting records accessible and transparent.

The team consisted of 2 hacks (Daniel Grosvenor from CLIConline and Hannah Waldram from Guardian Cardiff) and 2 hackers (Nathan Collins and Matt Dove).

They spent the day hacking away and drew up an outline for www.opensenedd.org.uk. We look forward to the birth of their project! Which may or may not look something like this (left). Minus Coke can and laptop hopefully!

They took on a lot for a one day project but devolution will not stop the ScraperWiki digger!

There’s no such thing as a free school meal – This project aimed to extract information on Welsh schools from inspection reports. This involved getting unstructure Estyn reports on all 2698 Welsh schools into ScraperWiki.

The team consisted of 1 hack (Izzy Kaminski) and 2 astronomer hackers (Edward Gomez and Stuart Lowe from LCOGT).

This small team managed to scrape Welsh schools data (which the next team stole!) and had time to make a heat map of schools in Wales. This was done using some sort of astronomical tool. Their longer term aim is to overlay the map with information on child poverty and school meals. A worthy venture and we wish them well.

Ysgoloscope – This project aimed to be a Welsh version of Schooloscope. It’s aim was to make accessible and interactive information about schools for parents to explore. It used Edward’s scraper of horrible PDF Estyn inspection reports. These had different rating methodology to Ofsted (devolution is not good for data journalism!).

The team consisted of 6 hacks (Joni Ayn Alexander, Chris Bolton, Bethan James from the Stroke Association, Paul Byers, Geraldine Nichols and Rachel Howells), 1 hacker (Ben Campbell from Media Standards Trust) and 1 troublemaker (Esko Reinikainen).

Maybe it was a case to too many hacks or just trying to narrow down what area of local government to tackle but the result was a plan. Here is their presentation and I’m sure parents all over wales are hoping to see Ysgoloscope up and running.

Blasus – This project aimed to map food hygiene rating over Wales. They wanted to correlate this information with deprivation indices. They noticed that the Food Standards Agency site does not work. Not for this purpose which is most useful.

The team consisted of 4 hacks (Joe Goodden from the BBC, Alyson Fielding, Charlie Duff from HRZone and Sophie Paterson from the ATRiuM) and 1 hacker (Dafydd Vaughan from CF Labs).

As you can see below they created something which they presented on the day. They used this scraper and made an interactive map with food hygiene ratings, symbols and local information. Amazing for just a day’s work!

And the winners are… (drum roll please)

  • 1st Prize: Blasus
  • 2nd Prize: Open Senedd
  • 3rd Prize: Co-Ordnance
  • Best Scoop: Blasus for finding  a catering college in Merthyr with a Food Hygiene Standard rating of just 2
  • Best Scraper: Co-Ordnance

A big shout out

To our judges Glyn Mottershead from Cardiff School of Journalism, Media and Cultural Studies, Gwawr Hughes from Skillset and Sean Clarke from The Guardian.

And our sponsors Skillset, Guardian Platform, Guardian Local and Cardiff School of Journalism, Media and Cultural Studies.

Schools, businesses and eating place of Wales – you’ve been ScraperWikied!

March 08 2011


600 Lines of Code, 748 Revisions = A Load of Bubbles

When Channel 4′s Dispatches came across 1,100 pages of PDFs, known as the National Asset Register, they knew they had a problem on their hands. All that data, caged in a pixelated prison.

So ScraperWiki let loose ‘The Julian’. What ‘The Stig’ is to Top Gear, ‘The Julian’ is to ScraperWiki. That and our CTO.

‘The Julian’ did not like the PDFs. After scraping 10 pages of Defence assets, he got angry. The register may as well been glued together by trolls. The 5 year old data copied and pasted by Luddites from the previous Government was worse then useless.

So the ScraperWiki team set about rebuilding the register. Using good old-fashioned man power (i.e. me) and a PDF cropper we built a database of names, values and hierarchies that link directly to the PDFs.

Then Julian set about coding; 600 lines and 748 revisions! He made the bubbles the size of the asset values and got them to orbit around their various parent bubbles. This required such functions as ‘MakeOtherBranchAggregationsRecurse(cluster)’.

This scared our designer Zarino a little, who nevertheless made it much more user-friendly. This is where ScraperWiki’s powers of viewing live edits, chatting and collaboration became useful. The result was rounds of debugging interspersed with a healthy dose of cursing.

We then tried using it. We wanted the source of the data to hold provenance. We wanted to give the users the ability to explore the data. We wanted them to be able to see the bubbles that were too small. We prodded ‘The Julian’.

He hard coded the smaller bubbles to get into a ‘More…’ bubble orbit. This made the whole thing a lot clearer and changed the navigation from jumping to orbits to drilling down and finding out which assets are worth a similar amount.

He then got it to drill down to the source PDFs. ‘The Julian’ outdid himself and stayed up all night making a PDF annotator of the data. We have plans for this.

Oh, and we also made a brownfield map. The scraper can be found here. And the code for the visual here. the 25000 data points were in Excel form and so much easier to work with. This was nice data with lots of fields. Francis and Zarino created a very friendly visual application that allows a user to type in a post code and to see what is going on with their local authority. But due to the new government coming in, the Homes and Communities Agency have not yet finished collecting the 2009 data.

NAR and NLUD – you’ve been ScraperWikied!

May 12 2010


What are the best tools for "scraping" data off a Web page for analysis in Excel or other software?

My former student Michelle Minkoff answered this question, at least in part, on Poynter.org today: http://www.poynter.org/column.asp?id=31&aid=183176. Her post includes links to two wonderful tutorials. I'm interested in other suggestions, and also in approaches for someone who's not too afraid of coding to write/adapt their own scraper.

Part of the reason I ask this question is that I've been thinking that writing a scraper might be an interesting final project for a course introducing programming to journalists. The rationale (along the lines of how I've taught computer-assisted reporting in the past) is that it's the kind of project a journalist would immediately see the utility/value of. So in addition to suggested tools/approaches, I'd be interested in feedback on this idea.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!