Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 25 2011

13:50

All the news that’s fit to scrape

Channel 4/Scraperwiki collaboration

There have been quite a few scraping-related stories that I’ve been meaning to blog about – so many I’ve decided to write a round up instead. It demonstrates just the increasing role that scraping is playing in journalism – and the possibilities for those who don’t know them:

Scraping company information

Chris Taggart explains how he built a database of corporations which will be particularly useful to journalists and anyone looking at public spending:

“Let’s have a look at one we did earlier: the Isle of Man (there’s also one for Gibraltar, Ireland, and in the US, the District of Columbia) … In the space of a couple of hours not only have we liberated the data, but both the code and the data are there for anyone else to use too, as well as being imported in OpenCorporates.”

OpenCorporates are also offering a bounty for programmers who can scrape company information from other jurisdictions.

Scraperwiki on the front page of The Guardian…

The Scraperwiki blog gives the story behind a front page investigation by James Ball on lobbyist influence in the UK Parliament:

“James Ball’s story is helped and supported by a ScraperWiki script that took data from registers across parliament that is located on different servers and aggregates them into one source table that can be viewed in a spreadsheet or document.  This is now a living source of data that can be automatically updated.  http://scraperwiki.com/scrapers/all_party_groups/

“Journalists can put down markers that run and update automatically and they can monitor the data over time with the objective of holding ‘power and money’ to account. The added value  of this technique is that in one step the data is represented in a uniform structure and linked to the source thus ensuring its provenance.  The software code that collects the data can be inspected by others in a peer review process to ensure the fidelity of the data.”

…and on Channel 4′s Dispatches

From the Open Knowledge Foundation blog (more on Scraperwiki’s blog):

“ScraperWiki worked with Channel 4 News and Dispatches to make two supporting data visualisations, to help viewers understand what assets the UK Government owns … The first is a bubble chart of what central Government owns. The PDFs were mined by hand (by Nicola) to make the visualisation, and if you drill down you will see an image of the PDF with the source of the data highlighted. That’s quite an innovation – one of the goals of the new data industry is transparency of source. Without knowing the source of data, you can’t fully understand the implications of making a decision based on it.

“The second is a map of brownfield landed owned by local councils in England … The dataset is compiled by the Homes and Communities Agency, who have a goal of improving use of brownfield land to help reduce the housing shortage. It’s quite interesting that a dataset gathered for purposes of developing housing is also useful, as an aside, for measuring what the state owns. It’s that kind of twist of use of data that really requires understanding of the source of the data.

Which chiropractors were making “bogus” claims?

This is an example from last summer. Following the Simon Singh case Simon Perry wrote a script to check which chiropractors were making the same “bogus claims” that Singh was being sued over:

“The BCA web site lists all it’s 1029 members online, including for many of them, about 400 web site URLs. I wrote a quick computer program to download the member details, record them in a database and then download the individual web sites. I then searched the data for the word “colic” and then manually checked each site to verify that the chiropractors were either claiming to treat colic, or implying that chiropractic was an efficacious treatment for it. I found 160 practices in total, with around 500 individual chiropractors.

“The final piece in the puzzle was a simple mail-merge. Not wanting to simultaneously report several quacks to the same Trading Standards office, I limited the mail-merge to one per authority and sent out 84 letters.

“On the 10th, the science blogs went wild when Le Canard Noir published a very amusing email from the McTimoney Chiropractic Association, advising their members to take down their web site. It didn’t matter, I had copies of all the web sites.”

March 08 2011

16:22

600 Lines of Code, 748 Revisions = A Load of Bubbles

When Channel 4′s Dispatches came across 1,100 pages of PDFs, known as the National Asset Register, they knew they had a problem on their hands. All that data, caged in a pixelated prison.

So ScraperWiki let loose ‘The Julian’. What ‘The Stig’ is to Top Gear, ‘The Julian’ is to ScraperWiki. That and our CTO.

‘The Julian’ did not like the PDFs. After scraping 10 pages of Defence assets, he got angry. The register may as well been glued together by trolls. The 5 year old data copied and pasted by Luddites from the previous Government was worse then useless.

So the ScraperWiki team set about rebuilding the register. Using good old-fashioned man power (i.e. me) and a PDF cropper we built a database of names, values and hierarchies that link directly to the PDFs.

Then Julian set about coding; 600 lines and 748 revisions! He made the bubbles the size of the asset values and got them to orbit around their various parent bubbles. This required such functions as ‘MakeOtherBranchAggregationsRecurse(cluster)’.

This scared our designer Zarino a little, who nevertheless made it much more user-friendly. This is where ScraperWiki’s powers of viewing live edits, chatting and collaboration became useful. The result was rounds of debugging interspersed with a healthy dose of cursing.

We then tried using it. We wanted the source of the data to hold provenance. We wanted to give the users the ability to explore the data. We wanted them to be able to see the bubbles that were too small. We prodded ‘The Julian’.

He hard coded the smaller bubbles to get into a ‘More…’ bubble orbit. This made the whole thing a lot clearer and changed the navigation from jumping to orbits to drilling down and finding out which assets are worth a similar amount.

He then got it to drill down to the source PDFs. ‘The Julian’ outdid himself and stayed up all night making a PDF annotator of the data. We have plans for this.

Oh, and we also made a brownfield map. The scraper can be found here. And the code for the visual here. the 25000 data points were in Excel form and so much easier to work with. This was nice data with lots of fields. Francis and Zarino created a very friendly visual application that allows a user to type in a post code and to see what is going on with their local authority. But due to the new government coming in, the Homes and Communities Agency have not yet finished collecting the 2009 data.

NAR and NLUD – you’ve been ScraperWikied!

October 06 2010

14:46

‘We do want journalists to break the rules’, says former prosecutions chief

Society needs journalists who are prepared to break the law in order to serve the public interest, argued the former director of public prosecutions Sir Ken Macdonald last night.

Speaking at a debate at City University on the the News of the World phone-hacking case and the lengths to which reporters can go to get information, MacDonald said: “There are bound to be cases where journalists will want to break the law, and for good reason (…) We do want journalists to break the rules.”

Macdonald did not condone the phone-hacking at NotW, and stressed that it was only under certain public interest circumstances that journalists might be forgiven for breaking the law.

He was joined by key players in the phone hacking scandal: Nick Davies of the Guardian, ex-News of the World journalist Paul McMullan and defamation lawyer Mark Lewis, as well as Max Mosley, Roy Greenslade and libel barrister Caldecott QC.

Mark Lewis, who is currently suing the Metropolitan Police and the Press Complaints Commission for libel, echoed Macdonald, saying that in certain circumstances illegal activity is acceptable.

“If you know something is of public interest then you can use certain methods to corroborate it,” he said. However, he stressed that these methods should not be used to obtain a story.

Macdonald also cautioned against increasing privacy laws, warning it could create a “contagion of caution” among newspapers, and pointed out that a culture of deference has developed in France due to its strict privacy rules.

However, Macdonald conceded that it is nearly impossible to define what is and isn’t in the public interest.

As former Daily Miror editor and journalism professor Greenslade pointed out, “the public interest for the Guardian’s audience is very different to the public interest of the News of the World readers.

“There is no easy way of drafting a public interest definition that would give journalists clear guidance on what they should and shouldn’t publish.”

More from Journalism.co.uk:

Former News of the World journalist defends phone-hacking at lively debate

PCC claimes it did respond to Dispatches with phone-hacking statement

Phone-hacking on Dispatches: a good documentary but not enough new evidenceSimilar Posts:



October 05 2010

12:06

Phone-hacking on Dispatches: a good documentary but not enough new evidence

Following the Twitter conversation around last night’s Channel 4 Dispatches on phone-hacking, Andy Coulson and the News of the World, it seems that for those already following the story there was insufficient new evidence.

But for those less aware of the ongoing claims and the series of investigations that have been conducted, the programme did a great job of putting the most recent claims – sparked by the New York Times’ reports in September – into context with what has gone before, starting with Clive Goodman and Glenn Mulcaire’s arrests in 2006.

Dispatches had comments from Paul McMullen, a journalist working at the News of the World when Coulson joined, and an unidentified source who worked under former editor Coulson while he was deputy editor.

Both alleged that phone-hacking did not begin and end with Goodman and Mulcaire. McMullan told the programme that there was surprise in the newsroom following Goodman’s arrest and sentencing that no one else had been charged.

Of 13 people who worked at the paper during Coulson’s editorship or time as deputy editor and have spoken to Dispatches, not one believes that Goodman was a lone “bad apple”.

Questioning Coulson’s “collective amnesia” and rulings by the Met Police and other industry groups that Goodman and Mulcaire were the only people involved in the practice may not be new, but Dispatches did a good job of raising some new points, as yet largely uncovered by the mainstream media. In particular, the programme spoke with a non-celebrity potential victim of phone-hacking, who explained how difficult it has been to get information from the police and her mobile phone operator to check if she had been hacked.

Concerns were raised by interviewees, including Brian Paddick, who is calling for a judicial review of the Met’s 2006 inquiry, and DCMS select committee member Adam Price, who had suggested that News International’s Rebekah Brooks should be made to give evidence to its phone-hacking inquiry, that whatever the truth behind the allegations about the extend of the practice, the way in which investigations by government and the Metropolitan police have been conducted suggests that the News of the World may be “above the law”.

Tom Watson MP, who worked on the department for culture, media and sport’s select committee inquiry into allegations against the NOTW, told Dispatches that he considered giving up politics after a senior News International journalist told him that he would be pursued by its titles after he called for Tony Blair’s resignation in 2006 because of the support of News International for the then PM.

Watson has now published a letter on his website written to the Prime Minister and asking him to make a statement in parliament this week about the allegations against his communications director Coulson.

Coulson has repeatedly denied knowledge of phone-hacking at the News of the World and told Dispatches he had nothing to add in response to its broadcast.

Lack of press coverage at the time of Goodman’s arrest suggested similar goings-on at other papers, said Dispatches’ host Peter Oborne last night. But given the Daily Mail columnist’s involvement and the featured commentary from former News of the World journalists, Channel 4 and the Guardian, has last night’s broadcast created a more united front amongst the press to investigate its own state of affairs?Similar Posts:



October 04 2010

10:45

Phone-hacking: Dispatches source claims Coulson listened to recordings

Tonight’s Channel 4 Dispatches documentary, Tabloids, Tories and Telephone Hacking, will reveal new phone tapping allegations against Andy Coulson, Channel 4 News revealed yesterday.

In a breaking news announcement, presenter Krishnan Guru-Murthy reported that a past colleague of Coulson’s will claim in tonight’s broadcast that the former editor of the News of the World, and now communications director for the Prime Minister, not only knew about phone hacking at the tabloid and asked recordings to be played to him. Coulson has always claimed that he had no knowledge of hacking at the paper.

The Dispatches programme, which features an investigation by political journalist Peter Obourne into the tabloid’s relationship with police and the government, will be aired on Channel 4 tonight at 8pm. The programme follows fresh allegations of phone hacking at the tabloid made by the New York Times last month, sparking emergency debates in the House of Commons, a new police investigation and a series of lawsuits.Similar Posts:



Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl