Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 11 2011


Data for journalists: understanding XML and RSS

If you are working with data chances are that sooner or later you will come across XML – or if you don’t, then, well, you should do. Really.

There are some very useful resources in XML format – and in RSS, which is based on XML – from ongoing feeds and static reference files to XML that is provided in response to a question that you ask. All of that is for future posts – this post attempts to explain how XML is relevant to journalism, and how it is made up.

What is XML?

XML is a language which is used for describing information, which makes it particularly relevant to journalists – especially when it comes to interrogating large sets of data.

If you wanted to know how many doctors were privately educated, or what the most common score was in the Premiership last season, or which documents were authored by a particular civil servant, then XML may be useful to you.

(That said, this post doesn’t show you how to do any of that – it is mainly aimed at explaining how XML works so that you can begin to think about those possibilities.)

XML stands for “eXtensible Markup Language”. It’s the ‘markup’ bit which is key: XML ‘marks up’ information as being something in particular: relating to a particular date, for example; or a particular person; or referring to a particular location.

For example, a snippet of XML like this -


- tells you that the ‘Paris’ in this instance is a city, rather than a celebrity. And that it’s in France, not Texas.

That makes it easier for you to filter out information that isn’t relevant, or combine particular bits of information with data from elsewhere.

For example, if an XML file contains information on authors, you can filter out all but those by the person you’re interested in; if it contains publication dates, you can use that to plot associated content on a timeline.

Most usefully, if you have a set of data yourself such as a spreadsheet, you can pull related data from a relevant XML file. If your spreadsheet contains football teams and the XML provides locations, images, and history for each, then you can pull that in to create a fuller picture. If it contains addresses, there are services that will give you XML files with the constituency for those postcodes.

What is RSS?

RSS is a whole family of formats which are essentially based on XML – so they are structured in the same way, containing ‘markup’ that might tell you the author, publication date, location or other details about the information it relates to.

There is a lot of variation between different versions of RSS, but the main thing for the purposes of this post is that the various versions of RSS, and XML, share a structure which journalists can use if they know how to.

Which version isn’t particularly important: as long as you understand the principles, you can adapt what you do to suit the document or feed you’re working with.

Looking at XML and RSS

XML documents (for simplicity’s sake I’ll mostly just refer to ‘XML’ for the rest of this post, although I’m talking about both XML and RSS) contain two things that are of interest to us: content, and information about the content (‘markup’).

Information about the content is contained within tags in angle brackets (also known as chevrons): ‘<’ and ‘>’

For example: <name> or <pubDate> (publication date).

The tag is followed by the content itself, and a closing tag that has a forward slash, e.g. </name> or </pubDate>, so one line might look like this:

<name>Paul Bradshaw</name>

At this point it’s useful to have some XML or RSS in front of you. For a random example go to the RSS feed for the Scottish Government News.

To see the code right-click on that page and select View Source or similar – Firefox is worth using if another browser does not work; the Firebug extension also helps. (Note: if the feed is generated by Feedburner this won’t work: look for the ‘View Feed XML‘ button in the middle right area or add ?format=xml to the feed URL).

What you should see will include the following:

<title>Manufactured Exports Q4 2010</title>
<description>A National Statistics publication for Scotland.</description>
<guid isPermaLink="true">http://www.scotland.gov.uk/News/Releases/2011/04/06100351</guid>
<pubDate>Wed, 06 Apr 2011 00:00:00 GMT</pubDate>

In the RSS feed itself this doesn’t start until line 14 (the first 13 lines are used to provide information about the feed as a whole, such as the version of RSS, title, copyright etc).

But from line 14 onwards this pattern repeats itself for a number of different ‘items’.

As you can see, each item has a title, a link, a description, a permalink, and a publication date. These are known as child elements (the item is the parent, or the ‘root element’).

More journalistic examples can be found at Mercedes GP’s XML file of the latest F1 Championship Standings (see the PS at the end of Tony Hirst’s post for an explanation of how this is structured), and MySociety’s Parliament Parser, which provides XML files on all parts of government, from MPs and peers to debates and constituencies, going back over a decade. Look at the Ministers XML file in Firefox and scroll down until you get to the first item tagged <ministerofficegroup>. Within each of those are details on ministerial positions. As the Parliament Parser page explains:

“Each one has a date range, the MP or Lord became a minister at some time on the start day, and stopped being one at some time on the end day. The matchid field is one sample MP or Lord office which that person also held. Alternatively, use the people.xml file to find out which person held the ministerial post.”

You’ll notice from that quote that some parts of the XML require cross-referencing to provide extra details. That’s where XML becomes very useful.

Using it in practice: working with XML in Yahoo! Pipes

Yahoo! Pipes provides a good introduction in working with data in XML or RSS. You’ll need to sign up at Pipes.Yahoo.com and click on ‘Create a Pipe‘.

You’ll now be editing a new project. On the left hand column are various ‘modules’ you can use. Click on ‘Sources‘ to expand it, and click and drag ‘Fetch Feed’ onto the graph paper-style canvas.

The 'Fetch Feed' module
The ‘Fetch Feed’ module

Copy the address of your RSS feed and paste it into the ‘Fetch Feed’ box. I’m using this feed of Health information from the UK government.

If you now click on the module so that it turns orange, you should be able (after a few moments) see that feed in the Debugger window at the bottom of the screen.

Click on the handle in the middle to pull it up and see more, and click on the arrows on the left to drill down to the ‘nested’ data within each item.

Drilling down into the data within an RSS feed
Drilling down into the data within an RSS feed

As you drill down you can see elements of data you can filter. In this case, we’ll use ‘region‘.

To filter the feed based on this we need the Filter module. On the left hand side click on ‘Operators‘ to expand that, and then drag the ‘Filter‘ module into the canvas.

Now drag a pipe from the circle at the bottom of the ‘Fetch Feed’ module to the top of the ‘Filter’ module.

Drag a pipe from Fetch Feed to Filter
Drag a pipe from Fetch Feed to Filter

Wait a moment for the ‘Filter’ module to work out what data the RSS feed contains. Then use the drop down menus so that it reads “Permit items that match all of the following”.

The next box determines which piece of data you will filter on. If you click on the drop-down here you should see all the pieces of data that are associated with each item.

Select the data you are filtering on
Select the data you are filtering on

We’re going to select ‘region’, and say that we only want to permit items where ‘region’ contains ‘North West’. If any of these don’t make any sense, look at the original RSS feed again to see what they contain.

Now drag a final pipe from the bottom of the ‘Filter’ module to the top of ‘Pipe output‘ at the bottom of the canvas. If you click on either you should be able to see in the Debugger that now only those items relating specifically to the North West are displayed.

If you wanted to you could now save this and click ‘Run Pipe‘ to see the results. Once you do you should notice options to ‘Get as RSS‘ – this would allow you to subscribe to this feed yourself or publish it on a website or Twitter account. There’s also ‘Get as JSON’ which is a whole other story – I’ll cover JSON in a future post.

You can see this pipe in action – and clone it yourself – here.

Oh, and a sidenote: if you wanted to grab an XML file in Yahoo! Pipes rather than an RSS feed, you would use ‘Fetch Data’ instead of ‘Fetch Feed’.

Just the start

There’s much more you can do here. Some suggestions for next steps:

Those are for future posts. For now I just want to demonstrate how XML works to add information-about-information which you can then use to search, filter, and combine data.

And it’s not just an esoteric language that is used by a geeky few as part of their newsgathering: journalists at Sky News, The Guardian and The Financial Times – to name just a few – all use this as a routine part of publishing, because it provides a way to dynamically update elements within a larger story without having to update the whole thing from scratch – for example by updating casualty numbers or new dates on a timeline.

And while I’m at it, if you have any examples of XML being used in journalism for either newsgathering or publishing, let me know.


October 04 2010


Open data meets FOI via some nifty automation

OpenlyLocal generated FOI request

Now this is an example of what’s possible with open data and some very clever thinking. Chris Taggart blogs about a new tool on his OpenlyLocal platform that allows you to send a Freedom of Information (FOI) request based on a particular item of spending. “This further lowers the barriers to armchair auditors wanting to understand where the money goes, and the request even includes all the usual ‘boilerplate’ to help avoid specious refusals.”

It takes around a minute to generate an FOI request.

The function is limited to items of spending above £10,000. Cleverly, it’s also all linked so you can see if an FOI request has already been generated and answered.

Although the tool sits on OpenlyLocalFrancis Irving at WhatDoTheyKnow gets enormous credit for making their side of the operation work with it.

Once again you have to ask why a media organisation isn’t creating these sorts of tools to help generate journalism beyond the walls of its newsroom.

Sponsored post

July 06 2010


Democracy site MySociety to receive $575,000 from US investment firm

Democracy website MySociety has announced it will receive a $575,000 donation from US-based Omidyar Network, an investment firm who aim to support organisations of provide opportunities for people to improve their lives.

The money will be used to help develop MySociety and its online community.

Founder and director Tom Steinberg says the grants will benefit both the business and its users:

We’re really delighted because these grants help us do two things we really need to – share our knowledge and skills more widely, and improve our ability to run ourselves as a mature organization, better able than before to look after our legal and financial affairs on the one hand, and our community and users on the other.

Read the full post here…Similar Posts:

June 28 2010


#Tip of the day from Journalism.co.uk – local map widgets

Hyperlocal Google maps: mySociety has a simple step-by-step guide showing you how to display the most recent reports from its FixMyStreet service, on a local map widget. This can then be embedded on your blog or site. Tipster: Judith Townend. To submit a tip to Journalism.co.uk, use this link - we will pay a fiver for the best ones published.

June 03 2010


The Future of News: Not So Bleak, Not So Rosy

What's the future of news? I'm tempted to say "not very much" since no one really knows too much about the future of news right now. You know this is true because senior news folk have given up on the doom and gloom stuff and are starting to talk about "the golden age of journalism" and how it's a "bright dawn" and that sort of thing. This would make sense if there had been any structural change in the economics of news, but there hasn't; so their optimism has the hollow twang of hope over reason.

Still, the optimists have got it half right. As Stewart Kirkpatrick, founder of the Caledonian Mercury, said at a #futureofnews conference a week or so back (I paraphrase): "This is a great time to do journalism. It's just not a great time to earn your living as a journalist."

What I Know

But, in these turbulent times, as I earnestly make my way from one news conference to another, a few things are starting to become clear. So this much I know:

  • Even if pay walls provide a secure financial future for news organizations -- which right now seems unlikely -- they will reduce the pool of shared information, and cut those news organizations' content off from the openness, sharing and linking that characterizes the web. "You cannot control distribution or create scarcity," Alan Rusbridger said in his January Hugh Cudlipp lecture, "without becoming isolated from this new networked world."
  • The pay wall is not the only way to sustain the digital newsroom. Advertising, though much maligned by many, could yet make online non-pay wall newspaper content viable within five years. Peter Kirwan did the math in Wired, calculating that if Guardian News Media manages a 20 percent annualized growth of digital revenues (it estimates growth will be 30 percent this year) it will be able to maintain a £100m digital newsroom seven days a week by 2015.
  • There are other revenue models for online news -- ones that allow you to keep your news open, linked and shared, and make money. For example, there is what I call the "carrier pigeon model." In this model you let people share, link to, recommend, search, aggregate, and even re-use you content -- you just make sure it's properly marked up and credited so you can keep track of it and develop revenue models off the back of it. You do this with -- excuse the geek terminology -- "metadata." Embedded metadata has all sorts of potential benefits we're only just starting to take advantage of (hence why we've spent so much time on hNews and linked data). I call it the carrier pigeon model because the news doesn't just go out, it comes back.
  • The cost base is still going to have to go down. The cost of producing news will necessarily have to be a lot lower than it has been historically. This doesn't have to mean cutting journalist's jobs or getting out of print. There are lots of ways to rethink costs in a digital world. One of the most inventive is Roman Gallo's Czech model. Gallo opened cafés in the centre of towns across the Czech Republic. He then put his news teams in the cafés. Not only does this mean they have very low office overhead (the café covers basic costs), but it means the journalists are working in amongst the local community and getting readers directly involved in production.
  • There will need to be accessible, re-usable public data provided regularly and in a consistent format. Without this it will be much harder to keeps costs low because of the amount of time it takes to coax information out of public authorities and then analyze that data. This is why the launch of data.gov.uk was such an important development, and why we need to join Sir Tim Berners-Lee's quest for "raw data now" (as he shouts in his wonderfully quirky TED appearance).
  • Whether or not pay walls work or online news makes money, there will be a public interest gap. Some newsgathering and reporting will almost certainly never again be commercially profitable in an open market. Online news is highly unlikely ever to pay for a journalist to sit in a local court for days on end, for example. This was one of the most important things to come out of Michael Schudson and Leonard Downie's report, "The Reconstruction of American Journalism." Schudson and Downie could not find a market solution to some of the news problems they were exploring, and so settled instead on a mixture of tax breaks, subsidies, foundation grants, and donations.
  • We will rely, for aspects of watchdog journalism, on a combination of journalists, NGOs, and motivated members of the public. Note the use of the word "motivated." News organisations will need to find ways -- other than money -- to motivate and sustain people to help them scour data, dig through school and healthcare records, and alert them to corruption and injustice.
  • As well as motivating people, news organizations will need to build the tools that help the non-professional journalists be watchdogs -- tools like whatdotheyknow.com, a site built by MySociety that makes it relatively easy for people to make freedom of information requests and share the results of those requests with a wider community. Or the way the Guardian got the public to search through the millions of MPs expenses claims.
  • News organizations and journalists will need to form and re-form partnerships with other organizations, journalism co-operatives, NGOs and members of the public. We're seeing this start to happen with sites like the Bay Citizen in San Francisco (see a good post by Mallary Jean Tenore on Poynter) and OpenFile, the beta site just launched by MediaShift managing editor Craig Silverman et al in Canada.

Even taking all this into account there's a good chance that, without some tweaking of the market, a few tax breaks here, maybe a start-up fund there, there will be a lot of public interest news blackspots.

So there it is. Not so bleak, but not so rosy, either. And take it with a big pinch of salt since the only ones who seem to know about profitable business model for news just now are those running #futureofnews conferences.

April 08 2010


Review: Heather Brooke – The Silent State

The Silent State

In the week that a general election is called, Heather Brooke’s latest book couldn’t have been better timed. The Silent State is a staggeringly ambitious piece of work that pierces through the fog of the UK’s bureaucracies of power to show how they work, what is being hidden, and the inconsistencies underlying the way public money is spent.

Like her previous book, Your Right To Know, Brooke structures the book into chapters looking at different parts of the power system in the UK – making it a particularly usable reference work when you want to get your head around a particular aspect of our political systems.

Chapter by chapter

Chapter 1 lists the various databases that have been created to maintain information on citizens - paying particular focus to the little-publicised rack of databases holding subjective data on children. The story of how an old unpopular policy was rebranded to ride into existence on the back of the Victoria Climbie bandwagon is particularly illustrative of government’s hunger for data for data’s sake.

Picking up that thread further, Chapter 2 explores how much public money is spent on PR and how public servants are increasingly prevented from speaking directly to the media. It’s this trend which made The Times’ outing of police blogger Nightjack particularly loathsome and why we need to ensure we fight hard to protect those who provide an insight into their work on the ground.

Chapter 3 looks at how the misuse of statistics led to the independence of the head of the Office of National Statistics – but not the staff that he manages – and how the statistics given to the media can differ quite significantly to those provided when requested by a Select Committee (the lesson being that these can be useful sources to check). It’s a key chapter for anyone interested in the future of public data and data journalism.

Bureaucracy itself is the subject of the fourth chapter. Most of this is a plea for good bureaucracy and the end of unnamed sources, but there is still space for illustrative and useful anecdotes about acquiring information from the Ministry of Defence.

And in Chapter 5 we get a potted history of MySociety’s struggle to make politicians accountable for their votes, and an overview of how data gathered with public money – from The Royal Mail’s postcodes to Ordnance Survey – is sold back to the public at a monopolistic premium.

The justice system and the police are scrutinised in the 6th and 7th chapters – from the twisted logic that decreed audio recordings are more unreliable than written records to the criminalisation of complaint.

Then finally we end with a personal story in Chapter 8: a reflection on the MPs’ expenses saga that Brooke is best known for. You can understand the publishers – and indeed, many readers – wanting to read the story first-hand, but it’s also the least informative of all the chapters for journalists (which is a credit to all that Brooke has achieved on that front in wider society).

With a final ‘manifesto’ section Brooke summarises the main demands running across the book and leaves you ready to storm every institution in this country demanding change. It’s an experience reminiscent of finishing Franz Kafka’s The Trial – we have just been taken on a tour through the faceless, logic-deprived halls of power. And it’s a disconcerting, disorientating feeling.

Journalism 2.0

But this is not fiction. It is great journalism. And the victims caught in expensive paper trails and logical dead ends are real people.

Because although the book is designed to be dipped in as a reference work, it is also written as an eminently readable page-turner – indeed, the page-turning gets faster as the reader gets angrier. Throughout, Brooke illustrates her findings with anecdotes that not only put a human face on the victims of bureaucracy, but also pass on the valuable experience of those who have managed to get results.

For that reason, the book is not a pessimistic or sensationalist piece of writing. There is hope – and the likes of Brooke, and MySociety, and others in this book are testament to the fact that this can be changed.

The Silent State is journalism 2.0 at its best – not just exposing injustice and waste, but providing a platform for others to hold power to account. It’s not content for content’s sake, but a tool. I strongly recommend not just buying it – but using it. Because there’s some serious work to be done.

February 24 2010


January 07 2010


December 18 2009


‘A non-profit is a business as well,’ says mySociety’s senior developer

Francis Irving, senior developer at mySociety – an organisation that runs some of the biggest democracy projects in the UK – has shared some of his thoughts about online transparency and citizen collaboration in a Q&A for Journalism.co.uk’s news:rewired site.

What advice would he give to people going down the non-profit publishing route, we asked. Irving answers:

A non-profit is a business as well – it still has to make a surplus, it is just that that surplus is used to do more of the charitable work, rather than as personal profit.

I would advise people to go one of two ways – either have some good ideas for business models from the start (take a look at Patient Opinion for an example) or work out how to run it entirely on philanthropic donations and volunteer work.

It’s going to be as hard to start a sustainably funded non-profit as it is to start a successful for-profit business.

Francis Irving will be talking at Journalism.co.uk’s digital journalism event news:rewired, 14 January 2010.

Tickets still available at this link…

Similar Posts:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...