Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 11 2011


Data for journalists: understanding XML and RSS

If you are working with data chances are that sooner or later you will come across XML – or if you don’t, then, well, you should do. Really.

There are some very useful resources in XML format – and in RSS, which is based on XML – from ongoing feeds and static reference files to XML that is provided in response to a question that you ask. All of that is for future posts – this post attempts to explain how XML is relevant to journalism, and how it is made up.

What is XML?

XML is a language which is used for describing information, which makes it particularly relevant to journalists – especially when it comes to interrogating large sets of data.

If you wanted to know how many doctors were privately educated, or what the most common score was in the Premiership last season, or which documents were authored by a particular civil servant, then XML may be useful to you.

(That said, this post doesn’t show you how to do any of that – it is mainly aimed at explaining how XML works so that you can begin to think about those possibilities.)

XML stands for “eXtensible Markup Language”. It’s the ‘markup’ bit which is key: XML ‘marks up’ information as being something in particular: relating to a particular date, for example; or a particular person; or referring to a particular location.

For example, a snippet of XML like this -


- tells you that the ‘Paris’ in this instance is a city, rather than a celebrity. And that it’s in France, not Texas.

That makes it easier for you to filter out information that isn’t relevant, or combine particular bits of information with data from elsewhere.

For example, if an XML file contains information on authors, you can filter out all but those by the person you’re interested in; if it contains publication dates, you can use that to plot associated content on a timeline.

Most usefully, if you have a set of data yourself such as a spreadsheet, you can pull related data from a relevant XML file. If your spreadsheet contains football teams and the XML provides locations, images, and history for each, then you can pull that in to create a fuller picture. If it contains addresses, there are services that will give you XML files with the constituency for those postcodes.

What is RSS?

RSS is a whole family of formats which are essentially based on XML – so they are structured in the same way, containing ‘markup’ that might tell you the author, publication date, location or other details about the information it relates to.

There is a lot of variation between different versions of RSS, but the main thing for the purposes of this post is that the various versions of RSS, and XML, share a structure which journalists can use if they know how to.

Which version isn’t particularly important: as long as you understand the principles, you can adapt what you do to suit the document or feed you’re working with.

Looking at XML and RSS

XML documents (for simplicity’s sake I’ll mostly just refer to ‘XML’ for the rest of this post, although I’m talking about both XML and RSS) contain two things that are of interest to us: content, and information about the content (‘markup’).

Information about the content is contained within tags in angle brackets (also known as chevrons): ‘<’ and ‘>’

For example: <name> or <pubDate> (publication date).

The tag is followed by the content itself, and a closing tag that has a forward slash, e.g. </name> or </pubDate>, so one line might look like this:

<name>Paul Bradshaw</name>

At this point it’s useful to have some XML or RSS in front of you. For a random example go to the RSS feed for the Scottish Government News.

To see the code right-click on that page and select View Source or similar – Firefox is worth using if another browser does not work; the Firebug extension also helps. (Note: if the feed is generated by Feedburner this won’t work: look for the ‘View Feed XML‘ button in the middle right area or add ?format=xml to the feed URL).

What you should see will include the following:

<title>Manufactured Exports Q4 2010</title>
<description>A National Statistics publication for Scotland.</description>
<guid isPermaLink="true">http://www.scotland.gov.uk/News/Releases/2011/04/06100351</guid>
<pubDate>Wed, 06 Apr 2011 00:00:00 GMT</pubDate>

In the RSS feed itself this doesn’t start until line 14 (the first 13 lines are used to provide information about the feed as a whole, such as the version of RSS, title, copyright etc).

But from line 14 onwards this pattern repeats itself for a number of different ‘items’.

As you can see, each item has a title, a link, a description, a permalink, and a publication date. These are known as child elements (the item is the parent, or the ‘root element’).

More journalistic examples can be found at Mercedes GP’s XML file of the latest F1 Championship Standings (see the PS at the end of Tony Hirst’s post for an explanation of how this is structured), and MySociety’s Parliament Parser, which provides XML files on all parts of government, from MPs and peers to debates and constituencies, going back over a decade. Look at the Ministers XML file in Firefox and scroll down until you get to the first item tagged <ministerofficegroup>. Within each of those are details on ministerial positions. As the Parliament Parser page explains:

“Each one has a date range, the MP or Lord became a minister at some time on the start day, and stopped being one at some time on the end day. The matchid field is one sample MP or Lord office which that person also held. Alternatively, use the people.xml file to find out which person held the ministerial post.”

You’ll notice from that quote that some parts of the XML require cross-referencing to provide extra details. That’s where XML becomes very useful.

Using it in practice: working with XML in Yahoo! Pipes

Yahoo! Pipes provides a good introduction in working with data in XML or RSS. You’ll need to sign up at Pipes.Yahoo.com and click on ‘Create a Pipe‘.

You’ll now be editing a new project. On the left hand column are various ‘modules’ you can use. Click on ‘Sources‘ to expand it, and click and drag ‘Fetch Feed’ onto the graph paper-style canvas.

The 'Fetch Feed' module
The ‘Fetch Feed’ module

Copy the address of your RSS feed and paste it into the ‘Fetch Feed’ box. I’m using this feed of Health information from the UK government.

If you now click on the module so that it turns orange, you should be able (after a few moments) see that feed in the Debugger window at the bottom of the screen.

Click on the handle in the middle to pull it up and see more, and click on the arrows on the left to drill down to the ‘nested’ data within each item.

Drilling down into the data within an RSS feed
Drilling down into the data within an RSS feed

As you drill down you can see elements of data you can filter. In this case, we’ll use ‘region‘.

To filter the feed based on this we need the Filter module. On the left hand side click on ‘Operators‘ to expand that, and then drag the ‘Filter‘ module into the canvas.

Now drag a pipe from the circle at the bottom of the ‘Fetch Feed’ module to the top of the ‘Filter’ module.

Drag a pipe from Fetch Feed to Filter
Drag a pipe from Fetch Feed to Filter

Wait a moment for the ‘Filter’ module to work out what data the RSS feed contains. Then use the drop down menus so that it reads “Permit items that match all of the following”.

The next box determines which piece of data you will filter on. If you click on the drop-down here you should see all the pieces of data that are associated with each item.

Select the data you are filtering on
Select the data you are filtering on

We’re going to select ‘region’, and say that we only want to permit items where ‘region’ contains ‘North West’. If any of these don’t make any sense, look at the original RSS feed again to see what they contain.

Now drag a final pipe from the bottom of the ‘Filter’ module to the top of ‘Pipe output‘ at the bottom of the canvas. If you click on either you should be able to see in the Debugger that now only those items relating specifically to the North West are displayed.

If you wanted to you could now save this and click ‘Run Pipe‘ to see the results. Once you do you should notice options to ‘Get as RSS‘ – this would allow you to subscribe to this feed yourself or publish it on a website or Twitter account. There’s also ‘Get as JSON’ which is a whole other story – I’ll cover JSON in a future post.

You can see this pipe in action – and clone it yourself – here.

Oh, and a sidenote: if you wanted to grab an XML file in Yahoo! Pipes rather than an RSS feed, you would use ‘Fetch Data’ instead of ‘Fetch Feed’.

Just the start

There’s much more you can do here. Some suggestions for next steps:

Those are for future posts. For now I just want to demonstrate how XML works to add information-about-information which you can then use to search, filter, and combine data.

And it’s not just an esoteric language that is used by a geeky few as part of their newsgathering: journalists at Sky News, The Guardian and The Financial Times – to name just a few – all use this as a routine part of publishing, because it provides a way to dynamically update elements within a larger story without having to update the whole thing from scratch – for example by updating casualty numbers or new dates on a timeline.

And while I’m at it, if you have any examples of XML being used in journalism for either newsgathering or publishing, let me know.


October 21 2010


Review: Yahoo! Pipes tutorial ebook

Pipes Tutorial ebook

I’ve been writing about Yahoo! Pipes for some time, and am consistently surprised that there aren’t more books on the tool. Pipes Tutorial – an ebook currently priced at $14.95 – is clearly aiming to address that gap.

The book has a simple structure: it is, in a nutshell, a tour around the various ‘modules’ that you combine to make a pipe.

Some of these will pull information from elsewhere – RSS feeds, CSV spreadsheets, Flickr, Google Base, Yahoo! Local and Yahoo! Search, or entire webpages.

Some allow the user to input something themselves – for example, a search phrase, or a number to limit the type of results given.

And others do things with all the above – combining them, splitting them, filtering, converting, translating, counting, truncating, and so on.

When combined, this makes for some powerful possibilities – unfortunately, its one-dimensional structure means that this book doesn’t show enough of them.

Modules in isolation

While the book offers a good introduction into the functionality of the various parts of Yahoo! Pipes, it rarely demonstrates how those can be combined. Typically, tutorial books will take you through a project that utilises the power of the tools covered, but Pipes Tutorial lacks this vital element. Sometimes modules will be combined in the book but this is mainly done because that is the only way to show how a single module works, rather than for any broader pedagogical objective.

At other times a module is explained in isolation and it is not explained how the results might actually be used. The Fetch Page module, for example – which is extremely useful for scraping content from a webpage – is explained without reference to how to publish the results, only a passing mention that the reader will have to use ‘other modules’ to assign data to types, and that Regex will be needed to clean it up.

Regex itself – possibly one of the most useful parts of Yahoo! Pipes – is cursorily tackled, and the reader pointed to resources elsewhere. The same applies to YQL – the language that allows you to interrogate data sources. Likewise, the Web Service module which allows you to connect with an API, isn’t illustrated with any practical guidance on how to use it.

The book makes no mention of the ability to clone pipes published by others on Yahoo! Pipes, and misses a big opportunity to provide links to working pipes that the user can clone and play with themselves – or indeed any online support that I can see other than a blog that currently has 2 instructional posts.

Despite all the above omissions, the lack of similar books mean this is still a useful resource for aspiring data journalists. It provides an insight into the possibilities of Pipes, even if it doesn’t quite take you through how to exploit those.

PS: If you’ve read any other books on Yahoo! Pipes (including this one) let me know whether they’re any use.

May 04 2010


Data journalism pt5: Mashing data (comments wanted)

This is a draft from a book chapter on data journalism (part 1 looks at finding data; part 2 at interrogating datapart 3 at visualisation, and 4 at visualisation tools). I’d really appreciate any additions or comments you can make – particularly around tips and tools.

Mashing data

Wikipedia defines a mashup particularly succinctly, as “a web page or application that uses or combines data or functionality from two or many more external sources to create a new service.” Those sources may be online spreadsheets or tables; maps; RSS feeds (which could be anything from Twitter tweets, blog posts or news articles to images, video, audio or search results); or anything else which is structured enough to ‘match’ against another source.

This ‘match’ is typically what makes a mashup. It might be matching a city mentioned in a news article against the same city in a map; or it may be matching the name of an author with that same name in the tags of a photo; or matching the search results for ‘earthquake’ from a number of different sources. The results can be useful to you as a journalist, to the user, or both.

Why make a mashup?

Mashups can be particularly useful in providing live coverage of a particular event or ongoing issue – mashing images from a protest march, for example, against a map. Creating a mashup online is not too dissimilar from how, in broadcast journalism, you might set up cameras at key points around a physical location in anticipation of an event from which you will later ‘pull’ live feeds: in a mashup you are effectively doing exactly the same thing – only in a virtual space rather than a physical one. So, instead of setting up a feed at the corner of an important junction, you might decide to pull a feed from Flickr of any images that are tagged with the words ‘protest’ and ‘anti-fascist’.

Some web developers have built entire sites that are mashups. Twazzup (twazzup.com) for example, will show you a mix of Twitter tweets, images from Flickr, news updates and websites – all based on the search term you enter. And Friendfeed (friendfeed.com) pulls in data that you and your social circle post to a range of social networking sites, and displays them in one place.

Mashups also provide a different way for users to interact with content – either by choosing how to navigate (for instance by using a map), or by inviting them to input something (for instance, a search term, or selecting a point on a slider). The Super Tuesday YouTube/Google Maps mashup, for instance, provided an at-a-glance overview of what election-related videos were being uploaded where across the US.

Finally, mashups offer an opportunity for juxtaposing different datasets to provide fresh, sometimes ongoing, insights. The MySociety/Channel 4 project Mapumental, for example, combines house price data with travel information and data on the ’scenicness’ of different locations to provide an interactive map of a location which the user can interrogate based on their individual preferences.

Mashup tools

Like so many aspects of online journalism, the ease with which you can create a mashup has increased significantly in recent years. An increase in the number and power of online tools, combined with the increasing ‘mashability’ of websites and data, mean that journalists can now create a basic mashup through the simple procedures of drag-and-drop or copy-and-paste.

A simple RSS mashup, which combines the feeds from a number of different sources into one, for example, can now be created using an RSS aggregator such as xFruits (xfruits.com) or Jumbra (jumbra.com).

Likewise, you can mix two maps together using the website MapTube (maptube.org) which also contains a number of maps for you to play with.

And if you want to mix two sources of data into one visualisation the site DataMasher (datamasher.org) will let you do that – although you’ll have to make do with the US data that the site provides. Google Public Data Explorer (google.com/publicdata) is a similar tool which allows you to play with global data.

But perhaps the most useful tool for news mashups is Yahoo! Pipes (pipes.yahoo.com).

Yahoo! Pipes allows you to choose a source of data – it might be an RSS feed, an online spreadsheet or something that the user will input – and do a variety of things with it. Here are just some of the basic things you might do:

  • Add it to other sources
  • Combine it with other sources – for instance, matching images to text
  • Filter it
  • Count it
  • Annotate it
  • Translate it
  • Create a gallery from the results
  • Place results on a map

You could write a whole book on how to use Yahoo! Pipes – indeed, people have – so we will not cover the practicalities of using all of those features here. There are also dozens of websites and help files devoted to the site (which you should explore). Below, however, is a short tutorial to introduce you to the website and how it works – this is a good way to understand how basic mashups work, and how easily they can be created.

Mashups and APIs

Although there are a number of easy-to-use mashup creators listed above, really impressive mashups tend to be written by people with knowledge of programming languages, and use APIs. APIs (Application Programming Interface) allow websites to interact with other websites. The launch of the Google Maps API in 2005, for example, has been described as a ‘huge tipping point’ in mashup history (Duvander, 2008) as it allowed web developers to ‘mash’ countless other sources of data with maps. Since then it has become commonplace for new websites, particularly in the social media arena, to launch their own APIs in order to allow web developers to do interesting things with their feeds and data – not just mashups, but applications and services too.

If you want to develop a particularly ambitious mashup it is likely that you will need to teach yourself some programming skills, and familiarise yourself with some APIs (the APIs of Twitter, Google Maps and Flickr are good places to start).

Box-out: Anatomy of a feed

The image below from ReadWriteWeb shows the code behind a simple Twitter update. It includes information about the author, their location, whether the update was a reply to someone else, what time and where it was created, and lots more besides. Each of these values can be used by a mashup in various ways – for example, you might match the author of this tweet with the author of a blog or image; you might match its time against other things being published at that moment; or you might use their location to plot this update on a map.

While the code can be intimidating, you do not need to understand programming in order to be able to do things with it. Of course, it will help if you do…

Anatomy of a Twitter feed

March 12 2010


Online Journalism lesson #10: RSS and mashups

This was the final session in my undergraduate Online Journalism module (the other classes can be found here), taught last May. It’s a relatively brief presentation, just covering some of the possibilities of mashups and RSS, and some tools. The majority of the class is taken up with students using Yahoo! Pipes to aggregate a number of feeds.

I didn’t know how students would cope with Yahoo! Pipes but, surprisingly, every one completed the task.

As a side note, this year I kicked off the module with students setting up Twitter, Delicious and Google Reader – and synchronising them, so the RSS feed from one could update another (e.g. bookmarks being published to Twitter). This seems to have built a stronger understanding of RSS in the group, which they are able to apply elsewhere (they also have widgets on their blogs pulling the RSS feeds from Twitter & Delicious; and their profile page on the news website – built by Kasper Sorensen – pulls the latest updates from their Twitter, Delicious and blog feeds).

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
Get rid of the ads (sfw)

Don't be the product, buy the product!