Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

July 29 2011

08:24

SFTW: How to scrape webpages and ask questions with Google Docs and =importXML

XML puzzle cube

Image by dullhunk on Flickr

Here’s another Something for the Weekend post. Last week I wrote a post on how to use the =importFeed formula in Google Docs spreadsheets to pull an RSS feed (or part of one) into a spreadsheet, and split it into columns. Another formula which performs a similar function more powerfully is =importXML.

There are at least 2 distinct journalistic uses for =importXML:

  1. You have found information that is only available in XML format and need to put it into a standard spreadsheet to interrogate it or combine it with other data.
  2. You want to extract some information from a webpage – perhaps on a regular basis – and put that in a structured format (a spreadsheet) so you can more easily ask questions of it.

The first task is the easiest, so I’ll explain how to do that in this post. I’ll use a separate post to explain the latter.

Converting an XML feed into a table

If you have some information in XML format it helps if you have some understanding of how XML is structured. A backgrounder on how to understand XML is covered in this post explaining XML for journalists.

It also helps if you are using a browser which is good at displaying XML pages: Chrome, for example, not only staggers and indents different pieces of information, but also allows you to expand or collapse parts of that, and colours elements, values and attributes (which we’ll come on to below) differently.

Say, for example, you wanted a spreadsheet of UK council data, including latitude, longitude, CIPFA code, and so on – and you found the data, but it was in XML format at a page like this:  http://openlylocal.com/councils/all.xml

To pull that into a neatly structured spreadsheet in Google Docs, type the following into the cell where you want the import to begin (try typing in cell A2, leaving the first row free for you to add column headers):

=ImportXML(“http://openlylocal.com/councils/all.xml”, ”//council”)

The formula (or, more accurately, function) needs two pieces of information, which are contained in the parentheses and separated by a comma: a web address (URL), and a query. Or, put another way:

=importXML(“theURLinQuotationMarks”, “theBitWithinTheURLthatYouWant”)

The URL is relatively easy – it is the address of the XML file you are reading (it should end in .xml). The query needs some further explanation.

The query tells Google Docs which bit of the XML you want to pull out. It uses a language called XPath – but don’t worry, you will only need to note down a few queries for most purposes.

Here’s an example of part of that XML file shown in the Chrome browser:

XML from OpenlyLocal

The indentation and triangles indicate the way the data is structured. So, the <councils> tag contains at least one item called <council> (if you scrolled down, or clicked on the triangle to collapse <council> you would see there are a few hundred).

And each <council> contains an <address>, <authority-type>, and many other pieces of information.

If you wanted to grab every <council> from this XML file, then, you use the query “//council” as shown above. Think of the // as a replacement for the < in a tag – you are saying: ‘grab the contents of every item that begins <council>’.

You’ll notice that in your spreadsheet where you have typed the formula above, it gathers the contents (called a value) of each tag within <council>, each tag’s value going into their own column – giving you dozens of columns.

You can continue this logic to look for tags within tags. For example, if you wanted to grab the <name> value from within each <council> tag, you could use:

=ImportXML(“http://openlylocal.com/councils/all.xml”, ”//council//name”)

You would then only have one column, containing the names of all the councils – if that’s all you wanted. You could of course adapt the formula again in cell B2 to pull another piece of information. However, you may end up with a mismatch of data where that information is missing – so it’s always better to grab all the XML once, then clean it up on a copy.

If the XML is more complex then you can ask more complex questions – which I’ll cover in the second part of this post. You can also put the URL and/or query in other cells to simplify matters, e.g.

=ImportXML(A1, B1)

Where cell A1 contains http://openlylocal.com/councils/all.xml and B1 contains //council (note the lack of quotation marks). You then only need to change the contents of A1 or B1 to change the results, rather than having to edit the formula directly)

If you’ve any other examples, ideas or corrections, let me know. Meanwhile, I’ve published an example spreadsheet demonstrating all the above techniques here.

PrintFriendly

July 25 2011

17:34

The style challenge

Odd one out - image by Cliff Muller
Spot the odd one out. Image by Cliff Muller

Time was when a journalist could learn one or two writing styles and stick with them. They might command enormous respect for being the best at what they did. But sometimes, when that journalist moved to another employer, their style became incongruous. And they couldn’t change.

This is the style challenge, and it’s one that has become increasingly demanding for journalists in an online age.

Because not only must they be able to adapt their style for different types of reporting; not only must they be able to adapt for different brands; not only must they be able to adapt their style within different brands across multiple media; but they must also be able to adapt their style within a single medium, across multiple platforms: Twitter, Facebook, blogs, Flickr, YouTube, or anywhere else that their audiences gather.

Immersion and language

Style is a fundamental skill in journalism. It is difficult to teach, because it relies on an individual immersing themselves in media, and doing so in a way that goes beyond each message to the medium itself. This is why journalism tutors urge their students so strongly to read as many newspapers as they can; to watch the news and listen to it, obsessively. Without immersion it is difficult to speak any language.

Now, some people do immerse themselves and have a handle on current affairs. That’s useful, but not the point.

Some do it and gain an understanding of institutions and audiences (that one is left-leaning; this one is conservative with a small c, etc.).

This is also useful, but also not the point.

The point is about how each institution addresses each audience, and when.

Despite journalists and editors often having an intuitive understanding of this difference in print or broadcast, over the last decade they’ve often demonstrated an inability to apply the same principles when it comes to publishing online.

And so we’ve had shovelware: organisations republishing print articles online without any changes. We’ve had opinion columns published as blogs because ‘blogs are all about opinion’. And we’ve had journalists treating Twitter as just another newswire to throw out headlines.

This is like a person’s first attempt at a radio broadcast where they begin by addressing “Hey all you out there” as if they’re a Balearic DJ. Good journalists should know better.

Style serves communication

Among many other things a good journalism or media degree should teach not just the practical skills of journalism but an intellectual understanding of communication, and by extension, style.

Because style is, at its base, about communication. It is about register: understanding what tone to adopt based on who you are talking to, what you are talking about, the relationship you seek to engender, and the history behind that.

As communication channels and tools proliferate, we probably need to pay more attention to that.

Journalists are being asked to adapt their skills from print to video; from formal articles to informal blog posts; from Facebook Page updates to tweets.

They are having to learn new styles of liveblogging, audio slideshows, mapping and apps; to operate within the formal restrictions of XML or SEO.

For freelance journalists commissioning briefs increasingly ask for that flexibility even within the same piece of work, offering an extra payments for an online version, a structured version, a podcast, and so on.

These requests are often quite basic – requiring a list of links for an online version, for example – but as content management systems become more sophisticated, those conditions will become more stringent: supplying an XML file with data on a product being reviewed, for example, or a version optimised for search.

What complicates things further is that, for many of these platforms, we are inventing the language as we speak it.

For those new to the platform, it can be intimidating. But for those who invest time in gaining experience, it is an enormous opportunity.

Because those who master the style of a blog, or Facebook, or Twitter, or addressing a particular group on Flickr, or a YouTube community, put themselves in an incredible position, building networks that a small magazine publisher would die for.

That’s why style is so important – now more than ever, and in the future more than now.

17:34

The style challenge

Odd one out - image by Cliff Muller
Spot the odd one out. Image by Cliff Muller

Time was when a journalist could learn one or two writing styles and stick with them. They might command enormous respect for being the best at what they did. But sometimes, when that journalist moved to another employer, their style became incongruous. And they couldn’t change.

This is the style challenge, and it’s one that has become increasingly demanding for journalists in an online age.

Because not only must they be able to adapt their style for different types of reporting; not only must they be able to adapt for different brands; not only must they be able to adapt their style within different brands across multiple media; but they must also be able to adapt their style within a single medium, across multiple platforms: Twitter, Facebook, blogs, Flickr, YouTube, or anywhere else that their audiences gather.

Immersion and language

Style is a fundamental skill in journalism. It is difficult to teach, because it relies on an individual immersing themselves in media, and doing so in a way that goes beyond each message to the medium itself. This is why journalism tutors urge their students so strongly to read as many newspapers as they can; to watch the news and listen to it, obsessively. Without immersion it is difficult to speak any language.

Now, some people do immerse themselves and have a handle on current affairs. That’s useful, but not the point.

Some do it and gain an understanding of institutions and audiences (that one is left-leaning; this one is conservative with a small c, etc.).

This is also useful, but also not the point.

The point is about how each institution addresses each audience, and when.

Despite journalists and editors often having an intuitive understanding of this difference in print or broadcast, over the last decade they’ve often demonstrated an inability to apply the same principles when it comes to publishing online.

And so we’ve had shovelware: organisations republishing print articles online without any changes. We’ve had opinion columns published as blogs because ‘blogs are all about opinion’. And we’ve had journalists treating Twitter as just another newswire to throw out headlines.

This is like a person’s first attempt at a radio broadcast where they begin by addressing “Hey all you out there” as if they’re a Balearic DJ. Good journalists should know better.

Style serves communication

Among many other things a good journalism or media degree should teach not just the practical skills of journalism but an intellectual understanding of communication, and by extension, style.

Because style is, at its base, about communication. It is about register: understanding what tone to adopt based on who you are talking to, what you are talking about, the relationship you seek to engender, and the history behind that.

As communication channels and tools proliferate, we probably need to pay more attention to that.

Journalists are being asked to adapt their skills from print to video; from formal articles to informal blog posts; from Facebook Page updates to tweets.

They are having to learn new styles of liveblogging, audio slideshows, mapping and apps; to operate within the formal restrictions of XML or SEO.

For freelance journalists commissioning briefs increasingly ask for that flexibility even within the same piece of work, offering an extra payments for an online version, a structured version, a podcast, and so on.

These requests are often quite basic – requiring a list of links for an online version, for example – but as content management systems become more sophisticated, those conditions will become more stringent: supplying an XML file with data on a product being reviewed, for example, or a version optimised for search.

What complicates things further is that, for many of these platforms, we are inventing the language as we speak it.

For those new to the platform, it can be intimidating. But for those who invest time in gaining experience, it is an enormous opportunity.

Because those who master the style of a blog, or Facebook, or Twitter, or addressing a particular group on Flickr, or a YouTube community, put themselves in an incredible position, building networks that a small magazine publisher would die for.

That’s why style is so important – now more than ever, and in the future more than now.

08:07

The style challenge

Odd one out - image by Cliff Muller

Spot the odd one out. Image by Cliff Muller

Time was when a journalist could learn one or two writing styles and stick with them. They might command enormous respect for being the best at what they did. But sometimes, when that journalist moved to another employer, their style became incongruous. And they couldn’t change.

This is the style challenge, and it’s one that has become increasingly demanding for journalists in an online age.

Because not only must they be able to adapt their style for different types of reporting; not only must they be able to adapt for different brands; not only must they be able to adapt their style within different brands across multiple media; but they must also be able to adapt their style within a single medium, across multiple platforms: Twitter, Facebook, blogs, Flickr, YouTube, or anywhere else that their audiences gather.

Immersion and language

Style is a fundamental skill in journalism. It is difficult to teach, because it relies on an individual immersing themselves in media, and doing so in a way that goes beyond each message to the medium itself. This is why journalism tutors urge their students so strongly to read as many newspapers as they can; to watch the news and listen to it, obsessively. Without immersion it is difficult to speak any language.

Now, some people do immerse themselves and have a handle on current affairs. That’s useful, but not the point.

Some do it and gain an understanding of institutions and audiences (that one is left-leaning; this one is conservative with a small c, etc.).

This is also useful, but also not the point.

The point is about how each institution addresses each audience, and when.

Despite journalists and editors often having an intuitive understanding of this difference in print or broadcast, over the last decade they’ve often demonstrated an inability to apply the same principles when it comes to publishing online.

And so we’ve had shovelware: organisations republishing print articles online without any changes. We’ve had opinion columns published as blogs because ‘blogs are all about opinion’. And we’ve had journalists treating Twitter as just another newswire to throw out headlines.

This is like a person’s first attempt at a radio broadcast where they begin by addressing “Hey all you out there” as if they’re a Balearic DJ. Good journalists should know better.

Style serves communication

Among many other things a good journalism or media degree should teach not just the practical skills of journalism but an intellectual understanding of communication, and by extension, style.

Because style is, at its base, about communication. It is about register: understanding what tone to adopt based on who you are talking to, what you are talking about, the relationship you seek to engender, and the history behind that.

As communication channels and tools proliferate, we probably need to pay more attention to that.

Journalists are being asked to adapt their skills from print to video; from formal articles to informal blog posts; from Facebook Page updates to tweets.

They are having to learn new styles of liveblogging, audio slideshows, mapping and apps; to operate within the formal restrictions of XML or SEO.

For freelance journalists commissioning briefs increasingly ask for that flexibility even within the same piece of work, offering an extra payments for an online version, a structured version, a podcast, and so on.

These requests are often quite basic – requiring a list of links for an online version, for example – but as content management systems become more sophisticated, those conditions will become more stringent: supplying an XML file with data on a product being reviewed, for example, or a version optimised for search.

What complicates things further is that, for many of these platforms, we are inventing the language as we speak it.

For those new to the platform, it can be intimidating. But for those who invest time in gaining experience, it is an enormous opportunity.

Because those who master the style of a blog, or Facebook, or Twitter, or addressing a particular group on Flickr, or a YouTube community, put themselves in an incredible position, building networks that a small magazine publisher would die for.

That’s why style is so important – now more than ever, and in the future more than now.

PrintFriendly

April 11 2011

13:00

Data for journalists: understanding XML and RSS

If you are working with data chances are that sooner or later you will come across XML – or if you don’t, then, well, you should do. Really.

There are some very useful resources in XML format – and in RSS, which is based on XML – from ongoing feeds and static reference files to XML that is provided in response to a question that you ask. All of that is for future posts – this post attempts to explain how XML is relevant to journalism, and how it is made up.

What is XML?

XML is a language which is used for describing information, which makes it particularly relevant to journalists – especially when it comes to interrogating large sets of data.

If you wanted to know how many doctors were privately educated, or what the most common score was in the Premiership last season, or which documents were authored by a particular civil servant, then XML may be useful to you.

(That said, this post doesn’t show you how to do any of that – it is mainly aimed at explaining how XML works so that you can begin to think about those possibilities.)

XML stands for “eXtensible Markup Language”. It’s the ‘markup’ bit which is key: XML ‘marks up’ information as being something in particular: relating to a particular date, for example; or a particular person; or referring to a particular location.

For example, a snippet of XML like this -

<city>Paris</city>
<country>France</country>

- tells you that the ‘Paris’ in this instance is a city, rather than a celebrity. And that it’s in France, not Texas.

That makes it easier for you to filter out information that isn’t relevant, or combine particular bits of information with data from elsewhere.

For example, if an XML file contains information on authors, you can filter out all but those by the person you’re interested in; if it contains publication dates, you can use that to plot associated content on a timeline.

Most usefully, if you have a set of data yourself such as a spreadsheet, you can pull related data from a relevant XML file. If your spreadsheet contains football teams and the XML provides locations, images, and history for each, then you can pull that in to create a fuller picture. If it contains addresses, there are services that will give you XML files with the constituency for those postcodes.

What is RSS?

RSS is a whole family of formats which are essentially based on XML – so they are structured in the same way, containing ‘markup’ that might tell you the author, publication date, location or other details about the information it relates to.

There is a lot of variation between different versions of RSS, but the main thing for the purposes of this post is that the various versions of RSS, and XML, share a structure which journalists can use if they know how to.

Which version isn’t particularly important: as long as you understand the principles, you can adapt what you do to suit the document or feed you’re working with.

Looking at XML and RSS

XML documents (for simplicity’s sake I’ll mostly just refer to ‘XML’ for the rest of this post, although I’m talking about both XML and RSS) contain two things that are of interest to us: content, and information about the content (‘markup’).

Information about the content is contained within tags in angle brackets (also known as chevrons): ‘<’ and ‘>’

For example: <name> or <pubDate> (publication date).

The tag is followed by the content itself, and a closing tag that has a forward slash, e.g. </name> or </pubDate>, so one line might look like this:

<name>Paul Bradshaw</name>

At this point it’s useful to have some XML or RSS in front of you. For a random example go to the RSS feed for the Scottish Government News.

To see the code right-click on that page and select View Source or similar – Firefox is worth using if another browser does not work; the Firebug extension also helps. (Note: if the feed is generated by Feedburner this won’t work: look for the ‘View Feed XML‘ button in the middle right area or add ?format=xml to the feed URL).

What you should see will include the following:

<item>
<title>Manufactured Exports Q4 2010</title>
<link>http://www.scotland.gov.uk/News/Releases/2011/04/06100351</link>
<description>A National Statistics publication for Scotland.</description>
<guid isPermaLink="true">http://www.scotland.gov.uk/News/Releases/2011/04/06100351</guid>
<pubDate>Wed, 06 Apr 2011 00:00:00 GMT</pubDate>
</item>

In the RSS feed itself this doesn’t start until line 14 (the first 13 lines are used to provide information about the feed as a whole, such as the version of RSS, title, copyright etc).

But from line 14 onwards this pattern repeats itself for a number of different ‘items’.

As you can see, each item has a title, a link, a description, a permalink, and a publication date. These are known as child elements (the item is the parent, or the ‘root element’).

More journalistic examples can be found at Mercedes GP’s XML file of the latest F1 Championship Standings (see the PS at the end of Tony Hirst’s post for an explanation of how this is structured), and MySociety’s Parliament Parser, which provides XML files on all parts of government, from MPs and peers to debates and constituencies, going back over a decade. Look at the Ministers XML file in Firefox and scroll down until you get to the first item tagged <ministerofficegroup>. Within each of those are details on ministerial positions. As the Parliament Parser page explains:

“Each one has a date range, the MP or Lord became a minister at some time on the start day, and stopped being one at some time on the end day. The matchid field is one sample MP or Lord office which that person also held. Alternatively, use the people.xml file to find out which person held the ministerial post.”

You’ll notice from that quote that some parts of the XML require cross-referencing to provide extra details. That’s where XML becomes very useful.

Using it in practice: working with XML in Yahoo! Pipes

Yahoo! Pipes provides a good introduction in working with data in XML or RSS. You’ll need to sign up at Pipes.Yahoo.com and click on ‘Create a Pipe‘.

You’ll now be editing a new project. On the left hand column are various ‘modules’ you can use. Click on ‘Sources‘ to expand it, and click and drag ‘Fetch Feed’ onto the graph paper-style canvas.

The 'Fetch Feed' module
The ‘Fetch Feed’ module

Copy the address of your RSS feed and paste it into the ‘Fetch Feed’ box. I’m using this feed of Health information from the UK government.

If you now click on the module so that it turns orange, you should be able (after a few moments) see that feed in the Debugger window at the bottom of the screen.

Click on the handle in the middle to pull it up and see more, and click on the arrows on the left to drill down to the ‘nested’ data within each item.

Drilling down into the data within an RSS feed
Drilling down into the data within an RSS feed

As you drill down you can see elements of data you can filter. In this case, we’ll use ‘region‘.

To filter the feed based on this we need the Filter module. On the left hand side click on ‘Operators‘ to expand that, and then drag the ‘Filter‘ module into the canvas.

Now drag a pipe from the circle at the bottom of the ‘Fetch Feed’ module to the top of the ‘Filter’ module.

Drag a pipe from Fetch Feed to Filter
Drag a pipe from Fetch Feed to Filter

Wait a moment for the ‘Filter’ module to work out what data the RSS feed contains. Then use the drop down menus so that it reads “Permit items that match all of the following”.

The next box determines which piece of data you will filter on. If you click on the drop-down here you should see all the pieces of data that are associated with each item.

Select the data you are filtering on
Select the data you are filtering on

We’re going to select ‘region’, and say that we only want to permit items where ‘region’ contains ‘North West’. If any of these don’t make any sense, look at the original RSS feed again to see what they contain.

Now drag a final pipe from the bottom of the ‘Filter’ module to the top of ‘Pipe output‘ at the bottom of the canvas. If you click on either you should be able to see in the Debugger that now only those items relating specifically to the North West are displayed.

If you wanted to you could now save this and click ‘Run Pipe‘ to see the results. Once you do you should notice options to ‘Get as RSS‘ – this would allow you to subscribe to this feed yourself or publish it on a website or Twitter account. There’s also ‘Get as JSON’ which is a whole other story – I’ll cover JSON in a future post.

You can see this pipe in action – and clone it yourself – here.

Oh, and a sidenote: if you wanted to grab an XML file in Yahoo! Pipes rather than an RSS feed, you would use ‘Fetch Data’ instead of ‘Fetch Feed’.

Just the start

There’s much more you can do here. Some suggestions for next steps:

Those are for future posts. For now I just want to demonstrate how XML works to add information-about-information which you can then use to search, filter, and combine data.

And it’s not just an esoteric language that is used by a geeky few as part of their newsgathering: journalists at Sky News, The Guardian and The Financial Times – to name just a few – all use this as a routine part of publishing, because it provides a way to dynamically update elements within a larger story without having to update the whole thing from scratch – for example by updating casualty numbers or new dates on a timeline.

And while I’m at it, if you have any examples of XML being used in journalism for either newsgathering or publishing, let me know.

PrintFriendly

December 19 2010

18:00

Games, systems and context in journalism at News Rewired

I went to News Rewired on Thursday, along with dozens of other journalists and folk concerned in various ways with news production. Some threads that ran through the day for me were discussions of how we publish our data (and allow others to do the same), how we link our stories together with each other and the rest of the web, and how we can help our readers to explore context around our stories.

One session focused heavily on SEO for specialist organisations, but included a few sharp lessons for all news organisations. Frank Gosch spoke about the importance of ensuring your site’s RSS feeds are up to date and allow other people to easily subscribe to and even republish your content. Instead of clinging tight to content, it’s good for your search rankings to let other people spread it around.

James Lowery echoed this theme, suggesting that publishers, like governments, should look at providing and publishing their data in re-usable, open formats like XML. It’s easy for data journalists to get hung up on how local councils, for instance, are publishing their data in PDFs, but to miss how our own news organisations are putting out our stories, visualisations and even datasets in formats that limit or even prevent re-use and mashup.

Following on from that, in the session on linked data and the semantic web,Martin Belam spoke about the Guardian’s API, which can be queried to return stories on particular subjects and which is starting to use unique identifiers -MusicBrainz IDs and ISBNs, for instance – to allow lists of stories to be pulled out not simply by text string but using a meaningful identification system. He added that publishers have to licence content in a meaningful way, so that it can be reused widely without running into legal issues.

Silver Oliver said that semantically tagged data, linked data, creates opportunities for pulling in contextual information for our stories from all sorts of other sources. And conversely, if we semantically tag our stories and make it possible for other people to re-use them, we’ll start to see our content popping up in unexpected ways and places.

And in the long term, he suggested, we’ll start to see people following stories completely independently of platform, medium or brand. Tracking a linked data tag (if that’s the right word) and following what’s new, what’s interesting, and what will work on whatever device I happen to have in my hand right now and whatever connection I’m currently on – images, video, audio, text, interactives; wifi, 3G, EDGE, offline. Regardless of who made it.

And this is part of the ongoing move towards creating a web that understands not only objects but also relationships, a world of meaningful nouns and verbs rather than text strings and many-to-many tables. It’s impossible to predict what will come from these developments, but – as an example – it’s not hard to imagine being able to take a photo of a front page on a newsstand and use it to search online for the story it refers to. And the results of that search might have nothing to do with the newspaper brand.

That’s the down side to all this. News consumption – already massively decentralised thanks to the social web – is likely to drift even further away from the cosy silos of news brands (with the honourable exception of paywalled gardens, perhaps). What can individual journalists and news organisations offer that the cloud can’t?

One exciting answer lies in the last session of the day, which looked at journalism and games. I wrote some time ago about ways news organisations were harnessing games, and could do in the future – and the opportunities are now starting to take shape. With constant calls for news organisations to add context to stories, it’s easy to miss the possibility that – as Philip Trippenbachsaid at News Rewired - you can’t explain a system with a story:

Stories can be a great way of transmitting understanding about things that have happened. The trouble is that they are actually a very bad way of transmitting understanding about how things work.

Many of the issues we cover – climate change, government cuts, the deficit – at macro level are systems that could be interestingly and interactively explored with games. (Like this climate change game here, for instance.) Other stories can be articulated and broadened through games in a way that allows for real empathy between the reader/player and the subject because they are experiential rather than intellectual. (Like Escape from Woomera.)

Games allow players to explore systems, scenarios and entire universes in detail, prodding their limits and discovering their flaws and hidden logic. They can be intriguing, tricky, challenging, educational, complex like the best stories can be, but they’re also fun to experience, unlike so much news content that has a tendency to feel like work.

(By the by, this is true not just of computer and console games but also of live, tabletop, board and social games of all sorts – there are rich veins of community journalism that could be developed in these areas too, as theRochester Democrat and Chronicle is hoping to prove for a second time.)

So the big things to take away from News Rewired, for me?

  • The systems within which we do journalism are changing, and the semantic web will most likely bring another seismic change in news consumption and production.
  • It’s going to be increasingly important for us to produce content that both takes advantage of these new technologies and allows others to use these technologies to take advantage of it.
  • And by tapping into the interactive possibilities of the internet through games, we can help our readers explore complex systems that don’t lend themselves to simple stories.

Oh, and some very decent whisky.

Cross-posted at Metamedia.

August 11 2010

17:10

Importing XML / RSS feeds into Wordpress from another cms

Have you exported stories via XML from a cms and then imported them to Wordpress using FTP access? If you have, how did you do it? Thanks for any help you can provide!

May 28 2010

15:41

Trying to get Kenyan Parliament to export XML. Need to clarify what we want.

My friend Ory (who you may know as founder of Ushahidi) is trying to talk the Kenyan Parliament web developer into exporting XML instead of PDFs for their official transcripts (a document called the Hansard).

The good news is, they're receptive. The bad news is they speak heavy geek, and we can't communicate well. She got this answer, which is pretty opaque to her and me both:


Hi ory,

Sorry for responding late. Well on the PROTOCOL REXML, When two processes located at different nodes communicate with each other, the interface of the communication can be implemented in two ways:

1) as a protocol; the data are defined as program structures and sent finally as binaries. 2) as exchange of XML files; the data are defined in text files by means of XML syntax and sent as strings.

How could you compare the performance of these two approaches? I would go for XML files but I'm afraid that this can slow down the communication. Thus the time out if not increased counters the reXML time out.

Read http://code.google.com/apis/protocolbuffers/docs/faq.html


Huh? Can someone translate this, and suggest the best-practice answer (or some kind of documentation) to respond with?

Cheers, Jonathan Eyler-Werve

Tags: xml

November 25 2009

20:33

Could Mashlogic be the answer to infoglut in the Web 2.0 world?

Combating information overload in the Internet age can be a tricky thing. The reader is often overwhelmed with the plethora of Web sites and news portals, and the publisher has to come up with a way to retain loyal users who will stick to their brand even while they are taken from hyperlink to hyperlink through an endless loop of news stories on a singular topic of interest.

Consumer version

Mashlogic, a tool that allows users to personalize their Web searches and define information on their own terms, promises to change that. The site assures readers that it can bring relief to their “RSS indigestion” woes in the Internet age. In addition to allowing the user to choose his or her most trusted sources of news on the Web, the consumer version of Mashlogic, which can be downloaded as a plugin for the Firefox or IE browser, permits readers to outline topics of interest in order to adapt Web-surfing to their needs.

“Mashlogic adds a layer of contextual information to casual viewing experience on a Web site,” says John Bryan, vice president of business development.

Users can go to the Mashlogic site and build their own “mashes.” Here, they can customize source feeds, which may include everything from brand names such as the Guardian or the New York Times, to aggregate mixes, which may incorporate celebrity news and sports teams they follow, and content from bloggers and tweeters. Everything from Wikipedia definitions to LinkedIn profiles of people mentioned in articles can be tracked based on a user’s interest. Mashlogic also allows readers to highlight and choose sources and order them based on their priorities. Little wonder then, that Techcrunch is calling it a “Swiss Army Knife for hyperlinks.” Behind the scenes, the tool scans RSS and XML feeds from the chosen sites for “strings of words” in Web pages based on the user’s pre-selected choices.

To Stay Tool

Internet readers trying to distill information overload on the Web aren’t the only ones who can take advantage of Mashlogic. Companies and news sites that are interested in preserving their brand, retaining readers and generating page views and revenue can utilize the company’s more recent tool, aptly named, “To Stay.”

Here, the publisher takes a few lines of java script and embeds it on a page. When the tool looks for matching terms on a site that has this embedded script, a branded box alerting the reader to relevant articles from the site itself will pop up as the user drags his cursor over specific terms. It gives site owners a way to let users navigate news on their site without having to rely on search engines, which can often turn up irrelevant information from untrusted sources. The technology works on two levels – it looks at direct tags, which would redirect the reader to articles based solely on words or phrases, and also contextually scans tags around a term, yielding associated tags, and hence secondary stories. This not only prompts the user to stay on a site longer, but also directs traffic to more popular – and hence, more profitable – parts of a Web site.

“It keeps people on the site for longer and allows people to navigate around a site. It’s a way of drilling down archival content,” says Bryan. “What’s really cool about it from the publisher’s perspective is that we have the ability to drive people from a low cpm area to a high cpm area.”

When I ask him how this is different from the “most popular” or “most commented” articles that most sites showcase, Bryan reminds me that it’s not a contest, “We don’t see Mashlogic as being a replacement to any of the other tools that you have on your site.”

Nevertheless, he is quick to point out that a lot of such lists are usually buried at the end of an article on conventional Web sites, or that they often take a reader through a maze of related stories, without the option of going back to the original article. The Mashlogic tool, on the other hand, opens up relevant stories in different tabs, aiding the horizontal reading experience, literally.

“What we offer the user is a way of quickly finding the associated article without leaving the page.” The tool is also intuitive in the sense that it recognizes terms that would be of interest to the user, and the longer time one spends on a site, the deeper it starts to reference buried content.

One of the places this technology works best, according to Bryan, is in the case of celebrity news. As if to reinforce this point he shows me how you can follow stories tagged with Indianapolis football star Peyton Manning on the citizen sports site, Bleacher Report. Merely moving the cursor over the quarterback’s name prompts a callout, which gleans Manning stories from all around the site – a list that includes everything from his team’s latest victory to his place on the NFL power rankings.

But could this excess of Peyton Manning news, so characteristic of niche information and fragmented audiences in the online world, carry with it the very real danger of obscuring the more important news items? Would this entice readers to spend too much time on Manning and too little on the health care bill, for instance?

“I’d like to think they’d use it for both,” says Bryan. In the age of democratization of the Web, the user should indeed be able to choose what he reads and where he reads it. And Mashlogic allows him to do this well. If, in fact, a user were interested in healthcare, the technology would allow him to access the leading magazines, sites, blogs, forums and even tweets on the topic, to create a 360-degree view. “Mashlogic does that better than anybody else because we would scour all the sources that you said you trusted or wanted to reference,” Bryan says.

To Go Tool

The company’s third product, “To Go” is for the ultimate brand fanatic. The brand can be anything from a preferred site to a favorite sports team or celebrity, or even a topic of interest. Readers would be required to download a button from their chosen sites, which would offer one-click access from anywhere on the Web.

Hence, To Go is for the reader what To Stay is for the publisher. “As a user, I have opted in to the have the ability to jump back, never be more than one click away from my favorite site,” explains Bryan.

Sure enough, as we traverse the ESPN site for news, a Bleacher Report-branded callout pops up, with related stories on B/R, ready to take the B/R fan back to his preferred source with one click. Mashlogic is currently in negotiations with about ten companies to install this tool, and according to Bryan, it’s being pretty well received.

Thus, what the three technologies being offered collectively do is adapt a reader’s experience to his preferences while allowing publishers to retain their most loyal users on their sites. “Mashlogic does not affect the way a site works, in any shape or form, the site works just the way it works,” Bryan says, as he closes an annoying popup ad.

The company has developed a pretty savvy e-commerce strategy for revenue generation. Any references to books or music in articles can directly take the user to the Amazon or iTunes site to purchase a specific item. The technology is also cleverly using third party sites to play sample music for the user, before he chooses to buy it. The feature can reference video, audio and text URLs. Hence, an NPR callout can jump the reader straight to a podcast from their broadcasts. Bryan also envisions having the callouts sponsored by advertisers. What would be more apt than having a Clorox callout advising a reader about environmentally-friendly Green Works products as he reads about the H1N1 virus, he reasons.

Too much distraction, perhaps? In an Internet age where readers are already in danger of encountering endlessly tantalizing hyperlinks, one too many sidebars, and interactive rich-media advertising, do they need more? But, on the other hand, don’t you want to be alerted to that contextual piece on Sarah Palin, as you glimpse through an article about her latest gaffe on a news show?

“A lot of the content, which is still very relevant tends to fall off the radar due to breaking news stories; it’s still pretty relevant, it’s just not current,” as Bryan points out. Mashlogic has the potential to combat the low attention span of the Internet age and bring that content to readers’ attention. In addition, it can provide them with the hundred and seventy-sixth article on Jon and Kate that they may have missed. What’s not to love about that?

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl