Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 08 2012

12:29

Olympics Swimming Lap Charts from the New York Times

Part of the promise of sports data journalism is the ability to use data from an event to enrich the reporting of that event. One of the widely used graphical devices used in motor racing is the lap chart, which shows the relative positions of each car at the end of each lap:

Another, more complex chart, and one that can be quite hard to read when you first come across it, is the race history chart, which shows the laptime of each car relative to the average laptime (calculated over the whole of the race) of the race winner:

(Great examples of how to read a race history charts can be found on the IntelligentF1 blog. For the general case, see The IntelligentF1 model.)

Both of these charts can be used to illustrate the progression of a race, and even in some cases to identify stories that might otherwise have been missed (particularly races amongst back markers, for example). For Olympics events particularly, where reporting is often at a local level (national and local press reporting on the progression of their athletes, as well as the winning athletes), timing data may be one of the few sources available for finding out what actually happened to a particular competitor who didn’t feature in coverage that typically focusses on the head of the race.

I’ve also experimented with some other views, including a race summary chart that captures the start position, end of first lap position, final position and range of positions held at the end of each lap by each driver:

One of the ways of using this chart is as a quick summary of the race position chart, as well as a tool for highlighting possible “driver of the day” candidates.

A rich lap chart might also be used to convey information about the distance between cars as well as their relative positions. Here’s one experiment I tried (using Gephi to visualise the data) in which node size is proportional to time to car in front and colour is related to time to car behind (red is hot – car behind is close):

(You might also be able to imagine a variant of this chart where we fix the y-value so each row shows data relating to one particular driver. Looking along a row then allows us to see how exciting a race they had.)

All of these charts can be calculated from lap time data. Some of them can be calculated from data describing the position held by each competitor at the end of each lap. But whatever the case, the data is what drives the visualisation.

A little bit of me had been hoping that laptime data for Olympics track, swimming and cycling events might be available somewhere, but if it is, I haven’t found a reliable source yet. What I did find encouraging, though, was that the New York Times, (in many ways one of the organisations that is seeing the value of using visualised data-driven storytelling in its daily activities) did make some split time data available – and was putting it to work – in the swimming events:

Here, the NYT have given split data showing the times achieved in each leg by the relay team members, along with a lap chart that has a higher level of detail, showing the position of each team at the end of each 50m length (I think?!). The progression of each of the medal winners is highlighted using an appropriate colour theme.

The chart provides an illustration that can be used to help a reporter identify different stories about how the race progressed, whether or not it is included in the final piece. The graphic can also be used as a sidebar illustration of a race report.

Lap charts also lend themselves to interactive views, or highlighted customisations that can be used to illustrate competition between selected individuals – here’s another F1 example, this time from the f1fanatic blog:

(I have to admit, I prefer this sort of chart with greyed options for the unhighlighted drivers because it gives a better sense of the position churn that is happening elsewhere in the race.)

Of course, without the data, it can be difficult trying to generate these charts…

…which is to say: if you know where lap data can be found for any of the Olympics events, please post a link to the source in the comments below:-)


12:29

Olympics Swimming Lap Charts from the New York Times

Part of the promise of sports data journalism is the ability to use data from an event to enrich the reporting of that event. One of the widely used graphical devices used in motor racing is the lap chart, which shows the relative positions of each car at the end of each lap:

Another, more complex chart, and one that can be quite hard to read when you first come across it, is the race history chart, which shows the laptime of each car relative to the average laptime (calculated over the whole of the race) of the race winner:

(Great examples of how to read a race history charts can be found on the IntelligentF1 blog. For the general case, see The IntelligentF1 model.)

Both of these charts can be used to illustrate the progression of a race, and even in some cases to identify stories that might otherwise have been missed (particularly races amongst back markers, for example). For Olympics events particularly, where reporting is often at a local level (national and local press reporting on the progression of their athletes, as well as the winning athletes), timing data may be one of the few sources available for finding out what actually happened to a particular competitor who didn’t feature in coverage that typically focusses on the head of the race.

I’ve also experimented with some other views, including a race summary chart that captures the start position, end of first lap position, final position and range of positions held at the end of each lap by each driver:

One of the ways of using this chart is as a quick summary of the race position chart, as well as a tool for highlighting possible “driver of the day” candidates.

A rich lap chart might also be used to convey information about the distance between cars as well as their relative positions. Here’s one experiment I tried (using Gephi to visualise the data) in which node size is proportional to time to car in front and colour is related to time to car behind (red is hot – car behind is close):

(You might also be able to imagine a variant of this chart where we fix the y-value so each row shows data relating to one particular driver. Looking along a row then allows us to see how exciting a race they had.)

All of these charts can be calculated from lap time data. Some of them can be calculated from data describing the position held by each competitor at the end of each lap. But whatever the case, the data is what drives the visualisation.

A little bit of me had been hoping that laptime data for Olympics track, swimming and cycling events might be available somewhere, but if it is, I haven’t found a reliable source yet. What I did find encouraging, though, was that the New York Times, (in many ways one of the organisations that is seeing the value of using visualised data-driven storytelling in its daily activities) did make some split time data available – and was putting it to work – in the swimming events:

Here, the NYT have given split data showing the times achieved in each leg by the relay team members, along with a lap chart that has a higher level of detail, showing the position of each team at the end of each 50m length (I think?!). The progression of each of the medal winners is highlighted using an appropriate colour theme.

The chart provides an illustration that can be used to help a reporter identify different stories about how the race progressed, whether or not it is included in the final piece. The graphic can also be used as a sidebar illustration of a race report.

Lap charts also lend themselves to interactive views, or highlighted customisations that can be used to illustrate competition between selected individuals – here’s another F1 example, this time from the f1fanatic blog:

(I have to admit, I prefer this sort of chart with greyed options for the unhighlighted drivers because it gives a better sense of the position churn that is happening elsewhere in the race.)

Of course, without the data, it can be difficult trying to generate these charts…

…which is to say: if you know where lap data can be found for any of the Olympics events, please post a link to the source in the comments below:-)


August 06 2012

21:07

London Olympics 2012 Medal Tables At A Glance?

Looking at the various medal standings for medals awarded during any Olympics games is all very well, but it doesn’t really show where each country won its medals or whether particular sports are dominated by a single country. Ranked as they are by the number of gold medals won, the medal standings don’t make it easy to see what we might term “strength in depth” – that is, we don’t get an sense of how the rankings might change if other medal colours were taken into account in some way.

Four years ago, in a quick round up of visualisations from the 2008 Beijing Olympics (More Olympics Medal Table Visualisations) I posted an example of an IBM Many Eyes Treemap visualisation I’d created showing how medals had been awarded across the top 10 medal winning countries. (Quite by chance, a couple of days ago I noticed one of the visualisations I’d created had appeared as an example in an academic paper – A Magic Treemap Cube for Visualizing
Olympic Games Data
).

Although not that widely used, I personally find treemaps a wonderful device for providing a macroscopic overview of a dataset. Whilst getting actual values out of them may be hit and miss, they can be used to provide a quick orientation around a hierarchically ordered dataset. Yes, it may be hard to distinguish detail, but you can easily get your eye in and start framing more detailed questions to ask of the data.

Whilst there is still a lot more thinking I’d like to do around the use of treemaps for visualising Olympics medal data using treemaps, here are a handful of quick sketches constructed using Google visualisation chart treemap components, and data scraped from NBC.

The data I have scraped is represented using rows of the form:

Country, Event, Gold, Silver, Bronze

where Event is at the level of “Swimming”, “Cycling” etc rather than at finer levels of detail (it’s really hard finding data at even this level of data in an easily grabbable way?)

I’ve then treated the data as hierarchically structured over three levels, which can be arranged in six ways:

  • MedalType, Country, Event
  • MedalType, Event, Country
  • Event, MedalType, Country
  • Event, Country, MedalType
  • Country, MedalType, Event
  • Country, Event, MedalType

Each ordering provides a different view over the data, and can be used to get a feel for different stories that are to be told.

First up, ordered by Medal, Country, Event:

This is a representation, of sorts, of the traditional medal standings table. If you look to the Gold segment, you can see the top few countries by medal count. We can also zoom in to see what events those medals tended to be awarded in:

The colouring is a bit off – the Google components is not as directly scriptable as a d3js treemap, for example – but with a bit of experimentation it may be able to find a colour scheme that better indicates the number of medals allocated in each case.

The Medal-Country-Event view thus allows us to get a feel for the overall medal standings. But how about the extent to which one country or another dominated an event? In this case, an Event-Country-Medal view gives us a feeling for strength in depth (ie we’re happy to take a point of view based on the the award of any medal type:

The Country-Event-Medal view gives us a view of the relative strength in depth of each country in each event:

and the Country Medal Event view allows us to then tunnel in on the gold winning events:

I think that colour could be used to make these charts even more accessible – maybe using different colouring schemes for the different variations – which is something I need to start thinking about (please feel free to make suggestions in the comments:-). It would also be good to have a little more control over the text that is displayed. The Google chart component is a little limited in this respect, so I think I need to find an alternative for more involved play – d3js seems like it’d be a good bet, although I need to do a quick review of R based treemap libraries too to see if there is anything there that may be appropriate.

It’d probably also be worth jotting down a few notes about what each of the six hierarchical variants might be good for highlighting, as well as exploring just as quick doodles with the Google chart component simpler treemaps that don’t reveal lower level structure, leaving that to be discovered through interactivity. (I showed the lower levels in the above treemaps because I was exploring static (i.e. printable) macroscopic views over the medal standings data.)

Data allowing, it would also be interesting to be able to get more detailed data visualised (for example, down to the level of actual events- 100m and Long Jump, for example, rather than Tack and Field, as well as the names of individual medalists.

PS for another Olympics related visualisation I’ve started exploring, see At A Glance View of the 2012 Olympics Heptathlon Performances

PPS As mentioned at the start, I love treemaps. See for example this initial demo of an F1 Championship points treemap in Many Eyes and as an Ergast Motor Sport API powered ‘live’ visualisation using a Google treemap chart component: A Treemap View of the F1 2011 Drivers and Constructors Championship


21:07

London Olympics 2012 Medal Tables At A Glance?

Looking at the various medal standings for medals awarded during any Olympics games is all very well, but it doesn’t really show where each country won its medals or whether particular sports are dominated by a single country. Ranked as they are by the number of gold medals won, the medal standings don’t make it easy to see what we might term “strength in depth” – that is, we don’t get an sense of how the rankings might change if other medal colours were taken into account in some way.

Four years ago, in a quick round up of visualisations from the 2008 Beijing Olympics (More Olympics Medal Table Visualisations) I posted an example of an IBM Many Eyes Treemap visualisation I’d created showing how medals had been awarded across the top 10 medal winning countries. (Quite by chance, a couple of days ago I noticed one of the visualisations I’d created had appeared as an example in an academic paper – A Magic Treemap Cube for Visualizing
Olympic Games Data
).

Although not that widely used, I personally find treemaps a wonderful device for providing a macroscopic overview of a dataset. Whilst getting actual values out of them may be hit and miss, they can be used to provide a quick orientation around a hierarchically ordered dataset. Yes, it may be hard to distinguish detail, but you can easily get your eye in and start framing more detailed questions to ask of the data.

Whilst there is still a lot more thinking I’d like to do around the use of treemaps for visualising Olympics medal data using treemaps, here are a handful of quick sketches constructed using Google visualisation chart treemap components, and data scraped from NBC.

The data I have scraped is represented using rows of the form:

Country, Event, Gold, Silver, Bronze

where Event is at the level of “Swimming”, “Cycling” etc rather than at finer levels of detail (it’s really hard finding data at even this level of data in an easily grabbable way?)

I’ve then treated the data as hierarchically structured over three levels, which can be arranged in six ways:

  • MedalType, Country, Event
  • MedalType, Event, Country
  • Event, MedalType, Country
  • Event, Country, MedalType
  • Country, MedalType, Event
  • Country, Event, MedalType

Each ordering provides a different view over the data, and can be used to get a feel for different stories that are to be told.

First up, ordered by Medal, Country, Event:

This is a representation, of sorts, of the traditional medal standings table. If you look to the Gold segment, you can see the top few countries by medal count. We can also zoom in to see what events those medals tended to be awarded in:

The colouring is a bit off – the Google components is not as directly scriptable as a d3js treemap, for example – but with a bit of experimentation it may be able to find a colour scheme that better indicates the number of medals allocated in each case.

The Medal-Country-Event view thus allows us to get a feel for the overall medal standings. But how about the extent to which one country or another dominated an event? In this case, an Event-Country-Medal view gives us a feeling for strength in depth (ie we’re happy to take a point of view based on the the award of any medal type:

The Country-Event-Medal view gives us a view of the relative strength in depth of each country in each event:

and the Country Medal Event view allows us to then tunnel in on the gold winning events:

I think that colour could be used to make these charts even more accessible – maybe using different colouring schemes for the different variations – which is something I need to start thinking about (please feel free to make suggestions in the comments:-). It would also be good to have a little more control over the text that is displayed. The Google chart component is a little limited in this respect, so I think I need to find an alternative for more involved play – d3js seems like it’d be a good bet, although I need to do a quick review of R based treemap libraries too to see if there is anything there that may be appropriate.

It’d probably also be worth jotting down a few notes about what each of the six hierarchical variants might be good for highlighting, as well as exploring just as quick doodles with the Google chart component simpler treemaps that don’t reveal lower level structure, leaving that to be discovered through interactivity. (I showed the lower levels in the above treemaps because I was exploring static (i.e. printable) macroscopic views over the medal standings data.)

Data allowing, it would also be interesting to be able to get more detailed data visualised (for example, down to the level of actual events- 100m and Long Jump, for example, rather than Tack and Field, as well as the names of individual medalists.

PS for another Olympics related visualisation I’ve started exploring, see At A Glance View of the 2012 Olympics Heptathlon Performances

PPS As mentioned at the start, I love treemaps. See for example this initial demo of an F1 Championship points treemap in Many Eyes and as an Ergast Motor Sport API powered ‘live’ visualisation using a Google treemap chart component: A Treemap View of the F1 2011 Drivers and Constructors Championship


April 21 2012

15:36

Insights into data journalism in Argentina

Angelica Peralta Ramos, multimedia development manager, La Nación in Argentina, gave an insight into the challenges of doing data journalism.

In her ISOJ talk, she explained how La Nacion started doing data visualisations with few resources and in a less than friendly government environment.

Peralta pointed out that Argentina ranks 100 out of 180 in corruption index. The country does not have a freedom of information law and it not part of the open government initiative.

But there is hope said Peralta. La Nacion wanted to do data journalism but didn’t have any programmers so they adopted tools for non programmers such as Tableau Public and Excel.

One of its initiatives involved gathering data on inflation to try to reveal more accurate inflation levels.

The newspaper has been taking public data and seeking to derive meaning from masses of figures.

For example, La Nacion took 400 PDFs with tables of 235,000 rows that recorded subsidies to bus companies to figure out who was getting what.

It is using software to keep track of updates to the PDFs to show how subsidies to the companies are on the rise.

Peralta’s short presentation showed how some media organisations are exploring data journalism in circumstances which are very different to the US or UK.

La Nacion have a data blog and will be posting links to the examples mentioned by Peralta.

15:09

Making data visualisation useful for audiences

At ISOJ, Alberto Cairo, lecturer in visual journalism, University of Miami, raised some critical questions about the visualisation of data in journalism.

Cairo explained that an information graphic is a tool for presenting information and for exploring information.

In the past, info graphics were about editing data down and summarising it. But this worries me, he says, as it is just presenting information but does not allow readers to explore the data.

Today we have the opposite trend and often ends up as data art which doesn’t help readers understand the data.

Cairo cited a New York Times project mapping neighbourhoods which he said forced readers to become their own reporters and editors to understand the data.

We have to create layers, he said. We have the presentation layer and we have the exploration layer, and these are complementary.

But readers need help to navigate the data, he said. Part of the task is giving clues to readers to understand the complexity of data.

Cairo quoted a visualistion mantra by Ben Shneiderman: “Overview first, zoom and filter, then details-on-demand.”

His approached echoed earlier comments by Brian Boyer, news applications editor, Chicago Tribune Media Group. Boyer said that we should make data beautiful, inspirational but make it useful to the audience.

 

July 14 2011

15:35

In Spanish: The inverted pyramid of data journalism part 2

Mauro Accurso has followed up his rapid translation of last week’s inverted pyramid of data journalism with a Spanish version of part 2: the 6 C’s of communicating data journalism. It’s copied in full below.

La semana pasada les traduje la primera parte de La Pirámide Invertida del Periodismo de Datos de Paul Bradshaw que prometió extender en el aspecto de comunicación del extenso proceso que significa el periodismo de datos.

comunicar periodismo de datosEn esta segunda parte Paul recorre 6 formas diferentes de comunicar en periodismo de datos que pueden ver en el cuadro de arriba y al final encontrarán un gráfico que resume toda la teoría (la cual está en desarrollo todavía y Bradshaw pide aportes, comentarios y sugerencias):

“El periodismo de datos moderno ha crecido junto con un gran aumento en visualización y esto puede llevarnos algunas veces a dejar de lado diferentes formas de contar historias que involucren grandes números. La intención de lo siguiente es funcionar como un manual para asegurar que todas las opciones sean consideradas:

1. VISUALIZACIÓN

La visualización es la forma más rápida de comunicar los resultados del periodismo de datos: herramientas gratuitas como Google Docs lo permiten con un sólo click y herramientas más poderosas como Many Eyes sólo requieren que el usuario pegue la data cruda y seleccione de un grupo de opciones de visualización.

Pero facilidad no es igual a efectividad. El surgimiento de cuadros-basura demuestra que la visualización no es inmune al churnalism o al espectáculo sin profundidad. Hay una rica historia de visualizaciones en gráfica que se mantiene relevante para la generación de las infografías online: enfocarse en no más de 4 puntos de datos, evitar el 3D y asegurarse que el gráfico es autosuficiente son sólo algunas.

No es un proceso simple pero, sin embargo, la visualización tiene una gran ventaja que hace que ese esfuerzo valga la pena: puede hacer que la comunicación sea increíblemente efectiva. Puede proveer de un método de distribución de tu contenido que no puede ser igualado por otros tipos de comunicación listado acá.

Pero su mayor fortaleza es también su mayor debilidad: la naturaleza instantánea de las infografías también significa que las personas a menudo no pasan demasiado tiempo mirándolas. Las hace muy efectivas para la distribución pero no para el engagement, así que es importante pensar estratégicamente acerca de 1) asegurarse que la imagen contenga un enlace a la fuente; y 2) asegurarse que haya algo más en la fuente cuando la gente llegue.

2. NARRACIÓN

Un artículo tradicional puede luchar para contener la clase de números que el periodismo de datos suele recorrer, pero aún así provee una forma accesible para que las personas entiendan la historia, si está hecho bien.

Como con la visualización, menos suele ser más. Pero también, como en la mayoría de la narrativa, necesitas pensar en el significado y tus objetivos en comunicar esos números.

Las cifras abstractas pueden ser impresionantes, pero sin sentido e inútiles. ¿Qué significa que 10 millones hayan sido gastados en algo? ¿Eso es más o menos que lo usual? ¿Más o menos que algo similar? Traten de bajar los montos a cantidades manejables: las sumas por persona o por día, por ejemplo. Finalmente, usen la edición para enfocarse en las cuestiones principales y asegúrense de enlazar al conjunto.

3. COMUNICACIÓN SOCIAL

La comunicación es un arte social y el éxito de infografías a través de medios sociales es un testamento de eso. Pero no son sólo las infografías que son sociales, la información también lo es. The Guardian ha demostrado eso de forma exitosa con la rica comunidad del Data Blog y alrededor de su API. Iniciativas de Crowdsourcing con el objetivo de recolectar data también pueden brindar una dimensión social a la información (ejemplos que remarca Paul: proyecto “Investigate your MP’s expenses” del Guardian que liberó más de 450 mil documentos para que los revisen los usuarios y cuando hicieron un crowdsource de las supuestas especificaciones del iPad). Hay otros ejemplos también, especialmente cuando no hay otra forma de conseguir la información.

La conectividad de la web ofrece nuevas oportunidades para presentar al periodismo de datos en una forma social. La aplicación de ProPublica que provee resultados basados en tu perfil de Facebook (escuelas a las que fueron; amigos que usaron la aplicación) es un ejemplo de como el periodismo de datos puede aprovechar la data social y, al mismo tiempo, como comunicar los resultados del periodismo de datos puede ser orientado alrededor de dinámicas sociales usando elementos como concursos, compartir, competiciones, campañas y colaboración. Estamos recién en el comienzo de este aspecto del periodismo online.

4. HUMANIZAR

Los programas de noticias a menudo utilizan casos de estudio para tratar el problema de presentar historias basadas en números en televisión o radio. Si los tiempos de espera en hospitales han aumentado, hablan con alguien que ha tenido que esperar un montón de tiempo por una operación. En otras palabras, humaniza los números.

Más recientemente, el crecimiento de gráficos en movimiento generados por computadora ha bajado la presión de cierta forma, ya que los presentadores pueden utilizar animaciones poderosas para ilustrar una historia.

Pero una vez más, surge el punto de hacer historias relevantes para las personas. Como escribí en “One ambassador’s embarrassment is a tragedy, 15,000 civilian deaths is a statistic” (resumen del post en español: periodismo de datos y filtraciones masivas – cuando la muerte es una estadística): cuando te mueves más allá de escalas que podamos manejar a un nivel humano, luchas para enganchar a la gente en el tema que estás cubriendo, no importa cuán impresionante sea el gráfico.

Así que después de estar enterrado en información abstracta necesitamos recordar que salir y grabar una entrevista con una persona cuya vida haya sido afectada por la data puede hacer una gran diferencia para propulsar nuestra historia.

5. PERSONALIZAR

Uno de los grandes cambios del periodismo online es que abre toda clase de posibilidades alrededor de la interactividad. En cuanto al periodismo de datos eso significa que los usuarios pueden, potencialmente, controlar qué información es presentada a ellos en varias entradas.

Hay algunas formas relativamente bien establecidas de esto. Por ejemplo, cuando un gobierno presenta su último presupuesto, los sitios web de noticias muchas veces invitan al usuario a ingresar sus propios detalles y averiguar cómo el presupuesto los afecta. Una variante reciente de esto fueron esos sitios interactivos donde invitaban al usuario a hacer sus propias decisiones de cómo recortarían el deficit (la versión del Financial Times llevó eso más allá agregando estrategias de partidos y políticas).

Otra forma común es personalización geográfica: el usuario es invitado a entrar su código posta y otra información geográfica para descubrir como un tema en particular está resultando en su lugar de residencia. Una tercera es simplemente “tus intereses”, como demostraron los acercamientos de Popvox a engagement político y el Newsmatch de LA Times.

Mientras más y más data personal está en manos de sitios de terceros, las posibilidades de personalización se expanden. El ejemplo de ProPublica de arriba demuestra como la información de perfil de Facebook puede ser usada para personalizar automáticamente la experiencia de una historia. Y existen variasaplicaciones que ofrecen presentar información basada en localización vía GPS.

Esto también indica que puede haber varias formas en las cuales la personalización y estrategias sociales pueden combinarse. Las noticias personalizadas pueden, de muchas maneras, ser usadas como una expresión de nuestra identidad: acá es donde vivo, de esta forma me afecta, en esto estoy interesado. El COO de Facebook predijo que todos los medios van a ser personalizados en 3-5 años; está claro que eso es algo donde las redes sociales nos van a llevar.

6. UTILIZAR

La forma más compleja de comunicar los resultados del periodismo de datos es crear algún tipo de herramienta basada en la información. Las calculadoras son opciones populares, así como herramientas con GPS, pero hay un montón de amplitud para aplicaciones más complejas mientras más información está disponible del publisher y el usuario.

Una vez más, hay un entrecruzamiento acá con personalización, pero es posible proveer utilidad sin personalización. Y muy a menudo, la complejidad y barrera consiguiente con respecto a los competidores presenta también oportunidades comerciales.

En Reed Business Information, por ejemplo, su modelo está orientado hacia este tipo de utilidad: atraer usuarios en varios puntos de la cadena de comunicación (actualizaciones online, revistas impresas, noticias móviles) y direccionarlos hacia el punto donde están más cercanos a una decisión de compra. La idea es que mientras más cerca tu información está de su acción, más valiosa es para el usuario.

Crear utilidad de la información es ahora relativamente costoso, pero esos costos están bajando como resultado de la competencia y la estandarización.

UN MEDIO PARA EXPLORAR

Lo que todo lo anterior hace evidente es que hay áreas enteras de periodismo online que todavía faltan ser adecuadamente exploradas, y de hecho en la mayoría todavía falta establecer convenciones claras o ideas de buenas prácticas. Esto trata de ser un resumen de aquellas convenciones que están surgiendo pero sería genial agregar más. Mientras tanto acá tienen ambas partes del modelo juntas”:

PrintFriendly

July 13 2011

14:00

6 ways of communicating data journalism (The inverted pyramid of data journalism part 2)

Last week I published an inverted pyramid of data journalism which attempted to map processes from initial compilation of data through cleaning, contextualising, and combining that. The final stage – communication – needed a post of its own, so here it is.

Below is a diagram illustrating 6 different types of communication in data journalism. (I may have overlooked others, so please let me know if that’s the case.)

Communicate: visualised, narrate, socialise, humanise, personalise, utilise

Modern data journalism has grown up alongside an enormous growth in visualisation, and this can sometimes lead us to overlook different ways of telling stories involving big numbers. The intention of the following is to act as a primer for ensuring all options are considered.

1. Visualisation

Visualisation is the quickest way to communicate the results of data journalism: free tools such as Google Docs allow it with a single click; more powerful tools like Many Eyes only require the user to paste their raw data and select from a range of visualisation options.

But ease does not equal effectiveness. The rise of chartjunk illustrates that visualisation is not immune to churnalism or spectacle without insight.

There is a rich history of print visualisation which remains relevant to the generation of online infographics: focusing on no more than 4 data points; avoiding 3D and ensuring the graphic is self-sufficient are just some.

Kaiser Fung’s trifecta is one useful reference point for ensuring a visualisation is effective, and this explanation of how a chart was transformed into something that could be used in a newspaper is also instructive (summarised by Kaiser Fung here).

In short: it’s not a simple process.

Visualisation has one major advantage which makes that effort worthwhile, however: it can make communication incredibly effective. And it can provide a method of distributing your content which cannot be matched by the other types of communication listed here.

But its major strength is also its main weakness: the instant nature of infographics also means that people often do not spend much time looking at it. It makes it very effective for distribution, but not for engagement, and so it is worth thinking strategically about 1) making sure the image contains a link back to its source; and 2) making sure that there is something more at the source when people arrive.

2. Narration

A traditional article can struggle to contain the sort of numbers that data journalism tends to turf up, but it still provides an accessible way for people to understand the story – if done well.

There are books providing useful guidance on how to write with numbers most clearly – and some guidance for web writing too (you should use numerals rather than words, as this helps people who are scanning the page).

As with visualisation, less is often more. But also, as in most narrative, you need to think about meaningfulness and your objectives in communicating these numbers.

Abstract amounts can be impressive, but meaningless and useless. What does it mean that £10m has been spent on something? Is that more or less than usual? More or less than something similar?

Try to bring down amounts to manageable quantities – the amount per person, or per day, for example.

Finally, use editing to focus in on the essentials: and make sure you link to the whole.

3. Social communication

Communication is a social act, and the success of infographics across social media is a testament to that. But it’s not just infographics that are social – data is too. The Guardian has demonstrated this particularly successfully with the cultivation of a healthy community around its Data Blog (which enjoys higher stickiness than the average Guardian article), and around its API.

Crowdsourcing initiatives aimed at gathering data can also provide a social dimension to the data. The Guardian are, again, pioneers here, with their MPs’ expenses project and Charles Arthur’s attempt to crowdsource predictions about the specifications of the iPad. But there are other examples, too – especially when it is difficult to obtain the data any other way.

The connectivity of the web presents new opportunities to present data journalism in a social way. ProPublica’s app that provides results based on your Facebook profile (schools attended; friends who have used the app) is one example of how data journalism can leverage social data, and, equally, how communicating the results of data journalism can be geared around social dynamics, using elements such as quizzes, sharing, competition, campaigning and collaboration. We are barely at the start of this aspect of online journalism.

4. Humanise

Broadcast news reports often use case studies to get around the problem of presenting numbers-based stories on television and radio. If waiting times have increased, speak to someone who had to wait a long time for an operation. In other words, humanise the numbers.

More recently the growth of computer-generated motion graphics has relaxed that pressure somewhat, as presenters can call on powerful animation to illustrate a story.

But once again, the point of making stories relevant to people comes through. As I wrote in One ambassador’s embarrassment is a tragedy, 15,000 civilian deaths is a statistic: when you move beyond scales we can deal with on a human level, you struggle to engage people in the issue you are covering – no matter how impressive the motion graphics (that post outlines some other considerations in humanising stories, such as ensuring that case studies are representative).

So after being buried in abstract data we need to remember that going out and recording an interview with a person whose life has been affected by that data can make a big difference to the power of our story.

5. Personalise

One of the biggest changes in journalism’s move online is that it opens up all sorts of possibilities around interactivity. When it comes to data journalism that means that the user can, potentially, control what information is presented to them based on various inputs.

There are some relatively well-established forms of this. For example, when a government presents its latest budget, news websites often invite the user to input their own details (for example, their earnings, or their family make up) to find out how the budget affects them. A recent variant of this are those interactives which invite the user to make their own decisions on how they might cut the deficit (the FT’s version took this further, adding in party strategies and policies).

Another common form is geographical personalisation: the user is invited to enter their postcode, zip code or other geographical information to find out how a particular issue is playing out in their home town.

A third is simply ‘your interests’, as demonstrated by Popvox’s approach to political engagement and the LA Times’ Newsmatch.

As more and more personal data is held by third party sites, the possibilities for personalisation expand. The ProPublica example given above, for example, demonstrates how Facebook profile information can be used to automatically personalise the experience of a story. And there are various apps that offer to present information based on location data provided via GPS.

This also indicates that there may be various ways in which personalisation and social strategies might be combined. Personalised stories can, in many ways, be used as an expression of our identity: this is where I live; this is how I am affected; this is what I’m interested in.

And when the COO of Facebook is predicting that all media will be personalised in 3-5 years, it’s clear that this is something the social networks are going to drive towards too.

6. Utilise

The most complex way of communicating the results of data journalism is to create some sort of tool based on the data. Calculators are popular choices, as are GPS-driven tools, but there is a lot of scope for more complex applications as more data becomes available both from the publisher and the user.

Again, there is overlap here with personalisation – but it is possible to provide utility without personalisation. And quite often, the complexity and consequent barrier to competitors presents commercial opportunities too.

At Reed Business Information, for example, their model is geared towards this sort of utility: attracting users at various points of the communication chain – online updates, printed magazines, mobile news – and steering them towards the point where they are closest to a purchasing decision. The idea is that the closer your information is to their action, the more valuable it is to the user.

Creating utility from data is currently relatively costly – but those costs are going down as a result of competition and standardisation. For example, as increasing numbers of news organisations adopt standard ways of storing story data (e.g. XML files), it is easier to create apps that pull data from datasets. Meanwhile, app creation becomes increasingly templated (in many ways you can see the process following a similar path to that of web design) and platform independent.

A medium up for grabs

What all of the above makes apparent – and I may have missed other methods of communicating data journalism (please let me know if you can think of any) – is that there are whole areas of online journalism that have yet to be properly explored, and certainly most have yet to establish clear conventions or ideas of best practice.

I’ve tried to scope out an overview of those conventions that are emerging, and the best practice that’s currently available, but it would be great if you could add more. What makes for good humanisation? Utility? What are great examples of personalisation or data journalism that involves a social dimension? Comments below please.

Meanwhile, here are both parts of the model shown together (click to magnify):

The inverted pyramid of data journalism and data journalism communication pyramid

.

PrintFriendly

April 10 2011

11:16

UK Journalists on Twitter

A post on the Guardian Datablog earlier today took a dataset collected by the Tweetminster folk and graphed the sorts of thing that journalists tweet about ( Journalists on Twitter: how do Britain’s news organisations tweet?).

Tweetminster maintains separate lists of tweeting journalists for several different media groups, so it was easy to grab the names on each list, use the Twitter API to pull down the names of people followed by each person on the list, and then graph the friend connections between folk on the lists. The result shows that the hacks are follow each other quite closely:

UK Media Twitter echochamber (via tweetminster lists)

Nodes are coloured by media group/Tweetminster list, and sized by PageRank, as calculated over the network using the Gephi PageRank statistic.

The force directed layout shows how folk within individual media groups tend to follow each other more intensely than they do people from other groups, but that said, inter-group following is still high. The major players across the media tweeps as a whole seem to be @arusbridger, @r4today, @skynews, @paulwaugh and @BBCLauraK.

I can generate an SVG version of the chart, and post a copy of the raw Gephi GDF data file, if anyone’s interested…


December 19 2010

17:44

December 01 2010

21:06

Visualising data with the Datapress WordPress plugin

Note: This post contains a interactive data presentation that may not show up in your feed reader. For the full experience, visit this article in your web browser.

Here’s a useful plugin for bloggers working with data: Datapress allows you to quickly visualise a dataset as a table, timeline, scatter plot, bar chart, ‘intelligent list’ (allowing you to sort by more than one value at once – see this example) or map.

Once installed, the plugin adds a new button to the ‘Upload/Insert’ row in the post edit view which you can click to link to a dataset in the same way as you would embed an image or video.

The plugin is in beta at the moment and takes a bit of getting used to. There’s a convention you have to follow in naming Google spreadsheet columns, for example – this Glasgow Vegan Guide spreadsheet has quite a few of them – but this could add some new visualisation possibilities. It seems particularly nice for lists and maps (if you have lat-long values), although Google spreadsheet’s built-in charts options will obviously be quicker for simple graphs and charts.

The plugin has a demo site with some impressive examples and the developers are happy to help with any problems. It’s also up for the Knight News Challenge if you want to support it.

07:31

Data journalism training – some reflections

OpenHeatMap - Percentage increase in fraud crimes in London since 2006_7

I recently spent 2 days teaching the basics of data journalism to trainee journalists on a broadsheet newspaper. It’s a pretty intensive course that follows a path I’ve explored here previously – from finding data and interrogating it to visualizing it and mashing – and I wanted to record the results.

My approach was both practical and conceptual. Conceptually, the trainees need to be able to understand and communicate with people from other disciplines, such as designers putting together an infographic, or programmers, statisticians and researchers.

They need to know what semantic data is, what APIs are, the difference between a database and open data, and what is possible with all of the above.

They need to know what design techniques make a visualisation clear, and the statistical quirks that need to be considered – or looked for.

But they also need to be able to do it.

The importance of editorial drive

The first thing I ask them to do (after a broad introduction) is come up with a journalistic hypothesis they want to test (a process taken from Mark E Hunter’s excellent ebook Story Based Inquiry). My experience is that you learn more about data journalism by tackling a specific problem or question – not just the trainees but, in trying to tackle other people’s problems, me as well.

So one trainee wants to look at the differences between supporters of David and Ed Miliband in that week’s Labour leadership contest. Another wants to look at authorization of armed operations by a police force (the result of an FOI request following up on the Raoul Moat story). A third wants to look at whether ethnic minorities are being laid off more quickly, while others investigate identity fraud, ASBOs and suicides.

Taking those as a starting point, then, I introduce them to some basic computer assisted reporting skills and sources of data. They quickly assemble some relevant datasets – and the context they need to make sense of them.

For the first time I have to use Open Office’s spreadsheet software, which turns out to be not too bad. The data pilot tool is a worthy free alternative to Excel’s pivot tables, allowing journalists to quickly aggregate & interrogate a large dataset.

Formulae like concatenate and ISNA turn out to be particularly useful in cleaning up data or making it compatible with similar datasets.

The ‘Text to columns’ function comes in handy in breaking up full names into title, forename and surname (or addresses into constituent parts), while find and replace helped in removing redundant information.

It’s not long before the journalists raise statistical issues – which is reassuring. The trainee looking into ethnic minority unemployment, for example, finds some large increases – but the numbers in those ethnicities are so small as to undermine the significance.

Scraping the surface of statistics

Still, I put them through an afternoon of statistical training. Notably, not one of them has studied a maths or science-related degree. History, English and Law dominate – and their educational history is pretty uniform. At a time when newsrooms need diversity to adapt to change, this is a little worrying.

But they can tell a mean from a mode, and deal well with percentages, which means we can move on quickly to standard deviations, distribution, statistical significance and regression analysis.

Even so, I feel like we’ve barely scraped the surface – and that there should be ways to make this more relevant in actively finding stories. (Indeed, a fortnight later I come across a great example of using Benford’s law to highlight problems with police reporting of drug-related murder)

One thing I do is ask one trainee to toss a coin 30 times and the others to place bets on the largest number of heads to fall in a row. Most plump for around 4 – but the longest run is 8 heads in a row.

The point I’m making is regarding small sample sizes and clusters. (With eerie coincidence, one of them has a map of Bridgend on her screen, which made the news after a cluster of suicides).

That’s about as engaging as this section got – so if you’ve any ideas for bringing statistical subjects to life and making them relevant to journalists, particularly as a practical tool for spotting stories, I’m all ears.

Visualisation – bringing data to life, quickly

Day 2 is rather more satisfying, as – after an overview of various chart types and their strengths and limitations – the trainees turn their hands to visualization tools – Many Eyes, Wordle, Tableau Public, Open Heat Map, and Mapalist.

Suddenly the data from the previous day comes to life. Fraud crime in London boroughs is shown on a handy heat map. A pie chart, and then bar chart, shows the breakdown of Labour leadership voters; and line graphs bring out new possible leads in suicide data (female suicide rates barely change in 5 years, while male rates fluctuate more).

It turns out that Mapalist – normally used for plotting points on Google Maps from a Google spreadsheet – now also does heat maps based on the density of occurrences. ManyEyes has also added mapping visualizations to its toolkit.

Looking through my Delicious bookmarks I rediscover a postcodes API with a hackable URL to generate CSV or XML files with the lat/long, ward and other data from any postcode (also useful on this front is Matthew Somerville’s project MaPit).

Still a print culture

Notably, the trainees bring up the dominance of print culture. “I can see how this works well online,” says one, “but our newsroom will want to see a print story.”

One of the effects of convergence on news production is that a tool traditionally left to designers after the journalist has finished their role in the production line is now used by the journalist as part of their newsgathering role – visualizing data to see the story within it, and possibly publishing that online to involve users in that process too.

A print news story – in this instance – may result from the visualization process, rather than the other way around.

More broadly, it’s another symptom of how news production is moving from a linear process involving division of labour to a flatter, more overlapping organization of processes and roles – which involves people outside of the organization as well as those within.

Mashups

The final session covers mashups. This is an opportunity to explore the broader possibilities of the technology, how APIs and semantic data fit in, and some basic tools and tutorials.

Clearly, a well-produced mashup requires more than half a day and a broader skillset than exists in journalists alone. But by using tools like Mapalist the trainees have actually already created a mashup. Again, like visualization, there is a sliding scale between quick and rough approaches to find stories and communicate them – and larger efforts that require a bigger investment of time and skill.

As the trainees are already engrossed in their own projects, I don’t distract them too much from that course.

You can see what some of the trainees produced at the links below:

Matt Holehouse:

Many Eyes _ Rate of deaths in industrial accidents in the EU (per 100k)

Rate of deaths in industrial accidents in the EU (per 100k)

Raf Sanchez:

Rosie Ensor

  • Places with the highest rates for ASBOs

Sarah Rainey

October 21 2010

12:11

A template for ’100 percent reporting’

progress bar for 100 percent reporting

Last night Jay Rosen blogged about a wonderful framework for networked journalism – what he calls the ‘100 percent solution‘:

“First, you set a goal to cover 100 percent of… well, of something. In trying to reach the goal you immediately run into problems. To solve those problems you often have to improvise or innovate. And that’s the payoff, even if you don’t meet your goal”

In the first example, he mentions a spreadsheet. So I thought I’d create a template for that spreadsheet that tells you just how far you are in achieving your 100% goal, makes it easier to organise newsgathering across a network of actors, and introduces game mechanics to make the process more pleasurable.

The spreadsheet contains fields for

  • the ‘objects’ that you want to cover (these might be events, areas, people or others),
  • the author who is either assigned to it or has already written it,
  • the outlet, or publisher
  • a link if it has been published.
  • geolocation information (this can be used to create a heatmap visualising gaps in coverage)
  • an importance rating (the scale is up to the creator – again, this can be used to order results or colour a heatmap)

There may be other fields that you can think of that could be added – let me know what you think.

In a separate column are some calculations to work out how close you are to achieving ‘The 100 percent solution’:

  • How many objects need covering (this uses the =counta formula to see how many cells contain text)
  • How many have been covered (this uses the =countif formula to count how many cells say ‘yes’)
  • The percentage of objects covered (based on the above figures)
  • Percentage not covered

These calculations form the basis for a ‘progress bar’ chart which gives an instant visualisation of the job in hand (shown above), and incentivises participants to get involved. The chart can be embedded on any webpage, and updated dynamically.

The idea is that this progress bar forms a starting point for people to get involved in your coverage – helping ‘complete’ the job. This might be, for example, via a Google Form also generated from this spreadsheet to allow potential contributors to add ‘objects’ – or mark existing objects as complete.

Other ideas welcome.

October 19 2010

12:16

Practical steps for improving visualisation

Here’s a useful resource for anyone involved in data journalism and visualising the results. ‘Dataviz‘ – a site for “improving data visualisation in the public sector” – features a step by step guide to good visualisation, as well as case studies and articles.

Although it’s aimed at public sector workers, the themes in the former provide a good starting point for journalists: “What do we need to do?”; “How do we do it?” and “How did we do?” Each provides a potential story angle. Clicking through those themes takes you through some of the questions to ask of the data, taking you to a gallery of visualisation possibilities. Even if you never get that far, it’s a good way to narrow the question you’re asking – or find other questions that might result in interesting stories and insights.

October 14 2010

14:43

Manchester police tweets – live data visualisation by the MEN

Manchester police tweets - live data visualisation

Greater Manchester Police (GMP) have been experimenting today with tweeting every incident they deal with. The novelty value of the initiative has been widely reported – but local newspaper the Manchester Evening News has taken the opportunity to ask some deeper questions of the data generated by experimenting with data visualisation.

A series of bar charts – generated from Google spreadsheets and updated throughout the day – provide a valuable – and instant – insight into the sort of work that police are having to deal with.

In particular, the newspaper is testing the police’s claim that they spend a great deal of time dealing with “social work” as well as crime. At the time of writing, it certainly does take up a significant proportion – although not the “two-thirds” mentioned by GMP chief Peter Fahy. (Statistical disclaimer: the data does not yet even represent 24 hours, so is not yet going to be a useful guide. Fahy’s statistics may be more reliable).

Also visualised are the areas responsible for the most calls, the social-crime breakdown of incidents by area, and breakdowns of social incidents and serious crime incidents by type.

I’m not sure how much time they had to prepare for this, but it’s a good quick hack.

That said, I’m going to offer some advice on how the visualisation could be improved: 3D bars are never a good idea, for instance, and the divisional breakdown showing serious crime versus “social work” is difficult to visually interpret (percentages of the whole would be more easy to directly compare). The breakdowns of serious crimes and “social work”, meanwhile, should be ranked from most popular down with labelling used rather than colour.

Head of Online Content Paul Gallagher says that it’s currently a manual exercise that requires a page refresh to see updated visuals. But he thinks “the real benefit of this will come afterwards when we can also plot the data over time”. Impressively, the newspaper plans to publish the raw data and will be bringing it to tomorrow’s Hacks and Hackers Hackday in Manchester.

More broadly, the MEN is to be commended for spotting this more substantial angle to what could easily be dismissed as a gimmick by the GMP. Although that doesn’t stop me enjoying the headlines in coverage elsewhere (shown below).

Manchester police twitter headlines

October 13 2010

07:47

Stories hidden in the data, stories in the comments

the tax gap

My attention was drawn this week by David Hayward to a visualisation by David McCandless of the tax gap (click on image for larger version). McCandless does some beautiful stuff, but what was particularly interesting in this graphic was how it highlighted areas that are rarely covered by the news agenda.

Tax avoidance and evasion, for example, account for £7.4bn each, while benefit fraud and benefit system error account for £1.5 and £1.6bn respectively.

Yet while the latter dominate the news agenda, and benefit cheats subject to regular exposure, tax avoidance and evasion are rare guests on the pages of newspapers.

In other words, the data is identifying a news hole of sorts. There are many reasons for this – Galtung & Ruge would have plenty of ideas, for example – but still: there it is.

The comments

But that’s only part of what makes this so interesting. By publishing the data and having built the healthy community that exists around the data blog, McCandless and The Guardian benefit from some very useful comments (aside from the odd political one) on how to improve both the data and the visualisation.

This is a great example of how the newspaper is stealing an enormous march on its rivals in working beyond its newsroom in collaboration with users – benefiting from what Clay Shirky would call cognitive surplus. Data is not just an informational object, but a social one too.

October 09 2010

13:15

October 04 2010

07:41

Where should an aspiring data journalist start?

In writing last week’s Guardian Data Blog piece on How to be a data journalist I asked various people involved in data journalism where they would recommend starting. The answers are so useful that I thought I’d publish them in full here.

The Telegraph’s Conrad Quilty-Harper:

Start reading:

http://www.google.com/reader/bundle/user%2F06076274130681848419%2Fbundle%2Fdatavizfeeds

Keep adding to your knowledge and follow other data journalists/people who work with data on Twitter.

Look for sources of data:

ONS stats release calendar is a good start http://www.statistics.gov.uk/hub/release-calendar/index.html Look at the Government data stores (Data.gov, Data.gov.uk, Data.london.gov.uk etc).

Check out What do they know, Freebase, Wikileaks, Manyeyes, Google Fusion charts.

Find out where hidden data is and try and get hold of it: private companies looking for publicity, under appreciated research departments, public bodies that release data but not in a granular form (e.g. Met Office).

Test out cleaning/visualisation tools:

You want to be able to collect data, clean it, visualise it and map it.

Obviously you need to know basic Excel skills (pivot tables are how journalists efficiently get headline numbers from big spreadsheets).

For publishing just use Google Spreadsheets graphs, or ManyEyes or Timetric. Google MyMaps coupled with http://batchgeo.com is a great beginner mapping combo.

Further on from that you want to try out Google Spreadsheets importURL service, Yahoo Pipes for cleaning data, Freebase Gridworks and Dabble DB.

More advanced stuff you want to figure out query language and be able to work with relational databases, Google BigQuery, Google Visualisation API (http://code.google.com/apis/charttools/), Google code playgrounds (http://code.google.com/apis/ajax/playground/?type=visualization#org_chart) and other Javascript tools. The advanced mapping equivalents are ArcGIS or GeoConcept, allowing you to query geographical data and find stories.

You could also learn some Ruby for building your own scrapers, or Python for ScraperWiki.

Get inspired:

Get the data behind some big data stories you admire, try and find a story, visualise it and blog about it. You’ll find that the whole process starts with the data, and your interpretation of it. That needs to be newsworthy/valuable.

Look to the past!

Edward Tufte’s work is very inspiring: http://www.edwardtufte.com/tufte/ His favourite data visualisation is from 1869! Or what about John Snow’s Cholera map? http://www.york.ac.uk/depts/maths/histstat/snow_map.htm

And for good luck here’s an assorted list of visualisation tutorials.

The Times’ Jonathan Richards

I’d say a couple of blogs.

Others that spring to mind are:

If people want more specific advice, tell them to come to the next London Hack/Hackers and track me down!

The Guardian’s Charles Arthur:

Obvious thing: find a story that will be best told through numbers. (I’m thinking about quizzing my local council about the effects of stopping free swimming for children. Obvious way forward: get numbers for number of children swimming before, during and after free swimming offer.)

If someone already has the skills for data journalism (which I’d put at (1) understanding statistics and relevance (2) understanding how to manipulate data (3) understanding how to make the data visual) the key, I’d say, is always being able to spot a story that can be told through data – and only makes sense that way, and where being able to manipulate the data is key to extracting the story. It’s like interviewing the data. Good interviewers know how to get what they want out from the conversation. Ditto good data journalists and their data.

The New York Times’ Aron Pilhofer:

I would start small, and start with something you already know and already do. And always, always, always remember that the goal here is journalism. There is a tendency to focus too much on the skills for the sake of skills, and not enough on how those skills help enable you to do better journalism. Be pragmatic about it, and resist the tendency to think you need to know everything about the techy stuff before you do anything — nothing could be further from the truth.

Less abstractly, I would start out learning some basic computer-assisted reporting skills and then moving from there as your interests/needs dictate. A lot of people see the programmer/journalism thing as distinct from computer-assisted reporting, but I don’t. I see it as a continuum. I see CAR as a “gateway drug” of sorts: Once you start working with small data sets using tools like Excel, Access, MySQL, etc., you’ll eventually hit limits of what you can do with macros and SQL.

Soon enough, you’ll want to be able to script certain things. You’ll want to get data from the web. You’ll want to do things you can only do using some kind of scripting language, and so it begins.

But again, the place to start isn’t thinking about all these technologies. The place to start is thinking about how these technologies can enable you to tell stories you otherwise would never be able to tell otherwise. And you should start small. Look for little things to start, and go from there.

September 20 2010

10:50

The BBC and missed data journalism opportunities

Bar chart: UN progress on eradication of world hunger

I’ve tweeted a couple of times recently about frustrations with BBC stories that are based on data but treat it poorly. As any journalist knows, two occasions of anything in close proximity warrants an overreaction about a “worrying trend”. So here it is.

“One in four council homes fails ‘Decent Homes Standard’”

This is a great piece of newsgathering, but a frustrating piece of online journalism. “Almost 100,000 local authority dwellings have not reached the government’s Decent Homes Standard,” it explained. But according to what? Who? “Government figures seen by BBC London”. Ah, right. Any more detail on that? No.

The article is scattered with random statistics from these figures “In Havering, east London, 56% of properties do not reach Decent Homes Standard – the highest figure for any local authority in the UK … In Tower Hamlets the figure is 55%.”

It’s a great story – if you live in those two local authorities. But it’s a classic example of narrowing a story to fit the space available. This story-centric approach serves readers in those locations, and readers who may be titillated by the fact that someone must always finish bottom in a chart – but the majority of readers will not live in those areas, and will want to know what the figures are for their own area. The article does nothing to help them do this. There are only 3 links, and none of them are deep links: they go to the homepages for Havering Council, Tower Hamlets Council, and the Department of Communities and Local Government.

In the world of print and broadcast, narrowing a story to fit space was a regrettable limitation of the medium; in the online world, linking to your sources is a fundamental quality of the medium. Not doing so looks either ignorant or arrogant.

“Uneven progress of UN Millennium Development Goals”

An impressive piece of data journalism that deserves credit, this looks at the UN’s goals and how close they are to being achieved, based on a raft of stats, which are presented in bar chart after bar chart (see image above). Each chart gives the source of the data, which is good to see. However, that source is simply given as “UN”: there is no link either on the charts or in the article (there are 2 links at the end of the piece – one to the UN Development Programme and the other to the official UN Millennium Development Goals website).

This lack of a link to the specific source of the data raises a number of questions: did the journalist or journalists (in both of these stories there is no byline) find the data themselves, or was it simply presented to them? What is it based on? What was the methodology?

The real missed opportunity here, however, is around visualisation. The relentless onslaught on bar charts makes this feel like a UN report itself, and leaves a dry subject still looking dry. This needed more thought.

Off the top of my head, one option might have been an overarching visualisation of how funding shortfalls overall differ between different parts of the world (allowing you to see that, for example, South America is coming off worst). This ‘big picture’ would then draw in people to look at the detail behind it (with an opportunity for interactivity).

Had they published a link to the data someone else might have done this – and other visualisations – for them. I would have liked to try it myself, in fact.

Compare this article, for example, with the Guardian Datablog’s treatment of the coalition agreement: a harder set of goals to measure, and they’ve had to compile the data themselves. But they’re transparent about the methodology (it’s subjective) and the data is there in full for others to play with.

It’s another dry subject matter, but The Guardian have made it a social object.

No excuses

The BBC is not a print outlet, so it does not have the excuse of these stories being written for print (although I will assume they were researched with broadcast as the primary outlet in mind).

It should also, in theory, be well resourced for data journalism. Martin Rosenbaum, for example, is a pioneer in the field, and the team behind the BBC website’s Special Reports section does some world class work. The corporation was one of the first in the world to experiment with open innovation with Backstage, and runs a DataArt blog too. But the core newsgathering operation is missing some basic opportunities for good data journalism practice.

In fact, it’s missing just one basic opportunity: link to your data. It’s as simple as that.

On a related note, the BBC Trust wants your opinions on science reporting. On this subject, David Colquhoun raises many of the same issues: absence of links to sources, and anonymity of reporters. This is clearly more a cultural issue than a technical one.

Of all the UK’s news organisations, the BBC should be at the forefront of transparency and openness in journalism online. Thinking politically, allowing users to access the data they have spent public money to acquire also strengthens their ideological hand in the Big Society bunfight.

September 17 2010

16:18

A First – Not Very Successful – Look at Using Ordnance Survey OpenLayers…

What’s the easiest way of creating a thematic map, that shows regions coloured according to some sort of measure?

Yesterday, I saw a tweet go by from @datastore about Carbon emissions in every local authority in the UK, detailing those emissions for a list of local authorities (whatever they are… I’ll come on to that in a moment…)

Carbon emissions data table

The dataset seemed like a good opportunity to try out the Ordnance Survey’s OpenLayers API, which I’d noticed allows you to make use of OS boundary data and maps in order to create thematic maps for UK data:

OS thematic map demo

So – what’s involved? The first thing was to try and get codes for the authority areas. The ONS make various codes available (download here) and the OpenSpace website also makes available a list of boundary codes that it can render (download here), so I had a poke through the various code files and realised that the Guardian emissions data seemed to identify regions that were coded in different ways? So I stalled there and looked at another part f the jigsaw…

…specifically, OpenLayers. I tried the demo – Creating thematic boundaries – got it to work for the sample data, then tried to put in some other administrative codes to see if I could display boundaries for other area types… hmmm…. No joy:-) A bit of digging identified this bit of code:

boundaryLayer = new OpenSpace.Layer.Boundary("Boundaries", {
strategies: [new OpenSpace.Strategy.BBOX()],
area_code: ["EUR"],
styleMap: styleMap });

which appears to identify the type of area codes/boundary layer required, in this case “EUR”. So two questions came to mind:

1) does this mean we can’t plot layers that have mixed region types? For example, the emissions data seemed to list names from different authority/administrative area types?
2) what layer types are available?

A bit of digging on the OpenLayers site turned up something relevant on the Technical FAQ page:

OS OpenSpace boundary DESCRIPTION, (AREA_CODE) and feature count (number of boundary areas of this type)

County, (CTY) 27
County Electoral Division, (CED) 1739
District, (DIS) 201
District Ward, (DIW) 4585
European Region, (EUR) 11
Greater London Authority, (GLA) 1
Greater London Authority Assembly Constituency, (LAC) 14
London Borough, (LBO) 33
London Borough Ward, (LBW) 649
Metropolitan District, (MTD) 36
Metropolitan District Ward, (MTW) 815
Scottish Parliament Electoral Region, (SPE) 8http://ouseful.wordpress.com/wp-admin/edit.php
Scottish Parliament Constituency, (SPC) 73
Unitary Authority, (UTA) 110
Unitary Authority Electoral Division, (UTE) 1334
Unitary Authority Ward, (UTW) 1464
Welsh Assembly Electoral Region, (WAE) 5
Welsh Assembly Constituency, (WAC) 40
Westminster Constituency, (WMC) 632

so presumably all those code types can be used as area_code arguments in place of “EUR”?

Back to one of the other pieces of the jigsaw: the OpenLayers API is called using official area codes, but the emissions data just provides the names of areas. So somehow I need to map from the area names to an area code. This requires: a) some sort of lookup table to map from name to code; b) a way of doing that.

Normally, I’d be tempted to use a Google Fusion table to try to join the emissions table with the list of boundary area names/codes supported by OpenSpace, but then I recalled a post by Paul Bradshaw on using the Google spreadsheets VLOOKUP formula (to create a thematic map, as it happens: Playing with heat-mapping UK data on OpenHeatMap), so thought I’d give that a go… no joy:-( For seem reason, the vlookup just kept giving rubbish. Maybe it was happy with really crappy best matches, even if i tried to force exact matches. It almost felt like formula was working on a differently ordered column to the one it should have been, I have no idea. So I gave up trying to make sense of it (something to return to another day maybe; I was in the wrong mood for trying to make sense of it, and now I am just downright suspicious of the VLOOKUP function!)…

…and instead thought I’d give the openheatmap application Paul had mentioned a go…After a few false starts (I thought I’d be able to just throw a spreadsheet at it and then specify the data columns I wanted to bind to the visualisation, (c.f. Semantic reports), but it turns out you have to specify particular column names, value for the data value, and one of the specified locator labels) I managed to upload some of the data as uk_council data (quite a lot of it was thrown away) and get some sort of map out:

openheatmap demo

You’ll notice there are a few blank areas where council names couldn’t be identified.

So what do we learn? Firstly, the first time you try out a new recipe, it rarely, if ever, “just works”. When you know what you’re doing, and “all you have to do is…”, all is a little word. When you don’t know what you’re doing, all is a realm of infinite possibilities of things to try that may or may not work…

We also learn that I’m not really that much closer to getting my thematic map out… but I do have a clearer list of things I need to learn more about. Firstly, a few hello world examples using the various different OpenLayer layers. Secondly, a better understanding of the differences between the various authority types, and what sorts of mapping there might be between them. Thirdly, I need to find a more reliable way of reconciling data from two tables and in particular looking up area codes from area names (in two ways: code and area type from area name; code from area name and area type). VLOOKUP didn’t work for me this time, so I need to find out if that was my problem, or an “issue”.

Something else that comes to mind is this: the datablog asks: “Can you do something with this data? Please post your visualisations and mash-ups on our Flickr group”. IF the data had included authority codes, I would have been more likely to persist in trying to get them mapped using OpenLayers. But my lack of understanding about how to get from names to codes meant I stumbled at this hurdle. There was too much friction in going from area name to OpenLayer boundary code. (I have no idea, for example, whether the area names relate to one administrative class, or several).

Although I don’t think the following is the case, I do think it is possible to imagine a scenario where the Guardian do have a table that includes the administrative codes as well as names for this data, or an environment/application/tool for rapidly and reliably generating such a table, and that they know this makes the data more valuable because it means they can easily map it, but others can’t. The lack of codes means that work needs to be done in order to create a compelling map from the data that may attract web traffic. If it was that easy to create the map, a “competitor” might make the map and get the traffic for no real effort. The idea I’m fumbling around here is that there is a spectrum of stuff around a data set that makes it more or less easy to create visualiations. In the current example, we have area name, area code, map. Given an area code, it’s presumably (?) easy enough to map using e.g. OpenLayers becuase the codes are unambiguous. Given an area name, if we can reliably look up the area code, it’s presumably easy to generate the map from the name via the code. Now, if we want to give the appearance of publishing the data, but make it hard for people to use, we can make it hard for them to map from names to codes, either by messing around with the names, or using a mix of names that map on to area codes of different types. So we can taint the data to make it hard for folk to use easily whilst still be being seen to publish the data.

Now I’m not saying the Guardian do this, but a couple of things follow: firstly, obfuscating or tainting data can help you prevent casual use of it by others (it can also help you track the data; e.g. mapping agencies that put false artefacts in their maps to help reveal plagiarism); secondly, if you are casual with the way you publish data, you can make it hard for people to make effective use of that data. For a long time, I used to hassle folk into publishing RSS feeds. Some of them did… or at least thought they did. For as soon as I tried to use their feeds, they turned out to be broken. No-one had ever tried to consume them. Same with data. If you publish your data, try to do something with it. So for example, the emissions data is illustrated with a Many Eyes visualisation of it; it works as data in at least that sense. From the place names, it would be easy enough to vaguely place a marker on a map showing a data value roughly in the area of each council. But for identifying exact administrative areas – the data is lacking.

It might seem as is if I’m angling against the current advice to councils and government departments to just “get their data out there” even if it is a bit scrappy, but I’m not… What I am saying (I think) is that folk should just try to get their data out, abut also:

- have a go at trying to use it for something themselves, or at least just demo a way of using it. This can have a payoff in at least a three ways I can think of: a) it may help you spot a problem with the way you published the data that you can easily fix, or at least post a caveat about; b) it helps you develop you’re own data handling skills; c) you might find that you can encourage reuse of the data you have just published in your own institution…

- be open to folk coming to you with suggestions for ways in which you might be able to make the data more valuable/easier to use for them for little effort on your own part, and that in turn may help you publish future data releases in an ever more useful way.

Can you see where this is going? Towards Linked Data… ;-)

PS just by the by, a related post (that just happens to mention OUseful.info:-) on the Telegraph blogs about Open data ‘rights’ require responsibility from the Government led me to a quick chat with Telegraph data hack @coneee and the realisation that the Telegraph too are starting to explore the release of data via Google spreadsheets. So for example, a post on Councils spending millions on website redesigns as job cuts loom also links to the source data here: Data: Council spending on websites.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl