Jun 23, 2008 12:00 PM

Visualizing Big Data: Bar Charts for Words

The biggest challenge of the

A visualization of thousands of Wikipedia edits that were made by a single software bot. Each color corresponds to a different page. *
Image: Fernanda B. Viégas, Martin Wattenberg, and Kate Hollenbach * The biggest challenge of the Petabyte Age won't be storing all that data, it'll be figuring out how to make sense of it. Martin Wattenberg, a mathematician and computer scientist at IBM's Watson Research Center in Cambridge, Massachusetts, is a pioneer in the art of visually representing and analyzing complex data sets. He and his partner at IBM, Fernanda Viégas, created Many Eyes, a collaborative site where users can share their own dynamic, interactive representations of big data. He spoke with Wired's Mark Horowitz:

Wired: How do you define "big" data?

Wattenberg: You can talk about terabytes and exabytes and zettabytes, and at a certain point it becomes dizzying. The real yardstick to me is how it compares with a natural human limit, like the sum total of all the words you'll hear in your lifetime. That's surely less than a terabyte of text. Any more than that and it becomes incomprehensible by a single person, so we have to turn to other means of analysis: people working together, or computers, or both.

Wired: Why is a numbers guy like you so interested in large textual data sets?

Wattenberg: Language is one of the best data-compression mechanisms we have. The information contained in literature, or even email, encodes our identity as human beings. The entire literary canon may be smaller than what comes out of particle accelerators or models of the human brain, but the meaning coded into words can't be measured in bytes. It's deeply compressed. Twelve words from Voltaire can hold a lifetime of experience.

Wired: What will happen when we have digital access to everything, like all of English literature or all the source code ever written?

Wattenberg: There's something about completeness that's magical. The idea that you can have everything at your fingertips and process it in ways that were impossible before is incredibly exciting. Even simple algorithms become more effective when trained on big sets. Perhaps we'll find out more about plagiarism and literary borrowing when we have the spread of literature before us. We think of our current age as one of intellectual remixing and mashups, but maybe it's always been that way. You can only do that kind of analysis when you have the full spectrum of data.

Wired: Is that why, on Many Eyes, you have visualizations of Wikipedia using simple word trees and tag clouds?

Wattenberg: Wikipedia also has this idea of completeness. The information there again probably totals less than a terabyte, but it's huge in terms of encompassing human knowledge. Today, if you're analyzing numbers, there are a million ways to make a bar chart. If you're analyzing text, it's hard. I think the only way to understand a lot of this data is through visualization.