Sunday, June 20, 2010

Getting a handle on data

When I teach system dynamics, I emphasize the importance of starting with a reference mode or reference behavior pattern: a graph that describes the problem being addressed over time. That's vital for grounding one's thinking in the real world.

I've written about the importance of looking at data here, too. Still, much system dynamics work seems to emphasize the importance of good inferences from the data we have over the quality of the data. I largely agree with that notion, and yet I think there are two areas in which we can enhance our contribution and help make better decisions.

First, we can do a better job of understanding what the data is telling us. I have been and remain a fan of J as a tool for making sense of data. It can be used to extract data from spreadsheets and databases easily, and it can quickly and concisely be used to analyze that data.

Over the last year, I've increasingly been using R for data analysis. While it's nowhere near as concise or elegant (IMHO) as J, it connects as well to a broad range of data sources, and it it has an enormous library of data analysis and graphics packages to help with any sort of analysis I've been able to think of.

In fact, one of the challenges is figuring out which of the currently 2,430 available packages is best for the task at hand (CRAN Task Views can help). You may have your own favorite way to get started; if you lack a good starting point, I'd recommend checking out Gelman and Hill's Data Analysis Using Regression and Multilevel/Hierarchical Models as a way to learn R in a very practical manner. You might also check out R bloggers, an aggregation of 73 blogs that talk about R.

It might be worth mentioning exploratory data analysis (EDA), too, for that seems particularly aligned with the notion of creating a reference mode graph. John Tukey's Exploratory Data Analysis is the seminal book in that approach, but there are a number of others; I started with Understanding Robust and Exploratory Data Analysis. Of course, the world has progressed since Tukey first wrote that book; Andrew Gelman has written A Bayesian formulation of exploratory data analysis and goodness-of-fit testing and Exploratory and confirmatory data analysis, which might prove interesting, and software such as GGobi and Mondrian have provided technological updates to some of Tukey's ideas.

Second, we can work on the way we present graphical results. Over the years, I've read The Visual Display of Quantitative Information, 2nd edition and other books by Edward Tufte, and I've found them quite helpful (perhaps you have, too; certainly I've seen his work quoted numerous places online, and I sense that he's reached a large number of people through his excellent day-long seminars).

I've also tried to keep my eyes open for research into what sort of graphical presentations are most likely to convey information accurately to viewers. I've recently finished William Cleveland's The Elements of Graphing Data and found it highly useful in helping me to think about graphics more effectively. It combines research results that point towards more effective approaches with practical tips and new graph types I hadn't used. His Visualizing Data promises to be good, too, although I haven't yet read it.

Speaking of graphics, colleague Chris Soderquist pointed me to Leland Wilkinson's The Grammar of Graphics (Statistics and Computing) some time ago. I quite enjoyed the book and felt a bit frustrated that I didn't have a way to use what I had learned.

Now Hadley Wickham has implemented the grammar of graphics for R in a package called ggplot2. He's got a smaller book, ggplot2: Elegant Graphics for Data Analysis (Use R), which seems to serve as a very handy reference to the current state of ggplot2, although you can go a long ways with the online reference manual and mailing list.

Labels: , , , , , ,


Blogger Bill Harris said...

There are two more online resources for graphics that I've found useful. Perhaps they can help you.

Frank Harrell's Statistical Presentation Graphics is heavily influenced by Cleveland's work.

Rafe Donahue's Principles for Constructing Better Graphics also seems interesting. Thanks to Andrew Gelman's Fundamental Statistical Concepts in Presenting Data: Principles for Constructing Better Graphics for that reference.

20 June, 2010 15:13  
Blogger Bill Harris said...

... and here's a seemingly quite useful R resource called Rtips from Paul Johnson.

20 June, 2010 15:18  

Post a Comment

<< Home