Computational Social Science: The Deluge: Twitter and Hurricane Sandy

Information is often valuable, but it's crucial in crisis situations. In a broad and basic attempt to move toward harvesting social media to determine a population's mood, my group looked at the use of language on Twitter as Hurricane Sandy bore down on the east coast of America during late 2012. What I show here is a basic visualization of tweets by time, location, and valence (here calculated by a hilariously rough positivity/negativity measure).

The tweets were scraped by team member Jacek Radzikowski (who is also available on Twitter here) focusing on the terms "Sandy", "hurricane", and "frankenstorm". The scraped tweets consist of much more than a 140 character string and a timestamp: if the user has included a description of their location in their profile or enabled twitter to use their phone's GPS, the tweet can contain some very specific locational information. For reasons that will become apparent below, we wanted to try to gain access to the user-provided location information rather than relying exclusively on GPS-derived geotagged tweets.

In this post, I distinguish between what I call "geotagged" tweets (tweets with associated coordinates) and "geocoded" tweets (tweets with locational information that was run through the Yahoo PlaceFinder to produce a set of coordinates).

We collected 1060915 tweets in all, of which 692376 have some geographic designator and 10237 have GPS-derived coordinates. For those of us who understand better in scientific notation, that's

1.1x10^6 total
6.9x10^5 with some info (~65%)
1.0x10^4 with coordinates (~1%)

(...so our motivation for investigating the geocoder is pretty obvious, right?)

To derive some basic measure of how positively people were talking about their impending doom, we took a hand-coded set of words (AFINN, available here). For each tweet, the text is stripped of punctuation, converted to lower case, and broken into individual words. The valence is set equal to the sum of the valence values of each component word found in the AFINN wordset, normalized by the number of valence-having words in the tweet. In the following images, the tweets are colored by this normalized quantity, with more positive tweets in green and more negative tweets in red.