Saturday, June 29, 2013

Trawling Social Media Part 2: Flickr

My adventures in Instagram came to a sudden and tragic halt once I encountered the bugginess of Instagram's media/search function. Other people have been documenting a decline in the number of responses they get to this API call for a few months now (see, for example, here) and at this point I get nothing by Error Code 400 in response to every request. Instagram's locations/search function does work correctly, but only returns 33 Colorado Springs-geocoded images over an eight-day period encompassing the evacuation of more than 20,000 people from that city. Disappointing!

Bereft of data, I've been investigating Flickr today. I found this tutorial helpful in getting started. However, for roughly the same period of time I'm seeing only 96 images being shot within a 5km radius of the center of Colorado Springs. I've posted the bulk of the code here, in case anyone else wants to give it a try.

A sample of the images drawn from my Flickr data pull
If anyone has had different experiences with this, I'd be thrilled to hear it, but I'm willing to tentatively say that if you're going back more than a few months and trying to extract data, you should expect to encounter a lot of bad behavior from APIs. This makes a lot of sense given the focus of these companies, but the takeaway: caveat emptor.

Wednesday, June 26, 2013

Trawling Social Media

Lately, I've been digging into using the APIs of various social media platforms as tools to help explore the spread of information, sentiment, and so forth, specifically focusing on Twitter and Instagram. It's been interesting and sometimes challenging, as I'm going over old tricks (PHP) and learning new ones (authorization and so forth). There have been a number of resources I've found helpful, and I thought I'd post something about my experiences here as a guide to others who are also just getting started with using social media APIs for research.

OAuth

In order to access the APIs, it's necessary to authenticate with the API itself. Both Twitter and Instagram use the OAuth protocol to provide users with access to their data. This involves having an account, registering an application, and generating and providing access codes in the appropriate places. I found the following to be very helpful in understanding and interacting with OAuth:

  • 140 Dev Twitter OAuth Programming - a tutorial on using OAuth in the context of Twitter applications. You have to sign up as a member to get access to the text, but I highly recommend it.
  • tmhOAuth - An OAuth library used in the 140 Dev tutorial, which with minimal modification can be used to access Instagram data as well (specifically, by changing 'api.twitter.com' on line 40 to 'api.instagram.com').
The App Itself

I've been using PHP to do my data extraction, but other sites discuss using javascript, etc, to do something similar. If you're comfortable with the command line, using PHP is incredibly simple and I highly recommend it. I've included a simple script which, if you edit it to include your target username, will pull the last 100 tweets from a user. To use it, you can drop the tmhOAuth.php and cacert.pem files into a directory with a copy of your application tokens, put a simple PHP script in that same directory, and type

commandline> php myScript.php > outputfile.txt

and voila! You're done. Until you run into 429 Errors, aka rate limiting.

Rate Limiting

Rate limiting will probably make you want to tear out your hair. Twitter and Instagram have different limits on how many times you can query the API within however much time. Twitter even breaks rate limiting down by the kind of query you send - for example, under the current REST 1.1 API policies, a user can submit only 15 "GET lists" queries relative to 180 "GET statuses/user_timeline" queries within a 15 minute period. So be aware of this, and factor it into your applications.

Anyway, hope this is helpful to someone!

Sample PHP files