Details of Work Product 1 – Sustainability – “Improve long term sustainability and accuracy of Twapper Keeper archives.”

by

To ensure the Twapper Keeper service could continue to be a viable platform for archiving tweets, numerous operational / infrastructure issues were addressed.

The summary of the issues and a plan was outlined in the Power Point presentation published on the project blog titled Twapper Keeper Operations / Infrastructure Enhancements on 4/25/2009 (https://twapperkeeper.wordpress.com/2010/04/27/planned-ops-infrastructure-enhancements/)

The following outlines the details of each action item:

Upgrade the Primary Twapper Keeper server VPS configuration to include larger CPU and RAM configuration

The primary server was upgraded to the next level VPS available from the host.  Loading continues to be monitored and the server sizing will most likely be upgraded again in the future as growth continues.

Procure additional disk space for Primary and Backup server

The primary and backup server had disk space added (roughly 40 GB) in support of the continued growth.  This added breathing room to the overall infrastructure (which is important, as log files often grow quickly under certain situations), and most likely will be upgraded again in the short term as the number of records increase.

Incorporate automated export / import routine for backup server and rsynch apache and jobs files to ensure secondary server can be brought online more rapidly

The primary server now does a nightly “export” to the backup server.  The backup server has been setup to have similar LAMP components installed, as well an rsynch’ed codebase, so that in the case of a major failure on the primary server, the database export can be imported and the system can be restarted on the backup server.  (Any missed tweets between backups would be filled in via the extensive back search archive process that is used to find missing tweets.)

Implement a layer of monitoring for administrator and provide a “system health” page for users for transparency

A new monitoring process has been established to help administrator and end users keep a better eye on the system.

The monitoring process includes a job that polls the various archiving / export processes approximately every 10-15 minutes to ensure they are still alive, and records the findings to a database table.

If any issues are found, SMS and email alerts are sent out so that I can resolve the issue.

End users also have access to the current status and the status history (roughly 7 days back) by accessing the SYSTEM HEALTH link in the Twapper Keeper system.

Provide a user feature to “reset” archives to kick off the requery process if they are in a hurry to see tweets (will force re-call of /search instead of waiting)

A new “reset archive” button has been added to every archive in the system.  This button moves the priority level up on the archive up so that it is once again considered a new archive.  This allows for the back search archiving process for new archives (which is under light load most of the time) to quickly try to archive tweets in for the archive.  This really helps when a user finds missed tweet as it will quickly reach back into the Twitter cache for data and try to “fill in the holes.”

Implement OAuth for all REST API calls

In response to Twitter’s plans to deprecate Basic Authentication for REST API services at the end of June 2010, all of the “back search” archiving processes had to be re-implemented (4 processes for keywords / hashtags, and 2 for person timelines) to leverage OAuth 1.0A.  (A potential upgrade in OAuth 2.0 will be required in the future as the Twitter platform evolves.)

(NOTE:  This also resulted in an initial implementation of OAuth client to be established in Twapper Keeper to grab the access tokens – and will be exposed to the end user shortly in preparation for further UI improvements.)

Implement an approach to reduce contention on RAWSTREAM table to reduce latency on inbound Streaming API

This was a VERY BIG issue after going live with the Streaming API in March, and was important to address as it resulted in significant lag time for tweets to be realized in their appropriate archive.

It also resulted in 30-40 minute “latency” in the ingestion of Twitter’s Streaming API which would result in lost tweets during any disconnects / reconnects with the Twitter Streaming API.

As a result a new approach was implemented that moved all the binning processes from being heavily database dependent to memory dependent.

The basic process we are now following is:

1)    Tweets from the Streaming API are immediately dumped into a temp table

2)    Five workers (PHP CLI jobs) are constantly running looking for tweets in the temp table.  If they find a tweet they grab as many as they can (up to a max of 5000, but on average only a few hundred) and put them into memory

3)    The worker then queries the complete list of “predicates” Twapper Keeper is monitoring (hashtags, keywords) and places that into memory

4)    The worker then loops through the tweets comparing against the predicates, and inserts matches into the proper archives

This new process works so fast (when there is no backlog, which only happens in very rare cases now), that tweets typically only take 30 seconds to go from “creation” to landing in the proper bin now, and the contention on the inbound temp tables been reduced so that tweets are going from Twitter to the Twapper Keeper temp table “almost instantly.”   (basically – NO CONTENTION AND NO LATENCY!)

One Response to “Details of Work Product 1 – Sustainability – “Improve long term sustainability and accuracy of Twapper Keeper archives.””

  1. Twapper Keeper Goes Open Source « UK Web Focus Says:

    […] the Twapper Keeper service could continue to be a viable platform for archiving tweets“. As described on the Twapper Keeper blog “numerous operational / infrastructure issues were […]

Comments are closed.


%d bloggers like this: