Archive for May, 2010

Details of Work Product 1 – Sustainability – “Improve long term sustainability and accuracy of Twapper Keeper archives.”

May 21, 2010

To ensure the Twapper Keeper service could continue to be a viable platform for archiving tweets, numerous operational / infrastructure issues were addressed.

The summary of the issues and a plan was outlined in the Power Point presentation published on the project blog titled Twapper Keeper Operations / Infrastructure Enhancements on 4/25/2009 (https://twapperkeeper.wordpress.com/2010/04/27/planned-ops-infrastructure-enhancements/)

The following outlines the details of each action item:

Upgrade the Primary Twapper Keeper server VPS configuration to include larger CPU and RAM configuration

The primary server was upgraded to the next level VPS available from the host.  Loading continues to be monitored and the server sizing will most likely be upgraded again in the future as growth continues.

Procure additional disk space for Primary and Backup server

The primary and backup server had disk space added (roughly 40 GB) in support of the continued growth.  This added breathing room to the overall infrastructure (which is important, as log files often grow quickly under certain situations), and most likely will be upgraded again in the short term as the number of records increase.

Incorporate automated export / import routine for backup server and rsynch apache and jobs files to ensure secondary server can be brought online more rapidly

The primary server now does a nightly “export” to the backup server.  The backup server has been setup to have similar LAMP components installed, as well an rsynch’ed codebase, so that in the case of a major failure on the primary server, the database export can be imported and the system can be restarted on the backup server.  (Any missed tweets between backups would be filled in via the extensive back search archive process that is used to find missing tweets.)

Implement a layer of monitoring for administrator and provide a “system health” page for users for transparency

A new monitoring process has been established to help administrator and end users keep a better eye on the system.

The monitoring process includes a job that polls the various archiving / export processes approximately every 10-15 minutes to ensure they are still alive, and records the findings to a database table.

If any issues are found, SMS and email alerts are sent out so that I can resolve the issue.

End users also have access to the current status and the status history (roughly 7 days back) by accessing the SYSTEM HEALTH link in the Twapper Keeper system.

Provide a user feature to “reset” archives to kick off the requery process if they are in a hurry to see tweets (will force re-call of /search instead of waiting)

A new “reset archive” button has been added to every archive in the system.  This button moves the priority level up on the archive up so that it is once again considered a new archive.  This allows for the back search archiving process for new archives (which is under light load most of the time) to quickly try to archive tweets in for the archive.  This really helps when a user finds missed tweet as it will quickly reach back into the Twitter cache for data and try to “fill in the holes.”

Implement OAuth for all REST API calls

In response to Twitter’s plans to deprecate Basic Authentication for REST API services at the end of June 2010, all of the “back search” archiving processes had to be re-implemented (4 processes for keywords / hashtags, and 2 for person timelines) to leverage OAuth 1.0A.  (A potential upgrade in OAuth 2.0 will be required in the future as the Twitter platform evolves.)

(NOTE:  This also resulted in an initial implementation of OAuth client to be established in Twapper Keeper to grab the access tokens – and will be exposed to the end user shortly in preparation for further UI improvements.)

Implement an approach to reduce contention on RAWSTREAM table to reduce latency on inbound Streaming API

This was a VERY BIG issue after going live with the Streaming API in March, and was important to address as it resulted in significant lag time for tweets to be realized in their appropriate archive.

It also resulted in 30-40 minute “latency” in the ingestion of Twitter’s Streaming API which would result in lost tweets during any disconnects / reconnects with the Twitter Streaming API.

As a result a new approach was implemented that moved all the binning processes from being heavily database dependent to memory dependent.

The basic process we are now following is:

1)    Tweets from the Streaming API are immediately dumped into a temp table

2)    Five workers (PHP CLI jobs) are constantly running looking for tweets in the temp table.  If they find a tweet they grab as many as they can (up to a max of 5000, but on average only a few hundred) and put them into memory

3)    The worker then queries the complete list of “predicates” Twapper Keeper is monitoring (hashtags, keywords) and places that into memory

4)    The worker then loops through the tweets comparing against the predicates, and inserts matches into the proper archives

This new process works so fast (when there is no backlog, which only happens in very rare cases now), that tweets typically only take 30 seconds to go from “creation” to landing in the proper bin now, and the contention on the inbound temp tables been reduced so that tweets are going from Twitter to the Twapper Keeper temp table “almost instantly.”   (basically – NO CONTENTION AND NO LATENCY!)

TwapperKeeper is currently migrating to new servers

May 20, 2010

We are currently in the process of migrating to new servers. Sorry for short notice, but we also received notice late of when the change over would happen.

Once the system is back online, we will begin crawling back through the Twitter SEARCH cache for any tweets missed during the downtime.

Find out if Twapper Keeper is healthy or sick…

May 19, 2010

As of this evening, a new SYSTEM HEALTH page has been added to Twapper Keeper that provides our end users a way to see if the system is “healthy.”

This page provides visibility to the monitoring that is running in the back end to ensure that the numerous archiving / exporting processes are running smoothly.  (These are the same monitors that are used to contact me if there are any system issues.)

You can also look back over 7 days by clicking on the history link on each monitor.

NOTE:  While typically Twapper Keeper can “heal itself” by reaching back to find tweets in the Twitter Search cache as it continues to crawl, this does provide a glimpse into potential short term blips in the system that may result in missed tweets for a time period.  However, if you see missed tweets I highly recommend “resetting the archive” to force a back search, and contact us at support@twapperkeeper.com if you continue to see an issue.)

(This enhancement aligns with one of the enhancements outlined in the Ops / Infrastructure work product https://twapperkeeper.wordpress.com/2010/04/27/planned-ops-infrastructure-enhancements/ and is based upon a request from Martin Hawkseyhttps://twapperkeeper.wordpress.com/2010/04/19/user-enhancements-to-twapper-keeper/)

Now you can reset your archive!

May 14, 2010

One comment I often get from users is that an archive is missing tweets.

While I am confident we have started stabilizing in this area especially with the performance improvements of the Streaming API, there are still cases where the Twitter API could have glitched or we have a small blip and something can be missed.

To proactively find any missed tweets we run a background process against every archive hashtag/keyword/person on file periodically scanning the Twitter Search API and comparing against our database.

If we find a tweet we already have, we dump it.

If we find one we don’t, we store it in our database.

However, this can be a time consuming process since we have so many archives to crawl though.

Therefore, as of today we have introduced a small little “Reset your archive” button on every archive page that will bump the priority on the archive so that it is crawled more quickly.

So if you see an archive missing tweets go ahead and press the button – and if you have any issues, contact us on our GetSatisfaction page.

Don’t forget though – if you can’t find the tweet in search.twitter.com, TwapperKeeper can’t find it either.

(NOTE:  This enhancement aligns with one of the enhancements outlined in the https://twapperkeeper.wordpress.com/2010/04/27/planned-ops-infrastructure-enhancements/)

Migration to Oauth

May 12, 2010

Over the coming weeks Twapper Keeper will be transitioning ALL of the backend archiving routines to authenicate via OAuth (vs. Basic Authentication which Twitter is removing as of June 30 -> http://www.countdowntooauth.com/).

While TwapperKeeper only consumes “public” data it still must authenticate to the Twitter API as a user to allow for higher than normal queries to the Twitter REST APIs, and thus must make this change to allow for operation in the future.

As of this morning, the routines that archive persons have been migrated.  We will continue to monitor as these are rolled into production to minimize any impacts on end user archives.

[Update: 05/13/2010 – 06:20 am] Everything appears stable on the oauth based person archiving routines.  This morning keyword / hashtag archiving routines have been migrated.  This includes archiving routines responsible for (1) initially filling the archive when first created with all tweets available in Twitter’s search cache and (2) continuing to crawl back in Twitter archive to find any potential missed tweets by the Streaming connection.

[Update: 05/14/2010 – 06:37 am] As of this morning, all other final backend OAuth updates have been made in support the various outbound messages Twapper Keeper sends out when archives / exports are created.  Everything continues to seem stable.

Let us know if you see any issues.

Plans for Updates to Twapper Keeper Functionality and APIs

May 11, 2010

We recently invited suggestions for developments to the end user functionality provided by Twapper Keeper and the APIs provided by the service (with some additional suggestions being made on the UK Web Focus blog)

We were pleased to receive a number of responses for developments to the service – and even more pleased that we will be able to implement the majority of the requests.  In brief the developments we will be implementing are:

  • Update / create new feeds to be compliant with Atom standard (requested by Andy Powell).
  • New API endpoints to align with Twitter standard input / output (requested by Andy Powell).
  • Advanced / refined search capabilities (requested by Martin Hawksey).
  • Additional filtering capabilities (requested by Tony Hirst and Kirsty McGill).
  • Process for user to opt out of archiving (requested by Liam Green-Hughes and Jeremy Ottevanger).
  • Additional opt in for @Person archives (requested by Jeremy Ottevanger).
  • Collection of archives (requested by Cameron Neylon).
  • Additional analytics, with integration with Eduserv’s Summarizr service (requested by Brian Kelly).
  • Tagging of archives (requested by Gary Green).
  • Infrastructure developments (including status alerts requested by Martin Hawksey)

There are two significant requests which we are unable to implement. Tony Hirst requested the ability to (1) archive a Twitter list and (2) grab snapshots of friends/followers of a set of individuals over a short period in order to chart how a network grows. The first request is in conflict with the need to provide control for an individual user over their entire Twitter stream (as described above). The second request is out-of-scope for the Twapper Keeper service, which has a core focus on the archiving of tweets rather that of monitoring trends for a Twitter community.

We appreciate the time spent by users of Twapper Keeper in letting us know of ways in which the service can be improved.  We hope that Tony Hirst will understand the reasons why we cannot implement his requests.

The following presentation outlines the details of the enhancements:

We welcome further feedback and comments.

Study of Missed Tweets

May 5, 2010

Archiving tweets via the API has always been a race of time, especially with high velocity hashtags / keywords that might bypass the Twitter /search API in record time.

I have always worked to include redundancy in the archiving process, especially as I released Version 2 which began to consume the streaming API from Twitter.

However, while the streaming API is great, there are gaps that can be introduced during disconnects / reconnects especially if there is a latency in the data being transferred between Twitter and TwapperKeeper.  Basically if the latency is high, data in flight could be lost.  (NOTE:  This latency was much more profound prior to last weeks change in the backend post processing changes which now reduces contention on the inbound stream table.  Basically there is only ‘seconds’ latency now, where before this approached 30 minutes at times.)

As a result, TwapperKeeper still runs a “background” process that still uses the REST /search API to check to see if we have missed any tweets and then tries to fill in the holes.

However, one thing I have NEVER measured is “how often are we missing tweets?”

As I often get questions about missed tweets or comparisons between other archiving services, I wanted to started to get a handle on how many times TwapperKeeper finds out it misses something.

This will begin to give me a sense of other potential issues by leveraging this  background process that continues to compare “Twitter” to “TwapperKeeper” and grab statistics off this process.

No doubt this isn’t perfect, as 1) some people can be removed by Twitter from the /search API at a moments notice for potential spam / low quality and 2) comparing /search to /stream isn’t always the same – but it should get me a sense of the percentage that get missed.

Will publish findings as they become available.