Archive for October, 2010

Help – please login to TwapperKeeper!

October 15, 2010

As noted yesterday, we are beginning to use the rate limits of the owner of each archive to help us increase our ability to query Twitter for tweets in our crawling process.  This crawling process is important as it also prioritizes which tweets are being watched by our streaming connection.

If you have created an archive and have not logged into the site since Oct 15, we ask that you at least login one time so we can properly capture your login information (of course, just the OAuth tokens per Twitter security)!

Your help is much appreciated!   And if you have any questions please contact us at

We need your rates and your help if you see missing tweets.

October 14, 2010

In the coming days we are going to be testing some new archiving processes that will leverage the rate limit of the user who created the archives, vs. using a single account which TwapperKeeper has been historically doing.

Therefore, we are going to start storing your OAuth tokens (most 3rd party apps do this, we just simply haven’t to date) and leverage your credentials to crawl your personal timeline (if you have an @person archive) or search hashtags / keywords for archives you have created.

This is in response to increased rate limiting we are seeing – and will be tested periodically for feasibility.

Also, if you see tweets missing in your archives let us know ASAP at so we can try to fix before they disappear from Twitter search.  Time is of the essence when it comes to missing tweets.

6 months later – and we are a very different TwapperKeeper

October 13, 2010

In this blog post, I want to take a few moments and share some of the highlights of our partnership with JISC and UKOLN and set the stage for where we are going in the future.

First, it is hard to believe that it was only 8 months ago that I attended dev8D and met with David Flanders (JISC) and Brian Kelly (UKOLN) to discuss a potential partnership – as I feel like they have been helping guide TwapperKeeper from the very beginning.

During that event we laid the plans for a JISC / UKOLN partnership and drafted a ~6 month schedule that focused on 1) stabilization, 2) capability evolution, and 3) sustainability / openness of the platform.

Ironically we released the news about the partnership on April 16, 2010 – which happened to be the same week of Twitter’s Chirp conference in which they announced that the Library of Congress (LoC) and Google would have archives available.

Initially this looked like a potential setback to the partnership, but after discussions we felt it was important to continue to press on since there was still a desire for crowd sourced tweet archiving and the fact that capabilities and access to the LoC and Google archives were unclear.  (And even as of today seems unclear.)


After announcing the partnership in April, my initial focus was on stabilizing the platform.  I had just released Version 2 of the platform a month earlier which increased the archiving capabilities from  #hashtags to also include keywords and @person timelines – and we were growing like crazy.

To get a sense of the growth during that period, from March to April the volume of tweets on file doubled from 50 million to 100 million.

Plans to implement a larger VM was set into motion and an additional VM was procured to act as a hot backup.  All was good for a short period of time.

Unfortunately, the volume of tweets continued to grow and the increasingly growing load on the host’s “shared” VMs was becoming a point of contention between the host and I, resulting in back end archiving processes being shutdown on occasion.

I’ll be honest, my plans to use VMs quickly became foolish – and over the last 6 months I have been in a ongoing battle with growth, new servers, tuning, and infrastructure changes.

As a result TwapperKeeper has made many infrastructure changes including:  migrating from a small VM to a larger VM, migrating to a a single dedicated box, and migrating to it’s current state of two dedicated boxes (with a 3rd one just around the corner).

And while this was a struggle which included many sleepless nights, I am happy to announce that now the application’s architecture has been refactored to take advantage of the multiple servers.  We can now grow horizontally across N-number of database servers which is increasingly important to support the ever growing number of archives.

Capability / Evolution

Following the partnership announcement, Brian began to solicit feedback from the Higher Education (HE) community to gain input to evolving the capability of the TwapperKeeper platform.

Enhancement requests predominantly centered around improving / standardizing the API / RSS endpoints to allow 3rd party application developers (such as Andy Powell’s Eduserv Summarizr and Martin Hawksey’s iTitle) to tap the TwapperKeeper archives and increasing the ability for end users to filter and view tweets archives and group archives together into collections.

Enhancements were rolled into production in an on-going manner during the last 6 months and continue be tweaked when issues / bug-fixes are raised by the HE community.

One request that caught us all off guard was around privacy – where users were concerned about their public tweets being archived.

The discussion that followed on user privacy rights resulted in two important enhancements being implemented including: 1) restricting @person public timeline archiving to only the person who owns the timeline and  2) allowing for users to opt’out of archiving.

The findings from this discussion were also presented at iPres 2010 in the paper Twitter Archiving Using Twapper Keeper: Technical and Policy Challenges.

Sustainability / Openness

During our initial partnership discussions, David, Brian, and I talked about open sourcing part of the platform.  Initially I was hesitant and committed to at a minimum outlining the strategy of how TwapperKeeper was archiving tweets (which includes a hybrid approach of crawling and tweet stream ingestion / processing).

As the 6 month period continued, I came to the realization that the service cannot be the only archiving platform, especially in special cases where people want quicker archiving times / etc – and I knew the right direction was releasing the code.

Therefore, I took the best pieces of TwapperKeeper and rewrote them from the ground-up into a simple self managed web application.

The yourTwapperKeeper and the code was released on August 25, 2010 and to date we have had over 100 people download the application.

As a result of the release, we caught the attention of Ross Gardler at OSS Watch who had some important advice on how best to license and manage the project to facilitate traction and growth.   We are now working with his team to ensure we have all of the appropriate licensing and governance models – so that yourTwapperKeeper can grow and continue to be used by the HE community in other projects.

Where do we go from here?

The huge growth over the last 6 months means we now have grown from 150 MILLION to 1+ BILLION tweets on file.  The server infrastructure is becoming increasingly more complex and costly.  We have to constantly battle and tune around Twitter API rate limits to try to get tweets requested by users.  Frankly, the 24/7 maintenance and operations / support is really too much for one person.

So with that said, how do we keep the primary service sustainable?

I know this has been a concern of JISC and UKOLN leadership from the beginning – and is also important now as the partersnhip winds down.

It concerns me as well – and means we have to start to monetize.

Whilst the service will continue to be free to the HE sector, new paid for services will begin be rolled out to others.

Premium services being considered include 1) sponsored archives that will have priority archiving processes and increased reach back capability,  2) charging a small fee for exports, and 3) possibly charging for access API endpoints.


In closing I want to once again thank David, Brian, and all of the HE community that participated in input, testing, bug fixing, etc.

As a team we have taken and social media archiving a step forward and have set the foundation for the future growth of the service, both in and yourTwapperKeeper – and without your help this would not have been possible.

– John

1 BILLION Tweets Archived!

October 8, 2010

Back in June 2009 when I started TwapperKeeper as a fun weekend hack, I had no idea 16 mos later I would still be running the site.

Heck, I was just having some fun with the Twitter API while my family was out of town.

However, the last 16 months has been crazy.  User demand for raw Twitter data sets has continued to increase.  Major world events were captured, exported, and analyzed.  Small businesses, major brands, market researchers, conference leaders, and academia continued to ask for more data.   And the data sets continued to grow…

And as of this morning I am proud to announce that TwapperKeeper  has passed the  1 BILLION TWEET milestone!

All I can say is WOW!

I want to thank our users for your continued support, praise, suggestions, and thanks – as your feedback helps drive us to be better.

I also want to sincerely thank JISC and UKOLN for stepping in and partnering with TwapperKeeper to help drive growth, enhancement, and sustainability of the platform.  Their support came at a critical time and has been instrumental in our ability to support the heavy growth in users and archives over the last 6 months!

So now the question is… when do we hit 10 BILLION? 🙂

Should TwapperKeeper request READ / WRITE access?

October 7, 2010

Recently members of the UK HE community asked “why TwapperKeeper was requesting READ / WRITE access when trying to opt out of being archived?”

Recognizing that many people are concerned with 3rd party web applications having having WRITE access to their Twitter accounts, we paused to think a little about our implementation of OAuth – and why we were asking for READ / WRITE vs. simply READ.

Stepping back in history, early users of TwapperKeeper may remember that originally we did not require logins to create archives.  This was done to allow for a frictionless way of creating archives, and minimized the need to create an account management solution or try to validate users with basic auth (pre-OAuth days).  It simply wasn’t important to know who the user was.

However, as we began to roll-out new features in partnership with JISC and UKOLN, it became more and more important to identify the user (for example, to confirm who was creating an @person archive, to confirm who is requesting to opt out of archiving, etc).

That led us to implementing an application wide OAuth login for the user – which we just happened to set to a READ / WRITE request.

However, upon review of current features, we are technically only using OAuth to identify the user (i.e. get your screen_name) – and really  never use the OAuth tokens to READ or WRITE anything on behalf of the user (honestly, we don’t even store the tokens, because we really don’t need them at this time).

Therefore, we have decided to go ahead and reduce the permission to READ at this time.  (The lowest level.)

In the future, this decision may need to be revisited as additional features are added that require WRITE access (i.e. for instance the cool features of Twitter @anywhere require READ / WRITE to be turned on) – but it is best to keep the permission at the lowest level required for operations at this time.