Ironically we released the news about the partnership on April 16, 2010 – which happened to be the same week of Twitter’s Chirp conference in which they announced that the Library of Congress (LoC) and Google would have archives available.
Initially this looked like a potential setback to the partnership, but after discussions we felt it was important to continue to press on since there was still a desire for crowd sourced tweet archiving and the fact that capabilities and access to the LoC and Google archives were unclear. (And even as of today seems unclear.)
After announcing the partnership in April, my initial focus was on stabilizing the platform. I had just released Version 2 of the platform a month earlier which increased the archiving capabilities from #hashtags to also include keywords and @person timelines – and we were growing like crazy.
To get a sense of the growth during that period, from March to April the volume of tweets on file doubled from 50 million to 100 million.
Plans to implement a larger VM was set into motion and an additional VM was procured to act as a hot backup. All was good for a short period of time.
Unfortunately, the volume of tweets continued to grow and the increasingly growing load on the host’s “shared” VMs was becoming a point of contention between the host and I, resulting in back end archiving processes being shutdown on occasion.
I’ll be honest, my plans to use VMs quickly became foolish – and over the last 6 months I have been in a ongoing battle with growth, new servers, tuning, and infrastructure changes.
As a result TwapperKeeper has made many infrastructure changes including: migrating from a small VM to a larger VM, migrating to a a single dedicated box, and migrating to it’s current state of two dedicated boxes (with a 3rd one just around the corner).
And while this was a struggle which included many sleepless nights, I am happy to announce that now the application’s architecture has been refactored to take advantage of the multiple servers. We can now grow horizontally across N-number of database servers which is increasingly important to support the ever growing number of archives.
Capability / Evolution
Following the partnership announcement, Brian began to solicit feedback from the Higher Education (HE) community to gain input to evolving the capability of the TwapperKeeper platform.
Enhancement requests predominantly centered around improving / standardizing the API / RSS endpoints to allow 3rd party application developers (such as Andy Powell’s Eduserv Summarizr and Martin Hawksey’s iTitle) to tap the TwapperKeeper archives and increasing the ability for end users to filter and view tweets archives and group archives together into collections.
Enhancements were rolled into production in an on-going manner during the last 6 months and continue be tweaked when issues / bug-fixes are raised by the HE community.
One request that caught us all off guard was around privacy – where users were concerned about their public tweets being archived.
The discussion that followed on user privacy rights resulted in two important enhancements being implemented including: 1) restricting @person public timeline archiving to only the person who owns the timeline and 2) allowing for users to opt’out of archiving.
The findings from this discussion were also presented at iPres 2010 in the paper Twitter Archiving Using Twapper Keeper: Technical and Policy Challenges.
Sustainability / Openness
During our initial partnership discussions, David, Brian, and I talked about open sourcing part of the platform. Initially I was hesitant and committed to at a minimum outlining the strategy of how TwapperKeeper was archiving tweets (which includes a hybrid approach of crawling and tweet stream ingestion / processing).
As the 6 month period continued, I came to the realization that the TwapperKeeper.com service cannot be the only archiving platform, especially in special cases where people want quicker archiving times / etc – and I knew the right direction was releasing the code.
Therefore, I took the best pieces of TwapperKeeper and rewrote them from the ground-up into a simple self managed web application.
The yourTwapperKeeper and the code was released on August 25, 2010 and to date we have had over 100 people download the application.
As a result of the release, we caught the attention of Ross Gardler at OSS Watch who had some important advice on how best to license and manage the project to facilitate traction and growth. We are now working with his team to ensure we have all of the appropriate licensing and governance models – so that yourTwapperKeeper can grow and continue to be used by the HE community in other projects.
Where do we go from here?
The huge growth over the last 6 months means we now have grown from 150 MILLION to 1+ BILLION tweets on file. The server infrastructure is becoming increasingly more complex and costly. We have to constantly battle and tune around Twitter API rate limits to try to get tweets requested by users. Frankly, the 24/7 maintenance and operations / support is really too much for one person.
So with that said, how do we keep the primary TwapperKeeper.com service sustainable?
I know this has been a concern of JISC and UKOLN leadership from the beginning – and is also important now as the partersnhip winds down.
It concerns me as well – and means we have to start to monetize.
Whilst the service will continue to be free to the HE sector, new paid for services will begin be rolled out to others.
Premium services being considered include 1) sponsored archives that will have priority archiving processes and increased reach back capability, 2) charging a small fee for exports, and 3) possibly charging for access API endpoints.
In closing I want to once again thank David, Brian, and all of the HE community that participated in input, testing, bug fixing, etc.
As a team we have taken TwapperKeeper.com and social media archiving a step forward and have set the foundation for the future growth of the service, both in TwapperKeeper.com and yourTwapperKeeper – and without your help this would not have been possible.