Archiving tweets via the API has always been a race of time, especially with high velocity hashtags / keywords that might bypass the Twitter /search API in record time.
I have always worked to include redundancy in the archiving process, especially as I released Version 2 which began to consume the streaming API from Twitter.
However, while the streaming API is great, there are gaps that can be introduced during disconnects / reconnects especially if there is a latency in the data being transferred between Twitter and TwapperKeeper. Basically if the latency is high, data in flight could be lost. (NOTE: This latency was much more profound prior to last weeks change in the backend post processing changes which now reduces contention on the inbound stream table. Basically there is only ‘seconds’ latency now, where before this approached 30 minutes at times.)
As a result, TwapperKeeper still runs a “background” process that still uses the REST /search API to check to see if we have missed any tweets and then tries to fill in the holes.
However, one thing I have NEVER measured is “how often are we missing tweets?”
As I often get questions about missed tweets or comparisons between other archiving services, I wanted to started to get a handle on how many times TwapperKeeper finds out it misses something.
This will begin to give me a sense of other potential issues by leveraging this background process that continues to compare “Twitter” to “TwapperKeeper” and grab statistics off this process.
No doubt this isn’t perfect, as 1) some people can be removed by Twitter from the /search API at a moments notice for potential spam / low quality and 2) comparing /search to /stream isn’t always the same – but it should get me a sense of the percentage that get missed.
Will publish findings as they become available.