SpaceSaving algorithm (heavyhitter): compute real time data in streams

I’ve just discovered recently the SpaceSaving algorithm ( ) which is very interesting.

The purpose of this algorithm is to approximate computation of data from an infinite stream using a data structure with a finite size. The typical problem we try to solve is to count the number of occurrences of items coming from an infinite stream.

The obvious approach, is to have a map of items,count somewhere (memory, database) and to increment the count (or create it if not present) in the map

Depending of the distribution, the map size can be as big as the number of items in the feed (so infinite!)

Thanks to the heavy hitter algorithm, you just maintain a small subset of items, and an error associate with them.

If an item is not yet present in the map, you remove the previous minus and replace it with this one, and set the error count to the new one to the total count of the previous one.

So basically, you can have an item appears in the map but with an high error count.

The result depends of the number chosen for the size of the map , and of the distribution of the data, but I wanted to try by myself and Ive done a small implementation in ruby and a sample using the Twitter real time stream to cunt URL’s occurrence in the feed

class SpaceSaving
  attr_accessor :counts,:errors
  def initialize in_max_entries=1_000
    @in_max_entries=in_max_entries
    @counts={}
    @errors={}
  end
  #Add a value in the array
  def add_entry value,inc_count=1
     count=@counts[value]
     if count==nil   #newentry
      if @counts.size>=@in_max_entries
        min=counts.values.min
        old=counts.key min
        @counts.delete old
        @errors.delete old
        @errors[value]=min
        count=min
      else
        count=0
      end
     end
     @counts[value]=count+inc_count
  end
  
end

I just wanted to keep it as simple as possible for now, but in a real world example, it would be good to encapsulate the way to access to the @counts field.
By default, the size of the map is set to 1.000, but this can be modified….

And the sample.rb

require 'bundler/setup'
require 'tweetstream'
require './SpaceSaving.rb'

TweetStream.configure do |config|
  config.consumer_key       = <CONSUMER_KEY>
  config.consumer_secret    = <CONSUMER_SECRET>
  config.oauth_token        = <OAUTH_TOKEN>
  config.oauth_token_secret = <OAUTH_SECRET>
  config.auth_method        = :oauth
end

@client=TweetStream::Client.new;

urls=SpaceSaving.new

total=0

@client.track('http') do |status|
  begin
    # add each URL in the tweet , and count it once
    status.urls.each do |url|
     urls.add_entry(url.expanded_url,1)
    end
    total+=1
    if total%1000==0 then
      puts "Treated:#{total}"
      limit=urls.counts.values.sort[-30]
      urls.counts.each do|u,c|
        if c>limit 
          puts "Counts #{c} #{u} errors:#{urls.errors[u]}"
        end
      end
    end
  end
end

The complete sourcecode is available on Github: https://github.com/tomsoft1/SpaceSaving

So I’ve made a few tests and comparaison with an exact approach, and the result for the top 20 was similar, even after 600.000 tweets parsed, with a “small” 1.000 size comparing to around 300.000 entries for the exact algo.