# Bloom filters

Suppose Google Chrome’s website blacklist (for malware sites) contains ten
million URLs, averaging 100 characters in length. How much memory does
it take to represent the the blacklist?

(8 Gbit = 1 GB.)

How long does it take to look up an entry in the blacklist? If we just store
a sequence of entries then it’s a looong linear scan. What if we want fast
lookups? We could use a hash table, which lets us look up a URL in constant
time, and the size of the hash table will be about 1 GB.

1 GB is a lot to download and hold in memory, so in fact Google stores the
hash table on a server, and browsers query it remotely. But that implies a
network request and response to check the blacklist for each web address that
we load, which makes each load slower. What can we do? Store a *summary of
the blacklist* on the client.

We don’t need to store any information with each URL, so imagine the
following: make a hash table where each bucket is just a bit, which will be
set if occupied and clear if unoccupied. What’s the problem with this approach?

(False positives.)

False positives aren’t necessarily fatal, though. We can look up a URL, and
if its bit isn’t set then we know it’s not in the blacklist. If the bit is
set, then we perform the remote query to confirm that the URL is in the
blacklist. We still have to perform a remote query sometimes, but not on the
common case, which is that the URL isn’t on the blacklist.

What’s the probability of a false positive?

(It depends on how many bits we use.)

Let *n* be the number of set elements. (In our example, *n* = 10,000,000.)
Let *m* be the number of bits. Then we expect approximately *n/m* (ignoring
collisions) of the bits to be set. So when we lookup a URL that isn’t in the
set, the probability of a false positive is *n/m*. In our example, let’s say
we want the probability to be 10%. Then we need *n/m = 0.1*, which means
*m = 100,000,000* bits = 12.8 MB. That’s one-eightieth as much space as the
whole blacklist. It’s perfectly reasonable to download it, and we only need
to go to the server for 10% of requests, which means 90% of requests are faster.

A Bloom filter does somewhat better than this by using multiple hash
functions. Let the hash functions be *h1*, *h2*, etc. Then to add an element
*s* to the set, we set the bits indicated by *h1(s)*, *h2(s)*, and so on.
Then to look up an element, we check whether all the bits indicated by the
hash functions are set. If all the bits are set, then the element is
possibly (probably) in the set; if any of the bits is clear then the element
definitely isn’t in the set.

Note that we can’t remove elements because multiple elements may share some
same bits.

## The math

How likely are false positives when using multiple hash functions? Consider:

  - The probability of one hash function setting one particular bit is 1/m. Or,
  the probability of a bit not being set by the hash function is 1 - 1/m.

  - If there are k hash functions, then the probability of a bit being not
  set is (1 - 1/m)^k.

  - If we insert n elements, then the probability of a bit not being set is
  (1 - 1/m)^(kn), or the probability of a bit being set is
  1 - (1 - 1/m)^(kn).

  - Now suppose we lookup an element that's not in the set. That means we
  check k bits, and we return true only if all k bits are set. So the
  probability that all k bits are set is
  [1 - (1 - 1/m)^(kn)]^k.

Let's try a particular instance of this with our example. So n = 10,000,000.
Let's use m = 100,000,000 as before, and let the number of hash functions k = 7.
Then plugging in the numbers, we get 0.008, or about 1%.

The optimal number of hash functions k = (m/n) ln 2. In that case, then using
4.8 bits per entry gets us a false positive rate of 10%, 9.6 bits per entry
gets us a false positive rate of 1%, and so on for every additional 4.8 bits
per entry eliminates 90% of the remaining false positives.