bup-margin - figure out your deduplication safety margin

bup margin [options...]

`bup margin`

iterates through all objects in your bup repository, calculating the largest number of prefix bits shared between any two entries. This number, `n`

, identifies the longest subset of SHA-1 you could use and still encounter a collision between your object ids.

For example, one system that was tested had a collection of 11 million objects (70 GB), and `bup margin`

returned 45. That means a 46-bit hash would be sufficient to avoid all collisions among that set of objects; each object in that repository could be uniquely identified by its first 46 bits.

The number of bits needed seems to increase by about 1 or 2 for every doubling of the number of objects. Since SHA-1 hashes have 160 bits, that leaves 115 bits of margin. Of course, because SHA-1 hashes are essentially random, it's theoretically possible to use many more bits with far fewer objects.

If you're paranoid about the possibility of SHA-1 collisions, you can monitor your repository by running `bup margin`

occasionally to see if you're getting dangerously close to 160 bits.

- --predict
- Guess the offset into each index file where a particular object will appear, and report the maximum deviation of the correct answer from the guess. This is potentially useful for tuning an interpolation search algorithm.
- --ignore-midx
- don't use
`.midx`

files, use only`.idx`

files. This is only really useful when used with`--predict`

.

```
$ bup margin
Reading indexes: 100.00% (1612581/1612581), done.
40
40 matching prefix bits
1.94 bits per doubling
120 bits (61.86 doublings) remaining
4.19338e+18 times larger is possible
Everyone on earth could have 625878182 data sets
like yours, all in one repository, and we would
expect 1 object collision.
$ bup margin --predict
PackIdxList: using 1 index.
Reading indexes: 100.00% (1612581/1612581), done.
915 of 1612581 (0.057%)
```

`bup-midx`

(1), `bup-save`

(1)

Part of the `bup`

(1) suite.