Reading, Parsing and Unpacking¶
Reading and parsing¶
The BitStream
and ConstBitStream
classes contain number of methods for reading the bitstring as if it were a file or stream. Depending on how it was constructed the bitstream might actually be contained in a file rather than stored in memory, but these methods work for either case.
In order to behave like a file or stream, every bitstream has a property pos
which is the current position from which reads occur. pos
can range from zero (its value on construction) to the length of the bitstream, a position from which all reads will fail as it is past the last bit. Note that the pos
property isn’t considered a part of the bitstream’s identity; this allows it to vary for immutable ConstBitStream
objects and means that it doesn’t affect equality or hash values.
The property bytepos
is also available, and is useful if you are only dealing with byte data and don’t want to always have to divide the bit position by eight. Note that if you try to use bytepos
and the bitstring isn’t byte aligned (i.e. pos
isn’t a multiple of 8) then a ByteAlignError
exception will be raised.
read / readlist
¶
For simple reading of a number of bits you can use read
with an integer argument. A new bitstring object gets returned, which can be interpreted using one of its properties or used for further reads. The following example does some simple parsing of an MPEG-1 video stream (the stream is provided in the test
directory if you downloaded the source archive).
>>> s = ConstBitStream(filename='test/test.m1v')
>>> print(s.pos)
0
>>> start_code = s.read(32).hex
>>> width = s.read(12).uint
>>> height = s.read(12).uint
>>> print(start_code, width, height, s.pos)
000001b3 352 288 56
>>> s.pos += 37
>>> flags = s.read(2)
>>> constrained_parameters_flag = flags.read(1)
>>> load_intra_quantiser_matrix = flags.read(1)
>>> print(s.pos, flags.pos)
95 2
If you want to read multiple items in one go you can use readlist
. This can take an iterable of bit lengths and return a list of bitstring objects. So for example instead of writing:
a = s.read(32)
b = s.read(8)
c = s.read(24)
you can equivalently use just:
a, b, c = s.readlist([32, 8, 24])
Reading using format strings¶
The read
/ readlist
methods can also take a format string similar to that used in the auto initialiser. Only one token should be provided to read
and a single value is returned. To read multiple tokens use readlist
, which unsurprisingly returns a list.
The format string consists of comma separated tokens that describe how to interpret the next bits in the bitstring. The tokens are:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
next bits as an unsigned exponential-Golomb code. |
|
next bits as a signed exponential-Golomb code. |
|
next bits as an interleaved unsigned exponential-Golomb code. |
|
next bits as an interleaved signed exponential-Golomb code. |
|
next bit as a boolean (True or False). |
|
next |
So in the earlier example we could have written:
start_code = s.read('hex:32')
width = s.read('uint:12')
height = s.read('uint:12')
and we also could have combined the three reads as:
start_code, width, height = s.readlist('hex:32, 12, 12')
where here we are also taking advantage of the default uint
interpretation for the second and third tokens.
You are allowed to use one ‘stretchy’ token in a readlist
. This is a token without a length specified which will stretch to fill encompass as many bits as possible. This is often useful when you just want to assign something to ‘the rest’ of the bitstring:
a, b, everything_else = s.readlist('intle:16, intle:24, bits')
In this example the bits
token will consist of everything left after the first two tokens are read, and could be empty.
It is an error to use more than one stretchy token, or to use a ue
, se
, uie
or se
token after a stretchy token (the reason you can’t use exponential-Golomb codes after a stretchy token is that the codes can only be read forwards; that is you can’t ask “if this code ends here, where did it begin?” as there could be many possible answers).
The pad
token is a special case in that it just causes bits to be skipped over without anything being returned. This can be useful for example if parts of a binary format are uninteresting:
a, b = s.readlist('pad:12, uint:4, pad:4, uint:8')
Peeking¶
In addition to the read methods there are matching peek methods. These are identical to the read except that they do not advance the position in the bitstring to after the read elements.
s = ConstBitStream('0x4732aa34')
if s.peek(8) == '0x47':
t = s.read(16) # t is first 2 bytes '0x4732'
else:
s.find('0x47')
Unpacking¶
The unpack
method works in a very similar way to readlist
. The major difference is that it interprets the whole bitstring from the start, and takes no account of the current pos
. It’s a natural complement of the pack
function.
s = pack('uint:10, hex, int:13, 0b11', 130, '3d', -23)
a, b, c, d = s.unpack('uint:10, hex, int:13, bin:2')
Seeking¶
The properties pos
and bytepos
are available for getting and setting the position, which is zero on creation of the bitstring.
Note that you can only use bytepos
if the position is byte aligned, i.e. the bit position is a multiple of 8. Otherwise a ByteAlignError
exception is raised.
For example:
>>> s = BitStream('0x123456')
>>> s.pos
0
>>> s.bytepos += 2
>>> s.pos # note pos verses bytepos
16
>>> s.pos += 4
>>> print(s.read('bin:4')) # the final nibble '0x6'
0110
Finding and replacing¶
find / rfind
¶
To search for a sub-string use the find
method. If the find succeeds it will set the position to the start of the next occurrence of the searched for string and return a tuple containing that position, otherwise it will return an empty tuple. By default the sub-string will be found at any bit position - to allow it to only be found on byte boundaries set bytealigned=True
.
>>> s = ConstBitStream('0x00123400001234')
>>> found = s.find('0x1234', bytealigned=True)
>>> print(found, s.bytepos)
(8,) 1
>>> found = s.find('0xff', bytealigned=True)
>>> print(found, s.bytepos)
() 1
The reason for returning the bit position in a tuple is so that the return value is True
in a boolean sense if the sub-string is found, and False
if it is not (if just the bit position were returned there would be a problem with finding at position 0). The effect is that you can use if s.find(...):
and have it behave as you’d expect.
rfind
does much the same as find
, except that it will find the last occurrence, rather than the first.
>>> t = BitArray('0x0f231443e8')
>>> found = t.rfind('0xf') # Search all bit positions in reverse
>>> print(found)
(31,) # Found within the 0x3e near the end
For all of these finding functions you can optionally specify a start
and / or end
to narrow the search range. Note though that because it’s searching backwards rfind
will start at end
and end at start
(so you always need start
< end
).
findall
¶
To find all occurrences of a bitstring inside another (even overlapping ones), use findall
. This returns a generator for the bit positions of the found strings.
>>> r = BitArray('0b011101011001')
>>> ones = r.findall([1])
>>> print(list(ones))
[1, 2, 3, 5, 7, 8, 11]
replace
¶
To replace all occurrences of one BitArray
with another use replace
. The replacements are done in-place, and the number of replacements made is returned. This methods changes the contents of the bitstring and so isn’t available for the Bits
or ConstBitStream
classes.
>>> s = BitArray('0b110000110110')
>>> s.replace('0b110', '0b1111')
3 # The number of replacements made
>>> s.bin
'111100011111111'
Working with byte aligned data¶
The emphasis with the bitstring module is always towards not worrying if things are a whole number of bytes long or are aligned on byte boundaries. Internally the module has to worry about this quite a lot, but the user shouldn’t have to care. To this end methods such as find
, findall
, split
and replace
by default aren’t concerned with looking for things only on byte boundaries and provide a parameter bytealigned
which can be set to True
to change this behaviour.
This works fine, but it’s not uncommon to be working only with whole-byte data and all the bytealigned=True
can get a bit repetitive. To solve this it is possible to change the default throughout the module by setting bitstring.bytealigned
. For example:
>>> s = BitArray('0xabbb')
>>> s.find('0xbb') # look for the byte 0xbb
(4,) # found, but not on byte boundary
>>> s.find('0xbb', bytealigned=True) # try again...
(8,) # not found on any byte boundaries
>>> bitstring.bytealigned = True # change the default behaviour
>>> s.find('0xbb')
(8,) # now only finds byte aligned