Introduction¶
Getting Started¶
Most of the functionality that one will use is contained in the main module
import hdf5storage
Lower level functionality needed mostly for extending this package to work with more datatypes are in its submodules.
The main functions in this module are write()
and
read()
which write a single Python variable to an HDF5 file or
read the specified contents at one location in an HDF5 file and convert
to Python types.
HDF5 files are structured much like a Unix filesystem, so everything can
be referenced with a POSIX style path, which look like
'/pyth/hf'
. Unlike a Windows path, back slashes ('/'
) are used
as directory separators instead of forward slashes ('\'
) and the
base of the file system is just '/'
instead of something like
'C:\'
. In the language of HDF5, what we call directories and files
in filesystems are called groups and datasets.
write()
has many options for controlling how the data is
stored, and what metadata is stored, but we can ignore that for now. If
we have a variable named foo
that we want to write to an HDF5 file
named data.h5
, we would write it by
hdf5storage.write(foo, path='/foo', filename='data.h5')
And then we can read it back from the file with the read()
function, which returns the read data. Here, we will put the data we
read back into the variable bar
bar = hdf5storage.read(path='/foo', filename='data.h5')
Writing And Reading Several Python Variables at Once¶
To write and read more than one Python variable, one could use
write()
and read()
for each variable individually.
This can incur a major performance penalty, especially for large HDF5
files, since each call opens and closes the HDF5 file (sometimes more
than once).
Version 0.1.10
added a way to do this without incuring this
performance penalty by adding two new functions: writes()
and
reads()
.
They can write and read more than one Python variable at once, though
they can still work with a single variable. In fact, write()
and read()
are now wrappers around them. savemat()
and loadmat()
currently use them for the improved performance.
New in version 0.1.10: Ability to write and read more than one Python variable at a time without opening and closing the HDF5 file each time.
Main Options Controlling Writing/Reading Data¶
There are many individual options that control how data is written and
read to/from file. These can be set by passing an Options
object to write()
and read()
by
options = hdf5storage.Options(...)
hdf5storage.write(... , options=options)
hdf5storage.read(... , options=options)
or passing the individual keyword arguments used by the
Options
constructor to write()
and
read()
. The two methods cannot be mixed (the functions will
give precedence to the given Options
object).
Note
Functions in the various submodules only support the
Options
object method of passing options.
The two main options are Options.store_python_metadata
and
Options.matlab_compatible
. A more minor option is
Options.oned_as
.
New in version 0.1.9: Support for the transparent compression of data has been added. It is enabled by default, compressing all python objects resulting in HDF5 Datasets larger than 16 KB with the GZIP/Deflate algorithm.
store_python_metadata¶
bool
Setting this options causes metadata to be written so that the written objects can be read back into Python accurately. As HDF5 does not natively support many Python data types (essentially only Numpy types), most Python data types have to be converted before being written. If metadata isn’t also written, the data cannot be read back to its original form and will instead be read back as the Python type most closely resembling how it is stored, which will be a Numpy type of some sort.
matlab_compatible¶
bool
Setting this option causes the writing of HDF5 files be done in a way compatible with MATLAB v7.3 MAT files. This consists of writing some file metadata so that MATLAB recognizes the file, adding specific metadata to every stored object so that MATLAB recognizes them, and transforming the data to be in the form that MATLAB expects for certain types (for example, MATLAB expects everything to be at least a 2D array and strings to be stored in UTF-16 but with no doublets).
Note
There are many individual small options in the Options
class that this option sets to specific values. Setting
matlab_compatible
automatically sets them, while changing their
values to something else automatically turns matlab_compatible
off.
action_for_matlab_incompatible¶
{'ignore'
, 'discard'
, 'error'
}
The action to perform when doing MATLAB compatibility
(matlab_compatible == True
) but a type
being written is not MATLAB compatible. The actions are to write the
data anyways (‘ignore’), don’t write the incompatible data (‘discard’),
or throw a lowlevel.TypeNotMatlabCompatibleError
exception. The default is ‘error’.
oned_as¶
{‘row’, ‘column’}
This option is only actually relevant when
matlab_compatible == True
. MATLAB only supports 2D and higher
dimensionality arrays, but Numpy supports 1D arrays. So, 1D arrays have
to be made 2 dimensional making them either into row vectors or column
vectors. This option sets which they become when imported into MATLAB.
compress¶
New in version 0.1.9.
bool
Whether to use compression when writing data. Enabled (True
) by default. See Compression for more information.
Convenience Functions for MATLAB MAT Files¶
Two functions are provided for reading and writing to MATLAB MAT files
in a convenient way. They are savemat()
and loadmat()
,
which are modelled after the SciPy functions of the same name
(scipy.io.savemat()
and scipy.io.loadmat()
), which
work with non-HDF5 based MAT files. They take not only the same options,
but dispatch calls automatically to the SciPy versions when instructed
to write to a non-HDF5 based MAT file, or read a MAT file that is not
HDF5 based. SciPy must be installed to take advantage of this
functionality.
savemat()
takes a dict
having data (values) and the names
to give each piece of data (keys), and writes them to a MATLAB
compatible MAT file. The format keyword sets the MAT file format, with
'7.3'
being the HDF5 based format supported by this package and
'5'
and '4'
being the non HDF5 based formats supported by
SciPy. If you want the data to be able to be read accurately back into
Python, you should set store_python_metadata=True
. Writing a couple
variables to a file looks like
hdf5storage.savemat('data.mat', {'foo': 2.3, 'bar': (1+2j)}, format='7.3', oned_as='column', store_python_metadata=True)
Then, to read variables back, we can either explicitly name the variables we want
out = hdf5storage.loadmat('data.mat', variable_names=['foo', 'bar'])
or grab all variables by either not giving the variable_names option
or setting it to None
.
out = hdf5storage.loadmat('data.mat')
Example: Write And Readback Including Different Metadata¶
Making The Data¶
Make a dict
containing many different types in it that we want to
store to disk in an HDF5 file. The initialization method depends on
the Python version.
Python 3¶
The dict
keys must be str
(the unicode string type).
>>> import numpy as np
>>> import hdf5storage
>>> a = {'a': True,
... 'b': None,
... 'c': 2,
... 'd': -3.2,
... 'e': (1-2.3j),
... 'f': 'hello',
... 'g': b'goodbye',
... 'h': ['list', 'of', 'stuff', [30, 2.3]],
... 'i': np.zeros(shape=(2,), dtype=[('bi', 'uint8')]),
... 'j':{'aa': np.bool_(False),
... 'bb': np.uint8(4),
... 'cc': np.uint32([70, 8]),
... 'dd': np.int32([]),
... 'ee': np.float32([[3.3], [5.3e3]]),
... 'ff': np.complex128([[3.4, 3], [9+2j, 0]]),
... 'gg': np.array(['one', 'two', 'three'], dtype='str'),
... 'hh': np.bytes_(b'how many?'),
... 'ii': np.object_(['text', np.int8([1, -3, 0])])}}
Python 2¶
The same thing but in Python 2 where the dict
keys must be
unicode
. The other datatypes are translated from the Python 3
example appropriately. The rest of the examples on this page are run
identically in Python 2 and 3, but the outputs are listed as is
returned in Python 3.
>>> import numpy as np
>>> import hdf5storage
>>> a = {u'a': True,
... u'b': None,
... u'c': 2,
... u'd': -3.2,
... u'e': (1-2.3j),
... u'f': u'hello',
... u'g': 'goodbye',
... u'h': [u'list', u'of', u'stuff', [30, 2.3]],
... u'i': np.zeros(shape=(2,), dtype=[('bi', 'uint8')]),
... u'j':{u'aa': np.bool_(False),
... u'bb': np.uint8(4),
... u'cc': np.uint32([70, 8]),
... u'dd': np.int32([]),
... u'ee': np.float32([[3.3], [5.3e3]]),
... u'ff': np.complex128([[3.4, 3], [9+2j, 0]]),
... u'gg': np.array([u'one', u'two', u'three'], dtype='unicode'),
... u'hh': np.str_('how many?'),
... u'ii': np.object_([u'text', np.int8([1, -3, 0])])}}
Using No Metadata¶
Write it to a file at the '/a'
directory, but include no Python or
MATLAB metadata. Then, read it back and notice that many objects come
back quite different from what was written. Namely, everything was
converted to Numpy types. This even included the dictionaries which were
converted to structured np.ndarray``s. This happens because all
other types (other than ``dict
) must be converted to these types
before being written to the HDF5 file, and without metadata, the
conversion cannot be reversed (while dict
isn’t converted, it has
the same form and thus cannot be extracted reversibly).
>>> hdf5storage.write(data=a, path='/a', filename='data.h5',
... store_python_metadata=False,
... matlab_compatible=False)
>>> hdf5storage.read(path='/a', filename='data.h5')
array([ (True,
[],
2,
-3.2,
(1-2.3j),
b'hello',
b'goodbye',
[array(b'list', dtype='|S4'),
array(b'of', dtype='|S2'),
array(b'stuff', dtype='|S5'),
array([array(30), array(2.3)], dtype=object)],
[(0,), (0,)],
[(False,
4,
array([70, 8], dtype=uint32),
array([], dtype=int32),
array([[ 3.29999995e+00], [ 5.30000000e+03]], dtype=float32),
array([[ 3.4+0.j, 3.0+0.j], [ 9.0+2.j, 0.0+0.j]]),
array([111, 110, 101, 0, 0, 116, 119, 111, 0, 0, 116, 104, 114,
101, 101], dtype=uint32),
b'how many?',
array([array(b'text', dtype='|S4'),
array([ 1, -3, 0], dtype=int8)],
dtype=object))])],
dtype=[('a', '?'),
('b', '<f8', (0,)),
('c', '<i8'),
('d', '<f8'),
('e', '<c16'),
('f', 'S5'),
('g', 'S7'), ('h', 'O', (4,)),
('i', [('bi', 'u1')], (2,)),
('j', [('aa', '?'),
('bb', 'u1'),
('cc', '<u4', (2,)),
('dd', '<i4', (0,)),
('ee', '<f4', (2, 1)),
('ff', '<c16', (2, 2)),
('gg', '<u4', (15,)),
('hh', 'S9'),
('ii', 'O', (2,))],
(1,))])
Including Python Metadata¶
Do the same thing, but now include Python metadata
(store_python_metadata == True
). This time, everything is read back
the same (or at least, it should) as it was written.
>>> hdf5storage.write(data=a, path='/a', filename='data_typeinfo.h5',
... store_python_metadata=True,
... matlab_compatible=False)
>>> hdf5storage.read(path='/a', filename='data_typeinfo.h5')
{'a': True,
'b': None,
'c': 2,
'd': -3.2,
'e': (1-2.3j),
'f': 'hello',
'g': b'goodbye',
'h': ['list', 'of', 'stuff', [30, 2.3]],
'i': array([(0,), (0,)],
dtype=[('bi', 'u1')]),
'j': {'aa': False,
'bb': 4,
'cc': array([70, 8], dtype=uint32),
'dd': array([], dtype=int32),
'ee': array([[ 3.29999995e+00],
[ 5.30000000e+03]], dtype=float32),
'ff': array([[ 3.4+0.j, 3.0+0.j],
[ 9.0+2.j, 0.0+0.j]]),
'gg': array(['one', 'two', 'three'],
dtype='<U5'),
'hh': b'how many?',
'ii': array(['text', array([ 1, -3, 0], dtype=int8)], dtype=object)}}
Including MATLAB Metadata¶
Do the same thing, but this time including only MATLAB metadata
(matlab_compatible == True
). This time, the data that is read back
is different from what was written, but in a different way than when no
metadata was used. The biggest differences are that everything was
turned into an at least 2D array, all arrays are transposed, and all
string types got converted to numpy.str_
. This happens because
MATLAB can only work with 2D and higher arrays, uses Fortran array
ordering instead of C ordering like Python does, and strings are stored
in a subset of UTF-16 (no doublets) in the version 7.3 MAT files.
>>> hdf5storage.write(data=a, path='/a', filename='data.mat',
... store_python_metadata=False,
... matlab_compatible=True)
>>> hdf5storage.read(path='/a', filename='data.mat')
array([ ([[True]],
[[]],
[[2]],
[[-3.2]],
[[(1-2.3j)]],
[['hello']],
[['goodbye']],
[[array([['list']], dtype='<U4'),
array([['of']], dtype='<U2'),
array([['stuff']], dtype='<U5'),
array([[array([[30]]), array([[ 2.3]])]], dtype=object)]],
[[(array([[0]], dtype=uint8),)],
[(array([[0]], dtype=uint8),)]],
[(array([[False]], dtype=bool),
array([[4]], dtype=uint8),
array([[70, 8]], dtype=uint32),
array([], shape=(1, 0), dtype=int32),
array([[ 3.29999995e+00], [ 5.30000000e+03]], dtype=float32),
array([[ 3.4+0.j, 3.0+0.j], [ 9.0+2.j, 0.0+0.j]]),
array([['one\x00\x00two\x00\x00three']], dtype='<U15'),
array([['how many?']], dtype='<U9'),
array([[array([['text']], dtype='<U4'),
array([[ 1, -3, 0]], dtype=int8)]], dtype=object))])],
dtype=[('a', '?', (1, 1)),
('b', '<f8', (1, 0)),
('c', '<i8', (1, 1)),
('d', '<f8', (1, 1)),
('e', '<c16', (1, 1)),
('f', '<U5', (1, 1)),
('g', '<U7', (1, 1)),
('h', 'O', (1, 4)),
('i', [('bi', 'u1', (1, 1))], (2, 1)),
('j', [('aa', '?', (1, 1)),
('bb', 'u1', (1, 1)),
('cc', '<u4', (1, 2)),
('dd', '<i4', (1, 0)),
('ee', '<f4', (2, 1)),
('ff', '<c16', (2, 2)),
('gg', '<U15', (1, 1)),
('hh', '<U9', (1, 1)),
('ii', 'O', (1, 2))],
(1,))])
Including both Python And MATLAB Metadata¶
Do the same thing, but now include both Python metadata
(store_python_metadata == True
) and MATLAB metadata
(matlab_compatible == True
). This time, everything is read back
the same (or at least, it should) as it was written. The Python metadata
allows the transformations done by making the stored data MATLAB
compatible reversible.
>>> hdf5storage.write(data=a, path='/a', filename='data_typeinfo.mat',
... store_python_metadata=True,
... matlab_compatible=True)
>>> hdf5storage.read(path='/a', filename='data_typeinfo.mat')
{'a': True,
'b': None,
'c': 2,
'd': -3.2,
'e': (1-2.3j),
'f': 'hello',
'g': b'goodbye',
'h': ['list', 'of', 'stuff', [30, 2.3]],
'i': array([(0,), (0,)],
dtype=[('bi', 'u1')]),
'j': {'aa': False,
'bb': 4,
'cc': array([70, 8], dtype=uint32),
'dd': array([], dtype=int32),
'ee': array([[ 3.29999995e+00],
[ 5.30000000e+03]], dtype=float32),
'ff': array([[ 3.4+0.j, 3.0+0.j],
[ 9.0+2.j, 0.0+0.j]]),
'gg': array(['one', 'two', 'three'],
dtype='<U5'),
'hh': b'how many?',
'ii': array(['text', array([ 1, -3, 0], dtype=int8)], dtype=object)}}