Table Indexing

Once a Table has been created, it is possible to create indices on one or more columns of the table. An index internally sorts the rows of a table based on the index column(s), allowing for element retrieval by column value and improved performance for certain table operations.

Creating an Index

To create an index on a table, use the add_index() method:

>>> from astropy.table import Table
>>> t = Table([(2, 3, 2, 1), (8, 7, 6, 5)], names=('a', 'b'))
>>> t.add_index('a')

The optional argument unique may be specified to create an index with uniquely valued elements.

To create a composite index on multiple columns, pass a list of columns instead:

>>> t.add_index(['a', 'b'])

In particular, the first index created using the add_index() method is considered the default index or the “primary key.” To retrieve an index from a table, use the indices property:

>>> t.indices['a']
<SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=4>
 a  rows
--- ----
  1    3
  2    0
  2    2
  3    1>>
>>> t.indices['a', 'b']
<SlicedIndex original=True index=<Index columns=('a', 'b') data=<SortedArray length=4>
 a   b  rows
--- --- ----
  1   5    3
  2   6    2
  2   8    0
  3   7    1>>

Row Retrieval using Indices

Row retrieval can be accomplished using two table properties: loc and iloc. The loc property can be indexed either by column value, range of column values (including the bounds), or a list or numpy.ndarray of column values:

>>> t = Table([(1, 2, 3, 4), (10, 1, 9, 9)], names=('a', 'b'), dtype=['i8', 'i8'])
>>> t.add_index('a')
>>> t.loc[2]  # the row(s) where a == 2
<Row index=1>
  a     b
int64 int64
----- -----
    2     1
>>> t.loc[[1, 4]]  # the row(s) where a in [1, 4]
<Table length=2>
  a     b
int64 int64
----- -----
    1    10
    4     9
>>> t.loc[1:3]  # the row(s) where a in [1, 2, 3]
<Table length=3>
  a     b
int64 int64
----- -----
    1    10
    2     1
    3     9
>>> t.loc[:]
<Table length=4>
  a     b
int64 int64
----- -----
    1    10
    2     1
    3     9
    4     9

Note that by default, loc uses the primary index, which here is column 'a'. To use a different index, pass the indexed column name before the retrieval data:

>>> t.add_index('b')
>>> t.loc['b', 8:10]
<Table length=3>
  a     b
int64 int64
----- -----
    3     9
    4     9
    1    10

The property iloc works similarly, except that the retrieval information must be either an integer or a slice, and relates to the sorted order of the index rather than column values. For example:

>>> t.iloc[0] # smallest row by value 'a'
<Row index=0>
  a     b
int64 int64
----- -----
    1    10
>>> t.iloc['b', 1:] # all but smallest value of 'b'
<Table length=3>
  a     b
int64 int64
----- -----
    3     9
    4     9
    1    10

Effects on Performance

Table operations change somewhat when indices are present, and there are a number of factors to consider when deciding whether the use of indices will improve performance. In general, indexing offers the following advantages:

  • Table grouping and sorting based on indexed column(s) both become faster.

  • Retrieving values by index is faster than custom searching.

There are certain caveats, however:

  • Creating an index requires time and memory.

  • Table modifications become slower due to automatic index updates.

  • Slicing a table becomes slower due to index relabeling.

See here for an IPython notebook profiling various aspects of table indexing.

Index Modes

The index_mode() method allows for some flexibility in the behavior of table indexing by allowing the user to enter a specific indexing mode via a context manager. There are currently three indexing modes: 'freeze', 'copy_on_getitem', and 'discard_on_copy'.

The 'freeze' mode prevents automatic index updates whenever a column of the index is modified, and all indices refresh themselves after the context ends:

>>> with t.index_mode('freeze'):
...    t['a'][0] = 0
...    print(t.indices['a']) # unmodified
<SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=4>
 a  rows
--- ----
  1    0
  2    1
  3    2
  4    3>>
>>> print(t.indices['a']) # modified
<SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=4>
 a  rows
--- ----
  0    0
  2    1
  3    2
  4    3>>

The 'copy_on_getitem' mode forces columns to copy and relabel their indices upon slicing. In the absence of this mode, table slices will preserve indices while column slices will not:

>>> ca = t['a'][[1, 3]]
>>> ca.info.indices
[]
>>> with t.index_mode('copy_on_getitem'):
...     ca = t['a'][[1, 3]]
...     print(ca.info.indices)
[<SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=2>
 a  rows
--- ----
  2    0
  4    1>>]

The 'discard_on_copy' mode prevents indices from being copied whenever a column or table is copied:

>>> t2 = Table(t)
>>> t2.indices['a']
<SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=4>
 a  rows
--- ----
  0    0
  2    1
  3    2
  4    3>>
>>> with t.index_mode('discard_on_copy'):
...    t2 = Table(t)
...    print(t2.indices)
[]

Updating Rows using Indices

Row updates can be accomplished by assigning the table property loc a complete row or a list of rows:

>>> t = Table([('w', 'x', 'y', 'z'), (10, 1, 9, 9)], names=('a', 'b'), dtype=['str', 'i8'])
>>> t.add_index('a')
>>> t.loc['x']
<Row index=1>
 a     b
str1 int64
---- -----
   x     1
>>> t.loc['x'] = ['a', 12]
>>> t
<Table length=4>
 a     b
str1 int64
---- -----
   w    10
   a    12
   y     9
   z     9
>>> t.loc[['w', 'y']]
<Table length=2>
 a     b
str1 int64
---- -----
   w    10
   y     9
>>> t.loc[['w', 'z']] = [['b', 23], ['c', 56]]
>>> t
<Table length=4>
 a     b
str1 int64
---- -----
   b    23
   a    12
   y     9
   c    56

Retrieving the Location of Rows using Indices

Retrieval of the location of rows can be accomplished using a table property: loc_indices. The loc_indices property can be indexed either by column value, range of column values (including the bounds), or a list or numpy.ndarray of column values:

>>> t = Table([('w', 'x', 'y', 'z'), (10, 1, 9, 9)], names=('a', 'b'), dtype=['str', 'i8'])
>>> t.add_index('a')
>>> t.loc_indices['x']
1

Engines

When creating an index via add_index(), the keyword argument engine may be specified to use a particular indexing engine. The available engines are:

The SCEngine depends on the sortedcontainers dependency. The most important takeaway is that SortedArray (the default engine) is usually best, although SCEngine may be more appropriate for an index created on an empty column since adding new values is quicker.

The BST engine demonstrates a simple pure Python implementation of a search tree engine, but the performance is poor for larger tables. This is available in the code largely as an implementation reference.