Datasets and Containers

MDSynthesis is not an analysis code. On its own, it does not produce output data given raw simulation data as input. Its scope is limited to the boring but tedious task of data management and storage. It is intended to bring value to analysis results by making them easily accessible now and later.

As such, the basic functionality of MDSynthesis is condensed into only two objects, sometimes referred to as Containers in the documentation. These are the Sim and Group objects.

In brief, a Sim is designed to manage and give access to the data corresponding to a single simulation (the raw trajectory(s), as well as analysis results); a Group gives access to any number of Sim or Group objects it has as members (including perhaps itself), and can store analysis results that pertain to these members collectively. Both types of Container store their underlying data persistently to disk on the fly. The file locking needed for each transaction is handled automatically, so more than one python process can be working with any number of instances of the same Container at the same time.

Warning

File locking is generally process safe, but not thread safe. Don’t use multithreading and try to modify Container elements at the same time. Multiprocessing, however, should work just fine.

Persistence as a feature

Containers store their data as directory structures in the file system. Generating a new Sim, for example, with the following

>>> # python session 1
>>> import mdsynthesis as mds
>>> s = mds.Sim('marklar')

creates a directory called marklar in the current working directory. It contains a single file at the moment

> # shell
> ls marklar
Sim.2b4b5800-48a7-4814-ba6d-1e631a09a199.h5

The name of this file includes the type of Container (Sim) it corresponds to, as well as the uuid of the Container, which is its unique identifier. This is the state file containing all the information needed to regenerate an identical instance of this Sim. In fact, we can open a separate python session (go ahead!) and regenerate this Sim immediately there

>>> # python session 2
>>> import mdsynthesis as mds
>>> s = mds.Sim('marklar')

Making a modification to the Sim in one session, perhaps by adding a tag, will be reflected in the Sim in the other session

>>> # python session 1
>>> s.tags.add('TIP4P')

>>> # python session 2
>>> s.tags
<Tags(['TIP4P'])>

This is because both objects pull their identifying information from the same file on disk; they store almost nothing in memory.

Note

The uuid of the Sim in this example will certainly differ from any Sims you generate. This is used to differentiate Sims from each other. Unexpected and broken behavior will result from changing the names of state files!

Storing arbitrary datasets

More on things like tags later, but we really care about storing (potentially large and time consuming to produce) datasets. Using our Sim marklar as the example here, say we have generated a numpy array of dimension (10^6, 3) that gives the minimum distance between the sidechains of three residues with those of a fourth for each frame in a trajectory

>>> a.shape
(1000000, 3)

We can store this easily

>>> s.data.add('distances', a)
>>> s.data
<Data(['distances'])>

and recall it

>>> s.data['distances'].shape
(1000000, 3)

Looking at the contents of the directory marklar, we see it has a new subdirectory corresponding to the name of our stored dataset

> # shell
> ls marklar
distances  Sim.h5

which has its own contents

> ls marklar/distances
npData.h5

This is the data we stored, serialized to disk in the efficient HDF5 data format. Containers will also store pandas objects using this format. For other data structures, the Container will pickle them if it can.

Datasets can be nested however you like. For example, say we had several pandas DataFrames each giving the distance with time of each cation in the simulation with respect to some residue of interest on our protein. We could just as well make it clear to ourselves that these are all similar datasets by grouping them together

>>> s.data.add('cations/residue1', df1)
>>> s.data.add('cations/residue2', df2)
>>> # we can also use setitem syntax
>>> s.data['cations/residue3'] = df3
>>> s.data
<Data(['cations/residue1', 'cations/residue2', cations/residue3',
       'distances'])>

and their locations in the filesystem reflect this structure.

Minimal blobs

Individual datasets get their own place in the filesystem instead of all being shoved into a single file on disk. This is by design, as it generally means better performance since this means less waiting for file locks to release from other Container instances. Also, it gives a space to put other files related to the dataset itself, such as figures produced from it.

You can get the location on disk of a dataset with

>>> s.data.locate('cations/residue1')
'/home/bob/marklar/cations/residue1'

which is particularly useful for outputting figures.

Another advantage of organizing Containers at the filesystem level is that datasets can be handled at the filesystem level. Removing a dataset with a

> # shell
> rm -r marklar/cations/residue2

is immediately reflected by the Container

>>> s.data
<Data(['cations/residue1', 'cations/residue3', 'distances'])>

Datasets can likewise be moved within the Container’s directory tree and they will still be found, with names matching their location relative to the state file.

Reference: Data

The class mdsynthesis.core.aggregators.Data is the interface used by Containers to access their stored datasets. It is not intended to be used on its own, but is shown here to give a detailed view of its methods.

class mdsynthesis.core.aggregators.Data(container, containerfile, logger)

Interface to stored data.

add(handle, *args, **kwargs)

Store data in Container.

A data instance can be a pandas object (Series, DataFrame, Panel), a numpy array, or a pickleable python object. If the dataset doesn’t exist, it is added. If a dataset already exists for the given handle, it is replaced.

Arguments:
handle

name given to data; needed for retrieval

data

data structure to store

append(handle, *args, **kwargs)

Append rows to an existing dataset.

The object must be of the same pandas class (Series, DataFrame, Panel) as the existing dataset, and it must have exactly the same columns (names included).

Arguments:
handle

name of data to append to

data

data to append

locate(handle)

Get directory location for a stored dataset.

Arguments:
handle

name of data to retrieve location of

Returns:
datadir

absolute path to directory containing stored data

make_filepath(handle, filename)

Return a full path for a file stored in a data directory, whether the file exists or not.

This is useful if preparing plots or other files derived from the dataset, since these can be stored with the data in its own directory. This method does the small but annoying work of generating a full path for the file.

This method doesn’t care whether or not the path exists; it simply returns the path it’s asked to build.

Arguments:
handle

name of dataset file corresponds to

filename

filename of file

Returns:
filepath

absolute path for file

remove(handle, **kwargs)

Remove a dataset, or some subset of a dataset.

Note: in the case the whole dataset is removed, the directory containing the dataset file (Data.h5) will NOT be removed if it still contains file(s) after the removal of the dataset file.

For pandas objects (Series, DataFrame, or Panel) subsets of the whole dataset can be removed using keywords such as start and stop for ranges of rows, and columns for selected columns.

Arguments:
handle

name of dataset to delete

Keywords:
where

conditions for what rows/columns to remove

start

row number to start selection

stop

row number to stop selection

columns

columns to remove

retrieve(handle, *args, **kwargs)

Retrieve stored data.

The stored data structure is read from disk and returned.

If dataset doesn’t exist, None is returned.

For pandas objects (Series, DataFrame, or Panel) subsets of the whole dataset can be returned using keywords such as start and stop for ranges of rows, and columns for selected columns.

Also for pandas objects, the where keyword takes a string as input and can be used to filter out rows and columns without loading the full object into memory. For example, given a DataFrame with handle ‘mydata’ with columns (A, B, C, D), one could return all rows for columns A and C for which column D is greater than .3 with:

retrieve('mydata', where='columns=[A,C] & D > .3')

Or, if we wanted all rows with index = 3 (there could be more than one):

retrieve('mydata', where='index = 3')

See :meth:pandas.HDFStore.select() for more information.

Arguments:
handle

name of data to retrieve

Keywords:
where

conditions for what rows/columns to return

start

row number to start selection

stop

row number to stop selection

columns

list of columns to return; all columns returned by default

iterator

if True, return an iterator [False]

chunksize

number of rows to include in iteration; implies iterator=True

Returns:
data

stored data; None if nonexistent