Datasets and Containers¶
MDSynthesis is not an analysis code. On its own, it does not produce output data given raw simulation data as input. Its scope is limited to the boring but tedious task of data management and storage. It is intended to bring value to analysis results by making them easily accessible now and later.
As such, the basic functionality of MDSynthesis is condensed into only two objects, sometimes referred to as Containers in the documentation. These are the Sim and Group objects.
In brief, a Sim is designed to manage and give access to the data corresponding to a single simulation (the raw trajectory(s), as well as analysis results); a Group gives access to any number of Sim or Group objects it has as members (including perhaps itself), and can store analysis results that pertain to these members collectively. Both types of Container store their underlying data persistently to disk on the fly. The file locking needed for each transaction is handled automatically, so more than one python process can be working with any number of instances of the same Container at the same time.
Warning
File locking is generally process safe, but not thread safe. Don’t use multithreading and try to modify Container elements at the same time. Multiprocessing, however, should work just fine.
Persistence as a feature¶
Containers store their data as directory structures in the file system. Generating a new Sim, for example, with the following
>>> # python session 1
>>> import mdsynthesis as mds
>>> s = mds.Sim('marklar')
creates a directory called marklar
in the current working directory. It contains
a single file at the moment
> # shell
> ls marklar
Sim.2b4b5800-48a7-4814-ba6d-1e631a09a199.h5
The name of this file includes the type of Container (Sim) it corresponds
to, as well as the uuid
of the Container, which is its unique identifier.
This is the state file containing all the information needed to regenerate an
identical instance of this Sim. In fact, we can open a separate python
session (go ahead!) and regenerate this Sim immediately there
>>> # python session 2
>>> import mdsynthesis as mds
>>> s = mds.Sim('marklar')
Making a modification to the Sim in one session, perhaps by adding a tag, will be reflected in the Sim in the other session
>>> # python session 1
>>> s.tags.add('TIP4P')
>>> # python session 2
>>> s.tags
<Tags(['TIP4P'])>
This is because both objects pull their identifying information from the same file on disk; they store almost nothing in memory.
Note
The uuid
of the Sim in this example will certainly differ from
any Sims you generate. This is used to differentiate Sims
from each other. Unexpected and broken behavior will result from
changing the names of state files!
Storing arbitrary datasets¶
More on things like tags later, but we really care about storing (potentially
large and time consuming to produce) datasets. Using our Sim marklar
as the example here, say we have generated a numpy array of dimension
(10^6, 3) that gives the minimum distance between the sidechains of three
residues with those of a fourth for each frame in a trajectory
>>> a.shape
(1000000, 3)
We can store this easily
>>> s.data.add('distances', a)
>>> s.data
<Data(['distances'])>
and recall it
>>> s.data['distances'].shape
(1000000, 3)
Looking at the contents of the directory marklar
, we see it has a new
subdirectory corresponding to the name of our stored dataset
> # shell
> ls marklar
distances Sim.h5
which has its own contents
> ls marklar/distances
npData.h5
This is the data we stored, serialized to disk in the efficient HDF5 data format. Containers will also store pandas objects using this format. For other data structures, the Container will pickle them if it can.
Datasets can be nested however you like. For example, say we had several pandas DataFrames each giving the distance with time of each cation in the simulation with respect to some residue of interest on our protein. We could just as well make it clear to ourselves that these are all similar datasets by grouping them together
>>> s.data.add('cations/residue1', df1)
>>> s.data.add('cations/residue2', df2)
>>> # we can also use setitem syntax
>>> s.data['cations/residue3'] = df3
>>> s.data
<Data(['cations/residue1', 'cations/residue2', cations/residue3',
'distances'])>
and their locations in the filesystem reflect this structure.
Minimal blobs¶
Individual datasets get their own place in the filesystem instead of all being shoved into a single file on disk. This is by design, as it generally means better performance since this means less waiting for file locks to release from other Container instances. Also, it gives a space to put other files related to the dataset itself, such as figures produced from it.
You can get the location on disk of a dataset with
>>> s.data.locate('cations/residue1')
'/home/bob/marklar/cations/residue1'
which is particularly useful for outputting figures.
Another advantage of organizing Containers at the filesystem level is that datasets can be handled at the filesystem level. Removing a dataset with a
> # shell
> rm -r marklar/cations/residue2
is immediately reflected by the Container
>>> s.data
<Data(['cations/residue1', 'cations/residue3', 'distances'])>
Datasets can likewise be moved within the Container’s directory tree and they will still be found, with names matching their location relative to the state file.
Reference: Data¶
The class mdsynthesis.core.aggregators.Data
is the interface used
by Containers to access their stored datasets. It is not intended to be used
on its own, but is shown here to give a detailed view of its methods.
-
class
mdsynthesis.core.aggregators.
Data
(container, containerfile, logger)¶ Interface to stored data.
-
add
(handle, *args, **kwargs)¶ Store data in Container.
A data instance can be a pandas object (Series, DataFrame, Panel), a numpy array, or a pickleable python object. If the dataset doesn’t exist, it is added. If a dataset already exists for the given handle, it is replaced.
Arguments: - handle
name given to data; needed for retrieval
- data
data structure to store
-
append
(handle, *args, **kwargs)¶ Append rows to an existing dataset.
The object must be of the same pandas class (Series, DataFrame, Panel) as the existing dataset, and it must have exactly the same columns (names included).
Arguments: - handle
name of data to append to
- data
data to append
-
locate
(handle)¶ Get directory location for a stored dataset.
Arguments: - handle
name of data to retrieve location of
Returns: - datadir
absolute path to directory containing stored data
-
make_filepath
(handle, filename)¶ Return a full path for a file stored in a data directory, whether the file exists or not.
This is useful if preparing plots or other files derived from the dataset, since these can be stored with the data in its own directory. This method does the small but annoying work of generating a full path for the file.
This method doesn’t care whether or not the path exists; it simply returns the path it’s asked to build.
Arguments: - handle
name of dataset file corresponds to
- filename
filename of file
Returns: - filepath
absolute path for file
-
remove
(handle, **kwargs)¶ Remove a dataset, or some subset of a dataset.
Note: in the case the whole dataset is removed, the directory containing the dataset file (
Data.h5
) will NOT be removed if it still contains file(s) after the removal of the dataset file.For pandas objects (Series, DataFrame, or Panel) subsets of the whole dataset can be removed using keywords such as start and stop for ranges of rows, and columns for selected columns.
Arguments: - handle
name of dataset to delete
Keywords: - where
conditions for what rows/columns to remove
- start
row number to start selection
- stop
row number to stop selection
- columns
columns to remove
-
retrieve
(handle, *args, **kwargs)¶ Retrieve stored data.
The stored data structure is read from disk and returned.
If dataset doesn’t exist,
None
is returned.For pandas objects (Series, DataFrame, or Panel) subsets of the whole dataset can be returned using keywords such as start and stop for ranges of rows, and columns for selected columns.
Also for pandas objects, the where keyword takes a string as input and can be used to filter out rows and columns without loading the full object into memory. For example, given a DataFrame with handle ‘mydata’ with columns (A, B, C, D), one could return all rows for columns A and C for which column D is greater than .3 with:
retrieve('mydata', where='columns=[A,C] & D > .3')
Or, if we wanted all rows with index = 3 (there could be more than one):
retrieve('mydata', where='index = 3')
See :meth:pandas.HDFStore.select() for more information.
Arguments: - handle
name of data to retrieve
Keywords: - where
conditions for what rows/columns to return
- start
row number to start selection
- stop
row number to stop selection
- columns
list of columns to return; all columns returned by default
- iterator
if True, return an iterator [
False
]- chunksize
number of rows to include in iteration; implies
iterator=True
Returns: - data
stored data;
None
if nonexistent
-