Table Of Contents

Previous topic


Next topic

This content refers to the previous stable release of PyMVPA. Please visit for the most recent version of PyMVPA and its documentation.


Module: datasets.base

Inheritance diagram for mvpa.datasets.base:

Dataset container


class mvpa.datasets.base.Dataset(data=None, dsattr=None, dtype=None, samples=None, labels=None, labels_map=None, chunks=None, origids=None, check_data=True, copy_samples=False, copy_data=True, copy_dsattr=True)

Bases: object

The Dataset.

This class provides a container to store all necessary data to perform MVPA analyses. These are the data samples, as well as the labels associated with the samples. Additionally, samples can be grouped into chunks.

Groups :
  • Creators: __init__, selectFeatures, selectSamples, applyMapper
  • Mutators: permuteLabels

Important: labels assumed to be immutable, i.e. no one should modify them externally by accessing indexed items, ie something like dataset.labels[1] += 100 should not be used. If a label has to be modified, full copy of labels should be obtained, operated on, and assigned back to the dataset, otherwise dataset.uniquelabels would not work. The same applies to any other attribute which has corresponding unique* access property.

Initialize dataset instance

There are basically two different way to create a dataset:

  1. Create a new dataset from samples and sample attributes. In this mode a two-dimensional ndarray has to be passed to the samples keyword argument and the corresponding samples attributes are provided via the labels and chunks arguments.

  2. Copy contructor mode

    The second way is used internally to perform quick coyping of datasets, e.g. when performing feature selection. In this mode and the two dictionaries (data and dsattr) are required. For performance reasons this mode bypasses most of the sanity check performed by the previous mode, as for internal operations data integrity is assumed.

  • data (dict) – Dictionary with an arbitrary number of entries. The value for each key in the dict has to be an ndarray with the same length as the number of rows in the samples array. A special entry in this dictionary is ‘samples’, a 2d array (samples x features). A shallow copy is stored in the object.
  • dsattr (dict) – Dictionary of dataset attributes. An arbitrary number of arbitrarily named and typed objects can be stored here. A shallow copy of the dictionary is stored in the object.
  • dtype (type | None) – If None – do not change data type if samples is an ndarray. Otherwise convert samples to dtype.
Keywords :
samples : ndarray

2d array (samples x features)


An array or scalar value defining labels for each samples. Generally labels should be numeric, unless labels_map is used

labels_map : None or bool or dict

Map original labels into numeric labels. If True, the mapping is computed if labels are literal. If is False, no mapping is computed. If dict instance – provided mapping is verified and applied. If you want to have labels_map just be present given already numeric labels, just assign labels_map dictionary to existing dataset instance


An array or scalar value defining chunks for each sample

Each of the Keywords arguments overwrites what is/might be already in the data container.

aggregateFeatures(dataset, fx=<function mean at 0x2982c80>)

Apply a function to each row of the samples matrix of a dataset.

The functor given as fx has to honour an axis keyword argument in the way that NumPy used it (e.g. NumPy.mean, var).

Return type:a new Dataset object with the aggregated feature(s).
applyMapper(featuresmapper=None, samplesmapper=None, train=True)

Obtain new dataset by applying mappers over features and/or samples.

While featuresmappers leave the sample attributes information unchanged, as the number of samples in the dataset is invariant, samplesmappers are also applied to the samples attributes themselves!

Applying a featuresmapper will destroy any feature grouping information.

  • featuresmapper (Mapper) – Mapper to somehow transform each sample’s features
  • samplesmapper (Mapper) – Mapper to transform each feature across samples
  • train (bool) – Flag whether to train the mapper with this dataset before applying it.
TODO: selectFeatures is pretty much
coarsenChunks(source, nchunks=4)

Change chunking of the dataset

Group chunks into groups to match desired number of chunks. Makes sense if originally there were no strong groupping into chunks or each sample was independent, thus belonged to its own chunk

  • source (Dataset or list of chunk ids) – dataset or list of chunk ids to operate on. If Dataset, then its chunks get modified
  • nchunks (int) – desired number of chunks

Returns a boolean mask with all features in ids selected.

Parameters:ids (list or 1d array) – To be selected features ids.
Return type:ndarray
Returns:All selected features are set to True; False otherwise.

Returns feature ids corresponding to non-zero elements in the mask.

Parameters:mask (1d ndarray) – Feature mask.
Return type:ndarray
Returns:Ids of non-zero (non-False) mask elements.

Create a copy (clone) of the dataset, by fully copying current one

Keywords :
deep : bool

deep flag is provided to __init__ for copy_{samples,data,dsattr}. By default full copy is done.


Assign definition to featuregroups

XXX Feature-groups was not finished to be useful

detrend(dataset, perchunk=False, model='linear', polyord=None, opt_reg=None)

Given a dataset, detrend the data inplace either entirely or per each chunk

  • dataset (Dataset) – dataset to operate on
  • perchunk (bool) – either to operate on whole dataset at once or on each chunk separately
  • model – Type of detrending model to run. If ‘linear’ or ‘constant’, scipy.signal.detrend is used to perform a linear or demeaning detrend. Polynomial detrending is activated when ‘regress’ is used or when polyord or opt_reg are specified.
  • polyord (int or list) – Order of the Legendre polynomial to remove from the data. This will remove every polynomial up to and including the provided value. For example, 3 will remove 0th, 1st, 2nd, and 3rd order polynomials from the data. N.B.: The 0th polynomial is the baseline shift, the 1st is the linear trend. If you specify a single int and perchunk is True, then this value is used for each chunk. You can also specify a different polyord value for each chunk by providing a list or ndarray of polyord values the length of the number of chunks.
  • opt_reg (ndarray) – Optional ndarray of additional information to regress out from the dataset. One example would be to regress out motion parameters. As with the data, time is on the first axis.

Stored labels map (if any)


Number of features per pattern.


Currently available number of patterns.


Select a random set of samples.

If ‘nperlabel’ is an integer value, the specified number of samples is randomly choosen from the group of samples sharing a unique label value ( total number of selected samples: nperlabel x len(uniquelabels).

If ‘nperlabel’ is a list which’s length has to match the number of unique label values. In this case ‘nperlabel’ specifies the number of samples that shall be selected from the samples with the corresponding label.

The method returns a Dataset object containing the selected samples.


Returns an array with the number of samples per label in each chunk.

Array shape is (chunks x labels).

Parameters:dataset (Dataset) – Source dataset.

To verify if dataset is in the same state as when smth else was done

Like if classifier was trained on the same dataset as in question

idsonboundaries(prior=0, post=0, attributes_to_track=['labels', 'chunks'], affected_labels=None, revert=False)

Find samples which are on the boundaries of the blocks

Such samples might need to be removed. By default (with prior=0, post=0) ids of the first samples in a ‘block’ are reported

  • prior (int) – how many samples prior to transition sample to include
  • post (int) – how many samples post the transition sample to include
  • attributes_to_track (list of basestring) – which attributes to track to decide on the boundary condition
  • affected_labels (list of basestring) – for which labels to perform selection. If None - for all
  • revert (bool) – either to revert the meaning and provide ids of samples which are found to not to be boundary samples
index(*args, **kwargs)

Universal indexer to obtain indexes of interesting samples/features. See .select() for more information

Return :tuple of (samples indexes, features indexes). Each item could be also None, if no selection on samples or features was requested (to discriminate between no selected items, and no selections)

Stored labels map (if any)


Number of features per pattern.


Currently available number of patterns.

permuteLabels(status, perchunk=True, assure_permute=False)

Permute the labels.

TODO: rename status into something closer in semantics.

  • status (bool) – Calling this method with set to True, the labels are permuted among all samples. If ‘status’ is False the original labels are restored.
  • perchunk (bool) – If True permutation is limited to samples sharing the same chunk value. Therefore only the association of a certain sample with a label is permuted while keeping the absolute number of occurences of each label value within a certain chunk constant.
  • assure_permute (bool) – If True, assures that labels are permutted, ie any one is different from the original one

Returns a new dataset with all invariant features removed.

select(*args, **kwargs)

Universal selector

WARNING: if you need to select duplicate samples (e.g. samples=[5,5]) or order of selected samples of features is important and has to be not ordered (e.g. samples=[3,2,1]), please use selectFeatures or selectSamples functions directly


Mimique plain selectSamples:[1,2,3])

Mimique plain selectFeatures:, [1,2,3])'all', [1,2,3])
dataset[:, [1,2,3]]

Mixed (select features and samples):[1,2,3], [1, 2])
dataset[[1,2,3], [1, 2]]

Select samples matching some attributes:[1,2], chunks=[2,4])'labels', [1,2], 'chunks', [2,4])
dataset['labels', [1,2], 'chunks', [2,4]]

Mixed – out of first 100 samples, select only those with labels 1 or 2 and belonging to chunks 2 or 4, and select features 2 and 3:,100), [2,3], labels=[1,2], chunks=[2,4])
dataset[:100, [2,3], 'labels', [1,2], 'chunks', [2,4]]
selectFeatures(ids=None, sort=True, groups=None)

Select a number of features from the current set.

  • ids – iterable container to select ids
  • sort (bool) – if to sort Ids. Order matters and selectFeatures assumes incremental order. If not such, in non-optimized code selectFeatures would verify the order and sort
Returns a new Dataset object with a copy of corresponding features
from the original samples array.

WARNING: The order of ids determines the order of features in the returned dataset. This might be useful sometimes, but can also cause major headaches! Order would is verified when running in non-optimized code (if __debug__)


Choose a subset of samples defined by samples IDs.

Returns a new dataset object containing the selected sample subset.

TODO: yoh, we might need to sort the mask if the mask is a list of ids and is not ordered. Clarify with Michael what is our intent here!


Set labels map.

Checks for the validity of the mapping – values should cover all existing labels in the dataset


Set the data type of the samples array.

summary(uniq=True, stats=True, idhash=False, lstats=True, maxc=30, maxl=20)

String summary over the object

  • uniq (bool) – Include summary over data attributes which have unique
  • idhash (bool) – Include idhash value for dataset and samples
  • stats (bool) – Include some basic statistics (mean, std, var) over dataset samples
  • lstats (bool) – Include statistics on chunks/labels
  • maxc (int) – Maximal number of chunks when provide details on labels/chunks
  • maxl (int) – Maximal number of labels when provide details on labels/chunks
summary_labels(maxc=30, maxl=20)

Provide summary statistics over the labels and chunks

  • maxc (int) – Maximal number of chunks when provide details
  • maxl (int) – Maximal number of labels when provide details
where(*args, **kwargs)

Obtain indexes of interesting samples/features. See select() for more information

XXX somewhat obsoletes idsby...

zscore(dataset, mean=None, std=None, perchunk=True, baselinelabels=None, pervoxel=True, targetdtype='float64')

Z-Score the samples of a Dataset (in-place).

mean and std can be used to pass custom values to the z-scoring. Both may be scalars or arrays.

All computations are done in place. Data upcasting is done automatically if necessary into targetdtype

If baselinelabels provided, and mean or std aren’t provided, it would compute the corresponding measure based only on labels in baselinelabels

If perchunk is True samples within the same chunk are z-scored independent of samples from other chunks, e.i. mean and standard deviation are calculated individually.


Decorator to easily bind functions to a Dataset class