This content refers to the previous stable release of PyMVPA. Please visit www.pymvpa.org for the most recent version of PyMVPA and its documentation.

datasets.base¶

Module: `datasets.base`¶

Inheritance diagram for mvpa.datasets.base:

Dataset container

`Dataset`¶

class mvpa.datasets.base.Dataset(data=None, dsattr=None, dtype=None, samples=None, labels=None, labels_map=None, chunks=None, origids=None, check_data=True, copy_samples=False, copy_data=True, copy_dsattr=True)¶

Bases: object

The Dataset.

This class provides a container to store all necessary data to perform MVPA analyses. These are the data samples, as well as the labels associated with the samples. Additionally, samples can be grouped into chunks.

Groups :	Creators: __init__, selectFeatures, selectSamples, applyMapper Mutators: permuteLabels

Important: labels assumed to be immutable, i.e. no one should modify them externally by accessing indexed items, ie something like dataset.labels[1] += 100 should not be used. If a label has to be modified, full copy of labels should be obtained, operated on, and assigned back to the dataset, otherwise dataset.uniquelabels would not work. The same applies to any other attribute which has corresponding unique* access property.

Initialize dataset instance

There are basically two different way to create a dataset:

Create a new dataset from samples and sample attributes. In this mode a two-dimensional ndarray has to be passed to the samples keyword argument and the corresponding samples attributes are provided via the labels and chunks arguments.
Copy contructor mode

The second way is used internally to perform quick coyping of datasets, e.g. when performing feature selection. In this mode and the two dictionaries (data and dsattr) are required. For performance reasons this mode bypasses most of the sanity check performed by the previous mode, as for internal operations data integrity is assumed.

Parameters:

data (dict) – Dictionary with an arbitrary number of entries. The value for each key in the dict has to be an ndarray with the same length as the number of rows in the samples array. A special entry in this dictionary is ‘samples’, a 2d array (samples x features). A shallow copy is stored in the object.
dsattr (dict) – Dictionary of dataset attributes. An arbitrary number of arbitrarily named and typed objects can be stored here. A shallow copy of the dictionary is stored in the object.
dtype (type | None) – If None – do not change data type if samples is an ndarray. Otherwise convert samples to dtype.

Keywords :

samples : ndarray: 2d array (samples x features)
labels: An array or scalar value defining labels for each samples. Generally labels should be numeric, unless labels_map is used
labels_map : None or bool or dict: Map original labels into numeric labels. If True, the mapping is computed if labels are literal. If is False, no mapping is computed. If dict instance – provided mapping is verified and applied. If you want to have labels_map just be present given already numeric labels, just assign labels_map dictionary to existing dataset instance
chunks: An array or scalar value defining chunks for each sample

Each of the Keywords arguments overwrites what is/might be already in the data container.

C¶

I¶

L¶

S¶

UC¶

UL¶

aggregateFeatures(dataset, fx=<function mean at 0x2982c80>)¶

Apply a function to each row of the samples matrix of a dataset.

The functor given as fx has to honour an axis keyword argument in the way that NumPy used it (e.g. NumPy.mean, var).

Return type:	a new Dataset object with the aggregated feature(s).

applyMapper(featuresmapper=None, samplesmapper=None, train=True)¶

Obtain new dataset by applying mappers over features and/or samples.

While featuresmappers leave the sample attributes information unchanged, as the number of samples in the dataset is invariant, samplesmappers are also applied to the samples attributes themselves!

Applying a featuresmapper will destroy any feature grouping information.

Parameters:	featuresmapper (Mapper) – Mapper to somehow transform each sample’s features samplesmapper (Mapper) – Mapper to transform each feature across samples train (bool) – Flag whether to train the mapper with this dataset before applying it.

TODO: selectFeatures is pretty much: applyMapper(featuresmapper=MaskMapper(...))

chunks¶

coarsenChunks(source, nchunks=4)¶

Change chunking of the dataset

Group chunks into groups to match desired number of chunks. Makes sense if originally there were no strong groupping into chunks or each sample was independent, thus belonged to its own chunk

Parameters:	source (Dataset or list of chunk ids) – dataset or list of chunk ids to operate on. If Dataset, then its chunks get modified nchunks (int) – desired number of chunks

convertFeatureIds2FeatureMask(ids)¶

Returns a boolean mask with all features in ids selected.

Parameters:	ids (list or 1d array) – To be selected features ids.
Return type:	ndarray
Returns:	All selected features are set to True; False otherwise.

convertFeatureMask2FeatureIds(mask)¶

Returns feature ids corresponding to non-zero elements in the mask.

Parameters:	mask (1d ndarray) – Feature mask.
Return type:	ndarray
Returns:	Ids of non-zero (non-False) mask elements.

copy(deep=True)¶

Create a copy (clone) of the dataset, by fully copying current one

Keywords :	deep : bool deep flag is provided to __init__ for copy_{samples,data,dsattr}. By default full copy is done.

defineFeatureGroups(definition)¶

Assign definition to featuregroups

XXX Feature-groups was not finished to be useful

detrend(dataset, perchunk=False, model='linear', polyord=None, opt_reg=None)¶

Given a dataset, detrend the data inplace either entirely or per each chunk

Parameters:

dataset (Dataset) – dataset to operate on
perchunk (bool) – either to operate on whole dataset at once or on each chunk separately
model – Type of detrending model to run. If ‘linear’ or ‘constant’, scipy.signal.detrend is used to perform a linear or demeaning detrend. Polynomial detrending is activated when ‘regress’ is used or when polyord or opt_reg are specified.
polyord (int or list) – Order of the Legendre polynomial to remove from the data. This will remove every polynomial up to and including the provided value. For example, 3 will remove 0th, 1st, 2nd, and 3rd order polynomials from the data. N.B.: The 0th polynomial is the baseline shift, the 1st is the linear trend. If you specify a single int and perchunk is True, then this value is used for each chunk. You can also specify a different polyord value for each chunk by providing a list or ndarray of polyord values the length of the number of chunks.
opt_reg (ndarray) – Optional ndarray of additional information to regress out from the dataset. One example would be to regress out motion parameters. As with the data, time is on the first axis.

getLabelsMap()¶: Stored labels map (if any)

getNFeatures()¶: Number of features per pattern.

getNSamples()¶: Currently available number of patterns.

getRandomSamples(nperlabel)¶

Select a random set of samples.

If ‘nperlabel’ is an integer value, the specified number of samples is randomly choosen from the group of samples sharing a unique label value ( total number of selected samples: nperlabel x len(uniquelabels).

If ‘nperlabel’ is a list which’s length has to match the number of unique label values. In this case ‘nperlabel’ specifies the number of samples that shall be selected from the samples with the corresponding label.

The method returns a Dataset object containing the selected samples.

getSamplesPerChunkLabel(dataset)¶

Returns an array with the number of samples per label in each chunk.

Array shape is (chunks x labels).

Parameters:	dataset (Dataset) – Source dataset.

idhash¶

To verify if dataset is in the same state as when smth else was done

Like if classifier was trained on the same dataset as in question

idsbychunks(x)¶

idsbylabels(x)¶

idsonboundaries(prior=0, post=0, attributes_to_track=['labels', 'chunks'], affected_labels=None, revert=False)¶

Find samples which are on the boundaries of the blocks

Such samples might need to be removed. By default (with prior=0, post=0) ids of the first samples in a ‘block’ are reported

Parameters:

prior (int) – how many samples prior to transition sample to include
post (int) – how many samples post the transition sample to include
attributes_to_track (list of basestring) – which attributes to track to decide on the boundary condition
affected_labels (list of basestring) – for which labels to perform selection. If None - for all
revert (bool) – either to revert the meaning and provide ids of samples which are found to not to be boundary samples

index(*args, **kwargs)¶

Universal indexer to obtain indexes of interesting samples/features. See .select() for more information

Return :	tuple of (samples indexes, features indexes). Each item could be also None, if no selection on samples or features was requested (to discriminate between no selected items, and no selections)

labels¶

labels_map¶: Stored labels map (if any)

nfeatures¶: Number of features per pattern.

nsamples¶: Currently available number of patterns.

origids¶

permuteLabels(status, perchunk=True, assure_permute=False)¶

Permute the labels.

TODO: rename status into something closer in semantics.

Parameters:

status (bool) – Calling this method with set to True, the labels are permuted among all samples. If ‘status’ is False the original labels are restored.
perchunk (bool) – If True permutation is limited to samples sharing the same chunk value. Therefore only the association of a certain sample with a label is permuted while keeping the absolute number of occurences of each label value within a certain chunk constant.
assure_permute (bool) – If True, assures that labels are permutted, ie any one is different from the original one

removeInvariantFeatures(dataset)¶: Returns a new dataset with all invariant features removed.

samples¶

samplesperchunk¶

samplesperlabel¶

select(*args, **kwargs)¶

Universal selector

WARNING: if you need to select duplicate samples (e.g. samples=[5,5]) or order of selected samples of features is important and has to be not ordered (e.g. samples=[3,2,1]), please use selectFeatures or selectSamples functions directly

Examples:

Mimique plain selectSamples:

dataset.select([1,2,3])
dataset[[1,2,3]]

Mimique plain selectFeatures:

dataset.select(slice(None), [1,2,3])
dataset.select('all', [1,2,3])
dataset[:, [1,2,3]]

Mixed (select features and samples):

dataset.select([1,2,3], [1, 2])
dataset[[1,2,3], [1, 2]]

Select samples matching some attributes:

dataset.select(labels=[1,2], chunks=[2,4])
dataset.select('labels', [1,2], 'chunks', [2,4])
dataset['labels', [1,2], 'chunks', [2,4]]

Mixed – out of first 100 samples, select only those with labels 1 or 2 and belonging to chunks 2 or 4, and select features 2 and 3:

dataset.select(slice(0,100), [2,3], labels=[1,2], chunks=[2,4])
dataset[:100, [2,3], 'labels', [1,2], 'chunks', [2,4]]

selectFeatures(ids=None, sort=True, groups=None)¶

Select a number of features from the current set.

Parameters:	ids – iterable container to select ids sort (bool) – if to sort Ids. Order matters and selectFeatures assumes incremental order. If not such, in non-optimized code selectFeatures would verify the order and sort

Returns a new Dataset object with a copy of corresponding features: from the original samples array.

WARNING: The order of ids determines the order of features in the returned dataset. This might be useful sometimes, but can also cause major headaches! Order would is verified when running in non-optimized code (if __debug__)

selectSamples(ids)¶

Choose a subset of samples defined by samples IDs.

Returns a new dataset object containing the selected sample subset.

TODO: yoh, we might need to sort the mask if the mask is a list of ids and is not ordered. Clarify with Michael what is our intent here!

setLabelsMap(lm)¶

Set labels map.

Checks for the validity of the mapping – values should cover all existing labels in the dataset

setSamplesDType(dtype)¶: Set the data type of the samples array.

summary(uniq=True, stats=True, idhash=False, lstats=True, maxc=30, maxl=20)¶

String summary over the object

Parameters:

uniq (bool) – Include summary over data attributes which have unique
idhash (bool) – Include idhash value for dataset and samples
stats (bool) – Include some basic statistics (mean, std, var) over dataset samples
lstats (bool) – Include statistics on chunks/labels
maxc (int) – Maximal number of chunks when provide details on labels/chunks
maxl (int) – Maximal number of labels when provide details on labels/chunks

summary_labels(maxc=30, maxl=20)¶

Provide summary statistics over the labels and chunks

Parameters:	maxc (int) – Maximal number of chunks when provide details maxl (int) – Maximal number of labels when provide details

uniquechunks¶

uniquelabels¶

where(*args, **kwargs)¶

Obtain indexes of interesting samples/features. See select() for more information

XXX somewhat obsoletes idsby...

zscore(dataset, mean=None, std=None, perchunk=True, baselinelabels=None, pervoxel=True, targetdtype='float64')¶

Z-Score the samples of a Dataset (in-place).

mean and std can be used to pass custom values to the z-scoring. Both may be scalars or arrays.

All computations are done in place. Data upcasting is done automatically if necessary into targetdtype

If baselinelabels provided, and mean or std aren’t provided, it would compute the corresponding measure based only on labels in baselinelabels

If perchunk is True samples within the same chunk are z-scored independent of samples from other chunks, e.i. mean and standard deviation are calculated individually.

mvpa.datasets.base.datasetmethod(func)¶: Decorator to easily bind functions to a Dataset class

Table Of Contents

Previous topic

Next topic

datasets.base¶

Module: `datasets.base`¶

`Dataset`¶

Navigation

Table Of Contents

Previous topic

Next topic

Quick search

datasets.base¶

Module: datasets.base¶

Dataset¶

Navigation

Module: `datasets.base`¶

`Dataset`¶