Package mvpa :: Package datasets :: Module base :: Class Dataset
[hide private]
[frames] | no frames]

Class Dataset

source code


The Dataset.

This class provides a container to store all necessary data to perform MVPA analyses. These are the data samples, as well as the labels associated with the samples. Additionally, samples can be grouped into chunks.

Important: labels assumed to be immutable, i.e. no one should modify them externally by accessing indexed items, ie something like dataset.labels[1] += 100 should not be used. If a label has to be modified, full copy of labels should be obtained, operated on, and assigned back to the dataset, otherwise dataset.uniquelabels would not work. The same applies to any other attribute which has corresponding unique* access property.

Instance Methods [hide private]
 
__init__(self, data=None, dsattr=None, dtype=None, samples=None, labels=None, labels_map=None, chunks=None, origids=None, check_data=True, copy_samples=False, copy_data=True, copy_dsattr=True)
Initialize dataset instance
source code
 
idhash(self)
To verify if dataset is in the same state as when smth else was done
source code
 
_resetallunique(self, force=False)
Set to None all unique* attributes of corresponding dictionary
source code
 
_getuniqueattr(self, attrib, dict_)
Provide common facility to return unique attributes
source code
 
_setdataattr(self, attrib, value)
Provide common facility to set attributes
source code
 
_getNSamplesPerAttr(self, attrib='labels')
Returns the number of samples per unique label.
source code
 
_getSampleIdsByAttr(self, values, attrib="labels", sort=True)
Return indecies of samples given a list of attributes
source code
 
idsonboundaries(self, prior=0, post=0, attributes_to_track=['labels','chunks'], affected_labels=None, revert=False)
Find samples which are on the boundaries of the blocks
source code
 
_shapeSamples(self, samples, dtype, copy)
Adapt different kinds of samples
source code
 
_checkData(self)
Checks _data members to have the same # of samples.
source code
 
_expandSampleAttribute(self, attr, attr_name)
If a sample attribute is given as a scalar expand/repeat it to a length matching the number of samples in the dataset.
source code
 
__str__(self)
String summary over the object
source code
 
__repr__(self)
repr(x)
source code
 
summary(self, uniq=True, stats=True, idhash=False, lstats=True, maxc=30, maxl=20)
String summary over the object
source code
 
summary_labels(self, maxc=30, maxl=20)
Provide summary statistics over the labels and chunks
source code
 
__iadd__(self, other)
Merge the samples of one Dataset object to another (in-place).
source code
 
__add__(self, other)
Merge the samples two Dataset objects.
source code
 
copy(self, deep=True)
Create a copy (clone) of the dataset, by fully copying current one
source code
 
selectFeatures(self, ids=None, sort=True, groups=None)
Select a number of features from the current set.
source code
 
applyMapper(self, featuresmapper=None, samplesmapper=None, train=True)
Obtain new dataset by applying mappers over features and/or samples.
source code
 
selectSamples(self, ids)
Choose a subset of samples defined by samples IDs.
source code
 
index(self, *args, **kwargs)
Universal indexer to obtain indexes of interesting samples/features. See .select() for more information
source code
 
select(self, *args, **kwargs)
Universal selector
source code
 
where(self, *args, **kwargs)
Obtain indexes of interesting samples/features. See select() for more information
source code
 
__getitem__(self, *args)
Convinience dataset parts selection
source code
 
permuteLabels(self, status, perchunk=True, assure_permute=False)
Permute the labels.
source code
 
getRandomSamples(self, nperlabel)
Select a random set of samples.
source code
 
getNSamples(self)
Currently available number of patterns.
source code
 
getNFeatures(self)
Number of features per pattern.
source code
 
getLabelsMap(self)
Stored labels map (if any)
source code
 
setLabelsMap(self, lm)
Set labels map.
source code
 
setSamplesDType(self, dtype)
Set the data type of the samples array.
source code
 
defineFeatureGroups(self, definition)
Assign definition to featuregroups
source code
 
convertFeatureIds2FeatureMask(self, ids)
Returns a boolean mask with all features in ids selected.
source code
 
convertFeatureMask2FeatureIds(self, mask)
Returns feature ids corresponding to non-zero elements in the mask.
source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __sizeof__, __subclasshook__

Class Methods [hide private]
 
_registerAttribute(cls, key, dictname="_data", abbr=None, hasunique=False)
Register an attribute for any Dataset class.
source code
Static Methods [hide private]
 
_checkCopyConstructorArgs(**kwargs)
Common sanity check for Dataset copy constructor calls.
source code
Class Variables [hide private]
  _uniqueattributes = []
Unique attributes associated with the data
  _registeredattributes = []
Registered attributes (stored in _data)
  _requiredattributes = ['samples', 'labels']
Attributes which have to be provided to __init__, or otherwise no default values would be assumed and construction of the instance would fail
  __doc__ = enhancedDocString('Dataset', locals())
  nsamples = property(fget= getNSamples)
  nfeatures = property(fget= getNFeatures)
  labels_map = property(fget= getLabelsMap, fset= setLabelsMap)
Instance Variables [hide private]
  _data
What makes a dataset.
  _dsattr
Dataset attriibutes.
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, data=None, dsattr=None, dtype=None, samples=None, labels=None, labels_map=None, chunks=None, origids=None, check_data=True, copy_samples=False, copy_data=True, copy_dsattr=True)
(Constructor)

source code 

Initialize dataset instance

There are basically two different way to create a dataset:

  1. Create a new dataset from samples and sample attributes. In this mode a two-dimensional ndarray has to be passed to the samples keyword argument and the corresponding samples attributes are provided via the labels and chunks arguments.

  2. Copy contructor mode

    The second way is used internally to perform quick coyping of datasets, e.g. when performing feature selection. In this mode and the two dictionaries (data and dsattr) are required. For performance reasons this mode bypasses most of the sanity check performed by the previous mode, as for internal operations data integrity is assumed.

Each of the Keywords arguments overwrites what is/might be already in the data container.

Parameters:
  • data (dict) - Dictionary with an arbitrary number of entries. The value for each key in the dict has to be an ndarray with the same length as the number of rows in the samples array. A special entry in this dictionary is 'samples', a 2d array (samples x features). A shallow copy is stored in the object.
  • dsattr (dict) - Dictionary of dataset attributes. An arbitrary number of arbitrarily named and typed objects can be stored here. A shallow copy of the dictionary is stored in the object.
  • dtype, type, |, None - If None -- do not change data type if samples is an ndarray. Otherwise convert samples to dtype.
  • samples (ndarray) - 2d array (samples x features)
  • labels - An array or scalar value defining labels for each samples. Generally labels should be numeric, unless labels_map is used
  • labels_map (None or bool or dict) - Map original labels into numeric labels. If True, the mapping is computed if labels are literal. If is False, no mapping is computed. If dict instance -- provided mapping is verified and applied. If you want to have labels_map just be present given already numeric labels, just assign labels_map dictionary to existing dataset instance
  • chunks - An array or scalar value defining chunks for each sample
Overrides: object.__init__

idhash(self)

source code 

To verify if dataset is in the same state as when smth else was done

Like if classifier was trained on the same dataset as in question

Decorators:
  • @property

_getuniqueattr(self, attrib, dict_)

source code 

Provide common facility to return unique attributes

XXX dict_ can be simply replaced now with self._dsattr

idsonboundaries(self, prior=0, post=0, attributes_to_track=['labels','chunks'], affected_labels=None, revert=False)

source code 

Find samples which are on the boundaries of the blocks

Such samples might need to be removed. By default (with prior=0, post=0) ids of the first samples in a 'block' are reported

Parameters:
  • prior (int) - how many samples prior to transition sample to include
  • post (int) - how many samples post the transition sample to include
  • attributes_to_track (list of basestring) - which attributes to track to decide on the boundary condition
  • affected_labels (list of basestring) - for which labels to perform selection. If None - for all
  • revert (bool) - either to revert the meaning and provide ids of samples which are found to not to be boundary samples

_shapeSamples(self, samples, dtype, copy)

source code 

Adapt different kinds of samples

Handle all possible input value for 'samples' and tranform them into a 2d (samples x feature) representation.

_registerAttribute(cls, key, dictname="_data", abbr=None, hasunique=False)
Class Method

source code 

Register an attribute for any Dataset class.

Creates property assigning getters/setters depending on the availability of corresponding _get, _set functions.

__str__(self)
(Informal representation operator)

source code 
String summary over the object
Overrides: object.__str__

__repr__(self)
(Representation operator)

source code 
repr(x)
Overrides: object.__repr__
(inherited documentation)

summary(self, uniq=True, stats=True, idhash=False, lstats=True, maxc=30, maxl=20)

source code 
String summary over the object
Parameters:
  • uniq (bool) - Include summary over data attributes which have unique
  • idhash (bool) - Include idhash value for dataset and samples
  • stats (bool) - Include some basic statistics (mean, std, var) over dataset samples
  • lstats (bool) - Include statistics on chunks/labels
  • maxc (int) - Maximal number of chunks when provide details on labels/chunks
  • maxl (int) - Maximal number of labels when provide details on labels/chunks

summary_labels(self, maxc=30, maxl=20)

source code 
Provide summary statistics over the labels and chunks
Parameters:
  • maxc (int) - Maximal number of chunks when provide details
  • maxl (int) - Maximal number of labels when provide details

__iadd__(self, other)

source code 

Merge the samples of one Dataset object to another (in-place).

No dataset attributes, besides labels_map, will be merged! Additionally, a new set of unique origids will be generated.

__add__(self, other)
(Addition operator)

source code 

Merge the samples two Dataset objects.

All data of both datasets is copied, concatenated and a new Dataset is returned.

NOTE: This can be a costly operation (both memory and time). If performance is important consider the '+=' operator.

copy(self, deep=True)

source code 
Create a copy (clone) of the dataset, by fully copying current one
Parameters:
  • deep (bool) - deep flag is provided to __init__ for copy_{samples,data,dsattr}. By default full copy is done.

selectFeatures(self, ids=None, sort=True, groups=None)

source code 

Select a number of features from the current set.

Returns a new Dataset object with a copy of corresponding features
from the original samples array.

WARNING: The order of ids determines the order of features in the returned dataset. This might be useful sometimes, but can also cause major headaches! Order would is verified when running in non-optimized code (if __debug__)

Parameters:
  • ids - iterable container to select ids
  • sort (bool) - if to sort Ids. Order matters and selectFeatures assumes incremental order. If not such, in non-optimized code selectFeatures would verify the order and sort

applyMapper(self, featuresmapper=None, samplesmapper=None, train=True)

source code 

Obtain new dataset by applying mappers over features and/or samples.

While featuresmappers leave the sample attributes information unchanged, as the number of samples in the dataset is invariant, samplesmappers are also applied to the samples attributes themselves!

Applying a featuresmapper will destroy any feature grouping information.

TODO: selectFeatures is pretty much
applyMapper(featuresmapper=MaskMapper(...))
Parameters:
  • featuresmapper (Mapper) - Mapper to somehow transform each sample's features
  • samplesmapper (Mapper) - Mapper to transform each feature across samples
  • train (bool) - Flag whether to train the mapper with this dataset before applying it.

selectSamples(self, ids)

source code 

Choose a subset of samples defined by samples IDs.

Returns a new dataset object containing the selected sample subset.

TODO: yoh, we might need to sort the mask if the mask is a list of ids and is not ordered. Clarify with Michael what is our intent here!

index(self, *args, **kwargs)

source code 
Universal indexer to obtain indexes of interesting samples/features. See .select() for more information
Returns:
tuple of (samples indexes, features indexes). Each item could be also None, if no selection on samples or features was requested (to discriminate between no selected items, and no selections)

select(self, *args, **kwargs)

source code 

Universal selector

WARNING: if you need to select duplicate samples (e.g. samples=[5,5]) or order of selected samples of features is important and has to be not ordered (e.g. samples=[3,2,1]), please use selectFeatures or selectSamples functions directly

Examples:

Mimique plain selectSamples:

dataset.select([1,2,3])
dataset[[1,2,3]]

Mimique plain selectFeatures:

dataset.select(slice(None), [1,2,3])
dataset.select('all', [1,2,3])
dataset[:, [1,2,3]]

Mixed (select features and samples):

dataset.select([1,2,3], [1, 2])
dataset[[1,2,3], [1, 2]]

Select samples matching some attributes:

dataset.select(labels=[1,2], chunks=[2,4])
dataset.select('labels', [1,2], 'chunks', [2,4])
dataset['labels', [1,2], 'chunks', [2,4]]

Mixed -- out of first 100 samples, select only those with labels 1 or 2 and belonging to chunks 2 or 4, and select features 2 and 3:

dataset.select(slice(0,100), [2,3], labels=[1,2], chunks=[2,4])
dataset[:100, [2,3], 'labels', [1,2], 'chunks', [2,4]]

where(self, *args, **kwargs)

source code 

Obtain indexes of interesting samples/features. See select() for more information

XXX somewhat obsoletes idsby...

__getitem__(self, *args)
(Indexing operator)

source code 

Convinience dataset parts selection

See select for more information

permuteLabels(self, status, perchunk=True, assure_permute=False)

source code 

Permute the labels.

TODO: rename status into something closer in semantics.

Parameters:
  • status (bool) - Calling this method with set to True, the labels are permuted among all samples. If 'status' is False the original labels are restored.
  • perchunk (bool) - If True permutation is limited to samples sharing the same chunk value. Therefore only the association of a certain sample with a label is permuted while keeping the absolute number of occurences of each label value within a certain chunk constant.
  • assure_permute (bool) - If True, assures that labels are permutted, ie any one is different from the original one

getRandomSamples(self, nperlabel)

source code 

Select a random set of samples.

If 'nperlabel' is an integer value, the specified number of samples is randomly choosen from the group of samples sharing a unique label value ( total number of selected samples: nperlabel x len(uniquelabels).

If 'nperlabel' is a list which's length has to match the number of unique label values. In this case 'nperlabel' specifies the number of samples that shall be selected from the samples with the corresponding label.

The method returns a Dataset object containing the selected samples.

setLabelsMap(self, lm)

source code 

Set labels map.

Checks for the validity of the mapping -- values should cover all existing labels in the dataset

defineFeatureGroups(self, definition)

source code 

Assign definition to featuregroups

XXX Feature-groups was not finished to be useful

convertFeatureIds2FeatureMask(self, ids)

source code 
Returns a boolean mask with all features in ids selected.
Parameters:
  • ids, list, or, 1d, array - To be selected features ids.
Returns:
ndarray: dtype='bool'
All selected features are set to True; False otherwise.

convertFeatureMask2FeatureIds(self, mask)

source code 
Returns feature ids corresponding to non-zero elements in the mask.
Parameters:
  • mask, 1d, ndarray - Feature mask.
Returns:
ndarray: integer
Ids of non-zero (non-False) mask elements.