Home

Trees

Indices

Help

Package mvpa :: Package datasets :: Module base :: Class Dataset

[hide private]

[frames] | no frames]

Class Dataset

source code

The Dataset.

This class provides a container to store all necessary data to perform MVPA analyses. These are the data samples, as well as the labels associated with the samples. Additionally, samples can be grouped into chunks.

Important: labels assumed to be immutable, i.e. no one should modify them externally by accessing indexed items, ie something like dataset.labels[1] += 100 should not be used. If a label has to be modified, full copy of labels should be obtained, operated on, and assigned back to the dataset, otherwise dataset.uniquelabels would not work. The same applies to any other attribute which has corresponding unique* access property.

Instance Methods

[hide private]

__init__(self, data=None, dsattr=None, dtype=None, samples=None, labels=None, labels_map=None, chunks=None, origids=None, check_data=True, copy_samples=False, copy_data=True, copy_dsattr=True)
Initialize dataset instance

source code

idhash(self)
To verify if dataset is in the same state as when smth else was done

source code

_resetallunique(self, force=False)
Set to None all unique* attributes of corresponding dictionary

source code

_getuniqueattr(self, attrib, dict_)
Provide common facility to return unique attributes

source code

_setdataattr(self, attrib, value)
Provide common facility to set attributes

source code

_getNSamplesPerAttr(self, attrib='labels')
Returns the number of samples per unique label.

source code

_getSampleIdsByAttr(self, values, attrib="labels", sort=True)
Return indecies of samples given a list of attributes

source code

idsonboundaries(self, prior=0, post=0, attributes_to_track=['labels','chunks'], affected_labels=None, revert=False)
Find samples which are on the boundaries of the blocks

source code

_shapeSamples(self, samples, dtype, copy)
Adapt different kinds of samples

source code

_checkData(self)
Checks _data members to have the same # of samples.

source code

_expandSampleAttribute(self, attr, attr_name)
If a sample attribute is given as a scalar expand/repeat it to a length matching the number of samples in the dataset.

source code

__str__(self)
String summary over the object

source code

__repr__(self)
repr(x)

source code

summary(self, uniq=True, stats=True, idhash=False, lstats=True, maxc=30, maxl=20)
String summary over the object

source code

summary_labels(self, maxc=30, maxl=20)
Provide summary statistics over the labels and chunks

source code

__iadd__(self, other)
Merge the samples of one Dataset object to another (in-place).

source code

__add__(self, other)
Merge the samples two Dataset objects.

source code

copy(self, deep=True)
Create a copy (clone) of the dataset, by fully copying current one

source code

selectFeatures(self, ids=None, sort=True, groups=None)
Select a number of features from the current set.

source code

applyMapper(self, featuresmapper=None, samplesmapper=None, train=True)
Obtain new dataset by applying mappers over features and/or samples.

source code

selectSamples(self, ids)
Choose a subset of samples defined by samples IDs.

source code

index(self, *args, **kwargs)
Universal indexer to obtain indexes of interesting samples/features. See .select() for more information

source code

select(self, *args, **kwargs)
Universal selector

source code

where(self, *args, **kwargs)
Obtain indexes of interesting samples/features. See select() for more information

source code

__getitem__(self, *args)
Convinience dataset parts selection

source code

permuteLabels(self, status, perchunk=True, assure_permute=False)
Permute the labels.

source code

getRandomSamples(self, nperlabel)
Select a random set of samples.

source code

getNSamples(self)
Currently available number of patterns.

source code

getNFeatures(self)
Number of features per pattern.

source code

getLabelsMap(self)
Stored labels map (if any)

source code

setLabelsMap(self, lm)
Set labels map.

source code

setSamplesDType(self, dtype)
Set the data type of the samples array.

source code

defineFeatureGroups(self, definition)
Assign definition to featuregroups source code

convertFeatureIds2FeatureMask(self, ids)
Returns a boolean mask with all features in ids selected. source code

convertFeatureMask2FeatureIds(self, mask)
Returns feature ids corresponding to non-zero elements in the mask.

source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __sizeof__, __subclasshook__

Class Methods

[hide private]

_registerAttribute(cls, key, dictname="_data", abbr=None, hasunique=False)
Register an attribute for any Dataset class.

source code

Static Methods

[hide private]

_checkCopyConstructorArgs(**kwargs)
Common sanity check for Dataset copy constructor calls.

source code

Class Variables

[hide private]

_uniqueattributes = []
Unique attributes associated with the data

_registeredattributes = []
Registered attributes (stored in _data)

_requiredattributes = ['samples', 'labels']
Attributes which have to be provided to __init__, or otherwise no default values would be assumed and construction of the instance would fail

__doc__ = enhancedDocString('Dataset', locals())

nsamples = property(fget= getNSamples)

nfeatures = property(fget= getNFeatures)

labels_map = property(fget= getLabelsMap, fset= setLabelsMap)

Instance Variables

[hide private]

_data
What makes a dataset.

_dsattr
Dataset attriibutes.

Properties

[hide private]

Inherited from object: __class__

Method Details

[hide private]

init(self, data=None, dsattr=None, dtype=None, samples=None, labels=None, labels_map=None, chunks=None, origids=None, check_data=True, copy_samples=False, copy_data=True, copy_dsattr=True)
(Constructor)

source code

Initialize dataset instance

There are basically two different way to create a dataset:

Create a new dataset from samples and sample attributes. In this mode a two-dimensional ndarray has to be passed to the samples keyword argument and the corresponding samples attributes are provided via the labels and chunks arguments.
Copy contructor mode

The second way is used internally to perform quick coyping of datasets, e.g. when performing feature selection. In this mode and the two dictionaries (data and dsattr) are required. For performance reasons this mode bypasses most of the sanity check performed by the previous mode, as for internal operations data integrity is assumed.

Each of the Keywords arguments overwrites what is/might be already in the data container.

Parameters:

data (dict) - Dictionary with an arbitrary number of entries. The value for each key in the dict has to be an ndarray with the same length as the number of rows in the samples array. A special entry in this dictionary is 'samples', a 2d array (samples x features). A shallow copy is stored in the object.
dsattr (dict) - Dictionary of dataset attributes. An arbitrary number of arbitrarily named and typed objects can be stored here. A shallow copy of the dictionary is stored in the object.
dtype, type, |, None - If None -- do not change data type if samples is an ndarray. Otherwise convert samples to dtype.
samples (ndarray) - 2d array (samples x features)
labels - An array or scalar value defining labels for each samples. Generally labels should be numeric, unless labels_map is used
labels_map (None or bool or dict) - Map original labels into numeric labels. If True, the mapping is computed if labels are literal. If is False, no mapping is computed. If dict instance -- provided mapping is verified and applied. If you want to have labels_map just be present given already numeric labels, just assign labels_map dictionary to existing dataset instance
chunks - An array or scalar value defining chunks for each sample

Overrides: object.__init__

idhash(self)

source code

To verify if dataset is in the same state as when smth else was done

Like if classifier was trained on the same dataset as in question

Decorators:

@property

_getuniqueattr(self, attrib, dict_)

source code

Provide common facility to return unique attributes

XXX dict_ can be simply replaced now with self._dsattr

idsonboundaries(self, prior=0, post=0, attributes_to_track=['labels','chunks'], affected_labels=None, revert=False)

source code

Find samples which are on the boundaries of the blocks

Such samples might need to be removed. By default (with prior=0, post=0) ids of the first samples in a 'block' are reported

Parameters:

prior (int) - how many samples prior to transition sample to include
post (int) - how many samples post the transition sample to include
attributes_to_track (list of basestring) - which attributes to track to decide on the boundary condition
affected_labels (list of basestring) - for which labels to perform selection. If None - for all
revert (bool) - either to revert the meaning and provide ids of samples which are found to not to be boundary samples

_shapeSamples(self, samples, dtype, copy)

source code

Adapt different kinds of samples

Handle all possible input value for 'samples' and tranform them into a 2d (samples x feature) representation.

_registerAttribute(cls, key, dictname="_data", abbr=None, hasunique=False)
Class Method

source code

Creates property assigning getters/setters depending on the availability of corresponding _get, _set functions.

str(self)
(Informal representation operator)

source code

String summary over the object

Overrides: object.__str__

repr(self)
(Representation operator)

source code

repr(x)

Overrides: object.__repr__: (inherited documentation)

summary(self, uniq=True, stats=True, idhash=False, lstats=True, maxc=30, maxl=20)

source code

String summary over the object

Parameters:

uniq (bool) - Include summary over data attributes which have unique
idhash (bool) - Include idhash value for dataset and samples
stats (bool) - Include some basic statistics (mean, std, var) over dataset samples
lstats (bool) - Include statistics on chunks/labels
maxc (int) - Maximal number of chunks when provide details on labels/chunks
maxl (int) - Maximal number of labels when provide details on labels/chunks

summary_labels(self, maxc=30, maxl=20)

source code

Provide summary statistics over the labels and chunks

Parameters:

maxc (int) - Maximal number of chunks when provide details
maxl (int) - Maximal number of labels when provide details

iadd(self, other)

source code

Merge the samples of one Dataset object to another (in-place).

No dataset attributes, besides labels_map, will be merged! Additionally, a new set of unique origids will be generated.

add(self, other)
(Addition operator)

source code

Merge the samples two Dataset objects.

All data of both datasets is copied, concatenated and a new Dataset is returned.

NOTE: This can be a costly operation (both memory and time). If performance is important consider the '+=' operator.

copy(self, deep=True)

source code

Create a copy (clone) of the dataset, by fully copying current one

Parameters:

deep (bool) - deep flag is provided to __init__ for copy_{samples,data,dsattr}. By default full copy is done.

selectFeatures(self, ids=None, sort=True, groups=None)

source code

Select a number of features from the current set.

Returns a new Dataset object with a copy of corresponding features: from the original samples array.

WARNING: The order of ids determines the order of features in the returned dataset. This might be useful sometimes, but can also cause major headaches! Order would is verified when running in non-optimized code (if __debug__)

Parameters:

ids - iterable container to select ids
sort (bool) - if to sort Ids. Order matters and selectFeatures assumes incremental order. If not such, in non-optimized code selectFeatures would verify the order and sort

applyMapper(self, featuresmapper=None, samplesmapper=None, train=True)

source code

Obtain new dataset by applying mappers over features and/or samples.

While featuresmappers leave the sample attributes information unchanged, as the number of samples in the dataset is invariant, samplesmappers are also applied to the samples attributes themselves!

Applying a featuresmapper will destroy any feature grouping information.

TODO: selectFeatures is pretty much: applyMapper(featuresmapper=MaskMapper(...))

Parameters:

featuresmapper (Mapper) - Mapper to somehow transform each sample's features
samplesmapper (Mapper) - Mapper to transform each feature across samples
train (bool) - Flag whether to train the mapper with this dataset before applying it.

selectSamples(self, ids)

source code

Choose a subset of samples defined by samples IDs.

Returns a new dataset object containing the selected sample subset.

TODO: yoh, we might need to sort the mask if the mask is a list of ids and is not ordered. Clarify with Michael what is our intent here!

index(self, *args, **kwargs)

source code

Universal indexer to obtain indexes of interesting samples/features. See .select() for more information

Returns:: tuple of (samples indexes, features indexes). Each item could be also None, if no selection on samples or features was requested (to discriminate between no selected items, and no selections)

select(self, *args, **kwargs)

source code

Universal selector

WARNING: if you need to select duplicate samples (e.g. samples=[5,5]) or order of selected samples of features is important and has to be not ordered (e.g. samples=[3,2,1]), please use selectFeatures or selectSamples functions directly

Examples:

Mimique plain selectSamples:

dataset.select([1,2,3])
dataset[[1,2,3]]

Mimique plain selectFeatures:

dataset.select(slice(None), [1,2,3])
dataset.select('all', [1,2,3])
dataset[:, [1,2,3]]

Mixed (select features and samples):

dataset.select([1,2,3], [1, 2])
dataset[[1,2,3], [1, 2]]

Select samples matching some attributes:

dataset.select(labels=[1,2], chunks=[2,4])
dataset.select('labels', [1,2], 'chunks', [2,4])
dataset['labels', [1,2], 'chunks', [2,4]]

Mixed -- out of first 100 samples, select only those with labels 1 or 2 and belonging to chunks 2 or 4, and select features 2 and 3:

dataset.select(slice(0,100), [2,3], labels=[1,2], chunks=[2,4])
dataset[:100, [2,3], 'labels', [1,2], 'chunks', [2,4]]

where(self, *args, **kwargs)

source code

Obtain indexes of interesting samples/features. See select() for more information

XXX somewhat obsoletes idsby...

getitem(self, args)
(Indexing operator)*

source code

Convinience dataset parts selection

See select for more information

permuteLabels(self, status, perchunk=True, assure_permute=False)

source code

Permute the labels.

TODO: rename status into something closer in semantics.

Parameters:

status (bool) - Calling this method with set to True, the labels are permuted among all samples. If 'status' is False the original labels are restored.
perchunk (bool) - If True permutation is limited to samples sharing the same chunk value. Therefore only the association of a certain sample with a label is permuted while keeping the absolute number of occurences of each label value within a certain chunk constant.
assure_permute (bool) - If True, assures that labels are permutted, ie any one is different from the original one

getRandomSamples(self, nperlabel)

source code

Select a random set of samples.

If 'nperlabel' is an integer value, the specified number of samples is randomly choosen from the group of samples sharing a unique label value ( total number of selected samples: nperlabel x len(uniquelabels).

If 'nperlabel' is a list which's length has to match the number of unique label values. In this case 'nperlabel' specifies the number of samples that shall be selected from the samples with the corresponding label.

The method returns a Dataset object containing the selected samples.

setLabelsMap(self, lm)

source code

Set labels map.

Checks for the validity of the mapping -- values should cover all existing labels in the dataset

defineFeatureGroups(self, definition)

source code

Assign definition to featuregroups

XXX Feature-groups was not finished to be useful

convertFeatureIds2FeatureMask(self, ids)

source code

Returns a boolean mask with all features in ids selected.

Parameters:

ids, list, or, 1d, array - To be selected features ids.

Returns:

ndarray: dtype='bool': All selected features are set to True; False otherwise.

convertFeatureMask2FeatureIds(self, mask)

source code

Returns feature ids corresponding to non-zero elements in the mask.

Parameters:

mask, 1d, ndarray - Feature mask.

Returns:

ndarray: integer: Ids of non-zero (non-False) mask elements.

Home

Trees

Indices

Help

Class Dataset

__init__(self, data=None, dsattr=None, dtype=None, samples=None, labels=None, labels_map=None, chunks=None, origids=None, check_data=True, copy_samples=False, copy_data=True, copy_dsattr=True) (Constructor)

idhash(self)

_getuniqueattr(self, attrib, dict_)

idsonboundaries(self, prior=0, post=0, attributes_to_track=['labels','chunks'], affected_labels=None, revert=False)

_shapeSamples(self, samples, dtype, copy)

_registerAttribute(cls, key, dictname="_data", abbr=None, hasunique=False) Class Method

__str__(self) (Informal representation operator)

__repr__(self) (Representation operator)

summary(self, uniq=True, stats=True, idhash=False, lstats=True, maxc=30, maxl=20)

summary_labels(self, maxc=30, maxl=20)

__iadd__(self, other)

__add__(self, other) (Addition operator)

copy(self, deep=True)

selectFeatures(self, ids=None, sort=True, groups=None)

applyMapper(self, featuresmapper=None, samplesmapper=None, train=True)

selectSamples(self, ids)

index(self, *args, **kwargs)

select(self, *args, **kwargs)

where(self, *args, **kwargs)

__getitem__(self, *args) (Indexing operator)

permuteLabels(self, status, perchunk=True, assure_permute=False)

getRandomSamples(self, nperlabel)

setLabelsMap(self, lm)

defineFeatureGroups(self, definition)

convertFeatureIds2FeatureMask(self, ids)

convertFeatureMask2FeatureIds(self, mask)

init(self, data=None, dsattr=None, dtype=None, samples=None, labels=None, labels_map=None, chunks=None, origids=None, check_data=True, copy_samples=False, copy_data=True, copy_dsattr=True)
(Constructor)

_registerAttribute(cls, key, dictname="_data", abbr=None, hasunique=False)
Class Method

str(self)
(Informal representation operator)

repr(self)
(Representation operator)

iadd(self, other)

add(self, other)
(Addition operator)

getitem(self, args)
(Indexing operator)*