Package mvpa :: Package clfs :: Module stats
[hide private]
[frames] | no frames]

Module stats

source code

Estimator for classifier error distributions.
Classes [hide private]
Non-parametric 1d distribution -- derives cdf based on stored values.
Base class for null-hypothesis testing.
Null-hypothesis distribution is estimated from randomly permuted data labels.
Proxy/Adaptor class for SciPy distributions.
Adaptive distribution which adjusts parameters according to the data
Adaptive rdist: params are (nfeatures-1, 0, 1)
Adaptive Normal Distribution: params are (0, sqrt(1/nfeatures))
Helper proxy-class to fit distribution when some parameters are known
Functions [hide private]
_pvalue(x, cdf_func, tail, return_tails=False, name=None)
Helper function to return p-value(x) given cdf and tail
source code
matchDistribution(data, nsamples=None, loc=None, scale=None, args=None, test='kstest', distributions=None, **kwargs)
Determine best matching distribution.
source code
plotDistributionMatches(data, matches, nbins=31, nbest=5, expand_tails=8, legend=2, plot_cdf=True, p=None, tail='both')
Plot best matching distributions
source code
Cheater for human beings -- wraps dist if needed with some NullDist
source code
_chk_asarray(a, axis) source code
nanmean(x, axis=0)
Compute the mean over the given axis ignoring nans.
source code

Imports: N, externals, warning, ClassWithCollections, StateVariable, debug, kstest, scipy, P

Function Details [hide private]

_pvalue(x, cdf_func, tail, return_tails=False, name=None)

source code 
Helper function to return p-value(x) given cdf and tail
  • cdf_func (callable) - Function to be used to derive cdf values for x
  • tail (str ('left', 'right', 'any', 'both')) - Which tail of the distribution to report. For 'any' and 'both' it chooses the tail it belongs to based on the comparison to p=0.5. In the case of 'any' significance is taken like in a one-tailed test.
  • return_tails (bool) - If True, a tuple return (pvalues, tails), where tails contain 1s if value was from the right tail, and 0 if the value was from the left tail.

matchDistribution(data, nsamples=None, loc=None, scale=None, args=None, test='kstest', distributions=None, **kwargs)

source code 
Determine best matching distribution.

Can be used for 'smelling' the data, as well to choose a
parametric distribution for data obtained from non-parametric
testing (e.g. `MCNullDist`).

WiP: use with caution, API might change

  data : N.ndarray
    Array of the data for which to deduce the distribution. It has
    to be sufficiently large to make a reliable conclusion
  nsamples : int or None
    If None -- use all samples in data to estimate parametric
    distribution. Otherwise use only specified number randomly selected
    from data.
  loc : float or None
    Loc for the distribution (if known)
  scale : float or None
    Scale for the distribution (if known)
  test : basestring
    What kind of testing to do. Choices:
     'p-roc' : detection power for a given ROC. Needs two
       parameters: `p=0.05` and `tail='both'`
     'kstest' : 'full-body' distribution comparison. The best
       choice is made by minimal reported distance after estimating
       parameters of the distribution. Parameter `p=0.05` sets
       threshold to reject null-hypothesis that distribution is the
       WARNING: older versions (e.g. 0.5.2 in etch) of scipy have
                incorrect kstest implementation and do not function
  distributions : None or list of basestring or tuple(basestring, dict)
    Distributions to check. If None, all known in scipy.stats
    are tested. If distribution is specified as a tuple, then
    it must contain name and additional parameters (name, loc,
    scale, args) in the dictionary. Entry 'scipy' adds all known
    in scipy.stats.
    Additional arguments which are needed for each particular test
    (see above)

  data = N.random.normal(size=(1000,1));
  matches = matchDistribution(
                   ('rdist', {'name':'rdist_fixed',
                              'loc': 0.0,
                              'args': (10,)})],
    nsamples=30, test='p-roc', p=0.05)

plotDistributionMatches(data, matches, nbins=31, nbest=5, expand_tails=8, legend=2, plot_cdf=True, p=None, tail='both')

source code 
Plot best matching distributions
  • data (N.ndarray) - Data which was used to obtain the matches
  • matches (list of tuples) - Sorted matches as provided by matchDistribution
  • nbins (int) - Number of bins in the histogram
  • nbest (int) - Number of top matches to plot
  • expand_tails (int) - How many bins away to add to parametrized distributions plots
  • legend (int) - Either to provide legend and statistics in the legend. 1 -- just lists distributions. 2 -- adds distance measure 3 -- tp/fp/fn in the case if p is provided
  • plot_cdf (bool) - Either to plot cdf for data using non-parametric distribution
  • p (float or None) - If not None, visualize null-hypothesis testing (given p). Bars in the histogram which fall under given p are colored in red. False positives and false negatives are marked as triangle up and down symbols correspondingly
  • tail (('left', 'right', 'any', 'both')) - If p is not None, the choise of tail for null-hypothesis testing
tuple(histogram, list of lines)


source code 

Cheater for human beings -- wraps dist if needed with some NullDist

tail and other arguments are assumed to be default as in NullDist/MCNullDist

nanmean(x, axis=0)

source code 
Compute the mean over the given axis ignoring nans.
  • x (ndarray) - input array
  • axis (int) - axis along which the mean is computed.