Package mvpa :: Package clfs :: Module stats
[hide private]
[frames] | no frames]

Module stats

source code

Estimator for classifier error distributions.
Classes [hide private]
  Nonparametric
Non-parametric 1d distribution -- derives cdf based on stored values.
  NullDist
Base class for null-hypothesis testing.
  MCNullDist
Null-hypothesis distribution is estimated from randomly permuted data labels.
  FixedNullDist
Proxy/Adaptor class for SciPy distributions.
  AdaptiveNullDist
Adaptive distribution which adjusts parameters according to the data
  AdaptiveRDist
Adaptive rdist: params are (nfeatures-1, 0, 1)
  AdaptiveNormal
Adaptive Normal Distribution: params are (0, sqrt(1/nfeatures))
  rv_semifrozen
Helper proxy-class to fit distribution when some parameters are known
Functions [hide private]
 
_pvalue(x, cdf_func, tail, return_tails=False, name=None)
Helper function to return p-value(x) given cdf and tail
source code
 
matchDistribution(data, nsamples=None, loc=None, scale=None, args=None, test='kstest', distributions=None, **kwargs)
Determine best matching distribution.
source code
 
plotDistributionMatches(data, matches, nbins=31, nbest=5, expand_tails=8, legend=2, plot_cdf=True, p=None, tail='both')
Plot best matching distributions
source code
 
autoNullDist(dist)
Cheater for human beings -- wraps dist if needed with some NullDist
source code
 
_chk_asarray(a, axis) source code
 
nanmean(x, axis=0)
Compute the mean over the given axis ignoring nans.
source code

Imports: N, externals, warning, ClassWithCollections, StateVariable, debug, kstest, scipy, P


Function Details [hide private]

_pvalue(x, cdf_func, tail, return_tails=False, name=None)

source code 
Helper function to return p-value(x) given cdf and tail
Parameters:
  • cdf_func (callable) - Function to be used to derive cdf values for x
  • tail (str ('left', 'right', 'any', 'both')) - Which tail of the distribution to report. For 'any' and 'both' it chooses the tail it belongs to based on the comparison to p=0.5. In the case of 'any' significance is taken like in a one-tailed test.
  • return_tails (bool) - If True, a tuple return (pvalues, tails), where tails contain 1s if value was from the right tail, and 0 if the value was from the left tail.

matchDistribution(data, nsamples=None, loc=None, scale=None, args=None, test='kstest', distributions=None, **kwargs)

source code 
Determine best matching distribution.

Can be used for 'smelling' the data, as well to choose a
parametric distribution for data obtained from non-parametric
testing (e.g. `MCNullDist`).

WiP: use with caution, API might change

:Parameters:
  data : N.ndarray
    Array of the data for which to deduce the distribution. It has
    to be sufficiently large to make a reliable conclusion
  nsamples : int or None
    If None -- use all samples in data to estimate parametric
    distribution. Otherwise use only specified number randomly selected
    from data.
  loc : float or None
    Loc for the distribution (if known)
  scale : float or None
    Scale for the distribution (if known)
  test : basestring
    What kind of testing to do. Choices:
     'p-roc' : detection power for a given ROC. Needs two
       parameters: `p=0.05` and `tail='both'`
     'kstest' : 'full-body' distribution comparison. The best
       choice is made by minimal reported distance after estimating
       parameters of the distribution. Parameter `p=0.05` sets
       threshold to reject null-hypothesis that distribution is the
       same.
       WARNING: older versions (e.g. 0.5.2 in etch) of scipy have
                incorrect kstest implementation and do not function
                properly
  distributions : None or list of basestring or tuple(basestring, dict)
    Distributions to check. If None, all known in scipy.stats
    are tested. If distribution is specified as a tuple, then
    it must contain name and additional parameters (name, loc,
    scale, args) in the dictionary. Entry 'scipy' adds all known
    in scipy.stats.
  **kwargs
    Additional arguments which are needed for each particular test
    (see above)

:Example:
  data = N.random.normal(size=(1000,1));
  matches = matchDistribution(
    data,
    distributions=['rdist',
                   ('rdist', {'name':'rdist_fixed',
                              'loc': 0.0,
                              'args': (10,)})],
    nsamples=30, test='p-roc', p=0.05)

plotDistributionMatches(data, matches, nbins=31, nbest=5, expand_tails=8, legend=2, plot_cdf=True, p=None, tail='both')

source code 
Plot best matching distributions
Parameters:
  • data (N.ndarray) - Data which was used to obtain the matches
  • matches (list of tuples) - Sorted matches as provided by matchDistribution
  • nbins (int) - Number of bins in the histogram
  • nbest (int) - Number of top matches to plot
  • expand_tails (int) - How many bins away to add to parametrized distributions plots
  • legend (int) - Either to provide legend and statistics in the legend. 1 -- just lists distributions. 2 -- adds distance measure 3 -- tp/fp/fn in the case if p is provided
  • plot_cdf (bool) - Either to plot cdf for data using non-parametric distribution
  • p (float or None) - If not None, visualize null-hypothesis testing (given p). Bars in the histogram which fall under given p are colored in red. False positives and false negatives are marked as triangle up and down symbols correspondingly
  • tail (('left', 'right', 'any', 'both')) - If p is not None, the choise of tail for null-hypothesis testing
Returns:
tuple(histogram, list of lines)

autoNullDist(dist)

source code 

Cheater for human beings -- wraps dist if needed with some NullDist

tail and other arguments are assumed to be default as in NullDist/MCNullDist

nanmean(x, axis=0)

source code 
Compute the mean over the given axis ignoring nans.
Parameters:
  • x (ndarray) - input array
  • axis (int) - axis along which the mean is computed.