mvpa.clfs.stats

_pvalue(x, cdf_func, tail, return_tails=False, name=None)

Helper function to return p-value(x) given cdf and tail

Parameters:

cdf_func (callable) - Function to be used to derive cdf values for x
tail (str ('left', 'right', 'any', 'both')) - Which tail of the distribution to report. For 'any' and 'both' it chooses the tail it belongs to based on the comparison to p=0.5. In the case of 'any' significance is taken like in a one-tailed test.
return_tails (bool) - If True, a tuple return (pvalues, tails), where tails contain 1s if value was from the right tail, and 0 if the value was from the left tail.

matchDistribution(data, nsamples=None, loc=None, scale=None, args=None, test='kstest', distributions=None, **kwargs)

Determine best matching distribution.

Can be used for 'smelling' the data, as well to choose a
parametric distribution for data obtained from non-parametric
testing (e.g. `MCNullDist`).

WiP: use with caution, API might change

:Parameters:
  data : N.ndarray
    Array of the data for which to deduce the distribution. It has
    to be sufficiently large to make a reliable conclusion
  nsamples : int or None
    If None -- use all samples in data to estimate parametric
    distribution. Otherwise use only specified number randomly selected
    from data.
  loc : float or None
    Loc for the distribution (if known)
  scale : float or None
    Scale for the distribution (if known)
  test : basestring
    What kind of testing to do. Choices:
     'p-roc' : detection power for a given ROC. Needs two
       parameters: `p=0.05` and `tail='both'`
     'kstest' : 'full-body' distribution comparison. The best
       choice is made by minimal reported distance after estimating
       parameters of the distribution. Parameter `p=0.05` sets
       threshold to reject null-hypothesis that distribution is the
       same.
       WARNING: older versions (e.g. 0.5.2 in etch) of scipy have
                incorrect kstest implementation and do not function
                properly
  distributions : None or list of basestring or tuple(basestring, dict)
    Distributions to check. If None, all known in scipy.stats
    are tested. If distribution is specified as a tuple, then
    it must contain name and additional parameters (name, loc,
    scale, args) in the dictionary. Entry 'scipy' adds all known
    in scipy.stats.
  **kwargs
    Additional arguments which are needed for each particular test
    (see above)

:Example:
  data = N.random.normal(size=(1000,1));
  matches = matchDistribution(
    data,
    distributions=['rdist',
                   ('rdist', {'name':'rdist_fixed',
                              'loc': 0.0,
                              'args': (10,)})],
    nsamples=30, test='p-roc', p=0.05)

plotDistributionMatches(data, matches, nbins=31, nbest=5, expand_tails=8, legend=2, plot_cdf=True, p=None, tail='both')

source code

Plot best matching distributions

Parameters:

data (N.ndarray) - Data which was used to obtain the matches
matches (list of tuples) - Sorted matches as provided by matchDistribution
nbins (int) - Number of bins in the histogram
nbest (int) - Number of top matches to plot
expand_tails (int) - How many bins away to add to parametrized distributions plots
legend (int) - Either to provide legend and statistics in the legend. 1 -- just lists distributions. 2 -- adds distance measure 3 -- tp/fp/fn in the case if p is provided
plot_cdf (bool) - Either to plot cdf for data using non-parametric distribution
p (float or None) - If not None, visualize null-hypothesis testing (given p). Bars in the histogram which fall under given p are colored in red. False positives and false negatives are marked as triangle up and down symbols correspondingly
tail (('left', 'right', 'any', 'both')) - If p is not None, the choise of tail for null-hypothesis testing

Returns:

tuple(histogram, list of lines)

Module stats

_pvalue(x, cdf_func, tail, return_tails=False, name=None)

matchDistribution(data, nsamples=None, loc=None, scale=None, args=None, test='kstest', distributions=None, **kwargs)

plotDistributionMatches(data, matches, nbins=31, nbest=5, expand_tails=8, legend=2, plot_cdf=True, p=None, tail='both')

autoNullDist(dist)

nanmean(x, axis=0)