The org.bzdev.math.stats package
This package contains several statistics-related classes. The class hierarchy is shown in the following diagram:
One will typically use the stats package as follows:
- In the simplest case, a program such as a simulation
will produce data providing a mean value and standard
deviation for quantities such as queue lengths. The class
BasicStats
orBasicStatsMV
can be used to record this data and then compute the mean and standard deviation. A table of means and standard deviations can then be used as the input for a least-squares fit. - If an application produces an array of non-deterministic
values, computing this array multiple times, the class
CovarianceMatrix
can be used to compute the covariance matrix. One of two subclasses will be used:-
CovarianceMatrix.Population
when the data sets represent the full population. -
CovarianceMatrix.Sample
when the data sets are a representative sample.
-
- For hypothesis testing, the standard terminology refers to
a null hypothesis (denoted H0) and an alternate
hypothesis, (denoted as HA or possibly H1,
etc. as there may be several). The null hypothesis is a
hypothesis one attempts to disprove (the word "null" is used
because "nullify" is used as a synonym for "disprove"). For
example, the null hypothesis might be that the mean value of
some quantity is zero, while the alternate hypothesis may be
that the mean is larger than some value μ. The test makes
use of a statistic - a real-valued function of the data. Often
the statistic is chosen so that its value is zero when the
probability of rejecting the null hypothesis is at its minimum
value. A statistic has an associated probability distribution
that represents the probability that a random variable will
produce a specific outcome when the null hypothesis is in fact
true. The following quantities are used with respect to the
null hypothesis:
- The p value is the probability that the random variable associated with a statistic will have a value at least as extreme (e.g., as far from zero). The term "extreme" can refer to deviations from zero in one or both directions.
- The symbol α is typically used to indicate a cutoff value for the p value. For example if α=0.01, there is a one percent chance of the random variable generating a value at least this extreme. The value of α is the probability of a type I error, defined as the probability that the null hypothesis is in fact true when the test indicated that it is false.
- For a given value of α, the statistic's critical value is the value of the statistic for which the p value would be numerically equal to α. depending on the situation, one may use one or two critical values. (two values allows for deviations in both directions from zero).
- The value β is the integral of the probability density of a statistic for the probability distribution associated with an alternate hypothesis, with the limits of integration equal to the critical values for the null hypothesis. This measures the probability of the null hypothesis being accepted when the alternate hypothesis is in fact true. Such an outcome is referred to as a type II error.
- The statistical power is defined as 1 - β and indicates the probability that the null hypothesis has not been falsely rejected given that the null hypothesis has been rejected.
- The alternate hypothesis is sometimes characterized by providing the mean value of some quantity or its difference from the value assumed by the null hypothesis. The value of β is a function of this quantity.
- Create an instance of a subclass of Statistic, choosing the subclass that matches the statistic one chooses to use.
- Add data to the statistic, either when the constructor is called or by calling various methods whose name starts with the string "add".
- The method
Statistic.getValue()
will return the value of the statistic given the data provided. - The method
Statistic.getPValue(org.bzdev.math.stats.Statistic.PValueMode)
will determine the statistic's p value. This method's argument will determine if a one-sided or two-sided value is desired. There are options for the one-sided case that specify whether more extreme values are positive or negative. - One may also use methods to obtain critical values,
a distribution for an alternate hypothesis, and to
compute the quantity β for a given alternate
hypothesis. These methods are described in the
documentation for
Statistic
. and its subclasses.
- For testing if a data set follows a particular distribution,
one may use the class
KSStat
, which provides a Kolmogorov-Smirnov test to see if a data set can be described by a specified probability distribution. This test requires that the distribution does not make use of parameters that were calculated from the data set. A good use is to test new random number generators to verify that they follow the desired probability distribution.