Descriptive statistics are summary values that describe
features of the distribution
based on the data sample. These
include statistics of location,
statistics of scale, statistics of distributional shape
(skewness
and heavy-tailedness),
quantiles (order statistics)
and counts of the data.
Users of descriptive statistics often make implicit assumptions about
the underlying distribution. When reporting a measure of location
such as the mean or median, we usually think of the underlying
distribution as having a single "center" or "middle," such as
the center "hump" in a
normal distribution.
Or we may assume that the distribution is
continuous.
The definitions of the statistics may be perfectly
valid without those assumptions, so we must be careful in
interpreting the numbers.
Descriptive statistics are estimates, and are more accurate
for larger sample sizes than for smaller ones.
Although descriptive statistics can provide a few pieces of information
about data or their underlying distribution, they seldom give as
good an overall picture of the distribution as a
boxplot,
histogram,
normal probability plot,
or other graph of the data. One or two graphs may give you a much
better idea of what your data "look like" than a raft of numeric statistics.
At the very least, graphs will help you interpret the descriptive
statistics.
The most common descriptive statistics are those that
measure location, or central tendency--the generalized
concept of the "average" value of a distribution.
The sample arithmetic mean, also known simply
as the mean or average, is the sum of all the sample values
divided by the sample size. It is the
best estimate of the expectation (mean) of the underlying population.
It is also the center of gravity of the histogram of the sample--
if the histogram where constructed out of cardboard or sheet metal,
the mean value would be the fulcrum point where the histogram
would balance horizontally.
Because the mean is calculated from all the sample values, it
makes the maximum possible use of the available data. On the other
hand, it can be influenced by any extreme value; i.e.,
it is not resistant. In using the
mean, you should always check for the presence of outliers in the sample.
One method of dealing with the problem of outlying values is to
use weighting or trimming in calculating the mean.
In the usual mean calculation, all the sample values are given
the same weight (1/sample size) in the sum. This can be
adjusted to using any collection of weights that sum to 1.
For a trimmed mean
a proportion (e.g., 10%) of the
data at each end of the sample is trimmed off, and
then the arithmetic mean is calculated for the remaining
values. The 10% trimmed mean for a sample size of 20
would be the average of the middle 16 values. This is
equivalent to weighting those 16 values equally and
assigning weights of 0 to the other 4 trimmed values.
Other weighted means can be calculated by
using a weighting function such as the biweight
or Winsorized means. The weighting function may depend on the size of
the individual values.
Chapters 10 and 11 of Hoaglin et al.
discuss trimming and weighting in detail.
The confidence interval for the sample mean as reported in Prophet
is the half-width of the 95% confidence interval for the mean
of a normal distribution, calculated as for the
one-sample t test.
The sample median is the "middle"
value of the sample. There are as many sample values above the
sample median as below it. If the sample size is odd (say, 2N + 1),
then the median is the Nth largest data value. If the sample size
is even (say, 2N + 2), then the median is defined as the
average of the Nth and (N+1)st largest data values.
The sample median will divide the histogram into two pieces
with equal areas.
The sample median is the best estimate of the median of the
underlying population.
Because the median is calculated from only one or two data values,
it is highly resistant, and
may be preferred to the mean when dealing with
skewed data.
For skewed distributions, the sample mean will be further toward
the direction of skew than the median: above the median for distributions
skewed to the right, and below the median for distributions skewed to the
left.
For symmetric distributions, the mean and median will be the same,
and the sample mean and sample median will be estimating the same value.
Since the sample mean is generally the better estimator in this case, especially
if the population distribution is normal, the mean is generally
preferred unless there is some reason to suspect nonnormality,
especially asymmetry.
The confidence interval for the sample median as reported in Prophet
is the half-width of a robust 95% confidence interval
for the median of a symmetric but possibly heavy-tailed distribution, as described in
Chapter 12 of Hoaglin et al.
The sample geometric mean is designed
for averaging ratio or proportion data. It is equivalent to taking
logarithms the sample values
(i.e., transforming
the sample), finding the arithmetic mean of the logs, and
then retransforming back to the original scale (by taking antilogs).
It can only be used when all the sample values are greater than 0.
The sample mode is the single most frequently
occurring data value. Samples from a continuous distribution may
not have any repeated data values, so the mode is generally more
informative with samples from
discrete distributions.
If the logarithm tranformation above is replaced by the reciprocal
transformation, the result is the sample harmonic mean,
which is sometimes used to average rates.
A mode looks like a hump in a graph of the frequency distribution
of the sample or population. A sample (or the underlying distribution)
may have more than one mode, although Prophet will only report a mode
if there is a single one. If the distribution is unimodal, like the
normal distribution, and also symmetric, then the sample mean,
the sample mode, and the sample median are all estimates of the
same value, the population mean.
The sample mode is less sensitive to skewness than either the sample mean
or the sample median, but it is more subject to sample variation than
either the sample mean or the sample median.
The sample midrange is the midpoint
of the sample--the average of the smallest and largest data values in the sample.
Like the sample median, it uses only a small portion of the data, but
can be heavily affected by outliers,
even more so than the sample mean. The mean daily temperature
reported in newspapers is usually in fact a midrange.
The letter values display
includes the midrange, as well as midpoints between other
quantiles.
A series of such midpoints can provide information about the
skewness and heavy-tailedness of the distribution, but the midrange
by itself does not provide much information.
The sample sum is simply the sum
of all the sample data values. It is identical to the mean
multiplied by the sample size.
Scale statistics measure the variability or
dispersion of the sample data, how scattered (or, conversely,
clustered) the data are about the center of the distribution.
The sample variance is the
the average of the squared deviations of each sample value
from the sample mean, except that instead of dividing the
sum of the squared deviations by the sample size N, the
sum is divided by N-1. This is done to make the
sample variance an unbiased
estimator of the population variance.
The sample standard deviation is the
square root of the sample variance. This means that
it has the same linear units as the original data values
or a measure of central tendency, instead of the squared
units of the sample variance.
Like the sample mean, the sample variance and sample standard
deviation make use of all the available sample data,
and can be heavily influenced by an extreme value,
or by skewed data.
Because the sample variance and standard deviation are
based on squared deviations, a single aberrant value
can make a huge difference in the calculated sample statistic.
In using these sample statistics,
you should always check for the presence of outliers in the sample.
A related statistic, the mean absolute deviation,
is the mean of the sum of the absolute values of the
deviations of each value from the mean (or, sometimes,
from the sample median). Like the variance and standard
deviation, it can be influenced by even a single outlier,
but because the deviations are not squared, the effect
is not as pronounced.
The sample standard error of the mean is the
sample standard deviation divided by the square root of the sample size.
It is simply the estimate of the standard deviation of the sample mean,
and shares both the advantages and lack of resistance of the sample standard deviation.
The sample coefficient of variation is the
sample standard deviation divided by the sample mean, sometimes
multiplied by 100 to give a percentage. It measures relative
variability by correcting for the magnitude of the data values,
and thus giving a measure that has no units. It is a
biased estimator of the
population coefficient of variation.
If two populations are identical except for a change of scale, then the
coefficients of variation will be the same. Thus the coefficient
of variation is often used to compare the variability of populations
that are somehow related, but have different orders of magnitude,
such as body weights of elephants vs shrews.
The sample sum of squares is the
sum of the squared
squared deviations of each sample value
from the sample mean, and is simply the
sample variance multiplied by one less than the sample size.
The sample range is the
difference between the maximum and minimum values in the
sample. Like the sample midrange, it uses
only a small portion of the data, but
can be heavily affected by outliers.
It is also not a very good estimator of the population range,
since it is biased and
highly variable. Its best use is in conjunction with another
scale statistic like the sample standard deviation.
The sample interquartile range is the
difference between the upper (75th percentile) and lower
(25th percentile) quartiles of the data sample,
which are the upper and lower bounds of the center half of the
data values. It does not use all the available data, but
only on the central half of the data. It
is less likely to be heavily affected by
outliers
or skewness (which mostly affects values in the tails)
than either the range or the standard deviation, but
is not the best estimator when the population is
known to be normal or nearly so.
Shape statistics measure how the shape of
the underlying population differs from the shape of a
normal distribution with the same mean and variance.
Boxplots,
histograms, and
normal probability plots
often help in interpreting shape statistics.
The sample skewness measures
asymmetry. A symmetric distribution has 0 skewness,
a distribution skewed to the right (long righthand tail)
has positive skewness, and
a distribution skewed to the left (long lefthand tail)
has negative skewness.
Outliers in a sample
from a symmetric distribution can produce a non-zero sample skewness statistic.
A boxplot or
A normal probability plot of the sample can
provide information as to whether the this might be the case.
The sample kurtosis measures
heavy-tailedness or light-tailedness relative to the normal distribution.
A light-tailed distribution like the uniform distribution has
fewer values in the tails (away from the center of the distribution)
than the normal distribution, and will have negative kurtosis.
A heavy-tailed distribution like the Cauchy distribution has
more values in the tails (away from the center of the distribution)
than the normal distribution, and will have positive kurtosis.
Outliers in a sample
from a distribution with normal tails
can produce a non-zero sample kurtosis statistic.
A boxplot or
A normal probability plot of the sample can
provide information as to whether the this might be the case.
A sample from a distribution with long tails (positive) kurtosis may
also have a sizeable non-zero skewness statistic, even if the
underlying distribution is symmetric. Both the sample skewness
and sample kurtosis statistics make use of all the data values,
and, like the mean and standard deviation, are sensitive to outliers
The normality test gives a
P value for the Shapiro-Wilk omnibus test of normality. (If the sample
size is greater than 2000, Stephens' test of normality is
performed.) This test
detects departures from normality, but will not indicate
the type of nonnormality (e.g., skewness vs heavy-tailedness).
Quantiles are order statistics,
or averages of two
order statistics, chosen so that a certain proportion of the sorted data
values fall below the quantile. The median is the
50th percentile, because 50% of the data values fall below it, and a
quantile.
The maximum and
minimum sample values are also quantiles, as the 0th and
100th percentiles. The are also known as the extremes.
The letter values
display is made of a specific set of quantiles, such that the
proportion that falls below each quantile is a power of ½. The median is
the first such quantile. The next such quantiles are the lower and upper
quartiles. The lower quartile,
Q1, is the 25th percentile.
The upper quartile,
Q3, is the 75th percentile. Q3-Q1 is the
interquantile range.
If the distance between and median and Q3 is greater than
that between the median and Q1, the distribution may be
skewed to the right.
If the distance between and median and Q3 is less than
that between the median and Q1, the distribution may be
skewed to the left.
The center box of a boxplot
is constructed from Q1 and Q3, along with the median.
The sample size is the number of
(non-empty) values in the sample.
The number missing is the number of
empty (missing) values in the sample. Prophet calculates
the number of missing values on a per-column basis, so
that empty values are not counted as missing in a column
if they occur after the row with the last non-empty
value in that column.
The number of unique values is the number of
different values in the sample. This is useful for checking
for incorrectly entered values in a sample from a discrete distribution,
or as a very crude indication of clumpiness in a sample from a
continuous distribution.
If you are not familiar with descriptive statistics, you are advised to
consult with a statistician. Failure to understand descriptive statistics
may result in drawing erroneous conclusions from your data.
You may also want to consult the following references:
Brownlee, K. A. 1965. Statistical Theory and Methodology
in Science and Engineering. New York: John Wiley & Sons.
Daniel, Wayne W. 1995. Biostatistics. 6th ed.
New York: John Wiley & Sons.
Hoaglin, D. C., Mosteller, F., and Tukey, J. W. (eds.). 1983.
Understanding Robust and Exploratory Data Analysis. New York: John Wiley & Sons.
If you are unsatisfied with your purchase, you may return it within 30
days for an
exchange, credit or refund.
This guarantee does not cover electronic download products, special requests requiring photocopying
or
engineering aids; however, if you cannot
edit our document(s) in your MS Word, Excel or Visio program we will fix
it or give you a refund.
Can't find what you're
looking for...?
Please call, Fax or Email Us at:
Office: (719) 649-4242
Fax: (719) 573-4205 Home Page
Click here to bookmark At-PQC™ then visit our
Toolbox to find a quality control plan that will
help you achieve an effective and efficient business
infrastructure that focuses on customer satisfaction,
continuous improvement and desirable cost savings. Visit
with us today for comprehensive assistance in developing
or choosing the right quality control plan for your
business.
Click here to visit our extensive selection of
quality control plans, policies, procedures and forms or
click here
for help with where-to-start.
We can interact with you anywhere in the USA from
8:00am to 5:00pm Monday through Friday except holidays.
At-PQC™
JnF Specialties, LLC
664 Greenscape Lane
Colorado Springs, Colorado 80916-5534
Office:
(719) 649-4242
Fax: (719) 573-4205
Email Us at:
Send an email to request next-day support or call our helpline at 719-649-4242
during your office hours
Mon - Fri except holidays.