Measures of Location and Dispersion

Measures of Location

Measures of location are used to summarize the data with just one value. They try to capture with a single number what is typical of the data. What single number is most representative of an entire list of numbers? We will study three common measures of location: the mean, the median, and the mode.

Mean or Average

The (arithmetic) mean, or average, of n observations

$$ x̄ $$

(pronounced “x bar”) is simply the sum of the observations divided by the number of observations; thus:

$$x̄ = \frac{\text{Sum of all sample values}}{\text{Sample size}} = \frac{\sum x^i}{n} $$

The major advantage of the mean is that it uses all the data values, and is, in a statistical sense, efficient.

The main disadvantage of the mean is that it is vulnerable to outliers. Outliers are single observations which, if excluded from the calculations, have noticeable influence on the results. For example, if we had entered '21' instead of '2.1' in the calculation of the mean in Example 1, we would find the mean changed from 1.50kg to 7.98kg. It does not necessarily follow, however, that outliers should be excluded from the final data summary, or that they always result from an erroneous measurement.

Median

The median is defined as the middle point of the ordered data. It is estimated by first ordering the data from smallest to largest, and then counting upwards for half the observations. The estimate of the median is either the observation at the center of the ordering in the case of an odd number of observations, or the simple average of the middle two observations if the total number of observations is even. More specifically, if there are an odd number of observations, it is the $[\frac{n+1}{2}]^{\text{th}}$ observation, and if there are an even number of observations, it is the average of the $[\frac{n}{2}]^{\text{th}}$ and the $[\frac{n}{2}+1]^{\text{th}}$observations.

The median has the advantage that it is not affected by outliers, so for example the median in the example would be unaffected by replacing '2.1' with '21'. However, it is not statistically efficient, as it does not make use of all the individual data values.

Mode

A third measure of location is the mode. This is the value that occurs most frequently, or, if the data are grouped, the grouping with the highest frequency. It is not used much in statistical analysis, since its value depends on the accuracy with which the data are measured; although it may be useful for categorical data to describe the most frequent category. The expression 'bimodal' distribution is used to describe a distribution with two peaks in it. This can be caused by mixing populations. For example, height might appear bimodal if one had men and women on the population. Some illnesses may raise a biochemical measure, so in a population containing healthy and ill people one might expect a bimodal distribution. However, some illnesses are defined by the measure (e.g. obesity or high blood pressure) and in this case the distributions are usually unimodal.

Measures of Dispersion or Variability

A measure of dispersion is a statistical value that quantifies the spread or variability of data points in a dataset. Common measures include range (the difference between the highest and lowest values), variance (the average of the squared differences from the mean), and standard deviation (the square root of the variance). These measures are used with a measure of central tendency (like the mean or median) to provide a complete picture of a distribution.

Range and Interquartile Range

The range is given as the smallest and largest observations. This is the simplest measure of variability. Note in statistics (unlike physics) a range is given by two numbers, not the difference between the smallest and largest. For some data it is very useful, because one would want to know these numbers, for example knowing in a sample the ages of youngest and oldest participant. If outliers are present it may give a distorted impression of the variability of the data, since only two observations are included in the estimate.

Quartiles and Interquartile Range

The quartiles, namely the lower quartile, the median and the upper quartile, divide the data into four equal parts; that is there will be approximately equal numbers of observations in the four sections (and exactly equal if the sample size is divisible by four and the measures are all distinct). Note that there are in fact only three quartiles and these are points not proportions. The quartiles are calculated in a similar way to the median; first arrange the data in size order and determine the median, using the method described above. Now split the data in two (the lower half and upper half, based on the median). The first quartile is the middle observation of the lower half, and the third quartile is the middle observation of the upper half.

The interquartile range is a useful measure of variability and is given by the lower and upper quartiles. The interquartile range is not vulnerable to outliers and, whatever the distribution of the data, we know that 50% of observations lie within the interquartile range.

Standard Deviation and Variance

The standard deviation of a sample (s) is calculated as follows: $$ s = \sqrt{\frac{\sum (x_{i} - x̄)^{2}}{n-1}} $$

The expression $ \sum(x_{i} - x̄)^{2}$ is interpreted as: from each individual observation (x_i) subtract the mean ($x̄$), then square this difference. Next add each of the n squared differences. This sum is then divided by (n-1). This expression is known as the sample variance (s²). The variance is expressed in square units, so we take the square root to return to the original units, which gives the standard deviation, s. Examining this expression it can be seen that if all the observations were the same (i.e. x_{1 =}x_{2 =}x_{3 ... =}x_n), then they would equal the mean, and so s would be zero. If the x's were widely scattered about, then s would be large. In this way, s reflects the variability in the data. The standard deviation is vulnerable to outliers.

Why is the standard deviation useful?

It turns out in many situations that about 95% of observations will be within two standard deviations of the mean, known as a reference interval. It is this characteristic of the standard deviation which makes it so useful. It holds for a large number of measurements commonly made in medicine. In particular, it holds for data that follow a Normal distribution. Standard deviations should not be used for highly skewed data, such as counts or bounded data, since they do not illustrate a meaningful measure of variation, and instead an IQR or range should be used. In particular, if the standard deviation is of a similar size to the mean, then the SD is not an informative summary measure, save to indicate that the data are skewed.

Mean Signed Deviation

In statistics, the mean signed difference (MSD), also known as mean signed deviation, mean signed error, or mean bias error is a sample statistic that summarizes how well a set of estimates $\hat{\theta}^i$ match the quantities ${\theta}^i$ that they are supposed to estimate. It is one of a number of statistics that can be used to assess an estimation procedure, and it would often be used in conjunction with a sample version of the mean square error.

The mean signed difference is derived from a set of n pairs, ($\hat{\theta}^i, \theta^i$), where $\hat{\theta}^i$ is an estimate of the parameter $\theta$ in a case where it is known that $\theta = \theta^i$. In many applications, all the quantities $\theta^i$ will share a common value. When applied to forecasting in a time series analysis context, a forecasting procedure might be evaluated using the mean signed difference, with $\hat{\theta}^i$ being the predicted value of a series at a given lead time and $\theta^i$ being the value of the series eventually observed for that time-point. The mean signed difference is defined to be

$$ \text{MSD} (\hat{\theta}) = \frac{1}{n}\sum_{i=1}^{n} \hat{\theta}^i - \theta^i $$

The mean signed difference is often useful when the estimations $\hat{\theta}^i$ are biased from the true values $\theta^i$ in a certain direction. If the estimator that produces the $\hat{\theta}^i$ values is unbiased, then $MSD(\hat{\theta}^i) = 0 $. However, if the estimations $\hat{\theta}^i$ are produced by a biased estimator, then the mean signed difference is a useful tool to understand the direction of the estimator's bias.