Variance is a statistical measure that describes the spread of data points in a dataset around the mean. It quantifies how much the values in the dataset differ from the average (mean) value. The variance is particularly useful because it gives a more complete picture of data variability compared to simpler measures like the range.
Formula for Variance
The variance of a dataset is calculated differently for a population and a sample:
Variance for a Population (( \sigma^2 )):
σ2=1/n ∑N(xi−μ)2
Where:
- ( \sigma^2 ) is the population variance.
- ( N ) is the total number of data points in the population.
- ( x_i ) is each individual data point.
- ( \mu ) is the population mean.
- ( (x_i – \mu)^2 ) represents the squared difference between each data point and the mean.
Variance for a Sample (( s^2 )):
s2=1/n−1 ∑n(xi−xˉ)2
Where:
- ( s^2 ) is the sample variance.
- ( n ) is the number of data points in the sample.
- ( x_i ) is each individual data point.
- ( \bar{x} ) is the sample mean.
- ( (x_i – \bar{x})^2 ) represents the squared difference between each data point and the mean.
- ( n – 1 ) is used instead of ( n ) in the sample variance to account for bias, as dividing by ( n-1 ) gives an unbiased estimate of the population variance (this is known as Bessel’s correction).
Steps to Calculate Variance
- Find the Mean: Calculate the mean (average) of the data points.
- Calculate the Differences from the Mean: Subtract the mean from each data point.
- Square the Differences: Square each difference to eliminate negative values.
- Find the Average of the Squared Differences: For a population, divide the sum of the squared differences by ( N ); for a sample, divide by ( n – 1 ).
Example (Sample Variance Calculation)
Consider the following dataset representing the ages of 5 individuals:
[ 22, 29, 31, 34, 38 ]
- Step 1: Calculate the mean:
xˉ=22+29+31+34+38/ 5 =154/5 =30.8 - Step 2: Subtract the mean from each data point:
22−30.8=−8.8, 29−30.8=−1.8,3 1−30.8=0.2, 34−30.8=3.2, 38−30.8=7.2 - Step 3: Square each difference:
(−8.8)2=77.44, (−1.8)2=3.24, (0.2)2=0.04, (3.2)2=10.24, (7.2)2=51.84 - Step 4: Sum the squared differences:
77.44+3.24+0.04+10.24+51.84=142. - Step 5: Divide by n – 1 = 5 – 1 = 4 to get the sample variance:
s2=142.8/4 =35.7
So, the sample variance is 35.7.
Advantages of Variance
- Comprehensive Measure of Variability: Unlike the range, which only considers the extremes, variance takes all data points into account, giving a more detailed measure of spread.
- Foundation for Standard Deviation: The variance leads directly to the calculation of the standard deviation, which is another important measure of dispersion.
Disadvantages of Variance
- Squared Units: Since variance squares the differences, its unit is the square of the original unit of measurement (e.g., if the data is in meters, the variance will be in square meters), making it harder to interpret in real-world terms.
- Sensitive to Outliers: Variance gives more weight to outliers because it squares the differences from the mean, which can distort the measure of variability in datasets with extreme values.
Variance vs. Standard Deviation
- The standard deviation is the square root of the variance and is often preferred because it is in the same units as the original data, making it easier to interpret.
- Variance is still useful for theoretical purposes and is a fundamental concept in statistics, especially in probability distributions and inferential statistics.
Summary
- Variance measures how spread out the data points are from the mean by averaging the squared differences.
- It is an essential measure of variability and helps in understanding the overall distribution of the data.
- While it provides a comprehensive look at dispersion, its squared units and sensitivity to outliers are limitations that can be addressed by using the standard deviation.