Lab 2-3: More Hypothesis Testing#


# import libraries we'll need
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline

T-Test for small sample sizes (n<30)#

We have instantaneous monthly observations of dissolved organic carbon (DOC) in two streams over the course of one water year (October-September). Use a two-sample, two-sided, t-test to determine:

  1. Using data for all 12 months, with what confidence can we say that the annual mean DOC concentrations are different between the two streams?

  2. Compare the two streams again, but this time perform two tests, one for the first 6 months of the water year (October-March), and a second test for the last 6 months (April-September).

  3. Can we say that the DOC concentrations between the two streams are different in the first half and/or second half of the water year? With what level of confidence could we say that they are different?

wy_month_labels = ['Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep']
wy_month_numbers = np.arange(12)+1
# DOC for the first stream, mg/L
doc_1 = [65.3, 98.4, 113.1, 120.5, 105.3, 100.3, 92.3, 97.5, 88.2, 89.5, 72.1, 61.9]
# DOC for the second stream, mg/L
doc_2 = [62.0, 50.7, 30.9, 52.5, 98.7, 95.8, 99.3, 110.2, 104.9, 96.4, 82.5, 75.5]

Chi-Squared Test for a Change in the Standard Deviation#

Test for statistical significance of a change in the standard deviation. Note that the standard deviation does not benefit from the Central Limit Theorem. Even though it is not strictly true, assume for the moment that the sample data are derived from a normally distributed population. Use a single sample test (with rejection region based on the Chi Squared distribution). Assume that the sample standard deviation from the 1929-1974 data is close to the true population standard deviation of the earlier data set. Test that the more recent sample is different from this.

Use \({t} = \frac{(n-1)s^2}{\sigma^2}\) with n-1 degrees of freedom.

# Read the excel file
skykomish_data_file = '../data/Skykomish_peak_flow_12134500_skykomish_river_near_gold_bar.xlsx'
skykomish_data = pd.read_excel(skykomish_data_file)
# Preview our data
skykomish_data.head(3)
/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/openpyxl/worksheet/_reader.py:329: UserWarning: Unknown extension is not supported and will be removed
  warn(msg)
date of peak water year peak value (cfs) gage_ht (feet)
0 1928-10-09 1929 18800 10.55
1 1930-02-05 1930 15800 10.44
2 1931-01-28 1931 35100 14.08
# Divide the data into the early period (before 1975) and late period (after and including 1975).
skykomish_before = skykomish_data[ skykomish_data['water year'] < 1975 ] 
skykomish_after = skykomish_data[ skykomish_data['water year'] >= 1975 ] 
# first calculate the test statistic
sd1 = skykomish_before['peak value (cfs)'].std() #we pretend this is the "true population standard deviation)
sd2 = skykomish_after['peak value (cfs)'].std()
m = len(skykomish_after['peak value (cfs)'])
t = (m-1)*sd2**2/sd1**2
print(t)
65.86640081438826

Now, we know from the lecture notes that this test statistic is a chi-squared distributed with n-1 degrees of freedom. Let’s choose that we want 95% confidence that there is a change, and therefore alpha = 0.05. In this example we are just going to test for an increase in the standard deviation (we are doing a one-sided test). We can look up our critical value in a chi-squared distribution table using our degrees of freedom and chosen alpha.

How can we look this up in python?

stats.chi2.ppf?
alpha = 0.05
vals = stats.chi2.ppf(alpha, m-1)
print(vals)
29.787477080861958

Our t statistic is larger than the cut-off value from the chi-squared distribution, so we determine that yes, with 95% confidence, a change has occurred.