You are on page 1of 12

H2Uh-Oh, a Black Hawk County and Iowa Water Quality Analysis: Exploring time-series using one-way analysis of variance

and Mann-Kendall/Sen-Theil Regression

Salil Kalghatgi

12/31/13

Intro Global warming and pollution risk a disturbing shift away from a Holocene Earth. Scientists struggle quantifying the major explanatory variables as the task is dauntingly large and vividly complex (Rockstrm et al., 2009). Interwoven fabrics of nature connect Iowa fertilizer runoff with increasing global temperatures, desiring local analysis in postulating global risk (Groffman et al., 2006). Furthermore, pollution degrades local health for both humans and other creatures; humans must treat the Earth better, and monitoring ecology is vital to properly allocated policy. Water's centrality to life makes water quality an important variable of interest. To contribute and learn, I statistically analyze Black Hawk county and Iowa water quality gaining preliminary insight into trends between wells (interwell) and within wells (intrawell) through STORET\WQX Water Quality Database (Iowa DNR). My lacking familiarity in chemical interactions directed focus towards using Iowa Water Quality Index (IWQI). Other indexes exist, but geographical variations make IWQI much more relevant. 70 currently monitored well testing locations across Iowa compose IWQI, with water quality ranging from a scale of 0-100(best). No data is collected for consecutive 2008 months, presumably due to floods. A special thanks must be given to Richard Langel, a geologist at the Iowa Department of Natural Resources who provided database queries. Fortunately, several statistical resources exist for amateur statisticians including the EPA's Unified Guide and Pazerdnik's analysis in Murky Waters, respectively help determine acceptable statistical methods and analyze Iowa's Water quality(2009;2012), and I believe it is within my scope to present preliminary analysis for first validating Pazdernik's state-wide conclusions and subsequently utilizing these models to analyze Black Hawk County waters. Our state-wide sample sites consist of: Site Names Volga River Near Elkport Soldier River near Pisgah Cedar River Downstream of Cedar Rapids South Skunk River near Oskaloosa East Nodaway near Clarinda North River near Norwak

STORET ID 10220002 10430002 10570001 10620001 10730002 10910002

While Black Hawk sample sites are located at Site Names Beaver Creek Wolf Creek West Fork Cedar River Black Hawk Creek Cedar River Upstream Cedar River Downstream STORET ID 10070001 10070002 10070003 10070004 10070005 10070006

Iowa's Water Quality Index is calculated using nine common water parameters: Dissolved oxygen, E. coli bacteria, 5-day BOD, total phosphorus, nitrate + nitrite as N, total detected pesticides pH total dissolved solids, and total suspended solids The state-wide sample populations were chosen from the 70 high quality Iowa well sites using a random number generator. The data structure makes multiple contrasts an attractive goal for judging interwell aspects, where

analysis may provide information determining either magnitude of difference between water quality in Iowa or identify most desperate situations to allocate policy resources (or optimistically, good water quality models). Water's role in life is so vital, it is clearly a long-term resource (Spiceland 2010), subject to and providing essentially all observable life. In order to ensure long-term health of this resource, continued monitoring is essential, raising a set of questions regarding water quality trends. Admittedly, I did not originally intend to perform time series analysis, but the data's nature necessitates temporal explorations. One major missing component of water quality in this report is 'water flow'. Water flow is very important in determining water quality, and further analysis should better incorporate water flow. In conjunction with water flow, I have omitted chemical composition analysis and creating water quality subindexes in intrawell tests, due to time limitations (Pazerdnik, 2012). Other equally important variables excluded in this analysis include precipitation, temperature, and cultural practices (exogenous variables (Helsel & Hirsch, 2011). Additionally, this research does not incorporate well-specific background data to identify naturally occurring groundwater constituents, meant to assess human and natural forces. Background data is noted as being very important to water quality analysis (EPA, 2009). Matters of space and time are a common theme in ecology analysis where different dimensional life aspects dictate data patterns. Environmental statistical analysis compensates for these data distortions by reshaping data through transformations, or performing robust tests. Non-parametric test are used when data does not follow a normal curve, and because of concepts such as seasonality, are used often in water analysis. Each well location is tested monthly for the past thirteen years, and I am comfortable with the sample size and quality of data; however due to variations involved with water quality, we cannot assume strict independence certainly one of the largest reasons for decreases in statistical power. As example, in Black Hawk Creek at Waterloo, an Index rating of 54 in August 2013, is tied to both the previous and after monthly scores of 28 and 85 respectively (autocorrelation), and therefore is not the result of an independent test; similarly from a spatial perspective, some well locations are closer to others, and water quality data is non-stationary as its mean and variance change with time and space. Because these independence assumptions can be difficult to swallow (although transformations certainly help), our multiple comparison tests (ANOVA and Kruskal-Wallis) primarily identify spatial and temporal variations, while our trend analysis (Mann-Kendall and Seasonal Mann-Kendall) is specific to each well population (Harrison, 2013; EPA, 2009). Some of our data organization grants us normality in our data, or equality of variances, but the tests conducted suggest non-normal, heteroskadstic data. It is important to note that while our tests may identify either spatial or temporal variations more concretely, it is difficult to completely separate out interactions. Monthly Site Populations Certain well locations seemingly share similar distributions, and distribution similarities (and differences) testify in some part to space, time, randomness, or other variables. After spatial-temporal variations are removed, distributions may explain water quality on a statewide basis, and in a manner easier for trend detections. Each well location has at least 151 observations, so we are comfortable with the accuracy of Shapiro-Francia test for normality, and comfortable rejecting the null hypothesis of normality within wells. While we can certainly continue our exploration into equality of variances, I believe it is more beneficial to first introduce concepts of seasonality. From the boxplot above, we see different distribution patters, outliers, large ranges, and mostly skewed data. The combination of these characteristics is not wholly inviting, but a fairly common reality within data and environmental analysis. Specifically, because many questions involve the 'end-result' and prediction, time series analysis are useful but add complexity. Whether one is trying to judge movements of a NYSE stock price - the combination of psychological attitudes derived from a multitude of variables - or

the equally difficult task of analyzing water quality, we must at least attempt to uncover significance in patterns. Annual Site Populations Seasonality is perhaps best communicated through this following decomposition graph of the Volga River (note, due to missing data during 2008, I have enabled an approximation function visualized by the straight line in the 'data' section of the graph). Our seasonal graph indicates a consistent reoccurring trend having an annual pattern and frequency of 12 months. This seasonality is caused mainly by environmental and human patterns, including non-point pollution sources of fertilizer pollution from run-off. Seasonality complicates trend analysis by obscuring long-term trends through patterns existing at smaller time frame (e.g. quarterly, monthly, daily, etc). Our goal is using relevant data collected during the past 13 years to best characterize long-term trends. To do so, we isolate only the April's (randomly chosen) from each of the six well locations in our preliminary tests. By isolating one month among the 13 years (reducing our observation points per well from 151 to 13), and under the assumption of seasonality, we reason our newer data set better showcases long-term patterns with less intimate temporal variation; we are essentially blocking using a single month. Further research should analyze seasonality in regards to each of the twelve months when determining trend existence. Annual data our new data set Well IWQI APRIL _07 IWQI APRIL_08 y y

Monthly- data - our original data set Well IWQI APRIL_07 y IWQI MAY_07 y Note, the labeling of 'annual' and 'monthly' may seem counterintuitive, but this nomenclature strives to explain the purpose of choosing data from one month. The boxplot below describes increased standardization, and Shaprio-Wilk normality tests indicate increased, but still relatively weak, normality. Due to smaller sample sizes we will instead test residuals for normality using the Shaprio-Francia, evidencing some normality (p=0.05743). Using the ladders of power transformations (Helsel & Hirsch, 2011), transforming the data to the -6th power (achieved through trial an error, and coincidentally the same transformation used in the annual analysis) maximizes equality of variance (Brown-Forsythe-Levene's p-value) . For both, annual and monthly data, we achieve data containing equal variances and non-existent normality; and, rejecting the Kruskal-Wallis null hypothesis,

subsequent leads to interwell, multiple comparison tests using paired tests under a Bonferroni alpha adjustment (future non-parametric ad-hoc tests also incorporate Bonferroni alpha adjustments). IOWA ANALYSIS Equality of Variance Normality Detection of differences Multiple comparisons Monthly^-6 P=0.1723 P<2.2e-16 P<2.2e-16 Volga & Nodaway are similar to each other, and different from other wells (which are similar) Annual^-6 P=0.3617 P=4.339e-12 P=0.0409 Some signs of significant difference between Volga vs. Cedar and Skunk (which are similar)

After isolating some time variation, our annual analysis is apparently much more stringent in declaring differences, possibly suggesting greater similarities among Iowa water quality.

April Populations One final data technique I believe helps describe data is simply transposing the April data: IWQI APRIL_07 IWQI APRIL_08 Well y y Now, we can trend across the April years, hopefully better understanding the temporal aspects of state wide Iowa Water Quality. Because each sample now only has six observations, we shall use the residuals to perform normality tests, to the successful tune of finding very strong normal distribution (p=0.4351). Because of Hartley's Ftest extreme sensitivity to normality departure, we will continue using the BFL-test, in which we find some equality of variance (p=0.07944). At this point running either ANOVA, or the Kruskal-Wallis test, rests in one's alpha level, and both can certainly be run to test under both equal and non-equal variance conditions. I am inclined to suggest not using an ANOVA, mainly due to questions of independence, however an experienced hydrologist may feel the assumptions are correctly met. Most of these methods are superseded by prediction limits and control charts, but we will not extend our analysis into these realms. In fact, regulatory restrictions for per-constituent alpha levels using ANOVA make it difficult to adequately

control site-wide false positive rates (EPA, 2009). While the variances are similar, because the ratio between the largest and smallest year's standard deviation is greater than 3 (appx. 4.3), the F-test will severely lose power (EPA, 2009). To best compare April populations to our monthly and annual analysis, I performed both an F and Kruskal-Wallis test, respectively leading to a TukeyHSD and Wilcox-rank-sum multiple comparison tests. IOWA ANALYSIS Equality of Variance Normality Detection of differences in means/medians Multiple comparisons April Untransformed P= 0.07944 P=0.4351 P= 0.000673 2002,2005,& 2011 vs 2006 April-Transformed (^-3) P=0.9195 P=1.8e-05 P=0.0003512 2002,2003,2005,2011,2013 vs 2006,2007

Iowa Exploration Test Summaries Transformations equalize variances and letting us detect significant differences between higher and lower quality wells, but these differences are less pronounced when using annual data. Significant differences also exist between years as we see low water quality across the board during 2006 and 2007, due to floods. Some suggest excluding outliers in water quality analysis, but with data only existing for the past thirteen years, I believe excluding the possibility of flood events occurring periodically is premature (Skopec, 2010). Notable differences in centrality when categorizing samples by site versus categorizing samples by year allude to existing spatial and temporal effects: events occurring across time significantly effect water quality throughout the state; events occurring at different locations create significant differences between well water quality results. Dissecting the data differently, and analyzing a greater number of sample sites will help better understand temporal and spatial effects on Iowa water quality. At this stage, we have separated our Iowa data into three distinct forms: Characteristics/Name Monthly Site Annual Site Populations April Populations Populations Column Names e.g. Row Names e.g. Description Volga, Soldier Apr_07, May_07 Volga, Solider April_07, April_08 April_07, April_08 Volga, Soldier

12 observations, 13 years 1 observation per year, 13 6 observations per year, years 13 years

Exhibiting extreme observations, whether through the lens of time or location, we find the data generally hard to read using ANOVA or Kruskal-Wallis tests and are somewhat limited in our ability to perform interwell tests. However, using transformations, we achieve equal variance and can perform paried multiple comparison tests identifying differences between well locations and time. Our annual site population tests primarily display different distribution patterns among a spectrum of site water quality. If equal variance is found, a two-way ANOVA may help analysis, but a major cause of skewed data is because of the particularly low water quality during the summer months, and some suggest analyzing the summer months exclusive of the other months (Pazdernik, 2012). Similarly, our April population tests show that while centrality is often higher than 40, 4 years exhibit severely low water quality ratings across the board, attributable to heavy floods during those years (Skopec, 2010). We may be inclined to remove

these flood years, but we must remember that floods, seem to be part of a trend and must be involved in water quality discussion. Time-series analysis can be tempting to extrapolate (do flood data exhibit patterned characteristics?) but as time frames of cyclical periodic patterns increases, more data is necessary before properly describing existent flood patterns. Mann-Kendall & Theil-Sen tests Commonly cited in water quality analysis, the Mann-Kendall and Theil-Sen tests are nonparametric, robust against heterogeneity in variance, resistant to outliers, and most importantly can handle paired-observations (Mann-Kendall tau-beta). Mann-Kendall tests the existence of a trend by analyzing randomness about a constant mean through a comparison between IWQI and time rankings. If no trend exists, then the fluctuations between pairs of observations will not follow any discernible trend across time. Tau represents the probability of water quality rankings in relationship to time, using concordant and discordant pairs (Stevenson,2012). The Theil-Sen trend line is then used in conjunction to analyze the magnitude of the trend, if the Mann-Kendall test suggests trend existence, and uses medians for slope values, but does not compensate for physical dependence (Butler, 2013). Tests show that across our two data sets, trends are not common. There are signs that Cedar and Skunk river are both trending positively, but lower-bound Theil-Sen confidence intervals (which include zero) show there may not be any magnitude in trend. This analysis mirrors that found in Murky-Waters, and highlights the need for improvement as the state's current water quality is fairly poor. We cannot use these tests with the April samples because those samples exhibit time as a categorical factor, not as a covariate. Characteristic/Name Trend Existence Trend Magnitude Monthly Site Populations Skunk p=0.0076 Skunk trend=0.0323 Annual Site Populations April Populations No significant existence N/A

Black Hawk County Analysis Performing two separate analyses explores spatial variation, and its significance for local populations. In accordance to our research concerning time series, we find both sets of data lack normality. Through our analysis of variance we find the Iowa stations do not share equal distributions, whereas Black Hawk county stations do share equal distributions (without transformation), allowing us to perform a Kruskall-Wallis test. The test results are not significant, indicating Black Hawk county wells are similar. April samples only obtain equal variance through a power transformation to the -3rd. It is important Black Hawk county take steps further analyzing local water as a group to discern how chemical and spatial variations specifically impact local water quality, identified by unequal variances found in our state-wide analysis. We mold the data into the same groupings as the Iowa analysis: Characteristics/Name Monthly Site Annual Site April Populations Populations^-6 Populations^-3 Column Names e.g. Row Names e.g. Description Beaver, Wolf Apr_07, May_07 Beaver, Wolf April_07, April_08 April_07, April_08 Beaver, Wolf

12 observations, 13 years 1 observation per year, 13 6 observations per year, years 13 years

Characteristics/Name Purpose Normality Equality of Variance Kruskal-Wallis Multiple Comparisons

Monthly Site Populations^-6

Annual Site Populations^-3

April Populations

Find differences in wells; Compensate seasonality; Compensate seasonality; analyze long-term trends find differences in wells find differences in years P<2.2e-16 P=0.8694 P=8.375e-12 Beaver vs DownCedar; Wolf vs West, UpCedar; West vs BH,DownCedar; BH vs UpCedar; UpCedar vs DownCedar Beaver p=0.044142 No significant trends Black Hawk p=0.093634 Beaver trend=0.061 Black Hawk trend=0.041 P= 9.265e-06 p=0.9919 P=0.2415 5.492e-08 P=0.1517 p=0.0002011 2001 vs 2002,2010; 2006 vs 2002,2010,2011; 2008 vs 2002, 2003 2010,2011,2013 N/A

Trend Existence Trend Magnitude

These tests highlight the benefit of approaching water quality data from different angles as we see no significant difference in centrality when comparing annual data, but very significant differences in centrality for monthly data. Without analyzing other annual data, it is hard to identify how time affects Black Hawk county wells; future research may hypothesize how different wells are treated over the course of the entire year, or how different wells react to events correlated with time. Dendograms may help distinguish some of the similarities using different months and locations as different variables, using median measures as water quality data exhibits large variability:

Similar to the Iowa analysis, we see existence of positive trends across two wells, but this is not especially

comforting taking into consideration all trend lower-bounds include zero or negative numbers. Conclusion The validity and soundness of our assumptions (lacking stationarity, autocorrelation), and therefore power of tests, raises issues in our results. Nevertheless, the analysis shines major perspective on water quality analysis in general, and for locally actionable progress. Identifying characteristics of water quality at a regional and local level reconfirms the importance of spatial variations. For instance, the time-series decomposition combined with the April analysis, paint a picture of seasonality and temporal variations such as floods. Separating the data along all the twelve months (two-way ANOVA) is a simple procedure, and may be the next logical step for this research. Our goals for analysis are important to remember, as Hirsch believes, we need to move away strict hypothesis testing, and instead identify the nature and magnitude of change, and newer models are being developed (weighted regression on time, discharge, and season WRTDS;). Future research should also explore chemical compositions, as they provide additional avenues of insight. We also did not utilize control charts or prediction limits, which the EPA strongly suggests (2009); these techniques should certainly be considered in future analysis. Ultimately, with the loss of statistical power and pollution heavily impacting water quality, I am inclined to believe a zero trend with a slight negative or positive relation is very concerning, with water inequality a potentially frightful situation.

Bibliography Environmental Prior, J. C. (2003). Iowas groundwater basics (1st ed.). Iowa City, Iowa: Iowa Dept. of Natural Resources. Rockstrm, J., Steffen, W., Noone, K., Persson, A Asa, Chapin, F. S., Lambin, E. F., Schellnhuber, H. J. (2009). A safe operating space for humanity. Nature, 461(7263), 472475. Skopec, M. (2010). Iowa floods: the new normal. Iowa Natural Heritage. Retrieved from http://www.inhf.org/pdfs/protect/pages8-10_inhf_fall_mag_pp1-16_final.pdf Spiceland, J. D. (2011). Intermediate accounting (6th ed., combined ed.). New York: McGraw-Hill Irwin. Water supply. (2008). New York: H.W. Wilson Co. Water supply and pollution control. (2009) (8th ed.). Upper Saddle River, NJ: Pearson Prentice Hall. Statistical Bronaugh, D., & Werner, A. (2013, September 19). Zhang + Yui-Pilon trends package. CRAN Repository. Butler, K. (2013, February 26). Assignment 10. Statistics for the Life and Social Sciences. Course. Retrieved December 26, 2013, from http://www.utsc.utoronto.ca/~butler/d29/a10.html Cox, C., Hug, A., & Pazdernik, K. (2012). Murky Waters: Farm Pollution Stalls Cleanup of Iowa streams. Environmental Working Group. Retrieved from s tatic.ewg.org/reports/2012/murky_waters/Murky_Waters.pdf Crichton, N. (2001). Kendalls Tau. Journal of Clinical Nursing, (10). Retrieved from http://arizona.openrepository.com/arizona/handle/10150/194407 Environmental Protection Agency. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities - Unified Guidance (No. EPA 530/ R-09-007). Gross, J., & Ligges, M. U. (2012, 29). Nortest Package. CRAN Repository. Retrieved from http://cran.uvigo.es/web/packages/nortest/nortest.pdf Helsel, D. R., & Hirsch, R. M. (2011). Statistical methods in water resources (Vol. 49). Elsevier.

Iowa Department of Natural Resrouces. (2013). STORET/WQX Iowa Water Quality Index. Retrieved from https://programs.iowadnr.gov/iastoret/ Joel Harrison. (2013, May 27). The heat is on. or is it? Trend Analysis of Toronto Climate Data. bloggers. Retrieved December 14, 2013, from trend-analysis-of-toronto-climate-data/ McLeod, A. I. (2011, 16). Kendall rank correlation and Mann-Kendall trend test. CRAN Repository. R etrieved from http://btr0x2.rz.uni-bayreuth.de/math/statlib/R/CRAN/doc/packages/Kendall.pdf Mozejko, J. (2012). Detecting and Estimating Trends of Water Quality Parameters. InTech. Retrieved from http://cdn.intechopen.com/pdfs/35048/InTech-Detecting_and_estimating_trends_of_water_quality_ parameters.pdf State of Oregon Department of Environmental Quality. (n.d.). Trend Analysis and Presentation. Retrieved from www.deq.state.or.us/lab/wqm/docs/TrendAnalysisCD.pdf Statistical Analysis for Monotonic Trends. (2011). National Nonpoint Source Monitoring Program, TechNotes(6). Retrieved from http://www.bae.ncsu.edu/bae/programs/extension/wqg/issues/notes135_monotonic_trends.pdf Stevenson, W. (2012, September 5). Kendall-tau. Statistical-Research.com. Retrieved from http://statistical-research.com/wp-content/uploads/2012/09/kendall-tau1.pdf Tian, J., & Fernandez, G. (2000a). Seasonal trend analysis of monthly water quality data. University of Nevada, Reno. Retrieved from http://www.ag.unr.edu/gf/pdf/joyce.pdf Tian, J., & Fernandez, G. (2000b). Seasonal trend analysis of monthly water quality data. University of Nevada, Reno. Retrieved from http://www.ag.unr.edu/gf/pdf/joyce.pdf Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems: Data exploration. Methods in Ecology and Evolution, 1(1), 314. doi:10.1111/j.2041-210X.2009.00001.x http://www.r-bloggers.com/the-heat-is-on-or-is-itR-

You might also like