-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_statistics.qmd
402 lines (281 loc) · 11.3 KB
/
02_statistics.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
# Statistical processing of hydrological dataset
A dataset is a collection of measurements which is used to create or evaluate assumptions about a population. Within this session, only **univariate** data are studied.
#### At the end of the lecture you should be familiar with following concepts: {.unnumbered}
- hydrological year
- data screening and cleaning
- basic data summation and aggregation
- descriptive statistics
- visualization of statistical data
## Exploratory Data Analysis (EDA)
EDA is not a strictly formalized set of procedures, rather it is an general approach on how to study, describe, extract, evaluate and report about the data. Hence the workflow is specific to data in process.\
When studying data, we will look at certain statistical features, which represent esteemed characteristcs for the whole data set.
### Measures of location
The farthest one can reduce data, while still retaining any information at all is by a single value. They describe a central tendency of the data.
#### Mean
The arithmetic mean (or average) is simply the sum of observations devided by the number of observations.
$$
\text{mean} = \dfrac{\text{sum of data}}{\text{number of data}}, \qquad \bar{x} = \dfrac{1}{n}\sum\limits_{i=1}^{n} x_i
$$
```{r, collapse=TRUE}
set.seed(123) # using the set seed will ensure that we will get the samme generated numbers every time
x <- rnorm(n = 30, mean = 50, sd = 10) # some random-generated numbers from the normal distribution
x
```
#### Median
The **median** is another central tendency based on the *sorted* data set. It is the $50$th quantile in the data. A half of the values is smaller than median, and a half is larger.
$$
\tilde{x} =
$$
```{r, collapse=TRUE}
median(x)
(sort(x)[length(x)/2] + sort(x)[length(x)/2 + 1])/2
```
#### Mode
For continuous variables in real numbers, computation of **modus** does make sense when applied on bins of values.
```{r}
x
cut(x, breaks = 10)
y <- cut(x, breaks = seq(from = -10, to = 100, by = 10))
table(y)
barplot(height = table(y))
```
#### $n-$th quantile
```{r}
quantile(x, probs = c(0.1, 0.9))
```
### Measures of spread (variability)
With **measures of variability** we describe the spread of the data around its central tendency. Some degree of variability is a natural occurring phenomenon.
#### Variation range
The simples measure of spread is the **variation range**. It is computed as the difference of both end extremes of the data.
$$
R = \max{x} - \min{x}
$$
```{r}
max(x) - min(x)
```
#### Variance
We calculate the **variance** as the average of the squares of the deviations of individual character values from their arithmetic mean.
$$
\sigma^2 = \frac{1}{n}(x_i - \bar{x})^2
$$
#### Standard deviation
The **standard deviation** is defined as sqruare root of variance.
```{r, collapse=TRUE}
sd(x)
var(x)^0.5
```
#### Coefficient of variation
The **coefficient of variation** is given by the ratio of the standard deviation and the arithmetic mean, and it follows from the definition of this ratio that it is a dimensionless indicator.
$$
\nu = \frac{s}{\bar{x}}
$$
#### Interquartile range
The $\text{IQR}$ is the **upper quartile** ($75$th percentile) minus the **lower quartile** ($25$th percentile)
$$
\text{IQR} = Q_{\text{III}} - Q_{\text{I}}
$$
```{r}
IQR(x)
```
## Hydrological data
Data sets containing hydrological data are most commonly, although not exclusively, in tabular (rectangular) shape.
Let's take a look at sample data from **MOOPEX**. It is a simple curated large sample data set.
This dataset covers the same 438 catchments in daily step with measured discharge. The dataset is publicly available at <https://hydrology.nws.noaa.gov/pub/gcip/mopex/US_Data/Us_438_Daily/>
```{r, collapse=TRUE}
url <- "./data/01138000.dly"
mpx <- read.fwf(file = url,
widths = c(8, rep(10, times = 5)),
header = FALSE)
names(mpx) <- c("date", "prec", "pet", "r", "tmax", "tmin")
mpx$date <- gsub(x = mpx$date, pattern = " ", replacement = "0")
mpx$date <- as.Date(mpx$date, format = "%Y%m%d")
mpx[which(mpx$r < 0), "r"] <- NA
head(mpx)
```
Let's add some features to the data. For the purpose of the analysis it would be beneficial to add `year`, `month`, `quarters`, `decade` and `hyear` as **hydrological year**.
## Aggregation and summation
```{r}
mpx$year <- as.integer(format(mpx$date, "%Y"))
mpx$month <- as.integer(format(mpx$date, "%m"))
```
Try to add `quart` variable using `quarters()` function.
```{r}
mpx$hyear <- ifelse(mpx$month > 10, as.integer(mpx$year + 1), as.integer(mpx$year))
```
Now exercise some basic calculations
- Fraction of days with no precipitation
- How many freezing days occur?
- How many freezing days in late spring (May) and summer?
- 10 highest and 10 lowest temperatures
These functions are based on **grouping**. In hydrology, the natural groups involve - *stations/basins*, *decades/years/months*, *groundwater regions*, etc.
### Box-plot
Come handy, when we want to visualize the important quantiles related to any categorical variable.
```{r}
station <- read.delim("./data/6242910_Q_Day.Cmd.txt", skip = 36, header = TRUE, sep = ";", col.names = c("date", "time", "value"))
station$date <- as.Date(station$date, format = "%Y-%m-%d")
station_agg <- aggregate(station$value ~ as.factor(data.table::month(station$date)),
FUN = "quantile",
probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
names(station_agg) <- c("Month", "Discharge")
par(font.main = 1,
adj = 0.1,
xaxs = "i",
yaxs = "i")
boxplot(data = station_agg,
Discharge ~ Month,
main = "Distribution of discharge in months",
frame = FALSE,
ylim = c(0,20),
xlab = "",
ylab = bquote(Discharge), font.main = 1)
```
### Q-Q plot
The so called **rankit graph** produces a quantile-quantile graph of the values
from selected data. This one can compare if the distribution of the data fit
with the assumed distribution, e.q. Normal. `qqline` then adds the theoretical line, which outputs
```{r}
par(font.main = 1,
adj = 0.1,
xaxs = "i",
yaxs = "i")
qqnorm(mpx$tmax,
pch = 21,
col = "black",
bg = "lightgray", cex = 0.5,
main = "")
qqline(mpx$tmax)
```
<!-- ```{r} -->
<!-- plot(quantile(rnorm(1000), probs = 1:100/100), -->
<!-- quantile(rnorm(1000), probs = 1:100/100), frame = FALSE, pch = 20) -->
<!-- abline(0, 1, col = "red") -->
<!-- ``` -->
<!-- ### P-P plot -->
<!-- ```{r} -->
<!-- n <- 100 -->
<!-- data <- rweibull(n, shape = 2, scale = 1) -->
<!-- # Estimated parameters for Pareto -->
<!-- shape_est <- 2 -->
<!-- scale_est <- 1 -->
<!-- # Calculate the theoretical quantiles for the Pareto distribution -->
<!-- theoretical_quantiles <- qweibull(p = seq(0, 1, 0.01), shape = shape_est, scale = scale_est) -->
<!-- # Create the Q-Q plot -->
<!-- qqplot(theoretical_quantiles, data, -->
<!-- xlab = "Theoretical Quantiles", -->
<!-- ylab = "Sample Quantiles", ylim = c(0,2.5), xlim = c(0, 2.5)) -->
<!-- abline(0, 1, col = "red") # Add a 45-degree reference line -->
<!-- ``` -->
### Empirical distribution function
Large portion of the data which is processed in Hydrology originates from time series
of streamflow or runoff. This enables us to construct empirical probability of exceeding
certain value in the data $P(X\geq x_k)$, simply using the well know definition
$$
P = \dfrac{m}{n}
$$
where $m$ is the number of reaching or exceeding of value $x_k$ and $n$
is the length of the series. This equation is valid strictly for $n \rightarrow \infty$.\
There are several empirical formulas in use to calculate the empirical exceedance
probability like the one from Čegodajev
$$
P = \dfrac{m - 0.3}{n + 0.4}
$$
In R we can utilize a highe order function called `ecdf()`
```{r, include=FALSE}
plot(cumsum(1/rank(mpx$r)/length(mpx$r)))
```
```{r}
ecdf_data <- ecdf(na.omit(mpx$r)) # <1>
max(mpx$r, na.rm = TRUE)
ecdf_data(35) # percent of data lower than 35
```
1. # Create the empirical cumulative distribution function (ECDF) for the data
<!-- ```{r, eval=FALSE} -->
<!-- library(tidyverse) -->
<!-- # ddd <- read_fwf("data/6242910_Q_Day.Cmd.txt", -->
<!-- # skip = 36, -->
<!-- # col_positions = fwf_widths(widths = c(8, 4, 4, 8, 2)), col_types = "cccdc") -->
<!-- ddd <- read_csv2("data/6242910_Q_Day.Cmd.txt", -->
<!-- skip = 36, -->
<!-- col_types = "ccc") -->
<!-- ddd |> -->
<!-- set_names(c("Date", "D", "Discharge")) |> -->
<!-- mutate(Date = as.Date(Date, format = "%Y-%m-%d"), -->
<!-- Discharge = as.numeric(Discharge)) |> -->
<!-- select(Date, Discharge) -> ddd -->
<!-- ddd |> -->
<!-- mutate(ri = rank(Discharge), -->
<!-- s = sd(Discharge), -->
<!-- q = pweibull(Discharge, shape = 2, scale = 1)) -->
<!-- pdfunc <- function(x, b = 0.44) { -->
<!-- n <- length(x) -->
<!-- m <- rank(x) -->
<!-- (m - b)/(n + 1 - 2*b) -->
<!-- } -->
<!-- plot(pdfunc(x <- ddd$Discharge), type = "l") -->
<!-- ddd |> -->
<!-- ggplot(aes(Date, Discharge)) + -->
<!-- geom_line() -->
<!-- ddd |> -->
<!-- ggplot(aes(Discharge, ri = rank(Discharge / length(Discharge)))) + -->
<!-- geom_line() -->
<!-- ddd |> -->
<!-- mutate(pi = rank(Discharge - 0.5)/length(Discharge)) -->
<!-- x <- c(1, 6 ,6, 1, 2, 0) -->
<!-- rank(x) -->
<!-- ``` -->
### Exceedance curve
```{r}
r_sorted <- sort(mpx$r, decreasing = TRUE)
# Empirical
n <- length(r_sorted)
exceedance_prob <- 1:n / n
plot(exceedance_prob, r_sorted, type = "l", xlab = "Exceedance Probability", ylab = "Runoff Value", main = "Empirical Exceedance Curve")
```
### Return period
We calculate the return period as inverse to the exceedance probability
```{r}
return_period <- 1/exceedance_prob
plot(return_period, r_sorted, type = "l", xlab = "Return Period", ylab = "Runoff Value", main = "Empirical Return Period")
```
<!-- ### Frequency distribution curve -->
<!-- ## Correlation and autocorrelation -->
<!-- $$ -->
<!-- \rho_{X, Y} = \dfrac{\mathrm{cov}(X,Y)}{\sigma_X\sigma_Y} -->
<!-- $$ $$ -->
<!-- \rho_{X,X} -->
<!-- $$ -->
## Hydrological indices
Originated from the combination of aggregation and summation methods and the measures
of location and dispersion.
### Runoff coefficient
The concept of runoff coefficient comes from the presumption, that over long-time period
$$
\varphi [-] = \dfrac{R}{P}
$$
where $R$ represents runoff and $P$ precipitation in long term, typically 30 year
so called **Normal period**, which is a decadal moving 30 window; the current being
1991--2020.
```{r}
R <- mean(aggregate(r ~ year, FUN = mean, data = mpx)[, "r"])
P <- mean(aggregate(prec ~ year, FUN = mean, data = mpx)[, "prec"])
R/P
```
### Flow duration curve
```{r}
M <- c(30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 355, 364)
m_day <- function(streamflow) {
quantile(na.omit(streamflow), 1 - M/365.25)
}
plot(M, m_day(mpx$r),
type = "b",
pch = 21,
bg = "darkgray",
xlab = "",
ylab = "Discharge",
main = "M-day discharge")
```
### Baseflow
The **Total runoff** comprises from **direct runoff**, **hypodermic runoff** and **baseflow**. From the baseflow separation process usually the **baseflow index** is calculated $\text{BFI}=\dfrac{\text{baseflow}}{\text{total runoff}}$
<!-- #### Eckhart-filter -->
<!-- #### Lyne-Hollick -->
<!-- ## Exercise -->