-
Notifications
You must be signed in to change notification settings - Fork 0
/
16-Categorical-variables-association.Rmd
331 lines (276 loc) · 15.8 KB
/
16-Categorical-variables-association.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
# Association between categorical variables
We'll start our little journey on statistical tests in R with the ones used to evaluate the association between categorical variables.
<br>
There are two tests used to test if two categorical variables are independent or not: <strong>Fisher's exact test</strong> and <strong>Chi-Squared test</strong>. They interpretation of the p-value of these tests is the same, as in both cases a p-value < threshold (e.g. 0.05) means that we reject the null hypothesis that the two variables are independent.
<br>
The first one (Fisher's) is used <u>only</u> when we have a small sample size 2x2 contingency matrix (so when the two categorical variables have 2 categories each) and at least one expected value is < 5. In the same scenario, but with <u>all expected values > 5</u>, Chi-Squared test with Yates correction is used, while in all other scenarios it is used Chi-Squared test without correction.
<br>
Here, we will see some examples on when to use these tests. Let's load the required libraries and our data first:
```{r, warning=FALSE}
# 1. Load packages
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(ggmosaic))
suppressPackageStartupMessages(library(vcd))
# 2. Load data
df <- read.csv("data/Stat-test-dataset.csv")
# 3. Change come column types
df <- df %>%
mutate("sex" = factor(sex),
"treatment" = factor(treatment, levels = c("T1", "Untreated")),
"Task1" = factor(Task1, levels = c(1, 0)),
"Task2" = factor(Task2, levels = c(1, 0)),
)
str(df)
```
This time, I also loaded the `ggmosaic` package, we will use it to create mosaic plots of our data. You can install it through `install.packages("ggmosaic")`.
<br>
<p class="plist">Now that we have our data loaded, here are the comparisons we want to do:</p>
<ol>
<li>In Male at age 3 that have successfully done Task2, is there an association between the treatment and the success on Task1?</li>
<li>In Female at age 15 that have successfully done Task1, is there an association between the treatment and the success on Task2?</li>
<li>In Untreated mice, is there any association between the age and the success on Task1?</li>
</ol>
Now, for each comparison, we have to decide which test we want to use. To evaluate sample size and number of categories we will use `table`:
```{r}
# 1. Filter data
male_3_task2yes_df <- df %>%
filter(sex == "Male", age == 3, Task2 == 1)
female_15_task1yes_df <- df %>%
filter(sex == "Female", age == 15, Task1 == 1)
untreated_df <- df %>%
filter(treatment == "Untreated")
# 2. Create contingency tables
male_3_task2yes_table <- table(male_3_task2yes_df$treatment, male_3_task2yes_df$Task1)
female_15_task1yes_table <- table(female_15_task1yes_df$treatment, female_15_task1yes_df$Task2)
untreated_table <- table(untreated_df$age, untreated_df$Task1)
# 3. Get sample size
male_3_task2yes_sample_size <- sum(male_3_task2yes_table)
female_15_task1yes_sample_size <- sum(female_15_task1yes_table)
untreated_sample_size <- sum(untreated_table)
# 4. Get max number of categories
male_3_task2yes_max_cat <- max(dim(male_3_task2yes_table))
female_15_task1yes_max_cat <- max(dim(female_15_task1yes_table))
untreated_max_cat <- max(dim(untreated_table))
```
Now that we have all the info, let's check them:
<em>In Male at age 3 that have successfully done Task2, is there an association between the treatment and the success on Task1?</em>
```{r}
print(male_3_task2yes_table)
cat("Sample size:", male_3_task2yes_sample_size,
"\nMax number of categories:", male_3_task2yes_max_cat, "\n")
```
<p class="plist">What can we say?<p>
<ul>
<li>First of all, it is <strong>always</strong> recommended to look at the contingency table as starting point</li>
<li>Sample size is low, it is a 2x2 contingency matrix. So we have to look at the expected values to decide whether to use Fisher's exact test or Chi-Squared test with Yates correction</li>
</ul>
To do so, we will use `chisq.test` function, extrapolating the expected values from the results. I know, it is "silly" that we have to actually perform chi-squared test to get the expected values...
```{r, warning=FALSE}
male_3_task2yes_expected <- chisq.test(male_3_task2yes_table)$expected
cat("Minimum expected value is:", min(male_3_task2yes_expected))
```
As it is < 5, we will use Fisher's exact test in this scenario.
<em>In Female at age 15 that have successfully done Task1, is there an association between the treatment and the success on Task2?</em>
```{r}
print(female_15_task1yes_table)
cat("Sample size:", female_15_task1yes_sample_size,
"\nMax number of categories:", female_15_task1yes_max_cat, "\n")
```
<p class="plist">What can we say?<p>
<ul>
<li>Sample size is low, it is a 2x2 contingency matrix. So we have to look at the expected values to decide whether to use Fisher's exact test or Chi-Squared test with Yates correction</li>
</ul>
```{r, warning=FALSE}
female_15_task1yes_expected <- chisq.test(female_15_task1yes_table)$expected
cat("Minimum expected value is:", min(female_15_task1yes_expected))
```
As it is > 5, we will use Chi-Squared test with Yates correction in this scenario.
<em>In Untreated mice, is there any association between the age and the success on Task1?</em>
```{r}
print(untreated_table)
cat("Sample size:", untreated_sample_size,
"\nMax number of categories:", untreated_max_cat, "\n")
```
<p class="plist">What can we say?<p>
<ul>
<li>Sample size is low, it is a 4x2 contingency matrix. So we have to use Chi-Squared test</li>
</ul>
## Fisher exact test {-}
Now that we have decided which test to use, let's start evaluating if there is an association between the treatment and the success on Task1 in Male at age 3 that have successfully done Task2.
<br>
We already have the contingency table, and we know we should use the Fisher's exact test; but, let's do a mosaic plot that can be the then updated with the statistics and can be used a figure to show the results:
```{r fisher_mosaic, warning=FALSE}
male_3_task2yes_mosaic <- ggplot(male_3_task2yes_df) +
geom_mosaic(aes(x = product(treatment), fill = Task1), alpha = 1) +
scale_y_discrete(expand = expansion(mult = c(0,0))) +
labs(y = "", x = "Treatment", title = "Task1 success rate") +
scale_fill_manual(values = c("1" = "#76B041", "0" = "#D9481C")) +
theme_classic() +
theme(axis.line = element_blank(),
plot.title = element_text(face = "bold", hjust = 0.07, size = 14),
axis.ticks.x = element_blank())
male_3_task2yes_mosaic
```
Great, let's have a look again at the contingency table, and then perform Fisher's exact test:
```{r}
male_3_task2yes_table
```
```{r}
male_3_task2yes_fisher_res <- fisher.test(male_3_task2yes_table)
male_3_task2yes_fisher_res
```
<p class="plist">How to interpret these data:</p>
<ul>
<li><em>odds ratio</em>: it indicates the odds of success in Task1 of T1-treated mice compared to untreated, so it seems that the former perform worst than the latter in this task</li>
<li><em>p-value</em>: in this case, it indicates that the two variables are independent, so that there is <u>no influence</u> of treatment in the outcome of Task1 in Male at P3 that have successfully performed in Task2</li>
</ul>
### Phi (φ) Coefficient {-}
The Phi (φ) Coefficient is used as a measure of association, indicating the strength and direction of association between two binary categorical variables in a 2x2 contingency table. It ranges from -1 to 1, with values closer to -1 or 1 indicating stronger associations.
<br>
Our results have already indicated that there is no association of the two variables, but we will calculate it as it may be useful in other cases, where the Fisher's exact test is significant. To do so, we will use the `assocstats` function from package `vcd` (this function returns multiple scores, we will take only phi):
```{r}
male_3_task2yes_fisher_phi <- assocstats(male_3_task2yes_table)$phi
male_3_task2yes_fisher_phi
```
As expected, Phi Coefficient is very low. Let's update our plot with these info:
```{r fisher_mosaic2}
n_untreated <- sum(male_3_task2yes_df$treatment == "Untreated")
n_t1 <- sum(male_3_task2yes_df$treatment == "T1")
pvalue <- case_when(male_3_task2yes_fisher_res$p.value > 0.05 ~ as.character(round(male_3_task2yes_fisher_res$p.value, 3)),
male_3_task2yes_fisher_res$p.value < 0.0001 ~ "< 0.0001",
male_3_task2yes_fisher_res$p.value < 0.001 ~ "< 0.001",
male_3_task2yes_fisher_res$p.value < 0.01 ~ "< 0.01",
male_3_task2yes_fisher_res$p.value < 0.05 ~ "< 0.05",
)
caption <- paste0("Task1 success rate (1: success; 0: fail) of Male P3 mice who have successfully performed in Task2 (n = ",
male_3_task2yes_sample_size, "), in the two treatment conditions (Untreated n = ",
n_t1, ", T1-treated n = ", n_untreated, "). Statistics calculated through Fisher's exact test.")
male_3_task2yes_mosaic <- male_3_task2yes_mosaic +
labs(subtitle = paste0("φ: ", round(male_3_task2yes_fisher_phi, 3),
", OR: ", round(male_3_task2yes_fisher_res$estimate, 3),
", p-value: ", pvalue),
caption = str_wrap(caption, width = 125)) +
theme(plot.subtitle = element_text(size = 10, hjust = 0.07),
plot.caption = element_text(hjust = 0, debug = F, margin = margin(t= 20)),
plot.caption.position = "plot")
male_3_task2yes_mosaic
```
Great! We have a nice figure to present!
## Chi-Squared test with Yates correction {-}
Now, let's evaluate the scenario that has to be be assessed through Chi-Squared test with Yates correction. <br>
As in Fisher's exact test example, we'll make the mosaic plot first:
```{r chi_squared_yates_mosaic}
female_15_task1yes_mosaic <- ggplot(female_15_task1yes_df) +
geom_mosaic(aes(x = product(treatment), fill = Task2), alpha = 1) +
scale_y_discrete(expand = expansion(mult = c(0,0))) +
labs(y = "", x = "Treatment", title = "Task2 success rate") +
scale_fill_manual(values = c("1" = "#76B041", "0" = "#D9481C")) +
theme_classic() +
theme(axis.line = element_blank(),
plot.title = element_text(face = "bold", hjust = 0.07, size = 14),
axis.ticks.x = element_blank())
female_15_task1yes_mosaic
```
Wow, here the difference is striking. Let's see the contingency matrix and perform the test:
```{r}
female_15_task1yes_table
```
```{r}
female_15_task1yes_chisq_res <- chisq.test(female_15_task1yes_table, correct = T)
female_15_task1yes_chisq_res
```
Compared to Fisher's exact test, here we look just at p-value as odds-ratio are not calculated by default (this is because we can calculate OR only in 2x2 contingency matrix, and chisq.test can accept greater matrices). In this case, we can say that the outcome of Task1 can be due to treatment (<strong>remember</strong>: correlation or absence of independence is not causation!).
<br>
As this is a 2x2 contingency matrix and the odds ratio, we can use Phi Coefficient to measure the association between the two variables:
```{r}
female_15_task1yes_phi <- assocstats(female_15_task1yes_table)$phi
female_15_task1yes_phi
```
There is quite a good positive association!
```{r}
female_15_task1yes_or <- (female_15_task1yes_table[1, 1] * female_15_task1yes_table[2, 2]) / (female_15_task1yes_table[1, 2] * female_15_task1yes_table[2, 1])
female_15_task1yes_or
```
Wow! The OR is huge!<br>
This has to be inserted in the plot:
```{r chi_squared_yates_mosaic2}
n_untreated <- sum(female_15_task1yes_df$treatment == "Untreated")
n_t1 <- sum(female_15_task1yes_df$treatment == "T1")
pvalue <- case_when(female_15_task1yes_chisq_res$p.value > 0.05 ~ as.character(round(female_15_task1yes_chisq_res$p.value, 3)),
female_15_task1yes_chisq_res$p.value < 0.0001 ~ "< 0.0001",
female_15_task1yes_chisq_res$p.value < 0.001 ~ "< 0.001",
female_15_task1yes_chisq_res$p.value < 0.01 ~ "< 0.01",
female_15_task1yes_chisq_res$p.value < 0.05 ~ "< 0.05",
)
caption <- paste0("Task2 success rate (1: success; 0: fail) of Female P15 mice who have successfully performed in Task1 (n = ",
female_15_task1yes_sample_size, "), in the two treatment conditions (Untreated n = ",
n_t1, ", T1-treated n = ", n_untreated, "). Statistics calculated through Fisher's exact test.")
female_15_task1yes_mosaic <- female_15_task1yes_mosaic +
labs(subtitle = paste0("φ: ", round(female_15_task1yes_phi, 3),
", OR: ", round(female_15_task1yes_or, 3),
", p-value: ", pvalue),
caption = str_wrap(caption, width = 125)) +
theme(plot.subtitle = element_text(size = 10, hjust = 0.07),
plot.caption = element_text(hjust = 0, debug = F, margin = margin(t= 20)),
plot.caption.position = "plot")
female_15_task1yes_mosaic
```
That's another cool picture to show at your PI!
## Chi-Squared test without correction {-}
Finally, let's see if there is any association between the age and the success on Task1 in untreated mice through Chi-Squared test. <br>
As always, here is the mosaic plot:
```{r chi_squared_mosaic}
untreated_mosaic <- ggplot(untreated_df) +
geom_mosaic(aes(x = product(age), fill = Task1), alpha = 1) +
scale_y_discrete(expand = expansion(mult = c(0,0))) +
labs(y = "", x = "Age", title = "Task1 success rate") +
scale_fill_manual(values = c("1" = "#76B041", "0" = "#D9481C")) +
theme_classic() +
theme(axis.line = element_blank(),
plot.title = element_text(face = "bold", hjust = 0.07, size = 14),
axis.ticks.x = element_blank())
untreated_mosaic
```
It seems that the success rate in Task1 increases over time in untreated mice. Let's see the contingency matrix and perform the test:
```{r}
untreated_table
```
```{r}
untreated_chisq_res <- chisq.test(untreated_table, correct = T)
untreated_chisq_res
```
From this test, it is evident that the association between age and success in Task1 is significant. Let's quantify it.
### Cramer's V {-}
For contingency matrix that are bigger than 2x2 we cannot use Phi Coefficient to calculate the association, but we have to use <strong>Cramer's V</strong>. The possible values are the same as Phi, and also the interpretation. To calculate it, the function is the same as for Phi, but here we take the value of Cramer's V.
```{r}
untreated_cramer <- assocstats(untreated_table)$cramer
untreated_cramer
```
There is quite a good positive association also here!
Let's update our plot:
```{r chi_squared_mosaic2}
n_3 <- sum(untreated_df$age == 3)
n_7 <- sum(untreated_df$age == 7)
n_15 <- sum(untreated_df$age == 15)
n_30 <- sum(untreated_df$age == 30)
pvalue <- case_when(untreated_chisq_res$p.value > 0.05 ~ as.character(round(untreated_chisq_res$p.value, 3)),
untreated_chisq_res$p.value < 0.0001 ~ "< 0.0001",
untreated_chisq_res$p.value < 0.001 ~ "< 0.001",
untreated_chisq_res$p.value < 0.01 ~ "< 0.01",
untreated_chisq_res$p.value < 0.05 ~ "< 0.05",
)
caption <- paste0("Task1 success rate (1: success; 0: fail) of untreated mice (n = ",
untreated_sample_size, "), over time (P3 n = ",
n_3, ", P7 n = ", n_7, ", P15 n = ", n_15, ", P30 n = ", n_30,
"). Statistics calculated through Fisher's exact test.")
untreated_mosaic <- untreated_mosaic +
labs(subtitle = paste0("Cramer's V: ", round(untreated_cramer, 3),
", p-value: ", pvalue),
caption = str_wrap(caption, width = 125)) +
theme(plot.subtitle = element_text(size = 10, hjust = 0.07),
plot.caption = element_text(hjust = 0, debug = F, margin = margin(t= 20)),
plot.caption.position = "plot")
untreated_mosaic
```
Amazing, we have ansewred all the questions through these 3 test. Now you are able to use them in your personal analysis!