forked from pjsio/ME114
-
Notifications
You must be signed in to change notification settings - Fork 0
/
ME414_assignment2_solution.Rmd
159 lines (117 loc) · 6.74 KB
/
ME414_assignment2_solution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: "Exercise 2 - Working with Data"
author: "Ken Benoit and Slava Mikhaylov"
output: html_document
---
1. Working with data structures in R
a. Execute and example the following object:
```{r}
obj1_1 <- read.table(text = "
a b c d
1 2 4.3 Yes
3 4L 5.1 No
")
```
Was this what you were expecting? Why not?
**Probably not, since the `a - d` were values rather than variable names.**
b. Modify the above command and rerun it with the `header=TRUE` argument, assigning
the result to a new object `obj2_1`.
Examine the object's structure using `str(obj2_1)`. Was this what you were expecting?
Try correcting the input by specifying a `stringsAsFactors` argument to `read.table`.
```{r}
obj2_1 <- read.table(text = "
a b c d
1 2 4.3 Yes
3 4L 5.1 No
", header=TRUE, stringsAsFactors=FALSE)
```
**`stringsAsFactors=TRUE` reads in the non-numeric data as type `character` rather
than creating factors from them.**
c. Modify the object so that:
* `b` is integer
* `d` is a factor
For this you can use `as.integer` -- but be careful that this results in the conversion that you were expecting -- and `factor`.
```{r}
obj3_1 <- read.table(text = "
a b c d
1 2 4.3 Yes
3 4L 5.1 No
", header=TRUE, stringsAsFactors=TRUE)
obj3_1$b <- as.integer(obj3_1$b)
obj3_1$d <- factor(obj3_1$d)
str(obj3_1)
```
d. Did you have trouble getting `b` to coerce to an integer, try first removing the "L"
using `gsub()` to replace the `"L"` with `""`. Get help on this using `?gsub`.
```{r}
obj4_1 <- read.table(text = "
a b c d
1 2 4.3 Yes
3 4L 5.1 No
", header=TRUE, stringsAsFactors=FALSE)
tmp <- gsub("L", "", obj4_1$b)
obj4_1$b <- as.integer(tmp)
str(obj4_1)
```
e. Finally, make this object into a data.frame, using `data.frame`. Print the output. Does it look correct?
```{r}
obj5_1 <- data.frame(obj4_1)
str(obj5_1)
```
**Actually, it was already a `data.frame`.**
2. Working with the `dplyr` package
For this part and the next, you should work with the file `dail2002.dta` from the article Kenneth Benoit and Michael Marsh. 2008. "[The Campaign Value of Incumbency: A New Solution to the Puzzle of Less Effective Incumbent Spending.](http://www.kenbenoit.net/pdfs/ajps_348.pdf)" *American Journal of Political Science* 52(4, October): 874-890.
a. Load the Stata dataset used in this paper, available [here](http://www.kenbenoit.net/files/dail2002.dta). To load this into R, you will need the `read.dta` command from the `foreign` package. (Note that you can load straight from the URL using this command.) Call this data object `dail2002`. What sort of object is this? How can you tell what sort of object it is?
```{r}
require(foreign)
dail2002 <- read.dta("http://www.kenbenoit.net/files/dail2002.dta")
```
b. Filtering: Select only the Fianna Fail candidates using `filter()`, and assign the filtered `data.frame` to `dail2002FF`. Note that you might want to first find out what are the labels for party by using `summary()` on the `party` variable.
```{r}
require(dplyr)
dail2002FF <- filter(dail2002, party=="ff")
summary(dail2002FF$party)
```
How many FF candidates were there in the 2002 election? ** 106**
c. Summarizing FF candidates per constituency. On the new data frame `dail2002FF`, summarize the median spending (`spend_total`) for FF candidates using the `dplyr` function `summarise`. Use "pipes" for extra credit!
```{r}
FFspend <- select(dail2002FF, spend_total, constituency) %>%
group_by(constituency) %>%
summarise(medspend = median(spend_total))
```
```{r}
dailgroup <- group_by(dail2002FF, constituency)
summarise(dailgroup, median(spend_total))
```
Sort and plot the 42 median spending values using an index plot.
```{r}
plot(sort(FFspend$medspend), ylab="Median constituency spending for FF")
```
For extra credit, do the same using `aggregate` instead of dplyr.
```{r}
FFspend2 <- aggregate(dail2002FF$spend_total,
list(constituency=dail2002FF$constituency),
median)
```
3. Working with the `reshape2` package
The `count2 - count16` variables are currently in "wide" format. Use `melt` to create a candidate-count unit dataset, and then produce a table of the 42 constituencies by their maximum count.
Hint: First rename the votes1st variable to `count1`, so that it will be consistent with the others.
Then `melt` the data using `reshape2`, creating a new variable called `count` for the new value. Then `filter` to remove any count variable that is zero. Then `group_by` constituency, and `summarise` a count using `n()`.
You will probably need to consult both the package vignettes and the help pages to accomplish this. It seems complicated but it's well worth the effort to master these reshaping and summarizing skills -- this sort of manipulation and summary of the data is a core part of the activities of data mining and data analysis.
```{r}
library(reshape2)
library(dplyr)
# rename votes1st
names(dail2002)[which(names(dail2002FF)=="votes1st")] <- "count1"
dail2002melted <- melt(select(dail2002, wholename, district, count1, count2:count16, m),
id.vars = c("wholename", "district", "m"),
variable.name= "count",
value.name = "votes")
# strip off the number after "count" in the count variable
dail2002melted$ncount <- as.numeric(gsub("count", "", as.character(dail2002melted$count)))
dail2002maxcount <- filter(dail2002melted, votes>0) %>%
group_by(district, m) %>%
summarise(maxcount = max(ncount))
# clear relationship between constituency size and number of counts
with(dail2002maxcount, table(m, maxcount))
```