day2/ME414_assignment2_solution.Rmd

---
title: "Exercise 2 - Working with Data"
author: "Ken Benoit and Slava Mikhaylov"
output: html_document
---

1.  Working with data structures in R

    a.  Execute and example the following object:
        ```{r}
        obj1_1 <- read.table(text = "
                             a  b    c    d 
                             1  2  4.3  Yes
                             3 4L  5.1   No
                             ")
         ```
        Was this what you were expecting?  Why not?
        
        **Probably not, since the `a - d` were values rather than variable names.**


    b.  Modify the above command and rerun it with the `header=TRUE` argument, assigning
        the result to a new object `obj2_1`.
        Examine the object's structure using `str(obj2_1)`.  Was this what you were expecting?
        Try correcting the input by specifying a `stringsAsFactors` argument to `read.table`.
        
        ```{r}
        obj2_1 <- read.table(text = "
                             a  b    c    d 
                             1  2  4.3  Yes
                             3 4L  5.1   No
                             ", header=TRUE, stringsAsFactors=FALSE)
         ```
        
        **`stringsAsFactors=TRUE` reads in the non-numeric data as type `character` rather
        than creating factors from them.**

    c.  Modify the object so that:
        *  `b` is integer
        *  `d` is a factor
        
        For this you can use `as.integer` -- but be careful that this results in the conversion that you were expecting -- and `factor`.
        
        ```{r}
        obj3_1 <- read.table(text = "
                             a  b    c    d 
                             1  2  4.3  Yes
                             3 4L  5.1   No
                             ", header=TRUE, stringsAsFactors=TRUE)
        obj3_1$b <- as.integer(obj3_1$b)
        obj3_1$d <- factor(obj3_1$d)
        str(obj3_1)
        ```

    d.  Did you have trouble getting `b` to coerce to an integer, try first removing the "L"
        using `gsub()` to replace the `"L"` with `""`.  Get help on this using `?gsub`.
       
        ```{r}
        obj4_1 <- read.table(text = "
                             a  b    c    d 
                             1  2  4.3  Yes
                             3 4L  5.1   No
                             ", header=TRUE, stringsAsFactors=FALSE)
        tmp <- gsub("L", "", obj4_1$b)
        obj4_1$b <- as.integer(tmp)
        str(obj4_1)
        ```
 

    e.  Finally, make this object into a data.frame, using `data.frame`.  Print the output.  Does it look correct?
    
        ```{r}
        obj5_1 <- data.frame(obj4_1)
        str(obj5_1)
        ```
        
        **Actually, it was already a `data.frame`.**


2.  Working with the `dplyr` package

    For this part and the next, you should work with the file `dail2002.dta` from the article Kenneth Benoit and Michael Marsh. 2008. "[The Campaign Value of Incumbency: A New Solution to the Puzzle of Less Effective Incumbent Spending.](http://www.kenbenoit.net/pdfs/ajps_348.pdf)" *American Journal of Political Science* 52(4, October): 874-890.   
    
    a.  Load the Stata dataset used in this paper, available [here](http://www.kenbenoit.net/files/dail2002.dta).  To load this into R, you will need the `read.dta` command from the `foreign` package.  (Note that you can load straight from the URL using this command.)  Call this data object `dail2002`.  What sort of object is this?  How can you tell what sort of object it is?
    
        ```{r}
        require(foreign)
        dail2002 <- read.dta("http://www.kenbenoit.net/files/dail2002.dta")
        ```
    
    b.  Filtering:  Select only the Fianna Fail candidates using `filter()`, and assign the filtered `data.frame` to `dail2002FF`.  Note that you might want to first find out what are the labels for party by using `summary()` on the `party` variable.
    
```{r}
        require(dplyr)
        dail2002FF <- filter(dail2002, party=="ff")
        summary(dail2002FF$party)
```

        How many FF candidates were there in the 2002 election?  ** 106**
        
    
    c.  Summarizing FF candidates per constituency.  On the new data frame `dail2002FF`, summarize the median spending (`spend_total`) for FF candidates using the `dplyr` function `summarise`.  Use "pipes" for extra credit!

```{r}
        FFspend <- select(dail2002FF, spend_total, constituency) %>%
                      group_by(constituency) %>% 
                          summarise(medspend = median(spend_total))
                          
```

```{r}
dailgroup <- group_by(dail2002FF, constituency)
summarise(dailgroup, median(spend_total))
```


        Sort and plot the 42 median spending values using an index plot. 

```{r}
plot(sort(FFspend$medspend), ylab="Median constituency spending for FF")
```

        For extra credit, do the same using `aggregate` instead of dplyr.

```{r}
        FFspend2 <- aggregate(dail2002FF$spend_total, 
                              list(constituency=dail2002FF$constituency), 
                              median)
```


3.  Working with the `reshape2` package
    
    The `count2 - count16` variables are currently in "wide" format.  Use `melt` to create a candidate-count unit dataset, and then produce a table of the 42 constituencies by their maximum count.
    
    Hint: First rename the votes1st variable to `count1`, so that it will be consistent with the others.
    Then `melt` the data using `reshape2`, creating a new variable called `count` for the new value.  Then `filter` to remove any count variable that is zero.  Then `group_by` constituency, and `summarise` a count using `n()`. 
    
    You will probably need to consult both the package vignettes and the help pages to accomplish this.  It seems complicated but it's well worth the effort to master these reshaping and summarizing skills -- this sort of manipulation and summary of the data is a core part of the activities of data mining and data analysis.

    ```{r}
    library(reshape2)
    library(dplyr)
    # rename votes1st
    names(dail2002)[which(names(dail2002FF)=="votes1st")] <- "count1"
    dail2002melted <- melt(select(dail2002, wholename, district, count1, count2:count16, m), 
                           id.vars = c("wholename", "district", "m"), 
                           variable.name= "count", 
                           value.name = "votes")
    # strip off the number after "count" in the count variable
    dail2002melted$ncount <- as.numeric(gsub("count", "", as.character(dail2002melted$count)))
    dail2002maxcount <- filter(dail2002melted, votes>0) %>%
                            group_by(district, m) %>% 
                                summarise(maxcount = max(ncount))
    # clear relationship between constituency size and number of counts
    with(dail2002maxcount, table(m, maxcount))
    ```