-
Notifications
You must be signed in to change notification settings - Fork 0
/
pr1.Rmd
467 lines (338 loc) · 23.3 KB
/
pr1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
---
title: 'Project 1: Explore and Prepare Data'
subtitle: |-
CSE6242 - Data and Visual Analytics - Fall 2017
Due: Sunday, October 15, 2017 at 11:59 PM UTC-12:00 on T-Square
output:
html_document: default
html_notebook: default
pdf_document: default
username: aamster3
---
_Note: This project involves getting data ready for analysis and doing some preliminary investigations. Project 2 will involve modeling and predictions on the same dataset, and will be released at a later date. Both projects will have equal weightage towards your grade. You may reuse some of the preprocessing/analysis steps from Project 1 in Project 2._
# Data
In this project, you will explore a dataset that contains information about movies, including ratings, budget, gross revenue and other attributes. It was prepared by Dr. Guy Lebanon, and here is his description of the dataset:
> The file [`movies_merged`](https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cs6242/project/movies_merged) contains a dataframe with the same name that has 40K rows and 39 columns. Each row represents a movie title and each column represents a descriptor such as `Title`, `Actors`, and `Budget`. I collected the data by querying IMDb’s API (see [www.omdbapi.com](http://www.omdbapi.com/)) and joining it with a separate dataset of movie budgets and gross earnings (unknown to you). The join key was the movie title. This data is available for personal use, but IMDb’s terms of service do not allow it to be used for commercial purposes or for creating a competing repository.
# Objective
Your goal is to investigate the relationship between the movie descriptors and the box office success of movies, as represented by the variable `Gross`. This task is extremely important as it can help a studio decide which titles to fund for production, how much to bid on produced movies, when to release a title, how much to invest in marketing and PR, etc. This information is most useful before a title is released, but it is still very valuable after the movie is already released to the public (for example it can affect additional marketing spend or how much a studio should negotiate with on-demand streaming companies for “second window” streaming rights).
# Setup
## Load data
Make sure you've downloaded the [`movies_merged`](https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cs6242/project/movies_merged) file and it is in the current working directory. Now load it into memory:
```{r}
load('movies_merged')
cat("Dataset has", dim(movies_merged)[1], "rows and", dim(movies_merged)[2], "columns", end="\n", file="")
```
This creates an object of the same name (`movies_merged`). For convenience, you can copy it to `df` and start using it:
```{r}
df = movies_merged
cat("Column names:", end="\n", file="")
colnames(df)
```
## Load R packages
Load any R packages that you will need to use. You can come back to this chunk, edit it and re-run to load any additional packages later.
```{r}
library(ggplot2)
library(GGally)
library(plyr)
library(reshape2)
library(tm)
```
If you are loading any non-standard packages (ones that have not been discussed in class or explicitly allowed for this project), please mention them below. Include any special instructions if they cannot be installed using the regular `install.packages('<pkg name>')` command.
**Non-standard packages used**: tm
# Tasks
Each task below is worth **10** points, and is meant to be performed sequentially, i.e. do step 2 after you have processed the data as described in step 1. Total points: **100**
Complete each task by implementing code chunks as described by `TODO` comments, and by responding to questions ("**Q**:") with written answers ("**A**:"). If you are unable to find a meaningful or strong relationship in any of the cases when requested, explain why not by referring to appropriate plots/statistics.
It is okay to handle missing values below by omission, but please omit as little as possible. It is worthwhile to invest in reusable and clear code as you may need to use it or modify it in project 2.
## 1. Remove non-movie rows
The variable `Type` captures whether the row is a movie, a TV series, or a game. Remove all rows from `df` that do not correspond to movies.
```{r}
# TODO: Remove all rows from df that do not correspond to movies
df = df[df$Type == "movie",]
dim(df)
```
**Q**: How many rows are left after removal? _Enter your response below._
**A**: 40000
## 2. Process `Runtime` column
The variable `Runtime` represents the length of the title as a string. Write R code to convert it to a numeric value (in minutes) and replace `df$Runtime` with the new numeric column.
```{r}
# TODO: Replace df$Runtime with a numeric column containing the runtime in minutes
convertToMinutes = function(runtime) {
if(runtime == 'N/A') {
return(NA)
}
split = strsplit(runtime, ' ')[[1]]
split = lapply(split, function(x) gsub(',', '', x))
minutes = 0
if (length(split) == 2) {
minutes = as.integer(split[1])
} else if (length(split) == 4) {
minutes = as.integer(split[1]) * 60 + as.integer(split[3])
}
return(minutes)
}
df$Runtime = sapply(df$Runtime, convertToMinutes)
```
Now investigate the distribution of `Runtime` values and how it changes over years (variable `Year`, which you can bucket into decades) and in relation to the budget (variable `Budget`). Include any plots that illustrate.
```{r}
# TODO: Investigate the distribution of Runtime values and how it varies by Year and Budget
ggplot(subset(df, !is.na(Runtime) & Runtime < 300), aes("", Runtime)) +
geom_boxplot() +
coord_flip() +
scale_x_discrete("")
```
```{r}
df$Decade = floor(df$Year / 10) * 10
ggplot(subset(df, !is.na(Runtime) & Runtime < 300), aes(as.factor(-Decade), Runtime)) + geom_boxplot() +
coord_flip() +
scale_x_discrete("Decade")
```
```{r}
qplot(Runtime, Budget, data = subset(df, !is.na(Runtime) & !is.na(Budget) & Runtime < 300)) + stat_smooth(color = "red", se = F)
```
_Feel free to insert additional code chunks as necessary._
**Q**: Comment on the distribution as well as relationships. Are there any patterns or trends that you can observe?
**A**: Note: I only visualized runtimes < 300 as there were many outliers > 300 which made the visualizations less easy to read.
The distribution of runtimes has a median around 90 minutes with an IQR ranging from ~75 to ~100 minutes. The whiskers extend from ~60 minutes to ~120 minutes.
The median runtime has in general increased over time. The movies were very short in the early history of movies as can be seen in the boxplot. The spread of runtimes has decreased over time with a maximum spread in 1940 of ~70 minutes difference between 25th percentile and 75th percentile, as well as a long tail reaching to ~200 minutes. The spread in 2010 is ~30 minutes from the 25th percentile to 75th percentile.
Budget in general starts increasing around 80 minutes, with a peak at ~150 minutes and then starts decreasing. The spread of budgets is also at a maximum around 150 minutes.
## 3. Encode `Genre` column
The column `Genre` represents a list of genres associated with the movie in a string format. Write code to parse each text string into a binary vector with 1s representing the presence of a genre and 0s the absence, and add it to the dataframe as additional columns. Then remove the original `Genre` column.
For example, if there are a total of 3 genres: Drama, Comedy, and Action, a movie that is both Action and Comedy should be represented by a binary vector <0, 1, 1>. Note that you need to first compile a dictionary of all possible genres and then figure out which movie has which genres (you can use the R `tm` package to create the dictionary).
```{r}
# TODO: Replace Genre with a collection of binary columns
corpus = VCorpus(VectorSource(df$Genre))
dtm = DocumentTermMatrix(corpus, control = list(removePunctuation = list(preserve_intra_word_dashes = T)))
df = df[, !names(df) == 'Genre']
df = data.frame(df, as.matrix(dtm))
```
Plot the relative proportions of movies having the top 10 most common genres.
```{r}
# TODO: Select movies from top 10 most common genres and plot their relative proportions
mostFreqGenresProportions = sort(colSums(as.matrix(dtm)) / dim(as.matrix(dtm))[1], decreasing = T)[1:10]
mostFreqGenreNames = factor(names(mostFreqGenresProportions), levels = names(mostFreqGenresProportions))
qplot(mostFreqGenreNames, mostFreqGenresProportions, xlab = "Genre", ylab = "Proportion")
```
Examine how the distribution of `Runtime` changes across genres for the top 10 most common genres.
```{r}
# TODO: Plot Runtime distribution for top 10 most common genres
moviesInMostPopGenres = df[any(df == 1), c(names(mostFreqGenresProportions), "Runtime", "Title")]
moviesMelted = melt(moviesInMostPopGenres, id = c("Title", "Runtime"))
moviesMelted = subset(moviesMelted, value == 1)
ggplot(subset(moviesMelted, !is.na(Runtime) & Runtime < 300, c("variable", "Runtime")), aes(reorder(variable, -Runtime, median), Runtime, ylab = "Genre")) + geom_boxplot() +
coord_flip() +
scale_x_discrete("Genre")
```
**Q**: Describe the interesting relationship(s) you observe. Are there any expected or unexpected trends that are evident?
**A**: The results are pretty expected. Drama is the most frequently occuring genre at around 40% with comedy next most popular at around 32%. Then there is a pretty big jump in frequency to short at ~15%, which is quite surprising that it is the 3rd most popular genre, I thought it would be more fringe.
The genres vs runtime are also pretty expected. Animation has the smallest median runtime at around 5 minutes, which is actually smaller than the median runtime for the genre called "short" which has a slightly larger median runtime. Documentary has the largest spread of runtimes with the IQR ranging from about 25 to 90 minutes and with 1.5 times IQR extending to about 1 minute and 145 minutes. The rest of the genres have a small spread and about the same median of around 90 minutes.
## 4. Eliminate mismatched rows
The dataframe was put together by merging two different sources of data and it is possible that the merging process was inaccurate in some cases (the merge was done based on movie title, but there are cases of different movies with the same title). There are 3 columns that contain date information: `Year` (numeric year), `Date` (numeric year), and `Released` (string representation of the release date).
Find and remove all rows where you suspect a merge error occurred based on a mismatch between these variables. To make sure subsequent analysis and modeling work well, avoid removing more than 10% of the rows that have a `Gross` value present.
_Note: Do not remove the rows with `Gross == NA` at this point, just use this a guideline._
```{r}
# TODO: Remove rows with Year/Date/Released mismatch
mismatch = function(row) {
year = as.numeric(row['Year'])
date = as.numeric(row['Date'])
released = as.numeric(format(as.Date(row['Released']), '%Y'))
if (!is.na(year) && !is.na(date)) {
if(abs(year - date) > 1) {
return(TRUE)
}
}
if (!is.na(year) && !is.na(released)) {
if(abs(year - released) > 1) {
return(TRUE);
}
}
if (!is.na(date) && !is.na(released)) {
if(abs(date - released) > 1) {
return(TRUE);
}
}
return(FALSE)
}
df = df[!apply(df, 1, mismatch), ]
```
**Q**: What is your precise removal logic, and how many rows remain in the resulting dataset?
**A**:
Iterate over the rows
parse year from released and convert to numeric
if year is not na and date is not na, and they differ by > 1 year remove the row
if date is not na and released is not na, and they differ by > 1 year remove the row
if year is not na and released is not na, and they differ by > 1 year remove the row
38099 rows are left
## 5. Explore `Gross` revenue
For the commercial success of a movie, production houses want to maximize Gross revenue. Investigate if Gross revenue is related to Budget, Runtime or Genre in any way.
_Note: To get a meaningful relationship, you may have to partition the movies into subsets such as short vs. long duration, or by genre, etc._
```{r}
# TODO: Investigate if Gross Revenue is related to Budget, Runtime or Genre
binnedRuntime = function(runtime) {
if (is.na(runtime)) {
return(NA)
}
if (runtime < 80) {
return("short")
} else if(runtime < 120) {
return("medium")
} else {
return("long")
}
}
df$BinnedRuntime = as.factor(sapply(df$Runtime, binnedRuntime))
```
```{r}
ggplot(subset(df, Gross < 100000000 & !is.na(BinnedRuntime)), aes(BinnedRuntime, Gross)) +
geom_boxplot() +
coord_flip() +
scale_x_discrete("Runtime")
```
```{r}
moviesMelted = melt(df[, c(gsub('-', '.', Terms(dtm)), "Title", "Gross")], id = c("Title", "Gross"), variable.name = "Genre")
moviesMelted = moviesMelted[moviesMelted$value == 1, ]
ggplot(moviesMelted[moviesMelted$Gross < 1000000000 & !is.na(moviesMelted$Gross), ], aes(reorder(as.factor(Genre), -Gross, median), Gross)) +
geom_boxplot() +
coord_flip() +
scale_x_discrete("Genre")
```
```{r}
qplot(x = log(Budget), y = log(Gross), data = df[!is.na(df$Budget) & !is.na(df$Gross) & df$Gross != 0, ]) + stat_smooth(color = "red", se = F)
```
**Q**: Did you find any observable relationships or combinations of Budget/Runtime/Genre that result in high Gross revenue? If you divided the movies into different subsets, you may get different answers for them - point out interesting ones.
**A**: From the runtime vs gross revenue plot, after breaking down runtimes into short (< 80 minutes), medium(less than 120 minutes), and long (> 120 minutes) we can see an increase in gross revenue on average as the runtime increases. The shorter duration movies have a shorter tail, whereas medium and long have a lot of overalp, however the median gross revenue for long movies is higher.
Not much can be gleaned from the genre vs gross revenue plot because there is a lot of overlap in gross revenue between the genres. We can see that some genres have smaller spreads in gross revenue (film noir) while somewhat surprisingly animation has the highest median gross revenue and longest whisker.
From the budget vs gross revenue plot we can see an almost linear relationship when plotted on the log-log scale.
From these observations we might expect a long, animated movie with a big budget to have a high gross revenue.
```{r}
# TODO: Investigate if Gross Revenue is related to Release Month
ggplot(df[!is.na(df$Gross) & !is.na(df$Released) & df$Gross < 1000000000, ], aes(reorder(as.factor(format(Released, "%m")), -Gross, median), Gross)) +
geom_boxplot() +
coord_flip() +
scale_x_discrete("Released")
```
## 6. Process `Awards` column
The variable `Awards` describes nominations and awards in text format. Convert it to 2 numeric columns, the first capturing the number of wins, and the second capturing nominations. Replace the `Awards` column with these new columns, and then study the relationship of `Gross` revenue with respect to them.
_Note: The format of the `Awards` column is not standard; you may have to use regular expressions to find the relevant values. Try your best to process them, and you may leave the ones that don't have enough information as NAs or set them to 0s._
```{r}
# TODO: Convert Awards to 2 numeric columns: wins and nominations
parseNominations = function(awards) {
pattern = "(\\S+)\\s*nomination*"
match = regexec(pattern, awards)
words = regmatches(awards, match)
return(sapply(words, function(words) as.integer(words[2])))
}
parseWins = function(awards) {
pattern = "(\\S+)\\s*win*"
match = regexec(pattern, awards)
words = regmatches(awards, match)
return(sapply(words, function(words) as.integer(words[2])))
}
df$Awards = tolower(df$Awards)
df$Nominations = parseNominations(df$Awards)
df$Nominations[is.na(df$Nominations)] = 0
df$Wins = parseWins(df$Awards)
df$Wins[is.na(df$Wins)] = 0
length(df$Nominations[df$Nominations != 0])
length(df$Wins[df$Wins != 0])
```
**Q**: How did you construct your conversion mechanism? How many rows had valid/non-zero wins or nominations?
**A**: To construct the wins and nomination columns I first lowercased the awards column string. Then I used regex to search for the substrings "win" and "nomination". I then used the regex backreference to select the word before the match.
10678 rows had nominations and 10180 rows had wins
```{r}
# TODO: Plot Gross revenue against wins and nominations
qplot(log(df$Nominations), log(df$Gross)) + stat_smooth(color = "red", se = F)
```
```{r}
qplot(log(df$Wins), log(df$Gross)) + stat_smooth(color = "red", se = F)
```
**Q**: How does the gross revenue vary by number of awards won and nominations received?
**A**: The gross revenue increases slightly linearly with both an increase in nominations and wins as can be seen on the log-log scale. I plotted it on the log-log scale since most data points were clustered around low nominations and low gross so it was difficult to see any trends when plotted.
## 7. Movie ratings from IMDb and Rotten Tomatoes
There are several variables that describe ratings, including IMDb ratings (`imdbRating` represents average user ratings and `imdbVotes` represents the number of user ratings), and multiple Rotten Tomatoes ratings (represented by several variables pre-fixed by `tomato`). Read up on such ratings on the web (for example [rottentomatoes.com/about](https://www.rottentomatoes.com/about) and [ www.imdb.com/help/show_leaf?votestopfaq](http:// www.imdb.com/help/show_leaf?votestopfaq)).
Investigate the pairwise relationships between these different descriptors using graphs.
```{r}
# TODO: Illustrate how ratings from IMDb and Rotten Tomatoes are related
qplot(x = tomatoRating, y = imdbRating, data = df[df$tomatoRating != 0, ]) + stat_smooth(color = "red", se = F)
```
```{r}
qplot(x = tomatoMeter, y = imdbRating, data = df[df$tomatoUserRating != 0, ]) + stat_smooth(color = "red", se = F, method = 'loess')
```
```{r}
qplot(x = tomatoReviews, y = imdbVotes, data = df) + stat_smooth(color = "red", se = F, method = 'loess')
```
**Q**: Comment on the similarities and differences between the user ratings of IMDb and the critics ratings of Rotten Tomatoes.
**A**: The user ratings of IMDb and the critics ratings of rotten tomatoes are in generally positively linearly correlated. As the rotten tomatoes ratings approaches the maximum, the ratings tend to agree more with the imdb ratings. There are some rotten tomatoes ratings that were given a relatively high rotten tomatoes rating but a relatively low imdb rating and vice versa.
The tomato meter is also positively linearly correlated with the imdb ratings. However, there are cases where the tomato meter is relatively low and the imdb rating relatively high and vice versa.
As the number of rotten tomatoes reviews increases so does the number of imdb votes. However there are some movies with relatively few rotten tomato reviews but a realitvely high number of imdb votes and vice versa. We can see that as the number of rotten tomatoes reviews increases, the imdb votes tend to non-linearly increase, which makes sense because there are many more imdb users than there are critics who write reviews.
## 8. Ratings and awards
These ratings typically reflect the general appeal of the movie to the public or gather opinions from a larger body of critics. Whereas awards are given by professional societies that may evaluate a movie on specific attributes, such as artistic performance, screenplay, sound design, etc.
Study the relationship between ratings and awards using graphs (awards here refers to wins and/or nominations).
```{r}
# TODO: Show how ratings and awards are related
iqr = quantile(df$Wins, .75) - quantile(df$Wins, .25)
ggplot(df[df$Wins < quantile(df$Wins, .75) + 20 * iqr, ], aes(reorder(Wins, -tomatoRating, median), tomatoRating)) +
geom_boxplot() +
coord_flip() +
scale_x_discrete("Wins")
```
```{r}
ggplot(df[df$Wins < quantile(df$Wins, .75) + 20 * iqr, ], aes(reorder(Wins, -imdbRating, median), imdbRating)) +
geom_boxplot() +
coord_flip() +
scale_x_discrete("Wins")
```
```{r}
iqr = quantile(df$Nominations, .75) - quantile(df$Nominations, .25)
ggplot(df[df$Nominations < quantile(df$Nominations, .75) + 20 * iqr, ], aes(reorder(Nominations, -tomatoRating, median), tomatoRating)) +
geom_boxplot() +
coord_flip() +
scale_x_discrete("Nominations")
```
```{r}
ggplot(df[df$Nominations < quantile(df$Nominations, .75) + 20 * iqr, ], aes(reorder(Nominations, -imdbRating, median), imdbRating)) +
geom_boxplot() +
coord_flip() +
scale_x_discrete("Nominations")
```
**Q**: How good are these ratings in terms of predicting the success of a movie in winning awards or nominations? Is there a high correlation between two variables?
**A**: In general, as the tomatoRating or imdbRating increases, the number of wins and nominations also increases. We can also see that as the number of wins or nominations increases, the ranges that the tomatoRatings or imdbRatings take on becomes smaller, meaning that there is more consensus that the movie is good. Movies with few wins or nominations take on a wide range of ratings.
## 9. Expected insights
Come up with two new insights (backed up by data and graphs) that is expected. Here “new” means insights that are not an immediate consequence of one of the above tasks. You may use any of the columns already explored above or a different one in the dataset, such as `Title`, `Actors`, etc.
```{r}
# TODO: Find and illustrate two expected insights
```
```{r}
# Insight 1
iqr = quantile(df$Budget, .75, na.rm = T) - quantile(df$Budget, .25, na.rm = T)
ggplot(df[df$Budget < quantile(df$Budget, .75, na.rm = T) + 1.5 * iqr & !is.na(df$BinnedRuntime) & !is.na(df$Budget), ], aes(BinnedRuntime, Budget)) +
geom_boxplot() +
coord_flip() +
scale_x_discrete("Runtime")
```
```{r}
# Insight 2
iqr = quantile(df$Wins, .75, na.rm = T) - quantile(df$Wins, .25, na.rm = T)
ggplot(df[!is.na(df$BinnedRuntime) & df$Wins < quantile(df$Wins, .75, na.rm = T) + 20 * iqr, ], aes(BinnedRuntime, Wins)) +
geom_boxplot() +
coord_flip() +
scale_x_discrete("Runtime")
```
**Q**: Expected insight #1.
**A**: Short movies (< 80 minutes) in general have a smaller budget than medium length movies (< 120 minutes), which have a smaller budget than long movies.
**Q**: Expected insight #2.
**A**: Short movies do not win many of the awards collected in this dataset (I'm not sure what they are in this dataset, but they win awards at more niche film festivals). Longer movies tend to win the awards collected in this dataset.
## 10. Unexpected insight
Come up with one new insight (backed up by data and graphs) that is unexpected at first glance and do your best to motivate it. Same instructions apply as the previous task.
```{r}
# TODO: Find and illustrate one unexpected insight
iqr = quantile(df$Budget, .75, na.rm = T) - quantile(df$Budget, .25, na.rm = T)
tomatoRatingLevels = function(ratings) {
return(as.factor(round(ratings / .5) * .5))
}
ggplot(df[!is.na(df$tomatoRating) & df$Budget < quantile(df$Budget, .75, na.rm = T) + 1.5 * iqr & !is.na(df$Budget), ], aes(reorder(tomatoRatingLevels(tomatoRating), -Budget, median), Budget)) +
geom_boxplot() +
coord_flip() +
scale_x_discrete("tomatoRating")
```
**Q**: Unexpected insight.
**A**: Budget does not necessarily correlate to higher critics ratings. In fact the correlation seems to be negative. This could indicate that in order to create a successful movie, a large budget is not required. Furthermore a bigger budget may actually lead to a worse movie.