A fun fact: The header picture is actually made by tensorflow based on the original movie poster:)))
oscar <- oscar %>%
mutate_at(c('budget','gross'), ~as.numeric(format(., scientific=F))) %>%
select(nominees,details,year,winner,metabase,rating,genres,budget,gross,minute,`American Cinema Editors`,BAFTA,`Chicago Film Critics`,`Critics Choice`,`Golden Globes`,Satellite,date,score,review) %>%
mutate(genre=str_extract(genres, "([A-Z])\\w+")) %>%
mutate(genre=as.factor(genre)) %>%
mutate(winner=ifelse(nominees=='Parasite',1,winner)) %>%
mutate(winner=ifelse(nominees=='1917',0,winner))
summary(oscar)
## nominees details year winner
## Length:19875 Length:19875 Min. :1997 Min. :0.0000
## Class :character Class :character 1st Qu.:2010 1st Qu.:0.0000
## Mode :character Mode :character Median :2014 Median :0.0000
## Mean :2013 Mean :0.1384
## 3rd Qu.:2018 3rd Qu.:0.0000
## Max. :2020 Max. :1.0000
##
## metabase rating genres budget
## Min. : 46.00 Min. :6.800 Length:19875 Min. : 145352
## 1st Qu.: 78.00 1st Qu.:7.500 Class :character 1st Qu.: 15000000
## Median : 84.00 Median :7.800 Mode :character Median : 25000000
## Mean : 82.27 Mean :7.794 Mean : 48125919
## 3rd Qu.: 89.00 3rd Qu.:8.100 3rd Qu.: 61000000
## Max. :100.00 Max. :8.900 Max. :237000000
## NA's :176
## gross minute American Cinema Editors BAFTA
## Min. : 323382 Min. : 91 Min. :-1.0000 Min. :-1.0000
## 1st Qu.: 92991835 1st Qu.:115 1st Qu.:-1.0000 1st Qu.:-1.0000
## Median : 177243185 Median :127 Median : 0.0000 Median : 0.0000
## Mean : 281873536 Mean :129 Mean :-0.1841 Mean :-0.2518
## 3rd Qu.: 329398046 3rd Qu.:139 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :2790439000 Max. :209 Max. : 1.0000 Max. : 1.0000
##
## Chicago Film Critics Critics Choice Golden Globes Satellite
## Min. :-1.000 Min. :-1.00000 Min. :-1.000 Min. :-1.0000
## 1st Qu.:-1.000 1st Qu.: 0.00000 1st Qu.:-1.000 1st Qu.:-1.0000
## Median :-1.000 Median : 0.00000 Median : 0.000 Median :-1.0000
## Mean :-0.919 Mean : 0.03884 Mean :-0.155 Mean :-0.8142
## 3rd Qu.:-1.000 3rd Qu.: 0.00000 3rd Qu.: 0.000 3rd Qu.:-1.0000
## Max. : 1.000 Max. : 1.00000 Max. : 1.000 Max. : 1.0000
##
## date score review genre
## Length:19875 Min. :0.0000 Length:19875 Drama :5970
## Class :character 1st Qu.:0.7500 Class :character Biography:5532
## Mode :character Median :0.8000 Mode :character Comedy :2814
## Mean :0.8179 Action :2267
## 3rd Qu.:1.0000 Adventure:1477
## Max. :1.8000 Crime :1305
## (Other) : 510
## [1] 19875 20
The median review score from Rotten Tomato website is relatively stable across different year. However, we also notice that the variability is more obvious in recent years than before. It may indicate that the oscar award standard has been changed and the rating may be not that important anymore.
oscar %>%
filter(as.integer(year)>2000) %>%
ggplot()+
geom_boxplot(aes(x=as.factor(year),y=score,color=as.factor(year)),alpha=0.4)+
theme(axis.text.x = element_text(angle = 30, hjust = 1))+
theme(legend.position="none")+
theme(plot.background = element_blank(),
panel.background = element_blank(),
legend.key = element_blank())+
labs(title = 'Relationship b/w Year & Review Score')+
ylim(0,1)
A function has been created to clean the corpus.
review_source <- VectorSource(oscar$review) # interprets each element as a DSI.
review_corpus <- VCorpus(review_source) # creates volatile Corpus object.
# create a function to clean the corpus
clean_corpus <- function(corpus){
# http://bit.ly/2lfOfG2. require instead of library w/in function call.
require(tm)
require(qdap)
require(magrittr)
require(textstem)
# manual replacement with spaces. removePunctuation() will not do this.
to_space <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, to_space, "\\.") # sometimes no space b/w sentences and period.
corpus <- corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, c(stopwords("en"),tolower((unique(oscar$nominees))),"movie","film","spanish")) %>%
tm_map(stripWhitespace) %>%
tm_map(removeNumbers) %>% # I noticed numbers are messy to work with.
tm_map(content_transformer(replace_symbol)) %>% # qdap. e.g. % = 'percent'
tm_map(removePunctuation) %>% # including curly {} and round () brackets.
tm_map(content_transformer(replace_contraction)) %>% # qdap. e.g. shouldn't replaced by 'should not'
tm_map(content_transformer(replace_abbreviation)) %>% # qdap. data(abbreviations)
tm_map(content_transformer(tolower)) %>%
tm_map(str_squish) %>%
tm_map(lemmatize_strings)
return(corpus)
}
review_corpus_clean <- clean_corpus(review_corpus)
review_corpus[[65]][[1]];review_corpus_clean[[65]][[1]]
## [1] "There have been many (so, so many) films about WW1 before, but never one quite like this, Sir!"
## [1] "many many film ww never one quite like sir"
We can compare the reviews before and after the cleansing.
Then, we try to bind the cleansed pro and con reviews back to the original dataframe.
review_clean <- vector("character", nrow(oscar))
for (text in 1:nrow(oscar)) {
review_clean[text] <- review_corpus_clean[[text]][[1]]
}
oscar1 <- bind_cols(oscar,data.frame(review_clean, stringsAsFactors = FALSE))
# remove tm corpus source and original corpus.
remove(review_corpus_clean, review_clean,review_corpus, review_source)
Parasite won the 2020 Oscar Best Picture. We first want to explore this specific movie.
oscar1 %>%
filter(nominees=='Parasite') %>%
unnest_tokens(., word, review_clean) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(n > 5) %>%
na.omit() %>%
wordcloud2(shape = "diamond",size=1,backgroundColor = 'black',color = "random-light")
It is not surprising that ‘Bong Joonho’, the name of the director has been mentioned so many times. He took most of the credit of this successful movie. In addition, we can see some really positive words, such as ‘masterpiece’, ‘masterful’ and ‘perfect’. It seems to have a really good reputation among reviews. Moreover, there also some words relevant to the topic of the movie, such as ‘satire’, ‘inequality’ and ‘class’.
library(dplyr)
library(tidytext)
df <- data.frame()
for (i in unique(oscar1$nominees)){
temp <- oscar1 %>%
filter(nominees == i) %>%
select(review_clean) %>%
bind()
temp1 = data.frame(movie = i,
review = temp,
stringsAsFactors = FALSE)
df <- rbind(df,temp1)
}
songTF = df %>%
split(., .$movie) %>%
lapply(., function(x) {
songTokens = tm::MC_tokenizer(x$review)
tokenCount = as.data.frame(summary(as.factor(songTokens), maxsum = 1000))
total = length(songTokens)
tokenCount = data.frame(count = tokenCount[[1]],
word = row.names(tokenCount),
total = total,
song = x$movie,
row.names = NULL)
return(tokenCount)
})
songTF = do.call("rbind", songTF)
songTF$tf = songTF$count/songTF$total
idfDF = songTF %>%
group_by(word) %>%
count() %>%
mutate(idf = log((length(unique(songTF$song)) / n)))
tfidfData = merge(songTF, idfDF, by = "word")
tfidfData$tfIDF = tfidfData$tf * tfidfData$idf
tfidfData %>%
group_by(song) %>%
filter(song == 'Parasite') %>%
arrange(song, desc(tfIDF)) %>%
filter(word!= '(Other)') %>%
slice(1:15) %>%
ggplot(aes(x=reorder(word,-tfIDF),y=tfIDF,fill=-tfIDF))+
geom_col()+
theme(axis.text.x = element_text(angle = 30, hjust = 1))+
theme(legend.position="none")+
theme(plot.background = element_blank(),
panel.background = element_blank(),
legend.key = element_blank())
The TF-IDF plot also emphasized the uniqueness and success of this movie for telling a story about inequality and reflecting the real social situations.
Next I conducted the sentiment analysis. Using the afinn
dictionary, I was able to quantify each critic reviews and examine the relationship between the sentiment score and the rating.
# sentiment analysis
oscar2 <- oscar1
oscar2$label <- seq.int(nrow(oscar2))
senti <- oscar2 %>%
unnest_tokens(output=word,input=review_clean) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(label) %>%
summarize(meanSentiment = mean(value)) %>%
left_join(oscar2)
senti %>%
group_by(year,winner) %>%
summarise(senti=mean(meanSentiment)) %>%
ggplot(aes(year,senti,color=as.factor(winner)))+
geom_smooth(se=FALSE)+
theme(plot.background = element_blank(),
panel.background = element_blank(),
legend.key = element_blank())
senti %>%
group_by(year,winner) %>%
summarise(score=mean(score)) %>%
ggplot(aes(year,score,color=as.factor(winner)))+
geom_smooth(se=FALSE)+
theme(plot.background = element_blank(),
panel.background = element_blank(),
legend.key = element_blank())
We found that 1)in the earlier times, the sentiment of reviews are much higher than nowadays. 2)there is a time from 2005 to 2015 that the movie that critics did not hold a positive attitude towards would win the oscar. However, there seems to be a new trend in the future that the movie that critics have higher sentiment score on would take the lead and win the award again.
Also, we need to keep in mind that if the movie has a sad ending, the reviews may mention it, which would lower the sentiment score.
##
## Welch Two Sample t-test
##
## data: meanSentiment by winner
## t = -0.23525, df = 2939.1, p-value = 0.814
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.08702369 0.06837880
## sample estimates:
## mean in group 0 mean in group 1
## 1.144123 1.153446
However, the overall T-test is not significant here. With that being said, there is no significant different from the mean sentiment score of the two groups.
senti %>%
group_by(genre,winner) %>%
summarise(senti=mean(meanSentiment)) %>%
ggplot(aes(genre,senti,fill=as.factor(winner)))+
geom_bar(stat = "identity", position = 'dodge')+
theme(plot.background = element_blank(),
panel.background = element_blank(),
legend.key = element_blank())
Our findings were 1)no horror or animation movie has won a Oscar. As for movies in Action and Adventure genre, the winner movies have a significantly higher sentiment score while the winner movies would have a slightly lower sentiment scores in Biography, Comedy, Crime, Drama. It may result from the fact that the stories for crime or drama would be more complicated and may cause some depressive thoughts. The sentiment scores of Action and Adventure can have more predictive power for who would win the Oscar.
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
sp_n <- senti %>% select(-label) %>% group_by(nominees) %>% mutate(score=mean(score),meanSentiment=mean(meanSentiment)) %>% keep(is.numeric) %>% distinct() %>% mutate_all(~normalize(.)) %>% discard(~all(is.na(.x))) %>% select(-year)
library(corrplot)
corrplot.mixed(cor(sp_n), lower = "number", upper = "square", tl.pos='lt',order = "FPC",
tl.cex=0.7,tl.srt=45,number.cex=0.9,diag='l')
The score and metabase are highly relevant, which makes sense because one is the critic review score from Rotten Tomato website while the other one is the critic review score from IMDB. We can also see that there are some awards that related to each other. It means that we may be able to use the results from some awards to predict the other ones.
lm <- lm(data=senti,score~meanSentiment+gross+budget+minute+BAFTA+`Critics Choice`+`Golden Globes`+`Chicago Film Critics`+`American Cinema Editors`+Satellite)
riMod <- lmer(data=senti,score~meanSentiment +gross+budget+minute+BAFTA+`Critics Choice`+`Golden Globes`+`Chicago Film Critics`+`American Cinema Editors`+Satellite+(1|genre))
mixedPred <- predict(riMod)
slimPred <- predict(lm)
allPred <- cbind(actual = senti$score,
mixed = mixedPred,
slim = slimPred)
par(mfrow=c(1,2))
plot(allPred[, "actual"], allPred[, "slim"])
plot(allPred[, "actual"], allPred[, "mixed"])
##
## Call:
## lm(formula = score ~ meanSentiment + gross + budget + minute +
## BAFTA + `Critics Choice` + `Golden Globes` + `Chicago Film Critics` +
## `American Cinema Editors` + Satellite, data = senti)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83958 -0.07814 0.01667 0.12414 0.91480
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.918893290328980 0.009807841414484 93.690
## meanSentiment 0.015389484517425 0.000746528038737 20.615
## gross -0.000000000038765 0.000000000005089 -7.617
## budget 0.000000000329895 0.000000000037881 8.709
## minute -0.000577885024305 0.000063868525328 -9.048
## BAFTA 0.021405008845163 0.002272546658631 9.419
## `Critics Choice` 0.036645567253833 0.002942601040340 12.453
## `Golden Globes` -0.014253426004486 0.002317814611806 -6.150
## `Chicago Film Critics` 0.021344094265376 0.004224256234173 5.053
## `American Cinema Editors` 0.011238820828753 0.002044276173565 5.498
## Satellite 0.028705682868236 0.003059752601824 9.382
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## meanSentiment < 0.0000000000000002 ***
## gross 0.0000000000000275 ***
## budget < 0.0000000000000002 ***
## minute < 0.0000000000000002 ***
## BAFTA < 0.0000000000000002 ***
## `Critics Choice` < 0.0000000000000002 ***
## `Golden Globes` 0.0000000007958947 ***
## `Chicago Film Critics` 0.0000004403553818 ***
## `American Cinema Editors` 0.0000000390727406 ***
## Satellite < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1578 on 15823 degrees of freedom
## (138 observations deleted due to missingness)
## Multiple R-squared: 0.05952, Adjusted R-squared: 0.05892
## F-statistic: 100.1 on 10 and 15823 DF, p-value: < 0.00000000000000022
All of the coefficients are very significant here. However, the R-squared here is just (0.06), which means the current combination of predictors may be not the optimal one. We need to further explore to decide the best predictors. We also applied the mixed model to see whether the genre variable can explain some variabilities in the outcome variable, but the result does not seem very pleasant.
Instead of score
, the critic review score, we now want to set the winner
as our outcome variable. We want to test how much predictive power does the review texts have towards to the winner of the Oscar Best Picture.
senti1 <- senti %>%
dplyr::select(review_clean,winner) %>%
mutate(winner=normalize(winner))
splits = initial_split(senti1, .6, "winner")
trainingDataWhole = training(splits)
testingDataWhole = testing(splits)
trainingLabel = as.vector(trainingDataWhole$winner)
trainingData = c(trainingDataWhole[, -c(2)],recursive=T)
testingLabel = as.vector(testingDataWhole$winner)
testingData = c(testingDataWhole[, -c(2)],recursive=T)
tokenizerTrain = text_tokenizer(num_words = 10000)
fit_text_tokenizer(tokenizerTrain, trainingData)
trainingData = texts_to_sequences(tokenizerTrain, trainingData)
tokenizerTest = text_tokenizer(num_words = 10000)
fit_text_tokenizer(tokenizerTest, testingData)
testingData = texts_to_sequences(tokenizerTest, testingData)
wholeLabel = as.vector(senti1$winner)
wholeData = c(senti1[, -c(2)],recursive=T)
tokenizerwhole = text_tokenizer(num_words = 10000)
fit_text_tokenizer(tokenizerwhole, wholeData)
wholeData = texts_to_sequences(tokenizerTrain, wholeData)
vectorize_sequences <- function(sequences, dimension = 10000) {
# Creates an all-zero matrix of shape (length(sequences), dimension)
results <- matrix(0, nrow = length(sequences), ncol = dimension)
for (i in 1:length(sequences))
# Sets specific indices of results[i] to 1s
results[i, sequences[[i]]] <- 1
results
}
trainingData = pad_sequences(trainingData, value = 0,
padding = "post", maxlen = 400)
testingData = pad_sequences(testingData, value = 0,
padding = "post", maxlen = 400)
wholeData = pad_sequences(wholeData, value = 0,
padding = "post", maxlen = 400)
vocabSize = 50000
#continuous output
model <- keras_model_sequential() %>%
layer_embedding(input_dim = vocabSize, output_dim = 16) %>%
layer_global_average_pooling_1d() %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1) %>%
compile(
optimizer = "rmsprop",
loss = "mse",
metrics = c("mae")
)
xValidation = trainingData[1:500, ]
xTraining = trainingData[501:nrow(trainingData), ]
yValidation = trainingLabel[1:500]
yTraining = trainingLabel[501:length(trainingLabel)]
history = model %>%
keras::fit(xTraining, yTraining,
epochs = 120, batch_size = 20,
validation_data = list(xValidation, yValidation),
verbose = 3,
callbacks = list(
callback_early_stopping(patience = 3),
callback_reduce_lr_on_plateau()
))
It is kind of weird that the loss on training dataset is higher than the loss on the validation dataset. We might want to adjust the loss function, learning rate and relugarization of the model to improve it.
## $loss
## [1] 0.1203287
##
## $mae
## [1] 0.2711006
test1 <- model %>%
predict(wholeData)
senti2 <- cbind(senti,test1)
senti3 <- senti2 %>%
filter(year>1997) %>%
group_by(year,nominees,winner) %>%
summarise(test=as.character(mean(test1))) %>%
group_by(year) %>%
top_n(1)
senti3[5] <- 1
senti4 <- senti2 %>%
filter(year>1997) %>%
group_by(year,nominees,winner) %>%
summarise(test=as.character(mean(test1))) %>%
left_join(senti3) %>%
mutate(V5=ifelse(is.na(V5),0,V5))
n <- senti4 %>%
filter(winner==1 & V5==1) %>%
nrow()
n1 <- n/length(unique(senti4$year))
n1
## [1] 0.7826087
senti4 %>%
filter(winner==1) %>%
mutate(ref=1) %>%
ggplot(aes(year,ref,color=V5))+
geom_count()+
theme(plot.background = element_blank(),
panel.background = element_blank(),
legend.key = element_blank(),
legend.position="none",
axis.line=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),)+
scale_x_continuous(breaks=seq(1998,2020,1))+
theme(axis.text.x = element_text(angle = 30, hjust = 1))+
scale_color_gradient(low="red", high="darkgreen")
Our prediction model can successfully predict 78% in the last 25 years Oscar awards. Also, we notice that most of the mistakes were before 2005. The conclusion is that reviews are playing a more and more important role in predicting Oscar.
log <- senti %>% select(-label,-genres,-date,-review,-review_clean,-details) %>% keep(is.numeric) %>% mutate(winner=as.factor(winner)) %>% na.omit()
set.seed(1234)
sample.set <- createDataPartition(log$winner, p = 0.6, list = FALSE)
log_train <- log[sample.set, ]
log_train <- DMwR::SMOTE(winner ~ ., as.data.frame(log_train), perc.over = 100, perc.under = 200)
log_test <- log[-sample.set, ]
logit_mod <-
glm(winner ~ ., family = binomial(link = 'logit'), data = log_train)
summary(logit_mod)
##
## Call:
## glm(formula = winner ~ ., family = binomial(link = "logit"),
## data = log_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1522 -0.5499 0.0136 0.5247 3.3559
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 49.6586372230063 17.8336634495554 2.785
## meanSentiment -0.0241256831057 0.0241644376059 -0.998
## year -0.0314396473889 0.0089146053064 -3.527
## metabase 0.0365768284223 0.0055633877293 6.575
## rating 1.6179806760195 0.1255443668636 12.888
## budget -0.0000000382098 0.0000000022983 -16.626
## gross 0.0000000007218 0.0000000002894 2.494
## minute -0.0106008374197 0.0031693620269 -3.345
## `American Cinema Editors` 0.3283984667586 0.0724644855813 4.532
## BAFTA 0.9570017778571 0.0815569593938 11.734
## `Chicago Film Critics` 0.4842014925009 0.1170391594657 4.137
## `Critics Choice` 2.5539286660878 0.0979722250964 26.068
## `Golden Globes` 0.4443311071630 0.0707091275989 6.284
## Satellite 0.2344877638146 0.1087651066818 2.156
## score 0.5400046911469 0.2756340925988 1.959
## Pr(>|z|)
## (Intercept) 0.005360 **
## meanSentiment 0.318087
## year 0.000421 ***
## metabase 0.0000000000488 ***
## rating < 0.0000000000000002 ***
## budget < 0.0000000000000002 ***
## gross 0.012646 *
## minute 0.000823 ***
## `American Cinema Editors` 0.0000058468264 ***
## BAFTA < 0.0000000000000002 ***
## `Chicago Film Critics` 0.0000351738746 ***
## `Critics Choice` < 0.0000000000000002 ***
## `Golden Globes` 0.0000000003301 ***
## Satellite 0.031091 *
## score 0.050097 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7436.1 on 5363 degrees of freedom
## Residual deviance: 3758.2 on 5349 degrees of freedom
## AIC: 3788.2
##
## Number of Fisher Scoring iterations: 6
like_pred <- predict(logit_mod, log_test, type = "response")
ideal_cutoff <- InformationValue::optimalCutoff(
actuals = log_test$winner,
predictedScores = like_pred,
optimiseFor = "Both")
logit_pred <- as.factor(ifelse(like_pred > ideal_cutoff, 1, 0))
caret::confusionMatrix(logit_pred, log_test$winner, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4656 154
## 1 783 740
##
## Accuracy : 0.852
## 95% CI : (0.8431, 0.8607)
## No Information Rate : 0.8588
## P-Value [Acc > NIR] : 0.9411
##
## Kappa : 0.5284
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.8277
## Specificity : 0.8560
## Pos Pred Value : 0.4859
## Neg Pred Value : 0.9680
## Prevalence : 0.1412
## Detection Rate : 0.1168
## Detection Prevalence : 0.2405
## Balanced Accuracy : 0.8419
##
## 'Positive' Class : 1
##
The accuracy and kappa for logistic regression is quite high, and we also notice that the sentiment score here is not significant. Whether a movie will win or not may not be decided by the review sentiment. It is more relevant to whether it has won in other awards and some of the movie’s features, such as budget, box office revenue and duration.
A work by Yun Yan