logo



A fun fact: The header picture is actually made by tensorflow based on the original movie poster:)))

Load the data


##    nominees           details               year          winner      
##  Length:19875       Length:19875       Min.   :1997   Min.   :0.0000  
##  Class :character   Class :character   1st Qu.:2010   1st Qu.:0.0000  
##  Mode  :character   Mode  :character   Median :2014   Median :0.0000  
##                                        Mean   :2013   Mean   :0.1384  
##                                        3rd Qu.:2018   3rd Qu.:0.0000  
##                                        Max.   :2020   Max.   :1.0000  
##                                                                       
##     metabase          rating         genres              budget         
##  Min.   : 46.00   Min.   :6.800   Length:19875       Min.   :   145352  
##  1st Qu.: 78.00   1st Qu.:7.500   Class :character   1st Qu.: 15000000  
##  Median : 84.00   Median :7.800   Mode  :character   Median : 25000000  
##  Mean   : 82.27   Mean   :7.794                      Mean   : 48125919  
##  3rd Qu.: 89.00   3rd Qu.:8.100                      3rd Qu.: 61000000  
##  Max.   :100.00   Max.   :8.900                      Max.   :237000000  
##                                                      NA's   :176        
##      gross                minute    American Cinema Editors     BAFTA        
##  Min.   :    323382   Min.   : 91   Min.   :-1.0000         Min.   :-1.0000  
##  1st Qu.:  92991835   1st Qu.:115   1st Qu.:-1.0000         1st Qu.:-1.0000  
##  Median : 177243185   Median :127   Median : 0.0000         Median : 0.0000  
##  Mean   : 281873536   Mean   :129   Mean   :-0.1841         Mean   :-0.2518  
##  3rd Qu.: 329398046   3rd Qu.:139   3rd Qu.: 0.0000         3rd Qu.: 0.0000  
##  Max.   :2790439000   Max.   :209   Max.   : 1.0000         Max.   : 1.0000  
##                                                                              
##  Chicago Film Critics Critics Choice     Golden Globes      Satellite      
##  Min.   :-1.000       Min.   :-1.00000   Min.   :-1.000   Min.   :-1.0000  
##  1st Qu.:-1.000       1st Qu.: 0.00000   1st Qu.:-1.000   1st Qu.:-1.0000  
##  Median :-1.000       Median : 0.00000   Median : 0.000   Median :-1.0000  
##  Mean   :-0.919       Mean   : 0.03884   Mean   :-0.155   Mean   :-0.8142  
##  3rd Qu.:-1.000       3rd Qu.: 0.00000   3rd Qu.: 0.000   3rd Qu.:-1.0000  
##  Max.   : 1.000       Max.   : 1.00000   Max.   : 1.000   Max.   : 1.0000  
##                                                                            
##      date               score           review                genre     
##  Length:19875       Min.   :0.0000   Length:19875       Drama    :5970  
##  Class :character   1st Qu.:0.7500   Class :character   Biography:5532  
##  Mode  :character   Median :0.8000   Mode  :character   Comedy   :2814  
##                     Mean   :0.8179                      Action   :2267  
##                     3rd Qu.:1.0000                      Adventure:1477  
##                     Max.   :1.8000                      Crime    :1305  
##                                                         (Other)  : 510
## [1] 19875    20

The median review score from Rotten Tomato website is relatively stable across different year. However, we also notice that the variability is more obvious in recent years than before. It may indicate that the oscar award standard has been changed and the rating may be not that important anymore.

Clean the Text


A function has been created to clean the corpus.

## [1] "There have been many (so, so many) films about WW1 before, but never one quite like this, Sir!"
## [1] "many many film ww never one quite like sir"

We can compare the reviews before and after the cleansing.

Then, we try to bind the cleansed pro and con reviews back to the original dataframe.

Parasite


Parasite won the 2020 Oscar Best Picture. We first want to explore this specific movie.

Word Cloud


It is not surprising that ‘Bong Joonho’, the name of the director has been mentioned so many times. He took most of the credit of this successful movie. In addition, we can see some really positive words, such as ‘masterpiece’, ‘masterful’ and ‘perfect’. It seems to have a really good reputation among reviews. Moreover, there also some words relevant to the topic of the movie, such as ‘satire’, ‘inequality’ and ‘class’.

Ratings

Sentiment Analysis


Next I conducted the sentiment analysis. Using the afinn dictionary, I was able to quantify each critic reviews and examine the relationship between the sentiment score and the rating.

We found that 1)in the earlier times, the sentiment of reviews are much higher than nowadays. 2)there is a time from 2005 to 2015 that the movie that critics did not hold a positive attitude towards would win the oscar. However, there seems to be a new trend in the future that the movie that critics have higher sentiment score on would take the lead and win the award again.

Also, we need to keep in mind that if the movie has a sad ending, the reviews may mention it, which would lower the sentiment score.

## 
##  Welch Two Sample t-test
## 
## data:  meanSentiment by winner
## t = -0.23525, df = 2939.1, p-value = 0.814
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.08702369  0.06837880
## sample estimates:
## mean in group 0 mean in group 1 
##        1.144123        1.153446

However, the overall T-test is not significant here. With that being said, there is no significant different from the mean sentiment score of the two groups.

Our findings were 1)no horror or animation movie has won a Oscar. As for movies in Action and Adventure genre, the winner movies have a significantly higher sentiment score while the winner movies would have a slightly lower sentiment scores in Biography, Comedy, Crime, Drama. It may result from the fact that the stories for crime or drama would be more complicated and may cause some depressive thoughts. The sentiment scores of Action and Adventure can have more predictive power for who would win the Oscar.

Correlation

The score and metabase are highly relevant, which makes sense because one is the critic review score from Rotten Tomato website while the other one is the critic review score from IMDB. We can also see that there are some awards that related to each other. It means that we may be able to use the results from some awards to predict the other ones.

Regression

## 
## Call:
## lm(formula = score ~ meanSentiment + gross + budget + minute + 
##     BAFTA + `Critics Choice` + `Golden Globes` + `Chicago Film Critics` + 
##     `American Cinema Editors` + Satellite, data = senti)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83958 -0.07814  0.01667  0.12414  0.91480 
## 
## Coefficients:
##                                     Estimate         Std. Error t value
## (Intercept)                0.918893290328980  0.009807841414484  93.690
## meanSentiment              0.015389484517425  0.000746528038737  20.615
## gross                     -0.000000000038765  0.000000000005089  -7.617
## budget                     0.000000000329895  0.000000000037881   8.709
## minute                    -0.000577885024305  0.000063868525328  -9.048
## BAFTA                      0.021405008845163  0.002272546658631   9.419
## `Critics Choice`           0.036645567253833  0.002942601040340  12.453
## `Golden Globes`           -0.014253426004486  0.002317814611806  -6.150
## `Chicago Film Critics`     0.021344094265376  0.004224256234173   5.053
## `American Cinema Editors`  0.011238820828753  0.002044276173565   5.498
## Satellite                  0.028705682868236  0.003059752601824   9.382
##                                       Pr(>|t|)    
## (Intercept)               < 0.0000000000000002 ***
## meanSentiment             < 0.0000000000000002 ***
## gross                       0.0000000000000275 ***
## budget                    < 0.0000000000000002 ***
## minute                    < 0.0000000000000002 ***
## BAFTA                     < 0.0000000000000002 ***
## `Critics Choice`          < 0.0000000000000002 ***
## `Golden Globes`             0.0000000007958947 ***
## `Chicago Film Critics`      0.0000004403553818 ***
## `American Cinema Editors`   0.0000000390727406 ***
## Satellite                 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1578 on 15823 degrees of freedom
##   (138 observations deleted due to missingness)
## Multiple R-squared:  0.05952,    Adjusted R-squared:  0.05892 
## F-statistic: 100.1 on 10 and 15823 DF,  p-value: < 0.00000000000000022

All of the coefficients are very significant here. However, the R-squared here is just (0.06), which means the current combination of predictors may be not the optimal one. We need to further explore to decide the best predictors. We also applied the mixed model to see whether the genre variable can explain some variabilities in the outcome variable, but the result does not seem very pleasant.

Oscar Prediction

Deep Learning

Instead of score, the critic review score, we now want to set the winner as our outcome variable. We want to test how much predictive power does the review texts have towards to the winner of the Oscar Best Picture.

senti1 <- senti %>% 
  dplyr::select(review_clean,winner) %>% 
  mutate(winner=normalize(winner))
  
splits = initial_split(senti1, .6, "winner")

trainingDataWhole = training(splits)
testingDataWhole = testing(splits)

trainingLabel = as.vector(trainingDataWhole$winner)
trainingData = c(trainingDataWhole[, -c(2)],recursive=T)
testingLabel = as.vector(testingDataWhole$winner)
testingData = c(testingDataWhole[, -c(2)],recursive=T)

tokenizerTrain = text_tokenizer(num_words = 10000)
fit_text_tokenizer(tokenizerTrain, trainingData)
trainingData = texts_to_sequences(tokenizerTrain, trainingData)
tokenizerTest = text_tokenizer(num_words = 10000)
fit_text_tokenizer(tokenizerTest, testingData)
testingData = texts_to_sequences(tokenizerTest, testingData)



wholeLabel = as.vector(senti1$winner)
wholeData = c(senti1[, -c(2)],recursive=T)
tokenizerwhole = text_tokenizer(num_words = 10000)
fit_text_tokenizer(tokenizerwhole, wholeData)
wholeData = texts_to_sequences(tokenizerTrain, wholeData)



vectorize_sequences <- function(sequences, dimension = 10000) {
  # Creates an all-zero matrix of shape (length(sequences), dimension)
  results <- matrix(0, nrow = length(sequences), ncol = dimension) 
  for (i in 1:length(sequences))
    # Sets specific indices of results[i] to 1s
    results[i, sequences[[i]]] <- 1 
  results
}


trainingData = pad_sequences(trainingData, value = 0,
                             padding = "post", maxlen = 400)
testingData = pad_sequences(testingData, value = 0,
                            padding = "post", maxlen = 400)
wholeData = pad_sequences(wholeData, value = 0,
                            padding = "post", maxlen = 400)
link

It is kind of weird that the loss on training dataset is higher than the loss on the validation dataset. We might want to adjust the loss function, learning rate and relugarization of the model to improve it.

## $loss
## [1] 0.1203287
## 
## $mae
## [1] 0.2711006
## [1] 0.7826087

Our prediction model can successfully predict 78% in the last 25 years Oscar awards. Also, we notice that most of the mistakes were before 2005. The conclusion is that reviews are playing a more and more important role in predicting Oscar.

Logistic

## 
## Call:
## glm(formula = winner ~ ., family = binomial(link = "logit"), 
##     data = log_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1522  -0.5499   0.0136   0.5247   3.3559  
## 
## Coefficients:
##                                   Estimate       Std. Error z value
## (Intercept)               49.6586372230063 17.8336634495554   2.785
## meanSentiment             -0.0241256831057  0.0241644376059  -0.998
## year                      -0.0314396473889  0.0089146053064  -3.527
## metabase                   0.0365768284223  0.0055633877293   6.575
## rating                     1.6179806760195  0.1255443668636  12.888
## budget                    -0.0000000382098  0.0000000022983 -16.626
## gross                      0.0000000007218  0.0000000002894   2.494
## minute                    -0.0106008374197  0.0031693620269  -3.345
## `American Cinema Editors`  0.3283984667586  0.0724644855813   4.532
## BAFTA                      0.9570017778571  0.0815569593938  11.734
## `Chicago Film Critics`     0.4842014925009  0.1170391594657   4.137
## `Critics Choice`           2.5539286660878  0.0979722250964  26.068
## `Golden Globes`            0.4443311071630  0.0707091275989   6.284
## Satellite                  0.2344877638146  0.1087651066818   2.156
## score                      0.5400046911469  0.2756340925988   1.959
##                                       Pr(>|z|)    
## (Intercept)                           0.005360 ** 
## meanSentiment                         0.318087    
## year                                  0.000421 ***
## metabase                       0.0000000000488 ***
## rating                    < 0.0000000000000002 ***
## budget                    < 0.0000000000000002 ***
## gross                                 0.012646 *  
## minute                                0.000823 ***
## `American Cinema Editors`      0.0000058468264 ***
## BAFTA                     < 0.0000000000000002 ***
## `Chicago Film Critics`         0.0000351738746 ***
## `Critics Choice`          < 0.0000000000000002 ***
## `Golden Globes`                0.0000000003301 ***
## Satellite                             0.031091 *  
## score                                 0.050097 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7436.1  on 5363  degrees of freedom
## Residual deviance: 3758.2  on 5349  degrees of freedom
## AIC: 3788.2
## 
## Number of Fisher Scoring iterations: 6
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4656  154
##          1  783  740
##                                              
##                Accuracy : 0.852              
##                  95% CI : (0.8431, 0.8607)   
##     No Information Rate : 0.8588             
##     P-Value [Acc > NIR] : 0.9411             
##                                              
##                   Kappa : 0.5284             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.8277             
##             Specificity : 0.8560             
##          Pos Pred Value : 0.4859             
##          Neg Pred Value : 0.9680             
##              Prevalence : 0.1412             
##          Detection Rate : 0.1168             
##    Detection Prevalence : 0.2405             
##       Balanced Accuracy : 0.8419             
##                                              
##        'Positive' Class : 1                  
## 

The accuracy and kappa for logistic regression is quite high, and we also notice that the sentiment score here is not significant. Whether a movie will win or not may not be decided by the review sentiment. It is more relevant to whether it has won in other awards and some of the movie’s features, such as budget, box office revenue and duration.

 

A work by Yun Yan