Dataset - Winners of the Oscars Award
Description | Download | Implementation | Statistical Analysis | Improvements |
---|
Description
This dataset contains four categories of the Academy Awards
- Best Picture
- Best Director
- Best Lead Actor
- Best Lead Actress
A scrapper is written in R that collects the following information about movies since 1928 (from imdb.com and filmaffinity.com) for each of the above category
- name - Name of the movie
- year - Year of Release of the movie
- nomination - Number of nominations received by the movie
- rating - User's Rating
- duration - Duration of the movie in minutes
- genre1 - First Genre of the movie
- genre2 - Second Genre of the movie
- release - Month of release of the movie
- metacritic - Meta-Critic Rating (MCR) of the movie
- synopsis - Synopsis of the plot of the movie
Challenges
- The data is small because it has only information from the past 87 years.
- The data has many missing values, for e.g. MCR values are not available for every movie.
- The data has combination of numeric and text values; therefore, it is a mixed data.
Download
The following .csv files contain the required dataset for the best pictures, directors, actors, actresses, All Together
Implementation
The R implementation is available on Github
Statistical Analysis
We can perform two types of Analysis on this data: Qualitative and Quantitative. In the qualitative analysis, we can ask questions such as "Which movie received all the four awards?", "Which winning movie had the lowest IMDB rating?", "Which winning movie has the maximum duration" and so on.
.P.S. If you make Qualitative analysis on this data and want it to make it public, then please contact me and I will put it here with your full credits.
We wish to use this data for predictive purposes; therefore, we perform some statistical analysis on this dataset
Best Picture | Best Director | Best Actor | Best Actress | |
---|---|---|---|---|
Mean Number of Nominations | 9.16 | 8.83 | 6.78 | 6.36 |
Std. Deviation of Number of Nominations | 2.42 | 2.65 | 2.91 | 3.19 |
Maximum Number of Nominations | 14 | 14 | 13 | 13 |
Minimum Number of Nominations | 3 | 3 | 2 | 2 |
Mean Users Rating | 7.86 | 7.91 | 7.77 | 7.55 |
Std. Deviation of Users Rating | 0.59 | 0.54 | 0.51 | 0.46 |
Maximum Users Rating | 9.2 | 9 | 9.2 | 8.7 |
Minimum Users Rating | 6 | 6.1 | 5.8 | 6.4 |
Mean Duration | 138.63 | 137.52 | 122.41 | 115.57 |
Std. Deviation of Duration | 31.57 | 33.85 | 26.18 | 22.35 |
Maximum Duration | 238 | 238 | 212 | 238 |
Minimum Duration | 90 | 85 | 85 | 69 |
Mean MCR | 83.75 | 83.87 | 80.71 | 76.6 |
Std. Deviation of MCR | 9.26 | 8.11 | 11.06 | 9.88 |
Maximum MCR | 100 | 100 | 100 | 91 |
Minimum MCR | 64 | 65 | 56 | 53 |
Majority Month of Release | December | December | December | December |
Two Most Occuring Genre | Drama, Romance | Drama, Romance | Drama, Biography | Drama, Romance |
Mean Sentiment* | -0.60 | -0.47 | -1.12 | -0.50 |
Std. Deviation of Sentiment | 3.04 | 3.09 | 3.36 | 2.94 |
Maximum Sentiment | 6 | 6 | 8 | 6 |
Minimum Sentiment | -11 | -11 | -11 | -11 |
* indicates the valence or pleasure of the text synopsis. It is calculated as the sum of valence values of words in the synopsis based on the word list available here .
Improvements
The quality of this data can be improved by
- Collecting information about all the movies nominated in these four categories.
- Collecting more information about the winning movies from other websites such as Rotten Tomatoes, Meta-Critic, Box Office Mojo etc.
- Extracting useful features that can be used for predictive purposes.