Skip to content

AI and Machine Learning

With the full data pipeline already in place, a natural next step is to focus on extracting deeper insights from the data using Artificial Intelligence (AI) and Machine Learning (ML).

The current pipeline transforms raw movie data through Bronze, Silver, and Gold layers, ensuring clean and structured datasets. Building upon this, we can integrate AI/ML models to discover hidden patterns and generate predictions that go beyond traditional BI dashboards.

For instance, instead of only reporting on top-rated movies or box office trends, we could use ML to predict future popularity, recommend content, or classify movies by themes and audience engagement.

Architecture Vision

The underlying architecture remains mostly unchanged. However, a new ML workflow would be integrated after the Silver Layer, where structured and cleaned data is available. This data would feed ML models, and the predictions or insights generated would be added to the Gold Layer for final consumption.

This allows us to:

  • Enrich the dataset with predicted values (e.g., popularity score)
  • Generate features for advanced dashboards
  • Automate analytics and content discovery

Initial Exploration

To validate this concept, I started a small proof-of-concept using the TMDB 5000 Movie Dataset from Kaggle.

TMDB 5000 Movie Dataset

The dataset contains metadata for 5000 movies, including fields such as budget, genres, release dates, cast, crew, and popularity score. Many of these columns overlap with our enriched Silver Layer, making it a suitable training base for model development.

First POC

The idea was to build a machine learning model capable of predicting the popularity of a movie, using metadata as input features.

To test this, I used a local Jupyter notebook and trained a basic regression model (RandomForestRegressor) to predict the popularity column.

Notebook & Code

You can access the notebook here:

Dataset Summary

  • Size: ~5000 movies
  • Features: title, release_date, genres, vote_average, vote_count, runtime, language, cast, crew, etc.
  • Target: popularity

Modeling Process

  • Performed basic Exploratory Data Analysis (EDA)
  • Applied Feature Engineering (genre encoding, date extraction, numeric transformations)
  • Trained a Random Forest Regressor

Results

These early results are promising, especially considering that the data was not deeply explored—only lightly processed and used in a simple proof of concept to validate the idea.

  • RMSE: 17.99
  • R² (Train): 0.934
  • R² (Test): 0.681

Visualizations

  • Real vs. Predicted Popularity

This scatter plot displays most points aligned near the ideal line, indicating the model captures the general trend. However, dispersion increases for higher popularity values, suggesting difficulty in predicting extreme values.

TMDB 5000 Movie Dataset

  • Residuals vs. Predicted Popularity

This plot shows the errors (residuals) against the predicted popularity. The residuals are mostly centered around zero for lower predicted values, but a noticeable spread emerges for higher predictions, with some significant positive and negative errors.

This pattern suggests that the model's prediction accuracy decreases as movie popularity increases.

TMDB 5000 Movie Dataset

  • Top 5 Best Predictions
TitleReal PopularityPredicted PopularityAbsolute Error
Three Kingdoms: Resurrection of the Dragon4.1271554.1211390.006016
The Last Legion12.28291112.2744500.008461
Ong Bak 29.0294899.0156880.013801
Jason Goes to Hell: The Final Friday10.34198210.3278490.014133
Sands of Iwo Jima3.8510003.8352270.015773
  • Top 10 Worst Predictions
TitleReal PopularityPredicted PopularityAbsolute Error
Interstellar724.247784336.452864387.794920
Gravity110.153618305.401820195.248202
Toy Story73.640445259.727407186.086962
The Lion King90.457886274.710194184.252308
Pirates of the Caribbean: The Curse of the Black Pearl271.972889108.266098163.706791
Iron Man 377.682080218.344458140.662378
Django Unchained82.121691210.649875128.528184
The Hunger Games68.550698186.539071117.988373
Whiplash192.528841102.47595790.052884
Captain America: Civil War198.372395121.71300876.659387

This chart highlights significant prediction errors, with the model notably underestimating highly popular films like "Interstellar". Conversely, it overestimated others such as "Gravity", indicating challenges in consistent accuracy across the popularity spectrum.

These specific cases, especially those with very large absolute errors, may represent outliers that require further investigation.

TMDB 5000 Movie Dataset

Conclusion

This proof-of-concept was developed entirely in a local environment using Jupyter notebooks, and its main purpose was to validate the feasibility of applying machine learning to our enriched movie dataset. It’s important to note that this exploration was conducted outside the core project scope, solely as an experimental step to test potential directions.

That said, the results clearly demonstrate that predicting movie popularity based on metadata is achievable and can offer significant value when integrated into a production-ready pipeline.

If explored further, this concept could be fully integrated into the existing serverless architecture using native AWS services. For example:

  • A machine learning model could be trained and deployed using Amazon SageMaker
  • A new Lambda function in the pipeline could invoke the SageMaker endpoint
  • Predicted values could be stored alongside other transformed data in the Gold Layer

This would enable automated predictions at scale, unlocking even deeper insights and analytics possibilities from the movie data.

Released under the MIT License.