Skip to content

Evolving the Architecture: Unlocking More Value

After building a solid and minimal core pipeline, it became clear that there was room to go beyond — not by adding complexity for its own sake, but by introducing well-justified features that would bring long-term value, scalability, and analytical capabilities.

Each improvement below was designed with a clear purpose: to make the system more usable, more insightful, and more future-proof.

Parameterized Batch Processing

To improve flexibility and control, the ingestion function was enhanced to support two configurable parameters:

  • top_n: defines how many top movies to select (e.g., top 10, 20, 50)
  • batch_size: defines how many movies to group into each message sent to the SQS queue

For example, top_n = 20 and batch_size = 5 will result in 4 messages in the queue, each containing 5 movies.

This batching mechanism allows the pipeline to handle larger workloads efficiently while keeping downstream processing manageable and scalable.

Value:

  • Fine-grained control over data volume processed in each execution .
  • Improved fault isolation: fewer items per message make retry logic and error tracing easier.
  • Efficient parallel processing: each Lambda handles a batch, reducing invocation overhead.

This configuration can be adjusted via event input, making the pipeline adaptable to different scenarios — from testing small subsets to processing the full dataset at scale.

Medallion Architecture: Structuring the Data Lifecycle

One of the first improvements was organizing the S3 storage using the Medallion Architecture pattern (Bronze → Silver → Gold), which brings clarity and governance to data processing.

LayerDescriptionValue Delivered
BronzeRaw enriched data fetched from OMDb (JSON per movie)Immutable, replayable source
SilverCleaned, validated, and normalized datasetReliable, schema-consistent data
GoldAggregated, analytical-ready outputs (e.g., by genre)Supports BI and machine learning

Value:

  • Clear data lineage: Easily track how data evolves across layers.
  • Scalable architecture: Supports growing complexity without changing the core logic.
  • Simplified debugging: Allows partial reprocessing without restarting the entire pipeline.
  • Audit & governance ready: Makes it easy to document and explain data transformations.
  • BI & ML friendly: The Silver layer feeds ML pipelines; the Gold layer powers dashboards.

Fault Tolerance with Retry Logic

To increase system robustness, all external operations were made fault-tolerant using a configurable retry mechanism. Instead of failing on the first error, each interaction now attempts to recover from transient issues gracefully.

This retry system is controlled via two environment variables:

MAX_RETRIES=3
BASE_DELAY_SECONDS=2

Where Retry Logic Is Applied

  • Reading or writing files in Amazon S3
  • Sending messages to Amazon SQS
  • Fetching data from the OMDb API
  • Accessing secrets from AWS Secrets Manager

Value:

  • Increased reliability: Recovers from intermittent network or service issues.
  • Fails gracefully: Prevents entire pipeline failures due to one bad request.
  • Flexible control: Easily adjustable via environment variables, per environment or workload.

Structured Logging for Observability

As the architecture evolved, it became essential to maintain clear visibility across the growing number of moving parts. To support this, structured and centralized logging was implemented throughout all Lambda functions.

Logs are emitted using Python’s built-in logging system:

python
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

logger.info("Movie enrichment started", extra={"movie_id": imdb_id})
logger.warning("OMDb API returned incomplete data")
logger.error("Failed to store enriched movie", exc_info=True)

Logging Design

  • Logs are emitted in a structured, consistent format
  • Contextual information is always included: movie ID, batch number, retry count, function name
  • Supports standard log levels: DEBUG, INFO, WARNING, ERROR

Value:

  • Faster debugging and root cause analysis.
  • Better monitoring for operational health.
  • Improved developer experience during development and maintenance.

Scheduled Execution with AWS EventBridge

To automate data refresh, the pipeline is scheduled to run once daily using Amazon EventBridge:

  • EventBridge rule triggers the ingestion Lambda every 24 hours.
  • No manual execution required.
  • Easily adjustable cron-like schedule.

Value:
Keeps data fresh and up-to-date, enabling near-real-time insights with zero operational overhead.

Infrastructure as Code (IaC) with AWS SAM

All infrastructure — from Lambda functions to queues and permissions — is defined using AWS SAM (Serverless Application Model).

  • YAML-based templates.
  • Full reproducibility of environments.
  • Easy deployment with a single command (sam deploy).
  • Supports CI/CD and collaboration.

Value:

  • Faster setup.
  • Versioned infrastructure.
  • Seamless team onboarding and environment replication.

Secure Secrets Management

Access to the OMDb API required storing a private key securely. Instead of hardcoding it, we used AWS Secrets Manager, which offers:

  • Encrypted, secure storage for API keys.
  • Role-based access management.
  • Integration with Lambda environment via runtime fetch.

Value:

  • Compliance with security best practices and flexibility for rotating or updating credentials without redeploying code.

Automated Testing Strategy

To ensure the pipeline remains reliable, maintainable, and safe to evolve, a robust testing strategy was implemented across two levels:

Unit Tests

Designed to validate individual components and business logic in isolation:

  • Extracting the top N movies based on ranking.
  • Validating the structure and content of the OMDb API response.
  • Correctly merging and formatting enriched movie data.

Integration Tests

Simulate the entire pipeline workflow to verify components interact correctly:

  • End-to-end flow: S3 → Lambda → SQS → Lambda → S3.
  • Mocked OMDb API responses to test enrichment logic under realistic conditions.
  • Ensured proper batching, message metadata, and final storage in S3.

Value:

  • Boosted confidence during refactoring and scaling.
  • Enabled faster development cycles through early bug detection.
  • Reduced risk of regressions with every deployment.

BI with Amazon QuickSight

To extract true insights from the enriched dataset, the project was extended with data visualization dashboards via Amazon QuickSight.

  • Data loaded directly from the Gold layer.

Value:

  • Non-technical users gain access to movie insights without needing to query raw data or use code.

Summary of Advanced Features

FeaturePurposeValue Delivered
Batch Processing & ControlParametrized ingestion and batchingScalability, flexibility
Medallion ArchitectureStructured multi-layer data lifecycleReusability, lineage, auditability
Fault ToleranceRetry logic for unstable operationsReliability, graceful failure recovery
Structured LoggingConsistent logging across all functionsObservability, simplified debugging
Daily SchedulingAutomation using EventBridgeFresh data, zero manual effort
Infrastructure as CodeDefine all AWS resources via SAMReproducibility, CI/CD readiness
Secrets ManagementSecure handling of API keysCompliance, credential isolation
Automated TestingUnit and integration test coverageStability, confidence in code quality
BI DashboardsAnalytics-ready outputs with QuickSightBusiness insights, data-driven decisions

Released under the MIT License.