Building a Modern Data Pipeline for NYC Taxi Data with dbt
Introduction
Transforming raw data into actionable insights requires robust pipelines and well-structured workflows in data engineering. Today, I'll share a project I've been working on that leverages dbt (data build tool) to transform NYC taxi ride data into a reliable analytics foundation.
The Project: NYC Taxi Rides Analytics
This project creates a complete analytics engineering workflow for NYC taxi data using dbt Cloud and Google BigQuery. The pipeline transforms raw taxi trip data into well-modeled, tested, and documented datasets ready for analysis.
Key Components
1. Data Source & Staging
The project starts with raw NYC taxi data (yellow, green, and FHV) stored in BigQuery. Using dbt's staging models, we:
Transform raw data into consistent, cleaned views
Apply proper data typing and field standardization
Create surrogate keys for reliable record identification
2. Core Data Models
The heart of the project lies in its core models:
fact_trips: A unified view of all taxi trips with additional metadata
dim_zones: A dimension table for NYC taxi zone information
dm_monthly_zone_revenue: A data mart for analyzing revenue patterns
3. Testing & Documentation
Data quality is ensured through:
Column-level tests for uniqueness, null values, and data ranges
Custom tests for business logic validation (e.g., positive_values test)
Comprehensive documentation is generated automatically with dbt docs
The project incorporates CI/CD practices with GitHub integration and scheduled job runs in dbt Cloud. This allows for:
Version-controlled data transformations
Automated testing on pull requests
Scheduled refreshes of production data
Why This Matters
This project demonstrates how modern data teams can:
Move beyond SQL scripts to version-controlled, modular data transformations
Implement testing that catches data quality issues before they impact analysis
Create self-documenting data assets that make analytics accessible
Bridge the gap between data engineering and analytics
Next Steps
To expand this project, I'm considering the following:
Adding machine learning models for trip duration prediction
Creating additional data marts for specific business questions
Implementing dbt metrics for standardized KPI tracking
Conclusion
Using dbt for NYC taxi data analytics showcases the power of treating data transformations as code. The resulting pipeline is maintainable, testable, and produces trusted data assets ready for business intelligence tools and further analysis.
Have you worked with dbt or similar tools for your data pipelines? I'd love to hear your experiences in the comments!
This project was built with dbt Cloud, BigQuery, and inspiration from the DataTalksClub community.