A Guide to Transformational Modeling with A BigQuery Dataset
Norapath Arjanurak
data-modeling
Part 1: BigQuery Information Schema: Primary Key and Foreign Key
Part 3: A Guide to Transformational Modeling with A BigQuery Dataset
An overview of transformational modeling
In Part 2, we showed how to define tables and constraints using BigQuery DDL and how to visualize a schema using the Datascale platform. In Part 3, we will continue building the ER diagram from Part 2 to dive into the transformation of data by focusing on the various modeling layers: base, staging, intermediate, and fact/dimension tables. We will also show how to illustrate the lineage of data as it flows through these transformations.
Based on the above diagram, the transformational modeling involves the creation of different layers of models to process raw data into meaningful business insights. For this blog, we will cover the transformation with modular data modeling by using the dbt tool. The main layers include:
1. Staging Layer: the staging layer is responsible for cleaning and filtering the raw data to prepare it for further processing.
2. Intermediate Layer: the intermediate layer aggregates and standardizes data from staging models that will be preparing for dimension and fact models.
3. Fact and Dimension Layer: the fact tables contain quantitative data and all foreign keys related to the dimension tables. The dimension tables contain the data that describe the fact table in various perspectives(dimensions).
All the layer implementations are different depending on the business that the organization has in order to answer specific questions from the data insights.
Learn more: the efficient ways to organize data models.
Benefits of the modular data modeling
With a modular approach, it is more convenient for the team to select, choose, and extract data at specific locations. It provides a strong foundation for analytics engineers to make the transformation workflow easy to retrieve. By dividing the process into these four layers, we can cover business usage and analytical tasks in various dimensions.
Modeling TheLook E-commerce dataset
Before we start, we have provided the GitHub repository for the following implementation here: https://github.com/datascalehq/example-thelook-ecommerce. All the following transformations will be based on the Look E-commerce dataset, as illustrated in the ER diagram below.
Now, let's go through the implementation of each layer using the Look E-commerce dataset. 🎉
1. Staging layer
The staging layer consists of the base and staging models. The base layer involves selecting all columns from the data source (blue columns), while the stage layer focuses on selecting specific columns and filtering certain rows (green columns).
1.1. Base models
The base models include the raw tables directly ingested from the source:
order_items:
bigquery-public-data.thelook_ecommerce.order_items
events:
bigquery-public-data.thelook_ecommerce.events
1.2. Staging models
The staging models standardize the raw data:
Distribution center: data about distribution centers, including their ID, name, latitude, and longitude.
Inventory items: data on inventory items, including item IDs, product IDs, creation and sale dates, and costs.
Orders & order Items: order information such as order ID, user ID, status, and key timestamps (creation, shipping, delivery, return).
Products: product data, including product ID, name, category, brand, retail price, cost, department, SKU, and associated distribution center ID.
Users: user information, including user ID, first and last names, email, age, gender, location details (state, city, country), traffic source, and creation date.
Events: event data like event ID, user ID, session information, timestamps, IP address, location (city, state, postal code), browser, traffic source, URI, and event type.
From this phase, we extracted all 7 tables contained in the database to create files that directly store data from the sources without any transformation. This acts as a defensive layer to protect against any changes directly to the main database. This approach will be beneficial for the following phase where heavier transformations will occur.
2. Mart layer
The mart phase is the important stages for data transformation where data is prepared for analysis and reporting. It divides into intermediate and fact/dimension layer.
2.1. Intermediate models
The intermediate layer aggregates and standardizes data from the staging models. We aggregate the following:
1. int_order_sales: It contains the order details with the user information, useful for delivery data purposes.
2. int_product_inventory: It contains information about the products that the company has, including price and storage location.
3. int_user_orders: It contains user details along with historical data on what and when orders were made, useful for marketing purposes like identifying peak purchasing times.
2.2. Fact and Dimension models
With the above base, stage and intermediate, this layer will organize the data to facilitate analysis and reporting. By combining these layers, we can query, filter, and aggregate only the data that can derive the business insight. The following section is our implementation for fct/dim tables.
1. dim_orders: It tracks the order details for every order from users, useful for marketing purposes to identify which products generate the most transactions.
2. dim_products: It provides product details which can be used to track product information, such as engagement among users and delivery times based on location and past shipping data.
Lineage:
Relations:
3. dim_inventory_items: It contains information about the inventory, including product prices and storage locations.
4. dim_users: It contains user details along with historical data on orders, useful for marketing purposes to determine peak purchasing times.
5. dim_event_user: It tracks events and participating users, helping to identify which events engage users the most and which products sell best during events.
6. fct_sales: A fact table that tracks every piece of sales information, useful for customer service and parcel tracking.
By implementing these fact and dimension tables, we create a data transformation workflow that effectively supports detailed business analysis. It also provides specific sources that data-related fields can extract and work on, making it easier for the data management.
To conclude, the transformational modeling process we discussed ensures that data is organized and processed efficiently through various layers: base, stage, intermediate, and fact/dimension tables. This approach creates an effective structure that allows for meaningful analysis that drives business insights and decisions.
Visualize the lineage in data model
To visualize each data model and understand the relationships between different tables, we use the Datascale platform (same as in Part 2). The ER & Lineage diagrams generated on Datascale provide a clear view of how data flows and interacts with each other.
In the Datascale platform, you can add a new SQL model and paste the DDL/SQL statements of your tables into the workspace. Then, click on the “Diagram” button on the right panel. It will automatically generate the data lineage diagram from this data model.
We will visualize the dim_products
table as an example using the SQL code provided in the GitHub repository.
This visualization clearly illustrates the relationships between the tables within the query code. The display is also column level which is clearly the dependency of data.
Conclusion
By following this transformational modeling process, we ensure that the data is clean, organized, and ready for comprehensive analysis and reporting. The ER diagram generated in the Datascale platform provides a clear visualization of the relationships and structure of the query. These implementations are good practices for analytics engineers to create an effective data transformation workflow.
What's next?
This concludes Part 3 of our blog series on transformational modeling and data lineage. In the next part, we will explore how to create actionable business insights using these transformed data models and implement the real business workflow.