English
AnalyticsCreator Congress 2021 - Data Vault 2 0 by Exxeta AG
This webinar shows how Exxeta used AnalyticsCreator to build a Data Vault-based Azure analytics showcase for New York taxi data. The demo covers Azure Blob Storage, Azure Data Factory, Azure SQL Database, Power BI, Parquet files, Data Vault modelling, automatic pipeline generation, and automated tabular model creation.
Questions
- How can AnalyticsCreator support Data Vault modelling on Azure?
- How does AnalyticsCreator generate Azure Data Factory pipelines?
- Can AnalyticsCreator process CSV and Parquet files from Azure Blob Storage?
- How are hash keys created in a Data Vault model?
- How does AnalyticsCreator create a Power BI tabular model?
- What analytics can be built from New York taxi data?
Key Takeaways
- Exxeta uses AnalyticsCreator as a project enabler for Azure analytics projects.
- The showcase uses New York taxi data stored in Azure Blob Storage.
- Source data includes CSV dimension files and Parquet fact files.
- AnalyticsCreator automatically creates Azure SQL Database structures.
- AnalyticsCreator automatically generates Azure Data Factory ETL pipelines.
- AnalyticsCreator automatically creates the Power BI tabular model.
- The showcase uses Data Vault modelling for scalability and parallel loading.
- Hash keys are generated automatically for Data Vault tables and relationships.
- The core layer moves from Data Vault structures to a star schema.
- The data mart layer is used as the basis for the Power BI model.
- The Power BI dashboard analyses revenue, trip count, locations, and trip patterns.
- The demo highlights the time and cost savings of automated data warehouse development.
Transcript
Hello everyone. My name is Timo Strauch, and I work as a manager at Adesso. We are a mid-sized consulting company based in Germany, and today we will show the Data Vault showcase we implemented with AnalyticsCreator.
We will briefly introduce Adesso, explain the Data Vault and Azure architecture, and then show the Power BI dashboard we created.
Adesso was founded in 2005 in Karlsruhe and now has more than 1,000 employees across Germany, Switzerland, and Slovakia.
Our Microsoft Platform and Technology team works across the Microsoft technology stack, including data engineering, data science, software development, .NET, Microsoft 365, SharePoint, Teams, SQL Server, Azure, Azure Synapse, analytics, and machine learning.
We use AnalyticsCreator as a project enabler because it helps us build structured solutions quickly for our customers.
We created this showcase to demonstrate which Azure technologies we use and which best practices we apply in modern data warehouse projects.
The goal was to pull structured data from Azure Blob Storage, integrate it through Azure Data Factory, store it in Azure SQL Database, and visualise the results in Power BI.
Because we expect to process larger data volumes in future, we chose Data Vault as the modelling approach.
Data Vault is a scalable data warehousing technique. It allows a data warehouse to grow incrementally and reduces the effort needed when adding new data sources.
The main components are hubs, links, satellites, and reference tables. Hubs store stable business keys. Links connect hubs and represent relationships or transactions. Satellites store descriptive attributes and metadata.
Data Vault is strong for integration and lineage, but it is not usually ideal for direct reporting. That is why we build a data mart on top, where hubs and satellites can become dimensions, and links and link satellites can become facts.
For our showcase, we used the New York taxi dataset. It contains yellow taxi trip data, including trip duration, cost, pickup and drop-off locations, date and time, passenger count, distance, total amount, and tips.
We wanted to analyse peak times, popular pickup locations, revenue, and trip behaviour. Predictive analytics was also explored, but today we focus on the descriptive analytics part.
My name is Natalia, and I will walk through the implementation.
We used Azure Blob Storage for the source files, Azure Data Factory for the ETL process, Azure SQL Database for the data warehouse, and Power BI Services for the tabular model and dashboard.
The important point is that AnalyticsCreator created the database structure, ETL pipelines, and tabular model automatically.
In AnalyticsCreator, we first created a connector to Azure Blob Storage. Then we defined the sources by selecting the connector, file paths, and file structures.
The dimension data is stored in CSV files, and the fact data is stored in Parquet files. AnalyticsCreator imported the file structures automatically. We only had to define the primary keys.
Using the Data Warehouse Wizard, AnalyticsCreator created the import tables, views, and hash keys required for the Data Vault model.
The staging layer loads the source data quickly into the database. The persisted staging layer adds historization using Valid From and Valid To fields.
In the core layer, we move from the Data Vault model towards a star schema. We use views on top of the Data Vault structures and define the keys used to join facts and dimensions.
The data mart layer becomes the tabular model for Power BI. We created separate pickup and drop-off location dimensions, as well as date and time dimensions. We also hid technical key columns and created measures such as revenue and revenue per distance.
During deployment, AnalyticsCreator created the Azure SQL Database structure automatically. It also generated the Azure Data Factory pipelines.
The main pipeline loads data from the source files into the import tables and then runs historization. Because we use Data Vault, dimension and fact processes can run in parallel.
AnalyticsCreator also created the Power BI dataset. On top of that dataset, we built a dashboard showing revenue, trip count, revenue per distance, daily trip patterns, popular pickup locations, and map-based location analysis.
In my view, AnalyticsCreator is powerful and relatively easy to use. It provides many predefined functions and automatically creates the database, ETL pipelines, and tabular model. This saves time and reduces manual effort.
During the Q&A, we discussed data cleaning. In this showcase, we did not perform much cleaning because the dimension files were simple, and the fact table was used mainly for demonstration. Cleaning unusual distances, unusual rates, or other outliers would be a useful extension for a future version.