Visualizing Data Lineage & Automating Documentation for Data Products
2024 (English)Independent thesis Basic level (university diploma), 10 credits / 15 HE credits
Student thesis
Abstract [en]
Scania's energy domain data analytics team is responsible for delivering and ensuring the quality of energy control-related time series data (TSD) from its battery electric vehicles. The team develops dashboards that visualize the TSD, which are used by various users to continuously monitor vehicle performance in real-world scenarios.
The TSD originates from the data warehouse and flows to the dashboard. If dashboards reference any non-existent database, schema, table, or column in their queries, the corresponding SQL queries fail, causing disruptions. Similar issues can arise from upstream changes, such as removing or modifying a column name that is used in a dashboard query. Additionally, when dashboard queries are updated, the associated documentation often becomes outdated, resulting in stakeholders not having access to accurate or up-to-date information. The team lacks tools to track changes in the data warehouse database or the SQL queries used in dashboards.
The problem is solved by implementing a system that parses SQL queries from various dashboards to generate JSON-formatted lineage data in a hierarchical structure. The data lineage is visualized using this JSON-formatted data, helping users gain a better understanding of dependencies and the impact of upstream changes on downstream data products and users. Additionally, the JSON-formatted data is compared with the data warehouse's database structure to identify failed SQL queries in dashboards. The failed parts of the dashboard queries are highlighted in the visualized data lineage, enabling users to easily pinpoint the problematic sections of the queries. Moreover, whenever changes are detected in the JSON-formatted lineage data, the corresponding dashboard documentation is automatically updated, ensuring that different stakeholders always have access to up-to-date information.
In this project, a Python backend is developed that parses SQL queries from various dashboard panels and generates an abstract syntax tree (AST) for each query. The ASTs are used to create a JSON file representing the lineage data in a hierarchical structure. This JSON file provides information about which column is used from which table, which table is used from which schema, and which schema is used from which database. The JSON-formatted data is compared with the data warehouse's database structure to identify failed SQL queries in dashboards. In addition, a frontend is developed, primarily using Vue.js and D3.js, to visualize the data lineage, highlight the failed parts of the dashboard queries, and host documentation for the dashboards. Whenever an AST is modified, the corresponding documentation page is automatically updated to reflect the changes.
Place, publisher, year, edition, pages
2024. , p. 34
Keywords [en]
SQL query parser, Abstract Syntax Tree, data lineage, vue, D3
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:bth-27405OAI: oai:DiVA.org:bth-27405DiVA, id: diva2:1945957
External cooperation
Scania CV AB
Subject / course
PA1438 Självständigt arbete Webbprogrammering
Educational program
PAGWG Webbprogrammering
Supervisors
Examiners
2025-03-242025-03-192025-03-24Bibliographically approved