Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Visualizing Data Lineage & Automating Documentation for Data Products
Blekinge Institute of Technology, Faculty of Computing.
2024 (English)Independent thesis Basic level (university diploma), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Scania's energy domain data analytics team is responsible for delivering and ensuring the quality of energy control-related time series data (TSD) from its battery electric vehicles. The team develops dashboards that visualize the TSD, which are used by various users to continuously monitor vehicle performance in real-world scenarios.

The TSD originates from the data warehouse and flows to the dashboard. If dashboards reference any non-existent database, schema, table, or column in their queries, the corresponding SQL queries fail, causing disruptions. Similar issues can arise from upstream changes, such as removing or modifying a column name that is used in a dashboard query. Additionally, when dashboard queries are updated, the associated documentation often becomes outdated, resulting in stakeholders not having access to accurate or up-to-date information. The team lacks tools to track changes in the data warehouse database or the SQL queries used in dashboards.

The problem is solved by implementing a system that parses SQL queries from various dashboards to generate JSON-formatted lineage data in a hierarchical structure. The data lineage is visualized using this JSON-formatted data, helping users gain a better understanding of dependencies and the impact of upstream changes on downstream data products and users. Additionally, the JSON-formatted data is compared with the data warehouse's database structure to identify failed SQL queries in dashboards. The failed parts of the dashboard queries are highlighted in the visualized data lineage, enabling users to easily pinpoint the problematic sections of the queries. Moreover, whenever changes are detected in the JSON-formatted lineage data, the corresponding dashboard documentation is automatically updated, ensuring that different stakeholders always have access to up-to-date information.

In this project, a Python backend is developed that parses SQL queries from various dashboard panels and generates an abstract syntax tree (AST) for each query. The ASTs are used to create a JSON file representing the lineage data in a hierarchical structure. This JSON file provides information about which column is used from which table, which table is used from which schema, and which schema is used from which database. The JSON-formatted data is compared with the data warehouse's database structure to identify failed SQL queries in dashboards. In addition, a frontend is developed, primarily using Vue.js and D3.js, to visualize the data lineage, highlight the failed parts of the dashboard queries, and host documentation for the dashboards. Whenever an AST is modified, the corresponding documentation page is automatically updated to reflect the changes.

Place, publisher, year, edition, pages
2024. , p. 34
Keywords [en]
SQL query parser, Abstract Syntax Tree, data lineage, vue, D3
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:bth-27405OAI: oai:DiVA.org:bth-27405DiVA, id: diva2:1945957
External cooperation
Scania CV AB
Subject / course
PA1438 Självständigt arbete Webbprogrammering
Educational program
PAGWG Webbprogrammering
Supervisors
Examiners
Available from: 2025-03-24 Created: 2025-03-19 Last updated: 2025-03-24Bibliographically approved

Open Access in DiVA

fulltext(735 kB)69 downloads
File information
File name FULLTEXT01.pdfFile size 735 kBChecksum SHA-512
fb1311a5dbd7f16417c65d086827a18999c3724bec54ee4efa52a984926d259e31dfd7952883eec0fe919794df29c3a22d74f41462c861d164bf8dd5cbc6a3b4
Type fulltextMimetype application/pdf

By organisation
Faculty of Computing
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 69 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 50 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf