Application of data warehousing and data mining in forecasting cancer diseases threats
Independent thesis Advanced level (degree of Master (One Year))Student thesis
Multidimensional analysis, trends analysis, summaries and drill-downs as data warehousing methods of choice provided rich, valuable and detailed perspective of cancer threats in terms of virtually any dimension covered by data. These allowed to model the risk of cancer including age, race, sex and survival chances among others, to spot most dangerous and incident cancers, revealed how little survival chances and treatment efficiency increased over last 30 years and how little early diagnosis was improved, presented trends and changes in them and changes in cancer risk related to place of residence and emphasized the importance of risk mitigation by screening and healthy lifestyle. These methods also turned out to be easy, requiring less computer science related knowledge as one could expect. With little support from IT staff, oncology domain professionals can easily benefit from vast data sets and analytical power applied to it. Data mining algorithms evaluated over melanoma of the skin data managed to extract what's already known in the domain. Therefore, when used by oncology professionals over less generic data one can expect data mining to have the potential of extending experts' knowledge. Neural networks, decision trees and clusters showed higher prediction accuracy than Naive Bayes classifiers and association rules but it is advised to merge results from many algorithms. Findings by particular algorithms are often disjoint and when combined, allow to reveal more despite varying predictive performance. Analysis of caCORE system and systemic integration experiment proved that building a large-scale oncological data system integrating distributed data is extremely complex. Integrating with it requires a lot of effort to understand its structures, prepare data mappings and implement integration procedures. Strict cooperation of IT and oncology professionals is mandatory. Suggestions were made to simplify the generic caCORE data model (ontology) or split it into smaller parts and expose as much integration functionality as web interfaces or encapsulated classes to decrease the complexity of the process. Tweaked like that, caCORE would be fully feasible and could be considered as the future of application of data warehousing and data mining techniques in oncology, providing distributed and common-model compliant dataset and leveraging the power of research community.
The thesis evaluates: application of data warehousing and mining analysis to SEERStat surveillance and epidemiology oncological database and aspects of future development of integrated and extensible data systems for oncology domain basing on integration experiment with caCORE project. In the thesis following is presented: results of the analysis of cancer diseases data with conclusions and advice, potential of this specific analytical application and conclusions as well as guidelines about how future, more powerful oncological analytical systems could be built.
Place, publisher, year, edition, pages
2008. , 113 p.
cancer, risk, assessment, data, analysis, warehouse, warehousing, mining, SEER, SEERStat, caCORE, surveillance, epidemiology
Computer Science Human Computer Interaction
IdentifiersURN: urn:nbn:se:bth-2943Local ID: oai:bth.se:arkivexBCB1B8215A8E597FC12574F300458677OAI: oai:DiVA.org:bth-2943DiVA: diva2:830238