Digitala Vetenskapliga Arkivet

Ändra sökning
Avgränsa sökresultatet
12 1 - 50 av 86
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Träffar per sida
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
Markera
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1.
    Asker, Lars
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Karlsson, Isak
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Mining Candidates for Adverse Drug Interactions in Electronic Patient Records2014Ingår i: PETRA '14 Proceedings of the 7th International Conference on Pervasive Technologies Related to Assistive Environments, PETRA’14, New York: ACM Press, 2014Konferensbidrag (Refereegranskat)
    Abstract [en]

    Electronic patient records provide a valuable source of information for detecting adverse drug events. In this paper, we explore two different but complementary approaches to extracting useful information from electronic patient records with the goal of identifying candidate drugs, or combinations of drugs, to be further investigated for suspected adverse drug events. We propose a novel filter-and-refine approach that combines sequential pattern mining and disproportionality analysis. The proposed method is expected to identify groups of possibly interacting drugs suspected for causing certain adverse drug events. We perform an empirical investigation of the proposed method using a subset of the Stockholm electronic patient record corpus. The data used in this study consists of all diagnoses and medications for a group of patients diagnoses with at least one heart related diagnosis during the period 2008--2010. The study shows that the method indeed is able to detect combinations of drugs that occur more frequently for patients with cardiovascular diseases than for patients in a control group, providing opportunities for finding candidate drugs that cause adverse drug effects through interaction.

  • 2.
    Asker, Lars
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Persson, Hans
    Identifying Factors for the Effectiveness of Treatment of Heart Failure: A Registry Study2016Ingår i: IEEE 29th International Symposiumon Computer-Based Medical Systems: CBMS 2016, IEEE Computer Society, 2016, s. 205-206Konferensbidrag (Refereegranskat)
    Abstract [en]

    An administrative health register containing health care data for over 2 million patients will be used to search for factors that can affect the treatment of heart failure. In the study, we will measure the effects of employed treatment for various groups of heart failure patients, using different measures of effectiveness. Significant deviations in effectiveness of treatments of the various patient groups will be reported and factors that may help explaining the effect of treatment will be analyzed. Identification of the most important factors that may help explain the observed deviations between the different groups will be derived through generation of predictive models, for which variable importance can be calculated. The findings may affect recommended treatments as well as high-lighting deviations from national guidelines.

  • 3.
    Asker, Lars
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Learning from Swedish Healthcare Data2016Ingår i: Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments, Association for Computing Machinery (ACM), 2016, artikel-id 47Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present two ongoing projects aimed at learning from health care records. The first project, DADEL, is focusing on high-performance data mining for detrecting adverse drug events in healthcare, and uses electronic patient records covering seven years of patient record data from the Stockholm region in Sweden. The second project is focusing on heart failure and on understanding the differences in treatment between various groups of patients. It uses a Swedish administrative health register containing health care data for over two million patients.

  • 4.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Estimating Class Probabilities in Random Forests2007Ingår i: Proceedings of the Sixth International Conference on Machine Learning and Applications, IEEE , 2007, s. 211-216Konferensbidrag (Refereegranskat)
    Abstract [en]

    For both single probability estimation trees (PETs) and ensembles of such trees, commonly employed class probability estimates correct the observed relative class frequencies in each leaf to avoid anomalies caused by small sample sizes. The effect of such corrections in random forests of PETs is investigated, and the use of the relative class frequency is compared to using two corrected estimates, the Laplace estimate and the m-estimate. An experiment with 34 datasets from the UCI repository shows that estimating class probabilities using relative class frequency clearly outperforms both using the Laplace estimate and the m-estimate with respect to accuracy, area under the ROC curve (AUC) and Brier score. Hence, in contrast to what is commonly employed for PETs and ensembles of PETs, these results strongly suggest that a non-corrected probability estimate should be used in random forests of PETs. The experiment further shows that learning random forests of PETs using relative class frequency significantly outperforms learning random forests of classification trees (i.e., trees for which only an unweighted vote on the most probable class is counted) with respect to both accuracy and AUC, but that the latter is clearly ahead of the former with respect to Brier score.

    Ladda ner fulltext (pdf)
    fulltext
  • 5.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Feature vs. Classifier Fusion for Predictive Data - a Case Study in Pesticide Classification2007Ingår i: Proceedings of the 10th International Conference on Information Fusion, IEEE , 2007, s. 1-7Konferensbidrag (Refereegranskat)
    Abstract [en]

    Two strategies for fusing information from multiple sources when generating predictive models in the domain of pesticide classification are investigated: i) fusing different sets of features (molecular descriptors) before building a model and ii) fusing the classifiers built from the individual descriptor sets. An empirical investigation demonstrates that the choice of strategy can have a significant impact on the predictive performance. Furthermore, the experiment shows that the best strategy is dependent on the type of predictive model considered. When generating a decision tree for pesticide classification, a statistically significant difference in accuracy is observed in favor of combining predictions from the individual models compared to generating a single model from the fused set of molecular descriptors. On the other hand, when the model consists of an ensemble of decision trees, a statistically significant difference in accuracy is observed in favor of building the model from the fused set of descriptors compared to fusing ensemble models built from the individual sources.

    Ladda ner fulltext (pdf)
    fulltext
  • 6.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Forests of probability estimation trees2012Ingår i: International journal of pattern recognition and artificial intelligence, ISSN 0218-0014, Vol. 26, nr 2, s. 1251001-Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Probability estimation trees (PETs) generalize classification trees in that they assign class probability distributions instead of class labels to examples that are to be classified. This property has been demonstrated to allow PETs to outperform classification trees with respect to ranking performance, as measured by the area under the ROC curve (AUC). It has further been shown that the use of probability correction improves the performance of PETs. This has lead to the use of probability correction also in forests of PETs. However, it was recently observed that probability correction may in fact deteriorate performance of forests of PETs. A more detailed study of the phenomenon is presented and the reasons behind this observation are analyzed. An empirical investigation is presented, comparing forests of classification trees to forests of both corrected and uncorrected PETS on 34 data sets from the UCI repository. The experiment shows that a small forest (10 trees) of probability corrected PETs gives a higher AUC than a similar-sized forest of classification trees, hence providing evidence in favor of using forests of probability corrected PETs. However, the picture changes when increasing the forest size, as the AUC is no longer improved by probability correction. For accuracy and squared error of predicted class probabilities (Brier score), probability correction even leads to a negative effect. An analysis of the mean squared error of the trees in the forests and their variance, shows that although probability correction results in trees that are more correct on average, the variance is reduced at the same time, leading to an overall loss of performance for larger forests. The main conclusions are that probability correction should only be employed in small forests of PETs, and that for larger forests, classification trees and PETs are equally good alternatives.

  • 7.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Maximizing the Area under the ROC Curve with Decision Lists and Rule Sets2007Ingår i: Proceedings of the 2007 SIAM International Conference on Data Mining, 2007, s. 27-34Konferensbidrag (Refereegranskat)
    Ladda ner fulltext (pdf)
    fulltext
  • 8. Boström, Henrik
    et al.
    Asker, Lars
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Gurung, Ram B.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Karlsson, Isak
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Lindgren, Tony
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Conformal prediction using random survival forests2017Ingår i: 16th IEEE International Conference on Machine Learning and Applications: Proceedings / [ed] Xuewen Chen, Bo Luo, Feng Luo, Vasile Palade, M. Arif Wani, IEEE, 2017, s. 812-817Konferensbidrag (Refereegranskat)
    Abstract [en]

    Random survival forests constitute a robust approach to survival modeling, i.e., predicting the probability that an event will occur before or on a given point in time. Similar to most standard predictive models, no guarantee for the prediction error is provided for this model, which instead typically is empirically evaluated. Conformal prediction is a rather recent framework, which allows the error of a model to be determined by a user specified confidence level, something which is achieved by considering set rather than point predictions. The framework, which has been applied to some of the most popular classification and regression techniques, is here for the first time applied to survival modeling, through random survival forests. An empirical investigation is presented where the technique is evaluated on datasets from two real-world applications; predicting component failure in trucks using operational data and predicting survival and treatment of heart failure patients from administrative healthcare data. The experimental results show that the error levels indeed are very close to the provided confidence levels, as guaranteed by the conformal prediction framework, and that the error for predicting each outcome, i.e., event or no-event, can be controlled separately. The latter may, however, lead to less informative predictions, i.e., larger prediction sets, in case the class distribution is heavily imbalanced.

  • 9.
    Boström, Henrik
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    De-identifying health records by means of active learning2012Ingår i:  , 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    An experiment on classifying words in Swedish health records as belonging to one of eight protected health information (PHI) classes, or to the non-PHI class, by means of active learning has been conducted, in which three selection strategies were evaluated in conjunction with random forests; the commonly employed approach of choosing the most uncertain examples, choosing randomly, and choosing the most certain examples. Surprisingly, random selection outperformed choosing the most uncertain examples with respect to ten considered performance metrics. Moreover, choosing the most certain examples outperformed random selection with respect to nine out of ten metrics.

  • 10.
    Boström, Henrik
    et al.
    KTH Royal Institute of Technology, School of Electrical Engineering and Computer Science, Sweden.
    Gurung, Ram Bahadur
    Dept. of Computer and System Sciences, Stockholm University, Sweden.
    Lindgren, Tony
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Johansson, Ulf
    Dept. of Computer Science and Informatics, Jönköping University, Swede.
    Explaining Random Forest Predictions with Association Rules2018Ingår i: Archives of Data Science, E-ISSN 2363-9881, Vol. 5, nr 1, s. 1-20Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Random forests frequently achieve state-of-the-art predictive performance. However, the logic behind their predictions cannot be easily understood, since they are the result of averaging often hundreds or thousands of, possibly conflicting, individual predictions. Instead of presenting all the individual predictions, an alternative is proposed, by which the predictions are explained using association rules generated from itemsets representing paths in the trees of the forest. An empirical investigation is presented, in which alternative ways of generating the association rules are compared with respect to explainability, as measured by the fraction of predictions for which there is no applicable rule and by the fraction of predictions for which there is at least one applicable rule that conflicts with the forest prediction. For the considered datasets, it can be seen that most predictions can be explained by the discovered association rules, which have a high level of agreement with the underlying forest. The results do not single out a clear winner of the considered alternatives in terms of unexplained and disagreement rates, but show that they are associated with substantial differences in computational cost.

  • 11.
    Boström, Henrik
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Knobbe, ArnoSoares, CarlosPapapetrou, PanagiotisStockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Advances in Intelligent Data Analysis XV: 15th International Symposium, IDA 2016, Stockholm, Sweden, October 13-15, 2016, Proceedings2016Proceedings (redaktörskap) (Refereegranskat)
    Abstract [en]

    This book constitutes the refereed conference proceedings of the 15th International Conference on Intelligent Data Analysis, which was held in October 2016 in Stockholm, Sweden. The 36 revised full papers presented were carefully reviewed and selected from 75 submissions. The traditional focus of the IDA symposium series is on end-to-end intelligent support for data analysis. The symposium aims to provide a forum for inspiring research contributions that might be considered preliminary in other leading conferences and journals, but that have a potentially dramatic impact.

  • 12.
    Boström, Henrik
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Linusson, Henrik
    Lofstrom, Tuve
    Johansson, Ulf
    Accelerating difficulty estimation for conformal regression forests2017Ingår i: Annals of Mathematics and Artificial Intelligence, ISSN 1012-2443, E-ISSN 1573-7470, Vol. 81, nr 1-2, s. 125-144Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The conformal prediction framework allows for specifying the probability of making incorrect predictions by a user-provided confidence level. In addition to a learning algorithm, the framework requires a real-valued function, called nonconformity measure, to be specified. The nonconformity measure does not affect the error rate, but the resulting efficiency, i.e., the size of output prediction regions, may vary substantially. A recent large-scale empirical evaluation of conformal regression approaches showed that using random forests as the learning algorithm together with a nonconformity measure based on out-of-bag errors normalized using a nearest-neighbor-based difficulty estimate, resulted in state-of-the-art performance with respect to efficiency. However, the nearest-neighbor procedure incurs a significant computational cost. In this study, a more straightforward nonconformity measure is investigated, where the difficulty estimate employed for normalization is based on the variance of the predictions made by the trees in a forest. A large-scale empirical evaluation is presented, showing that both the nearest-neighbor-based and the variance-based measures significantly outperform a standard (non-normalized) nonconformity measure, while no significant difference in efficiency between the two normalized approaches is observed. The evaluation moreover shows that the computational cost of the variance-based measure is several orders of magnitude lower than when employing the nearest-neighbor-based nonconformity measure. The use of out-of-bag instances for calibration does, however, result in nonconformity scores that are distributed differently from those obtained from test instances, questioning the validity of the approach. An adjustment of the variance-based measure is presented, which is shown to be valid and also to have a significant positive effect on the efficiency. For conformal regression forests, the variance-based nonconformity measure is hence a computationally efficient and theoretically well-founded alternative to the nearest-neighbor procedure.

  • 13.
    Boström, Henrik
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Linusson, Henrik
    Löfström, Tuve
    Johansson, Ulf
    Evaluation of a Variance-Based Nonconformity Measure for Regression Forests2016Ingår i: Conformal and Probabilistic Prediction with Applications: 5th International Symposium, COPA 2016, Madrid, Spain, April 20-22, 2016, Proceedings / [ed] Alexander Gammerman, Zhiyuan Luo, Jesús Vega, Vladimir Vovk, Springer, 2016, s. 75-89Konferensbidrag (Refereegranskat)
    Abstract [en]

    In a previous large-scale empirical evaluation of conformal regression approaches, random forests using out-of-bag instances for calibration together with a k-nearest neighbor-based nonconformity measure, was shown to obtain state-of-the-art performance with respect to efficiency, i.e., average size of prediction regions. However, the use of the nearest-neighbor procedure not only requires that all training data have to be retained in conjunction with the underlying model, but also that a significant computational overhead is incurred, during both training and testing. In this study, a more straightforward nonconformity measure is investigated, where the difficulty estimate employed for normalization is based on the variance of the predictions made by the trees in a forest. A large-scale empirical evaluation is presented, showing that both the nearest-neighbor-based and the variance-based measures significantly outperform a standard (non-normalized) nonconformity measure, while no significant difference in efficiency between the two normalized approaches is observed. Moreover, the evaluation shows that state-of-the-art performance is achieved by the variance-based measure at a computational cost that is several orders of magnitude lower than when employing the nearest-neighbor-based nonconformity measure.

  • 14. Carlsson, Lars
    et al.
    Ahlberg, Ernst
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Johansson, Ulf
    Linusson, Henrik
    Modifications to p-Values of Conformal Predictors2015Ingår i: Statistical Learning and Data Sciences: Third International Symposium, SLDS 2015, Egham, UK, April 20-23, 2015, Proceedings / [ed] Alexander Gammerman, Vladimir Vovk, Harris Papadopoulos, Springer, 2015, s. 251-259Konferensbidrag (Refereegranskat)
    Abstract [en]

    The original definition of a p-value in a conformal predictor can sometimes lead to too conservative prediction regions when the number of training or calibration examples is small. The situation can be improved by using a modification to define an approximate p-value. Two modified p-values are presented that converges to the original p-value as the number of training or calibration examples goes to infinity.

    Numerical experiments empirically support the use of a p-value we call the interpolated p-value for conformal prediction. The interpolated p-value seems to be producing prediction sets that have an error rate which corresponds well to the prescribed significance level.

  • 15.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Releasing a Swedish Clinical Corpus after Removing all Words – De-identification Experiments with Conditional Random Fields and Random Forests2012Ingår i: Proceedings of the Third Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM 2012), 2012, s. 45-48Konferensbidrag (Refereegranskat)
    Abstract [en]

    Patient records contain valuable information in the form of both structured data and free text; however this information is sensitive since it can reveal the identity of patients. In order to allow new methods and techniques to be developed and evaluated on real world clinical data without revealing such sensitive information, researchers could be given access to de-identified records without protected health information (PHI), such as names, telephone numbers, and so on. One approach to minimizing the risk of revealing PHI when releasing text corpora from such records is to include only features of the words instead of the words themselves. Such features may include parts of speech, word length, and so on from which the sensitive information cannot be derived. In order to investigate what performance losses can be expected when replacing specific words with features, an experiment with two state-of-the-art machine learning methods, conditional random fields and random forests, is presented, comparing their ability to support de-identification, using the Stockholm EPR PHI corpus as a benchmark test. The results indicate severe performance losses when the actual words are removed, leading to the conclusion that the chosen features are not sufficient for the suggested approach to be viable.

  • 16.
    Deegalla, Sampath
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Classification of Microarrays with kNN: Comparison of Dimensionality Reduction Methods2007Ingår i: Intelligent Data Engineering and Automated Learning - IDEAL 2007 / [ed] Hujun Yin, Peter Tino, Emilio Corchado, Will Byrne, Xin Yao, Berlin, Heidelberg: Springer Verlag , 2007, s. 800-809Konferensbidrag (Refereegranskat)
    Abstract [en]

    Dimensionality reduction can often improve the performance of the k-nearest neighbor classifier (kNN) for high-dimensional data sets, such as microarrays. The effect of the choice of dimensionality reduction method on the predictive performance of kNN for classifying microarray data is an open issue, and four common dimensionality reduction methods, Principal Component Analysis (PCA), Random Projection (RP), Partial Least Squares (PLS) and Information Gain(IG), are compared on eight microarray data sets. It is observed that all dimensionality reduction methods result in more accurate classifiers than what is obtained from using the raw attributes. Furthermore, it is observed that both PCA and PLS reach their best accuracies with fewer components than the other two methods, and that RP needs far more components than the others to outperform kNN on the non-reduced dataset. None of the dimensionality reduction methods can be concluded to generally outperform the others, although PLS is shown to be superior on all four binary classification tasks, but the main conclusion from the study is that the choice of dimensionality reduction method can be of major importance when classifying microarrays using kNN.

    Ladda ner fulltext (pdf)
    fulltext
  • 17.
    Deegalla, Sampath
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Walgama, Keerthi
    Choice of Dimensionality Reduction Methods for Feature and Classifier Fusion with Nearest Neighbor Classifiers2012Ingår i: 15th International Conference on Information Fusion, IEEE Computer Society Digital Library, 2012, s. 875-881Konferensbidrag (Refereegranskat)
    Abstract [en]

    Often high dimensional data cause problems for currently used learning algorithms in terms of efficiency and effectiveness. One solution for this problem is to apply dimensionality reduction by which the original feature set could be reduced to a small number of features while gaining improved accuracy and/or efficiency of the learning algorithm. We have investigated multiple dimensionality reduction methods for nearest neighbor classification in high dimensions. In previous studies, we have demonstrated that fusion of different outputs of dimensionality reduction methods, either by combining classifiers built on reduced features, or by combining reduced features and then applying the classifier, may yield higher accuracies than when using individual reduction methods. However, none of the previous studies have investigated what dimensionality reduction methods to choose for fusion, when outputs of multiple dimensionality reduction methods are available. Therefore, we have empirically investigated different combinations of the output of four dimensionality reduction methods on 18 medicinal chemistry datasets. The empirical investigation demonstrates that fusion of nearest neighbor classifiers obtained from multiple reduction methods in all cases outperforms the use of individual dimensionality reduction methods, while fusion of different feature subsets is quite sensitive to the choice of dimensionality reduction methods.

  • 18.
    Dudas, Catarina
    et al.
    University of Skövde, Sweden.
    Ng, Amos
    University of Skövde, Sweden.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Information extraction from solution set of simulation-based multi-objective optimization using data mining2009Ingår i: Proceedings of Industrial Simulation Conference (ISC) 2009, 2009, s. 65-69Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this work, we investigate ways of extracting information from simulations, in particular from simulation-based multi-objective optimisation, in order to acquire information that can support human decision makers that aim for optimising manufacturing processes. Applying data mining for analyzing data generated using simulation is a fairly unexplored area. With the observation that the obtained solutions from a simulation-based multi-objective optimisation are all optimal (or close to the optimal Pareto front) so that they are bound to follow and exhibit certain relationships among variables vis-à-vis objectives, it is argued that using data mining to discover these relationships could be a promising procedure. The aim of this paper is to provide the empirical results from two simulation case studies to support such a hypothesis.

  • 19. Dudas, Catarina
    et al.
    Ng, Amos H. C.
    Pehrsson, Leif
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Integration of data mining and multi-objective optimisation for decision support in production systems development2014Ingår i: International journal of computer integrated manufacturing (Print), ISSN 0951-192X, E-ISSN 1362-3052, Vol. 27, nr 9, s. 824-839Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Multi-objective optimisation (MOO) is a powerful approach for generating a set of optimal trade-off (Pareto) design alternatives that the decision-maker can evaluate and then choose the most-suitable configuration, based on some high-level strategic information. Nevertheless, in practice, choosing among a large number of solutions on the Pareto front is often a daunting task, if proper analysis and visualisation techniques are not applied. Recent research advancements have shown the advantages of using data mining techniques to automate the post-optimality analysis of Pareto-optimal solutions for engineering design problems. Nonetheless, it is argued that the existing approaches are inadequate for generating high-quality results, when the set of the Pareto solutions is relatively small and the solutions close to the Pareto front have almost the same attributes as the Pareto-optimal solutions, of which both are commonly found in many real-world system problems. The aim of this paper is therefore to propose a distance-based data mining approach for the solution sets generated from simulation-based optimisation, in order to address these issues. Such an integrated data mining and MOO procedure is illustrated with the results of an industrial cost optimisation case study. Particular emphasis is paid to showing how the proposed procedure can be used to assist decision-makers in analysing and visualising the attributes of the design alternatives in different regions of the objective space, so that informed decisions can be made in production systems development.

  • 20. Dudas, Catarina
    et al.
    Ng, Amosh C.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Post-analysis of multi-objective optimization solutions using decision trees2015Ingår i: Intelligent Data Analysis, ISSN 1088-467X, E-ISSN 1571-4128, Vol. 19, nr 2, s. 259-278Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Evolutionary algorithms are often applied to solve multi-objective optimization problems. Such algorithms effectively generate solutions of wide spread, and have good convergence properties. However, they do not provide any characteristics of the found optimal solutions, something which may be very valuable to decision makers. By performing a post-analysis of the solution set from multi-objective optimization, relationships between the input space and the objective space can be identified. In this study, decision trees are used for this purpose. It is demonstrated that they may effectively capture important characteristics of the solution sets produced by multi-objective optimization methods. It is furthermore shown that the discovered relationships may be used for improving the search for additional solutions. Two multi-objective problems are considered in this paper; a well-studied benchmark function problem with on a beforehand known optimal Pareto front, which is used for verification purposes, and a multi-objective optimization problem of a real-world production system. The results show that useful relationships may be identified by employing decision tree analysis of the solution sets from multi-objective optimizations.

  • 21.
    Gurung, Ram B.
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Lindgren, Tony
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Learning Decision Trees from Histogram Data2015Ingår i: Proceedings of the 2015 International Conference on Data Mining: DMIN 2015 / [ed] Robert Stahlbock, Gary M. Weiss, CSREA Press, 2015, s. 139-145Konferensbidrag (Refereegranskat)
    Abstract [en]

    When applying learning algorithms to histogram data, bins of such variables are normally treated as separate independent variables. However, this may lead to a loss of information as the underlying dependencies may not be fully exploited. In this paper, we adapt the standard decision tree learning algorithm to handle histogram data by proposing a novel method for partitioning examples using binned variables. Results from employing the algorithm to both synthetic and real-world data sets demonstrate that exploiting dependencies in histogram data may have positive effects on both predictive performance and model size, as measured by number of nodes in the decision tree. These gains are however associated with an increased computational cost and more complex split conditions. To address the former issue, an approximate method is proposed, which speeds up the learning process substantially while retaining the predictive performance.

    Ladda ner fulltext (pdf)
    fulltext
  • 22.
    Gurung, Ram B.
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Lindgren, Tony
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Learning Decision Trees from Histogram Data Using Multiple Subsets of Bins2016Ingår i: Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference / [ed] Zdravko Markov, Ingrid Russell, AAAI Press, 2016, s. 430-435Konferensbidrag (Refereegranskat)
    Abstract [en]

    The standard approach of learning decision trees from histogram data is to treat the bins as independent variables. However, as the underlying dependencies among the bins might not be completely exploited by this approach, an algorithm has been proposed for learning decision trees from histogram data by considering all bins simultaneously while partitioning examples at each node of the tree. Although the algorithm has been demonstrated to improve predictive performance, its computational complexity has turned out to be a major bottleneck, in particular for histograms with a large number of bins. In this paper, we propose instead a sliding window approach to select subsets of the bins to be considered simultaneously while partitioning examples. This significantly reduces the number of possible splits to consider, allowing for substantially larger histograms to be handled. We also propose to evaluate the original bins independently, in addition to evaluating the subsets of bins when performing splits. This ensures that the information obtained by treating bins simultaneously is an additional gain compared to what is considered by the standard approach. Results of experiments on applying the new algorithm to both synthetic and real world datasets demonstrate positive results in terms of predictive performance without excessive computational cost.

  • 23.
    Gurung, Ram B.
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Lindgren, Tony
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Learning Random Forest from Histogram Data Using Split Specific Axis Rotation2018Ingår i: International Journal of Machine Learning and Computing, ISSN 2010-3700, Vol. 8, nr 1, s. 74-79Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Machine learning algorithms for data containing histogram variables have not been explored to any major extent. In this paper, an adapted version of the random forest algorithm is proposed to handle variables of this type, assuming identical structure of the histograms across observations, i.e., the histograms for a variable all use the same number and width of the bins. The standard approach of representing bins as separate variables, may lead to that the learning algorithm overlooks the underlying dependencies. In contrast, the proposed algorithm handles each histogram as a unit. When performing split evaluation of a histogram variable during tree growth, a sliding window of fixed size is employed by the proposed algorithm to constrain the sets of bins that are considered together. A small number of all possible set of bins are randomly selected and principal component analysis (PCA) is applied locally on all examples in a node. Split evaluation is then performed on each principal component. Results from applying the algorithm to both synthetic and real world data are presented, showing that the proposed algorithm outperforms the standard approach of using random forests together with bins represented as separate variables, with respect to both AUC and accuracy. In addition to introducing the new algorithm, we elaborate on how real world data for predicting NOx sensor failure in heavy duty trucks was prepared, demonstrating that predictive performance can be further improved by adding variables that represent changes of the histograms over time.

  • 24.
    Gurung, Ram B.
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Lindgren, Tony
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Predicting NOx sensor failure in heavy duty trucks using histogram-based random forests2017Ingår i: International Journal of Prognostics and Health Management, E-ISSN 2153-2648, Vol. 8, nr 1, artikel-id 008Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Being able to accurately predict the impending failures of truck components is often associated with significant amount of cost savings, customer satisfaction and flexibility in maintenance service plans. However, because of the diversity in the way trucks typically are configured and their usage under different conditions, the creation of accurate prediction models is not an easy task. This paper describes an effort in creating such a prediction model for the NOx sensor, i.e., a component measuring the emitted level of nitrogen oxide in the exhaust of the engine. This component was chosen because it is vital for the truck to function properly, while at the same time being very fragile and costly to repair. As input to the model, technical specifications of trucks and their operational data are used. The process of collecting the data and making it ready for training the model via a slightly modified Random Forest learning algorithm is described along with various challenges encountered during this process. The operational data consists of features represented as histograms, posing an additional challenge for the data analysis task. In the study, a modified version of the random forest algorithm is employed, which exploits the fact that the individual bins in the histograms are related, in contrast to the standard approach that would consider the bins as independent features. Experiments are conducted using the updated random forest algorithm, and they clearly show that the modified version is indeed beneficial when compared to the standard random forest algorithm. The performance of the resulting prediction model for the NOx sensor is promising and may be adopted for the benefit of operators of heavy trucks.

  • 25. Henelius, Andreas
    et al.
    Puolamaki, Kai
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Asker, Lars
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    A peek into the black box: exploring classifiers by randomization2014Ingår i: Data mining and knowledge discovery, ISSN 1384-5810, E-ISSN 1573-756X, Vol. 28, nr 5-6, s. 1503-1529Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Classifiers are often opaque and cannot easily be inspected to gain understanding of which factors are of importance. We propose an efficient iterative algorithm to find the attributes and dependencies used by any classifier when making predictions. The performance and utility of the algorithm is demonstrated on two synthetic and 26 real-world datasets, using 15 commonly used learning algorithms to generate the classifiers. The empirical investigation shows that the novel algorithm is indeed able to find groupings of interacting attributes exploited by the different classifiers. These groupings allow for finding similarities among classifiers for a single dataset as well as for determining the extent to which different classifiers exploit such interactions in general.

  • 26. Henelius, Andreas
    et al.
    Puolamäki, Kai
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Clustering with Confidence: Finding Clusters with Statistical Guarantees2016Ingår i: Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Clustering is a widely used unsupervised learning method for finding structure in the data. However, the resulting clusters are typically presented without any guarantees on their robustness; slightly changing the used data sample or re-running a clustering algorithm involving some stochastic component may lead to completely different clusters. There is, hence, a need for techniques that can quantify the instability of the generated clusters. In this study, we propose a technique for quantifying the instability of a clustering solution and for finding robust clusters, termed core clusters, which correspond to clusters where the co-occurrence probability of each data item within a cluster is at least 1−α  . We demonstrate how solving the core clustering problem is linked to finding the largest maximal cliques in a graph. We show that the method can be used with both clustering and classification algorithms. The proposed method is tested on both simulated and real datasets. The results show that the obtained clusters indeed meet the guarantees on robustness.

  • 27. Henelius, Andreas
    et al.
    Puolamäki, Kai
    Karlsson, Isak
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Asker, Lars
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    GoldenEye++: a Closer Look into the Black Box2015Ingår i: Statistical Learning and Data Sciences: Proceedings / [ed] Alexander Gammerman, Vladimir Vovk, Harris Papadopoulos, Springer, 2015, s. 96-105Konferensbidrag (Refereegranskat)
    Abstract [en]

    Models with high predictive performance are often opaque, i.e., they do not allow for direct interpretation, and are hence of limited value when the goal is to understand the reasoning behind predictions. A recently proposed algorithm, GoldenEye, allows detection of groups of interacting variables exploited by a model. We employed this technique in conjunction with random forests generated from data obtained from electronic patient records for the task of detecting adverse drug events (ADEs). We propose a refined version of the GoldenEye algorithm, called GoldenEye++, utilizing a more sensitive grouping metric. An empirical investigation comparing the two algorithms on 27 datasets related to detecting ADEs shows that the new version of the algorithm in several cases finds groups of medically relevant interacting attributes, corresponding to prescribed drugs, undetected by the previous version. This suggests that the GoldenEye++ algorithm can be a useful tool for finding novel (adverse) drug interactions.

  • 28.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection2015Ingår i: 2015 IEEE International Conference on Bioinformatics and Biomedicine: Proceedings / [ed] Jun (Luke) Huan et al., IEEE Computer Society, 2015, s. 343-350Konferensbidrag (Refereegranskat)
    Abstract [en]

    Electronic health records (EHRs) are emerging as a potentially valuable source for pharmacovigilance; however, adverse drug events (ADEs), which can be encoded in EHRs by a set of diagnosis codes, are heavily underreported. Alerting systems, able to detect potential ADEs on the basis of patient- specific EHR data, would help to mitigate this problem. To that end, the use of machine learning has proven to be both efficient and effective; however, challenges remain in representing the heterogeneous EHR data, which moreover tends to be high- dimensional and exceedingly sparse, in a manner conducive to learning high-performing predictive models. Prior work has shown that distributional semantics – that is, natural language processing methods that, traditionally, model the meaning of words in semantic (vector) space on the basis of co-occurrence information – can be exploited to create effective representations of sequential EHR data, not only free-text in clinical notes but also various clinical events such as diagnoses, drugs and measurements. When modeling data in semantic space, an im- portant design decision concerns the size of the context window around an object of interest, which governs the scope of co- occurrence information that is taken into account and affects the composition of the resulting semantic space. Here, we report on experiments conducted on 27 clinical datasets, demonstrating that performance can be significantly improved by modeling EHR data in ensembles of semantic spaces, consisting of multiple semantic spaces built with different context window sizes. A follow-up investigation is conducted to study the impact on predictive performance as increasingly more semantic spaces are included in the ensemble, demonstrating that accuracy tends to improve with the number of semantic spaces, albeit not monotonically so. Finally, a number of different strategies for combining the semantic spaces are explored, demonstrating the advantage of early (feature) fusion over late (classifier) fusion. Ensembles of semantic spaces allow multiple views of (sparse) data to be captured (densely) and thereby enable improved performance to be obtained on the task of detecting ADEs in EHRs.

  • 29.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event Detection2015Ingår i: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) / [ed] Eric Gaussier, Longbing Cao, Patrick Gallinari, James Kwok, Gabriella Pasi, Osmar Zaiane, IEEE, 2015Konferensbidrag (Refereegranskat)
    Abstract [en]

    The enormous amounts of data that are continuously recorded in electronic health record systems offer ample opportunities for data science applications to improve healthcare. There are, however, challenges involved in using such data for machine learning, such as high dimensionality and sparsity, as well as an inherent heterogeneity that does not allow the distinct types of clinical data to be treated in an identical manner. On the other hand, there are also similarities across data types that may be exploited, e.g., the possibility of representing some of them as sequences. Here, we apply the notions underlying distributional semantics, i.e., methods that model the meaning of words in semantic (vector) space on the basis of co-occurrence information, to four distinct types of clinical data: free-text notes, on the one hand, and clinical events, in the form of diagnosis codes, drug codes and measurements, on the other hand. Each semantic space contains continuous vector representations for every unique word and event, which can then be used to create representations of, e.g., care episodes that, in turn, can be exploited by the learning algorithm. This approach does not only reduce sparsity, but also takes into account, and explicitly models, similarities between various items, and it does so in an entirely data-driven fashion. Here, we report on a series of experiments using the random forest learning algorithm that demonstrate the effectiveness, in terms of accuracy and area under ROC curve, of the proposed representation form over the commonly used bag-of-items counterpart. The experiments are conducted on 27 real datasets that each involves the (binary) classification task of detecting a particular adverse drug event. It is also shown that combining structured and unstructured data leads to significant improvements over using only one of them.

  • 30.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Ensembles of randomized trees using diverse distributed representations of clinical events2016Ingår i: BMC Medical Informatics and Decision Making, E-ISSN 1472-6947, Vol. 16, s. 85-95, artikel-id 69Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Background: Learning deep representations of clinical events based on their distributions in electronic health records has been shown to allow for subsequent training of higher-performing predictive models compared to the use of shallow, count-based representations. The predictive performance may be further improved by utilizing multiple representations of the same events, which can be obtained by, for instance, manipulating the representation learning procedure. The question, however, remains how to make best use of a set of diverse representations of clinical events – modeled in an ensemble of semantic spaces – for the purpose of predictive modeling. Methods: Three different ways of exploiting a set of (ten) distributed representations of four types of clinical events – diagnosis codes, drug codes, measurements, and words in clinical notes – are investigated in a series of experiments using ensembles of randomized trees. Here, the semantic space ensembles are obtained by varying the context window size in the representation learning procedure. The proposed method trains a forest wherein each tree is built from a bootstrap replicate of the training set whose entire original feature set is represented in a randomly selected set of semantic spaces – corresponding to the considered data types – of a given context window size. Results: The proposed method significantly outperforms concatenating the multiple representations of the bagged dataset; it also significantly outperforms representing, for each decision tree, only a subset of the features in a randomly selected set of semantic spaces. A follow-up analysis indicates that the proposed method exhibits less diversity while significantly improving average tree performance. It is also shown that the size of the semantic space ensemble has a significant impact on predictive performance and that performance tends to improve as the size increases. Conclusions: The strategy for utilizing a set of diverse distributed representations of clinical events when constructing ensembles of randomized trees has a significant impact on predictive performance. The most successful strategy – significantly outperforming the considered alternatives – involves randomly sampling distributed representations of the clinical events when building each decision tree in the forest.

  • 31. Jansson, Karl
    et al.
    Sundell, Håkan
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    gpuRF and gpuERT: efficient and Scalable GPU Algorithms for Decision Tree Ensembles2014Ingår i: Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International, IEEE Computer Society, 2014, s. 1612-1621Konferensbidrag (Refereegranskat)
  • 32. Johansson, Ulf
    et al.
    Ahlberg, Ernst
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Carlsson, Lars
    Linusson, Henrik
    Sönströd, Cecilia
    Handling Small Calibration Sets in Mondrian Inductive Conformal Regressors2015Ingår i: Statistical Learning and Data Sciences: Third International Symposium, SLDS 2015 Egham, UK, April 20–23, 2015 Proceedings / [ed] Alexander Gammerman, Vladimir Vovk, Harris Papadopoulos, Cham: Springer, 2015, s. 271-280Konferensbidrag (Refereegranskat)
    Abstract [en]

    In inductive conformal prediction, calibration sets must contain an adequate number of instances to support the chosen confidence level. This problem is particularly prevalent when using Mondrian inductive conformal prediction, where the input space is partitioned into independently valid prediction regions. In this study, Mondrian conformal regressors, in the form of regression trees, are used to investigate two problematic aspects of small calibration sets. If there are too few calibration instances to support the significance level, we suggest using either extrapolation or altering the model. In situations where the desired significance level is between two calibration instances, the standard procedure is to choose the more nonconforming one, thus guaranteeing validity, but producing conservative conformal predictors. The suggested solution is to use interpolation between calibration instances. All proposed techniques are empirically evaluated and compared to the standard approach on 30 benchmark data set . The results show that while extrapolation often results in invalid models, interpolation works extremely well and provides increased efficiency with preserved empirical validity.

  • 33. Johansson, Ulf
    et al.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Löfström, Tuve
    Conformal Prediction Using Decision Trees2013Ingår i: IEEE 13th International Conference on Data Mining (ICDM): Proceedings, IEEE, 2013, s. 330-339Konferensbidrag (Refereegranskat)
    Abstract [en]

    Conformal prediction is a relatively new framework in which the predictive models output sets of predictions with a bound on the error rate, i.e., in a classification context, the probability of excluding the correct class label is lower than a predefined significance level. An investigation of the use of decision trees within the conformal prediction framework is presented, with the overall purpose to determine the effect of different algorithmic choices, including split criterion, pruning scheme and way to calculate the probability estimates. Since the error rate is bounded by the framework, the most important property of conformal predictors is efficiency, which concerns minimizing the number of elements in the output prediction sets. Results from one of the largest empirical investigations to date within the conformal prediction framework are presented, showing that in order to optimize efficiency, the decision trees should be induced using no pruning and with smoothed probability estimates. The choice of split criterion to use for the actual induction of the trees did not turn out to have any major impact on the efficiency. Finally, the experimentation also showed that when using decision trees, standard inductive conformal prediction was as efficient as the recently suggested method cross-conformal prediction. This is an encouraging results since cross-conformal prediction uses several decision trees, thus sacrificing the interpretability of a single decision tree.

  • 34. Johansson, Ulf
    et al.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Löfström, Tuve
    Linusson, Henrik
    Regression conformal prediction with random forests2014Ingår i: Machine Learning, ISSN 0885-6125, E-ISSN 1573-0565, Vol. 97, nr 1-2, s. 155-176Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Regression conformal prediction produces prediction intervals that are valid, i.e., the probability of excluding the correct target value is bounded by a predefined confidence level. The most important criterion when comparing conformal regressors is efficiency; the prediction intervals should be as tight (informative) as possible. In this study, the use of random forests as the underlying model for regression conformal prediction is investigated and compared to existing state-of-the-art techniques, which are based on neural networks and k-nearest neighbors. In addition to their robust predictive performance, random forests allow for determining the size of the prediction intervals by using out-of-bag estimates instead of requiring a separate calibration set. An extensive empirical investigation, using 33 publicly available data sets, was undertaken to compare the use of random forests to existing state-of-the-art conformal predictors. The results show that the suggested approach, on almost all confidence levels and using both standard and normalized nonconformity functions, produced significantly more efficient conformal predictors than the existing alternatives.

  • 35. Johansson, Ulf
    et al.
    König, Rikard
    Linusson, Henrik
    Löfström, Tuve
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Rule Extraction with Guaranteed Fidelity2014Ingår i: Artificial Intelligence Applications and Innovations: Proceedings, Springer Berlin/Heidelberg, 2014, s. 281-290Konferensbidrag (Refereegranskat)
  • 36. Johansson, Ulf
    et al.
    König, Rikard
    Löfström, Tuve
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Evolved decision trees as conformal predictors2013Ingår i: 2013 IEEE Congress on Evolutionary Computation (CEC), IEEE Computer Society, 2013, s. 1794-1801Konferensbidrag (Refereegranskat)
    Abstract [en]

    In conformal prediction, predictive models output sets of predictions with a bound on the error rate. In classification, this translates to that the probability of excluding the correct class is lower than a predefined significance level, in the long run. Since the error rate is guaranteed, the most important criterion for conformal predictors is efficiency. Efficient conformal predictors minimize the number of elements in the output prediction sets, thus producing more informative predictions. This paper presents one of the first comprehensive studies where evolutionary algorithms are used to build conformal predictors. More specifically, decision trees evolved using genetic programming are evaluated as conformal predictors. In the experiments, the evolved trees are compared to decision trees induced using standard machine learning techniques on 33 publicly available benchmark data sets, with regard to predictive performance and efficiency. The results show that the evolved trees are generally more accurate, and the corresponding conformal predictors more efficient, than their induced counterparts. One important result is that the probability estimates of decision trees when used as conformal predictors should be smoothed, here using the Laplace correction. Finally, using the more discriminating Brier score instead of accuracy as the optimization criterion produced the most efficient conformal predictions.

  • 37.
    Johansson, Ulf
    et al.
    University of Borås, Sweden.
    Löfström, Tuve
    University of Borås, Sweden.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Overproduce-and-Select: The Grim Reality2013Ingår i: 2013 IEEE Symposium on Computational Intelligence and Ensemble Learning (CIEL), IEEE conference proceedings, 2013, s. 52-59Konferensbidrag (Refereegranskat)
    Abstract [en]

    Overproduce-and-select (OPAS) is a frequently used paradigm for building ensembles. In static OPAS, a large number of base classifiers are trained, before a subset of the available models is selected to be combined into the final ensemble. In general, the selected classifiers are supposed to be accurate and diverse for the OPAS strategy to result in highly accurate ensembles, but exactly how this is enforced in the selection process is not obvious. Most often, either individual models or ensembles are evaluated, using some performance metric, on available and labeled data. Naturally, the underlying assumption is that an observed advantage for the models (or the resulting ensemble) will carry over to test data. In the experimental study, a typical static OPAS scenario, using a pool of artificial neural networks and a number of very natural and frequently used performance measures, is evaluated on 22 publicly available data sets. The discouraging result is that although a fairly large proportion of the ensembles obtained higher test set accuracies, compared to using the entire pool as the ensemble, none of the selection criteria could be used to identify these highly accurate ensembles. Despite only investigating a specific scenario, we argue that the settings used are typical for static OPAS, thus making the results general enough to question the entire paradigm.

  • 38. Johansson, Ulf
    et al.
    Löfström, Tuve
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Random brains2013Ingår i: The 2013 International Joint Conference on Neural Networks (IJCNN): Proceedings, IEEE conference proceedings, 2013, s. 1-8Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this paper, we introduce and evaluate a novel method, called random brains, for producing neural network ensembles. The suggested method, which is heavily inspired by the random forest technique, produces diversity implicitly by using bootstrap training and randomized architectures. More specifically, for each base classifier multilayer perceptron, a number of randomly selected links between the input layer and the hidden layer are removed prior to training, thus resulting in potentially weaker but more diverse base classifiers. The experimental results on 20 UCI data sets show that random brains obtained significantly higher accuracy and AUC, compared to standard bagging of similar neural networks not utilizing randomized architectures. The analysis shows that the main reason for the increased ensemble performance is the ability to produce effective diversity, as indicated by the increase in the difficulty diversity measure.

  • 39. Johansson, Ulf
    et al.
    Sönströd, Cecilia
    Linusson, Henrik
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Regression Trees for Streaming Data with Local Performance Guarantees2014Konferensbidrag (Refereegranskat)
    Abstract [en]

    Online predictive modeling of streaming data is a key task for big data analytics. In this paper, a novel approach for efficient online learning of regression trees is proposed, which continuously updates, rather than retrains, the tree as more labeled data become available. A conformal predictor outputs prediction sets instead of point predictions; which for regression translates into prediction intervals. The key property of a conformal predictor is that it is always valid, i.e., the error rate, on novel data, is bounded by a preset significance level. Here, we suggest applying Mondrian conformal prediction on top of the resulting models, in order to obtain regression trees where not only the tree, but also each and every rule, corresponding to a path from the root node to a leaf, is valid. Using Mondrian conformal prediction, it becomes possible to analyze and explore the different rules separately, knowing that their accuracy, in the long run, will not be below the preset significance level. An empirical investigation, using 17 publicly available data sets, confirms that the resulting rules are independently valid, but also shows that the prediction intervals are smaller, on average, than when only the global model is required to be valid. All-in-all, the suggested method provides a data miner or a decision maker with highly informative predictive models of streaming data.

  • 40. Johansson, Ulf
    et al.
    Sönströd, Cecilia
    Löfström, Tuve
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Obtaining accurate and comprehensible classifiers using oracle coaching2012Ingår i: Intelligent Data Analysis, ISSN 1088-467X, E-ISSN 1571-4128, Vol. 16, nr 2, s. 247-263Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    While ensemble classifiers often reach high levels of predictive performance, the resulting models are opaque and hence do not allow direct interpretation. When employing methods that do generate transparent models, predictive performance typically has to be sacrificed. This paper presents a method of improving predictive performance of transparent models in the very common situation where instances to be classified, i.e., the production data, are known at the time of model building. This approach, named oracle coaching, employs a strong classifier, called an oracle, to guide the generation of a weaker, but transparent model. This is accomplished by using the oracle to predict class labels for the production data, and then applying the weaker method on this data, possibly in conjunction with the original training set. Evaluation on 30 data sets from the UCI repository shows that oracle coaching significantly improves predictive performance, measured by both accuracy and area under ROC curve, compared to using training data only. This result is shown to be robust for a variety of methods for generating the oracles and transparent models. More specifically, random forests and bagged radial basis function networks are used as oracles, while J48 and JRip are used for generating transparent models. The evaluation further shows that significantly better results are obtained when using the oracle-classified production data together with the original training data, instead of using only oracle data. An analysis of the fidelity of the transparent models to the oracles shows that performance gains can be expected from increasing oracle performance rather than from increasing fidelity. Finally, it is shown that further performance gains can be achieved by adjusting the relative weights of training data and oracle data.

  • 41. Johansson, Ulf
    et al.
    Sönströd, Cecilia
    Norinder, Ulf
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Trade-off between accuracy and interpretability for predictive in silico modeling2011Ingår i: Future Medicinal Chemistry, ISSN 1756-8919, E-ISSN 1756-8927, Vol. 3, nr 6, s. 647-663Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Background: Accuracy concerns the ability of a model to make correct predictions, while interpretability concerns to what degree the model allows for human understanding. Models exhibiting the former property are many times more complex and opaque, while interpretable models may lack the necessary accuracy. The trade-off between accuracy and interpretability for predictive in silico modeling is investigated. Method: A number of state-of-the-art methods for generating accurate models are compared with state-of-the-art methods for generating transparent models. Conclusion: Results on 16 biopharmaceutical classification tasks demonstrate that, although the opaque methods generally obtain higher accuracies than the transparent ones, one often only has to pay a quite limited penalty in terms of predictive performance when choosing an interpretable model.

  • 42.
    Karlsson, Isak
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Handling Sparsity with Random Forests when Predicting Adverse Drug Events from Electronic Health Records2014Ingår i: IEEE International Conference on Healthcare Informatics (ICHI): Proceedings, IEEE Computer Society, 2014, s. 17-22Konferensbidrag (Refereegranskat)
    Abstract [en]

    When using electronic health record (EHR) data to build models for predicting adverse drug effects (ADEs), one is typically facing the problem of data sparsity, i.e., drugs and diagnosis codes that could be used for predicting a certain ADE are absent for most observations. For such tasks, the ability to effectively handle sparsity by the employed machine learning technique is crucial. The state-of-the-art random forest algorithm is frequently employed to handle this type of data. It has however recently been demonstrated that the algorithm is biased towards the majority class, which may result in a low predictive performance on EHR data with large numbers of sparse features. In this study, approaches to handle this problem are empirically evaluated using 14 ADE datasets and three performance metrics; F1-score, AUC and Brier score. Two resampling based techniques are investigated and compared to two baseline approaches. The experimental results indicate that, for larger forests, the resampling methods outperform the baseline approaches when considering F1-score, which is consistent with the metric being affected by class bias. The approaches perform on a similar level with respect to AUC, which can be explained by the metric not being sensitive to class bias. Finally, when considering the squared error (Brier score) of individual predictions, one of the baseline approaches turns out to be ahead of the others. A bias-variance analysis shows that this is an effect of the individual trees being more correct on average for the baseline approach and that this outweighs the expected loss from a lower variance. The main conclusion is that the suggested choice of approach to handle sparsity is highly dependent on the performance metric, or the task, of interest. If the task is to accurately assign an ADE to a patient record, a sampling based approach is recommended. If the task is to rank patients according to risk of a certain ADE, the choice of approach is of minor importance. Finally, if the task is to accurately assign probabilities for a certain ADE, then one of the baseline approaches is recommended.

  • 43.
    Karlsson, Isak
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Predicting Adverse Drug Events using Heterogeneous Event Sequences2016Ingår i: 2016 IEEE International Conference on Healthcare Informatics (ICHI), IEEE Computer Society, 2016, s. 356-362Konferensbidrag (Refereegranskat)
    Abstract [en]

    Adverse drug events (ADEs) are known to be severely under-reported in electronic health record (EHR) systems. One approach to mitigate this problem is to employ machine learning methods to detect and signal for potentially missing ADEs, with the aim of increasing reporting rates. There are, however, many challenges involved in constructing prediction models for this task, since data present in health care records is heterogeneous, high dimensional, sparse and temporal. Previous approaches typically employ bag-of-items representations of clinical events that are present in a record, ignoring the temporal aspects. In this paper, we study the problem of classifying heterogeneous and multivariate event sequences using a novel algorithm building on the well known concept of ensemble learning. The proposed approach is empirically evaluated using 27 datasets extracted from a real EHR database with different ADEs present. The results indicate that the proposed approach, which explicitly models the temporal nature of clinical data, can be expected to outperform, in terms of the trade-off between precision and specificity, models that do no consider the temporal aspects.

  • 44.
    Karlsson, Isak
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Asker, Lars
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Persson, Hans E.
    Mining disproportional itemsets for characterizing groups of heart failure patients from administrative health records2017Ingår i: Proceedings of the 10th International Conference on PErvasive Technologies Related to Assistive Environments, Association for Computing Machinery (ACM), 2017, s. 394-398Konferensbidrag (Refereegranskat)
    Abstract [en]

    Heart failure is a serious medical conditions involving decreased quality of life and an increased risk of premature death. A recent evaluation by the Swedish National Board of Health and Welfare shows that Swedish heart failure patients are often undertreated and do not receive basic medication as recommended by the national guidelines for treatment of heart failure. The objective of this paper is to use registry data to characterize groups of heart failure patients, with an emphasis on basic treatment. Towards this end, we explore the applicability of frequent itemset mining and disproportionality analysis for finding interesting and distinctive characterizations of a target group of patients, e.g., those who have received basic treatment, against a control group, e.g., those who have not received basic treatment. Our empirical evaluation is performed on data extracted from administrative health records from the Stockholm County covering the years 2010--2016. Our findings suggest that frequency is not always the most appropriate measure of importance for frequent itemsets, while itemset disproportionality against a control group provides alternative rankings of the extracted itemsets leading to some medically intuitive characterizations of the target groups.

  • 45.
    Karlsson, Isak
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Early Random Shapelet Forest2016Ingår i: Discovery Science: 19th International Conference, DS 2016, Bari, Italy, October 19–21, 2016, Proceedings / [ed] Toon Calders, Michelangelo Ceci, Donato Malerba, Springer, 2016, s. 261-276Konferensbidrag (Refereegranskat)
    Abstract [en]

    Early classification of time series has emerged as an increasingly important and challenging problem within signal processing, especially in domains where timely decisions are critical, such as medical diagnosis in health-care. Shapelets, i.e., discriminative sub-sequences, have been proposed for time series classification as a means to capture local and phase independent information. Recently, forests of randomized shapelet trees have been shown to produce state-of-the-art predictive performance at a low computational cost. In this work, they are extended to allow for early classification of time series. An extensive empirical investigation is presented, showing that the proposed algorithm is superior to alternative state-of-the-art approaches, in case predictive performance is considered to be more important than earliness. The algorithm allows for tuning the trade-off between accuracy and earliness, thereby supporting the generation of early classifiers that can be dynamically adapted to specific needs at low computational cost.

  • 46.
    Karlsson, Isak
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Forests of Randomized Shapelet Trees2015Ingår i: Statistical Learning and Data Sciences: Proceedings / [ed] Alexander Gammerman, Vladimir Vovk, Harris Papadopoulos, Springer, 2015, s. 126-136Konferensbidrag (Refereegranskat)
    Abstract [en]

    Shapelets have recently been proposed for data series classification, due to their ability to capture phase independent and local information. Decision trees based on shapelets have been shown to provide not only interpretable models, but also, in many cases, state-of-the-art predictive performance. Shapelet discovery is however computationally costly, and although several techniques for speeding up the technique have been proposed, the computational cost is still in many cases prohibitive. In this work, an ensemble based method, referred to as Random Shapelet Forest (RSF), is proposed, which builds on the success of the random forest algorithm, and which is shown to have a lower computational complexity than the original shapelet tree learning algorithm. An extensive empirical investigation shows that the algorithm provides competitive predictive performance and that a proposed way of calculating importance scores can be used to successfully identify influential regions.

  • 47.
    Karlsson, Isak
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Generalized random shapelet forests2016Ingår i: Data mining and knowledge discovery, ISSN 1384-5810, E-ISSN 1573-756X, Vol. 30, nr 5, s. 1053-1085Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Shapelets are discriminative subsequences of time series, usually embedded in shapelet-based decision trees. The enumeration of time series shapelets is, however, computationally costly, which in addition to the inherent difficulty of the decision tree learning algorithm to effectively handle high-dimensional data, severely limits the applicability of shapelet-based decision tree learning from large (multivariate) time series databases. This paper introduces a novel tree-based ensemble method for univariate and multivariate time series classification using shapelets, called the generalized random shapelet forest algorithm. The algorithm generates a set of shapelet-based decision trees, where both the choice of instances used for building a tree and the choice of shapelets are randomized. For univariate time series, it is demonstrated through an extensive empirical investigation that the proposed algorithm yields predictive performance comparable to the current state-of-the-art and significantly outperforms several alternative algorithms, while being at least an order of magnitude faster. Similarly for multivariate time series, it is shown that the algorithm is significantly less computationally costly and more accurate than the current state-of-the-art.

  • 48.
    Karlsson, Isak
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Asker, Lars
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Predicting Adverse Drug Events by Analyzing Electronic Patient Records2013Ingår i: Artificial Intelligence in Medicine: 14th Conference on Artificial Intelligence in Medicine, AIME 2013. Proceedings / [ed] Niels Peek, Roque Marín Morales, Mor Peleg, Springer Berlin/Heidelberg, 2013, Vol. 7885, s. 125-129Konferensbidrag (Refereegranskat)
    Abstract [en]

    Diagnosis codes for adverse drug events (ADEs) are sometimes missing from electronic patient records (EPRs). This may not only affect patient safety in the worst case, but also the number of reported ADEs, resulting in incorrect risk estimates of prescribed drugs. Large databases of electronic patient records (EPRs) are potentially valuable sources of information to support the identification of ADEs. This study investigates the use of machine learning for predicting one specific ADE based on information extracted from EPRs, including age, gender, diagnoses and drugs. Several predictive models are developed and evaluated using different learning algorithms and feature sets. The highest observed AUC is 0.87, obtained by the random forest algorithm. The resulting model can be used for screening EPRs that are not, but possibly should be, assigned a diagnosis code for the ADE under consideration. Preliminary results from using the model are presented.

  • 49.
    Karunaratne, Thashmee
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Can frequent itemset mining be efficiently and effectively used for learning from graph data?2012Ingår i: 11th International Conference on Machine Learning and Applications (ICMLA) / [ed] M. Arif Wani, Taghi Khoshgoftaar, Xingquan (Hill) Zhu, Naeem Seliya, IEEE, 2012, Vol. 1, s. 409-414Konferensbidrag (Refereegranskat)
    Abstract [en]

    Standard graph learning approaches are often challenged by the computational cost involved when learning from very large sets of graph data. One approach to overcome this problem is to transform the graphs into less complex structures that can be more efficiently handled. One obvious potential drawback of this approach is that it may degrade predictive performance due to loss of information caused by the transformations. An investigation of the tradeoff between efficiency and effectiveness of graph learning methods is presented, in which state-of-the-art graph mining approaches are compared to representing graphs by itemsets, using frequent itemset mining to discover features to use in prediction models. An empirical evaluation on 18 medicinal chemistry datasets is presented, showing that employing frequent itemset mining results in significant speedups, without sacrificing predictive performance for both classification and regression.

  • 50.
    Karunaratne, Thashmee
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    DIFFER: A Propositionalization Approach for Learning from Structured Data2006Ingår i: Proceedings of World Academy of Science, Engineering and Technology, ISSN 2010-376X, Vol. 15, s. 49-51Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Logic based methods for learning from structured data is limited w.r.t. handling large search spaces, preventing large-sized substructures from being considered by the resulting classifiers. A novel approach to learning from structured data is introduced that employs a structure transformation method, called finger printing, for addressing these limitations. The method, which generates features corresponding to arbitrarily complex substructures, is implemented in a system, called DIFFER. The method is demonstrated to perform comparably to an existing state-of-art method on some benchmark data sets without requiring restrictions on the search space. Furthermore, learning from the union of features generated by finger printing and the previous method outperforms learning from each individual set of features on all benchmark data sets, demonstrating the benefit of developing complementary, rather than competing, methods for structure classification.

12 1 - 50 av 86
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf