Similarity assessment for removal of noisy end user license agreements
Blekinge Institute of Technology, School of Computing2012 (English)In: Knowledge and Information Systems, ISSN 0219-1377, Vol. 32, no 1, 167-189 p.Article in journal (Refereed) Published
In previous work, we have shown the possibility to automatically discriminate between legitimate software and spyware-associated software by performing supervised learning of end user license agreements (EULAs). However, the amount of false positives (spyware classified as legitimate software) was too large for practical use. In this study, the false positives problem is addressed by removing noisy EULAs, which are identified by performing similarity analysis of the previously studied EULAs. Two candidate similarity analysis methods for this purpose are experimentally compared: cosine similarity assessment in conjunction with latent semantic analysis (LSA) and normalized compression distance (NCD). The results show that the number of false positives can be reduced significantly by removing noise identified by either method. However, the experimental results also indicate subtle performance differences between LSA and NCD. To improve the performance even further and to decrease the large number of attributes, the categorical proportional difference (CPD) feature selection algorithm was applied. CPD managed to greatly reduce the number of attributes while at the same time increase classification performance on the original data set, as well as on the LSA- and NCD-based data sets.
Place, publisher, year, edition, pages
Springer , 2012. Vol. 32, no 1, 167-189 p.
End user license agreement, Latent semantic analysis, Normalized compression distance, Spyware
IdentifiersURN: urn:nbn:se:bth-7176DOI: 10.1007/s10115-011-0438-9ISI: 000305692000007Local ID: oai:bth.se:forskinfoFBDEF2128A7A7A8AC12578DE000AEEB5OAI: oai:DiVA.org:bth-7176DiVA: diva2:834758