Using similarity network analysis to improve text similarity calculations
2025 (English)In: Applied Network Science, E-ISSN 2364-8228, Vol. 10, no 1, article id 8Article in journal (Refereed) Published
Abstract [en]
Similarity-based analysis is a powerful and intuitive tool for exploring large data sets, for instance, for revealing patterns by grouping items by similarity or for recommending items based on selected samples. However, similarity is an abstract and subjective property which makes it hard to evaluate by a purely computational approach. Furthermore, there are usually several possible computational models that could be applied to the data, each with its own strengths and weaknesses. With this in mind, we aim to extend the research frontier regarding what impact the choice of a computational model may have on the results. In this paper, we target the scope of embedding-based similarity calculations on text documents and seek to answer the research question: “How can a better understanding of the continuous similarity distribution captured by different models lead to better similarity calculations on document sets?”. We propose a new and generic methodology based on similarity network comparison, and based on this approach, we have developed a computational pipeline together with a prototype visual analytics tool that allows the user to easily assess the level of model agreement/disagreement. To demonstrate the potential of our method, as well as showing its application to real world scenarios, we apply it in an experimental setup using three state-of-the-art text embedding models and three different text corpora. In view of the surprisingly low level of model agreement regarding the data, we also discuss strategies for handling model disagreement.
Place, publisher, year, edition, pages
Springer Nature, 2025. Vol. 10, no 1, article id 8
Keywords [en]
Embeddings, Similarity networks, Text similarity calculations, Visual analytics, Network embeddings, Computational modelling, Item-based, Large datasets, Property, Similarity calculation, Similarity network, Text similarity, Text similarity calculation
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:bth-27685DOI: 10.1007/s41109-025-00699-7ISI: 001467943200001Scopus ID: 2-s2.0-105000480934OAI: oai:DiVA.org:bth-27685DiVA, id: diva2:1950315
Projects
Rekryteringar 21
Funder
ELLIIT - The Linköping‐Lund Initiative on IT and Mobile CommunicationsKnowledge Foundation, 202100772025-04-072025-04-072025-05-05Bibliographically approved