Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Probabilistic metric space for machine learning: data and model spaces
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0000-0003-4233-3519
2025 (English)Doctoral thesis, comprehensive summary (Other academic)Alternative title
Probabilistiska metriska rum för maskininlärning : data- och modellrum (Swedish)
Abstract [en]

Machine learning models are inherently shaped by the data used to train them. Understanding the relationship between datasets and the models they generate is essential for tasks such as model selection, privacy metrics, and robustness evaluation. This thesis presents a rigorous mathematical framework for comparing machine learning models and algorithms by formalizing the interaction between two fundamental spaces: the database space, which captures possible datasets, and the model space, which contains the models or hypotheses derived from those datasets. A central motivation stems from the observation that different datasets can lead to the same or highly similar models. Such recurrent models—which arise frequently across diverse data sources—are particularly significant in privacy-sensitive applications. Their recurrence suggests reduced dependence on any specific data point or subgroup, thus offering inherent privacy and generalization benefits. By quantifying the relationship between models and their generating data, this work enables principled evaluation of a model’s robustness and disclosure risk.

To formalize relationships between the two spaces, the thesis develops a family of probabilistic metric space constructions tailored to different aspects of the data–model interaction. The first contribution models database evolution as a Markov process and defines probabilistic distances between models based on the likelihood of transitioning between their generating datasets. The second contribution introduces F-space, a framework based on fuzzy measures that captures richer structural properties of the data—such as redundancy, synergy, and overlap among subsets. Building on this, the third contribution applies the F-space theory in practical machine learning scenarios. It demonstrates how fuzzy measures can be used to compare different linear regression algorithms trained over structured subsets of real datasets. The final contribution further generalizes the framework through Generalized F-spaces, where the model space itself is endowed with probabilistic structure—allowing uncertainty in both the datasets and the model outputs to be captured simultaneously.

Together, these constructions offer a principled alternative to traditional model comparison metrics. Rather than relying solely on pointwise loss or accuracy, the proposed framework incorporates the diversity, dynamics, and internal structure of the data that underlies each model—enabling more robust and privacy-aware assessments.

Abstract [sv]

Maskininlärningsmodeller formas i grunden av den data de tränas på. Att förstå relationen mellan datamängder och de modeller som genereras från dem är avgörande för uppgifter såsom modellval, sekretessmätningar och robusthetsanalys. Denna avhandling presenterar ett rigoröst matematiskt ramverk för att jämföra maskininlärningsmodeller och algoritmer genom att formalisera samspelet mellan två grundläggande omfång: databasrummet, som representerar möjliga datamängder, och modellrummet, som innehåller de modeller eller hypoteser som härrör från dessa datamängder.

Ett centralt motiv är observationen att olika datamängder kan leda till samma eller mycket liknande modeller. Sådana återkommande modeller —som ofta uppstår över varierande datakällor — är särskilt betydelsefulla i integritetskänsliga tillämpningar. Återkommandet antyder ett minskat beroende av enskilda datapunkter eller undergrupper, vilket ger fördelar vad gäller både integritet och generaliserbarhet. Genom att kvantifiera relationen mellan modeller och deras genererande data möjliggör detta arbete en principbaserad utvärdering av en modells robusthet och risk för avslöjande.

För att formalisera relationen mellan de två omfången introducerar avhandlingen en familj av probabilistiska metriska rum, anpassade för olika aspekter av samspelet mellan data och modeller. Det första bidraget modellerar databasers utveckling som en Markovprocess och definierar probabilistiska avstånd mellan modeller baserat på sannolikheten att övergå mellan deras genererande datamängder. Det andra bidraget introducerar F-rum (F-space), ett ramverk baserat på fuzzy-mått som fångar rikare strukturella egenskaper hos data—såsom redundans, synergi och överlappning mellan delmängder. Det tredje bidraget tillämpar F-rum-teorin i praktiska maskininlärningsscenarier. Detvisar hur fuzzy-mått kan användas för att jämföra olika linjära regressionsalgoritmer tränade på strukturerade delmängder av verkliga datamängder. Det fjärde och sista bidraget generaliserar ramverket ytterligare genom Generaliserade F-rum, där även modellrummet ges en probabilistisk struktur — vilket möjliggör att osäkerhet i både datamängden och modellutdata fångas samtidigt. Tillsammans erbjuder dessa konstruktioner ett principiellt alternativ till traditionella jämförelsemått för modeller. I stället för att enbart förlita sig på punktvisa fel eller noggrannhet beaktar det föreslagna ramverket datans mångfald, dynamik och inre struktur — vilket möjliggör mer robusta och integritetsmedvetna analyser.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2025. , p. 50
Series
Report / UMINF, ISSN 0348-0542 ; 25.05
Keywords [en]
probabilstic metric space, space of data, space of models, fuzzy measures
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:umu:diva-238256ISBN: 978-91-8070-681-0 (electronic)ISBN: 978-91-8070-680-3 (print)OAI: oai:DiVA.org:umu-238256DiVA, id: diva2:1954823
Public defence
2025-05-23, Hörsal NAT.D. 360, Naturvetarhuset, Umeå, 09:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Available from: 2025-04-30 Created: 2025-04-28 Last updated: 2025-04-28Bibliographically approved
List of papers
1. The space of models in machine learning: using Markov chains to model transitions
Open this publication in new window or tab >>The space of models in machine learning: using Markov chains to model transitions
2021 (English)In: Progress in Artificial Intelligence, ISSN 2192-6352, E-ISSN 2192-6360, Vol. 10, no 3, p. 321-332Article in journal (Refereed) Published
Abstract [en]

Machine and statistical learning is about constructing models from data. Data is usually understood as a set of records, a database. Nevertheless, databases are not static but change over time. We can understand this as follows: there is a space of possible databases and a database during its lifetime transits this space. Therefore, we may consider transitions between databases, and the database space. NoSQL databases also fit with this representation. In addition, when we learn models from databases, we can also consider the space of models. Naturally, there are relationships between the space of data and the space of models. Any transition in the space of data may correspond to a transition in the space of models. We argue that a better understanding of the space of data and the space of models, as well as the relationships between these two spaces is basic for machine and statistical learning. The relationship between these two spaces can be exploited in several contexts as, e.g., in model selection and data privacy. We consider that this relationship between spaces is also fundamental to understand generalization and overfitting. In this paper, we develop these ideas. Then, we consider a distance on the space of models based on a distance on the space of data. More particularly, we consider distance distribution functions and probabilistic metric spaces on the space of data and the space of models. Our modelization of changes in databases is based on Markov chains and transition matrices. This modelization is used in the definition of distances. We provide examples of our definitions.

Place, publisher, year, edition, pages
Springer, 2021
Keywords
Hypothesis space, Machine and statistical learning models, Probabilistic metric spaces, Space of data, Space of models
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-183009 (URN)10.1007/s13748-021-00242-6 (DOI)000639627000001 ()2-s2.0-85104447939 (Scopus ID)
Funder
Swedish Research Council, 2016-03346Swedish Research Council, 2017-2020Knut and Alice Wallenberg Foundation
Available from: 2021-05-18 Created: 2021-05-18 Last updated: 2025-04-30Bibliographically approved
2. On the definition of probabilistic metric spaces by means of fuzzy measures
Open this publication in new window or tab >>On the definition of probabilistic metric spaces by means of fuzzy measures
2023 (English)In: Fuzzy sets and systems (Print), ISSN 0165-0114, E-ISSN 1872-6801, Vol. 465, article id 108528Article in journal (Refereed) Published
Abstract [en]

Metric spaces are defined in terms of a space and a metric, or distance. Probabilistic metric spaces are a useful extension of metric spaces where the distance is a distribution instead of a number. In this way, we can take into account uncertainty. Then, the triangle inequality is replaced by a condition based on triangle functions on the distributions. In this paper we introduce F-spaces. This is a new type of probabilistic metric spaces which is based on fuzzy measures (also known as non-additive measures and capacities). We prove some properties that describe which families of fuzzy measures are compatible with which type of triangle functions. Then, we show how we can use Sugeno, Choquet integrals, and, in general, any other fuzzy integral as a tool for building these spaces. We show how these results can be used to compute distances between functions. We illustrate the example comparing three types of means when applied to a set of databases. The example uses Sugeno λ-measures to illustrate the theoretical results presented in the paper.

Place, publisher, year, edition, pages
Elsevier, 2023
Keywords
Fuzzy integrals, Fuzzy measures, Probabilistic metric spaces
National Category
Computer Sciences Computer Systems Probability Theory and Statistics
Identifiers
urn:nbn:se:umu:diva-207877 (URN)10.1016/j.fss.2023.108528 (DOI)001012160000001 ()2-s2.0-85153802533 (Scopus ID)
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2023-05-04 Created: 2023-05-04 Last updated: 2025-04-30Bibliographically approved
3. Measuring the distance between machine learning models using F-space
Open this publication in new window or tab >>Measuring the distance between machine learning models using F-space
2023 (English)In: Fuzzy Logic and Technology, and Aggregation Operators: 13th Conference of the European Society for Fuzzy Logic and Technology, EUSFLAT 2023, and 12th International Summer School on Aggregation Operators, AGOP 2023. Palma de Mallorca, Spain, September 4–8, 2023, Proceedings / [ed] Sebastia Massanet; Susana Montes; Daniel Ruiz-Aguilera; Manuel González-Hidalgo, Springer Science+Business Media B.V., 2023, p. 307-319Conference paper, Published paper (Refereed)
Abstract [en]

Probabilistic metric spaces are a natural generalization of metric spaces in which the function that computes the distance outputs a distribution on the real numbers rather than a single number. Such a function is called a distribution function. In this paper, we construct a distance for linear regression models using one type of probabilistic metric space called F-space. F-spaces use fuzzy measures to evaluate a set of elements under certain conditions. By using F-spaces to build a metric on machine learning models, we permit to represent more complex interactions of the databases that generate these models.

Place, publisher, year, edition, pages
Springer Science+Business Media B.V., 2023
Series
Lecture Notes in Computer Science, ISSN 03029743, E-ISSN 16113349 ; 14069
Keywords
Fuzzy Measures, Machine Learning, Probabilistic Metric Space
National Category
Computer Sciences Computer Systems
Identifiers
urn:nbn:se:umu:diva-214994 (URN)10.1007/978-3-031-39965-7_26 (DOI)2-s2.0-85172232932 (Scopus ID)9783031399640 (ISBN)978-3-031-39965-7 (ISBN)
Conference
13th Conference of the European Society for Fuzzy Logic and Technology, EUSFLAT 2023, and 12th International Summer School on Aggregation Operators, AGOP 2023, Palma de Mallorca, Spain, September 4–8, 2023.
Available from: 2023-10-16 Created: 2023-10-16 Last updated: 2025-04-30Bibliographically approved
4. Generalized F-spaces through the lens of fuzzy measures
Open this publication in new window or tab >>Generalized F-spaces through the lens of fuzzy measures
2025 (English)In: Fuzzy sets and systems (Print), ISSN 0165-0114, E-ISSN 1872-6801, Vol. 507, article id 109317Article in journal (Refereed) Published
Abstract [en]

Probabilistic metric spaces are natural extensions of metric spaces, where the function that computes the distance outputs a distribution on the real numbers rather than a single value. Such a function is called a distribution function. F-spaces are constructions for probabilistic metric spaces, where the distribution functions are built for functions that map from a measurable space to a metric space. In this paper, we propose an extension of F-spaces, called Generalized F-space. This construction replaces the metric space with a probabilistic metric space and uses fuzzy measures to evaluate sets of elements whose distances are probability distributions. We present several results that establish connections between the properties of the constructed space and specific fuzzy measures under particular triangular norms. Furthermore, we demonstrate how the space can be applied in machine learning to compute distances between different classifier models. Experimental results based on Sugeno λ-measures are consistent with our theoretical findings.

Keywords
Fuzzy measures, Probabilistic metric space
National Category
Computer Sciences Computer Systems
Identifiers
urn:nbn:se:umu:diva-235860 (URN)10.1016/j.fss.2025.109317 (DOI)001428707700001 ()2-s2.0-85217744245 (Scopus ID)
Available from: 2025-02-24 Created: 2025-02-24 Last updated: 2025-04-30Bibliographically approved

Open Access in DiVA

fulltext(680 kB)43 downloads
File information
File name FULLTEXT01.pdfFile size 680 kBChecksum SHA-512
0f083a972db2101242cae4fb4bf7cb11062fce0506492a426f0dfd4ca5f842f4ab87bbb6ae0b0a754a3afe3af05545ce830693940295b8511887cc88297bcb7b
Type fulltextMimetype application/pdf
spikblad(114 kB)22 downloads
File information
File name SPIKBLAD01.pdfFile size 114 kBChecksum SHA-512
910ff0fd137fc09511eac0edcdb02e15e19272a2a2060c90deb8be3d9e00d7889e0781ae271c5f37942c4cd6a994db628900ff3c345a6d86fba1859415fb59e1
Type spikbladMimetype application/pdf

Search in DiVA

By author/editor
Taha, Mariam
By organisation
Department of Computing Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 44 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 640 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf