Clustering User Behavior in Scientific Collections
This master thesis looks at how clustering techniques can be applied
to a collection of scientific documents. Approximately one year of server
logs from the CERN Document Server (CDS) are analyzed and preprocessed.
Based on the findings of this analysis, and a review of the
current state of the art, three different clustering methods are selected
for further work: Simple k-Means, Hierarchical Agglomerative Clustering
(HAC) and Graph Partitioning. In addition, a custom, agglomerative
clustering algorithm is made in an attempt to tackle some of the problems
encountered during the experiments with k-Means and HAC. The results
from k-Means and HAC are poor, but the graph partitioning method
yields some promising results.
The main conclusion of this thesis is that the inherent clusters within
the user-record relationship of a scientific collection are nebulous, but
existing. Furthermore, the most common clustering algorithms are not
suitable for this type of clustering.
Place, publisher, year, edition, pages
Institutt for datateknikk og informasjonsvitenskap , 2014. , 114 p.
ntnudaim:12121, MTDT Datateknologi, Data- og informasjonsforvaltning
IdentifiersURN: urn:nbn:no:ntnu:diva-27340Local ID: ntnudaim:12121OAI: oai:DiVA.org:ntnu-27340DiVA: diva2:769314
Ramampiaro, Herindrasana, Førsteamanuensis