Change search
ReferencesLink to record
Permanent link

Direct link
Functionality Classification Filter for Websites
Linköping University, Department of Computer and Information Science, Statistics. Linköping University, The Institute of Technology.
2013 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The objective of this thesis is to evaluate different models and methods for website classification. The websites are classified based on their functionality, in this case specifically whether they are forums, news sites or blogs. The analysis aims at solving a search engine problem, which means that it is interesting to know from which categories in a information search the results come.

The data consists of two datasets, extracted from the web in January and April 2013. Together these data sets consist of approximately 40.000 observations, with each observation being the extracted text from the website. Approximately 7.000 new word variables were subsequently created from this text, as were variables based on Latent Dirichlet Allocation. One variable (the number of links) was created using the HTML-code for the web site.

These data sets are used both in multinomial logistic regression with Lasso regularization, and to create a Naive Bayes classifier. The best classifier for the data material studied was achieved when using Lasso for all variables with multinomial logistic regression to reduce the number of variables. The  accuracy of this model is 99.70 %.

When time dependency of the models is considered, using the first data to make the model and the second data for testing, the accuracy, however, is only 90.74 %. This indicates that the data is time dependent and that websites topics change over time.

Place, publisher, year, edition, pages
2013. , 58 p.
Keyword [en]
Website classification, Functionality, Latent Dirichlet Allocation, Multinomial logistic regression
National Category
Probability Theory and Statistics
URN: urn:nbn:se:liu:diva-93702ISRN: LIU-IDA/STAT-A--13/004—SEOAI: diva2:635113
External cooperation
Subject / course
Program in Statistics and Data Analysis
Available from: 2013-07-03 Created: 2013-06-10 Last updated: 2013-07-03Bibliographically approved

Open Access in DiVA

fulltext(546 kB)2116 downloads
File information
File name FULLTEXT01.pdfFile size 546 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
StatisticsThe Institute of Technology
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar
Total: 2116 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 158 hits
ReferencesLink to record
Permanent link

Direct link