Ändra sökning
Avgränsa sökresultatet
1 - 14 av 14
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Träffar per sida
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
Markera
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1. Dalianis, H.
    et al.
    Rosell, Magnus
    KTH, Skolan för datavetenskap och kommunikation (CSC), Teoretisk datalogi, TCS. Stockholm University, Sweden.
    Sneiders, E.
    Clustering e-mails for the Swedish social insurance agency - What part of the e-mail thread gives the best quality?2010Ingår i: Advances in Natural Language Processing: 7th International Conference on NLP, IceTAL 2010, Reykjavik, Iceland, August 16-18, 2010, Springer Berlin/Heidelberg, 2010, s. 115-120Konferensbidrag (Refereegranskat)
    Abstract [en]

    We need to analyse a large number of e-mails sent by the citizens to the customer services department of a governmental organisation based in Sweden. To carry out this analysis we clustered a large number of e-mails with the aim of automatic e-mail answering. One issue that came up was whether we should use the whole e-mail including the thread or just the original query for the clustering. In this paper we describe this investigation. Our results show that only the query and the answering part should be used, but not necessarily the whole e-mail thread. The results clearly show that the original question contains more useful information than only the answer, although a combination is even better. Using the full e-mail thread does not downgrade the result.

  • 2.
    Kann, Viggo
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    Rosell, Magnus
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    Free Construction of a Free SwedishDictionary of Synonyms2005Ingår i: NoDaLiDa 2005, 2005, s. 1-6Konferensbidrag (Refereegranskat)
    Abstract [en]

    Building a large dictionary of synonymsfor a language is a very tedioustask. Hence there exist veryfew synonym dictionaries for mostlanguages, and those that exist aregenerally not freely available due tothe amount of work that have beenput into them.The Lexin on-line dictionary1 is avery popular web-site for translationsof Swedish words to about tendifferent languages. By letting userson this site grade automatically generatedpossible synonym pairs a freedictionary of Swedish synonyms hasbeen created. The lexicon reflectsthe users intuitive definition of synonymityand the amount of work putinto the project is only as much asthe participants want to.

  • 3.
    Rosell, Magnus
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    Clustering in Swedish: The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation Method2005Licentiatavhandling, sammanläggning (Övrigt vetenskapligt)
    Abstract [sv]

    Textklustring delar upp en mängd texter i grupper, så att texterna inom dessa liknar varandra till innehåll. Man kan använda textklustring för att uppdaga strukturer och innehåll i okända textmängder och för att få nya perspektiv på redan kända. Bidragen i denna avhandling är en undersökning av textrepresentationer för svenska texter och en utvärderingsmetod som använder sig av två eller fler manuella kategoriseringar.

    Textklustring, åtminstonde som det beskrivs här, utnyttjar sig av den vektorrumsmodell, som används allmänt inom området. I denna modell representeras texter med orden som förekommer i dem och texter som har många gemensamma ord betraktas som lika till innehåll. Vad som betraktas som ett ord skiljer sig mellan språk. Vi har undersökt inverkan av några av svenskans egenskaper på textklustring. Eftersom svenska har större morfologisk variation än till exempel engelska har vi tagit bort suffix med hjälp av en stemmer. Detta ger lite bättre resultat och minskar antalet ord i representationen.

    I svenska används och skapas hela tiden fasta sammansättningar. De flesta delar av sammansättningar används som ord på egen hand och i många olika sammansättningar. Fasta sammansättningar i svenska språket motsvarar ofta fraser och öppna sammansättningar i andra språk. Delarna i sammansättningar används inte vid likhetsberäkningen i vektorrumsmodellen. För att utnyttja dem har vi använt ett rättstavningsprogram för att dela upp sammansättningar. Resultaten visar tydligt att detta är fördelaktigt

    I vektorrumsmodellen tas ingen hänsyn till ordens inbördes ordning. Vi har försökt utvidga modellen med nominalfraser på olika sätt. Inga av våra experiment visar på någon förbättring jämfört med den vanliga enkla modellen.

    Det är mycket svårt att utvärdera textklustringsresultat. Det ligger i sakens natur att vad som är en bra uppdelning av en mängd texter är subjektivt. Automatiska utvärderingsmetoder är antingen interna eller externa. Interna kvalitetsmått utnyttjar representationen på något sätt. Därför är de inte lämpliga att använda vid jämförelser av olika representationer.

    Externa kvalitetsmått jämför en klustring med en (manuell) kategorisering av samma mängd texter. Det teoretiska bästa värdet för måtten är kända, men vad som är ett bra värde är inte uppenbart -- mängder av texter skiljer sig åt i svårighet att klustra och kategoriseringar är mer eller mindre lämpliga för en speciell mängd texter. Vi beskriver en utvärderingsmetod som kan användas då en mängd texter har mer än en kategorisering. I sådana fall kan resultatet för en klustring jämföras med resultatet för en av kategoriseringarna, som vi antar är en bra uppdelning. Vi beskriver också kappakoefficienten som ett kvalitetsmått för klustring under samma förutsättningar.

  • 4.
    Rosell, Magnus
    KTH, Tidigare Institutioner                               , Numerisk analys och datalogi, NADA.
    Improving Clustering of Swedish Newspaper Articles using Stemming and Compound Splitting2003Konferensbidrag (Refereegranskat)
    Abstract [en]

    The use of properties of the Swedish language when indexing newspaper articles improves clustering results. To show this a clustering algorithm was implemented and language specific tools were used when building the representation of the articles.Since Swedish is an inflecting language many words have different forms. Thus two documents compared based on word occurrence(i.e. the vector space model and cosine measure of Information Retrieval) do not necessarily become similar although containing the sameword(s). To overcome this we have used a stemmer.Compounds are regularly formed as one word in Swedish. Hence indexing on words leaves the informationin the components of compounds unused.We use the spell checking program Stavato split compounds into their components.Newspapers sort their articles into sections such as Economy, Domestic, Sports etc. Using these we calculate entropy for the clusterings and use as a measure of quality.We have found that stemming improves clustering results on our collections by about 4 % compared to not using it. Compound splitting improves results by about 10 % (by 13 % incombination with stemming). Keeping the original compounds in the representation does not improve results.

     

     

     

  • 5.
    Rosell, Magnus
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    Infomat: Visualizing and Exploring Vector Space Model Data Matrixes2009Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Infomat is a vector space visualization tool aimed at Information Retrieval.It presents information stored in a matrix, such as the term-document-matrix, as arectangular picture. The opacity of each pixel is proportional to the weight(s) of thecorresponding matrix element(s).Reordering the objects of the rows and columns makes different distributional patternsappear. These can be explored to understand the relations (similarities and differences)between the objects. Infomat allows the user to zoom in and out of the pictureto obtain more detailed information, to remove objects and matrix elements, to reweightthe matrix, and to cluster all, or a part of the objects. At the same time textualinformation is presented.Infomat provides an overview of the content of the entire data and parts of it. Inparticular, text clustering results become easier to grasp, than when presented only intextual form.

  • 6.
    Rosell, Magnus
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    Part of Speech Tagging for Text Clustering in Swedish2009Ingår i: Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009, 2009Konferensbidrag (Refereegranskat)
    Abstract [en]

    Text clustering could be very useful bothas an intermediate step in a large naturallanguage processing system and as a toolin its own right. The result of a clusteringalgorithm is dependent on the text representationthat is used. Swedish has afairly rich morphology and a large numberof homographs. This possibly leads toproblems in Information Retrieval in general.We investigate the impact on textclustering of adding the part-of-speech-tagto all words in the the common term-bydocumentmatrix.The experiments are carried out on a fewdifferent text sets. None of them give anyevidence that part-of-speech tags improveresults. However, to represent texts usingonly nouns and proper names gives asmaller representation without worsen results.In a few experiments this smallerrepresentation gives better results.We also investigate the effect of lemmatizationand the use of a stoplist, bothof which improves results significantly insome cases.

  • 7.
    Rosell, Magnus
    KTH, Skolan för datavetenskap och kommunikation (CSC).
    Text Cluster Trimming for Better Descriptions and Improved Quality2010Ingår i: LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION / [ed] Calzolari, N Choukri, K Maegaard, B Mariani, J Odijk, J Piperidis, S Rosner, M Tapias, D, EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA , 2010, s. 3076-3083Konferensbidrag (Refereegranskat)
    Abstract [en]

    Text clustering is potentially very useful for exploration of text sets that are too large to study manually. The success of such a tool depends on whether the results can be explained to the user. An automatically extracted cluster description usually consists of a few words that are deemed representative for the cluster. It is preferably short in order to be easily grasped. However, text cluster content is often diverse. We introduce a trimming method that removes texts that do not contain any, or a few of the words in the cluster description. The result is clusters that match their descriptions better. In experiments on two quite different text sets we obtain significant improvements in both internal and external clustering quality for the trimmed clustering compared to the original. The trimming thus has two positive effects: it forces the clusters to agree with their descriptions ( resulting in better descriptions) and improves the quality of the trimmed clusters.

  • 8.
    Rosell, Magnus
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    Text Clustering Exploration: Swedish Text Representation and Clustering Results Unraveled2009Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
    Abstract [en]

    Text clustering divides a set of texts into clusters (parts), so that texts within each cluster are similar in content. It may be used to uncover the structure and content of unknown text sets as well as to give new perspectives on familiar ones. The main contributions of this thesis are an investigation of text representation for Swedish and some extensions of the work on how to use text clustering as an exploration tool. We have also done some work on synonyms and evaluation of clustering results. Text clustering, at least such as it is treated here, is performed using the vector space model, which is commonly used in information retrieval. This model represents texts by the words that appear in them and considers texts similar in content if they share many words. Languages differ in what is considered a word. We have investigated the impact of some of the characteristics of Swedish on text clustering. Swedish has more morphological variation than for instance English. We show that it is beneficial to use the lemma form of words rather than the word forms. Swedish has a rich production of solid compounds. Most of the constituents of these are used on their own as words and in several different compounds. In fact, Swedish solid compounds often correspond to phrases or open compounds in other languages. Our experiments show that it is beneficial to split solid compounds into their parts when building the representation. The vector space model does not regard word order. We have tried to extend it with nominal phrases in different ways. We have also tried to differentiate between homographs, words that look alike but mean different things, by augmenting all words with a tag indicating their part of speech. None of our experiments using phrases or part of speech information have shown any improvement over using the ordinary model. Evaluation of text clustering results is very hard. What is a good partition of a text set is inherently subjective. External quality measures compare a clustering with a (manual) categorization of the same text set. The theoretical best possible value for a measure is known, but it is not obvious what a good value is – text sets differ in difficulty to cluster and categorizations are more or less adapted to a particular text set. We describe how evaluation can be improved for cases where a text set has more than one categorization. In such cases the result of a clustering can be compared with the result for one of the categorizations, which we assume is a good partition. In some related work we have built a dictionary of synonyms. We use it to compare two different principles for automatic word relation extraction through clustering of words. Text clustering can be used to explore the contents of a text set. We have developed a visualization method that aids such exploration, and implemented it in a tool, called Infomat. It presents the representation matrix directly in two dimensions. When the order of texts and words are changed, by for instance clustering, distributional patterns that indicate similarities between texts and words appear. We have used Infomat to explore a set of free text answers about occupation from a questionnaire given to over 40 000 Swedish twins. The questionnaire also contained a closed answer regarding smoking. We compared several clusterings of the text answers to the closed answer, regarded as a categorization, by means of clustering evaluation. A recurring text cluster of high quality led us to formulate the hypothesis that “farmers smoke less than the average”, which we later could verify by reading previous studies. This hypothesis generation method could be used on any set of texts that is coupled with data that is restricted to a limited number of possible values.

  • 9.
    Rosell, Magnus
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    Hassel, Martin
    KTH, Skolan för informations- och kommunikationsteknik (ICT), Data- och systemvetenskap, DSV.
    Kann, Viggo
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    Global Evaluation of Random Indexing through Swedish Word Clustering Compared to the People’s Dictionary of Synonyms2009Ingår i: Proceedings of the International Conference RANLP-2009, 2009, s. 376-380Konferensbidrag (Refereegranskat)
    Abstract [en]

    Evaluation of word space models is usually local in the sense that it only considers words that are deemed very similar by the model. We propose a global evaluation scheme based on clustering of the words. A clustering of high quality in an external evaluation against a semantic resource, such as a dictionary of synonyms, indicates a word space model of high quality. We use Random Indexing to create several different models and compare them by clustering evaluation against the People's Dictionary of Synonyms, a list of Swedish synonyms that are graded by the public. Most notably we get better results for models based on syntagmatic information (words that appear together) than for models based on paradigmatic information (words that appear in similar contexts). This is quite contrary to previous results that have been presented for local evaluation. Clusterings to ten clusters result in a recall of 83% for a syntagmatic model, compared to 34% for a comparable paradigmatic model, and 10% for a random partition.

  • 10.
    Rosell, Magnus
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Teoretisk datalogi, TCS.
    Kann, Viggo
    KTH, Skolan för datavetenskap och kommunikation (CSC), Teoretisk datalogi, TCS.
    Constructing a Swedish General Purpose Polarity Lexicon: Random Walks in the People's Dictionary of Synonyms2010Ingår i: Proceedings of the Conference: Proceedings of the Conference, 2010, s. 19-20Konferensbidrag (Refereegranskat)
  • 11.
    Rosell, Magnus
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Teoretisk datalogi, TCS.
    Kann, Viggo
    KTH, Skolan för datavetenskap och kommunikation (CSC), Teoretisk datalogi, TCS.
    Global Evaluation of Random Indexing through Swedish Word Clustering Compared to the People’s Dictionary of Synonyms2010Konferensbidrag (Refereegranskat)
    Abstract [en]

    Evaluation of word space models is usually local in the sense that it only considers words that are deemed very similar by the model. We propose a global evaluation scheme based on clustering of the words. A clustering of high quality in an external evaluation against a semantic resource, such as a dictionary of synonyms, indicates a word space model of high quality. We use Random Indexing to create several different models and compare them by clustering evaluation against the People’s Dictionary of Synonyms, a list of Swedish synonyms that are graded by the public. Most notably we get better results for models based on syntagmatic information (words that appear together) than for models based on paradigmatic information (words that appear in similar contexts). This is quite contrary to previous results that have been presented for local evaluation. Clusterings to ten clusters result in a recall of 83 % for a syntagmatic model, compared to 34% for a comparable paradigmatic model, and 10% for a random partition.

  • 12.
    Rosell, Magnus
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    Kann, Viggo
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    Litton, Jan-Eric
    MEB, Karolinska Institutet, Stockholm, Sweden.
    Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classifications2004Konferensbidrag (Refereegranskat)
    Abstract [en]

    “Describe your occupation in a few words”, is a question answered by 44 000 Swedish twins.Each respondent was then manually categorized according to two established occupation classificationsystems. Would a clustering algorithm have produced satisfactory results? Usually,this question cannot be answered. The existing quality measures will tell us how much thealgorithmic clustering deviates from the manual classification, not if this is an acceptable deviation. But in our situation, with two different manual classifications (in classificationsystems called AMSYK and YK80), we can indeed construct such quality measures. If the algorithmic result differs no more from the manual classifications than these differ from eachother (comparing the comparisons) we have an indication of its being useful. Further, weuse the kappa coefficient as a clustering quality measure. Using one manual classification asa coding scheme we assess the agreement of a clustering and the other. After applying both these novel evaluation methods we conclude that our clusterings are useful.

  • 13.
    Rosell, Magnus
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    Velupillai, Sumithra
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    Revealing Relations between Open and Closed Answers in Questionnaires through Text Clustering Evaluation2008Ingår i: Proceedings of the Sixth International Language Resources and Evaluation (LREC'08), 2008, s. 1-7Konferensbidrag (Refereegranskat)
    Abstract [en]

    Open answers in questionnaires contain valuable information that is very time-consuming to analyze manually. We present a method forhypothesis generation from questionnaires based on text clustering. Text clustering is used interactively on the open answers, and the usercan explore the cluster contents. The exploration is guided by automatic evaluation of the clusters against a closed answer regarded as acategorization. This simplifies the process of selecting interesting clusters. The user formulates a hypothesis from the relation betweenthe cluster content and the closed answer categorization. We have applied our method on an open answer regarding occupation comparedto a closed answer on smoking habits. With no prior knowledge of smoking habits in different occupation groups we have generated thehypothesis that farmers smoke less than the average. The hypothesis is supported by several separate surveys. Closed answers are easyto analyze automatically but are restricted and may miss valuable aspects. Open answers, on the other hand, fully capture the dynamicsand diversity of possible outcomes. With our method the process of analyzing open answers becomes feasible.

  • 14.
    Rosell, Magnus
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    Velupillai, Sumithra
    KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA.
    The Impact of Phrases in Document Clustering for Swedish2005Ingår i: Proceedings of the 15th NODALIDA conference, Joensuu 2005 / [ed] Werner, S., 2005, s. 173-179Konferensbidrag (Refereegranskat)
    Abstract [en]

    We have investigated the impact of using phrases in the vector spacemodel for clustering documents in Swedish in different ways. The investigation is carried out on two textsets from different domains: one set of newspaper articles and one set of medical papers.The use of phrases do not improveresults relative the ordinary use ofwords. The results differ significantly between the text types. Thisindicates that one could benefit from different text representations for different domains although a fundamentally different approach probably would be needed.

1 - 14 av 14
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf