Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Graph Algorithms for Large-Scale and Dynamic Natural Language Processing
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS.ORCID iD: 0000-0003-1007-8533
2019 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In Natural Language Processing, researchers design and develop algorithms to enable machines to understand and analyze human language. These algorithms benefit multiple downstream applications including sentiment analysis, automatic translation, automatic question answering, and text summarization. Topic modeling is one such algorithm that solves the problem of categorizing documents into multiple groups with the goal of maximizing the intra-group document similarity. However, the manifestation of short texts like tweets, snippets, comments, and forum posts as the dominant source of text in our daily interactions and communications, as well as being the main medium for news reporting and dissemination, increases the complexity of the problem due to scalability, sparsity, and dynamicity. Scalability refers to the volume of the messages being generated, sparsity is related to the length of the messages, and dynamicity is associated with the ratio of changes in the content and topical structure of the messages (e.g., the emergence of new phrases). We improve the scalability and accuracy of Natural Language Processing algorithms from three perspectives, by leveraging on innovative graph modeling and graph partitioning algorithms, incremental dimensionality reduction techniques, and rich language modeling methods. We begin by presenting a solution for multiple disambiguation on short messages, as opposed to traditional single disambiguation. The solution proposes a simple graph representation model to present topical structures in the form of dense partitions in that graph and applies disambiguation by extracting those topical structures using an innovative distributed graph partitioning algorithm. Next, we develop a scalable topic modeling algorithm using a novel dense graph representation and an efficient graph partitioning algorithm. Then, we analyze the effect of temporal dimension to understand the dynamicity in online social networks and present a solution for geo-localization of users in Twitter using a hierarchical model that combines partitioning of the underlying social network graph with temporal categorization of the tweets. The results show the effect of temporal dynamicity on users’ spatial behavior. This result leads to design and development of a dynamic topic modeling solution, involving an online graph partitioning algorithm and a significantly stronger language modeling approach based on the skip-gram technique. The algorithm shows strong improvement on scalability and accuracy compared to the state-of-the-art models. Finally, we describe a dynamic graph-based representation learning algorithm that modifies the partitioning algorithm to develop a generalization of our previous work. A strong representation learning algorithm is proposed that can be used for extracting high quality distributed and continuous representations out of any sequential data with local and hierarchical structural properties similar to natural language text.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2019. , p. 42
Series
TRITA-EECS-AVL ; 2019:85
Keywords [en]
Natural Language Processing; Lexical Disambiguation; Topic Modeling; Representation Learning; Graph Partitioning; Distributed Algorithms; Dimensionality Reduction; Random Indexing;
National Category
Computer Systems
Identifiers
URN: urn:nbn:se:kth:diva-263914ISBN: 978-91-7873-377-4 (print)OAI: oai:DiVA.org:kth-263914DiVA, id: diva2:1371129
Public defence
2019-12-17, Sal C, Electrum, Kistagången 16, Kista, 10:00 (English)
Opponent
Supervisors
Note

QC 20191125

Available from: 2019-11-25 Created: 2019-11-19 Last updated: 2019-11-25Bibliographically approved
List of papers
1. Semi-Supervised Multiple Disambiguation
Open this publication in new window or tab >>Semi-Supervised Multiple Disambiguation
2015 (English)In: IEEE Computer Society Conference Publishing Services / [ed] IEEE, IEEE Press, 2015Conference paper, Published paper (Refereed)
Abstract [en]

Determining the true entity behind an ambiguousword is an NP-Hard problem known as Disambiguation. Previoussolutions often disambiguate a single ambiguous mention acrossmultiple documents. They assume each document contains onlya single ambiguous word and a rich set of unambiguous contextwords. However, nowadays we require fast disambiguation ofshort texts (like news feeds, reviews or Tweets) with few contextwords and multiple ambiguous words. In this research we focuson Multiple Disambiguation (MD) in contrast to Single Disambiguation(SD). Our solution is inspired by a recent algorithm developed for SD. The algorithm categorizes documents by first,transferring them into a graph and then, clustering the graphbased on its topological structure. We changed the graph-baseddocument-modeling of the algorithm, to account for MD. Also,we added a new parameter that controls the resolution of theclustering. Then, we used a supervised sampling approach formerging the clusters when appropriate. Our algorithm, comparedwith the original model, achieved 10% higher quality in termsof F1-Score using only 4% sampling from the dataset.

Place, publisher, year, edition, pages
IEEE Press, 2015
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-184878 (URN)10.1109/Trustcom.2015.566 (DOI)000391000900013 ()
Conference
The 9th IEEE International Conference on Big Data Science and Engineering (IEEE BigDataSE-15)
Note

QC 20160407

Available from: 2016-04-06 Created: 2016-04-06 Last updated: 2019-11-19Bibliographically approved
2. DeGPar: Large Scale Topic Detection usingNode-Cut Partitioning on Dense Weighted Graphs
Open this publication in new window or tab >>DeGPar: Large Scale Topic Detection usingNode-Cut Partitioning on Dense Weighted Graphs
2017 (English)In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), IEEE conference proceedings, 2017, p. 775-785, article id 7980020Conference paper, Published paper (Refereed)
Abstract [en]

Topic Detection (TD) refers to automatic techniques for locating topically related material in web documents. Nowadays, massive amounts of documents are generated by users of Online Social Networks (OSNs), in form of very short text, tweets and snippets of news. While topic detection, in its traditional form, is applied to a few documents containing a lot of information, the problem has now changed to dealing with massive number of documents with very little information. The traditional solutions, thus, fall short either in scalability (due to huge number of input items) or sparsity (due to insufficient information per input item). In this paper we address the scalability problem by introducing an efficient and scalable graph based algorithm for TD on short texts, leveraging dimensionality reduction and clustering techniques. We first, compress the input set of documents into a dense graph, such that frequent co-occurrence patterns in the documents create multiple dense topological areas in the graph. Then, we partition the graph into multiple dense sub-graphs, each representing a topic. We compare the accuracy and scalability of our solution with two state-of-the-art solutions (including the standard LDA, and BiTerm). The results on two widely used benchmark datasets show that our algorithm not only maintains a similar or better accuracy, but also performs by an order of magnitude faster than the state-of-the-art approaches.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2017
Series
Proceedings - International Conference on Distributed Computing Systems, ISSN 1063-6927
Keywords
TopicDetection, Node-cut Graph Partitioning, Distributed Algorithms, Random Indexing, Dimensionality Reduction, Dense Weighted Graph Partitioning, Online Social Networks
National Category
Computer Systems
Research subject
Computer Science; Information and Communication Technology; Applied and Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-204406 (URN)10.1109/ICDCS.2017.19 (DOI)000412759500071 ()2-s2.0-85027258993 (Scopus ID)9781538617915 (ISBN)
Conference
37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017, J.W. Marriott Hotel, Atlanta, United States, 5 June 2017 through 8 June 2017
Note

QC 20170407

Available from: 2017-04-05 Created: 2017-04-05 Last updated: 2019-11-19Bibliographically approved
3. Spatio-Temporal Multiple Geo-Location Identification on Twitter
Open this publication in new window or tab >>Spatio-Temporal Multiple Geo-Location Identification on Twitter
2018 (English)In: Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018 / [ed] Abe, N Liu, H Pu, C Hu, X Ahmed, N Qiao, M Song, Y Kossmann, D Liu, B Lee, K Tang, J He, J Saltz, J, Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 3412-3421Conference paper, Published paper (Refereed)
Abstract [en]

Twitter Geo-tags that indicate the exact location of messages have many applications from localized opinion mining during elections to efficient traffic management in critical situations. However, less than 6% of Tweets are Geo-tagged, which limits the implementation of those applications. There are two groups of solutions: content and network-based. The first group uses location indicative factors like URLs and topics, extracted from the content of tweets, to infer Geo-location for non geoactive users, whereas the second group benefits from friendship ties in the underlying social network graph. Friendship ties are better predictors compared to content information because they are less noisy and often follow the natural human spatial movement patterns. However, their prediction's accuracy is still limited because they ignore the temporal aspects of human behavior and always assume a single location per user. This research aims to extend the current network-based approaches by taking users' temporal dimension into account. We assume multiple locations per user during different time-slots and hypothesize that location predictability varies depending on the time and the properties of the social membership group. Thus, we propose a hierarchical solution to apply temporal categorizations on top of social network partitioning for multiple location prediction for users in Online Social Networks (OSNs) like Twitter. Given a largescale Twitter dataset, we show that users' location predictability exhibits different behavior in different time-slots and different social groups. We find that there are specific conditions where users are more predictable in terms of Geo-location. Our solution outperforms the state-of-the-art by improving the prediction accuracy by 16:6% in terms of Median Error Distance (MED) over the same recall.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Series
IEEE International Conference on Big Data, ISSN 2639-1589
Keywords
Geo-Location Identification, Graph Partitioning, Social Network Analysis, Spatio-Temporal Analysis
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-254147 (URN)10.1109/BigData.2018.8621997 (DOI)000468499303064 ()2-s2.0-85062605032 (Scopus ID)978-1-5386-5035-6 (ISBN)
Conference
2018 IEEE International Conference on Big Data, Big Data 2018; Seattle; United States; 10 December 2018 through 13 December 2018
Note

QC 20190624

Available from: 2019-06-24 Created: 2019-06-24 Last updated: 2019-11-19Bibliographically approved
4. GDTM: Graph-based Dynamic Topic Models
Open this publication in new window or tab >>GDTM: Graph-based Dynamic Topic Models
(English)In: Article in journal (Refereed) Submitted
Abstract [en]

Dynamic Topic Modeling (DTM) is the ultimate solution for extracting topics from short texts generated in Online Social Networks (OSNs) like Twitter. A DTM solution is required to be scalable and to be able to account for sparsity in short texts and dynamicity of topics. Current solutions combine probabilistic mixture models like Dirichlet Multinomial or PitmanYor Process with approximate inference approaches like Gibbs Sampling and Stochastic Variational Inference to, respectively, account for dynamicity and scalability in DTM. However, these solutions rely on weak probabilistic language models, which do not account for sparsity in short texts. In addition, their inference is based on iterative optimization algorithms, which have scalability issues when it comes to DTM. We present GDTM, a single-pass graph-based DTM algorithm, to solve the problem. GDTM combines a context-rich and incremental feature representation model, called Random Indexing (RI), with a novel online graph partitioning algorithm to address scalability and dynamicity. In addition, GDTM uses a rich language modeling approach based on the Skip-gram technique to account for sparsity. We run multiple experiments over a large-scale Twitter dataset to analyze the accuracy and scalability of GDTM and compare the results with four state-of-the-art approaches. The results show that GDTM outperforms the best approach by 11% on accuracy and performs by an order of magnitude faster while creating 4 times better topic quality over standard evaluation metrics.

Keywords
Topic Modeling, Dimensionality Reduction, Distributional Semantics, Language Modeling, Graph Partitioning
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-263828 (URN)
Note

QC 20191115

Available from: 2019-11-15 Created: 2019-11-15 Last updated: 2019-12-05Bibliographically approved
5. An Efficient Graph-Based Model for Learning Representations
Open this publication in new window or tab >>An Efficient Graph-Based Model for Learning Representations
2020 (English)Manuscript (preprint) (Other academic)
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-263908 (URN)
Note

Submitted manuscript. QC 20191119

Available from: 2019-11-19 Created: 2019-11-19 Last updated: 2019-11-20Bibliographically approved

Open Access in DiVA

fulltext(3267 kB)50 downloads
File information
File name FULLTEXT01.pdfFile size 3267 kBChecksum SHA-512
7aabe20c6eb2187b7b6707bbd7908379861fcbafc7e1d1cf660917adaa59960d21588f81e51ddf853616db8167b45317048cd35830c627d1f5dc54e1e1718f6a
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Ghoorchian, Kambiz
By organisation
Software and Computer systems, SCS
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 50 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 279 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf