Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Methods and Algorithms for Data-Intensive Computing: Streams, Graphs, and Geo-Distribution
KTH, School of Electrical Engineering and Computer Science (EECS), Software and Computer systems, SCS.
2019 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Struggling with the volume and velocity of Big Data has attracted lots of interest towards stream processing paradigm, a paradigm in the area of data-intensive computing that provides methods and solutions to process data in motion. Today's Big Data includes geo-distributed data sources.In addition, a major part of today's Big Data requires exploring complex and evolving relationships among data, which complicates any reasoning on the data. This thesis aims at challenges raised by geo-distributed streaming data, and the data with complex and evolving relationships.

Many organizations provide global scale applications and services that are hosted on servers and data centers that are located in different parts of the world. Therefore, the data that needs to be processed are generated in different geographical locations. This thesis advocates for distributed stream processing in geo-distributed settings to improve the performance including better response time and lower network cost compared to centralized solutions. In this thesis, we conduct an experimental study of Apache Storm, a widely used open-source stream processing system, on a geo-distributed infrastructure made of near-the-edge resources. The resources that host the system's components are connected by heterogeneous network links. Our study exposes a set of issues and bottlenecks of deploying a stream processing system on the geo-distributed infrastructure. Inspired by the results, we propose a novel method for grouping of geo-distributed resources into computing clusters, called micro data centers, in order to mitigate the effect of network heterogeneity for distributed stream processing applications. Next, we focus on the windowed aggregation of geo-distributed data streams, which has been widely used in stream analytics. We propose to reduce the bandwidth cost by coordinating windowed aggregations among near-the-edge data centers. We leverage intra-region links and design a novel low-overhead coordination algorithm that optimizes communication cost for data aggregation. Then, we propose a system, called SpanEdge, that provides an expressive programming model to unify programming stream processing applications on a geo-distributed infrastructure and provides a run-time system to manage (schedule and execute) stream processing applications across data centers. Our results show that SpanEdge can optimally deploy stream processing applications in a geo-distributed infrastructure, which significantly reduces the bandwidth consumption and response latency.

With respect to data with complex and evolving relationships, this thesis aims at effective and efficient processing of inter-connected data. There exist several domains such as social network analysis, machine learning, and web search in which data streams are modeled as linked entities of nodes and edges, namely a graph. Because of the inter-connection among the entities in graph data, processing of graph data is challenging. The inter-connection among the graph entities makes it difficult to distribute the graph among multiple machines to process the graph at scale. Furthermore, in a streaming setting, the graph structure and the graph elements can continuously change as the graph elements are streamed. Such a dynamic graph requires incremental computing methods that can avoid redundant computations on the whole graph. This thesis proposes incremental computing methods of streaming graph processing that can boost the processing time while still obtaining high quality results. In this thesis, we introduce HoVerCut, an efficient framework for boosting streaming graph partitioning algorithms. HoVerCut is Horizontally and Vertically scalable. Our evaluations show that HoVerCut speeds up the partitioning process significantly without degrading the quality of partitioning. Finally, we study unsupervised representation learning in dynamic graphs. Graph representation learning seeks to learn low dimensional vector representations for the graph elements, i.e. edges and vertices, and the whole graph.We propose novel and computationally efficient incremental algorithms. The computation complexity of our algorithms depends on the extent and rate of changes in a graph and on the graph density. The evaluation results show that our proposed algorithms can achieve competitive results to the state-of-the-art static methods while being computationally efficient.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2019. , p. 58
Series
TRITA-EECS-AVL ; 2019:13
Keywords [en]
stream processing, geo-distributed infrastructure, edge computing, streaming graph, dynamic graph
National Category
Computer and Information Sciences
Research subject
Information and Communication Technology; Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-243883ISBN: 978-91-7873-094-0 (print)OAI: oai:DiVA.org:kth-243883DiVA, id: diva2:1287149
Public defence
2019-03-15, Ka-Sal C (Sal Sven-Olof Öhrvik), Electrum, Kista, Stockholm, 13:30 (English)
Opponent
Supervisors
Note

QC 20190208

Available from: 2019-02-08 Created: 2019-02-08 Last updated: 2019-02-08Bibliographically approved
List of papers
1. Stream Processing in Community Network Clouds
Open this publication in new window or tab >>Stream Processing in Community Network Clouds
2015 (English)In: Future Internet of Things and Cloud (FiCloud), 2015 3rd International Conference on, IEEE conference proceedings, 2015, p. 800-805Conference paper, Published paper (Refereed)
Abstract [en]

Community Network Cloud is an emerging distributed cloud infrastructure that is built on top of a community network. The infrastructure consists of a number of geographically distributed compute and storage resources, contributed by community members, that are linked together through the community network. Stream processing is an important enabling technology that, if provided in a Community Network Cloud, would enable a new class of applications, such as social analysis, anomaly detection, and smart home power management. However, modern stream processing engines are designed to be used inside a data center, where servers communicate over a fast and reliable network. In this work, we evaluate the Apache Storm stream processing framework in an emulated Community Network Cloud in order to identify the challenges and bottlenecks that exist in the current implementation. The community network emulation was performed using data collected from the Guifi.net community network, Spain. Our evaluation results show that, with proper configuration of the heartbeats, it is possible to run Apache Storm in a Community Network Cloud. The performance is sensitive to the placement of the Storm components in the network. The deployment of management components on wellconnected nodes improves the Storm topology scheduling time, fault tolerance, and recovery time. Our evaluation also indicates that the Storm scheduler and the stream groupings need to be aware of the network topology and location of stream sources in order to optimally place Storm spouts and bolts to improve performance.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2015
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-174851 (URN)10.1109/FiCloud.2015.95 (DOI)000378639200122 ()2-s2.0-84959041836 (Scopus ID)
Conference
The 4th International Workshop on Community Networks and Bottom-up-Broadband(CNBuB 2015), 24-26 Aug. 2015, Rome, Italy
Note

QC 20151113

Available from: 2015-10-07 Created: 2015-10-07 Last updated: 2019-02-08Bibliographically approved
2. Smart Partitioning of Geo-Distributed Resources to Improve Cloud Network Performance
Open this publication in new window or tab >>Smart Partitioning of Geo-Distributed Resources to Improve Cloud Network Performance
2015 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Cloud Computing systems with geo-distributed re- sources are becoming more popular for enabling a new family of applications, which are latency sensitive or bandwidth intensive, e.g., Internet of Things and online video gaming services. The approach is to host the cloud services at the network edges to reduce the latency and bandwidth consumption. However, the topology of the existing networks is not necessarily optimal for hosting Cloud services. Moreover, how the services are placed on the nodes, can affect the performance of the applications and the whole network. Therefore, we propose a novel algorithm to partition a distributed infrastructure into a set of computing clusters, each called a Micro Data Center. Our proposed algorithm is a decentralized community detection algorithm that does not require any global knowledge of the network topology. We compare our solution with a geolocation based clustering solution and demonstrate our preliminary results based on a real world network data set. We show that micro data centers increase the minimum available bandwidth in the network to up to 62%. Likewise, the average latency can be reduced to 50%.

Keywords
geo-distributed cloud; community detection; cloud network performance; multiple data centers
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-174381 (URN)10.1109/CloudNet.2015.7335292 (DOI)000377207000022 ()2-s2.0-84960981479 (Scopus ID)
Conference
The 2015 4th IEEE International Conference on Cloud Networking (IEEE CloudNet 2015)5-7 October 2015, Niagara Falls, Canada
Note

QC 20160616

Available from: 2015-10-06 Created: 2015-10-06 Last updated: 2019-02-08Bibliographically approved
3. Optimizing Windowed Aggregation over Geo-Distributed Data Streams
Open this publication in new window or tab >>Optimizing Windowed Aggregation over Geo-Distributed Data Streams
2018 (English)In: 2018 IEEE International Conference on Edge Computing (EDGE), IEEE Computer Society Digital Library, 2018, p. 33-41Conference paper, Published paper (Refereed)
Abstract [en]

Real-time data analytics is essential since more and more applications require online decision making in a timely manner. However, efficient analysis of geo-distributed data streams is challenging. This is because data needs to be collected from all edge data centers, which aggregate data from local sources, in order to process most of the analytic tasks. Thus, most of the time edge data centers need to transfer data to a central data center over a wide area network, which is expensive. In this paper, we advocate for a coordinated approach of edge data centers in order to handle these analytic tasks efficiently and hence, reducing the communication cost among data centers. We focus on the windowed aggregation of data streams, which has been widely used in stream analytics. In general, aggregation of data streams among edge data centers in the same region reduces the amount of data that needs to be sent over cross-region communication links. Based on state-of-the-art research, we leverage intra-region links and design a low-overhead coordination algorithm that optimizes communication cost for data aggregation. Our algorithm has been evaluated using synthetic and Big Data Benchmark datasets. The evaluation results show that our algorithm reduces the bandwidth cost up to ~6x, as compared to the state-of-the-art solution.

Place, publisher, year, edition, pages
IEEE Computer Society Digital Library, 2018
Keywords
data analytics, stream processing, aggregation, WAN analytics
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-236051 (URN)10.1109/EDGE.2018.00012 (DOI)000447289500005 ()2-s2.0-85055625599 (Scopus ID)978-1-5386-7238-9 (ISBN)
Conference
2018 IEEE International Conference on Edge Computing (EDGE)San Francisco, CA, USA
Note

QC 20181022

Available from: 2018-10-14 Created: 2018-10-14 Last updated: 2019-02-08Bibliographically approved
4. SpanEdge: Towards Unifying Stream Processing over Central and Near-the-Edge Data Centers
Open this publication in new window or tab >>SpanEdge: Towards Unifying Stream Processing over Central and Near-the-Edge Data Centers
2016 (English)Conference paper, Published paper (Refereed)
Abstract [en]

In stream processing, data is streamed as a continuous flow of data items, which are generated from multiple sources and geographical locations. The common approach for stream processing is to transfer raw data streams to a central data center that entails communication over the wide-area network (WAN). However, this approach is inefficient and falls short for two main reasons: i) the burst in the amount of data generated at the network edge by an increasing number of connected devices, ii) the emergence of applications with predictable and low latency requirements. In this paper, we propose SpanEdge, a novel approach that unifies stream processing across a geodistributed infrastructure, including the central and near-theedge data centers. SpanEdge reduces or eliminates the latency incurred by WAN links by distributing stream processing applications across the central and the near-the-edge data centers. Furthermore, SpanEdge provides a programming environment, which allows programmers to specify parts of their applications that need to be close to the data source. Programmers can develop a stream processing application, regardless of the number of data sources and their geographical distributions. As a proof of concept, we implemented and evaluated a prototype of SpanEdge. Our results show that SpanEdge can optimally deploy the stream processing applications in a geo-distributed infrastructure, which significantly reduces the bandwidth consumption and the response latency.

Keywords
geo-distributed stream processing, geo-distributed infrastructure, edge computing, edge-based analytics
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-193581 (URN)10.1109/SEC.2016.17 (DOI)000391420900036 ()
Conference
The First IEEE/ACM Symposium on Edge Computing (SEC)
Note

QC 20161005

Available from: 2016-10-04 Created: 2016-10-04 Last updated: 2019-02-08Bibliographically approved
5. Boosting Vertex-Cut Partitioning For Streaming Graphs
Open this publication in new window or tab >>Boosting Vertex-Cut Partitioning For Streaming Graphs
Show others...
2016 (English)In: Big Data (BigData Congress), 2016 IEEE International Congress on, IEEE conference proceedings, 2016, p. 1-8Conference paper, Published paper (Refereed)
Abstract [en]

While the algorithms for streaming graph partitioning are proved promising, they fall short of creating timely partitions when applied on large graphs. For example, it takes 415 seconds for a state-of-the-art partitioner to work on a social network graph with 117 millions edges. We introduce an efficient platform for boosting streaming graph partitioning algorithms. Our solution, called HoVerCut, is Horizontally and Vertically scalable. That is, it can run as a multi-threaded process on a single machine, or as a distributed partitioner across multiple machines. Our evaluations, on both real-world and synthetic graphs, show that HoVerCut speeds up the process significantly without degrading the quality of partitioning. For example, HoVerCut partitions the aforementioned social network graph with 117 millions edges in 11 seconds that is about 37 times faster

Place, publisher, year, edition, pages
IEEE conference proceedings, 2016
Keywords
streaming graph, vertex-cut partitioning, graph partitioning, parallel scalability
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering Computer Systems
Identifiers
urn:nbn:se:kth:diva-189864 (URN)10.1109/BigDataCongress.2016.10 (DOI)000390212200001 ()2-s2.0-84994558691 (Scopus ID)978-1-5090-2622-7 (ISBN)
Conference
5th 2016 IEEE International Congress on Big Data (BigData Congress 2016)
Note

QC 20160923

Available from: 2016-07-20 Created: 2016-07-20 Last updated: 2019-02-08Bibliographically approved
6. Efficient Representation Learning Using RandomWalks for Dynamic Graphs
Open this publication in new window or tab >>Efficient Representation Learning Using RandomWalks for Dynamic Graphs
(English)Manuscript (preprint) (Other academic)
Abstract [en]

An important part of many machine learning workflows on graphs is vertex representation learning, i.e., learning a low-dimensional vector representation for each vertex in the graph. Recently, several powerful techniques for unsupervised representation learning have been demonstrated to give the state-of-the-art performance in downstream tasks such as vertex classification and edge prediction. These techniques rely on random walks performed on the graph in order to capture its structural properties. These structural properties are then encoded in the vector representation space. 

However, most contemporary representation learning methods only apply to static graphs while real-world graphs are often dynamic and change over time. Static representation learning methods are not able to update the vector representations when the graph changes; therefore, they must re-generate the vector representations on an updated static snapshot of the graph regardless of the extent of the change in the graph. In this work, we propose computationally efficient algorithms for vertex representation learning that extend random walk based methods to dynamic graphs. The computation complexity of our algorithms depends upon the extent and rate of changes (the number of edges changed per update) and on the density of the graph. We empirically evaluate our algorithms on real world datasets for downstream machine learning tasks of multi-class and multi-label vertex classification. The results show that our algorithms can achieve competitive results to the state-of-the-art methods while being computationally efficient.

Keywords
dynamic graph embedding, random walk, representation learning, information network, graph stream
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-243880 (URN)
Note

QC 20190208

Available from: 2019-02-08 Created: 2019-02-08 Last updated: 2019-03-04Bibliographically approved

Open Access in DiVA

doctoral_thesis_hps(1638 kB)65 downloads
File information
File name FULLTEXT01.pdfFile size 1638 kBChecksum SHA-512
f4c97068245a073bd1e0297d18678580fdc4768e7571d049a450b50e33d51a781b1f593e26b5f0ee26a3c48d5627f4c14f244505fdc45f79bfa1a1f080fea427
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Peiro Sajjad, Hooman
By organisation
Software and Computer systems, SCS
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 65 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 860 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf