Data Stream Queries to Apache SPARK
Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Many fields have a need to process and analyze data streams in real-time. In industrial applications the data can come from big sensor networks, where the processing of the data streams can be used for performance monitoring and fault detection in real time. Another example is in social media where data stream processing can be used to detect and prevent spam. A data stream management system (DSMS) is a system that can be used to manage and query continuously received data streams. The queries a DSMS executes are called continuous queries (CQs). In contrast to regular database queries they execute continuously until canceled. SCSQ is a DSMS developed at Uppsala university. Apache Spark is a large scale general data processing engine. It has, among other things, a component for data stream processing, Spark Streaming. In this project a system called SCSQ Spark Streaming Interface (SSI) was implemented that allows Spark Streaming applications to be called from CQs in SCSQ. It allows the Spark Streaming applications to receive input streams from SCSQ as well as emitting resulting stream elements back to SCSQ. To demonstrate SSI, two examples are shown where it is used for stream clustering in CQs using the streaming k-means implementation in Spark Streaming.
Place, publisher, year, edition, pages
2016. , 51 p.
Engineering and Technology
IdentifiersURN: urn:nbn:se:uu:diva-301326OAI: oai:DiVA.org:uu-301326DiVA: diva2:954003
Master Programme in Computer Science
Risch, ToreNgai, Edith