Introduction to Spark

BY: Kelyan | Cornelia

If you need to stream or process large amounts of data at lightning speed, you can't get past Spark. Apache Spark is an open source unified analytics engine, widely used in industries such as finance, healthcare, travel, e-commerce, and media and entertainment due to its speed, efficiency, and performance. Spark covers a wide range of workloads such as batch applications, iterative computations, interactive queries, and streaming through efficient in-memory cluster computing. It also provides a set of Scala, Python, Java, and R high-level APIs for application development.

A Brief History of Spark

Spark was born with the idea of creating something better than MapReduce, which was originally conceptionalised by Google as a processing model for fault tolerant big data processing and open sourced by Nutch developers in 2004, culminating in the Hadoop family of tools.

The Apache community realised that the implementation of MapReduce and NDFS could be used for other tasks as well. "Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures." Apache Hadoop

For the first time, a broad mass of developers could carry out intensive computing processes with large amounts of data (Big Data, petabyte range) on computer clusters. Hadoop made distributed computing and replication possible at a fraction of the usual cost using an extremely simple computing model. Back then, companies like Yahoo and Facebook deployed MapReduce on a large scale.

However, despite its ease of use, MapReduce proved inadequate in many ways. Limited to two very static phases (Map > Reduce), the application areas were limited to splitting, sorting and merging. It was a simple model that only allowed a simple reduction of datasets. Circumventing those limitations required a lot of flushing to disk, which took increased processing time and cost.

Furthermore, there was a need for solutions for more complex operations and functions such as Filter, Join Data, Aggregate, Fold or Map. Thus, the limitation of the MapReduce programming model led to the adoption of flow oriented frameworks such as Cascading, which improved coding but didn’t improve execution. The reduce output still had to be written to disk. It was inherently very slow.

Then Spark came about. It was developed at UC Berkeley’s AMPLab in 2009 and donated to Apache in 2013. It was a different model - based on the concept of direct acyclic graph execution - that improved massively on the runtime behavioural flaws of MapReduce. Spark essentially brings a lot of useful algorithms for data mining, data analysis, machine learning, and algorithms on graphs. While MapReduce was a processing oriented framework, Spark offers an abstract computation model based on resilient distributed dataset (RDD), Dataframes/Datasets and iterative transformation. With RDDs you can have an acyclic graph of consecutive computational stages. RDDs allow a flow through serialization and in memory computation and do not need to be flushing to disk at every stage as with MapReduce. Despite the fact that Spark is completely built in Scala, it also provides high level APIs for Java, Python and R. It runs on multiple cluster technologies, namely Kubernetes, Mesos, YARN (Hadoop) and of course its own clusters.

Spark provides four additional APIs on top of the Spark Core:

Spark ML

The increasingly popular one is Spark ML, which a dedicated implementation of distributed machine learning algorithms. You'll not find that much variety as with the Python ML libraries. But the most important ones are present and there's one big advantage: They are inherently distributed.

GraphX

Then there's GraphX, which provides an API for leveraging graphs in a fast and robust development environment. It uses the Spark RDD concept to simplify graph analysis tasks and enable operations on a directed multigraph with properties associated with each node and edge.

Spark Streaming

One of the most important components of the Big Data ecosystem is Spark Streaming. It was added to Apache Spark back in 2013 and is an extension to the core Spark API that enables scalable, high-throughput, and fault-tolerant stream processing of live data streams. Live input data streams are received by Spark Streaming and split into batches. These batches are then buffered and processed by the latency-optimised Spark engine to output the results to other systems.

SparkSQL

Last but not least, there's SparkSQL, a Dataset API with SQL that provides state-of-the-art SQL performance.

The Spark Framework

Spark can be used for both batch and near real-time processing. The Apache Spark framework consists of a driver that runs as a master node and many executors (tasks, or closures) that run as worker nodes in the cluster. The worker nodes communicate through the Cluster Manager with the driver, which again works with the Cluster Manager to manage various other tasks. The Cluster Manager allocates the available resources (Worker Nodes) and chunks the job into several smaller tasks, which are then distributed among the Worker Nodes.

Another responsibility of the Driver is to ascertain that stages are completed. The Driver always keeps track and orchestrates every single stage of the Worker Nodes and makes sure that tasks are getting scheduled properly in coordination with the Cluster Manager. That allows Spark to be quite resilient. If, for instance, a worker fails, the Driver reschedules(reassigns?) that same unit of work on a different node.

Spark RDDs

The RDDs are the fundamental data structure in Spark. RDD stands for resilient distributed DataSet:

Resilient: If data in memory is lost, it will be recreated.

Distributed: Data is stored in memory across the cluster.

DataSet: Initial data comes from a data source (File, Stream, etc ..)

RDDs are immutable and partitioned collections of elements. Data will be copied from a data source. Spark will always try to shard the data as much as possible so that the executions can be run in parallel. It's always a collection of different elements of the same type that's being processed by Spark. Using RDDs, Spark offers transformation (map, filter, ..) and aggregation (reduce, fold, ..) operations similar to Scala collections.

Spark DataFrames and DataSets

RDDs do not handle and output data in an optimised and structured way. This is where DataFrames and DataSets come into play. DataFrames give us the ability to organise data in named columns (similar to the Python pandas framework). With data organised in a table like format, we can easily execute SQL commands to modify, select and filter the data. The big advantage here is, that Spark's catalyst supports optimisation. It has an execution plan optimiser, very similar to how relational databases work. In conjunction with specific formats such as JSON, Parquet, CSV Spark spins up the resources that are necessary for the query by optimising data structure or by filtering data in the map phase using predicate pushdown.

Datasets are an extension of dataframes. They are statically typed Dataframes, thus combining the flexibility of Dataframes with the power of types in Scala. Datasets are by default a collection of strongly typed JVM objects, unlike dataframes. Moreover, they make use of Spark's Catalyst, the Spark SQL optimizer based on Scala functional programming (You can find more info about this topic here: https://blog.bi-geek.com/en/spark-sql-optimizador-catalyst/). To make that more understandable, a Dataframe is a Dataset of type Row, which elegantly bridges those Python concepts with Scala.

Dataframes is integrated in the Spark ML library. So instead of having RDDs that need to be converted into some table presenting format you can basically just take a DataFrame and then operate the machine learning algorithm on top of the DataFrames, which makes it much easier and more pythonesque to run. For everyone in the functional programming cloud that may sound like an insult, but it is actually a very convenient way to generate and use those ML algorithms. Often pandas examples can simply be carried over from Python to Spark to get an idea how stuff works. And of course, DataFrames and more specifically their statically typed implementation Dataset also offer a decent amount of functional transformations like map, flatMap, filter, etc.

Data Sources

Spark can read data from and write it back to pretty much anything and anywhere:

Files: HDFS, S3, Google Buckets as JSON, XML, Parquet, ORC, CSV

Streams: JMS, Twitter, Kinesis, Kafka

Datastores: Elasticsearch, Cassandra, RDBMs (JDBC)

Spark offers schema discovery to assume the data type. It also supports the use of case classes. Common file formats can automatically be converted to case classes and DataSets.

Testing and Coding

Test-driven development is possible with Spark and we would encourage everyone to use this option (in general). Spark gives you the ability to choose your execution environment and change cluster types such as Kubernetes, YARN, Mesos, Spark standalone by configuration. Local mode lets you run your Spark application locally.

In a Nutshell

Spark removes the boilerplate around distribution and data sourcing. It makes things like reading to and writing from different data types or dealing with distributed computing a lot easier. That allows you to just focus on your application, which is a flow of functional collection transformations, with data structures similar to collections in Scala or the Java Stream API. You have lazy evaluation and runtime optimisation that makes Spark so popular and vastly superior to MapReduce.

Try Spark yourself!

You can find a training repository with exercises and solutions parts here: https://github.com/HivemindTechnologies/spark-workshop