Spark tutorial

Install Spark

  1. Install Java
sudo apt install default-jdk
java -version
  1. Download Apache Spark and unpack it https://spark.apache.org/downloads.html
sudo mv spark-3.2.0-bin-hadoop3.2/ /opt/spark
  1. Install PySpark
pip install pyspark
  1. Set Spark environment
vi ~/.bashrc
export SPARK_HOME=/opt/spark/spark-3.2.0-bin-hadoop3.2
export PATH=$SPARK_HOME/bin:$PATH
source ~/.bashrc

Spark & Jupyter notebook

To set up Spark in Jupyter notebook, do the following:

  1. add the following lines into ~/.bashrc

    • local access
      export PYSPARK_DRIVER_PYTHON=jupyter
      export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
    • remote access
      export PYSPARK_DRIVER_PYTHON=jupyter
      export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=<port> --ip='*'"
    • Windows subsystem for Linux
      export PYSPARK_DRIVER_PYTHON=jupyter
      export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser"
  2. run from terminal:

    pyspark

Note that remote access to jupyter notebook requires a tunnel. On Windows machines, you can use Putty to set it up. In Linux environments, the following command can be used:

ssh -N -L localhost:<port>:localhost:<local_port> <user>

Finally, you can run the notebook in your browser:

http://localhost:<local_port>

PySpark Python API

PySpark can be used from standalone Python scripts by creating a SparkContext. You can set configuration properties by passing a SparkConf object to SparkContext.

Documentation: pyspark package

Parallelism demo

RDD - Resilient Distributed Datasets

resilient:

Spark is RDD-centric!

RDD Actions

RDD - Resilient Distributed Datasets

Some useful actions:

Demo files

file1.txt:
    Apple,Amy
    Butter,Bob
    Cheese,Chucky
    Dinkel,Dieter
    Egg,Edward
    Oxtail,Oscar
    Anchovie,Alex
    Avocado,Adam
    Apple,Alex
    Apple,Adam
    Dinkel,Dieter
    Doughboy,Pilsbury
    McDonald,Ronald

file2.txt:
    Wendy,
    Doughboy,Pillsbury
    McDonald,Ronald
    Cheese,Chucky

Note: the following produces output on Jupyter notebook server!

RDD Operations

map()

Return a new RDD by applying a function to each element of this RDD.

flatMap()

Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

mapValues()

Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD's partitioning.

Only works with pair RDDs.

flatMapValues()

Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD's partitioning.

Only works with pair RDDs.

filter()

Return a new RDD containing only the elements that satisfy a predicate.

groupByKey()

Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions.

Only works with pair RDDs.

reduceByKey()

Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce.

sortBy()

Sorts this RDD by the given keyfunc.

sortByKey()

Sorts this RDD, which is assumed to consist of (key, value) pairs.

subtract()

Return each value in self that is not contained in other.

join()

Return an RDD containing all pairs of elements with matching keys in self and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.

MapReduce demo

We will now count the occurences of each word. The typical "Hello, world!" app for Spark applications is known as word count. The map/reduce model is particularly well suited to applications like counting words in a document.

The flatMap() operation first converts each line into an array of words, and then makes each of the words an element in the new RDD.

The map() operation replaces each word with a tuple of that word and the number 1. The pairs RDD is a pair RDD where the word is the key, and all of the values are the number 1.

The reduceByKey() operation keeps adding elements' values together until there are no more to add for each key (word).

Simplify chained transformations

It is good to know that the code above can also be written in the following way: