This repo is a concise summary and replacement of the tutorials by IntelliPath and Coursera. Using the hyperlinks below is optional.

Tutorial from IntelliPath

Architecture

Spark is an open source, scalable tool for parallel-processing.

Spark is polygot: code can be written in Scala (most popular), Java, Python, R, and Spark SQL. It provides high-level APIs for these languages.

Driver program - The code you're writing behaves as a "driver program". The interactive shell you're writing code in is a sample driver program.
Cluster manager - manages various jobs. Sample cluster managers: Spark Standalone Cluster, Apache Mesos, Hadoop Yarn, Kubernetes
Worker nodes - they execute a task and return it to the "Spark context". They provide in-memory storage for cached RDDs (explained below)

Resilient Distributed Dataset (RDD)

RDDs are the fundamental data structure of Spark.

3 ways to create RDDs

1) Parallelize a collection

val myFirstRDD = sc.parallelize(List("spark", "scala", "hadoop"))

2) Use a data set in an external storage system

val textRDD = sc.textFile("/user/cloudera/data.txt")

3) Create an RDD from already existing RDDs

Using textRDD from above:

val newRDD = textRDD.filter(x => x.contains("spark"))

Transformations on RDDs

Function: map

val x = sc.parallelize(List("spark", "rdd", "example", "sample", "example"))
val y = x.map(x => (x, 1))
y.collect

// Output
Array[(String, Int)] = Array((spark,1), (rdd, 1), (example,1), (sample,1), example,1))

Function: flatmap - map returns 1 element, while flatmap can return a list of elements

sc.parallelize(List(1, 2, 3)).flatMap(x=>List(x, x, x)).collect

// Output
Array(1, 1, 1, 2, 2, 2, 3, 3, 3)

Function: filter

sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
numbers.filter(_ % 2 == 0).collect

// Output
Array(2, 4, 6, 8, 10)

Function: intersection

val parallel = sc.parallelize(1 to 9)
val par2 = sc.parallelize(5 to 15)
parallel.intersection(par2).collect

// Output
Array(6, 8, 7, 9, 5)

Actions on RDDs

Actions are Spark RDD operations that give non-RDD values

Function: reduce

val a = sc.parallelize(1 to 10)
a.reduce(_ + _)

// Output
Output is: `Int = 55`

Function: first

val names2 = sc.parallelize(List("apple", "beatty", "beatrice"))
names2.first

// Output
String = apple

Function: take

val nums = sc.parallelize(List(1, 5, 3, 9, 4, 0, 2)
nums.take(4)

// Output
Array[Int] = Array(1, 5, 3, 9)

Function: foreachPartition

val b = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)
b.foreachPartition(x => println(x.reduce(_ + _)))

// Output
6
15
24

Creating Data Frames

Create a Data Frame from a List

List((1, "mobile", 50000), (2, "shoes", 4500), (3, "TV", 70000))

val productDF = product.toDF("pid", "product", "value") // column names
productDF.show()

// Output

+---+-------+-----+
|pid|product|value|
+---+-------+-----+
|  1| mobile|50000|
|  2|  shoes| 4500|
|  3|     TV|70000|
+---+-------+-----+

Create a Data Frame from a JSON file

val df = spark.read.json("/student1.json")
df.show()

// Output

+----+------+
| age|  name|
+----+------+
|null|   Sam|
|  17|  Mick|
|  18|Jennet|
|  19|Serena|
+----+------+

Let's print the schema:

df.printSchema()

// Output

root
 |-- age: long (nullable = true)
 |-- name: long (nullable = true)

Let's select a column:

df.select("name").show()

// Output

+------+
|  name|
+------+
|   Sam|
|  Mick|
|Jennet|
|Serena|
+------+

Let's filter for age greater than 18:

df.filter($"age" >= 18).show()

// Output

+----+------+
| age|  name|
+----+------+
|  18|Jennet|
|  19|Serena|
+----+------+

Tutorial from Coursera

Introduction

Spark keeps all data immutable and in-memory.

All operations on data are just functional transformations, like regular Scala collections.

Fault tolerance is achieved by replaying functional transformations over original dataset. This makes Spark up to 100x faster than Hadoop/Map-Reduce which use disk writes to achieve fault tolerance.

Resilient Distributed Datasets (RDDs)

Most operations on RDDs are higher-order functions

abstract class RDD[T] {
  def map[U](f: T => U): RDD[U] = ...
  def flatMap[U](f: T => TraversableOnce[U]): RDD[U] = ...
  def filter(f: T => Boolean): RDD[T] = ...
  def reduce(f: (T, T) => T): T = ...
  ...
}

Example: Count of a specific word in Spark

Given an encyclopedia: RDD[String], we can count how many times "EPFL" appears in encyclopedia:

val result = encyclopedia.filter(page => page.contains("EPFL")).count()

Example: Word count in Spark

val rdd = spark.textFile("hdfs://...")

val count = rdd.flatMap(line => line.split(" ")) // separate lines into words
               .map(word => (word, 1))           // include something to count
               .reduceByKey(_ + _)               // sum up the 1s in the pairs

Transformations and Actions

Transformations return new RDDs as results. Examples: map, filter, flatMap, groupBy
Actions return a result based on an RDD, and it's either returned or saved to an external storage system. Examples: reduce, fold, reduce

Transformations are lazy (delayed execution), and actions are eager (immediate execution). So none of the transformations happen until there is an action.

To know if a function is a transformation or an action, we look at its return type. If the return type is an RDD, it's a transformation, otherwise it's an action.

Lazy evaluation resulting in efficiency

Spark will analyze and optimize a chain of operations before executing it. This is a benefit of lazy evaluation. In the code below, as soon as 10 elements of the filtered RDD have been computed, firstLogsWithErrors is done.

val lastYearsLogs: RDD[String] = ...
val firstLogsWithErrors = lastYearsLogs.filter(_.contains("ERROR")).take(10)

Spark (unlike Scala) can also combine the below map and filter so that it doesn't have to iterate through the list twice:

val lastYearsLogs: RDD[String] = ...
val numErrors = lastYearsLogs.map(_.lowercase)
                             .filter(_.contains("error"))
                             .count()

Evaluation in Spark

Caching and Persistence

By default, RDDs are recomputed each time you run an action on them. This can be expensive (in time) if you need to use a dataset more than once. To tell Spark to cache an RDD in memory, simply call persist() or cache() on it:

val lastYearsLogs: RDD[String] = ...
val logsWithErrors = lastYearsLogs.filter(_.contains("ERROR")).persist()
val firstLogsWithErrors = logsWithErrors.take(10)
val numErrors = logsWithErrors.count() // faster since we used .persist() above

The persist() method can be customized in 5 ways in how data is persisted:

in memory as regular Java objects - has a shorthand function for it: cache() instead of persist()
on disk as regular Java objects
in memory as serialized Java objects (more compact)
on disk as serialized Java objects (more compact)
both in memory and on disk (spill over to disk to avoid re-computation)

Scala Collections and Spark RDDs have similar-looking APIs. However, Spark RDDs use lazy evaluation while Scala Collections do not (by default)

Common pitfall: `println` in a cluster

What happens in this scenario?

case class Person(name: String, age: Int)
val people: RDD[Person] = ...
people.foreach(println)

Since println is an action with return type of Unit, the println happens in the cluster (instead of the driver program), and the output is never seen by the user.

Reduction Operations

foldLeft and foldRight are not parallelizable, so they do not exist for Spark's RDDs. We use fold, reduce, and aggregate instead.

The Aggregate function has a signature of aggregate[B](z: => B)(seqop: (B, A) => B, combop: (B, B) => B): B

Pair RDDs

Pair RDDs is just another name for distributed key-value pairs.

In distributed systems, Pair RDDs are used more often then arrays and lists.

Creating a Pair RDD from a JSON record

"definitions": {
  "firstname": "string",
  "lastname": "string",
  "address": {
    "type": "object",
    "properties": {
      "type": "object",
      "street": {
        "type": "string"
      },
      "city": {
        "type": "string"
      },
      "state": {
        "type": "string"
      }
    },
    "required": [
      "street_address",
      "city",
      "state"
    ]
  }
}

If we only care about the "address" part of the above record, we can create an RDD for just that part:

RDD[(String, Property)] // String is a key representing a city, 'Property' is its corresponding value.

case class Property(street: String, city: String, state: String)

We used the city as the key. This would be useful if we wanted to group these RDDs by their city, so we can do computations on these properties by city.

Creating a Pair RDD from an RDD

If given val rdd: RDD[WikipediaPage], we can create a pair RDD:

val pairRdd = rdd.map(page => (page.title, page.text))

Unlike a standard RDD, when you have a Pair RDD such as RDD[(K, V)], you get new methods such as:

def groupByKey(): RDD[(K, Iterable[V])]
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

Transformations and Actions on Pair RDDs

Pair RDD Transformation: groupByKey

`groupBy` from Scala:

def groupBy[K](f: A => K): Map[K, Traversable[A]]

Let's group by various ages:

val ages = List(2, 52, 44, 23, 17, 14, 12, 82, 51, 64)
val grouped = ages.groupBy { age =>
  if (age >= 18 && age < 65) "adult"
  else if (age < 18) "child"
  else "senior"
}

// Output
grouped: scala.collection.immutable.Map[String, List[Int]] =
  Map(senior -> List(82),
      adult -> List(52, 44, 23, 51, 64),
      child -> List(2, 17, 14, 12))

`groupByKey` for Pair RDDs in Spark

case class Event(organizer: String, name: String, budget: Int)
val eventsRdd = sc.parallelize(...) // "..." represents some data
                  .map(event => (event.organizer, event.budget))
val groupedRdd = eventsRdd.groupByKey()
groupedRdd.collect().foreach(println)

// Output is something like:

(Prime Sound, CompactBuffer(42000))
(Sportorg, CompactBuffer(23000, 12000, 1400))

Pair RDD Transformation: reduceByKey

We can use reduceByKey, which can be thought of as a combination of groupByKey and reduce-ing on all values per key.

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

case class Event(organizer: String, name: String, budget: Int)
val eventsRdd = sc.parallelize(...) // "..." represents some data
                  .map(event => (event.organizer, event.budget))
val budgetsRdd = eventsRdd.reduceByKey(_ + _)
reduceRdd.collect().foreach(println)

// Output is something like:

(Prime Sound, 42000)
(Sportorg, 36400)

Joins

Provided Sample Data

data called "abos":

(101, ("Ruetli", AG)),
(102, ("Brelaz", DemiTarif)),
(103, ("Gress", DemiTarifVisa)),
(104, ("Schatten", Demitarif))

data called "locations":

(101, "Bern"),
(101, "Thun"),
(102, "Lausanne"),
(102, "Geneve"),
(102, "Nyon"),
(103, "Zurich"),
(103, "St-Gallen"),
(103, "Chur")

Join

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

Doing a join (also known as inner join) gives us:

// Output

(101, ((Ruetli, AG), Bern))
(101, ((Ruetli, AG), Thun))
(102, ((Brelaz, DemiTarif), Nyon))
(102, ((Brelaz, DemiTarif), Lausanne))
(102, ((Brelaz, DemiTarif), Geneve))
(103, ((Gress, DemiTarifVisa), St-Gallen))
(103, ((Gress, DemiTarifVisa, Chur))
(103, ((Gress, DemiTarifVisa), Zurich))

Left Outer Joins, Right Outer Joins

def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]

Using a left outer join:

val abosWithOptionalLocations = abos.leftOuterJoin(locations)
abosWithOptionalLocations.collect().foreach(println)

// Output

(101, ((Ruetli, AG), Some(Thun)))
(101, ((Ruetli, AG), Some(Bern)))
(102, ((Brelaz, DemiTarif), Some(Geneve)))
(102, ((Brelaz, DemiTarif), Some(Nyon)))
(102, ((Brelaz, DemiTarif), Some(Lausanne)))
(103, ((Gress, DemiTarifVisa), Some(Zurich)))
(103, ((Gress, DemiTarifVisa), Some(St-Gallen)))
(103, ((Gress, DemiTarifVisa), Some(Chur)))
(104, ((Schatten, DemiTarif), None))  // notice the None

Shuffling

Shuffling is when data is moved between nodes. This can happen when we do a groupByKey(). Moving data around the network like this is extremely slow.

// slow
val purchasesPerMonthSlowLarge = purchasesRddLarge.map(p => p.customerId, p.price))
  .groupByKey()
  .map(p => (p._1, (p._2.size, p._2.sum)))
  .count()

By reducing the data set first, we can reduce the amount of data that's sent over the network during a shuffle.

// fast
val purchasesPerMonthFastLarge = purchasesRddLarge.map(p => p.customerId, (1, p.price)))
  .reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2, + v2._2))
  .count()

Partitioning

Partitioning can bring substantial performance gains, especially if you can prevent or lower the number of shuffles.

Properties of partitions

The data within an RDD is split into several partitions.
Partitions never span multiple machines.
Each machine in the cluster contains 1+ partitions.
The number of partitions to use is configurable. By default, it equals the total number of cores on all executor nodes.

Customizing partitioning is only possible when working with Pair RDDs (since partitioning is done based on keys)

Hash partitioning

Attempts to spread data evenly across partitions based on the key

Range partitioning

This is for keys that can have an ordering. Tuples with keys in the same range appear in the same machine. For example, if our numbers are 1 to 800, we can have 4 partitions of: [1, 200], [201, 400], [401, 600], [601, 800]

Invoking partitionBy creates an RDD with a specified partitioner.

val pairs = purchasesRdd.map(p => (p.customerId, p.price))

// 8 partitions. pairs will be sampled to create appropriate ranges.
val tunedPartitioner = new RangePartitioner(8, pairs)

val partitioned = pairs.partitionBy(tunedPartitioner).persist()

Each time the partitioned RDD is used, the partitioning is re-applied, resulting in unnecessary shuffling. By using persist we are telling Spark that once you move the data around in the network and re-partition it, persist it where it is and keep it in memory. The results of partitionBy should always be persisted.

2 ways partioners can be passed around the transformation

1) Partitioner from parent RDD

Pair RDDs that are the result of a transformation on a partitioned Pair RDD is usually configured to use the hash partitioner that was used to construct it.

Operations on Pair RDDs that hold to (and propagate) a partitioner:

cogroup          foldByKey
groupWith        combineByKey
join             partitionBy
leftOuterJoin    sort
rightOuterJoin   mapValues (if parent has a partitioner)
groupByKey       flatMapValues (if parent has a partitioner)
reduceByKey      filter (if parent has a partitioner)

all other operations will produce a result without a partitioner.

Notice map and flatMap are not on otherwise list. This is because map and flatMap can change the keys in an RDD. For this reason, use mapValues instead of map whenever possible to avoid unnecessary shuffling.

Operations that may cause a shuffle: cogroup, groupWith, join, leftOuterJoin, rightOuterJoin, groupByKey, reduceByKey, combineByKey, distinct, intersection, repartition, coalesce

2) Automatically-set partitioners

Some operations on RDDs automatically result in an RDD with a known partitioner, for when it makes sense. Examples:

RangePartitioner is used when using sortByKey
HashPartitioner is used when using groupByKey

Wide vs Narrow Dependencies

Narrow Dependency

Each partition of the parent RDD is used by at most 1 partition of the child RDD

Transformations with narrow dependencies: map, mapValues, flatMap, filter, mapPartitions, mapPartitionsWithIndex

Wide dependency

Each partition of the parent RDD may be depended on by multiple child partitions

Transformations with narrow dependencies (that may cause a shuffle): cogroup, groupWith, join, leftOuterJoin, rightOuterJoin, groupByKey, reduceByKey, combineByKey, distinct, intersection, repartition, coalesce

Finding dependencies

`dependencies()`

dependencies() returns a sequence of Dependency objects, which are the dependencies used by Spark's scheduler to know how this RDD depends on other RDDs.

dependencies() may return:
- Narrow dependency objects: OneToOneDependency, PruneDependency, RangeDependency
- Wide dependency objects: ShuffleDependency

val wordsRdd = sc.parallelize(largeList)
val pairs = words.Rdd.map(c => (c, 1))
                     .groupByKey()
                     .dependencies

// Output is something like:

pairs: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@4294a23d)

`toDebugString()`

val wordsRdd = sc.parallelize(largeList)
val pairs = words.Rdd.map(c => (c, 1))
                     .groupByKey()
                     .toDebugString

// Output is something like:

pairs: String =
(8) ShuffleRDD[219] at groupByKey at <console>:38 []
 +-(8) MapPartitionsRDD[218] at map at <console>:37 []
    |  ParallelCollectionRDD[217] at parallelize at <console>:36 []

The indentations in above output actually shows how Spark groups together these operations.

Structure and Optimization

Optimizing Inner Join

If we have:

val demographics = sc.textfile(...) // Pair RDD of (id, demographic)
val finances = sc.textfile(...) // Pair RDD of (id, finances)

Solution 1: Inner Join then Filter

An inner join of demographics and finances will give us a type of: (Int, (Demographic, Finances)), which we then filter and count below:

demographics.join(finances)
            .filter { p =>
              p._2._1.country == "Switzerland" &&
              p._2._2.hasFinancialDependents &&
              p._2._2.hasDebt
            }.count

Solution 2: Filter then Inner Join

val filtered = finances.filter(p => p._2.hasFinancialDependents && p._2.hasDebt)

demographics.filter(p => p._2.country == "Switzerland")
            .join(filtered)
            .count

Solution 3: Cartesian Product, then filters

val cartesian = demographics.cartesian(finances)

cartesian.filter {
  case (p1, p2) => p1._1 == p2._1
}
.filter {
  case (p1, p2) => (p1._2.country == "Switzerland") &&
                   (p2._2.hasFinancialDependents) &&
                   (p2._2.hasDebt)
}.count

Comparing our 3 methods

Fastest to Slowest: Solution 2, Solution 1, Solution 3

Cartesian product (Solution 3) is extremely slow. Use inner join instead.
Filtering data first before join (Solution 2) is much faster than joining then filtering (Solution 1)

Types of Data

Structured: Database tables
Semi-Structured: JSON, XML - these types of data are self-describing. No rigid structure to them.
Unstructured: Log files, images

For structured data, Spark may be able to make optimizations for you (such as putting filters before inner joins). That is the whole point of Spark SQL. The only caveat is we've got to give up some of the freedom, flexibility, and generality of the functional collections API in order to give Spark some structure and thus more opportunities to optimize.

Spark SQL

Spark SQL is a library implemented on top of Spark.

Benefits

mix SQL queries with Scala - sometimes it's more desirable to express a computation in SQL syntax instead of functional APIs, and vice versa.
high performance - we get optimizations we're used to from databases, into our Spark jobs.
support new data sources such as semi-structured data and external databases

3 main APIs it adds

SQL literal syntax
DataFrames
Datasets

2 specialized backend components

Catalyst - a query optimizer.
Tungsten - off-heap serializer.

More info on all this later.

SparkSession

SparkSession is the newer version of SparkContext. This is how to create a SparkSession:

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("My App")
  //.config("spark.some.config.option", "some-value")
  .getOrCreate()

Creating DataFrames

A DataFrame is conceptually equivalent to a table in a relational database.

DataFrames are distributed collections of records, with a known schema.

There are 2 ways to create data frames:

From an existing RDD - either with schema inference, or with an explicit schema
Reading a data source from file - common structured or semi-structured formats such as JSON

Method 1: Use an existing RDD:

val tupleRDD = ... // Assume RDD[(Int, String, String, String)]
val tupleDF = tupleRDD.toDF("id", "name", "city", "country") // column names

If you don't pass column names to toDF, then Spark will assign numbers as attributes (column names).

However, if you have an RDD containing some kind of case class instance, then Spark can infer the attributes from the case class's fields:

case class Person(id: Int, name: String, city: String)
val peopleRDD = ... // Assume RDD[Person]
val peopleDF = peopleRDD.toDF // Attributes (column names) will be inferred

Another option is to use an explicit schema, but the process is omitted here as it's complex.

Method 2: Use a data source from file

// 'spark' represents the SparkSession object
val df = spark.read.json("examples/src/main/resources/people.json")

Spark SQL can directly create DataFrames from the following semi-structured/structured data: JSON, CSV, Parquet (a serialized big data format), JDBC, + more using DataFrameReader

Creating Temp Views

Assuming we have a DataFrame called peopleDF, we just have to register our DataFrame as a temporary SQL view first:

peopleDF.createOrReplaceTempView("people")

This registers the DataFrame as an SQL temporary view. It essentially gives a name to our DataFrame in SQL so we can refer to it in an SQL FROM statement:

val adultsDF = spark.sql("SELECT * FROM people WHERE age > 17")

The SQL statements available are basically what's available in HiveQL.

Data Frames

DataFrames API is similar to SQL, in that it has select, where, limit, orderBy, groupBy, join, etc.

Spark SQL vs Data Frames API

Given:

case class Employee(id: Int, fname: String, lname: String, age: Int, city: String)
val employeeDF = sc.parallelize(...).toDF

we can use Spark SQL as:

// assuming we have "employees" table registered, we an do:
val sydneyEmployeesDF = spark.sql("""SELECT id, lname
                                       FROM employees
                                      WHERE city = "Sydney"
                                   ORDER BY id""")

or we can use the DataFrames API as:

val sydneyEmployeesDF = employeeDF.select("id", "lname")
                                  .where("city == 'Sydney'")
                                  .orderBy("id")

Seeing our data

show() pretty-prints DataFrame in tabular form, showing first 20 elements
printSchema() - prints the schema of your DataFrame in tree format

3 ways to select a column

Use $-notation as df.filter($"age" > 18). Requires import spark.implicits._ to use $-notation.
Refer to the Dataframe: df.filter(df("age") > 18)
Use SQL query string: df.filter("age > 18")

Working with missing values

Dropping records with unwanted values:

drop() drops rows that contain null or NaN values in any column and returns a new DataFrame
drop("all") drops rows that contain null or NaN values in all columns and returns a new DataFrame
drop(Array("id", "name")) drops rows that contain null or NaN values in the specified columns and returns a new DataFrame

Replacing unwanted values:

fill(0) replaces all occurrences of null or NaN in numeric columns with a specified value and returns a new DataFrame
fill(Map("minBalance" -> 0)) replaces all occurrences of null or NaN in specified column with specified value and returns a new DataFrame
replace(Array("id"), Map(1234 -> 8923)) replaces specified value (1234) in specified column (id) with specified replacement value (8923) and returns a new DataFrame

Common actions on DataFrames

Like RDDs, DataFrames also have their own set of actions:

collect(): Array[Row] // returns an array that contains all rows in this DataFrame
count(): Long // returns number of rows in DataFrame
first(): Row // returns the first row in the DataFrame
head(): Row  // same as first()
show(): Unit // displays the top 20 rows of DataFrame in a tabular form
take(n: Int): Array[Row] // returns the first n rows in the DataFrame

Joins on DataFrames

Joins on DataFrames are similar to those on Pair RDDs, with 1 major difference: since DataFrames aren't key/value pairs, we must specify which columns to join on.

Examples of joins - inner, outer, left_outer, right_outer, leftsemi:

df1.join(df2, $"df1.id" === $"df2.id")                // inner join
df1.join(df2, $"df1.id" === $"df2.id", "right_outer") // right_outer join

Optimizations on DataFrames: Catalyst

Compiles Spark SQL programs down to an RDD.

Reorders operations - for example, tries to do filters as early as possible.
Reduces the amount of data we must read - skips reading in, serializing, and sending around parts of the data that aren't needed for our computation (Example: a Scala object with many fields - Catalyst will only send around the relevant columns of the object).
Pruning unneeded partitions - Analyzes DataFrame and filter operations to figure out and skip partitions that aren't needed in our computation.

Optimizations on DataFrames: Tungsten

Tungsten is

highly-specialized data encoder - since our data types are restricted to Spark SQL data types, Tungsten can optimize encoding by using this schema information.
column-based storage - this is common for databases. Since most operations on tables are done on columns (instead of rows), it's more efficient to store data by grouping column data together.
encodes data off-heap - so it's free from garbage collection overhead.

Limitations of DataFrames

DataFrames are untyped (unlike RDDs). Your code may compile, but you may get a runtime exception if you try to run a query on a column that doesn't exist.
If your unstructured data cannot be reformulated to adhere to some kind of schema, it would be better to use RDDs

Datasets

DataFrames don't have type safety. Datasets resolve this problem.

type DataFrame = Dataset[Row] // DataFrames are actually Datasets of type: Row

Datasets can be thought of as typed distributed collections of data
Dataset API unifies the DataFrame and RDD APIs. We can freely mix these APIs, although the function signatures may be slightly different.
Datasets require structured or semi-structured data.

DataSets vs DataFrames: you get type information using DataSets. Can now use higher-order functions like map, flatMap, filter that datasets get from RDDs.

DataSets vs RDDs: You get more optimizations than RDDs since Catalyst works on DataSets.

Mixing APIs example, assuming listingsDS is of type Dataset[Listing]:

listingsDS.groupByKey(l => l.zip)        // from RDD API: groupByKey
          .agg(avg($"price").as[Double]) // from our DataFrame API

The types match up since everything is a dataset.

Creating a Dataset

Create `dataset` from a `DataFrame`

import spark.implicits._
myDF.toDS // creates a new dataset from a dataframe

Create `dataset` from JSON

If we define a case class who's structure, names, and types all match up with "people.json", then we can read this file into a dataset, perfectly typed:

val myDS = spark.read.json("people.json").as[Person]

Create `dataset` from `RDD`

import spark.implicits._
myRDD.toDS

Create `dataset` from Scala type

import spark.implicits._
List("yay", "ohnoes", "hooray!").toDS

Typed Columns

datasets used typed columns, so the following error could happen:

found   : org.apache.spark.sql.Column
required: org.apache.spark.sql.TypedColumn[...]
                .agg(avg($"price")).show

To create a TypedColumn, we can rewrite it as $"price".as[Double] to give the column a specific type (of Double)

Untyped and Typed Transformations

Untyped transformations - exist in DataFrames and DataSets
Typed transformations - exist in Datasets. Typed variants of many DataFrame transformations, and additional transformations such as RDD-like higher-order functions like map, flatMap, etc.

Aggregators

Aggregators is a class that helps you generically aggregate data, kind of like the aggregate method in RDDs.

class Aggregator[-IN, BUF, OUT]

IN is the input type to the aggregator. When using an aggregator after groupByKey, this is the type that represents the value in the key/value pair.
BUF is the intermediate type during aggregation
OUT is the type of the output of the aggregation

To create an Aggregator, define the IN, BUF, OUT types and implement the below methods:

val myAgg = new Aggregator[IN, BUF, OUT] {
  def zero: BUF = ...                    // The initial value.
  def reduce(b: BUF, a: IN): BUF = ...   // Add an element to the running total.
  def merge(b1: BUF, b2: BUF): BUF = ... // Merge intermediate values.
  def finish(b: BUF): OUT = ...          // Return the final result.
}.toColumn // if we're going to pass this to an aggregation method, it needs to be of type column

Example of specific Aggregator:

val keyValues
  = List((3, "Me"), (1, "Thi"), (2, "Se"), (3, "ssa"), (1, "sIsA"), (3, "ge:"), (3, "-)", (2, "cre"), (2,"t"))

val keyValuesDS = keyValues.toDS

val strConcat = new Aggregator[(Int, String), String, String] {
  def zero: String = ""
  def reduce(b: String, a: (Int, String)): String = b + a._2
  def merge(b1: String, b2: String): String = b1 + b2
  def finish(r: String): String = r
}.toColumn

// pass it to our aggregator
keyValuesDS.groupByKey(pair => pair._1)
           .agg(strConcat.as[String]).show

The above solution now needs Encoders for it to work.

Encoders

Encoders convert your data between JVM objects and Spark SQL's specialized internal representation. Encoders are required by all Datasets. They generate custom bytecode for serialization and deserialization of your data.

Two ways to introduce encoders:

Automatically (generally the case) via implicits from a SparkSession. Just do import spark.implicits._
Explicitly via org.apache.spark.sql.Encoders which contains a large selection of methods for creating Encoders from Scala primitive types, Products, tuples.

We explicitly add encoders to our strConcat function above, by adding these 2 functions:

override def bufferEncoder: Encoder[String] = Encoders.STRING
override def outputEncoder: Encoder[String] = Encoders.STRING

When to use Datasets vs DataFrames vs RDDs

Use Datasets when
- you have structured/semi-structured data
- you want typesafety
- you need to work with functional APIs
- you need good performance, but it doesn't have to be the best
Use DataFrames when
- you have structured or semi-structured data
- you want the best possible performance, automatically optimized for you
Use RDDs when
- you have unstructured data
- you need to fine-tune and manage low-level details of RDD computations
- you have complex data types that cannot be serialized with Encoders

User Defined Functions (UDFs)

User Defined Functions (UDFs) is a feature of Spark SQL to define new Column-based functions for transforming Datasets

Instead of UDFs, use higher-level standard Column-based functions whenever possible since Spark SQL performs optimizations on them. Spark SQL does not perform optimizations on UDFs.

Example of UDF:

val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text")

val upper: String => String = _.toUpperCase // regular Scala function

// Define a UDF that wraps the upper Scala function defined above.
// You could instead define the function inside the udf but separating
// Scala functions from Spark SQL's UDFs allows for easier testing.
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)

// Apply the UDF to change the source dataset
dataset.withColumn("upper", upperUDF('text)).show

gives output of:

+---+-----+-----+
| id| text|upper|
+---+-----+-----+
|  0|hello|HELLO|
|  1|world|WORLD|
+---+-----+-----+

Alternatively you could have defined the UDF like this:

val upper: String => String = _.toUpperCase
val upperUDF = udf { s: String => s.toUpperCase }

or like this:

val upper: String => String = _.toUpperCase
val upperUDF = udf[String, String](_.toUpperCase)

You can also register UDFs so you can use them in SQL queries:

val spark: SparkSession = ...
spark.udf.register("myUpper", (input: String) => input.toUpperCase)

SparkException: Task not serializable

org.apache.spark.SparkException: Task not serializable exception occurs when you use a reference to an instance of a non-serializable class inside a transformation.

Functions on RDDs (such as map), Dataframes, Datasets, etc. need to be serialized so they can be sent to worker nodes. Serialization happens for you, but if the function makes a reference to a field in another object, the entire other object must be serialized.

Example 1

object Example {
  val num = 1
  def myFunc = testRdd.map(_ + num)
}

This code fails since num is outside the scope of myFunc(). Since "the function makes a reference to a field in another object, the entire other object must be serialized."

The code is fixed by adding extends Serialiable to the object:

object Example extends Serializable {
  val num = 1
  def myFunc = testRdd.map(_ + num)
}

Example 2

object Example {
  val num = 1
  def myFunc = {
    val enclosedNum = num
    testRdd.map(_ + enclosedNum)
  }
}

Instead of using extends Serializable to serialize the entire object, this code works since we added val enclosedNum = num. Now the entire object doesn't need to be serialized since enclosedNum is in the scope of myFunc()

However, if we used lazy val enclosedNum = num instead, it wouldn't work. When enclosedNum is referenced, it still requires knowledge of num so it will still try to serialize object Example.

References

References - Used in this Repo

YouTube: Apache Spark Tutorial | Spark Tutorial for Beginners | Spark Big Data | Intellipaat - 0:00 to 33:20 was great. The rest was skipped since it taught very specific concepts with a mediocre explanation.
Coursera: Big Data Analysis with Scala and Spark - an amazing course. This repo is based on the course's lecture videos.
Article: Spark SQL UDFs - good beginner summary of UDFs.
Article: Serialization with Spark and Scala - useful for understanding SparkException: Task not serializable. The 8 examples were good, but the "What's next" section was skipped since it got overly detailed and complicated.

References - Deprecated

YouTube: What is Apache Spark? | Introduction to Apache Spark | Apache Spark Certification | Edureka - Mediocre overview.
YouTube: Intro to Apache Spark for Java and Scala Developers - Ted Malaska (Cloudera) - Too high-level and slightly off-topic.

muskanmahajan37 / spark Goto Github PK

spark's Introduction