- maven needs to be set up on the local machine
- pyspark version 3.0.0 is required
The goal of the programming exercise is
- Read data from a json file without inferring schema.
- Explode arrays
- Unwrap nested structures.
- Write the dataframe to csv
- Create delta table
- Write to table
- Read from table
- Spark cluster components and deployment modes
- Caching - cache(), persist(), unpersist(), and storage levels
- Partitioning
- Initial DataFrame partitioning when reading from data source
- Repartitioning via coalesce() vs repartition()
- Controlling number of shuffle partitions
- Performance
- Catalyst optimizer
- Identifying performance bottlenecks in Spark applications
- Transformations, actions, and other operations
- Wide vs Narrow
- Joins
- Broadcast Joins
- Cross Joins
- Defining and using User Defined Functions (UDFs)
- Window functions
- Streaming
- Checkpoints
- Aggregation using time windows
- Watermarking