- Session 1: Introduction to Spark and ShARC (HPC)
- Session 2: RDD, DataFrame, ML pipeline, & parallelization
- Session 3: Scalable matrix factorisation for collaborative filtering recommender systems
- Session 4: Scalable K-means clustering
- Session 5: Scalable PCA for dimensionality reduction (and data types in Spark)
- Session 6: Decision trees
- Session 7: Advanced decision trees
- Session 8: Scalable logistic regression
- Session 9: Scalable generalized linear models
- Session 10: (TBC) Apache Spark in the Cloud (invited lecture by Dr Michael Smith)
The materials are built with references to the following sources:
- The PySpark tutorial by Wenqiang Feng: PDF - Learning Apache Spark with Python Release v1.0, GitHub Project Page
- The official Apach Spark documentations
- The Introduction to Apache Spark course by Prof. Anthony D. Joseph, University of California, Berkeley
- The book Spark: The Definitive Guide by Bill Chambers and Matei Zaharia. There is also a Repository for code from the book.
Many thanks to
- Mike Croucher, Neil Lawrence, Will Furnass, Twin Karmakharm, and Vamsi Sai Turlapati for their inputs and inspirations.
- Our teaching assistants (demonstrators) and students who have contributed in various ways.