GithubHelp home page GithubHelp logo

jhxu0416 / 2018-spring Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gu-dsan6000/2018-spring

0.0 0.0 0.0 14 KB

Main repository for Georgetown University Massive Data Analytics course (ANLY502), Spring 2018

R 100.00%

2018-spring's Introduction

This syllabus is effective as of Tuesday, January 09, 2018 at 11:03 AM

ANLY502
Massive Data Fundamentals
Georgetown University
Spring 2018

Course Information

  • Instructors: Marck Vaisman (mv559 at georgetown.edu), Irina Vayndiner (iv95 at georgetown.edu)
  • Classroom: St. Mary’s 126
  • Time: Monday 6:30-9:00pm (except 1/10 which meets on a Wednesday)

Course Description

Data is everywhere! Many times, it’s just too big to work with traditional tools. This is a hands-on, practical workshop style course about using cloud computing resources to do analysis and manipulation of datasets that are too large to fit on a single machine and/or analyzed with traditional tools. The course will focus on Spark, MapReduce, the Hadoop Ecosystem and other tools.

You will understand how to acquire and/or ingest the data, and then massage, clean, transform, analyze, and model it within the context of big data analytics. You will be able to think more programmatically and logically about your big data needs, tools and issues.

Credit Hours

This is a 3 credit graduate course. You will spend approximately 3 hours per week in class. It is expected that you will spend approximately 2-3 hours of outside classroom activities (required readings, homework problems, completion of labs, quizzes, etc.) for each hour of class time. You will spend 36 hours in instructional time, and approximately 100 hours in out-of-classroom time.

Course Objectives

  • Operate big data tools and cloud infrastructure, including Spark, MapReduce, Hadoop and other tools in the big data ecosystem
  • Recognize and use ancillary tools that support big data processing, including git and the Linux command line
  • Setup and manage big data infrastructure and tools in the cloud on Amazon Web Services
  • Identify resources and documentation to remain current with big data tools and developments
  • Execute a big data analytics exercise from start to finish: ingest, wrangle, clean, analyze and store
  • Be aware of the responsibilities that are associated with performing analysis of large datasets

Prerequisites

  • Experience with the command line and terminal shell in Linux/OSX to navigate file system, manipulate files and directories (create, move, delete, etc). Understand file permissions.
  • Understand programming concepts (flow control, input/output, variable assignment.)
  • Experience with R, Python, SQL or other programming language for reading files, manipulating and analyzing data. Note: We will use Python as the primary interface to Apache Spark, through PySpark
  • Experience with remote computing via ssh
  • Understand shell executables
  • Experience with version control tools such as git

Course Materials

We have chosen several reference books for this course that cover different parts of the material. We will assign readings for each class from these books. These books are all available on Safari Books Online, and you should be able to access these resources. Our understanding is that as a Georgetown student, you have access to these resources. Visit the Georgetown Library e-book information page for additional information and click on “Safari Books Online”.

We may also provide supplemental materials to complement the books. Articles, links, etc. will be posted on Canvas.

Books (for assigned readings)

  • Benjamin Bengfort, Jenny Kim (2016). Data Analytics with Hadoop: An Introduction for Data Scientists O’Reilly Media. ISBN: 9781491913703.
  • Ofer Mendelevitch, Casey Stella, Douglas Eadline (2016). Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale. Addison-Wesley Professional. ISBN: 9780134024141.
  • Matei Zaharia, Bill Chambers (2017). Spark: The Definitive Guide (Early Release). O’Reilly Media. ISBN: 9781491912157.

Additional Recommended Books

  • Tomasz Drabas, Denny Lee (2017). Learning Pyspark. Packt Publishing. ISBN: 9781786463708.
  • Krishna Sankar (2016). Fast Data Processing with Spark 2 - Third Edition. Packt Publishing. ISBN: 9781784392574.

Learning Activities and Evaluation

This is a hands-on, practical, workshop style course, that provides opportunities to use the tools and techniques discussed in class. Although this is not a programming course per se, there is programming involved.

Lectures and In-Class Labs

Every class session will have a lecture portion and many sessions will have an in-class lab portion. The lab exercises are not graded and are designed to get you familiar with the tools discussed in class. In these labs, we will work through simple examples. The completion of lab exercises is part of your in-class participation portion of the grade.

Quizzes

There will be a total of 6 online quizzes about the topics/ideas discussed in class and from the readings. The purpose of the quizzes is to reinforce your knowledge about the tools and platform and also to help you remember the nomenclature and terms used in class. The quizzes will be online through Canvas and you can take them at your convenience within the established time window.

Assignments

You will be given problem sets as homework assignments. The goal of these problem sets is to use the big data tools to answer some questions about large datasets. The problem sets will build on the labs and will be much more elaborate. Deliverables from the problem sets will usually include code written for your programs and the output produced.

We will be using GitHub Classroom for problem sets and assignment submissions. When an assignment is created, we will email you a link that will clone the assignment and create a private repository for you. You will perform your work within the repository and then push back to GitHub for submission. If you do not have a GitHub account, please create one.

Grading

  • Problem Sets: 60% (6 problem sets, 10% each)
  • Quizzes: 30% (6 online quizzes, 5% each)
  • Participation: 10% (in-class discussion, completion of in-class labs, active participation in online forums)

Course Calendar

This calendar is subject to change. We will make make any changes known in advance.

Date Session Title Topics Lab Reading Assignment Quiz
Jan 10 1 Welcome to Massive Data Analytics Course Overview, What is Big Data, Distributed Computing, Cloud Computing, High Performance Computing
Jan 22 2 The Infrastructure of the Cloud Overview of major Cloud Computing providers (AWS and Azure), IAAS, PAAS and SAAS, Other Cloud providers Setup your environment, create SSH keys, start and connect to an instance in the cloud A1 released - due Feb 2
Jan 29 3 Database Systems Overview of scalable database systems, Massively Parallel Processing databases, Neteeza, Greenplum, RedShift TBD Q1
Feb 05 4 Introduction to Hadoop & MapReduce Hadoop, Distributed filesystems, MapReduce programming model Start and connect to a cluster, Run built-in Hadoop examples on cluster, Examine the different user interfaces A2 released - due Feb 18
Feb 12 5 Hadoop Streaming Hadoop Streaming Run the “Hello World” of Hadoop, the word count using Hadoop Streaming Q2
Feb 26 6 Higher Level APIs Pig and Hive Store a dataset in a Hive table, Run and example Pig job A3 released - due Mar 11 Q3
Mar 12 7 Introduction to Spark What is Spark, Resilient Distributed Datasets, PySpark Start a PySpark session, Create RDDs, Operate on RDDs A4 released - due Mar 25
Mar 19 8 SparkSQL SQL Review, Intro to SparkSQL Perform operations on Spark dataframes using SparkSQL Q4
Mar 26 9 Machine Learning on Big Data SparkML, Issues with ML algorithms on large datasets Build a model A5 released - due Apr 15
Apr 09 10 Working with Streaming Datasets Spark Streaming TBD Q5
Apr 16 11 NoSQL NoSQL Store data in a NoSQL data store A6 released - due Apr 22
Apr 23 12 Working with Graph Datasets GraphX API for Spark Analyze a large graph Q6
Apr 30 13 Other tools of interest Apache Drill, other topics TBD

Class will not meet on Jan 15 (MLK Holiday), Feb 19 (President’s Day), Mar 05 (Spring Break), Apr 02 (Easter Break).

The topics for the last 4 sessions are placeholders, and it is very likely we will cover these topics. However, we have some room for flexibility depending on other topics of interest to the class. We may also have a guest lecturer from time to time.

Policies & Expectations

  • Attendance: Given the technical nature of this course, and the breadth of topics discussed, it is expected that you attend every class session. Please contact us in advance if you are not able to attend class.
  • E-mail: We will try to respond to email within 24 to 36 hours. Please use email for personal discussions and not for course questions.
  • Online Discussion Boards: Please use the discussion board on Canvas for questions about the course, homework assignments, technical issues, etc. Individual questions submitted by email do not scale, and the likelihood of many students having the same question is high. Using the forums is a great resource for everyone.
  • Name Tents: You will be given a name tent. Please use it every class session and place it in front of you so we can get to know your name quicker.

Open Door Policy

Please approach or get in touch with us if something is not working for you regarding the class, methods, etc. Our pledge to you is to provide the best learning experience possible.

Academic Integrity

You must perform all of your own work on problem sets. You may collaborate with other students, though all submitted work must be your own. Please refer to the Georgetown University Honor Council site for additional information.

2018-spring's People

Contributors

wahalulu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.