GithubHelp home page GithubHelp logo

w205-project-2's Introduction

The description of the project materials in this repo is in my notebook.

Project 2: Tracking User Activity

In this project, you work at an ed tech firm. You've created a service that delivers assessments, and now lots of different customers (e.g., Pearson) want to publish their assessments on it. You need to get ready for data scientists who work for these customers to run queries on the data.

Tasks

Prepare the infrastructure to land the data in the form and structure it needs to be to be queried. You will need to:

  • Publish and consume messages with Kafka
  • Use Spark to transform the messages.
  • Use Spark to transform the messages so that you can land them in HDFS

You should include your docker-compose.yml used for spinning the pipeline. One example is enough!

Also, include the history of your console by running

history > <user-name>-history.txt

Do not alter the history file, submit it as is! Select the lines you seem relevant and paste them into your report.

You can either run Spark through the command line or use a notebook. If you use the command line, add an annotated markdown file where you present your results, tools used, and explanations of the commands that you use. To get the history from Spark run

docker-compose exec spark cat /root/.python_history

You should be able to query your data at the end.

In order to show the data scientists at these other companies the kinds of data that they will have access to, decide on 1 to 3 basic business questions that you believe they might need to answer about these data.

What you turn in

  • Your history file (see above)

  • A report either as a markdown file or a jupyter notebook. The report should describe your queries and spark SQL to answer business questions that you select. Describe your assumptions, your thinking, what the parts of the pipeline do. What is given? What do you set up yourself? Talk about the data. What issues do you find with the data? Do you find ways to solve it? How? If not describe what the problem is.

  • Any other files needed for the project, e.g., your docker-compose.yml, scripts used, etc

  • Make sure you have a good structure in your git repo and explain your repo in the README.

Comment you code with explanations of what you do, especially the steps to spin up your pipeline.

Data

To get the data, run

curl -L -o assessment-attempts-20180128-121051-nested.json https://goo.gl/ME6hjp`

Note on the data: This dataset is much more complicated than the one we'll cover in the live sessions. It is a nested JSON file, where you need to unwrap it carefully to understand what's really being displayed. There are many fields that are not that important for your analysis, so don't be afraid of not using them. The main problem will be the multiple questions field. Think about what the problem is here and give your thoughts. We recommend for you to read schema implementation in Spark Here is the documenation from Apache. It is NOT necessary to get to that data.

Here are some example questions that would be useful to know, but again, you choose your own (1 to 3).These examples are meant to get you started/thinking, they are optional.

  1. How many assesstments are in the dataset?
  2. What's the name of your Kafka topic? How did you come up with that name?
  3. How many people took Learning Git?
  4. What is the least common course taken? And the most common?
  5. Add any query(ies) you think will help the data science team

Files you submit

  • -history.txt
  • One example of your docker-compose.yml files
  • Either:
    • your Jupyter notebook or
    • the spark history file and an annotations file in markdown format that serves as your report

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.