GithubHelp home page GithubHelp logo

seby-sbirna / computational-data-processing-using-spark-pandas-and-data-streaming Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 2.0 2.83 MB

This repository contains a collection of three Data Engineering capstone projects made for the DTU Data Engineering course 02807: Computational Tools for Data Science

Jupyter Notebook 100.00%
data-science data-streaming pandas spark continuous-streams database sql web-traffic web-traffic-forecasting trend-detection

computational-data-processing-using-spark-pandas-and-data-streaming's Introduction

Computational Data Processing using Spark, Pandas and Data Streaming

by Sebastian Sbirna, Yingrui Li and Aijie Shu


This repository contains a set of three full Data Science projects, created with a strong focus on tools and methods for working with data at scale.

The course is based on mastering tools for analyzing Big Data and large-scale datasets which have high computational demands, and are normally manipulated using a distributed cluster of machines or through statistical approximations.

This course's objective is to enable us to develop and implement parallel and distributed algorithms for data science applications, and to apply database technologies and models or other relevant technologies and literature related to computational tools and techniques for massive data sets.

The point of the presented Project Assignments is to consolidate the skills we have learned throughout the course through specific company case-study problems.

In particular, we have built a database evaluation of anonymous customer data using Pandas and SQL:

Afterwards, we have analyzed a continuous web traffic stream of data using HyperLogLog and CountMin probabilistic data structures:

Lastly, we have collaborated using Spark on Airbnb's massive database to assess popularity of certain cities' neighbourhoods and lodging prices over time, as well as an overall sentiment analysis upon the text reviews of lodgings in that particular city:


computational-data-processing-using-spark-pandas-and-data-streaming's People

Contributors

seby-sbirna avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.