GithubHelp home page GithubHelp logo

konnomiya / dataengineering Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cornelldatascience/dataengineering

0.0 1.0 0.0 7.23 MB

The Data Engineering subteam of Cornell Data Science

Home Page: http://cornelldata.science

Python 4.90% Java 38.02% Scala 0.99% CSS 9.60% XSLT 46.12% Shell 0.36%

dataengineering's Introduction

Data Engineering

Who we are:

The CDS Data Engineering subteam exists to provide analysis and processing support to CDS project teams, and to develop institutional knowledge in high throughput computing.

Advisor: Professor Immanuel Trummer
Team Leads: Haram Kim (A&S CS 2020)

Team objectives:

  • Improve on existing high throughput computing frameworks
  • Develop solutions for data analysis problems in CDS projects
  • Provide a reservoir of reference information in data engineering
  • Research and publish means of improving existing DE frameworks

Current Projects:

  • Spark ML Optimization: Apache Spark's machine learning modules are not as well-studied as those of other platforms. This project seeks to empirically identify optimal settings for Spark's ML modules to best utilize the platform's unique capabilities.
  • SkinnerDB Parallelization: This project's objective is to experiment with parallelism in Professor Trummer's recently developed database engine, SkinnerDB. The SkinnerDB uses a machine learning approach to query optimization, in contrast to the heuristic model used by most current database engines, but has not yet been expanded to allow multi-core execution.
  • Deterministic Query Approximation: Several recent publications have outlined methods to allow high-speed query approximation with deteministic bounds, but have not yet been applied to a wide range of queries. The objective of this project is to apply several of these techniques to the TPC-H query benchmarks to demonstrate broader applicability.
  • GPU Acceleration: The distributed GPU computing deals with the unique task of handling distributed deep learning tasks, which is currently well-optimized for multiple GPUs, but not necessarily across multiple machines. Our goal is to research and optimize current tools in development so that it can be adopted by CDS teams deploying large DL models.

Previous Projects:

  • Data streaming: Profiling of real time data streaming through Apache Kafka

  • Server monitoring: Real time visualization and monitoring of compute server resource utilization through Cockpit

  • File format optimization and profiling: Comparative analysis of a variety of file formats typically used in data science, focusing on CSVs and Apache Parquet

  • Spark diagnostics: Deliberate attempts to produce errors while running Apache Spark, both locally and on our servers. Problem specifics and solutions were recorded in case similar issues develop in the future.

Members (SP2018):

dataengineering's People

Contributors

haramkim-1 avatar thebutlah avatar dwkwvss avatar josephch405 avatar t40tds avatar kevluo avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.