Name: Shuo Tian
Type: User
Company: University of Waterloo
Bio: Interest: Data Science, Data Engineering, Machine Learning, Data Analyst, Deep Learning, Software Development
Location: Toronto, Ontario, CA
Blog: [email protected]
Shuo Tian's Projects
CS 451/651, CS 431/631: Data-Intensive Distributed Computing (Winter 2019) at the University of Waterloo https://aroegies.github.io/bigdata-2019w/
Customer Segmentation Based on Cannabis Consumer Reviews
Chatbot-Raw Reddit Comments-Data Clean-Seq2Seq-Tensorflow-Attention-Bidirectional GRU
Data modeling with PostgreSQL and building an ETL pipeline using Python. Define fact and dimension tables for a star schema for a particular analytic focus, and write an ETL pipeline that transfers data from files in two local directories into these tables in PostgreSQL using Python and SQL.
Building out an ETL pipeline, extracting data from S3 buckets, processing it through Spark and transforming into a star schema stored in S3 buckets with parquet formatting and efficient partitioning.
Using Airflow to automate ETL pipelines using Airflow, Python, Amazon Redshift. Transforming data from various sources into a star schema optimized for the analytics team's use cases. Writing custom operators to perform tasks such as staging data, filling the data warehouse, and validation through data quality checks.
Building out an ETL pipeline using AWS SDK, Redshift, Python and PostgreSQL. Developing seamless pipeline to connect to Redshift cluster and COPY data from S3 buckets to redshift staging tables. Creating a database with tables designed to optimize queries on song play analysis
Code for deep learning tutorials that I have posted on my blog: https://hareeshbahuleyan.github.io/blog/
Linear Feature Extraction using PCA, LDA. Nonlinear Dimensionality Reduction using LLE and ISOMAP. Naive Bayes classifier, kNN, SVM
Data Wrangling-statistics-ANOVA-parametric assumption-Regression- Multiple Regression- Logistic Regression- Poisson Regression-Validity-non-parametric
Vertex Cover problem-Optimization-Multi-thread-Multi-process-MiniSAT
Handwritten dataset with 5 classes: digit 0, 1, 2, 3, 4. Dimensional reduction approaches-PCA, LDA, LLE, and Isomap- were implied.
Design patterns implemented in Java
Building out an ETL pipeline using Python. Creating a database schema and ETL pipeline for this analysis. Creating an Apache Cassandra database with denormalized tables designed to optimize queries on event data. Define robust Partition Keys, Clustering Columns and Composite Primary Keys.
Optimization-Controllers-Numerical simulation-Matlab
Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.