GithubHelp home page GithubHelp logo

Welcome to Xinyao's Github 😊

Hi, I'm Xinyao, a proactive Gemini with the ISFP personality type. I'm intensely curious about fresh and challenging endeavors. I love keeping up with the latest technological advancements, and I pride myself on my strong ability to act upon and replicate them. My planning preference leans towards setting short-term goals for the next three days and long-term visions spanning five years. I cherish the dynamic nature of plans as they progress and am passionate about harnessing my creativity to guide new directions in planning.

Lately, I've been engrossed in Kaggle competitions. I've participated in three projects so far: ICR, LLM, and CAFA. Concurrently, I'm diving deep into advanced solutions from past projects and honing my predictive techniques. Two of them are Ubiquant prediction and AI Games.

In 2020, I graduated from Columbia University with a Master's in Biostatistics. Pinned on my homepage are the projects I undertook during my MS, many of which relate to healthcare. My primary focus at Columbia was on theory and decision-making, encompassing fundamental knowledge in computational statistics like data mining, optimization, and more. My thesis centered on causal inference, an intriguing direction in statistics that extends well into the realm of Machine Learning.

I'll keep updating my GitHub with new projects. Stay tuned!

🔭 My Kaggle Competition Adventure

Kaggle - LLM Science Exam

Built and evaluated a Large Language Model using a pre-trained BERT model for answering science questions, achieving a 0.6 MAP.

pipeline

ICR - Identifying Age-Related Conditions

Initially, I utilized LightGBM with Optuna for hyperparameter tuning, running 100 trials. Post-analysis of the feature importance plot led me to limit the feature list and ensemble five LightGBM models. The cross-validation displayed an impressive AUC of 0.99, yet the private leaderboard (LB) score wasn't as good as expected (0.22).

Recognizing this, I delved into diagnosing the issue and improving the model. I discovered:

  • The task was disease detection, thereby making recall a priority over AUC.
  • The training data size was small (<700 rows), increasing the risk of overfitting with LightGBM or Neural Networks.
  • The data was imbalanced with less than 200 positive instances, which was only 17% of the total.

In light of these insights, I initiated several improvements:

  • Applied KNN Imputer for missing value treatment.
  • Utilized Synthetic Minority Over-sampling Technique (SMOTE) to balance the positive and negative instances.
  • Shifted my ensemble strategy from averaging 5 LightGBM models to a voting system encompassing logistic regression, random forest, and SVM models.
  • Changed the cross-validation metric from AUC to recall.

Update improved the public Leaderboard score from 0.64 to 0.53.

Google - American Sign Language Fingerspelling Recognition

Current: Top 35% -- 386 / 1120

🛠️ Technologies & Tools

Programming:

Python, SQL, R, Spark, Git, Bazel, Airflow, AWS, MLFlow, Databricks, CICD, Snowflake, Docker, Jupyter, Pytorch, Tensorflow, MySQL, MangoDB

Statistics & Data Mining:

A/B Testing, ANOVA, LLM, NLP, Deep Learning, Hyperparameter tuning (Optuna), Supervised Learning (LightGBM), Unsupervised Learning, Data Mining (Quantitive prediction) Industries I've Worked In: Tech, Advertisement (Audience prediction), E-commerce (funds flow forecasting, fraudulent activities detection), Healthcare (cancer detection, medical text classification, insurance beneficiaries risk adjustment)

🌱 My Journey So Far

VideoAmp, CA

As a Machine Learning Engineer, I led the design and development of personification systems, optimized data warehousing, implemented viewership prediction models, and facilitated extensive feature engineering.

Acumen, CA

As a Data Engineer, I optimized data pipelines for large datasets, automated manual tasks, integrated validation processes, and investigated data anomalies.

⚡ A Glimpse into My Projects

💡A Glance into My Articles

Alternative link if you don't have access to Medium

This article is inspired by a post written by a Databricks engineer. It is aimed at company engineers who use the Databricks ecosystem but are unclear about why they chose it or its advantages. With this piece, we hope to demystify the underlying concepts and benefits of Databricks, specifically in comparison to Data Warehouses and Data Lakes.

Alternative link if you don't have access to Medium

A Large Language Model (LLM) refers to a type of artificial intelligence model designed to understand and generate human-like text. These models are trained on vast amounts of text data and utilize deep learning techniques, typically based on neural networks, to generate coherent and contextually relevant responses to textual prompts.

📫 How to reach me

Portfolio 🌕

LinkedIn ✨

Email 📧

😎Fun Fact

Welcome our fluffy friend --> 🐱 Severus 🐱

Severus is my speakless friend, he is a 2 years old male ragdoll. He loves running around the house after pooping.

My another friend --> 🎻 Violin 🎻

I was a second violinist in Columbia University Irving Medical Center Symphony Orchestra.

Once a rehearsal we switched the conductor, the old one became my partner and sat next to me.

Then he finally knew I was the one who played out of tune.

Xinyao Wu's Projects

breast_cancer_diagnosis icon breast_cancer_diagnosis

Compare the performance of full logistic-lasso, Newton Raphson method model and optimal Logistic-LASSO with Coordinate-Wise Update

causal-inference-practicum icon causal-inference-practicum

[R][3D plot visualization][Causal inference] Expand the methodology of causal effect estimation for multiple continuous exposures

cs-notes icon cs-notes

:books: 技术面试必备基础知识、Leetcode、计算机操作系统、计算机网络、系统设计、Java、Python、C++

cs-notes-1 icon cs-notes-1

我的自学笔记,在学习操作系统(MIT6.004)和软件构造(MIT6.031),整理Java、算法、操作系统,后续学习数据库,终身更新。

dm_hw_svm icon dm_hw_svm

[R][SVM][Ada boost] comparison of linear support vector classifier, support vector classifier with Radial kernel, classifier using AdaBoost algorithm with decision stumps as weak learners.

human_disease_prediction icon human_disease_prediction

[XGboost][Data Mining]Improve the bootstrap process and visualize the comparison of new and existing methods

hurrican-prediction icon hurrican-prediction

[R][Bayesian inference][Monte Carlo Markov Chain]Develop a model to predict the hurricane trajectory with MCMC and Bayesian inference

ipo-analysis icon ipo-analysis

[Python][SVM][Random Forest][Ada boost][Tableau visualization] 2018 IPO price in China& U.S.

llm_science_exam icon llm_science_exam

Kaggle - LLM Science Exam Use LLMs to answer difficult science questions

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.