Ideally a new data science hire would go through these resources in the first 30-60 days. I started this with a junior data scientist in my mind working for a data science team using Python and GCP. This is roughly ordered.

Computer & Environment Setup

Productivity

Rectangle: A window management app for macOS, Rectangle enables you to quickly and effortlessly resize and organize your windows using keyboard shortcuts or by dragging windows to screen edges. Iused to use ShiftIt which did something similar but Rectangle does the same thing but works on the latest versions of macOS.

Stats: An open-source system monitor for macOS, Stats provides you with detailed information on your CPU, memory, disk, network, and battery usage, all accessible from your menu bar. I used to pay for iStat Menus but stats is an open source version.

Amphetamine: Keep your Mac awake and prevent it from sleeping with Amphetamine, a powerful and customizable app that allows you to set rules based on applications, time, or power source. Similar to the Caffiene app.

Be Focused: A productivity-enhancing time management app, Be Focused utilizes the Pomodoro Technique to help you break work into manageable intervals, maintain focus, and stay on track. I find using Pomodoros, setting 25 minute timers of focused work to be incredibly helpful.

Hidden Bar: A minimalist app that allows you to declutter your Mac's menu bar by hiding icons you don't need to see all the time, Hidden Bar lets you access these icons with a simple click whenever needed.

Developer Tools

Homebrew: A must-have package manager for macOS, Homebrew makes it easy to install, update, and manage software packages, including command-line tools and graphical applications.

Visual Studio Code: A versatile and free source code editor developed by Microsoft, Visual Studio Code supports a wide range of programming languages and comes with built-in support for Git, intelligent code completion, and a plethora of extensions to customize your coding environment.

iTerm2: A highly customizable and feature-rich terminal emulator for macOS, iTerm2 improves upon the default Terminal app with features like split panes, search functionality, and extensive customization options.

Anaconda/Miniconda: Anaconda is a powerful Python and R distribution that simplifies package management and deployment, while Miniconda is its lightweight counterpart. Both options provide you with the essential tools to set up and manage your data science and machine learning environments.

zsh: zsh has become my bash replacement.

Oh My Zsh: Makes zsh more useful with a bunch of plugins.

Sublime Text: A sophisticated and lightning-fast text editor designed for code, markup, and prose, Sublime Text offers a sleek interface, multiple selections, and a highly extensible plugin API.

Technical

Reading through Foundations sections of madewithml:

🛠 Toolkit	🔥 Machine Learning	🤖 Deep Learning
Notebooks	Linear Regression
Python	Logistic Regression
NumPy	Neural Network
Pandas	Data Quality
PyTorch	Utilities

Left out MadewithML's material on Attention, Embeddings and Transformers because Jay Alammar's blog posts are better.

Command Line

There's many tutorials but this one is decent https://www.freecodecamp.org/news/command-line-for-beginners/

Cloud - GCP

gcloud
gsutil
Big Query - Read through and run through the examples in these Big Query documentation pages:
- Introduction: https://cloud.google.com/bigquery/docs/query-overview
- Query BigQuery data (15 subpages) https://cloud.google.com/bigquery/docs/running-queries
- Query data with SQL (10 subpages) https://cloud.google.com/bigquery/docs/introduction-sql

Git & Bitbucket/Github

Read through Reproducibility section of madewithml's MLOps course

♻️ Reproducibility

Git

Pre-commit

Versioning

Sometimes you will run into merge conflicts, read this guide from Github for how to resolve them.

Docker

https://valohai.com/blog/docker-for-data-science/
madewithml Docker guide: https://madewithml.com/courses/mlops/docker/

Kubeflow

Introduction to Vertex AI Pipelines

Data

Hadley Wickham's tidy data paper first introduced me to the idea of tidy data. I first read it around 2017 when I first was getting into data analytics. Totally changed how I thought about representing data and put categories to shapes of data like "wide" and "long" data.

Read the original paper: https://vita.had.co.nz/papers/tidy-data.pdf Work through some Python examples: https://byuidatascience.github.io/python4ds/tidy-data.html

SQL

This is one of the most important skills for a data scientist as most of the data lives in databases. Therefore, being able to extract and manipulate data using SQL is crucial. Mode Analytics provides a good tutorial. Start with the intermediate one. https://mode.com/sql-tutorial/

madewithml (parts 1, 2 and 3)

🎨 Design

🔢 Data

📈 Modeling

Python

scikit-learn

Understand the sklearn API through this example notebook.

fit
transform
predict
fit_transform

torch

Work through Steps 0-7 in the official PyTorch guide: https://pytorch.org/tutorials/beginner/basics/intro.html

Packaging

Use the cookiecutter data science project template for new projects: https://drivendata.github.io/cookiecutter-data-science/
Read through the developing section of madewithml:

💻 Developing

Jupyter

Similar to @radekosmulski, I use VS Code exclusively in order to use Github Copilot in Jupyter
You can remote SSH to connect to a server and run Jupyter on the server and use Copilot there as well (bare metal VMs, Cloud VMs, etc)
Learn hotkeys

Streamlit

Learn by working through the official example
Streamlit Cheatsheet

@karpathy also recommends spending a couple hours learning Streamlit

Gradio is a similar library from Hugging Face.

Github Copilot

Use Github Copilot for all coding (Python, Jupyter, SQL, etc.). I estimate it makes me 20% more productive for all programming tasks.
Use Jupyter in VS Code to use Copilot in Jupyter notebooks

madewithml (parts 5 and 6)

📦 Serving

✅ Testing

NLP

Embeddings

http://jalammar.github.io/illustrated-word2vec/
Read through Vicki Boykis' embeddings guide

Transformers

http://jalammar.github.io/illustrated-transformer/

Language Models / Gen AI

Prompt Engineering

madewithml (parts 8 and 9)

🚀 Production

⎈ Data engineering

Data stack

Orchestration

Feature store

Extras

MIT: The Missing Semester of Your CS Education
Great example of a full ML project (Part 1, Part 2, Part 3) from Will Koehrsen. Steps 1-3 is in Part 1, Steps 4-6 is in Part 2 and Steps 7-8 is in Part 3.
1. Data cleaning and formatting
2. Exploratory data analysis
3. Feature engineering and selection
4. Compare several machine learning models on a performance metric
5. Perform hyperparameter tuning on the best model to optimize it for the problem
6. Evaluate the best model on the testing set
7. Interpret the model results to the extent possible
8. Draw conclusions and write a well-documented report

Continue to Learn

Remember, the field of data science is vast and constantly evolving. The most important skill to develop is the ability to learn and adapt to new tools, technologies, and techniques. Here are some resources to help you continue to learn:

YouTube

@lexfridman - and associated transcripts
@AndrejKarpathy
@jamesbriggs
@ai-explained-

lawwu / data-science-toolkit Goto Github PK

data-science-toolkit's Introduction

Computer & Environment Setup

Productivity

Developer Tools

Technical

Reading through Foundations sections of madewithml:

Command Line

Cloud - GCP

Git & Bitbucket/Github

Docker

Kubeflow

Data

SQL

madewithml (parts 1, 2 and 3)

Python

scikit-learn

torch

Packaging

Jupyter

Streamlit

Github Copilot

madewithml (parts 5 and 6)

NLP

Embeddings

Transformers

Language Models / Gen AI

Prompt Engineering

madewithml (parts 8 and 9)

Extras

Continue to Learn

YouTube

Twitter

Blogs

data-science-toolkit's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs