GithubHelp home page GithubHelp logo

data-science-toolkit's Introduction

Ideally a new data science hire would go through these resources in the first 30-60 days. I started this with a junior data scientist in my mind working for a data science team using Python and GCP. This is roughly ordered.

Computer & Environment Setup

Productivity

Rectangle: A window management app for macOS, Rectangle enables you to quickly and effortlessly resize and organize your windows using keyboard shortcuts or by dragging windows to screen edges. Iused to use ShiftIt which did something similar but Rectangle does the same thing but works on the latest versions of macOS.

Stats: An open-source system monitor for macOS, Stats provides you with detailed information on your CPU, memory, disk, network, and battery usage, all accessible from your menu bar. I used to pay for iStat Menus but stats is an open source version.

Amphetamine: Keep your Mac awake and prevent it from sleeping with Amphetamine, a powerful and customizable app that allows you to set rules based on applications, time, or power source. Similar to the Caffiene app.

Be Focused: A productivity-enhancing time management app, Be Focused utilizes the Pomodoro Technique to help you break work into manageable intervals, maintain focus, and stay on track. I find using Pomodoros, setting 25 minute timers of focused work to be incredibly helpful.

Hidden Bar: A minimalist app that allows you to declutter your Mac's menu bar by hiding icons you don't need to see all the time, Hidden Bar lets you access these icons with a simple click whenever needed.

Developer Tools

Homebrew: A must-have package manager for macOS, Homebrew makes it easy to install, update, and manage software packages, including command-line tools and graphical applications.

Visual Studio Code: A versatile and free source code editor developed by Microsoft, Visual Studio Code supports a wide range of programming languages and comes with built-in support for Git, intelligent code completion, and a plethora of extensions to customize your coding environment.

iTerm2: A highly customizable and feature-rich terminal emulator for macOS, iTerm2 improves upon the default Terminal app with features like split panes, search functionality, and extensive customization options.

Anaconda/Miniconda: Anaconda is a powerful Python and R distribution that simplifies package management and deployment, while Miniconda is its lightweight counterpart. Both options provide you with the essential tools to set up and manage your data science and machine learning environments.

zsh: zsh has become my bash replacement.

Oh My Zsh: Makes zsh more useful with a bunch of plugins.

Sublime Text: A sophisticated and lightning-fast text editor designed for code, markup, and prose, Sublime Text offers a sleek interface, multiple selections, and a highly extensible plugin API.

Technical

Reading through Foundations sections of madewithml:

๐Ÿ› ย  Toolkit ๐Ÿ”ฅย  Machine Learning ๐Ÿค–ย  Deep Learning
Notebooks Linear Regression
Python Logistic Regression
NumPy Neural Network
Pandas Data Quality
PyTorch Utilities

Left out MadewithML's material on Attention, Embeddings and Transformers because Jay Alammar's blog posts are better.

Command Line

There's many tutorials but this one is decent https://www.freecodecamp.org/news/command-line-for-beginners/

Cloud - GCP

Git & Bitbucket/Github

Read through Reproducibility section of madewithml's MLOps course

โ™ป๏ธย  Reproducibility
Git
Pre-commit
Versioning

Sometimes you will run into merge conflicts, read this guide from Github for how to resolve them.

Docker

Kubeflow

Data

Hadley Wickham's tidy data paper first introduced me to the idea of tidy data. I first read it around 2017 when I first was getting into data analytics. Totally changed how I thought about representing data and put categories to shapes of data like "wide" and "long" data.

Read the original paper: https://vita.had.co.nz/papers/tidy-data.pdf Work through some Python examples: https://byuidatascience.github.io/python4ds/tidy-data.html

SQL

This is one of the most important skills for a data scientist as most of the data lives in databases. Therefore, being able to extract and manipulate data using SQL is crucial. Mode Analytics provides a good tutorial. Start with the intermediate one. https://mode.com/sql-tutorial/

madewithml (parts 1, 2 and 3)

๐ŸŽจย  Design
Product
Engineering
Project
๐Ÿ”ขย  Data
Exploration
Labeling
Preprocessing
Splitting
Augmentation
๐Ÿ“ˆย  Modeling
Baselines
Evaluation
Experiment tracking
Optimization

Python

scikit-learn

Understand the sklearn API through this example notebook.

  • fit
  • transform
  • predict
  • fit_transform

torch

Work through Steps 0-7 in the official PyTorch guide: https://pytorch.org/tutorials/beginner/basics/intro.html

Packaging

๐Ÿ’ปย  Developingย 
Packaging
Organization
Logging
Documentation
Styling
Makefile

Jupyter

  • Similar to @radekosmulski, I use VS Code exclusively in order to use Github Copilot in Jupyter
  • You can remote SSH to connect to a server and run Jupyter on the server and use Copilot there as well (bare metal VMs, Cloud VMs, etc)
  • Learn hotkeys

Streamlit

@karpathy also recommends spending a couple hours learning Streamlit

Gradio is a similar library from Hugging Face.

Github Copilot

madewithml (parts 5 and 6)

๐Ÿ“ฆย  Serving
Command-line
RESTful API
โœ…ย  Testing
Code
Data
Models

NLP

Embeddings

Transformers

Language Models / Gen AI

Prompt Engineering

madewithml (parts 8 and 9)

๐Ÿš€ย  Production
Dashboard
CI/CD
Monitoring
Systems design
โŽˆย  Data engineering
Data stack
Orchestration
Feature store

Extras

  • MIT: The Missing Semester of Your CS Education

  • Great example of a full ML project (Part 1, Part 2, Part 3) from Will Koehrsen. Steps 1-3 is in Part 1, Steps 4-6 is in Part 2 and Steps 7-8 is in Part 3.

    1. Data cleaning and formatting
    2. Exploratory data analysis
    3. Feature engineering and selection
    4. Compare several machine learning models on a performance metric
    5. Perform hyperparameter tuning on the best model to optimize it for the problem
    6. Evaluate the best model on the testing set
    7. Interpret the model results to the extent possible
    8. Draw conclusions and write a well-documented report

Continue to Learn

Remember, the field of data science is vast and constantly evolving. The most important skill to develop is the ability to learn and adapt to new tools, technologies, and techniques. Here are some resources to help you continue to learn:

YouTube

Twitter

Blogs

data-science-toolkit's People

Contributors

lawwu avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.