GithubHelp home page GithubHelp logo

ds-100 / textbook Goto Github PK

View Code? Open in Web Editor NEW
227.0 28.0 79.0 291.82 MB

Learning Data Science, a textbook.

Home Page: https://learningds.org/

Makefile 0.01% Jupyter Notebook 99.98% CSS 0.01% Python 0.01% JavaScript 0.01% R 0.01%
data-science textbook

textbook's Introduction

Learning Data Science

By Sam Lau, Joey Gonzalez, and Deb Nolan.

Front cover of textbook

Learning Data Science is an introductory textbook for data science published by O'Reilly Media in 2023. It covers foundational skills in programming and statistics that encompass the data science lifecycle. The reader's assumed background is detailed in the Preface.

The contents of this book are licensed for free consumption under the following license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

textbook's People

Contributors

aaronkh avatar allenshen5 avatar ananthagarwal avatar andrewjkim avatar ashleychien avatar ashleychien1 avatar chrispyles avatar debnolan avatar jegonzal avatar junseo-park avatar ryanlovett avatar samlau95 avatar sonajeswani avatar tianxiaohu avatar tjann avatar yuvipanda avatar zhunation avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textbook's Issues

Topics to eventually cover

Last updated: Aug 20, 2018

Here's a list of topics that we would like to include in the textbook. Each of these topics will likely become one section.

  • HTML and XML
  • Extracting features for modeling from text
  • Linear regression + one-hot encoding results in average of column
  • More types of feature engineering
  • Compare lasso vs ridge fit speed
  • Norm balls and elastic net
  • Regularization for logistic reg

Topics finished in Summer 2018

  • Linear regression viewed as a projection
  • SGD and gradient of logistic cost
  • Sensitivity / Specificity + AUC
  • Multi-class classification
  • More on random variables / expectation / variance / distributions
  • Hypothesis testing review
  • Studentized bootstrap
  • p-hacking

Small info issue in Sec 2.5

This phrase:

Without CO2, earth would be impossibly cold, but it’s a delicate balance.

I think it should be warm instead of cold.

pandas/seaborn/sklearn Quick Reference

We would like a page containing all the API pieces that we use in Data 100 from pandas, seaborn, and sklearn. It should be a set of tables with the function name, location it's mentioned, and a one-sentence description of what it does.

Feature Request, Search Bar

Is it possible to implement a search function into the textbook similar to Data 8? I find it hard to look up definitions/examples of key terms quickly. Thanks!!

Adding projection to vector space review appendix

Big oversight on my part -- I've been working with orthonormal vectors and forgot that projections aren't just multiplications. I've added a reference in the Linear Projection chapter but the coefficient for the projection is not yet in the Appendix

Fa20 Running List of TODOs

Last updated: 10/12/2020

Deb and I meet every few weeks to discuss the book. We'll leave a running list of TODOs in this issue (in no particular order).

Large scale TODOs (involves multiple pages or chapters of the book):

  • Write decision tree section before the class covers it on Nov 12, 2020

  • Write PCA section before the class covers it on Nov 24, 2020

  • Rework Data Cleaning, EDA, and Data Viz sections.

    • EDA will come first, then data cleaning (in the context of EDA), then data viz
  • Change datasets to more interesting ones.

  • Add worked examples and case studies to the textbook.

  • Integrate themes of the course throughout the book (Data lifecycle, Working with large datasets, Data design and generalizability)

Smaller TODOs (involves single pages of the book):

Preface or Introduction

  • Goals of the book (course) - prepare, enable, empower
  • What’s special about our approach: integrate computing and statistics and data technologies; empirical loss minimization (optimization, bias-variance, prediction - inference, model selection and feature engineering and regularization)
  • Organization - main case, small examples, other case studies
  • Introductory definition of data science

Overarching changes:

  • Remove Data8-specific references: add links to that text and/or more information about the referenced topic
  • Add section on Ethics.
  • Make an editing pass for sections that haven't gotten a fine-grained edit.

Ch 1 (Data Lifecycle)

  • Missing two “entry” points into the data science lifecycle — sometimes we start with data; sometimes we start with a question
  • Sam thinks the first example of the book should have at least one interesting CS and one interesting stat idea
  • Add canonical examples of different types of data science: estimation, prediction, data mining (?)
  • Talk about unique challenges of data science. Why is data science different from CS + stats put together?

Ch 2 (Data Design)

  • Flesh out SRS definition
  • Pair famous examples of Admin data with new examples, e.g., Hite report paper surveys to Internet survey, Phone calls for Dewey’s election to cell phone calls to Trump’s election
  • Add practice for basic probability (e.g. probability that A appears for SRS, cluster, stratified sampling)
  • Can use polling examples
  • Change SRS vs. Big Data section to use actual datasets
  • Add practice for evaluating data design

Ch 3 (Tabular data)

  • Bring back restaurant example with it’s 3 levels of granularity? Or make it a case study
  • Add practice problems for indexing / sorting
  • Add practice for grouping / pivoting
  • Some problems should be high-level (e.g. will grouping be helpful?) and others should be syntax-oriented.
  • Add practice for apply and strings
  • Add example of well-written pandas code (e.g. using pipes instead of multiple variables)

Ch 4 (Data Cleaning)

  • Swap out Police report example or create a question where we can help use statistics to answer
  • Missing values and how to think about them
    • Unit tests for the data
    • E.g. Check whether data lies within bounds
  • What should the distribution look like?
  • Move data types into this chapter
  • Add practice on granularity manipulation
    • E.g. grouping time series
  • Add practice on wide vs. long
    • Convert data to tidy format
    • What questions do wide vs. long data facilitate?
  • Add practice on faithfulness
    • Sanity checks for data

Ch 5 (EDA)

  • How to read a plot, in terms of distributions
  • Geographic data
  • Granularity and the wide vs tall form of tabular data
  • Add practice to match data types with plots
    • E.g. debugging a barchart of quantitative data
  • Add practice on critical thinking about common data issues revealed in viz
    • E.g. how to identify missing values, integers, outliers using viz

Ch 6 (Visualization)

  • Goals: understand the distribution, Comparisons and conditioning, Information rich plots with context
  • Bulge diagram
  • Histograms as smoothers (large n), when not to smooth
  • Reasons for transforming
  • Dimension reduction (large p)
  • Add practice on plot syntax and customization
  • Add practice on viz problem identification
  • Add practice on solving viz problems

Ch 7 (Web)

  • Write section for XML, HTML, and XPath
  • Write section for REST and scraping
  • Add worked example on using web technologies to collect and analyze data
    • E.g. weather website
  • Add practice on web scraping
  • Add practice on data analysis with HTML

Ch 8 (Regex)

  • Add back more use cases for regexes
  • Add philosophy for how pattern matching works
  • Greedy matching (avoid the newest metacharacters and explain why)
  • Text mining concepts
  • Add practice on when string method vs. regex is better
  • Add practice on regex construction
  • Add practice on regex debugging

Ch 9 (Databases, SQL)

  • Add conceptual questions about pros/cons of using databases
  • Add practice on SQL query construction
  • Add practice on query debugging

Ch 10 (Modeling)

  • Distinction between world, model, data
  • Introduce the data more
  • Reasons for L2, L1 and why the optimizers are different
  • Mention empirical approach - need to bring in statistical approach
  • Scientific example - a constant that we want to estimate about a natural phenomenon, e.g., whether the expansion of universe is increasing
  • Social Science Example - We want to estimate the behavior of a population, there may be outliers with contaminated data so use L1 error
  • Economics example - population distribution has heavy tails, we want to estimate the center using Huber loss
  • Put chain rule and convex function definitions in an appendix

Ch 11 (Gradient Descent)

  • 11.1 Example where write code to inspect the range of thetas to inspect.
  • 11.2 The intuition as to why we would use the derivative and the size of the steps is a bit unclear, e.g., why do we begin by using steps with alpha = 1? I think that a few small language changes can fix this.
  • Example: An example with an asymmetric curve would be good to see worked out with starting points on either side of the minimum.
  • 11.3 One example of extra material would be to look at the sum of two convex functions and determine whether the sum is convex. This is relevant to the problems that we are solving, and there was confusion over this in class

Ch 12 (Probability and Generalization)

  • Introduction: Put the estimation problem in a larger context: Population -> Data production -> Sample
  • We have one sample and one estimate, but we know that we may get other data in the future so we want to attach an error to the estimate that indicates how accurate it is.
  • Example: Possibly a new section: that introduces the theory by first running a simulation study where we have a population. In class we used the restaurant scores for the example.
  • Develop theory from this point. Use a small population of known values. Take a sample of size 1 and find the distribution of the values, the expected value, and the variance. Draw comparisons to the L2 loss on the sample and the L2 loss on the probability distribution. Use diagrams that match the ones from Chapter 10. Then take a sample of size 2 and figure out the joint distribution and the distribution of the average.
  • Formal development of expectation and linearity, variance and covariance.
  • Example: Binomial, first introduced via a real world application
  • Example: Hypergeometric, first introduced via a real world application
  • New section: Usefulness of Monte Carlo to study behavior of statistics
  • New section: We can talk about the difference between inference and prediction here. With inference we want to know the accuracy of the estimator of the tau parameter. With prediction we want to know the accuracy of the prediction.

Ch 13 (Linear Regression)

  • Introduction: “In this chapter we will introduce linear models which will allow us to make use of our entire dataset to make predictions.” This sounds like we aren’t using all of the observations, which is not the case.
  • 13.1 “We treat … as the underlying function that generated the data.” - we want to say this differently. We know it doesn’t generate our data because the points clearly do not fall on a line. Here is where we might talk about “idealized models” and connect it to leaving a percentage tips.
  • Example: I think we should use an example where the error is much more clearly defined. And substitute the tips as another example, where we can discuss the kinds of approximations in modeling that we are making. Possible examples from nature:
  • Kleiber’s data on metabolic rate and mass.
  • Dungeness crab data with post-molt size and pre-molt size.
  • New Section before multivariable regression- We need to discuss the error and residual plots for simple examples where there is no bias. Easiest example is where the errors are clearly normally distributed.
  • Example: predicting baby’s birthweight from mother’s height?
  • 13.3 It’s strange motivation that because more variables are available in a data set, we want to run multivariable regression.
  • Example: I think we should create a motivating example with two variables and explain what the model means, i.e., that for any value of variable1, the same linear relationship exists between y and variable2.
  • 13.4 Example: Compare simple regression coefficients to bivariate regression coefficients when the explanatory variables are correlated. Show pairwise scatterplots of the two regressors and y.
  • Example: Compare simple regression coefficients to bivariate regression coefficients when the explanatory variables are uncorrelated. Show pairwise scatterplots of two regressors and y.
  • 13.5 Motivation needs fixing. Need to add pairwise scatterplots; explanation of the model components; treatment of indicator variables.

Ch 14 (Feature Engineering)

  • Write section on feature extraction for text
  • Add a section with other types of feature engineering
  • 14.1 The motivation for one-hot encoding needs fixing. It needs to reflect the purpose of creating dummy variables. We should also mention the various terms used to describe this, dummy variables, indicators, and one-hot encoding.
    • I think that this section might deserve to be its own chapter. It is not feature engineering in the same sense as polynomial regression (14.2). It’s not optional.
    • Example: Rather than start with a complex regression, start with a simple problem that only has one categorical variable.
    • Example: Combine one or two numeric variables with one categorical variable to understand the model.
    • Example: Many variables are categorical and there are no numeric variables, eg., a simple spam filter, text mining
  • 14.2 We should include two regressors so that we can look at interactions between them.
    • Example: We should discuss the high correlation between polynomials of high degree and mention orthogonal polynomials.
    • Example: run time as a function of age: We should add an example of bent lines that have more flexibility

Ch 15 (Bias-Variance Tradeoff)

  • 15.1 In the Risk definition it is unclear what the expectation is being taken over.
    • Example: A very simple theoretical example would be good here.
    • Example: Bring back the donkey example and examine the estimated %-error. This has some meaning in the field.
  • 15.2 It would be helpful to clearly spell out all of the various f’s so we know what is being compared.
    • Example: use a hypothetical example where the various pieces are identified.
  • 15.3 Missing the mathematical formulation of the cross-validation as an approximation to Risk.

Ch 16 (Regularization)

Ch 17 (Classification)

  • Publish section on regularization for logistics regression (Jun Seo started this)

Ch 18 (Statistical Inference)

Fix issue in 20.2.3 code

In 20.2 bias_modeling notebook, there is an issue with "Example: Linear Regression and Sine Waves" code, when you run it you will get "Expected 2D array, got scalar array instead".

To fix this issue, make the following changes to line 23:

  • from: return Line(x_start, x_end, clf.predict(x_start)[0], clf.predict(x_end)[0])
  • to: return Line(x_start, x_end, clf.predict([[x_start]])[0], clf.predict([[x_end]])[0])

👍

Small info issue in Sec 2.5

This phrase:

 Clinton’s upset victory came as a surprise.

I think it should be Trump instead of Clinton.

Convexity (full section)

We need a section on convexity for gradient descent. Outline to come.

Tiff is current assigned (need one more).

This will become section 11.3, which comes right after section 11.2.

Outline:

  • Will gradient descent always find the theta that generates the lowest cost?
    • No; show an example of a degree 4 polynomial that gradient descent will find a local minima for
  • When will gradient descent find the best theta?
    • Only when the function doesn't have local minima
    • These functions are called convex functions
  • Define convex function
  • Conclusion: when we can, we will choose a convex loss function because gradient descent will find the lowest cost instead of a local minimum.

appendices: some small issues

In the "additional material". Perhaps it would make more sense to refactor this section, and add pieces to the end of each chapter, in a section called "Further reading". This would be more modular, and would put the references in context.
(Also some of the sources you cite are arguably not the most useful to a beginning DS student...)

The correct reference for ISLP is below (you omitted two authors)

@book{James2023,
 author = "G. James and D. Witten and T. Hastie and R. Tibshirani and Jonathan Taylor",
 title = "An introduction to statistical learning (with applications in Python)",
 year = 2023,
 publisher = "Springer",
 url = "https://www.statlearning.com/"
}

You might want to add a list of topics not covered, with pointers on where to learn more.
(The above ISLP book may be a good next step.)

https://learningds.org/ch/a01/prob_review.html has some missing figures.

ch 20 (optimization): a few small issues

In sec 20.1, The comment about scipy.minimize where you say "we don’t even need to compute the gradient" may be misleading. As you know, by default it uses numerical differentaiton to compute the gradient, if the grad function is not specified by the user, so this is likely to be slow. You may want to mention automatic differentiation libraries like jax and pytorch, which can solve this problem for you. (Also scipy.minimize defaults to BFGS, not GD, and chooses step size automagically :) Since this book is trying to demonstrate "best practice" for DS (eg the nice way you use dataframe.pipe for reproducible wrangling), maybe you should show how to use scipy.minimize on your example problem?

In sec 20.2 first 2 paragraphs need rewriting to avoid repetition/ redundancy.

Screenshot 2023-07-07 at 1 49 47 PM

In sec 20.3 maybe mention that convex implies second order derivative is positive, so the function has a bowl shape.
This condition is easier to check in practice than the definition of convexity. It's probably also worth mentioning some examples of convex and non-convex loss functions encountered in the book.

Maybe mention SAGA and other variance reduced SGD methods since it is used in 21.4.1?

ch 10: replace sns with px if possible

I noticed that a few of the examples in ch 10 (eg sec 10.2 and sec 10.3.2) use seaborn. Is this required, or can plotly be used? Since this chapter is about data viz, it would be ideal if the reader only had to learn one library :) (Also it may be worth mentioning that go is a graph object (I assume?), which is part of plotly.)

Linear Regression case study

Smaller issue, but we can consider a slight reordering in this lesson: we introduce code with get_dummmies and fit the model on dummy variables, and then show how the dummy variables work under Transforming Variables.

sec 17.5 (prediction intervals) has issues

The equation in Sec 17.5.2 (Predicting Crab Size) is likely very obscure to most readers.

Screenshot 2023-07-07 at 9 30 39 AM

Maybe it would be helpful to explain that this is computing the
square root of the variance of the predictive distribution

Var(Y; D) = Var(Y | theta_hat(D)) + Var(theta_hat(D)) 
               ~ sigma_hat^2(D)  + sigma_hat^2(D)/n

(where we treat D as random, since we are adopting the frequentist paradigm).

The equation in sec 17.5.3 (Predicting the Incremental Growth of a Crab) likely looks even more obscure to most readers

Screenshot 2023-07-07 at 9 30 20 AM

It would help to refer back to sec 15.5.1 (A Geometric Problem) where you have a (very elegant) derivation
of theta_hat = (X' X)^{-1} X' y and also mention that sigma_hat = SD(e). Then maybe explain what you computing is the square root of the variance of the conditional predictive distribution

Var(Y0|x0; D) = Var(Y0 | x0; theta_hat(D)) + Var(theta_hat(D) | x0) 
                     ~ sigma_hat^2(D) + sigma_hat^2(D) x_0' (X' X)^{-1} x_0

(where we treat D=(X,y) as random, and x0 as fixed). (I have assumed x0 is a column vector, rather than a row vector, so that the second term looks like an inner product, as it should.)

You can then explain that these equations are derived in sec 17.6 (I assume...)

Time to set `pd.set_option('precision', 2)` to `pd.set_option('display.precision', 2)?

I noticed a couple of places at least where y'all have:

pd.set_option('precision', 2)

or equivalent - see e.g.:

https://ds.lis.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A//github.com/lisds/ds-100-sql&subPath=content/ch/07/sql_exercises.ipynb)

With current Pandas, this gives:

OptionError: Pattern matched multiple keys

I assume you intend this to mean display.precision instead of styler.format.precision? But in either case, is now the time to specify the option explicitly? Happy to make a PR if so.

sec 21.4.3 (fake news): a few small issues

Sec 21.4.3 the table comparing the 3 models performance says "test error" but it should be "test accuracy".

It might be useful to do a little exploratory analysis on the td-idf transformed data before fitting :)

You say the model has "23,804 features" but there are 23812 unique tokens.

Maybe explain how to handle "out of vocabulary" words (like new names of politicians) so model can actually be used on new data?

Cross-Validation (full section)

I would like a section on CV at the end of chapter 13 on Feature Engineering. This will be section 13.3 which comes right after section 13.2 on polynomial regression.

Outline:

  • In the previous section, we see that adding many features to the data will lower cost but result in poor models. (Show short example.)
  • How do we choose the best set of features for the data?
    • We need to evaluate the model based on data that it hasn't used for training.
  • Cross validation
    • We split the data into a training set, validation set, and test set.
    • We train the model on the training set, then check its accuracy on the validation set.
    • Show example using ice cream regression from last section
    • We use the training set for fitting the model; we use CV for model selection.
  • Test set accuracy
    • Still, the validation set accuracy will usually be too high since we picked the model with the highest validation accuracy.
    • So, we use the test set (which the model has never seen at all) to report the final accuracy of the model. We cannot use the test set accuracy to pick a different model.
    • Show test set accuracy on ice cream regression
  • Conclusion: We use CV to perform feature selection and model selection.

sec 17.2 has some issues

In sec 17.2.1 (rank test), the calculation (1+200)/2 = 100.5 might seem obscure to some.
Later you write this more clearly as shown below. Maybe move here?

Screenshot 2023-07-07 at 10 00 36 AM

In sec 17.2.2(vaccines), the pargraph below is hard to understand

Screenshot 2023-07-07 at 10 05 52 AM

The numbers in table 17.1 are obscure - where does 293 come from? (I assume it's ceil(585/2), but this needs to be explained.)

Finally you say that the multivariate hypergeometric distribution is explained in ch 3, but that only covers the univariate case, IIUC. So the code may not be very clear to some readers.

Small spelling mistake in Sec 2.4

This phrase:

The measurements of CO2 are the number of CO2 molecules per million molecules of dry air so the unit of measurement is parts per million of ppm for short.)

I think it should be (ppm for short.) instead of of ppm for short.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.