ds-100 / textbook Goto Github PK

View Code? Open in Web Editor NEW

227.0 28.0 79.0 291.82 MB

Learning Data Science, a textbook.

Home Page: https://learningds.org/

Makefile 0.01% Jupyter Notebook 99.98% CSS 0.01% Python 0.01% JavaScript 0.01% R 0.01%

data-science textbook

textbook's Introduction

Learning Data Science

By Sam Lau, Joey Gonzalez, and Deb Nolan.

Learning Data Science is an introductory textbook for data science published by O'Reilly Media in 2023. It covers foundational skills in programming and statistics that encompass the data science lifecycle. The reader's assumed background is detailed in the Preface.

The contents of this book are licensed for free consumption under the following license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

textbook's People

Contributors

Stargazers

Watchers

textbook's Issues

Topics to eventually cover

Last updated: Aug 20, 2018

Here's a list of topics that we would like to include in the textbook. Each of these topics will likely become one section.

HTML and XML
Extracting features for modeling from text
Linear regression + one-hot encoding results in average of column
More types of feature engineering
Compare lasso vs ridge fit speed
Norm balls and elastic net
Regularization for logistic reg

Topics finished in Summer 2018

Linear regression viewed as a projection
SGD and gradient of logistic cost
Sensitivity / Specificity + AUC
Multi-class classification
More on random variables / expectation / variance / distributions
Hypothesis testing review
Studentized bootstrap
p-hacking

Small info issue in Sec 2.5

This phrase:

Without CO2, earth would be impossibly cold, but it’s a delicate balance.

I think it should be warm instead of cold.

pandas/seaborn/sklearn Quick Reference

We would like a page containing all the API pieces that we use in Data 100 from pandas, seaborn, and sklearn. It should be a set of tables with the function name, location it's mentioned, and a one-sentence description of what it does.

juypter book wrong cell tag for sec 7.0

In https://learningds.org/ch/07/sql_intro.html the dog breed table does not show up in the html,
although it is present in the ipynb. I think the remove-cell tag should be remove-input.

check/add transitions in linear regression chapter

Didn't get a chance to do this while doing polish, putting it on backlog

Feature Request, Search Bar

Is it possible to implement a search function into the textbook similar to Data 8? I find it hard to look up definitions/examples of key terms quickly. Thanks!!

11.2 LaTeX not rendering

LaTeX not rendering properly in 11.2 Defining Gradient Descent/Minimizing the Huber cost. Not sure why; the jupyter notebook looks fine. Would this be resolved by re-running make?

Adding projection to vector space review appendix

Big oversight on my part -- I've been working with orthonormal vectors and forgot that projections aren't just multiplications. I've added a reference in the Linear Projection chapter but the coefficient for the projection is not yet in the Appendix

Fa20 Running List of TODOs

Last updated: 10/12/2020

Deb and I meet every few weeks to discuss the book. We'll leave a running list of TODOs in this issue (in no particular order).

Large scale TODOs (involves multiple pages or chapters of the book):

Write decision tree section before the class covers it on Nov 12, 2020
Write PCA section before the class covers it on Nov 24, 2020
Rework Data Cleaning, EDA, and Data Viz sections.
- EDA will come first, then data cleaning (in the context of EDA), then data viz
Change datasets to more interesting ones.
Add worked examples and case studies to the textbook.
Integrate themes of the course throughout the book (Data lifecycle, Working with large datasets, Data design and generalizability)

Smaller TODOs (involves single pages of the book):

Preface or Introduction

Goals of the book (course) - prepare, enable, empower
What’s special about our approach: integrate computing and statistics and data technologies; empirical loss minimization (optimization, bias-variance, prediction - inference, model selection and feature engineering and regularization)
Organization - main case, small examples, other case studies
Introductory definition of data science

Overarching changes:

Remove Data8-specific references: add links to that text and/or more information about the referenced topic
Add section on Ethics.
Make an editing pass for sections that haven't gotten a fine-grained edit.

Ch 1 (Data Lifecycle)

Missing two “entry” points into the data science lifecycle — sometimes we start with data; sometimes we start with a question
Sam thinks the first example of the book should have at least one interesting CS and one interesting stat idea
Add canonical examples of different types of data science: estimation, prediction, data mining (?)
Talk about unique challenges of data science. Why is data science different from CS + stats put together?

Ch 2 (Data Design)

Flesh out SRS definition
Pair famous examples of Admin data with new examples, e.g., Hite report paper surveys to Internet survey, Phone calls for Dewey’s election to cell phone calls to Trump’s election
Add practice for basic probability (e.g. probability that A appears for SRS, cluster, stratified sampling)
Can use polling examples
Change SRS vs. Big Data section to use actual datasets
Add practice for evaluating data design

Ch 3 (Tabular data)

Bring back restaurant example with it’s 3 levels of granularity? Or make it a case study
Add practice problems for indexing / sorting
Add practice for grouping / pivoting
Some problems should be high-level (e.g. will grouping be helpful?) and others should be syntax-oriented.
Add practice for apply and strings
Add example of well-written pandas code (e.g. using pipes instead of multiple variables)

Ch 4 (Data Cleaning)

Swap out Police report example or create a question where we can help use statistics to answer
Missing values and how to think about them
- Unit tests for the data
- E.g. Check whether data lies within bounds
What should the distribution look like?
Move data types into this chapter
Add practice on granularity manipulation
- E.g. grouping time series
Add practice on wide vs. long
- Convert data to tidy format
- What questions do wide vs. long data facilitate?
Add practice on faithfulness
- Sanity checks for data

Ch 5 (EDA)

How to read a plot, in terms of distributions
Geographic data
Granularity and the wide vs tall form of tabular data
Add practice to match data types with plots
- E.g. debugging a barchart of quantitative data
Add practice on critical thinking about common data issues revealed in viz
- E.g. how to identify missing values, integers, outliers using viz

Ch 6 (Visualization)

Goals: understand the distribution, Comparisons and conditioning, Information rich plots with context
Bulge diagram
Histograms as smoothers (large n), when not to smooth
Reasons for transforming
Dimension reduction (large p)
Add practice on plot syntax and customization
- E.g. how to recreate a 538 plot? https://www.dataquest.io/blog/making-538-plots/
Add practice on viz problem identification
Add practice on solving viz problems

Ch 7 (Web)

Write section for XML, HTML, and XPath
Write section for REST and scraping
Add worked example on using web technologies to collect and analyze data
- E.g. weather website
Add practice on web scraping
Add practice on data analysis with HTML

Ch 8 (Regex)

Add back more use cases for regexes
Add philosophy for how pattern matching works
Greedy matching (avoid the newest metacharacters and explain why)
Text mining concepts
Add practice on when string method vs. regex is better
Add practice on regex construction
Add practice on regex debugging

Ch 9 (Databases, SQL)

Add conceptual questions about pros/cons of using databases
Add practice on SQL query construction
Add practice on query debugging

Ch 10 (Modeling)

Distinction between world, model, data
Introduce the data more
Reasons for L2, L1 and why the optimizers are different
Mention empirical approach - need to bring in statistical approach
Scientific example - a constant that we want to estimate about a natural phenomenon, e.g., whether the expansion of universe is increasing
Social Science Example - We want to estimate the behavior of a population, there may be outliers with contaminated data so use L1 error
Economics example - population distribution has heavy tails, we want to estimate the center using Huber loss
Put chain rule and convex function definitions in an appendix

Ch 11 (Gradient Descent)

11.1 Example where write code to inspect the range of thetas to inspect.
11.2 The intuition as to why we would use the derivative and the size of the steps is a bit unclear, e.g., why do we begin by using steps with alpha = 1? I think that a few small language changes can fix this.
Example: An example with an asymmetric curve would be good to see worked out with starting points on either side of the minimum.
11.3 One example of extra material would be to look at the sum of two convex functions and determine whether the sum is convex. This is relevant to the problems that we are solving, and there was confusion over this in class

Ch 12 (Probability and Generalization)

Introduction: Put the estimation problem in a larger context: Population -> Data production -> Sample
We have one sample and one estimate, but we know that we may get other data in the future so we want to attach an error to the estimate that indicates how accurate it is.
Example: Possibly a new section: that introduces the theory by first running a simulation study where we have a population. In class we used the restaurant scores for the example.
Develop theory from this point. Use a small population of known values. Take a sample of size 1 and find the distribution of the values, the expected value, and the variance. Draw comparisons to the L2 loss on the sample and the L2 loss on the probability distribution. Use diagrams that match the ones from Chapter 10. Then take a sample of size 2 and figure out the joint distribution and the distribution of the average.
Formal development of expectation and linearity, variance and covariance.
Example: Binomial, first introduced via a real world application
Example: Hypergeometric, first introduced via a real world application
New section: Usefulness of Monte Carlo to study behavior of statistics
New section: We can talk about the difference between inference and prediction here. With inference we want to know the accuracy of the estimator of the tau parameter. With prediction we want to know the accuracy of the prediction.

Ch 13 (Linear Regression)

Introduction: “In this chapter we will introduce linear models which will allow us to make use of our entire dataset to make predictions.” This sounds like we aren’t using all of the observations, which is not the case.
13.1 “We treat … as the underlying function that generated the data.” - we want to say this differently. We know it doesn’t generate our data because the points clearly do not fall on a line. Here is where we might talk about “idealized models” and connect it to leaving a percentage tips.
Example: I think we should use an example where the error is much more clearly defined. And substitute the tips as another example, where we can discuss the kinds of approximations in modeling that we are making. Possible examples from nature:
Kleiber’s data on metabolic rate and mass.
Dungeness crab data with post-molt size and pre-molt size.
New Section before multivariable regression- We need to discuss the error and residual plots for simple examples where there is no bias. Easiest example is where the errors are clearly normally distributed.
Example: predicting baby’s birthweight from mother’s height?
13.3 It’s strange motivation that because more variables are available in a data set, we want to run multivariable regression.
Example: I think we should create a motivating example with two variables and explain what the model means, i.e., that for any value of variable1, the same linear relationship exists between y and variable2.
13.4 Example: Compare simple regression coefficients to bivariate regression coefficients when the explanatory variables are correlated. Show pairwise scatterplots of the two regressors and y.
Example: Compare simple regression coefficients to bivariate regression coefficients when the explanatory variables are uncorrelated. Show pairwise scatterplots of two regressors and y.
13.5 Motivation needs fixing. Need to add pairwise scatterplots; explanation of the model components; treatment of indicator variables.

Ch 14 (Feature Engineering)

Write section on feature extraction for text
Add a section with other types of feature engineering
14.1 The motivation for one-hot encoding needs fixing. It needs to reflect the purpose of creating dummy variables. We should also mention the various terms used to describe this, dummy variables, indicators, and one-hot encoding.
- I think that this section might deserve to be its own chapter. It is not feature engineering in the same sense as polynomial regression (14.2). It’s not optional.
- Example: Rather than start with a complex regression, start with a simple problem that only has one categorical variable.
- Example: Combine one or two numeric variables with one categorical variable to understand the model.
- Example: Many variables are categorical and there are no numeric variables, eg., a simple spam filter, text mining
14.2 We should include two regressors so that we can look at interactions between them.
- Example: We should discuss the high correlation between polynomials of high degree and mention orthogonal polynomials.
- Example: run time as a function of age: We should add an example of bent lines that have more flexibility

Ch 15 (Bias-Variance Tradeoff)

15.1 In the Risk definition it is unclear what the expectation is being taken over.
- Example: A very simple theoretical example would be good here.
- Example: Bring back the donkey example and examine the estimated %-error. This has some meaning in the field.
15.2 It would be helpful to clearly spell out all of the various f’s so we know what is being compared.
- Example: use a hypothetical example where the various pieces are identified.
15.3 Missing the mathematical formulation of the cross-validation as an approximation to Risk.

Ch 16 (Regularization)

Ch 17 (Classification)

Publish section on regularization for logistics regression (Jun Seo started this)

Ch 18 (Statistical Inference)

Fix issue in 20.2.3 code

In 20.2 bias_modeling notebook, there is an issue with "Example: Linear Regression and Sine Waves" code, when you run it you will get "Expected 2D array, got scalar array instead".

To fix this issue, make the following changes to line 23:

from: return Line(x_start, x_end, clf.predict(x_start)[0], clf.predict(x_end)[0])
to: return Line(x_start, x_end, clf.predict([[x_start]])[0], clf.predict([[x_end]])[0])

👍

Small info issue in Sec 2.5

This phrase:

 Clinton’s upset victory came as a surprise.

I think it should be Trump instead of Clinton.

Convexity (full section)

We need a section on convexity for gradient descent. Outline to come.

Tiff is current assigned (need one more).

This will become section 11.3, which comes right after section 11.2.

Outline:

Will gradient descent always find the theta that generates the lowest cost?
- No; show an example of a degree 4 polynomial that gradient descent will find a local minima for
When will gradient descent find the best theta?
- Only when the function doesn't have local minima
- These functions are called convex functions
Define convex function
- Show example of convex function and non-convex functions
- Put convexity demo in at the end: https://calebs97.github.io/convexity_visualization/convex_demo.html
Conclusion: when we can, we will choose a convex loss function because gradient descent will find the lowest cost instead of a local minimum.

sec 18.4 (donkey) small suggestion (sign change in definition)

A small thing, but I find it confusing that error = (y-pred)/pred, instead of (pred-y)/pred, since one might expect an over estimate (which we want to penalized more) to be a positive error, not a negative error.

appendices: some small issues

In the "additional material". Perhaps it would make more sense to refactor this section, and add pieces to the end of each chapter, in a section called "Further reading". This would be more modular, and would put the references in context.
(Also some of the sources you cite are arguably not the most useful to a beginning DS student...)

The correct reference for ISLP is below (you omitted two authors)

@book{James2023,
 author = "G. James and D. Witten and T. Hastie and R. Tibshirani and Jonathan Taylor",
 title = "An introduction to statistical learning (with applications in Python)",
 year = 2023,
 publisher = "Springer",
 url = "https://www.statlearning.com/"
}

You might want to add a list of topics not covered, with pointers on where to learn more.
(The above ISLP book may be a good next step.)

https://learningds.org/ch/a01/prob_review.html has some missing figures.

Link to NSFW website imbedded in Chapter 9

Link to what should be the American Kennel Club data set opens to a NSFW website due to the placeholder link.

sec 13.3 (regexp): raw strings code is not shown

In https://learningds.org/ch/13/text_regex.html you write Note that we add the r character before the quotes to create a raw string, which makes regexes easier to write. but the corresponding source code is not shown, so this statement is unclear.

show_regex_match(r'\d\d\d-\d\d-\d\d\d\d', 'My other number is 6382-13-38420.')

ch 20 (optimization): a few small issues

In sec 20.1, The comment about scipy.minimize where you say "we don’t even need to compute the gradient" may be misleading. As you know, by default it uses numerical differentaiton to compute the gradient, if the grad function is not specified by the user, so this is likely to be slow. You may want to mention automatic differentiation libraries like jax and pytorch, which can solve this problem for you. (Also scipy.minimize defaults to BFGS, not GD, and chooses step size automagically :) Since this book is trying to demonstrate "best practice" for DS (eg the nice way you use dataframe.pipe for reproducible wrangling), maybe you should show how to use scipy.minimize on your example problem?

In sec 20.2 first 2 paragraphs need rewriting to avoid repetition/ redundancy.

In sec 20.3 maybe mention that convex implies second order derivative is positive, so the function has a bowl shape.
This condition is easier to check in practice than the definition of convexity. It's probably also worth mentioning some examples of convex and non-convex loss functions encountered in the book.

Maybe mention SAGA and other variance reduced SGD methods since it is used in 21.4.1?

ch 10: replace sns with px if possible

I noticed that a few of the examples in ch 10 (eg sec 10.2 and sec 10.3.2) use seaborn. Is this required, or can plotly be used? Since this chapter is about data viz, it would be ideal if the reader only had to learn one library :) (Also it may be worth mentioning that go is a graph object (I assume?), which is part of plotly.)

Linear Regression case study

Smaller issue, but we can consider a slight reordering in this lesson: we introduce code with get_dummmies and fit the model on dummy variables, and then show how the dummy variables work under Transforming Variables.

some rendering error in 14.2

sec 17.5 (prediction intervals) has issues

The equation in Sec 17.5.2 (Predicting Crab Size) is likely very obscure to most readers.

Maybe it would be helpful to explain that this is computing the
square root of the variance of the predictive distribution

Var(Y; D) = Var(Y | theta_hat(D)) + Var(theta_hat(D)) 
               ~ sigma_hat^2(D)  + sigma_hat^2(D)/n

(where we treat D as random, since we are adopting the frequentist paradigm).

The equation in sec 17.5.3 (Predicting the Incremental Growth of a Crab) likely looks even more obscure to most readers

It would help to refer back to sec 15.5.1 (A Geometric Problem) where you have a (very elegant) derivation
of theta_hat = (X' X)^{-1} X' y and also mention that sigma_hat = SD(e). Then maybe explain what you computing is the square root of the variance of the conditional predictive distribution

Var(Y0|x0; D) = Var(Y0 | x0; theta_hat(D)) + Var(theta_hat(D) | x0) 
                     ~ sigma_hat^2(D) + sigma_hat^2(D) x_0' (X' X)^{-1} x_0

(where we treat D=(X,y) as random, and x0 as fixed). (I have assumed x0 is a column vector, rather than a row vector, so that the second term looks like an inner product, as it should.)

You can then explain that these equations are derived in sec 17.6 (I assume...)

Time to set `pd.set_option('precision', 2)` to `pd.set_option('display.precision', 2)?

I noticed a couple of places at least where y'all have:

pd.set_option('precision', 2)

or equivalent - see e.g.:

https://ds.lis.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A//github.com/lisds/ds-100-sql&subPath=content/ch/07/sql_exercises.ipynb)

With current Pandas, this gives:

OptionError: Pattern matched multiple keys

I assume you intend this to mean display.precision instead of styler.format.precision? But in either case, is now the time to specify the option explicitly? Happy to make a PR if so.

sec 21.4.3 (fake news): a few small issues

Sec 21.4.3 the table comparing the 3 models performance says "test error" but it should be "test accuracy".

It might be useful to do a little exploratory analysis on the td-idf transformed data before fitting :)

You say the model has "23,804 features" but there are 23812 unique tokens.

Maybe explain how to handle "out of vocabulary" words (like new names of politicians) so model can actually be used on new data?

Typos in 11.3

This page misspells steer as stear and each as eac.
https://www.textbook.ds100.org/ch/11/viz_comparisons.html

jupyter book compilation problem in 6.2.4

In https://learningds.org/ch/06/pandas_aggregating.html the cell that computes px.line(unique_names_by_year.reset_index().. failed to run for some reason and gives the error name 'unique_names_by_year' is not defined.

Prepare Andrew Do's notebook on text analysis

Andrew Do created a notebook explaining text analysis here: https://github.com/DS-100/textbook/blob/master/ch08/working_with_text.ipynb .

It's not quite in a state where it's ready to include in the textbook but has most of the necessary pieces. We would like to edit it, split it up into a couple sections, and put it into the textbook.

Cross-Validation (full section)

I would like a section on CV at the end of chapter 13 on Feature Engineering. This will be section 13.3 which comes right after section 13.2 on polynomial regression.

Outline:

In the previous section, we see that adding many features to the data will lower cost but result in poor models. (Show short example.)
How do we choose the best set of features for the data?
- We need to evaluate the model based on data that it hasn't used for training.
Cross validation
- We split the data into a training set, validation set, and test set.
- We train the model on the training set, then check its accuracy on the validation set.
- Show example using ice cream regression from last section
- We use the training set for fitting the model; we use CV for model selection.
Test set accuracy
- Still, the validation set accuracy will usually be too high since we picked the model with the highest validation accuracy.
- So, we use the test set (which the model has never seen at all) to report the final accuracy of the model. We cannot use the test set accuracy to pick a different model.
- Show test set accuracy on ice cream regression
Conclusion: We use CV to perform feature selection and model selection.

sec 17.2 has some issues

In sec 17.2.1 (rank test), the calculation (1+200)/2 = 100.5 might seem obscure to some.
Later you write this more clearly as shown below. Maybe move here?

In sec 17.2.2(vaccines), the pargraph below is hard to understand

The numbers in table 17.1 are obscure - where does 293 come from? (I assume it's ceil(585/2), but this needs to be explained.)

Finally you say that the multivariate hypergeometric distribution is explained in ch 3, but that only covers the univariate case, IIUC. So the code may not be very clear to some readers.

Small spelling mistake in Sec 2.4

This phrase:

The measurements of CO2 are the number of CO2 molecules per million molecules of dry air so the unit of measurement is parts per million of ppm for short.)

I think it should be (ppm for short.) instead of of ppm for short.)

ds-100 / textbook Goto Github PK

textbook's Introduction

Learning Data Science

textbook's People

Contributors

Stargazers

Watchers

Forkers

textbook's Issues

Large scale TODOs (involves multiple pages or chapters of the book):

Smaller TODOs (involves single pages of the book):

Recommend Projects

Recommend Topics

Recommend Org

Jobs