GithubHelp home page GithubHelp logo

data-science-from-scratch's Introduction

Data Science from Scratch

Here's all the code and examples from the second edition of my book Data Science from Scratch. They require at least Python 3.6.

(If you're looking for the code and examples from the first edition, that's in the first-edition folder.)

If you want to use the code, you should be able to clone the repo and just do things like

In [1]: from scratch.linear_algebra import dot

In [2]: dot([1, 2, 3], [4, 5, 6])
Out[2]: 32

and so on and so forth.

Two notes:

  1. In order to use the library like this, you need to be in the root directory (that is, the directory that contains the scratch folder). If you are in the scratch directory itself, the imports won't work.

  2. It's possible that it will just work. It's also possible that you may need to add the root directory to your PYTHONPATH, if you are on Linux or OSX this is as simple as

export PYTHONPATH=/path/to/where/you/cloned/this/repo

(substituting in the real path, of course).

If you are on Windows, it's potentially more complicated.

Table of Contents

  1. Introduction
  2. A Crash Course in Python
  3. Visualizing Data
  4. Linear Algebra
  5. Statistics
  6. Probability
  7. Hypothesis and Inference
  8. Gradient Descent
  9. Getting Data
  10. Working With Data
  11. Machine Learning
  12. k-Nearest Neighbors
  13. Naive Bayes
  14. Simple Linear Regression
  15. Multiple Regression
  16. Logistic Regression
  17. Decision Trees
  18. Neural Networks
  19. [Deep Learning]
  20. Clustering
  21. Natural Language Processing
  22. Network Analysis
  23. Recommender Systems
  24. Databases and SQL
  25. MapReduce
  26. Data Ethics
  27. Go Forth And Do Data Science

data-science-from-scratch's People

Contributors

cclauss avatar cenuno avatar e9t avatar joelgrus avatar mandliya avatar pandeesh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-science-from-scratch's Issues

Overflow error for logistic gradient descent example

I'm not 100% sure I had all of my code in correctly by this point, but I got an overflow error when I tried to fit the logistic model example using gradient descent on p. 194. This seemed to stem from an overflow problem with running the logistic_log_partial_ij function. I did not have this problem when fitting the model with stochastic gradient descent.

BackPropagation Algorithm

Hi,

It looks like in the algorithm for backprop the variable output_layer is used before it's defined in the hidden_deltas list comprehension.

Also, do you have any tips for extending the algorithm for multi-layer networks?

PCA example

Why does the PCA example returns components with the opposite sign of the ones from sklearn PCA? Also, when I try to standardize the data and use the code, the components obtained through PCA are the same, which doesn't make sense. Notebook with examples attached.
PrincipalComponentAnalysis.ipynb.zip

Typo on p. 84

On p. 84, you have a random "=== p-values" at the end of a paragraph. I'm guessing this was meant to be a section heading and there was just a missing carriage return before it or something?

Error in Whitespace Formatting in Chapter 2

Sorry I encountered this in the book in Chapter 2:

One consequence of whitespace formatting is that it can be hard to copy and paste code
into the Python shell. For example, if you tried to paste the code:

for i in [1, 2, 3, 4, 5]:
# notice the blank line
print i

into the ordinary Python shell, you would get a:
IndentationError: expected an indented block
because the interpreter thinks the blank line signals the end of the for loop’s block.

But it actually works in my Python Shell.

1

So I am confused whether it is right?

Typo

Hey man, your book is seriously awesome. I just caught one typo I would bring to your attention. On pg. 83, toward the top you say the mean should be 50 and the sigma should be 15.8, when the mean should be 500. It wasn't that hard to figure out, but I thought I might help the next guy.

Thanks

Question of the add friends code

Sorry I have a problem concerning the comment in your code:

# and then populate the lists with friendships    
for i, j in friendships:
# this works because users[i] is the user whose id is i
users[i]["friends"].append(users[j]) # add i as a friend of j
users[j]["friends"].append(users[i]) # add j as a friend of i 

to be specific,users[i]["friends"].append(users[j]) # add i as a friend of j add i as a friend of j should be changed into " add j as a friend of i" right?

Because

 >>> list1=['a','b']
 >>> list1.append('c') 
 >>> list1
 ['a', 'b', 'c']

So we can no we add users[j] into users[i]["friends"] right?

naive bayes missing dataset

Hi joel, i figure it out on naive_bayes.py the data that you refer is not in this github. can you give me a link to download it?

if __name__ == "__main__": #train_and_test_model(r"c:\spam\*\*") train_and_test_model(r"/home/joel/src/spam/*/*")

Thanks

estimate_gradiente function type annotations no output

Hi there,

Not sure if this is an issue but , chapter 8 Gradient descent; estimate_gradient function should it return a Vector type?
def estimate_gradient(f: Callable[[Vector], float], c: Vector, h: float = 0.0001) -> Vector: return [partial_difference_quotient(f, v, i, h) for i in range(len(v))]
either way thanks for the great book :)

error at line 219

File ".\Friends.py", line 219
key=lambda (user_id, num_friends): num_friends, # by number of friends
^
SyntaxError: invalid syntax

Missing code of the introduction chapter

Hello,
The following code is not present at the end of the FINDING KEY CONNECTORS section of https://github.com/joelgrus/data-science-from-scratch/blob/master/code/introduction.py:

num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users]
sorted(num_friends_by_id,
       key= lambda (user_id, num_friends): num_friends,
       reverse=True)

For the Python 3 version, I propose the following solution to take the PEP 3113 into account:

num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users]
sorted(num_friends_by_id,
       key= lambda userid_numfriends: userid_numfriends[1],
       reverse=True)

Update the weights in backpropagation after calculating the hidden_deltas? (neuronal_networks.py)

I'm new with neuronal-networks. In your example of backpropagation, you update the weights, which are pointing from the hidden to the output layer, immediatly after calculating the output deltas. After that, you calculate the hidden_deltas using the updated weights. In other documentation all the deltas are calculated and after that, the weights of the network were updated.
So what is correct?

I tried to calculate all the deltas first, but then the result is not as good as in your version.

Bayesian Inference unclear

Hi Joel,

Im am currently going through your book. I really like it although some things go a little quick for me. Especially chapter 7 on bayesian Inference. I understand that with bayesian inference you update your prior believes. like the example you give: beta(1,1) becomes beta(4,8) after 10 throws with 3 heads. Giving a probability of 0.33 .

What confuses me is the figure 7-1 and 7-2. How did you make these figures? Did you make them with the function "beta_pdf" on the page before (89). Then x is the number of throws and alpha and beta are the respecitve probabilities of heads and tails?

Best,
Tijl

Whitespace

The whitespace in git appears different from the whitespace in the book, and so far the book appears to have it right.

inputs of Chapter 15. Multiple Regression - Fitting the Model

In Chapter 15, section "Fitting the Model" tried is to estimate beta using the following line of code from the book:

beta = least_squares_fit(inputs, daily_minutes_good, learning_rate, 5000, 25)

However, I cannot seem to find inputs that is used, other than the following:

inputs = [(x, 20 * x + 5) for x in range(-50, 50)]

But that is obviously not going to work, because the above line of code does not generate enough columns (2 vs. required 4 for this model). Am I missing something? inputs is also used further in the chapter.

function calls mismatched dict names

I have been reproduced your code and it failed here and I find that the code below is mismatched in dict names

from collections import defaultdict

# keys are interests, values are lists of user_ids with that interest
user_ids_by_interest = defaultdict(list)

for user_id, interest in interests:
    user_ids_by_interest[interest].append(user_id)

# keys are user_ids, values are lists of interests for that user_id
interests_by_user_id = defaultdict(list)

for user_id, interest in interests:
    interests_by_user_id[user_id].append(interest)

def most_common_interests_with(user_id):
    return Counter(interested_user_id
        for interest in interests_by_user["user_id"]   
        for interested_user_id in users_by_interest[interest]
        if interested_user_id != user_id)

Generally, the problem happens in function most_common_interests_with(user_id)
and when most_common_interests_with(user_id) calls dict above it shoulbe be user_ids_by_interest not users_by_interest and interests_by_user_id not interests_by_user.

But even if I have rewritten the code the code above produce no results

There is no example for Stochastic Gradient Descent in Chapter 8

There is no example for Stochastic Gradient Descent in Chapter 8. I have tried to write one.

print("using minimize_stochastic_batch")

x = list(range(101))
y = [3*x_i + random.randint(-10, 20) for x_i in x]
theta_0 = random.randint(-10,10) 
v = minimize_stochastic(sum_of_squares, sum_of_squares_gradient, x, y, theta_0)

print("minimum v", v)
print("minimum value", sum_of_squares(v))

However, I would run into a problem of TypeError: sum_of_squares() takes 1 positional argument but 3 were given

minor notes

minor notes for author revision - second edtion. Thanks for the great book.

Web app is not working

When click on the linked provided in README, such error message showed up


404 Not Found

Code: NoSuchKey
Message: The specified key does not exist.
Key: experiments/function-index/index.html
RequestId: AD1F5AB2228E5E80
HostId: Th07ZUUrIQtgVi8s0FD3zKMXu4KtfUSvbOpQaJmFXKMACJBDkTsm+GME6RZB6sSl9S49xXFxy0c=


Please add a new comment about `from __future__ import division`

Please add a new comment about from __future__ import division

When someone is new to Python and he follows your directions step by step then he wiil find his is trapped by the book code sample like this:

from __future__ import division # integer division is lame

This statement must be put at the head of the code

num_users = len(users)   # length of the users list
 avg_connections = total_connections / num_users # 2.4

That will be more useful and clear and it will be convenient for people to follow your code step by step.

Jupyter notebooks?

I'm about to use the book in a course. I will likely copy the book code to a set of jupyter notebooks. Unless someone has already done that.

Typo in code on p. 79

I imagine you have already caught this since you've fixed the code on GitHub, but the function definition for binomial on p. 79, you define it as binomial(n, p), and the next function (make_hist) won't work unless binomial's defined as binomial(p, n).

missing file stocks.csv

in file working_with_data.py there is code

with open("stocks.csv", "r") as f:

but in the project zip file I don't see the file stocks.csv anywhere.

Definition and implementation of the quantile

A generalization of the median is the quantile, which represents the value under which a certain percentile of the data lies (the median represents the value under which 50% of the data lies):

def quantile(xs: List[float], p: float) -> float:
    """Returns the pth-percentile value in x"""
    p_index = int(p * len(xs))
    return sorted(xs)[p_index]

I think this means the median should be equal to quantile(xs, 0.5), which is not the case.

No example for minimize_stochastic

There's no example for minimize_stochastic function and it's not clear how to use it – in batch functions you're able to pass anything as target_fn – for minimize_stochastic you should pass function that accepts 3 params+ and one of them is y (which I believe is supposed to be the value of target_fn(x)) – where am I wrong and how should I use it?

Missing title in plotting code

In the code to produce picture 1-3 you miss the line code to produce title of the picture.

see code

You should add a line plt.title("Salary by Years Experience") into the code.

Typo pg 85 2nd edition

You say:

call make_hist(0.75, 100, 10000)

but you must mean:

call binomial_histogram(0.75, 100, 10000)

Crashcourse to Python - defaultdict

Hi,

I got a problem in Chapter 2 (German version) with example about "Defaultdict" and also "Counter".

Whats seems to be left out here is, how the value "document" has been defined.

Code:
_```
from collections import defaultdict
word_counts = {}
for word in document:
if word in word_counts:
word_counts[word] += 1
else:
word_count[word] = 1


I always get an error message like this: 
"name 'document' is not defined"

as a matter of this error the following code examples are all not working as well:
- Counter - because word_counts can not work without the code before
- Sets - same here...

For subsection  Bolean, the code does not work neither:

s = some_function_that_returns_a_string()
if s:
    first_char = s[0]
else:
    first_char = ""

the following error appears:
NameError: name 'some_function_that_returns_a_string' is not defined

and the same holds for:
first_char = s and s[0]

Any help or hint, would be great.

Issue with spam_probability function (Ch 13, pp. 168-9)

Hi Joel,

I've run into a problem with function that is called after the training process, to evaluate a new message for spamminess. The algorithm as given goes through every word accumulated by the training process, and updates the spam/not-spam probabilities based on that word appearing in the message (log(p)), or based on it not appearing (log(1.0 - p)). All this, as best I can tell, is sound & correct according to the math.
Except, my post-training dictionary contains well over 80,000 words. So if you're accumulating probabilities, even if every one of those probabilities were 99%, by the time you combine 80,000 of them Python calculates the resulting probability as 0.0. Even with a dictionary 1/10 that size, the accumulated 99% probabilities would come out on the order of 1E-35. (Naturally, the overwhelming majority of the word-wise probabilities are far less than 99% -- most in fact are below 1%.)
But anyway, put in plain English, this means that as your training set grows, the chance of a given message being either spam or not spam, approaches zero! This can't possibly be right.

Could you let me know whether something's off with the math & algorithm given in the book? Or is my understanding of the methodology just off in the woods?

One suggestion for displaying Code examples

Hey @joelgrus ,

I would like to thank you for all the work you are doing here. I would still like to recommend one or two improvements for your new book or even than for your code examples... depending on which python version someone is using. You should show "white spaces" and "indentions".

Thus in Python2.7 you could write (2,10) but in Python 3.5 you have to add one 'white space' behind the comma -> (2, 10). Beside that, it is not always clear if an indention is expected or not from your site.
It might depending on the IDE, spyder for instance is running mad about a lot of indention errors but at the same time my console is complaining about syntax errors....

You can reproduce this issue by copy + pasting your own code for example to:

python.org , where you can find a live environment for python 3.7
ok, this will be pretty obvious show you a lot of errors... but than you know what I mean...

In the O'reilly book, it is also not always clear, if the code is continuing on the next page, in the same "column" or how many spaces are expected, because the 4 spaces rule does not always count
.
Thus it might be useful to just work in the book and also here, with kind of "syntax" and visuals, like:

grafik.

grafik

Just my few Cents,

Greetings.

most_common_interests_with(user_id) from Python 3 Introductions file

I think that:

def most_common_interests_with(user_id):
return Counter(interested_user_id
for interest in interests_by_user_id["user_id"]
for interested_user_id in user_ids_by_interest[interest]
if interested_user_id != user_id)

Should be:

def most_common_interests_with(user_id):
return Counter(interested_user_id
for interest in interests_by_user_id[user_id]
for interested_user_id in user_ids_by_interest[interest]
if interested_user_id != user_id)

Quotes around user_id in third quoted line removed.

Surplus parameters

A wee nonfunctional problem: you reuse the same four params for precision(), recall() and accuracy() in machine_learning.py. Should be:

def precision(tp, fp):
    return tp / (tp + fp)
def recall(tp, fn):
    return tp / (tp + fn)

The number calculating mistakes in introduction.py

The number calculating mistakes in introduction.py
The length of the users list is 11 not 10 so the answer of average user's friends is 2.182 instead of 2.4.

 num_users = len(users)  # 11
 avg_connections = total_connections / num_users # 2.1818181818181817

I have changed it AND add a new comment on the line to calculate the length of users.

Chapter 4 vector_sum code not working

Hey, working through some of the examples and it seems the below code is not set up properly even though vector_add has been defined:

def vector_sum2(vectors):
return reduce(vector_add, vectors)

When I try and run the below in the ipython console I get an error saying: "zip argument #1 must support iteration"

vector_sum([1,2,3,4])

Please could you verify that the code is correct/incorrect?

Many thanks!

Index for function definitions

An index giving the page where each custom function is defined would be very useful in navigating the book, particularly the later chapters.

User "Jen" is not exist on the Japanese edition

Hi, Joel

I started reading the Japanese edition since yesterday.
A code in this repo, there is a User "Jen" in a list named "users", but there IS NOT the user in the book.
Is it just a mistake? It may be an intended difference between the original(written in English) and the Japanese edition I guess.

ETA for second edition?

According to a July update of yours, the second edition of the book is forthcoming.

Is there an estimation date for the book to be completed?

Best regards,

TypeError: unsupported operand type(s) for -: 'dict' and 'dict'

While running the code below by following the example in Chapter 19 Clustering,

random.seed(0) # so you get the same results as me
clusterer = KMeans(3)
clusterer.train(inputs)
print clusterer.means

I got this error message, "TypeError:unsupported operand type(s) for -: 'dict' and 'dict'".


/repo/data-science-from-scratch/code/linear_algebra.py in squared_distance(v, w)
     44 
     45 def squared_distance(v, w):
---> 46     return sum_of_squares(vector_subtract(v, w))
     47 
     48 def distance(v, w):

/repo/data-science-from-scratch/code/linear_algebra.py in vector_subtract(v, w)
     17 def vector_subtract(v, w):
     18     """subtracts two vectors componentwise"""
---> 19     return [v_i - w_i for v_i, w_i in zip(v,w)]
     20 
     21 def vector_sum(vectors):

TypeError: unsupported operand type(s) for -: 'dict' and 'dict'

Could you help me with this issue?

Thanks

how do I install the linear_algebra library?

I've got your book. I'm working/skimming through it. I've hit a brick wall.

in your Readme.md you have the following snippet

from linear_algebra import distance, vector_mean
v = [1, 2, 3]
w = [4, 5, 6]
print distance(v, w)
print vector_mean([v, w])

This throws the following error:

ImportError: No module named linear_algebra

Where is the linear_algebra library from? I've got a feeling it should have something to do with numpy

For info I'm running python 2.7 using the anaconda stack

introduction coding for FINDING KEY CONNECTORS

Hi,
Should the code:
for i, j in friendships:
# this works because users[i] is the user whose id is i
users[i]["friends"].append(users[j]) # add i as a friend of j
users[j]["friends"].append(users[i]) # add j as a friend of i

be:
users[i]["friends"].append(users[j]['id'])
users[j]["friends"].append(users[i]['id'])
otherwise the friends list fills with dicts and becomes unwieldy

thanks

andrew

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.