joelgrus / data-science-from-scratch Goto Github PK
View Code? Open in Web Editor NEWcode for Data Science From Scratch book
License: MIT License
code for Data Science From Scratch book
License: MIT License
I have been reproduced your code and it failed here and I find that the code below is mismatched in dict names
from collections import defaultdict
# keys are interests, values are lists of user_ids with that interest
user_ids_by_interest = defaultdict(list)
for user_id, interest in interests:
user_ids_by_interest[interest].append(user_id)
# keys are user_ids, values are lists of interests for that user_id
interests_by_user_id = defaultdict(list)
for user_id, interest in interests:
interests_by_user_id[user_id].append(interest)
def most_common_interests_with(user_id):
return Counter(interested_user_id
for interest in interests_by_user["user_id"]
for interested_user_id in users_by_interest[interest]
if interested_user_id != user_id)
Generally, the problem happens in function most_common_interests_with(user_id)
and when most_common_interests_with(user_id)
calls dict above it shoulbe be user_ids_by_interest
not users_by_interest
and interests_by_user_id
not interests_by_user
.
But even if I have rewritten the code the code above produce no results
Why does the PCA example returns components with the opposite sign of the ones from sklearn PCA? Also, when I try to standardize the data and use the code, the components obtained through PCA are the same, which doesn't make sense. Notebook with examples attached.
PrincipalComponentAnalysis.ipynb.zip
Hello,
The following code is not present at the end of the FINDING KEY CONNECTORS section of https://github.com/joelgrus/data-science-from-scratch/blob/master/code/introduction.py:
num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users]
sorted(num_friends_by_id,
key= lambda (user_id, num_friends): num_friends,
reverse=True)
For the Python 3 version, I propose the following solution to take the PEP 3113 into account:
num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users]
sorted(num_friends_by_id,
key= lambda userid_numfriends: userid_numfriends[1],
reverse=True)
When click on the linked provided in README, such error message showed up
404 Not Found
Code: NoSuchKey
Message: The specified key does not exist.
Key: experiments/function-index/index.html
RequestId: AD1F5AB2228E5E80
HostId: Th07ZUUrIQtgVi8s0FD3zKMXu4KtfUSvbOpQaJmFXKMACJBDkTsm+GME6RZB6sSl9S49xXFxy0c=
Please add a new comment about from __future__ import division
When someone is new to Python and he follows your directions step by step then he wiil find his is trapped by the book code sample like this:
from __future__ import division # integer division is lame
num_users = len(users) # length of the users list
avg_connections = total_connections / num_users # 2.4
That will be more useful and clear and it will be convenient for people to follow your code step by step.
SimpleLinearRegression.ipynb.zip
I'm trying the example provided in the chapter containing Simple Linear Regression using SGD, and the result does not converge. The value of the function keeps getting bigger, and the gradient direction is not changed. So, the first theta, randomly generated, is returned.
code to sort people from the most friends to the least friends is missing in introduction.py .
While running the code below by following the example in Chapter 19 Clustering,
random.seed(0) # so you get the same results as me
clusterer = KMeans(3)
clusterer.train(inputs)
print clusterer.means
I got this error message, "TypeError:unsupported operand type(s) for -: 'dict' and 'dict'".
/repo/data-science-from-scratch/code/linear_algebra.py in squared_distance(v, w)
44
45 def squared_distance(v, w):
---> 46 return sum_of_squares(vector_subtract(v, w))
47
48 def distance(v, w):
/repo/data-science-from-scratch/code/linear_algebra.py in vector_subtract(v, w)
17 def vector_subtract(v, w):
18 """subtracts two vectors componentwise"""
---> 19 return [v_i - w_i for v_i, w_i in zip(v,w)]
20
21 def vector_sum(vectors):
TypeError: unsupported operand type(s) for -: 'dict' and 'dict'
Could you help me with this issue?
Thanks
Hi Joel,
I have created this pull request after including the matplotlib import in introduction.py.
please let me know if you have any questions.
The number calculating mistakes in introduction.py
The length of the users
list is 11 not 10 so the answer of average user's friends is 2.182 instead of 2.4.
num_users = len(users) # 11
avg_connections = total_connections / num_users # 2.1818181818181817
I have changed it AND add a new comment on the line to calculate the length of users
.
Hi Joel,
I've run into a problem with function that is called after the training process, to evaluate a new message for spamminess. The algorithm as given goes through every word accumulated by the training process, and updates the spam/not-spam probabilities based on that word appearing in the message (log(p)), or based on it not appearing (log(1.0 - p)). All this, as best I can tell, is sound & correct according to the math.
Except, my post-training dictionary contains well over 80,000 words. So if you're accumulating probabilities, even if every one of those probabilities were 99%, by the time you combine 80,000 of them Python calculates the resulting probability as 0.0. Even with a dictionary 1/10 that size, the accumulated 99% probabilities would come out on the order of 1E-35. (Naturally, the overwhelming majority of the word-wise probabilities are far less than 99% -- most in fact are below 1%.)
But anyway, put in plain English, this means that as your training set grows, the chance of a given message being either spam or not spam, approaches zero! This can't possibly be right.
Could you let me know whether something's off with the math & algorithm given in the book? Or is my understanding of the methodology just off in the woods?
An index giving the page where each custom function is defined would be very useful in navigating the book, particularly the later chapters.
I imagine you have already caught this since you've fixed the code on GitHub, but the function definition for binomial
on p. 79, you define it as binomial(n, p)
, and the next function (make_hist
) won't work unless binomial
's defined as binomial(p, n)
.
Looks like in python 3 they changed maps to be non-iterables so the code needs to look like:
plt.plot(x, list(map(derivative_estimate, x)), 'b+')
There is no example for Stochastic Gradient Descent in Chapter 8. I have tried to write one.
print("using minimize_stochastic_batch")
x = list(range(101))
y = [3*x_i + random.randint(-10, 20) for x_i in x]
theta_0 = random.randint(-10,10)
v = minimize_stochastic(sum_of_squares, sum_of_squares_gradient, x, y, theta_0)
print("minimum v", v)
print("minimum value", sum_of_squares(v))
However, I would run into a problem of TypeError: sum_of_squares() takes 1 positional argument but 3 were given
File ".\Friends.py", line 219
key=lambda (user_id, num_friends): num_friends, # by number of friends
^
SyntaxError: invalid syntax
Sorry I encountered this in the book in Chapter 2:
One consequence of whitespace formatting is that it can be hard to copy and paste code
into the Python shell. For example, if you tried to paste the code:for i in [1, 2, 3, 4, 5]:
# notice the blank line
print i
into the ordinary Python shell, you would get a:
IndentationError: expected an indented block
because the interpreter thinks the blank line signals the end of the for loop’s block.
But it actually works in my Python Shell.
So I am confused whether it is right?
Sorry I have a problem concerning the comment in your code:
# and then populate the lists with friendships
for i, j in friendships:
# this works because users[i] is the user whose id is i
users[i]["friends"].append(users[j]) # add i as a friend of j
users[j]["friends"].append(users[i]) # add j as a friend of i
to be specific,users[i]["friends"].append(users[j]) # add i as a friend of j
add i as a friend of j should be changed into " add j as a friend of i" right?
Because
>>> list1=['a','b']
>>> list1.append('c')
>>> list1
['a', 'b', 'c']
So we can no we add users[j]
into users[i]["friends"]
right?
You say:
call make_hist(0.75, 100, 10000)
but you must mean:
call binomial_histogram(0.75, 100, 10000)
All is in the Title :).
https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/natural_language_processing.py
return error 404
By the way thank you for your awesome job !
Looks like O'Reilly changed the div name from entry-content to article-body at
http://radar.oreilly.com/2010/06/what-is-data-science.html
Running natural_language_processing.py yields the NoneTyoe error.
To fix:
on line 48 of the program, change entry-content to article-body
Hi there,
Not sure if this is an issue but , chapter 8 Gradient descent; estimate_gradient function should it return a Vector type?
def estimate_gradient(f: Callable[[Vector], float], c: Vector, h: float = 0.0001) -> Vector: return [partial_difference_quotient(f, v, i, h) for i in range(len(v))]
either way thanks for the great book :)
I've got your book. I'm working/skimming through it. I've hit a brick wall.
in your Readme.md you have the following snippet
from linear_algebra import distance, vector_mean
v = [1, 2, 3]
w = [4, 5, 6]
print distance(v, w)
print vector_mean([v, w])
This throws the following error:
ImportError: No module named linear_algebra
Where is the linear_algebra library from? I've got a feeling it should have something to do with numpy
For info I'm running python 2.7 using the anaconda stack
There's no example for minimize_stochastic function and it's not clear how to use it – in batch functions you're able to pass anything as target_fn – for minimize_stochastic you should pass function that accepts 3 params+ and one of them is y (which I believe is supposed to be the value of target_fn(x)) – where am I wrong and how should I use it?
Hi,
I got a problem in Chapter 2 (German version) with example about "Defaultdict" and also "Counter".
Whats seems to be left out here is, how the value "document" has been defined.
Code:
_```
from collections import defaultdict
word_counts = {}
for word in document:
if word in word_counts:
word_counts[word] += 1
else:
word_count[word] = 1
I always get an error message like this:
"name 'document' is not defined"
as a matter of this error the following code examples are all not working as well:
- Counter - because word_counts can not work without the code before
- Sets - same here...
For subsection Bolean, the code does not work neither:
s = some_function_that_returns_a_string()
if s:
first_char = s[0]
else:
first_char = ""
the following error appears:
NameError: name 'some_function_that_returns_a_string' is not defined
and the same holds for:
first_char = s and s[0]
Any help or hint, would be great.
In the code to produce picture 1-3 you miss the line code to produce title of the picture.
see code
You should add a line plt.title("Salary by Years Experience")
into the code.
In Chapter 15, section "Fitting the Model" tried is to estimate beta
using the following line of code from the book:
beta = least_squares_fit(inputs, daily_minutes_good, learning_rate, 5000, 25)
However, I cannot seem to find inputs
that is used, other than the following:
But that is obviously not going to work, because the above line of code does not generate enough columns (2 vs. required 4 for this model). Am I missing something? inputs
is also used further in the chapter.
I'm not 100% sure I had all of my code in correctly by this point, but I got an overflow error when I tried to fit the logistic model example using gradient descent on p. 194. This seemed to stem from an overflow problem with running the logistic_log_partial_ij
function. I did not have this problem when fitting the model with stochastic gradient descent.
The whitespace in git appears different from the whitespace in the book, and so far the book appears to have it right.
Hey man, your book is seriously awesome. I just caught one typo I would bring to your attention. On pg. 83, toward the top you say the mean should be 50 and the sigma should be 15.8, when the mean should be 500. It wasn't that hard to figure out, but I thought I might help the next guy.
Thanks
I'm about to use the book in a course. I will likely copy the book code to a set of jupyter notebooks. Unless someone has already done that.
Hey, working through some of the examples and it seems the below code is not set up properly even though vector_add has been defined:
def vector_sum2(vectors):
return reduce(vector_add, vectors)
When I try and run the below in the ipython console I get an error saying: "zip argument #1 must support iteration"
vector_sum([1,2,3,4])
Please could you verify that the code is correct/incorrect?
Many thanks!
Hi, Joel
I started reading the Japanese edition since yesterday.
A code in this repo, there is a User "Jen" in a list named "users", but there IS NOT the user in the book.
Is it just a mistake? It may be an intended difference between the original(written in English) and the Japanese edition I guess.
in file working_with_data.py there is code
with open("stocks.csv", "r") as f:
but in the project zip file I don't see the file stocks.csv anywhere.
A generalization of the median is the quantile, which represents the value under which a certain percentile of the data lies (the median represents the value under which 50% of the data lies):
def quantile(xs: List[float], p: float) -> float:
"""Returns the pth-percentile value in x"""
p_index = int(p * len(xs))
return sorted(xs)[p_index]
I think this means the median should be equal to quantile(xs, 0.5)
, which is not the case.
Hey @joelgrus ,
I would like to thank you for all the work you are doing here. I would still like to recommend one or two improvements for your new book or even than for your code examples... depending on which python version someone is using. You should show "white spaces" and "indentions".
Thus in Python2.7 you could write (2,10) but in Python 3.5 you have to add one 'white space' behind the comma -> (2, 10). Beside that, it is not always clear if an indention is expected or not from your site.
It might depending on the IDE, spyder for instance is running mad about a lot of indention errors but at the same time my console is complaining about syntax errors....
You can reproduce this issue by copy + pasting your own code for example to:
python.org , where you can find a live environment for python 3.7
ok, this will be pretty obvious show you a lot of errors... but than you know what I mean...
In the O'reilly book, it is also not always clear, if the code is continuing on the next page, in the same "column" or how many spaces are expected, because the 4 spaces rule does not always count
.
Thus it might be useful to just work in the book and also here, with kind of "syntax" and visuals, like:
Just my few Cents,
Greetings.
Hi Joel,
Im am currently going through your book. I really like it although some things go a little quick for me. Especially chapter 7 on bayesian Inference. I understand that with bayesian inference you update your prior believes. like the example you give: beta(1,1) becomes beta(4,8) after 10 throws with 3 heads. Giving a probability of 0.33 .
What confuses me is the figure 7-1 and 7-2. How did you make these figures? Did you make them with the function "beta_pdf" on the page before (89). Then x is the number of throws and alpha and beta are the respecitve probabilities of heads and tails?
Best,
Tijl
minor notes for author revision - second edtion. Thanks for the great book.
Missing code for chapter2
On p. 84, you have a random "=== p-values" at the end of a paragraph. I'm guessing this was meant to be a section heading and there was just a missing carriage return before it or something?
A wee nonfunctional problem: you reuse the same four params for precision()
, recall()
and accuracy()
in machine_learning.py. Should be:
def precision(tp, fp):
return tp / (tp + fp)
def recall(tp, fn):
return tp / (tp + fn)
File ".\Friends.py", line 82
print friends_of_friend_ids(users["3"]) # Counter({0: 2, 5: 1})
According to a July update of yours, the second edition of the book is forthcoming.
Is there an estimation date for the book to be completed?
Best regards,
Hi joel, i figure it out on naive_bayes.py the data that you refer is not in this github. can you give me a link to download it?
if __name__ == "__main__": #train_and_test_model(r"c:\spam\*\*") train_and_test_model(r"/home/joel/src/spam/*/*")
Thanks
I think that:
def most_common_interests_with(user_id):
return Counter(interested_user_id
for interest in interests_by_user_id["user_id"]
for interested_user_id in user_ids_by_interest[interest]
if interested_user_id != user_id)
Should be:
def most_common_interests_with(user_id):
return Counter(interested_user_id
for interest in interests_by_user_id[user_id]
for interested_user_id in user_ids_by_interest[interest]
if interested_user_id != user_id)
Quotes around user_id in third quoted line removed.
Hi,
It looks like in the algorithm for backprop the variable output_layer
is used before it's defined in the hidden_deltas
list comprehension.
Also, do you have any tips for extending the algorithm for multi-layer networks?
Hi,
Should the code:
for i, j in friendships:
# this works because users[i] is the user whose id is i
users[i]["friends"].append(users[j]) # add i as a friend of j
users[j]["friends"].append(users[i]) # add j as a friend of i
be:
users[i]["friends"].append(users[j]['id'])
users[j]["friends"].append(users[i]['id'])
otherwise the friends list fills with dicts and becomes unwieldy
thanks
andrew
I'm new with neuronal-networks. In your example of backpropagation, you update the weights, which are pointing from the hidden to the output layer, immediatly after calculating the output deltas. After that, you calculate the hidden_deltas using the updated weights. In other documentation all the deltas are calculated and after that, the weights of the network were updated.
So what is correct?
I tried to calculate all the deltas first, but then the result is not as good as in your version.
This file is referenced from the README.md for Chapter 7
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.