spark-mooc / mooc-setup Goto Github PK

Information for setting up for the BerkeleyX Spark Intro MOOC, and lab assignments for the course

Jupyter Notebook 14.88% Python 74.44% Java 10.67%

mooc-setup's Issues

Incorrect assertion in cs120_lab3_ctr_df.dbc (5f)

Concerns 8252845

Following the discussion at https://piazza.com/class/iqfbu516yuj5t3?cid=653
it seems that "expected_test_baseline = 0.530363901139" used in the assertion comes from hash_test_df instead of hash_train_df

Small issue with cs120_lab2_linear_regression_df.py

I think in this line, parsed_points_df should be parsed_data_df instead, because we should plot with the data with shifted labels. Using parsed_points_df we can't see the labels on the x-axis as it's out of range.

Thank you very much.

Module 4 lab CTR Data URL No longer exists!

Hi Felix!

I love your pyspark course thus far!

I am going through your "Scalable Machine Learning" and noticed the link to the dataset in the Module 4 Lab 'Click through Rate Prediction" is not working anymore. Do you have any advice on how to import the dataset relevant to the Module 4 lab so that I may finish the Module?

Thank you so much for your help,

Austen

Minor typos/grammar errors in ML_lab3_linear_reg_student.ipynb

Three changes for your consideration where underscores are insertions and dashes are strike throughs:

line 361: 'task involves split_ting_ it into training, validation and test sets'
line 490: 'Calculates the -the- squared error'
line 815: 'gradient` s_h_ould be a '

Login user name and password

Hi, just downloaded sparkvm via vagrant, but can't login, what is the user name and password?

cs120_lab2_linear_regression_df.py - randomSplit changed result in Databrick may cause inconsistent test cases

Some change in Databrick caused randomSplit to result differently since yesterday (02/08/2016).

The same test case was correct yesterday but when I ran again today I found these test cases became incorrect due to result changed of randomSplit in line 352

https://github.com/spark-mooc/mooc-setup/blob/master/cs120_lab2_linear_regression_df.py#L384
should be
Test.assertEquals(round(float(n_train) / float(n_train + n_val + n_test), 1), .8, 'unexpected value for nTrain')

https://github.com/spark-mooc/mooc-setup/blob/master/cs120_lab2_linear_regression_df.py#L385
should be
Test.assertEquals(round(float(n_val) / float(n_train + n_val + n_test), 1), .1, 'unexpected value for nVal')

https://github.com/spark-mooc/mooc-setup/blob/master/cs120_lab2_linear_regression_df.py#L386
should be
Test.assertEquals(round(float(n_test) / float(n_train + n_val + n_test), 1), .1, 'unexpected value for nTest')

Need to update Spark 1.6.0

The Spark version is 1.3.1 in this VM:

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/

I need to do an update to 1.6.0. How is Spark being installed inside the VM and is there instruction to update? Or do you plan to push an update soon?

Labs incompatibilities in certain circumstances

I do realize that course VM is close environment not friendly to change, but searching Piazza some students had same obstacles, if I'm incorrect, please close issue.

: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/vagrant/Scalable-Machine-Learning/labs-progress/data/cs190/neuro.txt

Relative file import path in labs produce error in case when IPython working directory changed to another from user home. For convenience using shared folder I made a change in notebook profile c.NotebookApp.notebook_dir = '/vagrant' So maybe notebook have to use explicit path of current user home directory? Something like:

from os.path import expanduser
home = expanduser("~")

Incompatible with numpy 1.9.2 Is that worth to make it forward compatible?

A small typo issue

"which is recommended when they key doesn't change"

ssh timeout error

I am trying to use your VM which is similar to the one created here, but I am getting an ssh timeout error. Are you familiar to why this is the case?

Seed problem

Hello

Im trying to go though the 3rd week lab, however it seems to be a problem with the proportions by which the data is partitioned regarding train, validation and test. I'm using the supplied seed, along with the defined weights and i get a different number of examples within each set. Obviously, the following tests are sentenced to fail.

snippet:

weights = [.8, .1, .1]
seed = 42
raw_train_df, raw_validation_df, raw_test_df = raw_df.randomSplit(weights, seed)

n_train = raw_train_df.cache().count()
n_val = raw_validation_df.cache().count()
n_test = raw_test_df.cache().count()
print n_train, n_val, n_test, n_train + n_val + n_test
raw_df.show(1)

output:

80115 9955 9930 100000
+--------------------+
|                text|
+--------------------+
|0,1,1,5,0,1382,4,...|
+--------------------+
only showing top 1 row

the same thing happens in lab 2 linear regression

spark-mooc / mooc-setup Goto Github PK

mooc-setup's Issues

Incorrect assertion in cs120_lab3_ctr_df.dbc (5f)

Small issue with cs120_lab2_linear_regression_df.py

Module 4 lab CTR Data URL No longer exists!

Minor typos/grammar errors in ML_lab3_linear_reg_student.ipynb

Login user name and password

cs120_lab2_linear_regression_df.py - randomSplit changed result in Databrick may cause inconsistent test cases

Need to update Spark 1.6.0

Labs incompatibilities in certain circumstances

A small typo issue

ssh timeout error

Seed problem

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs