I'm a data scientist / machine learning engineer.
abhishekkrthakur / approachingalmost Goto Github PK
View Code? Open in Web Editor NEWApproaching (Almost) Any Machine Learning Problem
Approaching (Almost) Any Machine Learning Problem
There must be a reason why you prefer df.loc[:, 'weekofyear'] = ...
to shorter df['weekofyear'] = ...
, can you please give some hints?
Hi Abhishek,
I recently bought your book.
I get the following error when I ren the code that is ritten on page 19.
clf.fit(df_train[cols], df_train.quality)
AttributeError: 'list' object has no attribute 'fit'
Can you help with the accurate code?
Thanks
Nakul
@abhishekkrthakur
Page 102. Second para. Fourth sentence. Change you to your.
Current: (Incorrect)
"You model pipeline in this case....
To be corrected to:
"Your model pipeline in this case....
Hi,
I'm trying to run the code from the chapter Approaching Categorical Variables
. The code is run using XGBoost. I'm getting uneven results when I run the code. I'm not sure if I'm missing something. Can you help me look into this issue.
This is the code that I used to run
import pandas as pd
from sklearn import preprocessing
from sklearn import metrics
import xgboost as xgb
def run(fold):
df = pd.read_csv('../input/train_folds.csv')
num_cols = ['age', 'fnlwgt', 'education.num' , 'capital.gain', 'capital.loss' , 'hours.per.week']
mapping = {
'<=50K' : 0,
'>50K' : 1
}
df.loc[:,'income'] = df['income'].map(mapping)
features = [x for x in df.columns if x not in ('kfold','income') ]
for col in features:
if col not in num_cols:
df.loc[:, col] = df[col].astype(str).fillna("NONE")
for col in features:
if col not in num_cols:
lbl = preprocessing.LabelEncoder()
lbl.fit(df[col])
df.loc[:,col] = lbl.transform(df[col])
df_train = df[df.kfold != fold].reset_index(drop = True)
df_valid = df[df.kfold == fold].reset_index(drop = True)
x_train = df_train[features].values
x_valid = df_valid[features].values
model = xgb.XGBClassifier(n_jobs = -1, max_depth=20, n_estimators= 200)
model.fit(x_train, df_train.income.values)
print("model trained")
y_valid_preds = model.predict_proba(x_valid)[:,1]
print(metrics.roc_auc_score(df_valid.income.values, y_valid_preds))
if __name__ == "__main__" :
for f in range(5):
run(f)
And the output that I'm getting is :
fold 0: 0.7381942252913961 fold 1:0.031021225996466548 fold 2: 0.1365583437561918 fold 3: 0.47537328796809986 fold 4: 0.8349179819075687
The values being inconsistent are far away from the actual results.. I do not know what I'm doing wrong in here.
Hi,
Looks like the squaring is missing in the denominator of the formula for R2 (figure 10, Page 69 in Kindle edition). The code for R2 has squaring in the numerator and denominator and is working as expected.
Thank you,
Hi, I believe there is an error in the code to fit a decision tree on the red wine quality dataset.
/# train the model on the provided features
/# and mapped quality from before
clf.fit(df_train[cols], df_test.quality)
Shouldn't the feature column be from df_train as opposed to from df_test. This error was not repeated in the longer code section where you plot accuracies for different values of max_depth
Cheers
I was getting the following error when running the code from your example:
single_image = pixel_values[1, :].reshape(28, 28)
plt.imshow(single_image, cmap='gray')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-45-a39d645e4ad8> in <module>
----> 1 single_image = pixel_values[1, :].reshape(28, 28)
2 plt.imshow(single_image, cmap='gray')
~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
3022 if self.columns.nlevels > 1:
3023 return self._getitem_multilevel(key)
-> 3024 indexer = self.columns.get_loc(key)
3025 if is_integer(indexer):
3026 indexer = [indexer]
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3078 casted_key = self._maybe_cast_indexer(key)
3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
3082 raise KeyError(key) from err
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
TypeError: '(1, slice(None, None, None))' is an invalid key
Worked well when I changed it to:
single_image = pixel_values.iloc[1, :].values.reshape(28, 28)
plt.imshow(single_image, cmap='gray');
Loving the book so far! I did want to point out a small error on page 5, and I thought this might be the best way to report this to you.
In the last paragraph:
Next, please read and follow the instructions on your screen. If you installed everything correctly, you should be able to start the conda environment by typing conda init the terminal. We will create a conda environment that we will be using throughout this book. To create a conda environment, you can type:
I believe it should read:
you should be able to start the conda environment by typing conda init in the terminal.
Or perhaps:
you should be able to start the conda environment by typing conda init into the terminal.
Rather than:
you should be able to start the conda environment by typing conda init the terminal.
Thanks! Great work on this!
In page 11
single_image = pixel_values[1, :].reshape(28, 28)
plt.imshow(single_image, cmap='gray')
but reshape is not an pd.DataFrame atribute, so to see the image I had to code:
single_image = pixel_values.iloc[1, :]
single_image = single_image.values.reshape(28,28)
plt.imshow(single_image, cmap='gray')
Hi,
environment.yml has package issues and pip subprocess fail error is generated due to few packages in ubuntu with miniconda installed
Hi,
While running the tensorflow code provided in the code, i faced this error:
import os
import gc
import joblib
import pandas as pd
import numpy as np
from sklearn import metrics, preprocessing
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras.models import Model, load_model
from tensorflow.keras import callbacks
from tensorflow.keras import backend as k
from tensorflow.keras import utils
def create_model(data, catcols):
'''
this funciton returns a compiled tf.keras model for entitiy embeddings
:param data: this is a pandas dataframe
:param catcols: list of categorical column names
:return: complied tf.keras model
'''
#init the list of inputs for embedding
inputs =[]
#init the list of outputs for embedding
outputs= []
#loop over all categorical columns
for c in catcols:
#find the number of unique values in the column
num_unique_values= int(data[c].nunique())
#simple dimension of embedding calculator
#min size is half the number of unbique values
#max size is 50. max size depends on the number of values
#categories too. 50 is quite sufficient most of the times
#but if you have millions of unique values, you might need a larger dimenion
embed_dim = int(min(np.ceil((num_unique_values)/2), 50))
#simple keras input layer with size 1
inp = layers.Input(shape = (1,))
#add embedding layer to raw input
#embedding size is alwasy 1 more than unique values in input
out = layers.Embedding(num_unique_values + 1, embed_dim, name = c)(inp)
#1-d spatial dropout is the standard for embedding layers
#it can be used in nlp tasks as well
out = layers.SpatialDropout1D(0.3)(out)
#reshape the input to the dimensions of embedding
#this becomes our output layer for current feature
out = layers.Reshape(target_shape = (embed_dim,))(out)
#add input to input list
inputs.append(inp)
#add output to output list
outputs.append(out)
#concatenate all output layers
X = layers.Concatenate()(outputs)
# add a batchnorm layer
# from here, everything is up to you
# you can try different architecture
# add numerical features here or in concatonate layer
X = layers.BatchNormalization()(X)
# a bunch of dense layers with dropout
# start with 1 or two layers only
X = layers.Dense(300,activation = 'relu')(X)
X = layers.Dropout(0.3)(X)
X = layers.BatchNormalization()(X)
#using softmax and treating it as a two class problem
# sigmoid can also be used but then we need only 1 output class
y = layers.Dense(2, activation = 'softmax')(X)
model = Model(inputs = inputs ,outputs = y)
#compile the model
# we use adam and binary cross entropy
model.compile(loss = 'binary_crossentropy', optimizer = 'adam')
return model
def run(fold):
df = pd.read_csv('../input/cat_train_folds.csv')
features = [
f for f in df.columns if f not in ("id","target","kfold")
]
#fill all Na with NONE
for col in features:
df.loc[:,col] = df[col].astype(str).fillna("NONE")
#encode all features with label encoder individually
#in a live setting all label encoders need to be saved
for feat in features:
df.loc[:,feat] = df[feat].astype(str)
lbl_enc = preprocessing.LabelEncoder()
lbl_enc = lbl_enc.fit(df[feat].values)
df.loc[:, feat] = lbl_enc.fit_transform(df[feat].astype(str).values)
#get trainign data using folds
df_train= df[df.kfold != fold].reset_index(drop = True)
df_valid = df[df.kfold ==fold].reset_index(drop = True)
model = create_model(df, features)
#our features are a list of list
Xtrain = [df_train[features].values[:,k] for k in range(len(features))]
Xvalid = [df_valid[features].values[:,k] for k in range(len(features))]
ytrain = df_train.target.values
yvalid = df_train.target.values
#concert target columns to categories
#this is just binarization
ytrain_cat = utils.to_categorical(ytrain)
yvalid_cat = utils.to_categorical(yvalid)
#fit the model
model.fit(Xtrain,ytrain_cat, validation_data = (Xvalid, yvalid_cat), verbose = 1, batch_size =1024, epochs = 3)
valid_preds = model.predict(Xvalid)[:,1]
print(metrics.roc_auc_score(yvalid, valid_preds))
#clear session to free gpu memory
k.clear_session()
if __name__ == "__main__":
run(0)
run(1)
run(2)
run(3)
run(4)
The error:
(ml) sahand@sahand-System-Product-Name:~/ApproachingML/cat-in-the-dat/src$ python neural_embedding.py
2020-09-30 00:50:35.370138: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-30 00:50:42.254838: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-30 00:50:42.275733: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-30 00:50:42.276275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: TITAN RTX computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2020-09-30 00:50:42.276297: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-30 00:50:42.277275: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-30 00:50:42.278136: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-30 00:50:42.278320: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-30 00:50:42.279236: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-30 00:50:42.279669: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-30 00:50:42.279787: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2020-09-30 00:50:42.279796: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-09-30 00:50:42.279962: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-09-30 00:50:42.283430: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3699850000 Hz
2020-09-30 00:50:42.283675: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5629cad878a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-30 00:50:42.283685: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-09-30 00:50:42.284397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-30 00:50:42.284406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]
Epoch 1/3
465/469 [============================>.] - ETA: 0s - loss: 0.4713Traceback (most recent call last):
File "neural_embedding.py", line 135, in
run(0)
File "neural_embedding.py", line 124, in run
model.fit(Xtrain,ytrain_cat, validation_data = (Xvalid, yvalid_cat), verbose = 1, batch_size =1024, epochs = 3)
File "/home/sahand/anaconda3/envs/ml/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/sahand/anaconda3/envs/ml/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1110, in fit
self._eval_data_handler = data_adapter.DataHandler(
File "/home/sahand/anaconda3/envs/ml/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 1105, in init
self._adapter = adapter_cls(
File "/home/sahand/anaconda3/envs/ml/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 282, in init
raise ValueError(msg)
ValueError: Data cardinality is ambiguous:
x sizes: 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000
y sizes: 480000
Please provide data which shares the same first dimension.
blas-1.0 | 1 KB | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Ran pip subprocess with arguments:
['/home/luttkule/miniconda3/envs/ml/bin/python', '-m', 'pip', 'install', '-U', '-r', '/home/luttkule/Documents/condaenv.zcmb4rq8.requirements.txt']
Pip subprocess output:
Collecting absl-py==0.9.0
Downloading https://files.pythonhosted.org/packages/1a/53/9243c600e047bd4c3df9e69cfabc1e8004a82cac2e0c484580a78a94ba2a/absl-py-0.9.0.tar.gz (104kB)
Collecting alabaster==0.7.12
Downloading https://files.pythonhosted.org/packages/10/ad/00b090d23a222943eb0eda509720a404f531a439e803f6538f35136cae9e/alabaster-0.7.12-py2.py3-none-any.whl
Collecting albumentations==0.4.3
Downloading https://files.pythonhosted.org/packages/f6/c4/a1e6ac237b5a27874b01900987d902fe83cc469ebdb09eb72a68c4329e78/albumentations-0.4.3.tar.gz (3.2MB)
Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement apex==0.1 (from -r /home/luttkule/Documents/condaenv.zcmb4rq8.requirements.txt (line 4)) (from versions: 0.9.8dev.linux-i686, 0.9.8.dev0, 0.9.8a0.dev0, 0.9.9.dev0, 0.9.10.dev0)
ERROR: No matching distribution found for apex==0.1 (from -r /home/luttkule/Documents/condaenv.zcmb4rq8.requirements.txt (line 4))
CondaEnvException: Pip failed
from 213 page
for imgid in image_ids:
files = glob.glob(os.path.join(TRAIN_PATH, imgid, "*.png"))
self.data[counter] = {"img_path": os.path.join(TRAIN_PATH, imgid + ".png" .....
Both counter and files not defined above not after
Solving environment: failed
ResolvePackageNotFound:
Changing it to python==3.7.6 works . Please update environment file .
I'm trying to write the code from page 137 and I'm getting an error. I have no idea how to troubleshoot or perhaps it's just a simple typo. I've gone over the code multiple times and can't figure it out.
https://colab.research.google.com/drive/1CZXpt7xman0PL6lU9-HoL1oIphPHeADy?usp=sharing
Epoch 1/3
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-da7c14b3ed65> in <module>()
1 if __name__ == "__main__":
----> 2 run(0)
3 run(1)
4 run(2)
5 run(3)
10 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs)
992 except Exception as e: # pylint:disable=broad-except
993 if hasattr(e, "ag_error_metadata"):
--> 994 raise e.ag_error_metadata.to_exception(e)
995 else:
996 raise
ValueError: in user code:
/usr/local/lib/python3.7/dist-packages/keras/engine/training.py:853 train_function *
return step_function(self, iterator)
/usr/local/lib/python3.7/dist-packages/keras/engine/training.py:842 step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:1286 run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2849 call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3632 _call_for_each_replica
return fn(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/keras/engine/training.py:835 run_step **
outputs = model.train_step(data)
/usr/local/lib/python3.7/dist-packages/keras/engine/training.py:787 train_step
y_pred = self(x, training=True)
/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py:1020 __call__
input_spec.assert_input_compatibility(self.input_spec, inputs, self.name)
/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py:202 assert_input_compatibility
' input tensors. Inputs received: ' + str(inputs))
ValueError: Layer model_4 expects 1 input(s), but it received 23 input tensors. Inputs received: [<tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_1:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_2:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_3:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_4:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_5:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_6:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_7:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_8:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_9:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_10:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_11:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_12:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_13:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_14:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_15:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_16:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_17:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_18:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_19:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_20:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_21:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_22:0' shape=(None, 1) dtype=int64>]
Hi,
Would you please clarify if the code in the book for MAP@k (page 63 in kindle version) needs to add square-bracket for y_true[i] and y_pred[i] ? I was getting error "'int' is not subscriptable' while trying it out.
def mapk(y_true, y_pred, k):
apk_values = []
for i in range(len(y_true)):
apk_values.append(
apk([y_true[i]], [y_pred[i]], k=k)
)
return (sum(apk_values) / len(apk_values)
And, when I use the same y_true and y_pred examples mentioned in page 64, I am getting error "unhashable type 'list'". Would you please help me understand why ? Thanks in advance.
Is the calculation of weighted precision is correct??
please check
I would use the average of all aucs to exactly see what method is better, because the vector of values is more difficult to estimate.
For example, on page 136 you use feature combinations to create new features.
And it seems to you that "It seems like we have improved again". But we did not, because without that new features average auc=0.927, with new features 0.925.
As it turns out, Amazon doesn't have Kindle support for Linux Subsystems. The online kindle reader is almost useless and is unable to open this book.
I would love to know if someone has been successfully able to read this book on Linux.
Here are a few ways I had in mind:
Do comment if you found any way to address the issue.
Jan 2021
There is an implicit assumption of a Jupyter environment, but this only mentioned on page 13 after you discuss several examples.
I installed Jupyter using conda install -c anaconda jupyter
and then started the notebook using the command jupyter notebook
.
That will give you a link to access Jupyter notebook in you local machine browser. FYI, I am running Ubuntu on a Windows 10 machine.
Could you please add a section in the introduction describing how to setup Jupyter to run the code examples in the book for novices?
In adult citizens
problem you don't include education.num
in number column. Is there some reason for this?
I get this error after running train.py from the command line
python train.py --fold 1 --model dt_gini
# rf_hyperopt.py
import numpy as np
import pandas as pd
from functools import partial
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection
from hyperopt import hp, fmin, tpe, Trials
from hyperopt.pyll.base import scope
def optimize(params, x, y):
"""
The main optimization function.
This function takes all the arguments from the search space
and training features and targets. It then initializes
the models by setting the chosen parameters and runs
cross-validation and returns a negative accuracy score
:param params: dict of params from hyperopt
:param x: training data
:param y: labels/targets
:return: negative accuracy after 5 folds
"""
# initialize model with current parameters
model = ensemble.RandomForestClassifier(**params)
# initialize stratified k-fold
kf = model_selection.StratifiedKFold(n_splits=5)
.
.
.
# return negative accuracy
return -1 * np.mean(accuracies)
if __name__ == "__main__":
# read the training data
df = pd.read_csv("../input/mobile_train.csv")
# features are all columns without price_range
# note that there is no id column in this dataset
# here we have training features
X = df.drop("price_range", axis=1).values
# and the targets
y = df.price_range.values
# define a parameter space
# now we use hyperopt
param_space = {
# quniform gives round(uniform(low, high) / q) * q
# we want int values for depth and estimators
"max_depth": scope.int(hp.quniform("max_depth", 1, 15, 1)),
"n_estimators": scope.int(
hp.quniform("n_estimators", 100, 1500, 1)
),
# choice chooses from a list of values
"criterion": hp.choice("criterion", ["gini", "entropy"]),
# uniform chooses a value between two values
"max_features": hp.uniform("max_features", 0, 1)
}
# partial function
optimization_function = partial(
optimize,
x=X,
y=y
)
# initialize trials to keep logging information
trials = Trials()
# run hyperopt
hopt = fmin(
fn=optimization_function,
space=param_space,
algo=tpe.suggest,
max_evals=15,
trials=trials
)
print(hopt)
I might be thick, but I'm trying to understand what does the variable accuracies
represent in this code and I'm failing. All the other objective functions I've seen other people code define the score variable, and then they call it. Or this has been defined before and I've missed it?
This is a fantastic book, and I am learning a lot working through it. One issue I am having is that it is really hard to differentiate the code from the comments sometimes visually. Expressions and comments both seem to be using nearly the same color.
For the next edition, would you consider changing this? ๐
Hi,
I do understand that code used in book cant be shared, but could you tell us where the data set used in the book can be found.
Its good to have all data sets used in book at one repo, else we have to google each and everytime the dataset being used in book.
on page 75-76
Hi Abhishek ,
There is minor typo in document string .
In chapter ensembling and stacking , function max_voting has document string which says that it creates max predictions . It should be corrected to max voted predictions.
Also , please correct param in doc string . It says param probas . It should be corrected to preds.
Is it planned a epub version of the book?
Should be: :param y_pred: list of predicted values
Instead of :param y_proba: list of predicted values
@abhishekkrthakur
Square term is missing in the denominator of the formula. The python implementation is accurate and correctly adds the square term.
In the code where you've shown how to apply stratified k-fold cross validation to a regression problem, I noticed a small bug.
# we create a new column called kfold and fill it with -1
data["kfold"] = -1
# the next step is to randomize the rows of the data
data = data.sample(frac=1).reset_index(drop=True)
# calculate the number of bins by Sturge's rule
# I take the floor of the value, you can also
# just round it
num_bins = np.floor(1 + np.log2(len(data)))
# bin targets
data.loc[:, "bins"] = pd.cut(
data["target"], bins=num_bins, labels=False
)
# initiate the kfold class from model_selection module
kf = model_selection.StratifiedKFold(n_splits=5)
# fill the new kfold column
# note that, instead of targets, we use bins!
for f, (t_, v_) in enumerate(kf.split(X=data, y=data.bins.values)):
data.loc[v_, 'kfold'] = f
# drop the bins column
data = data.drop("bins", axis=1)
# return dataframe with folds
return data
The bug is in this line :
num_bins = np.floor(1 + np.log2(len(data)))
num_bins
is of type numpy.floa64
And when this is used in segregating targets into bins (in the next part of the code), it throws an error
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
Proposed solution:
num_bins = num_bins.asype(int)
page 141: "you will notice that this approach gives the best results".
But only in this approach you did not show the results.
I've got np.average(aucs)=0.893
and this is the worst result, not the best result.
And I don't know - that was my mistake (some error in code) or something else.
Was looking at your github page, and saw that you have a section titled "AAMLP.pdf"
When i followed that link, i found the PDF copy of the entire book. I've already read you book, so I recall that you state in the very beginning that you wish to avoid pirated copies, just wanted to make sure that you weren't accidentally leaking the book out via github.
I get the following error when using pip install after activating my conda environment -
pip install -r requirements.txt
ERROR: Invalid requirement: '' (from line 6 of requirements.txt)
The environment.yml file is not working as well returning error below -
ruamel_yaml.scanner.ScannerError: mapping values are not allowed here
in "", line 128, column 34:
<span style="background-color: #79b8ff;width: 0%;" class="Pro ...
Datasets such as winequality-red.csv, use in the cross-validation section, is not available here. No datasets are available here at all.
I am aware that the datasets could be found elsewhere, but this place should be self-contained.
You have put references to the references. But why not place the datasets here? It ensures consistency as the references may change. Every other book I have seen have the datasets used in the repo, except when these data sets are available in a package.
There are many spelling issues in environment.yml file that will cause pip to fail.
Hi, I am running the code for gp_minimize.py
and in the following part of the code I get an error:
result = gp_minimize(
optimization_function,
dimensions=param_space,
n_calls=15,
n_random_starts=10,
verbose=10)
The error is the following: Exception has occurred: TypeError '<' not supported between instances of 'Version' and 'tuple'
Going through the gp_minimize
info there is an example:
res = gp_minimize(f, # the function to minimize
[(-2.0, 2.0)], # the bounds on each dimension of x
acq_func="EI", # the acquisition function
n_calls=15, # the number of evaluations of f
n_random_starts=5, # the number of random initialization points
noise=0.1**2, # the noise level (optional)
random_state=1234) # the random seed
Where [(-2.0, 2.0)]
is what we have as param_space
in the book but if I print the latter we get:
[Integer(low=3, high=15, prior='uniform', transform='identity'), Integer(low=100, high=1500, prior='uniform', transform='identity'), Categorical(categories=('gini', 'entropy'), prior=None), Real(low=0.01, high=1, prior='uniform', transform='identity')]
Wondering if the error is the way we are passing the values but didn't found anything.
Hi Abhishek,
Your environment.yml file is platform-specific (for Linux) and a strict export, which means that I can't use this file to recreate the conda environment in OSX. Is there an easy fix to this? Maybe you should run conda env export --from-history -f environment.yml
just to get the package and version information and upload it?
Thanks.
Using Ubuntu 20.04
Conda environment failed
Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement apex==0.1 (from -r /home/servando/approachingalmost/condaenv.3ehyjrxh.requirements.txt (line 4)) (from versions: 0.9.8dev.linux-i686, 0.9.8.dev0, 0.9.8a0.dev0, 0.9.9.dev0, 0.9.10.dev0)
ERROR: No matching distribution found for apex==0.1 (from -r /home/servando/approachingalmost/condaenv.3ehyjrxh.requirements.txt (line 4))
There is an issue with the code when I run the .fit on the first decision tree.
ValueError: Number of labels=599 does not match number of samples=1000
Do you have any idea what is causing this issue?
Warning while piping requirements,
-- ERROR: tensorflow 2.2.1 has requirement numpy<1.19.0,>=1.16.0, but you'll have numpy 1.19.0 which is incompatible.
for class 2
there is 1 in class 0 but there is No instance of class 2 which predicted ad 0 in prediction lists so there must be 0
so column for class2 is like
let me know if I am wrong I am learning
Ubuntu : 16.4
I have encountered an error while creating stratified kfolds for regression problem
def create_folds(data):
data["kfold"] = -1
data = data.sample(frac=1).reset_index(drop=True)
#calculate number of bins by sturge's rule
num_bins = np.floor(1 + np.log2(len(data)))
#bins targets
data.loc[:, "bins"] = pd.cut(data["target"],
bins = num_bins,
labels=False
)
num of bin type must be changed from float to int
num_bins = np.floor(1 + np.log2(len(data))).astype(np.int32)
Hope it helps
I am trying to load the Wine Quality dataset from sklearn,
data = datasets.load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
data.feature_names
However, the feature names that I am getting are completely different from the ones mentioned in the book in Chapter 2 Cross Validation.
Wondering from where can I get that 'winequality-red.csv' file?
Here're the features that I am getting from the sklearn dataset,
'malic_acid',
'ash',
'alcalinity_of_ash',
'magnesium',
'total_phenols',
'flavanoids',
'nonflavanoid_phenols',
'proanthocyanins',
'color_intensity',
'hue',
'od280/od315_of_diluted_wines',
'proline']
@abhishekkrthakur I couldn't find the train.csv file for Approaching image classification & segmentation
chapter. Even in the png dataset of pneumothorax there is no csv file present.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.