abhishekkrthakur / approachingalmost Goto Github PK

View Code? Open in Web Editor NEW

6.6K 6.6K 972.0 10.35 MB

Approaching (Almost) Any Machine Learning Problem

approachingalmost's Introduction

Hi there 👋

I'm a data scientist / machine learning engineer.

approachingalmost's People

Contributors

Stargazers

Watchers

Forkers

keerthi-evive dkryadav deepmeditativemind anuraglahon16 ebindraj kunalrsagar buns-koji aminulislam-ai firaterman rhitchakraborty amrutheshreddy shariefimran vicky0106 ankitkatewa mahamatnoumai ds-madhavan-ramani prasun6187 wngaw krajubit gaurav053 datascienceorg devanshu125 shashuec cnarte gurpreet-learning priyank7n rohan0401 typroj anand-0220 pranitbhujel faraazsheriff saradhimpardha ravivanam363 marco-amadeo imvenkata ankitbaluni123 dlianou gabeesh tokuma09 jay2201 pujan-regmi manishs86 thatgeeman marquisvictor ggg1024 joaopalmeiro jorge-omana midhilesh4890 sallychang baylodge frankguo77 quangchiem139 parva-jain rthille fahad021 davidcruz0202 nlebang rakeshsharma21 ekeminiumanah hubertp ramineniraviteja sakib5271 ricardocarvalhods pipepiper rajeshwar21 nishant-chitkara franciscorpuz abhinavm24 rushikesh10 ionthruster hiteshkalwani tagasimon tusharkalecam anshudhanshu ejhortala amrrs favorite-projects naveen-marthala amansharma2910 iamavailable er-vivekkumar d80ep08th anonymousr007 sal81 dineshiitkgp chaitanyakasaraneni abaraskar shafiahmed jubaer145 breenline daniel3424 loveactualry 0201shj emilylin8073 iamsantoshkumar octalpixel figo2002 hiteshai am1codes nikhilkumar238

approachingalmost's Issues

New column assignment in pandas

There must be a reason why you prefer df.loc[:, 'weekofyear'] = ... to shorter df['weekofyear'] = ..., can you please give some hints?

Page 19 Error

Hi Abhishek,
I recently bought your book.
I get the following error when I ren the code that is ritten on page 19.

clf.fit(df_train[cols], df_train.quality)
AttributeError: 'list' object has no attribute 'fit'

Can you help with the accurate code?
Thanks
Nakul

Grammatical error

@abhishekkrthakur
Page 102. Second para. Fourth sentence. Change you to your.

Current: (Incorrect)
"You model pipeline in this case....
To be corrected to:
"Your model pipeline in this case....

Uneven results from Label Encoder problem

Hi,
I'm trying to run the code from the chapter Approaching Categorical Variables. The code is run using XGBoost. I'm getting uneven results when I run the code. I'm not sure if I'm missing something. Can you help me look into this issue.
This is the code that I used to run

import pandas as pd 
from sklearn import preprocessing
from sklearn import metrics

import xgboost as xgb

def run(fold):

    df = pd.read_csv('../input/train_folds.csv')

    num_cols = ['age', 'fnlwgt', 'education.num' , 'capital.gain', 'capital.loss' , 'hours.per.week']

    mapping = {
        '<=50K' : 0,
        '>50K' : 1
    }

    df.loc[:,'income'] = df['income'].map(mapping)
    features = [x for x in df.columns if x not in ('kfold','income') ]

    for col in features:
        if col not in num_cols:
            df.loc[:, col] = df[col].astype(str).fillna("NONE")



    for col in features:
        if col not in num_cols:
            lbl = preprocessing.LabelEncoder()

            lbl.fit(df[col])

            df.loc[:,col] = lbl.transform(df[col])


    df_train = df[df.kfold != fold].reset_index(drop = True)
    df_valid = df[df.kfold == fold].reset_index(drop = True)

    x_train = df_train[features].values
    x_valid = df_valid[features].values

    model = xgb.XGBClassifier(n_jobs = -1, max_depth=20, n_estimators= 200)

    model.fit(x_train, df_train.income.values)

    print("model trained")
    y_valid_preds = model.predict_proba(x_valid)[:,1]


    print(metrics.roc_auc_score(df_valid.income.values, y_valid_preds))


if __name__ == "__main__" :
    for f in range(5):
        run(f)

And the output that I'm getting is :
fold 0: 0.7381942252913961 fold 1:0.031021225996466548 fold 2: 0.1365583437561918 fold 3: 0.47537328796809986 fold 4: 0.8349179819075687
The values being inconsistent are far away from the actual results.. I do not know what I'm doing wrong in here.

Fig 10 : Formula for R2

Hi,
Looks like the squaring is missing in the denominator of the formula for R2 (figure 10, Page 69 in Kindle edition). The code for R2 has squaring in the numerator and denominator and is working as expected.
Thank you,

Fitting decision tree on wrong data frame - cross validation intro section

Hi, I believe there is an error in the code to fit a decision tree on the red wine quality dataset.

/# train the model on the provided features
/# and mapped quality from before
clf.fit(df_train[cols], df_test.quality)

Shouldn't the feature column be from df_train as opposed to from df_test. This error was not repeated in the longer code section where you plot accuracies for different values of max_depth

Cheers

Page 11 MNIST visualization.

I was getting the following error when running the code from your example:

single_image = pixel_values[1, :].reshape(28, 28)
plt.imshow(single_image, cmap='gray')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-a39d645e4ad8> in <module>
----> 1 single_image = pixel_values[1, :].reshape(28, 28)
      2 plt.imshow(single_image, cmap='gray')

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3022             if self.columns.nlevels > 1:
   3023                 return self._getitem_multilevel(key)
-> 3024             indexer = self.columns.get_loc(key)
   3025             if is_integer(indexer):
   3026                 indexer = [indexer]

~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3078             casted_key = self._maybe_cast_indexer(key)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:
   3082                 raise KeyError(key) from err

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(1, slice(None, None, None))' is an invalid key

Worked well when I changed it to:

single_image = pixel_values.iloc[1, :].values.reshape(28, 28)
plt.imshow(single_image, cmap='gray');

Typo on page 5

Loving the book so far! I did want to point out a small error on page 5, and I thought this might be the best way to report this to you.

In the last paragraph:

Next, please read and follow the instructions on your screen. If you installed everything correctly, you should be able to start the conda environment by typing conda init the terminal. We will create a conda environment that we will be using throughout this book. To create a conda environment, you can type:

I believe it should read:

you should be able to start the conda environment by typing conda init in the terminal.

Or perhaps:

you should be able to start the conda environment by typing conda init into the terminal.

Rather than:

you should be able to start the conda environment by typing conda init the terminal.

Thanks! Great work on this!

Visualizing samples in MNIST dataset

In page 11

single_image = pixel_values[1, :].reshape(28, 28)
plt.imshow(single_image, cmap='gray')

but reshape is not an pd.DataFrame atribute, so to see the image I had to code:

single_image = pixel_values.iloc[1, :]
single_image = single_image.values.reshape(28,28)
plt.imshow(single_image, cmap='gray')

environment.yml has some package versions issues

Hi,
environment.yml has package issues and pip subprocess fail error is generated due to few packages in ubuntu with miniconda installed

Tensorflow embedding function error

Hi,
While running the tensorflow code provided in the code, i faced this error:

import os
import gc
import joblib
import pandas as pd 
import numpy as np
from sklearn import metrics, preprocessing
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras.models import Model, load_model
from tensorflow.keras import callbacks
from tensorflow.keras import backend as k
from tensorflow.keras import utils


def create_model(data, catcols):

    '''
    this funciton returns a compiled tf.keras model for entitiy embeddings
    :param data: this is a pandas dataframe
    :param catcols: list of categorical column names
    :return: complied tf.keras model
    '''
    #init the list of inputs for embedding
    inputs =[]
    #init the list of outputs for embedding
    outputs= []
    #loop over all categorical columns
    for c in catcols:

        #find the number of unique values in the column
        num_unique_values= int(data[c].nunique())
        #simple dimension of embedding calculator
        #min size is half the number of unbique values
        #max size is 50. max size depends on the number of values
        #categories too. 50 is quite sufficient most of the times
        #but if you have millions of unique values, you might need a larger dimenion

        embed_dim = int(min(np.ceil((num_unique_values)/2), 50))
        #simple keras input layer with size 1

        inp = layers.Input(shape = (1,))
        #add embedding layer to raw input
        #embedding size is alwasy 1 more than unique values in input

        out = layers.Embedding(num_unique_values + 1, embed_dim, name = c)(inp)

        #1-d spatial dropout is the standard for embedding layers
        #it can be used in nlp tasks as well

        out = layers.SpatialDropout1D(0.3)(out)

        #reshape the input to the dimensions of embedding
        #this becomes our output layer for current feature

        out = layers.Reshape(target_shape = (embed_dim,))(out)

        #add input to input list
        inputs.append(inp)
        #add output to output list
        outputs.append(out)


    #concatenate all output layers
    X = layers.Concatenate()(outputs)
    # add a batchnorm layer
    # from here, everything is up to you
    # you can try different architecture
    # add numerical features here or in concatonate layer
    X = layers.BatchNormalization()(X)

    # a bunch of dense layers with dropout
    # start with 1 or two layers only
    X = layers.Dense(300,activation = 'relu')(X)
    X = layers.Dropout(0.3)(X)
    X = layers.BatchNormalization()(X)

    #using softmax and treating it as a two class problem
    # sigmoid can also be used but then we need only 1 output class
    y = layers.Dense(2, activation = 'softmax')(X)

    model = Model(inputs = inputs ,outputs = y)
    #compile the model
    # we use adam and binary cross entropy
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam')

    return model 
def run(fold):
    df = pd.read_csv('../input/cat_train_folds.csv')
    features = [
        f for f in df.columns if f not in ("id","target","kfold")
    ]
    #fill all Na with NONE
    for col in features:
        df.loc[:,col] = df[col].astype(str).fillna("NONE")
        #encode all features with label encoder individually
        #in a live setting all label encoders need to be saved
        
        for feat in features:
            df.loc[:,feat] = df[feat].astype(str)
            lbl_enc = preprocessing.LabelEncoder()
            lbl_enc = lbl_enc.fit(df[feat].values)
            df.loc[:, feat] = lbl_enc.fit_transform(df[feat].astype(str).values)

        #get trainign data using folds

        df_train= df[df.kfold != fold].reset_index(drop = True)
        df_valid = df[df.kfold ==fold].reset_index(drop = True)

        model = create_model(df, features)
        #our features are a list of list
        Xtrain = [df_train[features].values[:,k] for k in range(len(features))]
        Xvalid = [df_valid[features].values[:,k] for k in range(len(features))]

        ytrain = df_train.target.values
        yvalid = df_train.target.values

        #concert target columns to categories
        #this is just binarization

        ytrain_cat = utils.to_categorical(ytrain)
        yvalid_cat = utils.to_categorical(yvalid)

        #fit the model

        model.fit(Xtrain,ytrain_cat, validation_data = (Xvalid, yvalid_cat), verbose = 1, batch_size =1024, epochs = 3)


        valid_preds = model.predict(Xvalid)[:,1]
        print(metrics.roc_auc_score(yvalid, valid_preds))

        #clear session to free gpu memory

        k.clear_session()

if __name__ == "__main__":
    run(0)
    run(1)
    run(2)
    run(3)
    run(4)

The error:
(ml) sahand@sahand-System-Product-Name:~/ApproachingML/cat-in-the-dat/src$ python neural_embedding.py
2020-09-30 00:50:35.370138: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-30 00:50:42.254838: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-30 00:50:42.275733: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-30 00:50:42.276275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: TITAN RTX computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2020-09-30 00:50:42.276297: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-30 00:50:42.277275: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-30 00:50:42.278136: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-30 00:50:42.278320: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-30 00:50:42.279236: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-30 00:50:42.279669: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-30 00:50:42.279787: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2020-09-30 00:50:42.279796: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-09-30 00:50:42.279962: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-09-30 00:50:42.283430: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3699850000 Hz
2020-09-30 00:50:42.283675: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5629cad878a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-30 00:50:42.283685: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-09-30 00:50:42.284397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-30 00:50:42.284406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]
Epoch 1/3
465/469 [============================>.] - ETA: 0s - loss: 0.4713Traceback (most recent call last):
File "neural_embedding.py", line 135, in
run(0)
File "neural_embedding.py", line 124, in run
model.fit(Xtrain,ytrain_cat, validation_data = (Xvalid, yvalid_cat), verbose = 1, batch_size =1024, epochs = 3)
File "/home/sahand/anaconda3/envs/ml/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/sahand/anaconda3/envs/ml/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1110, in fit
self._eval_data_handler = data_adapter.DataHandler(
File "/home/sahand/anaconda3/envs/ml/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 1105, in init
self._adapter = adapter_cls(
File "/home/sahand/anaconda3/envs/ml/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 282, in init
raise ValueError(msg)
ValueError: Data cardinality is ambiguous:
x sizes: 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000, 120000
y sizes: 480000
Please provide data which shares the same first dimension.

Pip error for the environment.yml

blas-1.0             | 1 KB      | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Ran pip subprocess with arguments:
['/home/luttkule/miniconda3/envs/ml/bin/python', '-m', 'pip', 'install', '-U', '-r', '/home/luttkule/Documents/condaenv.zcmb4rq8.requirements.txt']
Pip subprocess output:
Collecting absl-py==0.9.0
  Downloading https://files.pythonhosted.org/packages/1a/53/9243c600e047bd4c3df9e69cfabc1e8004a82cac2e0c484580a78a94ba2a/absl-py-0.9.0.tar.gz (104kB)
Collecting alabaster==0.7.12
  Downloading https://files.pythonhosted.org/packages/10/ad/00b090d23a222943eb0eda509720a404f531a439e803f6538f35136cae9e/alabaster-0.7.12-py2.py3-none-any.whl
Collecting albumentations==0.4.3
  Downloading https://files.pythonhosted.org/packages/f6/c4/a1e6ac237b5a27874b01900987d902fe83cc469ebdb09eb72a68c4329e78/albumentations-0.4.3.tar.gz (3.2MB)

Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement apex==0.1 (from -r /home/luttkule/Documents/condaenv.zcmb4rq8.requirements.txt (line 4)) (from versions: 0.9.8dev.linux-i686, 0.9.8.dev0, 0.9.8a0.dev0, 0.9.9.dev0, 0.9.10.dev0)
ERROR: No matching distribution found for apex==0.1 (from -r /home/luttkule/Documents/condaenv.zcmb4rq8.requirements.txt (line 4))


CondaEnvException: Pip failed

"counter" variable not defined, variable "files" not used

from 213 page

for imgid in image_ids:
     files = glob.glob(os.path.join(TRAIN_PATH, imgid, "*.png"))
    self.data[counter] = {"img_path": os.path.join(TRAIN_PATH, imgid + ".png" .....

Both counter and files not defined above not after

resolvepackage not found

Solving environment: failed
ResolvePackageNotFound:

python==3.7.6=h0371630_1

Changing it to python==3.7.6 works . Please update environment file .

Can't get code on page 137 to work

I'm trying to write the code from page 137 and I'm getting an error. I have no idea how to troubleshoot or perhaps it's just a simple typo. I've gone over the code multiple times and can't figure it out.

https://colab.research.google.com/drive/1CZXpt7xman0PL6lU9-HoL1oIphPHeADy?usp=sharing

Epoch 1/3
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-da7c14b3ed65> in <module>()
      1 if __name__ == "__main__":
----> 2   run(0)
      3   run(1)
      4   run(2)
      5   run(3)

10 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs)
    992           except Exception as e:  # pylint:disable=broad-except
    993             if hasattr(e, "ag_error_metadata"):
--> 994               raise e.ag_error_metadata.to_exception(e)
    995             else:
    996               raise

ValueError: in user code:

    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:853 train_function  *
        return step_function(self, iterator)
    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:842 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:1286 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2849 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3632 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:835 run_step  **
        outputs = model.train_step(data)
    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:787 train_step
        y_pred = self(x, training=True)
    /usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py:1020 __call__
        input_spec.assert_input_compatibility(self.input_spec, inputs, self.name)
    /usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py:202 assert_input_compatibility
        ' input tensors. Inputs received: ' + str(inputs))

    ValueError: Layer model_4 expects 1 input(s), but it received 23 input tensors. Inputs received: [<tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_1:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_2:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_3:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_4:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_5:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_6:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_7:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_8:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_9:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_10:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_11:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_12:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_13:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_14:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_15:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_16:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_17:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_18:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_19:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_20:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_21:0' shape=(None, 1) dtype=int64>, <tf.Tensor 'ExpandDims_22:0' shape=(None, 1) dtype=int64>]

MAP@k

Hi,

Would you please clarify if the code in the book for MAP@k (page 63 in kindle version) needs to add square-bracket for y_true[i] and y_pred[i] ? I was getting error "'int' is not subscriptable' while trying it out.

def mapk(y_true, y_pred, k):
apk_values = []
for i in range(len(y_true)):
apk_values.append(
apk([y_true[i]], [y_pred[i]], k=k)
)
return (sum(apk_values) / len(apk_values)

And, when I use the same y_true and y_pred examples mentioned in page 64, I am getting error "unhashable type 'list'". Would you please help me understand why ? Thanks in advance.

Page number 11 , conveys confusing and wrong info

Hello,
Small wrong info in the description of MNIST data set.

The highlighted text is not 784 data points, it is 784 features (columns) and the number of data points is 70000.
the array is 70000x784, so the number of data points is 70000 in a 784-dimensional space.

Weighted Precision

Is the calculation of weighted precision is correct??
please check

Why don't you use average auc (but list of auc's)?

I would use the average of all aucs to exactly see what method is better, because the vector of values is more difficult to estimate.

For example, on page 136 you use feature combinations to create new features.

And it seems to you that "It seems like we have improved again". But we did not, because without that new features average auc=0.927, with new features 0.925.

Unable to read the book on Ubuntu 20.04 LTE

As it turns out, Amazon doesn't have Kindle support for Linux Subsystems. The online kindle reader is almost useless and is unable to open this book.

I would love to know if someone has been successfully able to read this book on Linux.
Here are a few ways I had in mind:

Using wine to install kindle app for Windows, but the Latest version of kindle app for windows is not supported by wine.
Using something like a mobile screen sharer to read the book from mobile to Linux. Last resort, which I really don't wanna resort to.

Do comment if you found any way to address the issue.

How to run the code?

Jan 2021
There is an implicit assumption of a Jupyter environment, but this only mentioned on page 13 after you discuss several examples.

I installed Jupyter using conda install -c anaconda jupyter and then started the notebook using the command jupyter notebook.

That will give you a link to access Jupyter notebook in you local machine browser. FYI, I am running Ubuntu on a Windows 10 machine.

Could you please add a section in the introduction describing how to setup Jupyter to run the code examples in the book for novices?

In `adult citizens` problem you don't include `education.num` in number column

In adult citizens problem you don't include education.num in number column. Is there some reason for this?

run() got an unexpected keyword argument 'model'

I get this error after running train.py from the command line

python train.py --fold 1 --model dt_gini

rf_hyperopt.py in Hyperparameter tuning chapter - where does accuracies variable come from?

# rf_hyperopt.py
import numpy as np
import pandas as pd
from functools import partial
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection
from hyperopt import hp, fmin, tpe, Trials
from hyperopt.pyll.base import scope

def optimize(params, x, y):
  """
  The main optimization function.
  This function takes all the arguments from the search space
  and training features and targets. It then initializes
  the models by setting the chosen parameters and runs
  cross-validation and returns a negative accuracy score
  :param params: dict of params from hyperopt
  :param x: training data
  :param y: labels/targets
  :return: negative accuracy after 5 folds
  """
  # initialize model with current parameters
  model = ensemble.RandomForestClassifier(**params)
  # initialize stratified k-fold
  kf = model_selection.StratifiedKFold(n_splits=5)
  .
  .
  .
  # return negative accuracy
  return -1 * np.mean(accuracies)

if __name__ == "__main__":
  # read the training data
  df = pd.read_csv("../input/mobile_train.csv")

  # features are all columns without price_range
  # note that there is no id column in this dataset
  # here we have training features
  X = df.drop("price_range", axis=1).values
  # and the targets
  y = df.price_range.values

  # define a parameter space
  # now we use hyperopt
  param_space = {
    # quniform gives round(uniform(low, high) / q) * q
    # we want int values for depth and estimators
    "max_depth": scope.int(hp.quniform("max_depth", 1, 15, 1)),
    "n_estimators": scope.int(
    hp.quniform("n_estimators", 100, 1500, 1)
    ),
    # choice chooses from a list of values
    "criterion": hp.choice("criterion", ["gini", "entropy"]),
    # uniform chooses a value between two values
    "max_features": hp.uniform("max_features", 0, 1)
  }
# partial function
optimization_function = partial(
optimize,
x=X,
y=y
)

# initialize trials to keep logging information
trials = Trials()

# run hyperopt
hopt = fmin(
fn=optimization_function,
space=param_space,
algo=tpe.suggest,
max_evals=15,
trials=trials
)

print(hopt)

I might be thick, but I'm trying to understand what does the variable accuracies represent in this code and I'm failing. All the other objective functions I've seen other people code define the score variable, and then they call it. Or this has been defined before and I've missed it?

Hard to differentiate code from comments

This is a fantastic book, and I am learning a lot working through it. One issue I am having is that it is really hard to differentiate the code from the comments sometimes visually. Expressions and comments both seem to be using nearly the same color.

For the next edition, would you consider changing this? 🙂

Sample from page 61:

data set repo

Hi,

I do understand that code used in book cant be shared, but could you tell us where the data set used in the book can be found.
Its good to have all data sets used in book at one repo, else we have to google each and everytime the dataset being used in book.

Error occurred while running 'conda env create -f environment.yml'

code missing create_folds.py missing

on page 75-76

Minor issue on doc string

Hi Abhishek ,
There is minor typo in document string .
In chapter ensembling and stacking , function max_voting has document string which says that it creates max predictions . It should be corrected to max voted predictions.

Also , please correct param in doc string . It says param probas . It should be corrected to preds.

EPUB version

Is it planned a epub version of the book?

Page 56 error "y_proba" --> "y_pred" in the doc string

Should be: :param y_pred: list of predicted values
Instead of :param y_proba: list of predicted values

Invalid formula for R-squared

@abhishekkrthakur
Square term is missing in the denominator of the formula. The python implementation is accurate and correctly adds the square term.

A small bug on page 28.

In the code where you've shown how to apply stratified k-fold cross validation to a regression problem, I noticed a small bug.

 # we create a new column called kfold and fill it with -1
 data["kfold"] = -1

 # the next step is to randomize the rows of the data
 data = data.sample(frac=1).reset_index(drop=True)

 # calculate the number of bins by Sturge's rule
 # I take the floor of the value, you can also
 # just round it
   num_bins = np.floor(1 + np.log2(len(data)))

 # bin targets
 data.loc[:, "bins"] = pd.cut(
 data["target"], bins=num_bins, labels=False
 )

 # initiate the kfold class from model_selection module
 kf = model_selection.StratifiedKFold(n_splits=5)

 # fill the new kfold column
 # note that, instead of targets, we use bins!
 for f, (t_, v_) in enumerate(kf.split(X=data, y=data.bins.values)):
 data.loc[v_, 'kfold'] = f

 # drop the bins column
 data = data.drop("bins", axis=1)
 # return dataframe with folds
 return data

The bug is in this line :
num_bins = np.floor(1 + np.log2(len(data)))

num_bins is of type numpy.floa64

And when this is used in segregating targets into bins (in the next part of the code), it throws an error
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.

Proposed solution:

   num_bins = num_bins.asype(int)

Is the keras model in adult sensus data realy is the best?

page 141: "you will notice that this approach gives the best results".
But only in this approach you did not show the results.

I've got np.average(aucs)=0.893 and this is the worst result, not the best result.
And I don't know - that was my mistake (some error in code) or something else.

Book PDF is available freely.

Was looking at your github page, and saw that you have a section titled "AAMLP.pdf"

When i followed that link, i found the PDF copy of the entire book. I've already read you book, so I recall that you state in the very beginning that you wish to avoid pirated copies, just wanted to make sure that you weren't accidentally leaking the book out via github.

requirments.txt error

I get the following error when using pip install after activating my conda environment -

pip install -r requirements.txt

ERROR: Invalid requirement: '' (from line 6 of requirements.txt)

The environment.yml file is not working as well returning error below -

ruamel_yaml.scanner.ScannerError: mapping values are not allowed here
in "", line 128, column 34:
<span style="background-color: #79b8ff;width: 0%;" class="Pro ...

Datasets missing here

Datasets such as winequality-red.csv, use in the cross-validation section, is not available here. No datasets are available here at all.

I am aware that the datasets could be found elsewhere, but this place should be self-contained.

You have put references to the references. But why not place the datasets here? It ensures consistency as the references may change. Every other book I have seen have the datasets used in the repo, except when these data sets are available in a package.

error in the code of "winequality-red"

Throws the following error: Number of labels=599 does not match number of samples=1000

graphviz , PyYAML ,

There are many spelling issues in environment.yml file that will cause pip to fail.

Error with `gp_minimize` in Hyper-parameter Optimization Chapter

Hi, I am running the code for gp_minimize.py and in the following part of the code I get an error:

result = gp_minimize(
        optimization_function,
        dimensions=param_space,
        n_calls=15,
        n_random_starts=10,
        verbose=10)

The error is the following: Exception has occurred: TypeError '<' not supported between instances of 'Version' and 'tuple'

Going through the gp_minimize info there is an example:

res = gp_minimize(f,                  # the function to minimize
                  [(-2.0, 2.0)],      # the bounds on each dimension of x
                  acq_func="EI",      # the acquisition function
                  n_calls=15,         # the number of evaluations of f
                  n_random_starts=5,  # the number of random initialization points
                  noise=0.1**2,       # the noise level (optional)
                  random_state=1234)   # the random seed

Where [(-2.0, 2.0)] is what we have as param_space in the book but if I print the latter we get:
[Integer(low=3, high=15, prior='uniform', transform='identity'), Integer(low=100, high=1500, prior='uniform', transform='identity'), Categorical(categories=('gini', 'entropy'), prior=None), Real(low=0.01, high=1, prior='uniform', transform='identity')]
Wondering if the error is the way we are passing the values but didn't found anything.

environment.yml does not work in OSX

Hi Abhishek,
Your environment.yml file is platform-specific (for Linux) and a strict export, which means that I can't use this file to recreate the conda environment in OSX. Is there an easy fix to this? Maybe you should run conda env export --from-history -f environment.yml just to get the package and version information and upload it?

Thanks.

Environment failed lsb - Ubuntu 20.04 LTS -

Using Ubuntu 20.04

Conda environment failed

Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement apex==0.1 (from -r /home/servando/approachingalmost/condaenv.3ehyjrxh.requirements.txt (line 4)) (from versions: 0.9.8dev.linux-i686, 0.9.8.dev0, 0.9.8a0.dev0, 0.9.9.dev0, 0.9.10.dev0)
ERROR: No matching distribution found for apex==0.1 (from -r /home/servando/approachingalmost/condaenv.3ehyjrxh.requirements.txt (line 4))

Wine Quality Dataset

There is an issue with the code when I run the .fit on the first decision tree.

ValueError: Number of labels=599 does not match number of samples=1000

Do you have any idea what is causing this issue?

numpy 1.19.0 is incompatible with tensorflow 2.2.1

Warning while piping requirements,

-- ERROR: tensorflow 2.2.1 has requirement numpy<1.19.0,>=1.16.0, but you'll have numpy 1.19.0 which is incompatible.

Page No - 59 Confustion matric

for class 2
there is 1 in class 0 but there is No instance of class 2 which predicted ad 0 in prediction lists so there must be 0
so column for class2 is like

let me know if I am wrong I am learning

sturge rule number of bin creation type error

Ubuntu : 16.4
I have encountered an error while creating stratified kfolds for regression problem

def create_folds(data):
    data["kfold"] = -1
    data = data.sample(frac=1).reset_index(drop=True)
    
    #calculate number of bins by sturge's rule
    num_bins = np.floor(1 + np.log2(len(data)))
    
    #bins targets
    data.loc[:, "bins"] = pd.cut(data["target"],
                                 bins = num_bins,
                                 labels=False
                                )

num of bin type must be changed from float to int
num_bins = np.floor(1 + np.log2(len(data))).astype(np.int32)

Hope it helps

The conflict is caused by: The user requested numpy==1.19.0

modified with numpy~=1.16.0 changes in requirements.txt file. Now it is working perfectly fine.

ParseError on Colab while reading csv files using pandas

Wine Quality Dataset

I am trying to load the Wine Quality dataset from sklearn,

data = datasets.load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
data.feature_names

However, the feature names that I am getting are completely different from the ones mentioned in the book in Chapter 2 Cross Validation.
Wondering from where can I get that 'winequality-red.csv' file?

Here're the features that I am getting from the sklearn dataset,

 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

No train.csv file for siim-png-images

@abhishekkrthakur I couldn't find the train.csv file for Approaching image classification & segmentation chapter. Even in the png dataset of pneumothorax there is no csv file present.

abhishekkrthakur / approachingalmost Goto Github PK

approachingalmost's Introduction

Hi there 👋

approachingalmost's People

Contributors

Stargazers

Watchers

Forkers

approachingalmost's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs