GithubHelp home page GithubHelp logo

mhoogen / ml4qs Goto Github PK

View Code? Open in Web Editor NEW
105.0 105.0 132.0 223.35 MB

Code belonging to the book machine learning for the quantified self.

Python 91.62% R 8.26% Batchfile 0.01% Shell 0.03% Dockerfile 0.07%

ml4qs's People

Contributors

annefischer avatar bobborsboomvu avatar buelentuendes avatar florisdenhengst avatar fohlen avatar mhoogen avatar rskeskin avatar rubenhorn avatar serafim179 avatar tmaaiveld avatar xytreyum avatar yaaani85 avatar yb123tcs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ml4qs's Issues

Method gower_dinstance does not exist

In the compute_distance_matrix_instances method on line 161 of Chapter5/Clustering.py the gower_distance method is referenced:

distances.iloc[i,j] = self.gower_distance(dataset.iloc[i:i+1,:], dataset.iloc[j:j+1,:])

This method does not exist, however, a similar method called gowers_similarity does exist and is referenced by k_means_over_instances. Unfortunately, a simple replacement seems to race further errors: KeyError: 0.

Dividing INF/INF, producing NaN values (unexpected)

lrd_ratios_array[i] = neighbor_lrd / instance_lrd

Here is not caught the possibility that both lrd are INF. In one of the lines above it is assumed that lrd can become INF, but what if both of then are INF? Need to add if statement above to catch this case and set a not inf value (ex. 1 or 0). Luckily due to another mistake it never happens, but after fixing the previous one -- this produces errors :(.

Wrong computing of LOF

reachability_distances_array[i] = self.reachability_distance(k, i, neighbor)

In this line variable i is redefined from the line above (enumerate) which leads to computing reachiability distance between rows 0, 1, .., k and neighbor (which is a neighbor of main i from function's arguments). It should be computed between i and neighbor. I suggest changing the name in the function argument from i to root_i.

Can't use distance_metric except 'default' for k_medoids_over_instances

Calling the k_medoids_over_instances function from the NonHierarchicalClustering class will result in an error if the distance_metric parameter is different from 'default'.

Here calling idxmin(axis=1) will raise an error.

points_to_centroid = D[centers].idxmin(axis=1)

This is due to the dataframe containing multidimensional array objects instead of numerical values which in turn is caused by dist.pairwise(X, Y) in distance functions returning a multidimensional array instead of simply returning the distance value.

Changing the distance function to instead return dist.pairwise(X, Y)[0][0] solves the first issues but the dataframe still considers its element to be non-numerical.

The D[centers] can be changed to D[centers] = D[centers].apply(pd.to_numeric, errors='coerce', axis=0) which will at least allow the correct execution of idxmin(axis=1), however the examples will still run into another error pandas.core.indexing.IndexingError: Too many indexers.

As a result the practical exercise 5.9.2.4 is complicated as the code accompanied by the book can not be used in its current form.

Distance metrics different from euclidian for hierarchical clustering fail

Hierarchical clustering fails when using manhattan or minkowski with specified value p as a distance metric. See the following lines in the code:
https://github.com/mhoogen/ML4QS/blob/master/PythonCode/Chapter5/Clustering.py#L307-L310
This is due to
a) pdist taking the string 'cityblock' for manhattan distance
b) linkage not taking additional arguments, so specifying p is not possible.
(a possible work around for b) would be:
from scipy.spatial.distance import pdist
self.link = linkage(temp_dataset.as_matrix(), method=link_function, metric= lambda x,y : pdist([x,y], 'minkowski', p)[0])
which is however significantly slower)

Frequency Domains Incorrectly Added based on collist ordering

https://github.com/mhoogen/ML4QS/blob/master/Python3Code/Chapter4/FrequencyAbstraction.py#L48

Sorry if this is incorrect, but it seems that the calculated Frequency Features are inserted always at the beginning of the returned (or rather stored) object always at the beginning such that the real amplitudes of the FFT are added to the end of the dataset. However, the referenced calculated activities are being added in-order in terms of the code, but always at the beginning, so they are reversed. This means that the data being displayed would be incorrectly labeled, if I'm not wrong. I can open a PR with the fix we implemented if people agree that this is wrong. If I'm wrong, I would love to know that as well.

Wrong path to Python requirements file in Dockerfile

Unable to build docker image due to wrong path to python requirements file. When running Python3Code/start_docker.sh there is an error:

 sh start_docker.sh                                                                                                     126 ↵
[+] Building 0.3s (9/14)
 => [internal] load build definition from Dockerfile                                                                                                         0.1s
 => => transferring dockerfile: 356B                                                                                                                         0.0s
 => [internal] load .dockerignore                                                                                                                            0.0s
 => => transferring context: 2B                                                                                                                              0.0s
 => [internal] load metadata for docker.io/library/ubuntu:latest                                                                                             0.0s
 => [ 1/10] FROM docker.io/library/ubuntu:latest                                                                                                             0.1s
 => [internal] load build context                                                                                                                            0.1s
 => => transferring context: 2B                                                                                                                              0.0s
 => CACHED [ 2/10] RUN apt-get update                                                                                                                        0.0s
 => CACHED [ 3/10] RUN apt-get install sudo                                                                                                                  0.0s
 => CACHED [ 4/10] RUN apt-get install git -y                                                                                                                0.0s
 => ERROR [ 5/10] ADD Python3_requirements.txt /src/requirements.txt                                                                                         0.0s
------
 > [ 5/10] ADD Python3_requirements.txt /src/requirements.txt:
------
failed to compute cache key: "/Python3_requirements.txt" not found: not found

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.