GithubHelp home page GithubHelp logo

supernova-machine's People

Contributors

michellelochner avatar mkerrwinter avatar

Watchers

 avatar  avatar

Forkers

maria-vincenzi

supernova-machine's Issues

Using GridSearch with KNN

Hey, I'm trying to optimise KNN with grid_search, but for some obscure reason grid_search can't handle KNN with the Mahalanobis distance metric. This problem is described here:
scikit-learn/scikit-learn#2609
and they say they've fixed it. So how do I get their new code where the problem's fixed?
Thanks!

features module

tools.py imports methods called features and pywt. I currently don't have these files, could you send them to me? Thanks!

Something weird with gridsearch

I don't think grid_search (in the forest function in ml_algorithms.py) is working properly. As I understand it, it's meant to select the best parameter values from given ranges. One of these parameters is the number of DTs in the Random Forest and it was always choosing 10 - giving AUC=0.905. I switched off grid_search and just set the number of DTs to 100 and got AUC=0.924. So I don't really know what grid_search is optimising. And it takes ages to run...

TlC installation

I ran learn_sncosmo.py and it was going fine up until the graph plotting bit in run_ml in tools.py. Then it gives me an error and says 'This probably means that TlC wasn't installed properly.' How should I go about fixing this?

SVM working, TcL not working

Hey!

  1. I've got the SVM to work! (all it needed was the data to be scaled and it finishes in 0.1s, as opposed to 16 hours and counting for unscaled data. . . ). Its AUC scores are a bit rubbish atm but it's better than nothing.
  2. 'it looks like TcL was not loaded properly' is back so I can't plot anything again. I tried those commands John suggested that worked last time but no luck. . .
  3. I just got invited to interview at UCL!

More GridSearch problems

I thought I'd figured out why I was getting different AUC values for GridSearch's internal ROC AUC function and our own one by saying GridSearch was using the SVC.decision_function values as "probabilities". I've now realised that this is quite silly as the decision function for SVC is the distance between a data point and the separating hyperplane - i.e. it is not bounded so definitely can't be used like a probability as is needed for a ROC curve. So I went back to looking through the source code to find the bit where the 'scores' (as they call the probabilities or probability-esque values to be sent to the ROC AUC calculating function). But I've hit a bit of a dead end. GridSearchCV.fit uses a function called sklearn.metrics._score to get the score values. And sklearn.metrics._score (as far as I can see) calls 'scorer' which is a function defined by the user when they originally call GridSearchCV (AKA 'roc_auc' in our case), passing it the variables 'estimator' and 'X_test' (so in our case 'an SVM' and 'some feature data'). According to this website: http://scikit-learn.org/stable/modules/model_evaluation.html the input string 'roc_auc' (AKA the function 'scorer') corresponds to the code sklearn.metrics.roc_auc_score. Buuuuuuuuut, roc_auc_score takes as input parameters 'y_true' and 'y_score', not an estimator and some feature data! So I don't understand why this doesn't just give an error, and I also don't understand where in the code the probability scores are being calculated. Do you have any ideas/suggestions? (sorry for the massive post)

Thresholds and multiclass predict_proba

I've found another place where thresholding is important. When applying the AdaBoost predict_proba function (max_ml_algorithms line 197 and onwards) with multiple classes (currently in our case classes 1, 2 and 3) it's probabilities are very close. Typically the probabilities for [class1, class2, class3] are like [0.35, 0.32, 0.33]. For the Random Forest the class1 probabilities are more in the range 0.7-0.99 so less of an issue. But with 3 classes a P=0.5 threshold seems even more dodgy. I also have no idea how these DT based algorithms calculate probabilities. I'll look into that.

Format of the data file

Hey! Three quick questions:

  1. In sncosmo_des_fit.txt, what do the different numbers in the 'type' column mean?
  2. Has this data been PCA-ed? Because it's low dim, but seems to have physically meaningful column labels.
  3. In learn_sncosmo.py on line 7, why do we only use 5:10?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.