michellelochner / supernova-machine Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 2.63 MB

Python 100.00%

supernova-machine's People

Contributors

Watchers

Forkers

maria-vincenzi

supernova-machine's Issues

Using GridSearch with KNN

Hey, I'm trying to optimise KNN with grid_search, but for some obscure reason grid_search can't handle KNN with the Mahalanobis distance metric. This problem is described here:
scikit-learn/scikit-learn#2609
and they say they've fixed it. So how do I get their new code where the problem's fixed?
Thanks!

features module

tools.py imports methods called features and pywt. I currently don't have these files, could you send them to me? Thanks!

Something weird with gridsearch

I don't think grid_search (in the forest function in ml_algorithms.py) is working properly. As I understand it, it's meant to select the best parameter values from given ranges. One of these parameters is the number of DTs in the Random Forest and it was always choosing 10 - giving AUC=0.905. I switched off grid_search and just set the number of DTs to 100 and got AUC=0.924. So I don't really know what grid_search is optimising. And it takes ages to run...

What is _y?

Hey, in this code:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/classification.py
on line 185, what is _y and where is it defined? I can't find it anywhere. . .

TlC installation

I ran learn_sncosmo.py and it was going fine up until the graph plotting bit in run_ml in tools.py. Then it gives me an error and says 'This probably means that TlC wasn't installed properly.' How should I go about fixing this?

SVM working, TcL not working

Hey!

I've got the SVM to work! (all it needed was the data to be scaled and it finishes in 0.1s, as opposed to 16 hours and counting for unscaled data. . . ). Its AUC scores are a bit rubbish atm but it's better than nothing.
'it looks like TcL was not loaded properly' is back so I can't plot anything again. I tried those commands John suggested that worked last time but no luck. . .
I just got invited to interview at UCL!

More GridSearch problems

I thought I'd figured out why I was getting different AUC values for GridSearch's internal ROC AUC function and our own one by saying GridSearch was using the SVC.decision_function values as "probabilities". I've now realised that this is quite silly as the decision function for SVC is the distance between a data point and the separating hyperplane - i.e. it is not bounded so definitely can't be used like a probability as is needed for a ROC curve. So I went back to looking through the source code to find the bit where the 'scores' (as they call the probabilities or probability-esque values to be sent to the ROC AUC calculating function). But I've hit a bit of a dead end. GridSearchCV.fit uses a function called sklearn.metrics._score to get the score values. And sklearn.metrics._score (as far as I can see) calls 'scorer' which is a function defined by the user when they originally call GridSearchCV (AKA 'roc_auc' in our case), passing it the variables 'estimator' and 'X_test' (so in our case 'an SVM' and 'some feature data'). According to this website: http://scikit-learn.org/stable/modules/model_evaluation.html the input string 'roc_auc' (AKA the function 'scorer') corresponds to the code sklearn.metrics.roc_auc_score. Buuuuuuuuut, roc_auc_score takes as input parameters 'y_true' and 'y_score', not an estimator and some feature data! So I don't understand why this doesn't just give an error, and I also don't understand where in the code the probability scores are being calculated. Do you have any ideas/suggestions? (sorry for the massive post)

Thresholds and multiclass predict_proba

I've found another place where thresholding is important. When applying the AdaBoost predict_proba function (max_ml_algorithms line 197 and onwards) with multiple classes (currently in our case classes 1, 2 and 3) it's probabilities are very close. Typically the probabilities for [class1, class2, class3] are like [0.35, 0.32, 0.33]. For the Random Forest the class1 probabilities are more in the range 0.7-0.99 so less of an issue. But with 3 classes a P=0.5 threshold seems even more dodgy. I also have no idea how these DT based algorithms calculate probabilities. I'll look into that.

Format of the data file

Hey! Three quick questions:

In sncosmo_des_fit.txt, what do the different numbers in the 'type' column mean?
Has this data been PCA-ed? Because it's low dim, but seems to have physically meaningful column labels.
In learn_sncosmo.py on line 7, why do we only use 5:10?

michellelochner / supernova-machine Goto Github PK

supernova-machine's People

Contributors

Watchers

Forkers

supernova-machine's Issues

Using GridSearch with KNN

features module

Something weird with gridsearch

What is _y?

TlC installation

SVM working, TcL not working

More GridSearch problems

Thresholds and multiclass predict_proba

Format of the data file

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs