triagemd / keras-eval Goto Github PK

An evaluation abstraction for Keras models.

Home Page: https://triagemd.github.io/keras-eval/

Jupyter Notebook 96.09% Python 3.86% Shell 0.05%

cnn convolutional-neural-networks deep-learning ensemble evaluation evaluator keras keras-eval keras-models machine-learning tensorflow

keras-eval's People

Contributors

Stargazers

Watchers

Forkers

fitrialif ahmad-abdellatif

keras-eval's Issues

Folder names as concepts by default

Predict error

Looks like for keras version == 2.1.4, we have an error when using predict.
p = evaluator.predict('/data/datasets/psoriasis_dataset_clean/4way/validation/00007_sl/030205115400_693_jpg-395454.jpg')

   1290             else:
   1291                 progbar = Progbar(target=num_samples,
-> 1292                                   stateful_metrics=self.stateful_metric_names)
   1293 
   1294         indices_for_conversion_to_dense = []

AttributeError: 'Model' object has no attribute 'stateful_metric_names'

Add Top-k sensitivity for individual metrics

Compute top-k sensitivity for each class metric.

This makes mathematically sense since only positive class information is involved in sensitivity:

sensitivity = TP / (TP + FN)

Note: average top-k sensitivity was removed in #27

Precision-Recall graphs per class

Normalize probabilities to sum to 1

Right now I think the probabilities get rounded off and the sum of probabilities end up being slightly more or less than one. We should try to normalize the probabilities to one in eval.py/_compute_probabilities_generator.

Correct AUROC computation

Right now we're just np.mean(tpr).

Error computing TN, FP, FN, TP when any is 0

When any of the TN, FP, FN, TP from the confusion_matrix is zero:

Making predictions from model  0
Input image size:  [299, 299, 3]
Found 426 images belonging to 117 classes.
14/14 [==============================] - 6s 399ms/step
Traceback (most recent call last):
  File "eval.py", line 37, in <module>
    evaluator.evaluate(data_dir=opts.data_dir, top_k=opts.top_k, save_confusion_matrix_path=opts.report_dir)
  File "/home/adria/Github/deepderm/.venv/lib/python3.5/site-packages/keras_eval/eval.py", line 142, in evaluate
    save_confusion_matrix_path=save_confusion_matrix_path)
  File "/home/adria/Github/deepderm/.venv/lib/python3.5/site-packages/keras_eval/eval.py", line 195, in get_metrics
    results = metrics.metrics_top_k(self.combined_probabilities, y_true, concepts=concept_labels, top_k=top_k)
  File "/home/adria/Github/deepderm/.venv/lib/python3.5/site-packages/keras_eval/metrics.py", line 74, in metrics_top_k
    np.float32).ravel()
ValueError: not enough values to unpack (expected 4, got 1)

Making figure sizes constant in plot_images

Right now the size of the figures are dependant on the number of images which are getting plot as seen here .

My proposal is to plot all images to size of 20,20 which seems to be a good size to view them on a Jupyter notebook.

Add tests for multiple variety of predictions

Types of predictions:

All samples predicted as TP
All samples predicted as TN
All samples predicted as FP
All samples predicted as FN
(Others?)

Evaluate on sets of classes

Given a model and test set based on N classes, allow the evaluation on sets of classes by providing a testing dictionary or similar.

E.g.
Training scenario:

[class_0]
[class_1]
[class_2]
[class_3]

Testing scenario:

[test_set_0] class_0 or class_1
[test_set_1] class_2 or class_3

So the way we combine probabilities is as below:

probability(test_set_0) =  probability(class_0) + probability(class_1)
probability(test_set_1) = probability(class_2) + probability(class_3)

We would want the users to give us the mapping between the training and testing dictionary as a .json file. Given below is the format we expect:

[
  {
    "class_index": 0,
    "class_name": "dog",
     "group":"land_animals" 
  },
  {
    "class_index": 1,
    "class_name": "cat",
    "group":"land_animals"
  },
  {
    "class_index": 2,
    "class_name": "gold_fish",
    "group":"sea_creatures"
  }

]

So in the example above the group gives us the mapping between a single concept during training and the concepts which we would want to evaluate on in test.

Renaming

Rename things to use the same nomenclature among our repositories.

Oracle Compatibility

Issues #10 #11 #9 to make a smoother integration into Oracle.

Plot confusion matrix fails when no predictions for a class

ValueError: Number of concepts (4) and dimensions of confusion matrix do not coincide (2, 2)

If any class with no samples, some average metrics are hidden

Example case:

Individual results (C_8 and C_15 with no samples):

class	precision	FP	TP	FDR	f1_score	FN	AUROC	sensitivity
C_0	0.6669999957084656	8	16	0.3330000042915344	0.64	10	0.668	0.6150000095367432
C_1	0.75	5	15	0.25	0.732	6	0.759	0.7139999866485596
C_2	1.0	0	3	0.0	0.6	4	0.714	0.42899999022483826
C_3	0.7860000133514404	3	11	0.21400000154972076	0.786	3	0.802	0.7860000133514404
C_4	0.5	1	1	0.5	0.667	0	1.0	1.0
C_5	0.800000011920929	1	4	0.20000000298023224	0.727	2	0.69	0.6669999957084656
C_6	0.8569999933242798	1	6	0.14300000667572021	0.923	0	0.722	1.0
C_7	0.6669999957084656	1	2	0.3330000042915344	0.8	0	0.833	1.0
C_8		0	0			0
C_9	1.0	0	2	0.0	1.0	0	0.833	1.0
C_10	0.6669999957084656	3	6	0.3330000042915344	0.75	1	0.771	0.8569999933242798
C_11	0.7139999866485596	4	10	0.28600001335144043	0.769	2	0.765	0.8330000042915344
C_12	0.5	4	4	0.5	0.5	4	0.648	0.5
C_13	0.7369999885559082	5	14	0.2630000114440918	0.509	22	0.588	0.3889999985694885
C_14	1.0	0	1	0.0	1.0	0	1.0	1.0
C_15		0	0			2	0.429	0.0

Average results:

model	precision	auroc	accuracy_top_1	accuracy_top_2	accuracy_top_3	specificity	fdr	sensitivity	f1_score
fda-117-way-inception_v3-lr-0.001-batch-128_1GPU.hdf5			0.758	0.866	0.913	0.994

Note precision, auroc, fdr, sensitivity and f1_score are not being shown.

Optimize test coverage

From #71, some tests (specially test_ensemble_models) are taking longer than expected and for this reason tests in Travis CI are failing.

Average metrics is NaN when some of the individual metrics is NaN

Example case:

Individual results (C_8 and C_15 with no samples):

class	precision	FP	TP	FDR	f1_score	FN	AUROC	sensitivity
C_0	0.6669999957084656	8	16	0.3330000042915344	0.64	10	0.668	0.6150000095367432
C_1	0.75	5	15	0.25	0.732	6	0.759	0.7139999866485596
C_2	1.0	0	3	0.0	0.6	4	0.714	0.42899999022483826
C_3	0.7860000133514404	3	11	0.21400000154972076	0.786	3	0.802	0.7860000133514404
C_4	0.5	1	1	0.5	0.667	0	1.0	1.0
C_5	0.800000011920929	1	4	0.20000000298023224	0.727	2	0.69	0.6669999957084656
C_6	0.8569999933242798	1	6	0.14300000667572021	0.923	0	0.722	1.0
C_7	0.6669999957084656	1	2	0.3330000042915344	0.8	0	0.833	1.0
C_8		0	0			0
C_9	1.0	0	2	0.0	1.0	0	0.833	1.0
C_10	0.6669999957084656	3	6	0.3330000042915344	0.75	1	0.771	0.8569999933242798
C_11	0.7139999866485596	4	10	0.28600001335144043	0.769	2	0.765	0.8330000042915344
C_12	0.5	4	4	0.5	0.5	4	0.648	0.5
C_13	0.7369999885559082	5	14	0.2630000114440918	0.509	22	0.588	0.3889999985694885
C_14	1.0	0	1	0.0	1.0	0	1.0	1.0
C_15		0	0			2	0.429	0.0

Average results:

model	precision	auroc	accuracy_top_1	accuracy_top_2	accuracy_top_3	specificity	fdr	sensitivity	f1_score
fda-117-way-inception_v3-lr-0.001-batch-128_1GPU.hdf5			0.758	0.866	0.913	0.994

Note precision, auroc, fdr, sensitivity and f1_score are not being shown.

Check concepts dictionary contains fields ID and Label

Generate CSV file with results

Improve Metrics

Improve plain text visualization
Improve return of results - oracle-style

Assign id to Evaluator objects

Assign a id name for the evaluator object as attribute.

Add new metric value per class

Metrics to add:

AUC per class and global average
F1 per class
Fall-out (False Positive rate)
~~PPV (Positive Predictive Value)~~ == precision
NPV (Negative Predictive Value)

Setuptools dependency in setup.py

We need to figure out if need a particular setuptools version in setup.py as seen here.

The reason I added it was because of an error message similar to this.

Having a particular version of setuptool as a dependency is an issue as every repository which is dependant on keras-eval will need to have that particular version of setuptools as a dependency as it is not the most updated version.

I think one way to validate the need for setuptools would be to remove the dependency and see if anything breaks.

What do you think @adriaromero @jsalbert ?

inconsistency evaluation results

I noticed that the evaluation results may be different when using the keras_eval.utils.ensemble_model to manually ensemble different models. Here is an example

Model to ensemble (arithmetric)

On keras_eval

f1_score	precision	top_1	top_2	top_3	top_4	top_5
0.520981	0.64068	0.505455	0.643636	0.729091	0.772727	0.805454

On Lab eval

eval

@jsalbert

Add more error information in compare_group_test_concepts

In compare_group_test_concepts, if ValueError is raised, provide the user with the error information such as which classes are not matching.

Remove useless metrics

Add differential for results in metrics

to_categorical function does not encode a class if a column is all zeros

We must specify the num_classes to be len(concepts

Model specs missing in fixtures

No model specs here:
/github_repos/keras-eval/tmp/fixtures/models/ensemble/mobilenet_1

Model can't be loaded.

Read dataset dictionary.json

Read dataset dictionary.json file with class information such as class_index, class_name and image count.

e.g.

[
  {
    "class_index": 0,
    "class_name": "dog",
    "count": 500
  },
  {
    "class_index": 1,
    "class_name": "cat",
    "count": 300
  }
]

Ordered dict metrics in show_metrics function

keras_applications vs keras.applications

When I run,
from keras_eval import utils

I get an error on this line:
https://github.com/triagemd/keras-eval/blob/master/keras_eval/utils.py#L13

I see how there's a separate keras_application in:
https://github.com/keras-team/keras-applications

but on that page they recommend to import from keras.

Changing to this line seems to work for me:
from keras.applications import mobilenet

I think this line would have to be updated to mobile.relu6.

Is this just an old API that needs to be updated? I see that the tests seem to rely on an existing mobilenet model. Does this play into this at all?

Predict folder read subfolders in main folder

Allow thresholding for each class

Purpose: While evaluating let the user set a different threshold for every class.

Example scenario:
If you have three classes A, B and C. In many cases when the classifier gives 0.3 probability for class A and 0.6 for class B. The predicted label is assigned to class B. But, let's say that class A is very sensitive and does not always give high probability values. So in this case maybe you want to say that for any probability greater than 0.25 for class A, assign the label to class A even if the probability of class B is higher.

So, in the above case, you give it a list of minimum probability to assign a class [0.25, 0.7, 0.5].

Adverse scenarios:

What if it the probabilities assigned is above the probability threshold for multiple classes?
Then set the class label to the class with the highest probability.
What if the probabilities assigned are lower than all of the thresholds?
Then set the class label to the class with the highest probability.

some problem about kears inference

hi i am a beginner in deeplearning ,i have read a paper(Stochastic Multiple Choice Learning for
Training Diverse Deep Ensembles),i am insterest in mcl ,can you show me a Tutorial about keras Training a diverse ensemble of deep networks.