craffel / mir_eval Goto Github PK

View Code? Open in Web Editor NEW

589.0 27.0 109.0 14.77 MB

Evaluation functions for music/audio information retrieval/signal processing algorithms.

License: MIT License

Python 98.26% JavaScript 1.74%

mir_eval's Introduction

mir_eval

Python library for computing common heuristic accuracy scores for various music/audio information retrieval/signal processing tasks.

Documentation, including installation and usage information: http://craffel.github.io/mir_eval/

If you're looking for the mir_eval web service, which you can use to run mir_eval without installing anything or writing any code, it can be found here: http://labrosa.ee.columbia.edu/mir_eval/

Dependencies:

Scipy/Numpy

If you use mir_eval in a research project, please cite the following paper:

Colin Raffel, Brian McFee, Eric J. Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang, and Daniel P. W. Ellis, "mir_eval: A Transparent Implementation of Common MIR Metrics", Proceedings of the 15th International Conference on Music Information Retrieval, 2014.

mir_eval's People

Contributors

Stargazers

Watchers

Forkers

jorgehatccrma matthiasmauch wyx1227 faroit nils-werner f0k uditg yluo42 mrgloom bmcfee akiratu eq4 urinieto silasxue justinsalamon rabitt priyatransbit stefan-balke bbt132 polyrhythmatic chf2117 chinshou niloleart carlthome okchip spkimda chenxinglili ssfrr walkoncross helenacuesta mcartwright experimentaccount0 experimentaccount1 fdlm cocosci imandrewli sigsep aliutkus dnahid auserj windowxiaoming templeblock rohith-puli 601sgku ruohoruotsi hendriks73 kukas r06944010 afourast tinmantua myusernameisstudy titospadini fakufaku emrys365 owen864720655 qiankunquan hadryan daturkel debeat wzj1988tv zwarshavsky grandmasterjedi exp-time-series-tools beasteers xiongmaoxia diantongqingjie jannaking karindressler orangebaowang pramoneda fagan2888 liziru fdch cwitkowitz cghawthorne xcode2010 road2018 nightmoonbridge cc-cherie arthurwkm stefan-baumann jaedukseo zhipeng-zhong runngezhang wcangyu silvadirceu oriolcolomefont f90 frontcover baekms anteju dhockaday sonjk metamorphart icecreamww leighsmith duobin itsbrex mbrotos laubeee

mir_eval's Issues

Verify against MIREX

We are currently unit testing against various codebases. We should really be verifying against MIREX. Code is here: https://code.google.com/p/nemadiy/source/browse/analytics/trunk/src/main/java/org/imirsel/nema/analytics/evaluation/

Add pattern discovery metrics

Code is available here:
http://www.music-ir.org/mirex/wiki/2013:Discovery_of_Repeated_Themes_%26_Sections#Data

function to build a chord alphabet?

The chord module does reduction to a specific alphabet, but there doesn't seem to be a way to ask for a list of the chords in the chosen alphabet.

I'm thinking the following:

>>> alphabet = mir_eval.chords.alphabet('minmaj')
>>> print alphabet
    ['N', 'A:maj', 'A:min', 'A#:maj', 'A#:min', ... ]

and so on for all of the other supported alphabets. That way, it will be simple to construct and label the states of a chord model.

Ideally, the ordering of the chord alphabet should be well-defined, eg:

no-chord symbol always comes first
maj/min in alphabetical order
then remaining subsets (triads, 7ths, etc) each in alphabetical order

I'm not married to that exact ordering, but we should try and keep things consistent.

Split segment submodule into two submodules

One for boundary, and one for structure

Watch out for arange

np.arange may not give consistent results across calls. This may be an issue when resampling the timebases in melody.

io.load_jams_range: obscure exception

I'm getting the not-so-helpful exception:

Error: could not open /home/bmcfee/git/olda/data/truth/JAMS/SALAMI_608.jams (JAMS module not installed?)

which obscures down-stream errors from jams. I suggest redoing this with a decorator that first verifies that the jams import succeeded (raising the above exception if not), and then the loader proceeds with no try-catch so that errors are passed upstream.

boundary import to include labels

util.import_segment_boundaries currently loads in data of the flavor:

0.00\t21.7\tINTRO
21.7\t32.8\tVERSE
...

Right now, it only processes the time information. We should extend this to return segment labels as well.

Maybe we should generify the loader to support this kind of i/o for chord (or arbitrary) annotations as well?

segment unit test fails

(posted by Colin)

On my computer, the segment unit test matches exactly. On Dawen and Eric's (and maybe other's) computers, the segmentation unit test fails on the following metrics:

Pair-P 0.760466079298 0.759487707538 False
Pair-R 0.439344606659 0.439064910407 False
Pair-F 0.556932313401 0.55644516106 False
ARI 0.36752433328 0.366763678811 False
MI 0.69019595974 0.690003644248 False
AMI 0.414313462999 0.414259923205 False
NMI 0.509383741119 0.509459218087 False
S_Over 0.765096115476 0.765422031141 False
S_Under 0.505541891353 0.505574473705 False
S_F 0.608809330754 0.608936120723 False

Regression tests and (actual) unit tests

Each task submodule should have a testing script in tests which performs unit/sanity tests and regression tests against the current stable implementation of the code. This is all primarily to prevent bugs from creeping in the code. We're not trying to unit test against an existing implementation anymore.

Regression tests

Unit tests

display submodule

For making useful qualitative plots for eval-by-eye

segment.frame*: gracefully handling end-of-track discrepancies

Given boundary time-stamps for the ground truth and prediction annotations, frame-based metrics operate by generating samples between each pair of boundaries.

Often, we have to compare two annotation sequences that don't line up exactly (due to sampling quantization, etc). For example, the ground truth may list the final segment boundary at 243.900, while the prediction may list it at 243.85.

I'd argue that we shouldn't assume that both annotations correctly cover the track duration. The current implementation assumes that the final timestamp in the ground truth is the end of the track. (This may be correct enough in practice.) But how should the prediction file be adjusted to match up? Add a final end-of-track boundary? This changes the total number of segments, and may skew the metrics.

Add source separation metrics

BSS-EVAL, PEASS, others? Have you implemented these in Python Dawen?

Refactor code to obey the design doc

https://github.com/craffel/mir_eval/wiki/Design

We should use the decorator module, not functools.wraps

functools.wraps maintains __name__ and __doc__ but mangles the function signature. The decorator module (not in stdlib) "does the right thing".

efficiency of structure.pairwise

structure.pairwise currently uses np.triu_indices to find all unique pairs of frames and evaluate their agreement. It turns out that this is woefully inefficient in memory, as np.triu_indices constructs an intermediate (int64) masking matrix.

This should be replaced with a more direct implementation with leaner memory requirements.

util.intervals_to_samples() -- return time array?

I'm thinking it might it be more consistent to return an array of time points with this function (especially in light of rounding)? Granted it can be easily computed / created, but it (1) prevents users from screwing it up and (2) enables a pass-through loop that would make converting intervals -> samples -> intervals -> samples trivial, both for practice and testing purposes.

Add chord detection evaluation

Add a submodule chord.py which implements common chord detection metrics. I think you have code for this already right? All the sampling and overlap kind of stuff.

RFC: Empty reference and estimated

In all of the metrics (I believe), a special case must be made when the reference annotations are empty and/or the estimated annotations are empty. In the case where one is empty and the either isn't, I think we can all agree that the resulting metric should be 0 (please disagree if you can think of a good case). If both are empty, it gets a little fuzzier. The onset detection metric (f-measure) that I "translated" said that if both are empty (e.g. there are no onsets, and the detector says they are no onsets), then precision, recall, and f-measure are all 1. In contrast, beat metrics and (all?) segment metrics return 0 if estimated annotations are empty. I'm kind of comfortable with both of these things; I think I am more comfortable with "If there are no reference annotations and no annotations were estimated, you're perfect".

Either way, I think all of the validators should throw warnings when things are empty.

MIREX mode

There should be an environment variable which forces all metrics into "MIREX compatibility mode", because in all submodules there are proposed changes to deviate from MIREX.

Add code for mapping to chord subalphabets

Data validation functions

In mir_eval.beat, there's a function _clean_beats (which I'm in the process of removing/refactoring) which essentially takes as input an array of beat locations and runs a series of assertions about its content - e.g., that it is a 1d np.ndarray, no beat times are negative, all beat times are less than 30,000 (largely to make sure the beat times are in seconds). It also sorts the beat times.

Given that the design doc https://github.com/craffel/mir_eval/wiki/Design requires that each metric function just takes in raw annotations and does not load files or provide error checking, it seems like it would be beneficial to have functions like this one which provide sanity checks for the input data, and which fail in a helpful way when the data is poorly formed. It seems also like all the tasks would need something like this. If that's the case, can we agree on some kind of convention as to where this functionality should go? Maybe in a function mir_eval.task.clean or something?

rename the repository/package?

Hyphens are bad mojo for package names:

import mir-evaluate

will cause a syntax error.

Can we either:

Move actual code into a sub-directory mir_evaluate, or
Rename the repository to mir_evaluate?
Or both?

Either way, the code should definitely be moved into a sub-directory to facilitate having an install script down the line.

Remove sklearn dependency

Ideally, we'd like to minimize dependencies as much as possible. sklearn is not unreasonable but we want to minimize the barrier of entry to its absolute lowest (numpy/scipy). This is low priority, but it would be nice eventually.

empty beat predictions throw exceptions

The beat eval helper method _clean_beats throws a ValueError exception when there are no detected beats outside the boundaries. This is probably not the correct behavior, as many of the metrics are well-defined for empty predictions.

Pylint

Make the pylint score better.

Sonification

Not exactly a metric, but it would be nice to have something which, e.g., synthesized beeps at a list of times and added them back to the original audio to perform "evaluation by ears", because not all of us use sonic visualizer. Similarly, we could synthesize chord labels, chromagrams, etc.

Time indices (beats, onsets, segments)
Piano roll/time-frequency representation inversion
Chord labels
Chroma sonification

beat.information_gain should not replicate beat evaluation toolbox exactly

Currently, beat.information_gain matches the unusual histogram binning used in the beat evaluation toolbox. It should use uniform bins instead. It was changed from uniform bins to the beat evaluation toolbox in commit 9aa7b70 and should be changed back once we have a reference release which matches MIREX exactly.

Fix on naming conventions

Yeah

input_output.load_time_series() -- allow "value" column to support strings?

for convenience: https://github.com/craffel/mir_eval/blob/master/mir_eval/input_output.py#L229

Is there any strong opposition to extending this function such that the second column could be strings / labels? The docstring for np.loadtxt makes it look like this would be a simple fix...

Unit tests

Create unit tests, which should verify that the output is the same as reference implementations.

beat should match the beat evaluation toolbox http://code.soundsoftware.ac.uk/projects/beat-evaluation/ @craffel
onset should match https://github.com/CPJKU/onset_detection/blob/master/onset_evaluation.py @craffel
separate should match http://bass-db.gforge.inria.fr/bss_eval/ @dawenl
segment should match MIREX output @bmcfee
chord should match Pauwel's findings https://github.com/jpauwels/mirex-tools @ejhumphrey
melody should match MIREX 2005 MATLAB code @justinsalamon
pattern should match MIREX 2013 MATLAB code @urinieto

Warn when reference or estimation are empty and return 0

The warning code should go in each submodule's validator. Metrics should return 0 (or an otherwise appropriate "you got it wrong" value).

convention for empty intervals

I ran across this curious example in the SALAMI dataset this morning, and it got me thinking about how we should handle semantically questionable (but syntactically valid) inputs. The example in question comes from 988/parsed/textfile1_functions.txt, and the offending lines (events) are:

321.788979591   Instrumental
321.788979591   Outro
321.788979591   Solo

mir_eval happy loads these as events, and following the zipping convention, converts to intervals without a problem:

(array([ 321.78897959,  321.78897959]), 'Instrumental'),
(array([ 321.78897959,  321.78897959]), 'Outro'),
(array([ 321.78897959,  397.66927438]), 'Solo'),

The problem here is that the event file listed three labels for a single boundary time. The interval converter slams these together, and makes two length-0 intervals. On our part, this is incorrect because the assumption of sequential unique timing is not satisfied.

This causes problems in segmentation metrics in a few ways: label equivalence is no longer well defined, and there are artificial boundaries in place.

It seems like we should settle on a convention for dealing with this type of problem, as it may be reasonable to expect similar things in other tasks. Anyone care to chime in?

Demo evaluator scripts for all submodules

Each submodule should have a driver program that can be operated from the command line, and output each metric's score in a machine-readable format. These should follow the general formula of

./SUBMODULE_eval.py TRUTH.TXT PREDICTION.DATA

e.g.

./segment_eval.py ~/data/SALAMI/data/1000/parsed/textfile1_functions.txt predictions/1000.lab

I'll push the segment driver asap so you can use it as a template for your own submodule drivers.

Move segment._boundaries_to_frames to util.time_range_to_frames

Should also take in a list of labels to use, optionally.

Metrics should include example usage

Each metric in each task submodule should include an example usage in the docstring, which includes loading and preprocessing data from a file and calling the metric, e.g.

:usage:
    >>> reference_beats = mir_eval.beat.trim_beats(mir_eval.io.load_events('reference.txt'))
    >>> estimated_beats = mir_eval.beat.trim_beats(mir_eval.io.load_events('estimated.txt'))
    >>> f_measure = mir_eval.beat.f_measure(reference_beats, estimated_beats)

Sphinx docs

Clean up docstrings and set up compilation

Convert segment to use event start and end time format

They are currently using the old list of boundaries format.

io module, refactor loaders

Split the annotation loader into separate functions for annotations with or without duration
Move i/o into a submodule io

Merge common functionality across submodules into util

Change "label_prefix" to "fill_value" in adjust_intervals?

While I can see how a prefix might help (filtering and whatnot), it'd be super helpful in chord-rec to just specify what the out-of-boundary label is / should be. I think this would be more elegant than writing chord parsing code to clean up the result after the fact.

Any major objections or justifications I'm not seeing? @bmcfee I think this is relevant to you?

numpy error when intervals is length 2

A bogus edge case but when calling score, if len(reference_labels) == len(estimated_labels) == 1 and len(intervals) == 2, then

durations = np.abs(np.diff(intervals, axis=-1)).squeeze()

returns an array with shape (), and the line

durations = durations[valid_idx]

which indexes it with a boolean array of size 1 gives the following numpy error:

IndexError: 0-dimensional arrays can't be indexed

validate input data

Data import should throw an exception if the input is not well-formed.

Example:

0.00 5.12 INTRO
5.12 4.12 VERSE
...

should fail loudly upon import.

chord.score should be a decorator

... which decorate the comparator functions so that each comparator is callable as a metric.

RFC: rounding time-stamps on import and/or adjustment to ground truth?

Should we add a feature to automatically round (segment boundary) timestamps upon import?

This would provide more reliable behavior in util.adjust_segments when trimming a prediction file against a ground truth annotation. For example, a prediction file might list the end-of-track at 201.955, while the ground truth has it at 201.96. In this case, we probably don't want or need to append a final marker.

Note that the frame size parameter in the various frame clustering metrics already takes care of this, but the effects are somewhat opaque and hidden from the user.

Moreover, boundary-detection metrics currently just trim the first and last boundaries, assuming that they indicate start- and end- of track. Rounding and uniq'ing the times upon import would make this behavior a little more reliable.

craffel / mir_eval Goto Github PK

mir_eval's Introduction

mir_eval

mir_eval's People

Contributors

Stargazers

Watchers

Forkers

mir_eval's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs