GithubHelp home page GithubHelp logo

craffel / mir_eval Goto Github PK

View Code? Open in Web Editor NEW
589.0 27.0 109.0 14.77 MB

Evaluation functions for music/audio information retrieval/signal processing algorithms.

License: MIT License

Python 98.26% JavaScript 1.74%

mir_eval's Introduction

https://coveralls.io/repos/craffel/mir_eval/badge.svg?branch=master&service=github

mir_eval

Python library for computing common heuristic accuracy scores for various music/audio information retrieval/signal processing tasks.

Documentation, including installation and usage information: http://craffel.github.io/mir_eval/

If you're looking for the mir_eval web service, which you can use to run mir_eval without installing anything or writing any code, it can be found here: http://labrosa.ee.columbia.edu/mir_eval/

Dependencies:

If you use mir_eval in a research project, please cite the following paper:

Colin Raffel, Brian McFee, Eric J. Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang, and Daniel P. W. Ellis, "mir_eval: A Transparent Implementation of Common MIR Metrics", Proceedings of the 15th International Conference on Music Information Retrieval, 2014.

mir_eval's People

Contributors

bmcfee avatar carlthome avatar craffel avatar daturkel avatar dpwe avatar ecmjohnson avatar ejhumphrey avatar f90 avatar fdlm avatar hendriks73 avatar justinsalamon avatar kukas avatar laubeee avatar leighsmith avatar nils-werner avatar pramoneda avatar rabitt avatar ssfrr avatar stefan-balke avatar urinieto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mir_eval's Issues

function to build a chord alphabet?

The chord module does reduction to a specific alphabet, but there doesn't seem to be a way to ask for a list of the chords in the chosen alphabet.

I'm thinking the following:

>>> alphabet = mir_eval.chords.alphabet('minmaj')
>>> print alphabet
    ['N', 'A:maj', 'A:min', 'A#:maj', 'A#:min', ... ]

and so on for all of the other supported alphabets. That way, it will be simple to construct and label the states of a chord model.

Ideally, the ordering of the chord alphabet should be well-defined, eg:

  • no-chord symbol always comes first
  • maj/min in alphabetical order
  • then remaining subsets (triads, 7ths, etc) each in alphabetical order

I'm not married to that exact ordering, but we should try and keep things consistent.

Watch out for arange

np.arange may not give consistent results across calls. This may be an issue when resampling the timebases in melody.

io.load_jams_range: obscure exception

I'm getting the not-so-helpful exception:

Error: could not open /home/bmcfee/git/olda/data/truth/JAMS/SALAMI_608.jams (JAMS module not installed?)

which obscures down-stream errors from jams. I suggest redoing this with a decorator that first verifies that the jams import succeeded (raising the above exception if not), and then the loader proceeds with no try-catch so that errors are passed upstream.

boundary import to include labels

util.import_segment_boundaries currently loads in data of the flavor:

0.00\t21.7\tINTRO
21.7\t32.8\tVERSE
...

Right now, it only processes the time information. We should extend this to return segment labels as well.

Maybe we should generify the loader to support this kind of i/o for chord (or arbitrary) annotations as well?

segment unit test fails

(posted by Colin)

On my computer, the segment unit test matches exactly. On Dawen and Eric's (and maybe other's) computers, the segmentation unit test fails on the following metrics:

Pair-P 0.760466079298 0.759487707538 False
Pair-R 0.439344606659 0.439064910407 False
Pair-F 0.556932313401 0.55644516106 False
ARI 0.36752433328 0.366763678811 False
MI 0.69019595974 0.690003644248 False
AMI 0.414313462999 0.414259923205 False
NMI 0.509383741119 0.509459218087 False
S_Over 0.765096115476 0.765422031141 False
S_Under 0.505541891353 0.505574473705 False
S_F 0.608809330754 0.608936120723 False

Regression tests and (actual) unit tests

Each task submodule should have a testing script in tests which performs unit/sanity tests and regression tests against the current stable implementation of the code. This is all primarily to prevent bugs from creeping in the code. We're not trying to unit test against an existing implementation anymore.

Regression tests

Unit tests

segment.frame*: gracefully handling end-of-track discrepancies

Given boundary time-stamps for the ground truth and prediction annotations, frame-based metrics operate by generating samples between each pair of boundaries.

Often, we have to compare two annotation sequences that don't line up exactly (due to sampling quantization, etc). For example, the ground truth may list the final segment boundary at 243.900, while the prediction may list it at 243.85.

I'd argue that we shouldn't assume that both annotations correctly cover the track duration. The current implementation assumes that the final timestamp in the ground truth is the end of the track. (This may be correct enough in practice.) But how should the prediction file be adjusted to match up? Add a final end-of-track boundary? This changes the total number of segments, and may skew the metrics.

efficiency of structure.pairwise

structure.pairwise currently uses np.triu_indices to find all unique pairs of frames and evaluate their agreement. It turns out that this is woefully inefficient in memory, as np.triu_indices constructs an intermediate (int64) masking matrix.

This should be replaced with a more direct implementation with leaner memory requirements.

util.intervals_to_samples() -- return time array?

I'm thinking it might it be more consistent to return an array of time points with this function (especially in light of rounding)? Granted it can be easily computed / created, but it (1) prevents users from screwing it up and (2) enables a pass-through loop that would make converting intervals -> samples -> intervals -> samples trivial, both for practice and testing purposes.

Add chord detection evaluation

Add a submodule chord.py which implements common chord detection metrics. I think you have code for this already right? All the sampling and overlap kind of stuff.

RFC: Empty reference and estimated

In all of the metrics (I believe), a special case must be made when the reference annotations are empty and/or the estimated annotations are empty. In the case where one is empty and the either isn't, I think we can all agree that the resulting metric should be 0 (please disagree if you can think of a good case). If both are empty, it gets a little fuzzier. The onset detection metric (f-measure) that I "translated" said that if both are empty (e.g. there are no onsets, and the detector says they are no onsets), then precision, recall, and f-measure are all 1. In contrast, beat metrics and (all?) segment metrics return 0 if estimated annotations are empty. I'm kind of comfortable with both of these things; I think I am more comfortable with "If there are no reference annotations and no annotations were estimated, you're perfect".

Either way, I think all of the validators should throw warnings when things are empty.

MIREX mode

There should be an environment variable which forces all metrics into "MIREX compatibility mode", because in all submodules there are proposed changes to deviate from MIREX.

Data validation functions

In mir_eval.beat, there's a function _clean_beats (which I'm in the process of removing/refactoring) which essentially takes as input an array of beat locations and runs a series of assertions about its content - e.g., that it is a 1d np.ndarray, no beat times are negative, all beat times are less than 30,000 (largely to make sure the beat times are in seconds). It also sorts the beat times.

Given that the design doc https://github.com/craffel/mir_eval/wiki/Design requires that each metric function just takes in raw annotations and does not load files or provide error checking, it seems like it would be beneficial to have functions like this one which provide sanity checks for the input data, and which fail in a helpful way when the data is poorly formed. It seems also like all the tasks would need something like this. If that's the case, can we agree on some kind of convention as to where this functionality should go? Maybe in a function mir_eval.task.clean or something?

rename the repository/package?

Hyphens are bad mojo for package names:

import mir-evaluate

will cause a syntax error.

Can we either:

  • Move actual code into a sub-directory mir_evaluate, or
  • Rename the repository to mir_evaluate?
  • Or both?

Either way, the code should definitely be moved into a sub-directory to facilitate having an install script down the line.

Remove sklearn dependency

Ideally, we'd like to minimize dependencies as much as possible. sklearn is not unreasonable but we want to minimize the barrier of entry to its absolute lowest (numpy/scipy). This is low priority, but it would be nice eventually.

empty beat predictions throw exceptions

The beat eval helper method _clean_beats throws a ValueError exception when there are no detected beats outside the boundaries. This is probably not the correct behavior, as many of the metrics are well-defined for empty predictions.

Pylint

Make the pylint score better.

Sonification

Not exactly a metric, but it would be nice to have something which, e.g., synthesized beeps at a list of times and added them back to the original audio to perform "evaluation by ears", because not all of us use sonic visualizer. Similarly, we could synthesize chord labels, chromagrams, etc.

  • Time indices (beats, onsets, segments)
  • Piano roll/time-frequency representation inversion
  • Chord labels
  • Chroma sonification

Unit tests

Create unit tests, which should verify that the output is the same as reference implementations.

convention for empty intervals

I ran across this curious example in the SALAMI dataset this morning, and it got me thinking about how we should handle semantically questionable (but syntactically valid) inputs. The example in question comes from 988/parsed/textfile1_functions.txt, and the offending lines (events) are:

321.788979591   Instrumental
321.788979591   Outro
321.788979591   Solo

mir_eval happy loads these as events, and following the zipping convention, converts to intervals without a problem:

(array([ 321.78897959,  321.78897959]), 'Instrumental'),
(array([ 321.78897959,  321.78897959]), 'Outro'),
(array([ 321.78897959,  397.66927438]), 'Solo'),

The problem here is that the event file listed three labels for a single boundary time. The interval converter slams these together, and makes two length-0 intervals. On our part, this is incorrect because the assumption of sequential unique timing is not satisfied.

This causes problems in segmentation metrics in a few ways: label equivalence is no longer well defined, and there are artificial boundaries in place.

It seems like we should settle on a convention for dealing with this type of problem, as it may be reasonable to expect similar things in other tasks. Anyone care to chime in?

Demo evaluator scripts for all submodules

Each submodule should have a driver program that can be operated from the command line, and output each metric's score in a machine-readable format. These should follow the general formula of

./SUBMODULE_eval.py TRUTH.TXT PREDICTION.DATA

e.g.

./segment_eval.py ~/data/SALAMI/data/1000/parsed/textfile1_functions.txt predictions/1000.lab

I'll push the segment driver asap so you can use it as a template for your own submodule drivers.

Metrics should include example usage

Each metric in each task submodule should include an example usage in the docstring, which includes loading and preprocessing data from a file and calling the metric, e.g.

:usage:
    >>> reference_beats = mir_eval.beat.trim_beats(mir_eval.io.load_events('reference.txt'))
    >>> estimated_beats = mir_eval.beat.trim_beats(mir_eval.io.load_events('estimated.txt'))
    >>> f_measure = mir_eval.beat.f_measure(reference_beats, estimated_beats)

Sphinx docs

Clean up docstrings and set up compilation

io module, refactor loaders

  • Split the annotation loader into separate functions for annotations with or without duration
  • Move i/o into a submodule io

Change "label_prefix" to "fill_value" in adjust_intervals?

While I can see how a prefix might help (filtering and whatnot), it'd be super helpful in chord-rec to just specify what the out-of-boundary label is / should be. I think this would be more elegant than writing chord parsing code to clean up the result after the fact.

Any major objections or justifications I'm not seeing? @bmcfee I think this is relevant to you?

numpy error when intervals is length 2

A bogus edge case but when calling score, if len(reference_labels) == len(estimated_labels) == 1 and len(intervals) == 2, then

durations = np.abs(np.diff(intervals, axis=-1)).squeeze()

returns an array with shape (), and the line

durations = durations[valid_idx]

which indexes it with a boolean array of size 1 gives the following numpy error:

IndexError: 0-dimensional arrays can't be indexed

validate input data

Data import should throw an exception if the input is not well-formed.

Example:

0.00 5.12 INTRO
5.12 4.12 VERSE
...

should fail loudly upon import.

RFC: rounding time-stamps on import and/or adjustment to ground truth?

Should we add a feature to automatically round (segment boundary) timestamps upon import?

This would provide more reliable behavior in util.adjust_segments when trimming a prediction file against a ground truth annotation. For example, a prediction file might list the end-of-track at 201.955, while the ground truth has it at 201.96. In this case, we probably don't want or need to append a final marker.

Note that the frame size parameter in the various frame clustering metrics already takes care of this, but the effects are somewhat opaque and hidden from the user.

Moreover, boundary-detection metrics currently just trim the first and last boundaries, assuming that they indicate start- and end- of track. Rounding and uniq'ing the times upon import would make this behavior a little more reliable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.