vocalpy / crowsetta Goto Github PK

View Code? Open in Web Editor NEW

49.0 1.0 3.0 265.71 MB

A tool to work with any format for annotating vocalizations

Home Page: https://crowsetta.readthedocs.io/en/latest/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

dataset annotation csv python python3 annotation-format birdsong animal-communication animal-vocalizations bioacoustics

crowsetta's Introduction

A Python tool to work with any format for annotating animal vocalizations and bioacoustics data

crowsetta provides a Pythonic way to work with annotation formats for animal vocalizations and bioacoustics data. These formats are used, for example, by applications that enable users to annotate audio and/or spectrograms. Such annotations typically include the times when sound events start and stop, and labels that assign each sound to some set of classes chosen by the annotator. crowsetta has built-in support for many widely used formats, such as Audacity label tracks, Praat .TextGrid files, and Raven .txt files.

Spectrogram of the song of a Bengalese finch with syllables annotated as segments underneath. Annotations parsed by crowsetta from a file in the Praat .TextGrid format. Example song from Bengalese finch song dataset, Tachibana and Morita 2021, adapted under CC-By-4.0 License.

Spectrogram of a field recording with annotations of songs of different bird species indicated as bounding boxes. Annotations parsed by crowsetta from a file in the Raven Selection Table format. Example song from "An annotated set of audio recordings of Eastern North American birds containing frequency, time, and species information", Chronister et al., 2021, adapted under CC0 1.0 License.

Who would want to use crowsetta? Anyone that works with animal vocalizations or other bioacoustics data that is annotated in some way. Maybe you are a neuroscientist trying to figure out how songbirds learn their song, or why mice emit ultrasonic calls. Or maybe you're an ecologist studying dialects of finches distributed across Asia, or maybe you are a linguist studying accents in the Caribbean, or a speech pathologist looking for phonetic changes that indicate early onset Alzheimer's disease. crowsetta makes it easier for you to work with your annotations in Python, regardless of the format.

Features

take advantage of built-in support for many widely used formats, such as Audacity label tracks, Praat .TextGrid files, and Raven .txt files.
work with any format by remembering just one class:
annot = crowsetta.Transcriber(format='format').from_file('annotations.ext')
- no need to remember different functions for different formats
when needed, use classes that represent the formats to write readable scripts and libraries
convert annotations to common file formats like .csv that anyone can work with
work with custom formats that are not built in to crowsetta by writing simple classes, leveraging abstractions that can represent a wide array of annotation formats

For examples of these features, please see: https://crowsetta.readthedocs.io/en/latest/index.html#features

Getting Started

Installation

with `pip`

$ pip install crowsetta

with `conda`

$ conda install crowsetta -c conda-forge

Usage

If you are new to crowsetta, start with tutorial.

For vignettes showing how to use crowsetta for various tasks, such as working with your own annotation format, please see the how-to section.

Project Information

Background

crowsetta was developed for two libraries:

hybrid-vocal-classifier https://github.com/vocalpy/hybrid-vocal-classifier
vak https://github.com/vocalpy/vak

Support

To report a bug or request a feature (such as a new annotation format), please use the issue tracker on GitHub:
https://github.com/vocalpy/crowsetta/issues

To ask a question about crowsetta, discuss its development, or share how you are using it, please start a new topic on the VocalPy forum with the crowsetta tag:
https://forum.vocalpy.org/

Contribute

Code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Contributing Guidelines

Below we provide some quick links, but you can learn more about how you can help and give feedback
by reading our Contributing Guide.

To ask a question about crowsetta, discuss its development, or share how you are using it, please start a new "Q&A" topic on the VocalPy forum with the crowsetta tag:
https://forum.vocalpy.org/

To report a bug, or to request a feature, please use the issue tracker on GitHub:
https://github.com/vocalpy/crowsetta/issues

CHANGELOG

You can see project history and work in progress in the CHANGELOG

License

The project is licensed under the BSD license.

Citation

If you use crowsetta, please cite the DOI:

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Tessa Rhinehart}
📖 🐛 📓 🤔

_{Sylvain HAUPERT}
💻 🤔 📓

_{Yannick Jadoul}
🤔 🐛 📖 📓

_sammlapp
🤔

This project follows the all-contributors specification. Contributions of any kind welcome!

crowsetta's People

Contributors

Stargazers

Watchers

Forkers

ntrouvain mbsantiago globalmanagement

crowsetta's Issues

add `header_segment_map` parameter to `csv2seq` function

so if user has a csv with header different from segment fields, can just provide mapping (i.e. a dict) that specifies which header fields (csv columns) correspond to Segment attributes

so with this header
Onsets, Offsets, Filename, SegmentLabel
you'd use

header_segment_map = {
    'Onsets': 'onsets_s',
    'Offsets': 'offsets_s',
    'Filename': 'file',
    'SegmentLabel': 'label'
    }
crowsetta.csv.csv2seq(csv_filename='my.csv', header_segment_map=header_segment_map)

add sap module

add formats module, info about each format in its modules docstring that format uses?

in spirit of DRY, instead of having a separate dict in the data module,
the top-level docstring for each format's module should have this metadata,
and there should be a formats module that knows how to parse this

Better if this could be linked with the internal config.ini somehow.

Maybe a Makefile that generates the config.ini?

Or ... each formats module has its own config_dict at the top, and then that gets used through an entry point maybe?

change parameters from 'audio_file'/'annot_file' to 'audio_path'/'annot_path'

because 'path' is more specific + accurate than 'file'
for consistency with vak

single-source version

rename 'Annotation' to `Vocalization'

because it's not really an "annotation"

it's the high-level abstract object that lets us associate an annotation file with the sequence of annotated segments within that file, and the file that the annotation annotates, e.g. an audio file

so it should be something like:
Vocalization, with attributes 'sequence', 'annot_pathand (optionally)source_path`

keep attributes like file, onsets, offset for Sequence and just have `frozen=True`?

so user doesn't have to do something like

seq_dict = seq.to_dict()
file = seq_dict['file']

just to get the audio filename

make it so notmat2annot doesn't require .rec files

because a user can have .not.mat files but not have .rec files, e.g. song was acquired with software that is not evTAF but they still used evsonganaly to annotate

have to_annot functions only return single annot

less testing required for different returned types
clearer what expected type returned is for downstream user -- won't have to test all their code for e.g. Annot + list of Annots

users will be able to e.g. write a list comprehension so it's not actually that useful to include this extra functionality

add `unique_labels` function?

all_labels = [a_seq.labels.tolist() for a_seq in seq]
all_labels = [label for labellist in all_labels for label in labellist]
uniq_labels = set(all_labels)
return uniq_labels

converting from_csv fails when to_csv had None values (e.g. onsets_Hz)

Need to have to_csv insert something (NaN or None?) when value of Segment field is None for on/offsets_s/Hz

allow for user-defined `tiers` for a Segment, like Praat?

Praat allows for multiple user-defined tiers per segment, e.g. "phoneme", "syllable", "word", "sentence".

http://www.fon.hum.uva.nl/praat/manual/Intro_7__Annotation.html

Not sure if that would be easy to add for Crowsetta.
I was thinking it would require the ability to dynamically add attributes to the Segment class, but I guess there could be an optional tiers attribute that's a dict mapping an annotation to each tier for any instance of a Segment.
But even then seq2csv would have to be able to handle mapping these extra tiers. I guess that's not too painful though if we're iterating over Segments anyway. Just would have to make sure all Segments have the same tiers.

fix circular import bug in .formats

should import within function show() when called
have a similar function load() that does this and then show() calls load() if formats not loaded
and then build these function calls into Transcriber
how to test?

fix TMI Sequence repr, don't need to show all segments

override __repr__ to say Sequence with x segments: Segment(info), Segment(info), ...
(actually showing ellipsis in __repr__)
and then change repr of segment list, maybe using UserList from collections?

Add utils module

with utility functions for labels and annotations
for labels just steal from vak
annotations utilities would be e.g. duration of all annotations

add logo

"crowsetta stone" image?
- in doc/index.rst & README.md
maybe also image showing GUI with labeling | Filenames | Sequence objects | csv output

add Praat module

Inspiration, from Kyles and Kylers:
https://github.com/kylebgorman/textgrid
https://github.com/kylerbrown/textgrid

add tests for Transcriber.to_format

use Textgridtools for Praat in place of current textgrid package?

pros: on PyPI, citable
(possible) cons: Python 3? Maintained? Some issues -- does it work with as many diff't versions?
https://github.com/hbuschme/TextGridTools
https://pub.uni-bielefeld.de/download/2561620/2563287

Can make writing 2csv function optional for user?

i.e. have a 'default' 2csv function that gets used if none (or None, or 'default') specified, and only replace if user needs some extra functionality?

make Annotation class

that can have a Stack attribute or a Sequence attribute

mainly because it feels weird and counterintuitive to write

annot : crowsetta.Sequence

in docstrings. No-one will get why an annotation is a Sequence.

should have a mandatory annot_file attribute
and optional audio_file' and spect_file` attributes

make `user_config` less fragile

module crashes with a relative path like ./mymodule.py

to_csv and to_format have to be 'None' (if not using), not

None

which is annoying to type

use Annotation.mat example in doc, have example function in src/bin

have Sequence.segments return a "pretty printed" version?

Seems like __repr__ should be something like

Sequence(segments=15)

and then a pretty_print method would give something like

Sequence with 15 segments:
    Segment 1: label='a', onset_Hz=16000, offset_Hz=17500, onset_s=None, offset_s=None, file='0.wav'
    Segment 2: label='b', onset_Hz=18000, offset_Hz=19500, onset_s=None, offset_s=None, file='0.wav'
   ...

add examples of installing a user-defined format via entry points

with poetry or setup.py

add entry point for those who want them

add to_textgrid and to_aud so it's useful with vak.predict

add 'bark' and 'arf' modules

see https://github.com/kylerbrown/bark

add contributing.md

should link to overall vocalpy/contributing.md

DOC: add images of spectrograms with annotation, use example data for this

first-order, add _static/spectrogram for use with tutorial.rst + tutorial.ipynb
and/or add librosa as docs dependency, create spects when building docs
later, use vocalpy for this

add csv as a format?

so user of e.g. vak can specify 'csv' as format

this would be a place where the 'to_annot` would have to return a list of annotations, though (see #54 )

TIMIT format module

change function names and what they accept for consistency across modules

i.e. notmat_list_to_csv should be renamed to notmat2csv and accept either a list or a single .not.mat file, and notmat2seq should also accept either a list or a single .not.mat file

switch to using attrs object for Sequence instead of namedtuple

use onsets and offsets in seconds, not Hz, for make_notmat function

since .not.mat files have onsets and offsets in seconds

this makes it possible to "round trip" from .not.mat to Annotation and back without anything besides the .not.mat file (e.g. without getting sampling frequency from the associated audio file)

change default value for `koumura2annot.Wave` parameter

causing vak to crash because the default is written relative to the current working directory, ./Wave

This only works if user is in the right place

Instead default should be written relative to the Annotation.xml path, which will always be in the parent directory of the Wave directory, unless someone was actually using the same format somewhere outside this dataset. In which case they could specify the correct location with the non-default Wave argument

add `format2seq_func` parameter to `seq2csv`

so that user can avoid writing their own format2csv function

The argument will be a function such as notmat2seq, and if not None then seq2csv will take the seq argument and run it through this format2seq_func like so:

def seq2csv(seq, ..., format2seq_func=None):
     if format2seq_func is not None:
        seq = format2seq_func(seq)

add actual docs, most of README.md can go on intro page there

`koumura2annot` throws an error when annot_file is a Path not a str

Traceback (most recent call last):
  File "/home/art/anaconda3/envs/vak-dev/bin/vak", line 11, in <module>
    load_entry_point('vak', 'console_scripts', 'vak')()
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/__main__.py", line 43, in main
    config_file=args.configfile)
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/cli/cli.py", line 18, in cli
    prep(toml_path=config_file)
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/cli/prep.py", line 162, in prep
    logger=logger)
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/io/dataframe.py", line 124, in from_files
    annot_list = scribe.from_file(annot_file=annot_file)
  File "/home/art/anaconda3/envs/vak-dev/lib/python3.6/site-packages/crowsetta/koumura.py", line 53, in koumura2annot
    if not annot_file.endswith('.xml'):
AttributeError: 'PosixPath' object has no attribute 'endswith'

rewrite doc with example using Praat on real data right off the bat

and converting e.g. from Textgrid to Audacity, then using TIMIT

Later talk about code stuff. Like, buried deep in the docs later

add `from_excel` function / module

mainly as easier way to get stuff out of SAP?
Would be a convenience wrapper around csv2seq that knows to use Excel dialect and look for SAP field names

add Stack class

programatically instantiated attrs class where each attribute is a Sequence.
A Stack is made up of 2 or more Sequences

add `seqID` attribute to `Segment`

This will get used when one annotation file contains multiple sequences, and/or each sequence does not correspond to one audio file.
E.g., in the Koumura data set there are multiple sequences per audio file.
Similarly, canary song can be annotated by phrase and the user might want to preserve this annotation.

make doc/notebooks/ dir with originals, have Makefile that makes crowsetta/notebooks from those

Make should remove the raw cells with .rst directives that are just used in the .doc files

Is there some way to get more control flow with nbconvert, i.e. "use these cells for the notebook --> notebook conversion and use these other cells for the notebook --> rst"?

change annot_file / audio_file attributes of annotation to be Path objects

to not get

TypeError: ("'annot_file' must be <class 'str'> (got PosixPath('/home/ildefonso/Documents/repos/coding/birdsong/tweetynet/tests/test_data/mat/llb3_annot_subset.mat') that is a <class 'pathlib.PosixPath'>).", Attribute(name='annot_file', default=NOTHING, validator=<instance_of validator for type <class 'str'>>, repr=True, eq=True, order=True, hash=None, init=True, metadata=mappingproxy({}), type=None, converter=None, kw_only=False), <class 'str'>, PosixPath('/home/ildefonso/Documents/repos/coding/birdsong/tweetynet/tests/test_data/mat/llb3_annot_subset.mat'))

add "why" and "how" at top of docs

why:
- club project for ppl studying vocalizations
- tool for munging datasets of vocalizations that have annotated segments
  - so that when working with the dataset, there is no need to be aware of where different files are,
    e.g., the annotation file or files, the audio files, etc.
- assumes you care about the "segments" part
  - need to include illustration of annotated segments right at top of docs
how:
- Python classes that faciliate representing these datasets
  - a Vocalization that consists of its annotation and the files associated with it
- end product: a .csv / DataFrame where each row is an annotated segment