GithubHelp home page GithubHelp logo

imclab / msd-examples Goto Github PK

View Code? Open in Web Editor NEW

This project forked from echonest/msd-examples

0.0 1.0 0.0 2.02 MB

Example code for processing the Million Song Dataset and other big music datasets

msd-examples's Introduction

msd-examples

Example code for processing the Million Song Database. This repostory contains code that can be used to process the million song dataset.

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks available at:

http://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset

Its purposes are:

  • To encourage research on algorithms that scale to commercial sizes
  • To provide a reference dataset for evaluating research
  • As a shortcut alternative to creating a large dataset with The Echo Nests API
  • To help new researchers get started in the MIR field

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code we provide. Additional datasets have been attached to the Million Song Dataset, so far they contain lyrics and cover songs. The Million Song Dataset started as a collaborative project between The Echo Nest and LabROSA. It was supported in part by the NSF.

These examples depend on mrjob, a python library for running MapReduce jobs on Hadoop or Amazon web services. See https://github.com/Yelp/mrjob and http://packages.python.org/mrjob/.

MSD Data on S3

These examples use MSD data that has been loaded on to S3 at s3://tbmmsd. There are around 330 files each with about 3000 sets track data each (one set per line) where each line is represented by 54 fields as described here:

http://labrosa.ee.columbia.edu/millionsong/pages/field-list

except that in the flat file format, the 'track id' field has been moved from field 52 to the first field.

In the repository you will find tiny.dat which contains data for 20 tracks.

Map-reduce jobs

Density

Finds the most dense and the least dense songs

density.py

Local Usage:

python density.py tiny.dat

EC2 Usage

This will run the job on EC2 Map reduce on 100 small instances. Note that you have to add the track.py code to t.tar.gz with:

% tar cvfz t.tar.gz track.py

To run the job on 100 CPUs on all of the MSD use:

%  python density.py --num-ec2-instances 100 --python-archive t.tar.gz -r emr 's3://tbmmsd/*.tsv.*' > output.dat

(Of course you will need to setup your Amazon credentials. See http://packages.python.org/mrjob/writing-and-running.html#running-on-emr )

msd-examples's People

Contributors

plamere avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.