Skip to content

MuChoMusic is a benchmark for evaluating music understanding in multimodal audio-language models.

License

Notifications You must be signed in to change notification settings

mulab-mir/muchomusic

Repository files navigation

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

License: MIT arXiv DOI Huggingface

Benno Weck*1, Ilaria Manco*2,3, Emmanouil Benetos2, Elio Quinton3, George Fazekas2, Dmitry Bogdanov1

1 UPF, 2 QMUL, 3 UMG

* equal contribution

This repository contains code and data for the paper MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models (ISMIR 2024).

[TODO]

Quick Links

Data

The dataset is available to download from Zenodo:

wget -P data https://zenodo.org/record/12709974/files/muchomusic.csv

or via HuggingFace Datasets. You can access it using the 🤗 Datasets library:

from datasets import load_dataset
MuchoMusic = load_dataset("mulab-mir/muchomusic")

Code setup

To use this code, we recommend creating a new python3 virtual environment:

python -m venv venv 
source venv/bin/activate

Then, clone the repository and install the dependencies:

git clone https://github.com/mulab-mir/muchomusic.git
cd muchomusic
pip install -r requirements.txt

This codebase has been tested with Python 3.11.5.

Code Structure

muchomusic         
├── data            
│   └── muchomusic.csv
├── dataset_creation                # code to generate and validate the dataset
├── muchomusic_eval                 # evaluation code
│   ├── configs                     # folder to store the config files for evaluation experiments
|   └── ...    
├── evaluate.py                     # run file to run the evaluation
└── prepare_prompts.py

Prepare the model outputs for benchmark

Inputs to the benchmark should be given as a JSON object with the following format:

{
    "id": 415600,
    "prompt": "Question: What rhythm pattern do the digital drums follow? Options: (A) Four on the floor. (B) Off-beat syncopation. (C) Scat singing. (D) E-guitar playing a  simple melody. The correct answer is: ",
    "answers": [
        "Pop music",
        "Reggae",
        "Latin rock",
        "Ska"
    ],
    "answer_orders": [
        3,
        0,
        2,
        1
    ],
    "dataset": "sdd",
    "genre": "Reggae",
    "reasoning": [
        "genre and style"
    ],
    "knowledge": [],
    "audio_path": "data/sdd/audio/00/415600.2min.mp3",
    "model_output": "A"
}

To generate this, first run:

python prepare_prompts.py --output_path <path_to_json_file>

Then obtain the model predictions from each (audio, text) pair formed by prompt and the corresponding audio at audio_path, and populate model_output accordingly.

Run the evaluation

python evaluate.py --output_dir <path_to_results_dir>

After running the code, the results will be stored in <path_to_results_dir>.

Citation

If you use the code in this repo, please consider citing our work:

@inproceedings{weck2024muchomusic,
   title={MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models},
   author={Weck, Benno and Manco, Ilaria and Benetos, Emmanouil and Quinton, Elio and Fazekas, György and Bogdanov, Dmitry},
   booktitle = {Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR)},
   year={2024}
}

License

This repository is released under the MIT License. Please see the LICENSE file for more details. The dataset is released under the CC BY-SA 4.0 license.

Contact

If you have any questions, please get in touch: benno.weck01@estudiant.upf.edu, i.manco@qmul.ac.uk.

If you find a problem when using the code, you can also open an issue.

About

MuChoMusic is a benchmark for evaluating music understanding in multimodal audio-language models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published