State of the art: Audio tag classification (Genre)

In this post, I shall talk about the state of the art approaches for Music Genre Recognition (MGR) task. Despite several works, MGR remains a compelling problem to solve by a machine. The variety of approaches used for evaluating performance in MGR has yet to be surveyed. How does one measure the capacity of a system — living or not — to recognise and discriminate between abstract characteristics of the human phenomenon of music?

Evaluation methods

Surprisingly little has been written about evaluation, i.e., experimental design, data, and FoMs, with respect to MGR (Sturm 2012a [1]). An experimental design is a method for testing a hypothesis. Data is the material on which a system is tested. A FoM reflects the confidence in the hypothesis after conducting an experiment. Of three review articles devoted in large part to MGR (Aucouturier and Pachet 2003[2]; Scaringella et al. 2006[3]; Fu et al. 2011[4]), only Aucouturier and Pachet (2003 [3]) give a brief paragraph on evaluation. The work by Vatolkin (2012) [5] provides a comparison of various performance statistics for music classification. Fig. 1 shows the annual number of publications in MGR, and the proportion that use formal statistical testing in comparing MGR systems.

https://static-content.springer.com/image/art%3A10.1007%2Fs10844-013-0250-y/MediaObjects/10844_2013_250_Fig1_HTML.gif

 

Fig. 1

Annual numbers of references in MGR divided by which use and do not use formal statistical tests for making comparisons (Sturm 2012a)[6]. Only about 12 % of references in MGR employ formal statistical testing; and only 19.4 % of the work (91 papers) appears at the Conference of the International Society for Music Information Retrieval

Ten experimental designs of MGR, and the percentage of references having an experimental component (435 references) in the survey of (Sturm 2012a) [6] that employ them

Design

Description

%

Classify

To answer the question, “How well does the system predict the genres used by music?” The system applies genre labels to music, which researcher then compares to a “ground truth”

91

Features

To answer the question, “At what is the system looking to identify the genres used by music?” The system ranks and/or selects features, which researcher then inspects

33

Generalize

To answer the question, “How well does the system identify genre in varied datasets?” Classify with two or more datasets having different genres, and/or various amounts of training data

16

Robust

To answer the question, “To what extent is the system invariant to aspects inconsequential for identifying genre?” The system classifies music that researcher modifies or transforms in ways that do not harm its genre identification by a human

7

Eyeball

To answer the question, “How well do the parameters make sense with respect to identifying genre?” The system derives parameters from music; researcher visually compares

7

Cluster

To answer the question, “How well does the system group together music using the same genres?” The system creates clusters of dataset, which researcher then inspect

7

Scale

To answer the question, “How well does the system identify music genre with varying numbers of genres?” Classify with varying numbers of genres

7

Retrieve

To answer the question, “How well does the system identify music using the same genres used by the query?” The system retrieves music similar to query, which researcher then inspects

4

Rules

To answer the question, “What are the decisions the system is making to identify genres?” The researcher inspects rules used by a system to identify genres

4

Compose

To answer the question, “What are the internal genre models of the system?” The system creates music in specific genres, which the researcher then inspects

 

Table 2

Datasets used in MGR, the type of data they contain, and the percentage of experimental work (435 references) that use them

Dataset

Description

%

Private

Constructed for research but not made available

58

GTZAN

Audio; http://marsyas.info/download/data_sets

23

ISMIR2004

Audio; http://ismir2004.ismir.net/genre_contest

17

Latin (Silla et al. 2008)

Features; http://www.ppgia.pucpr.br/~silla/lmd/

5

Ballroom

Audio; http://mtg.upf.edu/ismir2004/contest/tempoContest/

3

Homburg (Homburg et al. 2005)

Audio; http://www-ai.cs.uni-dortmund.de/audio.html

3

Bodhidharma

Symbolic; http://jmir.sourceforge.net/Codaich.html

3

USPOP2002 (Berenzweig et al. 2004)

Audio; http://labrosa.ee.columbia.edu/projects/musicsim/uspop2002.html

2

1517-artists

Audio; http://www.seyerlehner.info

1

RWC (Goto et al. 2003)

Audio; http://staff.aist.go.jp/m.goto/RWC-MDB/

1

SOMeJB

Features; http://www.ifs.tuwien.ac.at/~andi/somejb/

1

SLAC

Audio & symbols; http://jmir.sourceforge.net/Codaich.html

1

SALAMI (Smith et al. 2011)

Features; http://ddmal.music.mcgill.ca/research/salami

0.7

Unique

Features; http://www.seyerlehner.info

0.7

Million song (Bertin-Mahieux et al. 2011)

Features; http://labrosa.ee.columbia.edu/millionsong/

0.7

ISMIS2011

Features; http://tunedit.org/challenge/music-retrieval

0.4

(All datasets listed after Private are public)

Table 3

Figures of merit (FoMs) of MGR, their description, and the percentage of work (467 references) that use them

FoM

Description

%

Mean accuracy

Proportion of the number of correct trials to the total number of trials

82

Confusion table

Counts of labeling outcomes for each labeled input

32

Recall

For a specific input label, proportion of the number of correct trials to the total number of trials

25

Confusions

Discussion of confusions of the system in general or with specifics

24

Precision

For a specific output label, proportion of the number of correct trials to the total number of trials

10

F-measure

Twice the product of Recall and Precision divided by their sum

4

Composition

Observations of the composition of clusters created by the system, distances within and between

4

Precision@k

Proportion of the number of correct items of a specific label in the k items retrieved

3

ROC

Precision vs. Recall (true positives vs. false positives) for several systems, parameters, etc.

Three state-of-the-art systems for music genre recognition:

There are three MGR systems that appear to perform well with respect to state of the art classification accuracy in GTZAN dataset:

  1. AdaBoost with decision trees and bags of frames of features (AdaBFFs)
  2. Sparse representation classification with auditory temporal modulations (SRCAM)
  3. Maximum a posteriori classification of scattering coefficients (MAPsCAT)
Table 4

Mean accuracies in GTZAN for each system, and the maximum {pi}  over all 10 CV runs

System

System configuation

Mean acc., std. dev.

Max {pi}

AdaBFFs

Decision stumps

0.776 ± 0.004

>0.024

Two-node trees

0.800 ± 0.006

>0.024

SRCAM

Normalized features

0.835 ± 0.005

>0.024

Standardized features

0.802 ± 0.006

>0.024

MAPsCAT

Class-dependent covariances

0.754 ± 0.004

<10 − 6

Total covariance

0.830 ± 0.004

<10 − 6

https://static-content.springer.com/image/art%3A10.1007%2Fs10844-013-0250-y/MediaObjects/10844_2013_250_Fig2_HTML.gif
Fig. 2

Highest reported classification accuracies in GTZAN (Sturm 2013b)[7]. The legend shows evaluation parameters. Top gray line is the estimated maximum accuracy possible in GTZAN given its repetitions and mislabelings. The five “x” are results that are disputed, or known to be invalid. The dashed gray line is the accuracy we observe for SRCAM with normalized features and 2 fCV using an artist-filtered GTZAN without repetitions

Evaluating performance in particular classes:

Figure 3 shows the recalls, precisions, and F-measures for AdaBFFs, SRCAM, and MAPsCAT. These FoMs, which appear infrequently in the MGR literature, can be more specific than mean accuracy, and provide a measure of how a system performs for specific classes. Wu et al. (2011)[8] concludes on the relevance of their features to MGR by observing that the empirical recalls for Classical and Rock in GTZAN are above that expected for random. With respect to precision, Lin et al. (2004)[9] concludes their system is better than another. We see in Fig. 3 for Disco that MAPsCAT using total covariance shows the highest recall (0.76±0.01, std. dev.) of all systems. Since high recall can come at the price of many false positives, we look at the precision. MAPsCAT displays this characteristic for Country. When it comes to Classical, we see MAPsCAT using class-dependent covariance has perfect recall; and using class-dependent covariance it shows high precision (0.85±0.01). The F-measure combines recall and precision to reflect class accuracy. We see that AdaBFFs appears to be one of the most accurate for Classical, and one of the least accurate for Disco.

https://static-content.springer.com/image/art%3A10.1007%2Fs10844-013-0250-y/MediaObjects/10844_2013_250_Fig3_HTML.gif

Fig. 3

Boxplots of recalls, precisions, and F-measures in 10×10-fold CV in GTZAN. Classes: Blues (bl), Classical (cl), Country (co), Disco (di), Hip hop (hi), Jazz (ja), Metal (me), Pop (po), Reggae (re), Rock (ro)

 

References

[1] Sturm, B.L. (2012a). A survey of evaluation in music genre recognition. In Proc. Adaptive Multimedia Retrieval.

[2] Aucouturier, J.J., & Pachet, F. (2003). Representing music genre: A state of the art. Journal of New Music Research, 32(1), 83–93.

[3] Scaringella, N., Zoia, G., Mlynek, D. (2006). Automatic genre classification of music content: A survey. IEEE Signal Processing Magazine, 23(2), 133–141.

[4] Fu, Z., Lu, G., Ting, K.M., Zhang, D. (2011). A survey of audio-based music classification and annotation. IEEE Transactions on Multimedia, 13(2), 303–319.

[5] Vatolkin, I. (2012). Multi-objective evaluation of music classification. In W.A. Gaul, A. Geyer-Schulz, L. Schmidt-Thieme, J. Kunze (Eds.), Challenges at the interface of data analysis, computer science, and optimization (pp. 401–410). Berlin: Springer.

[6] Sturm, B.L. (2012a). A survey of evaluation in music genre recognition. In Proc. Adaptive Multimedia Retrieval.

[7] Sturm, B.L. (2013b). The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use. http://arxiv.org/abs/1306.1461.

[8]Wu, M.J., Chen, Z.S., Jang, J.S.R., Ren, J.M. (2011). Combining visual and acoustic features for music genre classification. In Proc. International Conference on Machine Learning and Applications and Workshops (pp. 124–129).

[9] Lin, C.R., Liu, N.H., Wu, Y.H., Chen, A. (2004). Music classification using significant repeating patterns. In Y. Lee, J. Li, K.Y. Whang, D. Lee (Eds.), Database systems for advanced applications (pp. 27–29). Berlin/Heidelberg: Springer.

 

 

 

Leave a comment