State of the art: Audio tag classification (Genre)

In this post, I shall talk about the state of the art approaches for Music Genre Recognition (MGR) task. Despite several works, MGR remains a compelling problem to solve by a machine. The variety of approaches used for evaluating performance in MGR has yet to be surveyed. How does one measure the capacity of a system — living or not — to recognise and discriminate between abstract characteristics of the human phenomenon of music?

Evaluation methods

Surprisingly little has been written about evaluation, i.e., experimental design, data, and FoMs, with respect to MGR (Sturm 2012a [1]). An experimental design is a method for testing a hypothesis. Data is the material on which a system is tested. A FoM reflects the confidence in the hypothesis after conducting an experiment. Of three review articles devoted in large part to MGR (Aucouturier and Pachet 2003[2]; Scaringella et al. 2006[3]; Fu et al. 2011[4]), only Aucouturier and Pachet (2003 [3]) give a brief paragraph on evaluation. The work by Vatolkin (2012) [5] provides a comparison of various performance statistics for music classification. Fig. 1 shows the annual number of publications in MGR, and the proportion that use formal statistical testing in comparing MGR systems.

Fig. 1

Annual numbers of references in MGR divided by which use and do not use formal statistical tests for making comparisons (Sturm 2012a)[6]. Only about 12 % of references in MGR employ formal statistical testing; and only 19.4 % of the work (91 papers) appears at the Conference of the International Society for Music Information Retrieval

Ten experimental designs of MGR, and the percentage of references having an experimental component (435 references) in the survey of (Sturm 2012a) [6] that employ them

Design	Description	%
Classify	To answer the question, “How well does the system predict the genres used by music?” The system applies genre labels to music, which researcher then compares to a “ground truth”	91
Features	To answer the question, “At what is the system looking to identify the genres used by music?” The system ranks and/or selects features, which researcher then inspects	33
Generalize	To answer the question, “How well does the system identify genre in varied datasets?” Classify with two or more datasets having different genres, and/or various amounts of training data	16
Robust	To answer the question, “To what extent is the system invariant to aspects inconsequential for identifying genre?” The system classifies music that researcher modifies or transforms in ways that do not harm its genre identification by a human	7
Eyeball	To answer the question, “How well do the parameters make sense with respect to identifying genre?” The system derives parameters from music; researcher visually compares	7
Cluster	To answer the question, “How well does the system group together music using the same genres?” The system creates clusters of dataset, which researcher then inspect	7
Scale	To answer the question, “How well does the system identify music genre with varying numbers of genres?” Classify with varying numbers of genres	7
Retrieve	To answer the question, “How well does the system identify music using the same genres used by the query?” The system retrieves music similar to query, which researcher then inspects	4
Rules	To answer the question, “What are the decisions the system is making to identify genres?” The researcher inspects rules used by a system to identify genres	4
Compose	To answer the question, “What are the internal genre models of the system?” The system creates music in specific genres, which the researcher then inspects

Table 2

Datasets used in MGR, the type of data they contain, and the percentage of experimental work (435 references) that use them

Dataset	Description	%
Private	Constructed for research but not made available	58
GTZAN	Audio; http://marsyas.info/download/data_sets	23
ISMIR2004	Audio; http://ismir2004.ismir.net/genre_contest	17
Latin (Silla et al. 2008)	Features; http://www.ppgia.pucpr.br/~silla/lmd/	5
Ballroom	Audio; http://mtg.upf.edu/ismir2004/contest/tempoContest/	3
Homburg (Homburg et al. 2005)	Audio; http://www-ai.cs.uni-dortmund.de/audio.html	3
Bodhidharma	Symbolic; http://jmir.sourceforge.net/Codaich.html	3
USPOP2002 (Berenzweig et al. 2004)	Audio; http://labrosa.ee.columbia.edu/projects/musicsim/uspop2002.html	2
1517-artists	Audio; http://www.seyerlehner.info	1
RWC (Goto et al. 2003)	Audio; http://staff.aist.go.jp/m.goto/RWC-MDB/	1
SOMeJB	Features; http://www.ifs.tuwien.ac.at/~andi/somejb/	1
SLAC	Audio & symbols; http://jmir.sourceforge.net/Codaich.html	1
SALAMI (Smith et al. 2011)	Features; http://ddmal.music.mcgill.ca/research/salami	0.7
Unique	Features; http://www.seyerlehner.info	0.7
Million song (Bertin-Mahieux et al. 2011)	Features; http://labrosa.ee.columbia.edu/millionsong/	0.7
ISMIS2011	Features; http://tunedit.org/challenge/music-retrieval	0.4

(All datasets listed after Private are public)

Table 3

Figures of merit (FoMs) of MGR, their description, and the percentage of work (467 references) that use them

FoM	Description	%
Mean accuracy	Proportion of the number of correct trials to the total number of trials	82
Confusion table	Counts of labeling outcomes for each labeled input	32
Recall	For a specific input label, proportion of the number of correct trials to the total number of trials	25
Confusions	Discussion of confusions of the system in general or with specifics	24
Precision	For a specific output label, proportion of the number of correct trials to the total number of trials	10
F-measure	Twice the product of Recall and Precision divided by their sum	4
Composition	Observations of the composition of clusters created by the system, distances within and between	4
Precision@k	Proportion of the number of correct items of a specific label in the k items retrieved	3
ROC	Precision vs. Recall (true positives vs. false positives) for several systems, parameters, etc.

Three state-of-the-art systems for music genre recognition:

There are three MGR systems that appear to perform well with respect to state of the art classification accuracy in GTZAN dataset:

AdaBoost with decision trees and bags of frames of features (AdaBFFs)
Sparse representation classification with auditory temporal modulations (SRCAM)
Maximum a posteriori classification of scattering coefficients (MAPsCAT)

Table 4

Mean accuracies in GTZAN for each system, and the maximum {p_i} over all 10 CV runs

System	System configuation	Mean acc., std. dev.	Max {p_i}
AdaBFFs	Decision stumps	0.776 ± 0.004	>0.024
	Two-node trees	0.800 ± 0.006	>0.024
SRCAM	Normalized features	0.835 ± 0.005	>0.024
	Standardized features	0.802 ± 0.006	>0.024
MAPsCAT	Class-dependent covariances	0.754 ± 0.004	<10^− 6
	Total covariance	0.830 ± 0.004	<10^− 6

https://static-content.springer.com/image/art%3A10.1007%2Fs10844-013-0250-y/MediaObjects/10844_2013_250_Fig2_HTML.gif — Fig. 2

Highest reported classification accuracies in GTZAN (Sturm 2013b)[7]. The legend shows evaluation parameters. *Top gray line* is the estimated maximum accuracy possible in GTZAN given its repetitions and mislabelings. The five “x” are results that are disputed, or known to be invalid. The *dashed gray line* is the accuracy we observe for SRCAM with normalized features and 2 fCV using an artist-filtered GTZAN without repetitions

Evaluating performance in particular classes:

Figure 3 shows the recalls, precisions, and F-measures for AdaBFFs, SRCAM, and MAPsCAT. These FoMs, which appear infrequently in the MGR literature, can be more specific than mean accuracy, and provide a measure of how a system performs for specific classes. Wu et al. (2011)[8] concludes on the relevance of their features to MGR by observing that the empirical recalls for Classical and Rock in GTZAN are above that expected for random. With respect to precision, Lin et al. (2004)[9] concludes their system is better than another. We see in Fig. 3 for Disco that MAPsCAT using total covariance shows the highest recall (0.76±0.01, std. dev.) of all systems. Since high recall can come at the price of many false positives, we look at the precision. MAPsCAT displays this characteristic for Country. When it comes to Classical, we see MAPsCAT using class-dependent covariance has perfect recall; and using class-dependent covariance it shows high precision (0.85±0.01). The F-measure combines recall and precision to reflect class accuracy. We see that AdaBFFs appears to be one of the most accurate for Classical, and one of the least accurate for Disco.

Fig. 3

Boxplots of recalls, precisions, and F-measures in 10×10-fold CV in GTZAN. Classes: Blues (bl), Classical (cl), Country (co), Disco (di), Hip hop (hi), Jazz (ja), Metal (me), Pop (po), Reggae (re), Rock (ro)

References

[1] Sturm, B.L. (2012a). A survey of evaluation in music genre recognition. In Proc. Adaptive Multimedia Retrieval.

[2] Aucouturier, J.J., & Pachet, F. (2003). Representing music genre: A state of the art. Journal of New Music Research, 32(1), 83–93.

[3] Scaringella, N., Zoia, G., Mlynek, D. (2006). Automatic genre classification of music content: A survey. IEEE Signal Processing Magazine, 23(2), 133–141.

[4] Fu, Z., Lu, G., Ting, K.M., Zhang, D. (2011). A survey of audio-based music classification and annotation. IEEE Transactions on Multimedia, 13(2), 303–319.

[5] Vatolkin, I. (2012). Multi-objective evaluation of music classification. In W.A. Gaul, A. Geyer-Schulz, L. Schmidt-Thieme, J. Kunze (Eds.), Challenges at the interface of data analysis, computer science, and optimization (pp. 401–410). Berlin: Springer.

[6] Sturm, B.L. (2012a). A survey of evaluation in music genre recognition. In Proc. Adaptive Multimedia Retrieval.

[7] Sturm, B.L. (2013b). The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use. http://arxiv.org/abs/1306.1461.

[8]Wu, M.J., Chen, Z.S., Jang, J.S.R., Ren, J.M. (2011). Combining visual and acoustic features for music genre classification. In Proc. International Conference on Machine Learning and Applications and Workshops (pp. 124–129).

[9] Lin, C.R., Liu, N.H., Wu, Y.H., Chen, A. (2004). Music classification using significant repeating patterns. In Y. Lee, J. Li, K.Y. Whang, D. Lee (Eds.), Database systems for advanced applications (pp. 27–29). Berlin/Heidelberg: Springer.

	vsevoloderemenko on State of the art in Audio Chor…
	tangkk on State of the art in Audio Chor…
	emiliagomez on Singing Voice Separation: Data…
	emiliagomez on Singing Voice Separation:…
	Audio Beat Tracking:… on Instrument Classification: Rob…

Music Information Retrieval

Student blog for the MIR course, Master in Sound and Music Computing, Universitat Pompeu Fabra (2017)

State of the art: Audio tag classification (Genre)

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply