Post 0: Machines on Music

Nicole Cosme
Mar 7
10 min read

Updated: Apr 6

The figure below shows three representations of the same sound wave: a brief prose description, a spectrogram, and a sequence of numbers. The prose hypothesizes a chord progression, a description of its writer’s perceptual experience of the sound wave. The spectrogram shows how acoustic energy in the sound wave is distributed across frequencies over time. Both are human-legible representations of music, each offering its own strategy for recovering semantic structures from a sound wave.

Figure 1. Three representations of a sound wave: (A) prose description; (B) spectrogram; (C) feature map

The third representation (a feature map produced by a convolutional filter) also recovers something from the sound wave, but it is not clear what this recovered thing is. Produced by a neural network, it is an internal representation that reflects what aspects of the sound wave the model has learned to treat as meaningful for its task. Therefore, it is not immediately human-legible. These examples make two broad points: (1) every representational system constructs its own ontology of musical sound (a theory about what musical sound is and how it is organized), and (2) the ontologies constructed by neural networks are generally opaque to humans.

0.1 Representation systems instantiate ontologies

To construct an ontology of musical sound, a representational system must choose the basic building blocks from which higher-order structures are composed. I will call these primitives. Traditional Western staff notation, for example, selects discrete pitch–time events as its primitives. These events are encoded as note symbols positioned on a staff, and they compose higher-order features, like chords and melodies. The choice of primitives, therefore, determines what kinds of representational distinctions a system can express.

This happens because representational systems carry built-in assumptions and constraints that shape the kinds of features that system can learn from its data. Such inductive biases imply a space of representational possibility: some range of distinctions that the system could express through its operators and their combinations. For example, Western staff notation can represent any combination of note symbols that fit into its staff space.

In practice, only a small subset of these possibilities are actually used when representing a dataset. E.g., the representation of a single musical recording selectively instantiates the space of representational possibility, repeatedly realizing some subspaces while leaving others unused. This representation might have high use of a specific set of note symbols in specific places on the staff, while leaving other symbols and positions minimally expressed or wholly unexpressed. In doing so, it reveals which parameters are actively mobilized within the broader representational system for a particular piece of data.

0.2 Neural ontologies are opaque to humans

Many music-representational ontologies are closely aligned with acoustic physics. Consider a violin string: its back-and-forth motion when bowed can be well approximated as a combination of simpler motions (normal modes) with characteristic oscillation rates (frequencies). The same holds for the vibrations of most sound-producing systems. When these vibrations force the surrounding air molecules, the resulting disturbance (a sound wave) inherits the same mixture of oscillations. [1] Therefore, sound waves are records of air pressure change, and every sound wave (if it is sufficiently long) has a one-to-one mapping with a unique mixture of narrowband oscillatory components. Those components have various names––overtones, harmonics, etc. I will call them partials.

Figure 2. Sound waves can be represented as mixtures of oscillatory components

Auditory perception research [2] suggests that the cochlea in our inner ear decomposes incoming sound waves into narrowband frequency regions that effectively isolate those narrowband oscillatory components. From these isolated components we build higher-level perceptual structures (e.g., pitch, melody, harmony) and assign semantic meaning to them (“that’s an A”). In sum, if a sound-producing system generates a waveform, then auditory perception (1) represents the waveform as an evolving oscillatory mixture, (2) groups pieces of this mixture into discrete representational objects, and (3) and draws semantic meaning over those objects.

But this process is tricky and approximate. The ear must be selective enough to isolate closely spaced narrowband components, which engages a tradeoff in time-frequency uncertainty. Even if it does, most acoustic sounds are non-redundant: if the same violinist plays the same note in the same room twice in a row, the two notes will have slightly different waveforms and slightly different spectra. The mapping between the waveform of each note and its oscillatory representation may be one-to-one when the waveform is long enough, but the mapping from those oscillatory representations to a perceptual object is many to one, and the meaning of that object is highly variable (e.g., a note means different things in different keys).

Human perception handles this variability by forming equivalence classes: similar but non-identical oscillatory mixtures are largely assigned to the same perceptual object, and then a semantic interpretation is rendered (e.g., instances of a C major chord). That is, musical semantics work because of our ability to discretize consistently and in alignment with other humans in our cultural orbit. When we do this, perceptual structures do exhibit semantic patterns similar to language: a chord or pitch can take on different meanings depending on context.

In end-to-end audio models (which currently dominate across generation, classification, and codecs), we want to be able to steer instances of these equivalence classes: recent interpretability work has focused on looking for such structures in the residual streams of the transformers in popular generative models (although they are rarely found). [3] [4] [5] [6] But audio models are not bound to handle the journey from waveform to semantics in the same way humans do. Therefore, neither the equivalence classes nor the objects within them are guaranteed to emerge. If objects like partials, pitches, and chords do emerge, they would have to do so prior to downstream components like sequence models, becasue these components are meant to learn semantic relationships between objects that already exist as discrete tokens. This points us to the encoder.

The encoder compresses short chunks of raw waveform data into latent embedding vectors, which are then discretized into input tokens for downstream components like sequence models. The sequence model then learns semantic relationships between these tokens. Ideally, the encoder would learn a reliable mapping from spectral structure to embedding vectors, and the sequence model would learn semantic relationships between the resulting tokens. But there's a catch. The encoder operates in the time domain, so it never explicitly represents oscillatory structure. Instead, it uses waveforms as a proxy by which oscillatory structure is implicitly manipulated during compression. And this reflects two assumptions: similar waveforms always map to similar spectral structures, and similar spectral structures always map to similar semantic meanings,

The first assumption may hold locally, when waveform shapes are sufficiently similar. However, because the encoder is compressive, distinct waveform segments can also map to similar embeddings, making the mapping between waveform surface and latent vector imperfect. The second assumption is more demanding. In music, semantic structure exists at multiple levels (e.g., partials, pitches, chords) each of which can function independently or in combination. For spectral similarity to support semantic similarity, the representation must make these components available in a stable and reusable way.

In principle, this could happen if the encoder learned something like a narrowband decomposition of the signal. If different filters consistently responded to specific frequency regions, then components such as partials could be represented as stable patterns of activation. Higher-level structures could then be formed by grouping these components. In such a representation, pitch would emerge as a distributed but consistent pattern across filters: the same subset of filters would activate for a given pitch, regardless of context. This would make pitch a stable direction in activation space, which could combine with others to form more complex structures such as chords.

This provides a clear condition for compositionality: narrowband oscillatory components must be represented in a way that is both separable and stable across contexts. If such components and their higher-order compositions stabilize anywhere in the encoder and those stable representations persist into the latent embedding vectors, this would give downstream components (like sequence models or classifiers) an opportunity (though not a guarantee) to represent and reason over the kinds of structures favored by human perception and by perceptually derived systems of musical analysis. If they do not stabilize, downstream components must organize musical structure in a fundamentally different way.

That said, there is no guarantee that any of this will happen. And even if the model does recover narrowband components and group them into perceptually relevant objects, there is no guarantee the transformer will then learn human-aligned semantic relationships over these.

Therefore, audio interpretability has two challenges. The first follows the focus of most work in text: do transformers learn human-aligned semantic relationships over tokens? The second is new: do those tokens encode oscillatory structures that make such semantic relationships available in the first place? This blog post series focuses on the second question (although I also explore the first when studying generative models).

0.3 Neural ontologies != human-interpretable ontologies

Given an end-to-end audio model, there are three plausible places where narrowband oscillatory objects could stabilize in a way that would allow them to compose into higher-order representational structures for downstream components.

The first convolutional layer. It is natural to look for partials here: first-layer filters interact directly with the raw waveform, and their outputs are hierarchically transformed throughout the encoder, which could allow partials to compose vertically into more complex features.
The end of the convolutional stack. By this point receptive fields are generally long enough to exhibit spectral selectivity, and compressed waveform representations become latents for downstream modeling. If oscillatory primitives exist in the encoder, this is perhaps the most promising candidate.
The discretized latent space. Here the compressed latents are either mapped to vector-quantized codes (in generative models) or fed into linear layers (in discriminative models). If oscillatory structures have been stabilized earlier, this stage could preserve or recombine them into higher-level representations.

The posts that follow explore these possibilities through a series of experiments on the encoders of a discriminative and a generative model––a proprietary toy version of SampleCNN [7], and EnCodec [8]––using the FMA small dataset [9]. I chose these models to prove that the effects were architectural and not model-specific––they persist across task and scale.

Ultimately, the analysis suggests that encoders across tasks and scale learn latent vectors that correspond to coarse, recurring patterns of local waveform shape. Each pattern represents some holistic texture: a summary of the entire oscillatory mixture within that waveform shape. But unlike latent auditory representations—which preserve access to the narrowband components within the holistic texture, so these can be grouped into objects—latent encoder representations do not, in general, behave compositionally. These representations contain internal narrowband components, but they cannot reliably expose them as independently accessible or stable substructures. As a result, those components are not reliably available for grouping into higher-order objects like pitches or chords.

Several factors likely contribute to this. First, repeated downsampling, convolution, and nonlinearities transform the spectral fingerprint of the signal in ways that degrade or mix spectral structure. Narrowband oscillatory components may be attenuated or aliased, and if they are aliased, distinct frequencies can collapse into shared representations. With that said, this problem is uneven: original narrowband components from different kinds of signals are preserved unequally under harsh aliasing (I show that this representability is predictable from the interaction of the signal and the architecture design).

Second, even if these components are well preserved, deep filters generally lack the selectivity to tease them apart, and this problem scales with signal density. Although deep convolutional filters can exhibit narrowband sensitivity in principle, their selectivity is limited relative to the effective bandwidth of the downsampled signal. As the representation is compressed, each filter spans a larger portion of the remaining spectrum, reducing its ability to isolate fine-grained components. Again, this becomes an increasingly large problem as the signal becomes more dense and more components are crowded into the same downsampled space.

Third, learned filters do not appear to form an orthogonal or evenly spaced basis. They are free to overlap and respond to correlated patterns, which encourages entangled representations rather than a clean decomposition into independent frequency-localized elements. This suggests that even if the space were not downsampled and the model had enough filters to encode a sufficient number of narrowband components, the model would need additional constraints to tile the frequency space efficiently.

These findings limit what downstream models can recover: if latent vectors (and their quantized codes) encode the holistic texture off of each chunk as a token, any sequence model that operates over such tokens will learn semantic relationships over holistic textures. But that same model cannot learn semantic relationships between parts of those textures (e.g., single pitches). For semantic relationships between pitches to occur, for example, the data would need to contain chunks whose holistic representation is a single pitch. But even then, these could not be semantically equated to pitches within mixtures, because the model has no way of discretizing pieces of mixtures.

If true, this might point to a mismatch between neural and human musical representations. Human auditory perception tends to organize sound around composable oscillatory structures, while the end-to-end encoders I studied seem to converge to a waveform-based representation that does not easily admit those structures.

This mismatch ultimately leads to more questions than it answers, and it matters for applications that aim to interpret or steer traditional musical parameters as models scale: increasing model size and data diversity might help us better steer things that can be expressed as broad oscillatory mixtures (e.g., "make this brighter" or "make this more pop-like"), but we may continue struggling to steer seemingly basic things like partials, and the higher-order structures that compose from them.

References

[1] Kinsler, Lawrence E., Austin R. Frey, Alan B. Coppens and James V. Sanders. “Fundamentals of Acoustics, 4th Edition.” (1999).

[2] Moore, Brian CJ. An introduction to the psychology of hearing. Brill, 2012.

[3] Ma, Wenye, and Gus Xia. "Exploring the internal mechanisms of music llms: A study of root and quality via probing and intervention techniques." In ICML 2024 Workshop on Mechanistic Interpretability. 2024.

[4] Singh, Nikhil, Manuel Cherep, and Pattie Maes. "Discovering and Steering Interpretable Concepts in Large Generative Music Models." arXiv preprint arXiv:2505.18186 (2025).

[5] Vásquez, Marcel A. Vélez, Charlotte Pouw, John Ashley Burgoyne, and Willem H. Zuidema. "Exploring the Inner Mechanisms of Large Generative Music Models." In ISMIR, pp. 791-798. 2024.

[6] Wei, Megan, Michael Freeman, Chris Donahue, and Chen Sun. "Do Music Generation Models Encode Music Theory?." arXiv preprint arXiv:2410.00872 (2024).

[7] Lee, Jongpil, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam. "Samplecnn: End-to-end deep convolutional neural networks using very small filters for music classification." Applied Sciences 8, no. 1 (2018): 150.

[8] Défossez, Alexandre, Jade Copet, Gabriel Synnaeve, and Yossi Adi. "High fidelity neural audio compression." arXiv preprint arXiv:2210.13438 (2022).

[9] Defferrard, Michaël, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. "FMA: A dataset for music analysis." arXiv preprint arXiv:1612.01840 (2016).

[10] Elhage, et al., "Toy Models of Superposition", Transformer Circuits Thread, 2022.

[11] Mikolov, Tomáš, Wen-tau Yih, and Geoffrey Zweig. "Linguistic regularities in continuous space word representations." In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp. 746-751. 2013.

[12] Olah, et al., "Zoom In: An Introduction to Circuits", Distill, 2020.

Post 0: Machines on Music

Recent Posts

Comments