top of page
Search

Post 0: What Kind of Brain Lives Inside A Neural Audio Model?

  • Writer: Nicole Cosme
    Nicole Cosme
  • 4 hours ago
  • 6 min read

The figure below shows three representations of the same sound wave: a brief prose description, a spectrogram, and a sequence of numbers. The prose describes a chord sequence, a description of its writer’s perceptual experience of the sound wave. The spectrogram shows how acoustic energy in the sound wave is distributed across frequencies over time. Both are human-legible representations of music, each offering its own strategy for recovering semantic structures from a sound wave.



Figure 1. Three representations of a sound wave: (A) prose description; (B) spectrogram; (C) feature map


The third representation (a feature map produced by a convolutional filter) also recovers something from the sound wave, but this thing is not immediately human-legible. Produced by a neural network, it is a model-internal representation that reflects what aspects of the sound wave the model has learned to treat as meaningful for its task. These examples illustrate a general point: every representational system constructs its own ontology of musical sound (a theory about what musical sound is and how it is organized). 


0.1 Representation systems instantiate ontologies

To do this, a representational system must choose the basic building blocks from which higher-order structures are composed. I will call these primitives. Traditional Western staff notation, for example, selects discrete pitch–time events as its primitives. These events are encoded as note symbols positioned on a staff, and they compose higher-order features, like chords and melodies. The choice of primitives, therefore, determines what kinds of representational distinctions a system can express.


This happens because all representational systems carry inductive biases—built-in architectural commitments that shape what kinds of features that system can learn from its data. Inductive biases imply a space of representational possibility: the full range of distinctions that the system could express through its operators and their combinations. I will call this space a latent ontology. For example, Western staff notation can represent any combination of note symbols that fit into its staff space. 


The representation of a musical recording then selectively instantiates this space, repeatedly realizing some subspaces while leaving others unused. I refer to this realized subset as an operational ontology. Continuing with the above example, a transcription of a single song features high use of a specific set of note symbols in specific places on the staff, while leaving other symbols and positions minimally expressed or wholly unexpressed. In doing so, it reveals which parameters are actively mobilized within the broader representational system for a particular piece of data. 


0.2 Neural ontologies are opaque

In many familiar music-representational systems, latent and operational ontologies are closely aligned with human perceptual organization, largely because they choose the same primitives. These primitives are oscillatory structures that follow from the physical properties of most sound-producing systems. 


Consider a violin string: its back-and-forth motion when bowed can be well approximated as a combination of simpler motions (normal modes) with characteristic oscillation rates (called frequencies). The same holds for the vibrations of most sound-producing systems. When these vibrations force the surrounding air molecules, the resulting disturbance (a sound wave) inherits the same mixture of oscillations. cite Therefore, many sound waves can be usefully represented as mixtures of narrowband oscillatory components. Those components have various names––overtones, harmonics, etc. I will call them partials


Figure 2. Sound waves can be represented as mixtures of oscillatory components


Auditory perception research (cite) suggests that the cochlea in our inner ear decomposes incoming sound waves into narrowband frequency regions that effectively isolate partials (I’ll ignore phase here for simplicity). From these isolated components we build higher-level perceptual structures (e.g., pitch, melody, harmony) and assign subjective meaning to them (“that’s an A”). Therefore, sound-producing systems generate oscillatory mixtures, and perceptual systems organize around them. Many human music-representational systems (e.g., Western score notation) extend this process, so their representational structures can often be expressed in oscillatory terms. 


Modern neural audio models enact much the same epistemological process that other music-representational systems do. The mathematical operations that comprise each specific model introduce inductive biases. These biases delimit a latent ontology: a space of representational possibility. Data and training then act as a selective force within this space, stabilizing some subspaces while leaving others unrealized. The result is an operational ontology: a subset of representational distinctions that the trained model actually uses. However, unlike most music-representational systems, these models don’t first filter the sound through human perception, so they aren’t bound to any particular representational basis. And they are famously opaque. 


This raises a natural interpretability question about neural audio models. Do oscillatory structures also emerge as stable features inside modern neural audio models? And if not, what kinds of features do stabilize? 


This blog post series investigates these questions within the encoding stacks of a broad family of end-to-end neural audio systems, which currently dominates large-scale audio modeling and compression. These systems operate directly on raw waveform samples (end-to-end) and learn representations using hierarchically stacked one-dimensional convolutional layers. Despite differences in scale, design, and objective (classification, generation, compression), they share a common set of mathematical operators and arrangements of those operators. I refer to this family as convolutional waveform encoders.


0.3 Neural ontologies != human-interpretable ontologies

I launched a broad interpretability study of a discriminative and a generative model––a proprietary toy version of SampleCNN (cite), and EnCodec (Meta)––using the FMA small dataset (cite). And it quickly became apparent that data format is a complicating factor. 

Consider the first layer of a model like EnCodec. This layer is a bank of convolutional filters operating directly on the raw waveform. Each filter responds strongly to a family of tiny waveform fragments with similar shapes, producing an activation channel that measures the presence of that family in the input waveform. 


These channels are the model’s explicit representations. They are directly accessible to the model: it can reuse them, combine them with other representations, and propagate them through deeper layers. 


But because the original wave has oscillatory sub-components owing to its physical mechanism of production, every explicit representation also carries implicit spectral structure. That is, any short waveform segment can also be represented as a combination of partials (recall Figure 2). These partials are present in the input signal even if the model doesn’t represent them directly. 


This distinction becomes important when we try to understand what kinds of features can stabilize inside a model. Mechanistic interpretability literature (cite) largely takes features to be directions in activation space—patterns of activity across many units that track the same underlying variable across contexts. And we commonly try to map these features to known concepts to see if the model’s representational substrate shares classes with our own. Neural audio architectures complicate this process. 


If features are defined in waveform-activation space, they directly concern explicit representations. That is, features will be defined based on waveform-like activations. If we want to connect features to known musical concepts, though, we have to look at the spectral domain. Any spectral objects of interest, like oscillatory components, will only materialize for control by the model if features in waveform-space allow this to happen indirectly. If this happens, we might call those objects implicit features: reliable directions in spectral space that the model can control by leveraging relationships among explicit waveform-fragment activations. 


And there’s one more layer to the complexity: the relationship between waveform fragments and the oscillatory mixtures they contain is many-to-many. Many waveform fragments can correspond to the same spectral pattern, and many different spectral patterns can correspond to the same fragment. This means that stabilizing a feature in waveform-space does not guarantee that a single feature will stabilize in spectral-space. And even if it does, there is no guarantee that it maps onto a known musical concept. 


The central question for this project was therefore whether any stable directions in waveform activation space actually correspond to oscillatory structures (e.g., partials, pitches, chords). And to simplify this, I focused on asking if the most basic oscillatory structures––partials––stabilize, since most other oscillatory structures are composed from these. 


If partials stabilize as implicit but independent internal variables, they should appear as directions in waveform activation space that indirectly isolate narrowband oscillatory components across musical contexts. If not, oscillatory structures remain amorphous: distributed across waveform-fragment activations and inaccessible to the model for direct semantic reasoning. This would make neural audio ontologies somewhat misaligned to human perceptual ontologies of music, in the sense that the representation basis does not match the concept basis. And this matters for applications that benefit from understanding and steering human-interpretable musical parameters. 


If a model can stabilize narrowband oscillatory components early on (like the ear does), it could in principle compose higher-level oscillatory structures from them. This would give the model an opportunity (but not a guarantee) to represent and reason over the kinds of structures favored by human perception and perceptually derived analytical systems. 


There are three places in this architectural family where partials could plausibly stabilize:


  • First-layer convolutional filters, where the raw waveform is first transformed

  • The end of the convolutional stack, where compressed waveform representations become latents for downstream modeling

  • The discretized latent space, where those latents are mapped to vector-quantized codes (generative) or fed into linear layers (discriminative)


The posts that follow examine each of these possibilities. Collectively, they suggest that oscillatory components may be coincidentally isolated for inputs where spectral structure is already sparse, but they largely do not stabilize as reusable internal variables across contexts. Features do stabilize in waveform-activation space, but these features overwhelmingly do not correspond to reliable directions in spectral space. Another way to say this is that the model’s learned features are largely monosemantic in some unknown waveform basis, but polysemantic in a known spectral basis: stabilized feature directions generally correspond to mixtures of oscillatory components, often with substantial variation across contexts.


Post 1
Post 1
Post 2
Post 2

Post 3
Post 3


 
 
 

Comments


Commenting on this post isn't available anymore. Contact the site owner for more info.
bottom of page