Post 1: What Does the First Layer of a Neural Audio Model Learn?

Nicole Cosme
Mar 11
23 min read

Updated: Mar 22

In the previous post, I argued that many of the structures humans use to organize sound are oscillatory (e.g., partials, pitch, melody, chords). If we want to understand more about the way neural networks organize sound, and how it aligns with our own organizational schemas, then we might look for these structures. This blog post series focuses on the most primitive such structure: partials (narrowband oscillatory components with preferred frequencies).

The first layer of a convolutional waveform encoder would be a natural place for narrowband oscillatory primitives to stabilize. First-layer filters act directly on the raw waveform, and their outputs become the inputs to deeper layers, where higher-level musical structures such as pitches or chords could be composed. However, simple signal-processing constraints make this kind of stabilization highly unlikely in practice.

Therefore, this post proceeds in two steps. First, I demonstrate why partials are not expected to stabilize in the first layer. Then I turn to a more interesting question: what stabilizes here instead? Partials serve as a useful reference point; understanding why they fail to appear helps clarify what kinds of representations this layer can support.

The analysis below presents a few results from my dissertation work, organized around this question. The results are far from comprehensive, but they point us toward a hypothesis that the first layer stabilizes a set of directions in activation space corresponding to recurring patterns of local waveform morphology (as we might hope), along with a sparse set of transitions between those directions. In this sense, the first layer may behave like a low-dimensional dynamical system over waveform primitives.

With that said, these findings are suggestive rather than definitive. I have observed enough to believe that the model forms coherent state and transition objects in the time domain, but I have not found evidence that these reliably correspond to primitive spectral objects such as partials or pitches.

1.1 Architectural constraints make partial stabilization unlikely

To stabilize a partial, filters in the first layer would need to somehow isolate that partial even when other partials are present. This kind of selectivity is fundamentally a spectral property: isolation happens when the first layer responds strongly to a narrow band of frequencies that encompasses that oscillation.

First-layer filters do not operate in the spectral domain, however. They operate directly on short windows of the waveform, in the time domain. For example, Toy-SampleCNN [1] uses 1D filters of length three, while EnCodec [2] uses filters of length seven. Figure 1 shows examples from EnCodec’s first layer.

Figure 1. Filter shapes from EnCodec's first convolutional layer.

To see what this implies in practice, we can compare these filter lengths to the time scales of the oscillations we are trying to isolate.

Take the simplest case, where the signal contains a single oscillation at 440 Hz (the typical fundamental of “concert A”). In this case, the waveform is shaped like that oscillation. The oscillation has a full period of 2.27ms. But at a 24 kHz sampling rate, a 7-sample filter spans ~0.29 ms, becoming sensitive primarily to short local amplitude patterns in the oscillation (brief rises, falls, and jagged micro-structures). First-layer filters thus have enough temporal context to cover fragments of the full period but cannot become sensitive to the whole thing at once. And this holds for any oscillation with a period longer than the filter length.

This is not so much of a problem in the first layer, as later layers could learn how these fragments compose into a full cycle, thus representing the oscillation. The bigger problem is that most signals are not a single, pure oscillation (even locally).

In real musical signals, the waveform at any moment is often a superposition of many oscillations, each with different frequencies, amplitudes, and phases. The local waveform shape seen by a short filter therefore reflects the superposition of fragments of many partials rather than any single one. If the model wants to compose a full oscillatory cycle for any of these individually, it not only needs the distributed representation mentioned in the isolated case above, but it also needs some way to break apart the superposition.

In post 1, I explained that waveform-domain convolution implicitly does multiplication in the spectral domain. In theory, this could break the superposition if different filters learn different, multiplicative operations that isolate narrow parts of the spectrum. But can the first layer's short filters achieve narrowband selectivity in the spectral domain consistently across musical contexts? Classical signal processing suggests that it cannot (I'll show in future posts that very deep layers can achieve this selectivity, although they still don't seem to isolate narrowband partials).

Short filters are sharp tools in the waveform domain, but they are blunt tools in the spectral domain. For a filter of length N, sampled at rate fₛ, the approximate width of its main spectral lobe (i.e., its effective frequency resolution) scales on the order of

This inverse relationship between temporal support and frequency resolution is a manifestation of the time–frequency uncertainty principle [3]. Short filters can localize events in time, but they cannot simultaneously localize them in frequency. For example, a 7-sample filter at 24 kHz yields a delta on the order of 3400 Hz (i.e., it largely cannot distinguish partials whose spacing is closer than this delta). By comparison, the spacing between adjacent partials in common musical events are typically tens to hundreds of hertz (e.g., a chord starting on A4 (440 Hz) and ending on the E above (~659 Hz) spans roughly 220 Hz).

Therefore, a filter that spans only a few samples is unlikely to respond selectively to narrowband oscillatory components in a mixture. It would be expected to respond broadly to whatever mixture of frequencies happens to occur within its temporal window. While deeper layers could in principle combine many such filters to approximate narrower selectivity, the first layer alone does not have the temporal support required to do this robustly across mixtures. So the filters in this layer likely cannot isolate a piece of a narrowband partial from the signal if it is mixed with pieces of other narrowband partials.

This implies a kind of basis-dependent semanticity for first-layer filters. In waveform space, first-layer filters can behave as reliable detectors of simple local waveform shapes that serve grammatical functions in the time domain. But in spectral space those same filters necessarily entangle many oscillatory components whenever mixtures are present. Partials (or coherent fragments of them) may be technically present across filter activations. But they will likely remain entangled in each piece of that distributed representation, unless the signal contains one pure oscillation. Therefore, partials (as a general feature category) would not be expected to stabilize cleanly or consistently via filter activations or some linear combination of these.

The more interesting question, then, is not whether the first layer stabilizes partials, but what it stabilizes instead.

1.2 Towards features and archetypes

Because the first layer depends so strongly on tiny changes in the waveform, I start by examining local properties.

At every time step the model produces a response vector: the set of activation values across all filters at that moment. If the first layer contains N filters, these response vectors live in an N-dimensional space where each axis corresponds to one filter’s activation. For example,

r_t = (0.1, 0.0, 0.3, 0.9, 0.0, ...)

tells us that filter 3 is strongly active, filter 2 is somewhat active, and the rest are mostly quiet.

In rare cases a response vector may align strongly with a single axis, indicating that one filter alone explains the fragment (e.g., 0.9, 0.0, 0.0, 0.0, 0.0, ...). But most response vectors fall at arbitrary points in this N-dim space, meaning the corresponding local waveform fragment is best explained by a combination of filters. This perspective gives us a geometric way to interrogate semantic similarity in the first layer.

1.3 Features

In Post 0, I argued that many perceptual features can be described as compositions of simpler oscillatory components. If such structure is present in the model, we might expect it to enlist two levels of representation.

First, individual channels. Each channel corresponds to a learned filter and defines an axis in activation space. If narrowband oscillatory components were to stabilize at this layer, we would expect to observe it here, as consistent sensitivity of individual channels to specific frequency regions across contexts.

Second, linear combinations of channels. These define directions in activation space and correspond to the feature concept commonly used in mechanistic interpretability (as explained in post 0) [4], [5], [6]. If oscillatory structure is represented compositionally, such directions could in principle encode partials (as a distributed representation) or combinations of partials.

The theoretical constraints discussed above apply directly at the channel level. Because first-layer filters operate over short temporal windows, they lack the frequency resolution required to isolate narrowband components in mixtures. Each channel therefore responds to local waveform fragments that reflect superpositions of oscillatory components rather than any single one. As a result, channel activations do not seem to provide a stable basis for representing individual partials across contexts.

This limitation extends to linear combinations of channels. While combinations of convolutional filters can approximate more selective responses, filters within the same layer are applied independently to the waveform and share the same temporal support. As a result, any combinations inherit the frequency resolution constraint from their components. In practice, this makes it unlikely that linear combinations in the first layer provide a stable basis for isolating narrowband oscillatory components across contexts.

But even if individual channels and their linear combinations do not reliably isolate narrowband oscillatory components, they still have semantic value. The latter, in particular, provides an informative view of what actually stabilizes in the first layer. If the model does not represent oscillatory components as separable primitives, then its representational structure must instead be encoded in higher-level combinations. In the interest of space, I focus the analysis that follows on directions in activation space.

If filters respond to families of local waveform shapes, then linear combination features (hereafter, just features) may describe recurring combinations of those shapes. The question, then, is what kinds of shapes these features correspond to.

I learned several sets of candidate features via a sparse autoencoder (I tested a few feature-learning methods, and while the results from the SAE weren't perfect, they seemed to be the most robust). Larger sets seemed to explain more variance in the data, but it was hard to make claims here without a good summarization method. And even so, these sets often had directions to which only a handful of activations mapped, which made me feel like I was mostly capturing features for small, unique bits of the data (even though the dataset was sufficiently large that I was able to consider activations for millions of fragments).

Smaller sets were more generic and quicker to analyze comprehensively, but less informative. These candidates seemed to be meta-directions to which many activations could be mapped, rather than clean singular features.

I chose a representative set somewhere in the middle. This set seems to capture common waveform structures well while still allowing directions to generalize. Roughly 96% of activation vectors could be assigned to their nearest feature direction with a mean cosine similarity of 0.94 (suggesting most assigned vectors align reasonably with one of the sparse learned directions).

Many features were minimally active (perhaps these are variants of more active features), while others were very active. I'll present results from two features that were particularly active: let's call them Feature 0 and Feature 1.

Figure 2. (a) feature vectors from the two most active features, (b) representative waveform snippets from the two most active features.

Each feature vector in Figure 2a shows the deviation from the mean activation level for each filter, across all data. Positive values indicate filters that are more active than average when that feature is present, while negative values indicate filters that are suppressed relative to their baseline. For example, filters 11, 13, and 15 are strongly more active than average in Feature 0 but substantially less active than average in Feature 1.

Feature 0 appears to emphasize filters associated with sharp oscillatory peaks, while suppressing filters whose learned shapes resemble valleys or flatter oscillations. Roughly, therefore, Feature 0 responds to rapid increases in signal energy. This turned out to be musically interesting, as Feature 0 was one of the only features with reliable semantic musical associations.

Sharp increases in the waveform often arise during transients, such as note onsets, where energy rises quickly before decaying. Indeed, if we create maps for a few input samples, noting where Feature 0 is maximally activated, and we listen to those moments, strong transients are often present. That said, other musical features are also present, so this is not definitive proof. It may just be a nice coincidence. What's more, sharp waveform shapes can also arise incidentally for many reasons, including waveform interference patterns or broadband noise. In other words, this candidate feature likely responds to a family of sharp local waveform events, which includes some transients, rather than to a specific musical object.

We can see this in the associated waveform fragments that maximally activate Feature 0. I’ve plotted a small number of representatives in Figure 2b. The graph is somewhat noisy, but we can distinguish sharp upward motions in many of the lines, which perhaps triggered the filters whose strong responses characterize this feature.

Feature 1 is almost the inverse of Feature 0. Instead of emphasizing sharp peaks, it responds to moments where energy drops and remains low. These might look like descending contours or valley-like waveform fragments, as shown in Figure 2b. One possibility is that this feature responds to the decay portion of transients, where energy falls after an onset. But this feature is largely uninterpretable––I could not distinguish any reliable musical semantics from graphing or listening to representative examples.

Overall, this set of candidate features suggests that the first layer can be decomposed into some set of stable directions in activation space (features), and that these directions are reused across musical contexts. This is not guaranteed a priori. The model could have learned directions dominated by single filters or some other heuristic. Instead, the learned features combine multiple filters, suggesting that recurring joint activation patterns across filters are semantically meaningful to the model.

These candidate features appear to be coherent objects in activation space. But what do they mean in the spectral domain, where we tend to reason about music?

If a candidate feature corresponds to a specific musical object—like an isolated partial—we would expect to see traces of that object in the spectra of the waveform fragments that activate the feature. In particular, a feature that isolated a partial would produce a narrowband spectral peak, with frequencies outside that band strongly attenuated.

Figure 3. Spectra of representative waveform snippets for Feature 0 (left) and Feature 1 (right).

The spectra of fragments activating Features 0 and 1 do show peakiness at the low end of the spectrum. At first glance this can seem to resemble narrowband selectivity, but it actually reflects typical partial dynamics in musical sounds. The fundamentals of most musical notes live at the bottom of the spectrum, making low partials relatively loud while higher partials quickly decay in amplitude.

True narrowband selectivity would look different. We would expect to see something like a single narrow peak (for example, a 20 Hz-wide bump around 400 Hz) with most other frequencies heavily suppressed near 0. Neither feature exhibits this pattern. Instead, each is strongly activated by a wide range of spectral contexts. The features therefore do not appear to behave like narrowband detectors capable of isolating partials within dense signals.

More broadly, the candidate features I've studied often map onto broad textures ("a bright mixture," "a low-end-dominant guitar sound," etc.). These are perfectly valid musical objects, as we can use them to describe patterns of sound, but they largely entangle the perceptual objects that we tend to use to organize music (the few possible exceptions include things like the transient-like behavior of Feature 0).

This lack of musical clarity is frustrating but not surprising. Candidate features hypothesize about the organizational axes of the model’s internal computations. These axes are encouraged towards regularities in the time domain, but the time domain is not one to one with the spectral domain. So, the model's axes have no a priori reason to mirror the organizational axes of human musical perception. Recall the opening. Candidate features derive from filters that are much shorter than most perceptual objects, so we wouldn't expect the first layer to represent things like notes or chords in full. We might only expect to see pieces of these, like their onsets or fragments of their sustains. And even in the case where we see something like a piece of a sustain, filters are not spectrally selective enough to disentangle the pieces of that sustain. It makes sense that they seem to encode broadband mixtures of sounds.

With that said, there is more to this story. As the waveform evolves over time, the response vectors move through activation space, tracing a trajectory between points. This raises a new question: how and why do these trajectories tend to move? In other words, how does the model’s internal feature representation change when the input changes slightly?

1.4 Archetypes describe motion in time-domain activation space

Adjacent waveform windows at a small scale are typically highly correlated. Many models encode overlap between adjacent windows, and even if not, non-overlapping contiguous windows often reflect the same underlying signal conditions (e.g., a held note). This points to a smoothly varying and predictable transition between the vectors representing those windows. Exceptions happen in the case of abrupt events, such as transients or noise, that produce sharp changes in local waveform shape. In these cases, we might expect roughness in the transition.

We might ask, then, what the model does with this information––can activation space also encode dynamical regularities in waveform data?

The first layer is not parameterized to learn dynamical regularities directly. It does learn representations that support later tasks, like reconstruction or classification. To do this, it applies the same filters across all inputs. If the data contains sufficient regularity in its adjacent windows and the model represents this regularity in a consistent way across contexts, then an implicit set of local transitions may emerge between these representations (something akin to a low-dimensional dynamical system embedded in a high-dimensional space).

Perhaps this is fitting for music. One of my college professors once proclaimed that the lifeblood of music lives in reusable semantic transitions between its static objects. That is, the means by which one note becomes the next (e.g., a slide vs. an abrupt jump) often carries as much meaning as the notes themselves. Likewise, recurring patterns of motion in the activation space of a waveform model may hold some indirect semantic value.

Here's a set of guiding questions. Do recurring transformations in the input induce consistent, reusable paths through activation space, or are these paths highly variable and context-dependent? And looking more broadly, do longer similar strings in the input (e.g., two long oscillatory shapes) cause sequences of activation vectors that trace similar paths in activation space? The section below presents answers to sub-questions I've explored for the first layer, en route to these larger questions.

A small note: At the first layer, any results likely reflect an interaction between signal regularity and representation geometry, so finding structure here isn't necessarily surprising. But it becomes more interesting in deeper layers. Although filters remain small, their effective receptive fields grow after downsampling, allowing them to respond to larger regions of the signal. Therefore, very different waveform fragments may be mapped to very similar internal representations, as the model operates over more abstract summaries of the input. Understanding how these higher-level representations evolve may reveal whether similar low-dimensional dynamics persist, or whether new kinds of structure emerge. But this is material for later posts.

1.5 How do we quantify motion?

There are many ways to study activation space dynamics, but I will focus on this adjacency idea. I will call the difference between adjacent activation vectors a generator. Formally, if r_t is the response vector at time t, then the generator is:

g_t = r_t+1 − r_t

Each generator therefore can be interpreted as a local velocity in activation space, indicating which filters are increasing or decreasing together, and by how much, between time steps.

If generators simply define the instances of transitions, I will call recurring patterns of these instances archetypes. For example, if an activation vector has three dimensions, an archetype might look like [+f1,−f2,+f3], meaning filters 1 and 3 increase while filter 2 decreases. This move can occur from many different starting points in activation space. Archetypes therefore describe common directions of motion in activation space, independent of position (these motions are often regionally concentrated, as I will show below, but the definition doesn't require this).

One way to identify archetypes is to mirror the method for identifying features: learning directions of motion in activation space over generators. As with features, I employed sparse autoencoders. The SAE solutions with very good reconstruction under sparsity constraints tended to find a relatively small number of archetypes (increasing this number produced additional directions that were rarely activated).

To check whether these archetypes were stable, I trained a sparse model across multiple random seeds. While the full set of learned directions varied, the most frequently used archetypes showed consistent alignment across runs (mean cosine similarity ~0.7 after matching). Reconstruction error and overall usage patterns were also consistent. This suggests that while the exact sparse basis is not consistent (the tails were variable), the dominant patterns of motion in activation space are fairly reliable.

I'll provide a quick example here. I chose a popular archetype, asked which generators maximally aligned with that archetype, and then asked which waveform fragments were most representative of those generators.

Figure 4. Centered waveform fragments whose generators maximally align with archetype 93

Figure 4 plots a few waveform fragments whose generators align strongly with archetype 93 (cosine similarity ≈ 0.90, versus ≈ 0.20 for random generators). The fragments share a common structure: a decrease into a trough followed by a rise. Because this pattern can occur anywhere within the window, I aligned the troughs to make the shared shape more visible.

These instances are noisy. The fragments vary in steepness and exhibit local jitter (which is expected in real-world audio). But despite this variability, the archetype captures a consistent underlying shape. And notably, this shape extends over a longer temporal span than any single filter can capture (roughly two filter lengths). By following transitions, then, we recovered a reliable coarse waveform structure, capturing a generalized view of how the signal evolves locally, rather than only isolating fixed components.

The shape consistency in this quick example suggests some credence to the idea that archetype 93 reflects a family of canonical transitions along some dominant direction in activation space. But we can't say for sure if this is just an isolated coincidence or a microcosm of broader dynamics. The questions below point us toward the latter.

1.5.1 Are archetypes truly sparse?

We can group the SAE's sparse candidates by how often they participate strongly in observed transitions. Figure 5a shows the fraction of transitions in which each archetype is highly active, sorted by rank.

If activation changes were highly variable and context-dependent, we would expect activity to be spread more evenly across many archetypes. Instead, most transitions are captured by a relatively small subset (roughly 20 out of 128), while the remaining directions are only rarely used. This suggests that the model’s local dynamics may be compressible, with many transitions occurring along a limited set of directions.

One alternative explanation is that these frequently used archetypes are redundant; that is, multiple learned directions may represent small variations of the same underlying pattern. Figure 5b shows some evidence against this with a simple cosine similarity matrix between archetype vectors. While there is some similarity between a few pairs, most off-diagonal entries remain relatively low, suggesting that the frequently used archetypes capture distinct patterns of coordinated change and are not collapsed variants of one another.

Figure 5. (a) sorted archetype activity; (b) cosine similarity between top 20 archetypes

Inspecting individual archetypes supports this interpretation. Figure 6 visualizes two frequently used archetypes (9 and 15), where each bar corresponds to a filter and indicates how that filter’s activation typically changes during the transition. Archetype 9 exhibits a characteristic pattern in which several filters at the beginning and end of the vector increase while a group in the middle decreases. Archetype 15 shows a different structure, with strong decreases across several filters and sharp increases in a smaller subset. These patterns differ both in which filters are involved and how they change, suggesting that the archetypes capture qualitatively distinct modes of coordinated activation change rather than small variations of a single pattern. I observed similar diversity across the other frequently used archetypes.

Figure 6. Activity vectors for two archetypes.

Taken together, these are votes in favor of the idea that the learned archetypes are sparse and non-redundant. However, this analysis does not yet tell us whether this small set of directions is sufficient to explain the full range of observed transitions.

1.5.2 How well do the archetypes represent the data?

To examine how well the SAE's sparse set could cover the data, I tested how faithfully individual transitions could be reconstructed using a small number of archetypes.

Using the full set of archetypes (128 directions) reduces reconstruction error to numerical precision (~1e-14), as expected. But sparse reconstruction also yields a low average error. Figure 7a shows that reconstruction error dips quickly as the number of archetypes increases. Reconstruction with the 20 top archetypes gives an error of ~1e-6 (reconstruction with the top 20 archetypes found via sparse dictionary learning also landed around ~1e-6, while reconstruction with 20 random directions gave ~0.43, suggesting this is not just a basis artifact).

Figure 7. (a) reconstruction error vs. sparsity; (b) a single reconstructed transition

Figure 7b shows a representative reconstruction with just 4 archetypes. While the reconstructed signal is slightly more compressed in amplitude, the locations of peaks and troughs align closely with the original, and the relative structure across dimensions is largely preserved (e.g., the same coordinates tend to dominate in both). If activation changes were highly variable and high-dimensional, we would expect low-rank reconstructions to distort both the shape and relative structure of the transition. Instead, even a small number of archetypes captures most of the variation.

In total, the above results corroborate the idea that the model moves between activation vectors along a sparse set of preferred directions. If this is true, the next question is: why?

1.5.3 Is sparsity just a property of short filters?

To test whether sparsity arises from the architecture itself or from learned structure in the filters, I compared archetypes learned from three versions of the first layer: trained weights, random weights, and handcrafted weights.

The comparison between trained and random weights initially suggested that sparsity might be largely architectural. Both produced a similar number of archetypes and similar usage patterns. However, this result was somewhat misleading. Because first-layer filters are very short, they operate over highly local waveform fragments that follow a small number of generic geometric patterns (e.g., increases, decreases, simple curvatures). Random filters are also combinatorically limited, which means they tend to define the same kinds of generic shapes as trained filters, and this leads to similar performance in activation space.

So, I constructed a set of handcrafted filters designed to respond to waveform shapes that are rare in real audio. This was intended to reduce the alignment between filter responses and the statistical structure of the data, testing the hypothesis that if a model learned filters mismatched to the data, the transition basis would become more dense. And this held up.

Under the handcrafted filters, the activation dynamics differed. Activation vectors derived from handcrafted filters showed much weaker and less coherent motion, and thus a denser basis, than their trained and random counterparts. And this suggests that sparsity is not solely a consequence of filter length. While short filters impose some structural constraints that help prime the model towards a sparse transition basis, the degree of sparsity in activation dynamics also depends on how well the filters align with recurring patterns in the data.

1.5.4 Is sparsity just a property of data––does training data force the model into low-rank motion regime?

To test whether sparsity arises from the data distribution, I fixed the trained model weights and varied the input data. I compared three cases: real audio (FMA small), a dataset of noise samples, and a set of flat signals (constant-valued inputs), which serves as a baseline where no meaningful local changes should occur.

Figure 8 shows the fraction of transitions explained by each archetype (sorted by rank) for the three datasets. If sparsity were purely a consequence of the architecture, we would expect similar usage patterns across all inputs. Instead, the distributions differ substantially.

Figure 8. Fraction of archetypes that are active above a threshold

Real audio (consonant musical sounds) exhibits the most concentrated usage: a relatively small number of archetypes account for most transitions, with the active fraction dropping off quickly. Noise (which is also part of the training data) produces a more diffuse pattern, with activity spread across many more archetypes and a slower decay (extending to ~50 directions). Flat signals, by contrast, produce almost no motion at all, with activity collapsing entirely into the lowest-rank direction. This acts as a sanity check: when the input contains no meaningful variation, the representation does not exhibit structured dynamics.

These differences also suggest that sparsity depends strongly on the structure of the data. Real audio, which contains the recurring local patterns we want the model to find, induces more consistent and compressible transitions. Noise, which is weakly patterned, leads to more variable dynamics.

This pattern is corroborated by comparing the learned archetypes across these data types. For each input type, I computed how well its top archetypes align with those learned from the baseline model (again, using cosine similarity). Real audio shows strong alignment (mean top-1 cosine ≈ 0.88), while noise aligns more weakly (≈ 0.71) and requires more directions to capture its variability, even though it's part of the training data.

The above sections suggest that sparsity is not fully explained by architecture or data alone. Instead, it appears to emerge from their interaction during training. The data provides recurring local waveform patterns and transformations, the architecture constrains how these can be represented (e.g., through locality and translation invariance), and training selects representations that capture these regularities in a consistent way. When these factors align, activation dynamics become more structured and compressible, with many transitions concentrated along a relatively small set of directions. In this sense, the behavior we might observe is negotiated; different models may learn different dynamics.

1.5.5 Are archetypes local or global actors?

So far, we have some evidence for a sparse set of archetypes, but this evidence does not tell us where these archetypes land within the broader structure of activation space. Are they globally defined (canonical update patterns reused by many transitions starting from many different places) or locally defined (update patterns reused by transitions starting in a similar subspace of activation space)?

To test this, I asked whether the most active archetype could be predicted from the starting activation vector. I trained a simple logistic regression model to predict the top archetype at each step, restricting the task to the 20 most frequently used archetypes. The model achieved ~85% accuracy (compared to a chance baseline of ~5%), giving credence to the idea that the direction of motion is state-dependent.

This result is supported by examining how far each archetype tends to spread. For each archetype, I measured the dispersion of its starting points in activation space, defined as the mean Euclidean distance from its centroid. All archetypes showed tighter dispersion than a random baseline (radius ≈ 4.1), which implies that they are not applied uniformly across the space.

Archetype Number	Size (n)	Dispersion Radius	Random Radius
0	162696	3.2	4.1
125	143724	3.2	4.1
23	20525	1.6	4.1
71	23165	2.5	4.1
104	11766	0.6	4.1

However, the degree of locality varies. The table above shows statistics for five representative archetypes. For example, archetype 104 is tightly concentrated (radius ≈ 0.6), while others (e.g., 0 and 125) span broader regions (≈ 3.2). This suggests that archetypes differ in how globally they act: some correspond to broadly applicable transitions, while others are tied to more specific regions of activation space.

1.5.6 How does all of this fit together?

Figure 9 illustrates how these dynamics manifest in practice. It shows representative trajectories for three types of input: a pure sine wave, a real-world audio sample, and broadband noise. Note that the middle plot in each panel visualizes a projection of the high-dimensional activation space into two dimensions, so its geometry should be interpreted cautiously. However, the relative differences between these cases are sufficient to corroborate the quantitative results above.

Figure 9. Three representative examples: a sine wave (a one-frequency signal), a random sample from the dataset, and broadband noise.

To sum things up quickly, the sine wave produces a smooth trajectory through activation space, with gradual and coordinated changes across filters. The real-world audio sample introduces more complexity, but the trajectory remains largely smooth, with structured variation over time. In contrast, the noise input produces jagged and irregular motion, with many filters changing abruptly and much less evidence of coordinated structure.

The sine wave example also aligns with the earlier argument that a single oscillatory structure can be represented in a distributed way at the first layer, as a coordinated pattern across many filters. The representation in the first animation evolves smoothly across time, with slightly different clusters of filters activating for different parts of the oscillation. But as in the opening example, this only appears to encode the oscillatory structure as a distributed representation because the waveform is the oscillatory structure.

This breaks in the case of mixtures. For more complex signals (such as the second example, which is a sum of oscillatory components), I did not find evidence that clusters of filter activations responded to independent oscillatory components (whether looking at features or archetypes). Instead, the activation patterns appear to reflect the combined waveform pattern, as they do for the sine wave.

1.6 The ontology of the first layer is structured, but not in a way we can easily interpret

My analysis suggests that the first convolutional layer likely does not learn to stabilize partials or other primitive oscillatory structures, either via individual learned filters or distributions across these. If the model had learned partials, we would expect to observe raw activations or stable feature directions that consistently track narrowband oscillatory components across contexts. Such things would vary smoothly with the strength of particular oscillations in the signal. But I found no evidence for this.

That said, the layer does seem to exhibit a coherent internal structure.

Convolutional filters in raw-audio models are expected to learn primitive waveform detectors. But less is known about how the activations of these filters structure themselves. The above analysis advances the idea that activation vectors tend to cluster along a set of recurring directions, and transitions between them tend to follow a relatively small number of patterns. Taken together, this suggests that the first layer behaves like a low-dimensional dynamical system over waveform primitives: it represents local structure as coordinated patterns of activity and change.

This structure appears to emerge from the interaction between the data and the architecture. The locality and parameter sharing of convolution constrain the space of possible representations, while the statistical regularities of real audio + training dynamics encourage the model to reuse a consistent set of activation patterns and transitions. The result is a representation that is structured both in position (features) and in motion (archetypes), rather than one that moves arbitrarily through activation space.

At the same time, this structure is inherently approximate. The features are not perfectly discrete, and the archetypes are not deterministic. Instead, they reflect statistical tendencies: certain activation directions generally recur, and certain transitions between them generally recur.

Moreover, both seem to be coherent within the model’s own representational basis, but neither cleanly surfaces many of the structures that tend to organize human musical perception. When alignment with musical perception does occur, it often appears to be incidental. For example, features stabilize spectral features only when they happen to satisfy the same low-level waveform constraints. And even when this does happen, those features generally look like broadband noise features (e.g., transients) in the spectral domain.

This raises a natural next question. If oscillatory primitives are not stabilized at the first layer, do they emerge later, once the model has access to larger receptive fields and can achieve finer frequency selectivity? The most promising results I have so far point in this direction, where I apply the feature and archetype analyses applied to deeper layers and discrete representations.

References:

[1] Lee, Jongpil, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam. "Samplecnn: End-to-end deep convolutional neural networks using very small filters for music classification." Applied Sciences 8, no. 1 (2018): 150.

[2] Défossez, Alexandre, Jade Copet, Gabriel Synnaeve, and Yossi Adi. "High fidelity neural audio compression." arXiv preprint arXiv:2210.13438 (2022).

[3] Gabor, Dennis. "Theory of communication. Part 1: The analysis of information." Journal of the Institution of Electrical Engineers-part III: radio and communication engineering 93, no. 26 (1946): 429-441.

Post 1: What Does the First Layer of a Neural Audio Model Learn?

Recent Posts

Comments