The density of states is an important concept in understanding solid state physics. It is a measure of the number of possible states with a given energy per unit volume. Saad's theorem states that the set of critical values of a function has measure zero, and regular values of t are those for which the pre-image is a sub-manifold. The density of states is calculated by integrating a function over a submanifold, and the gradient of this function determines the volume between two level sets. As the level set approaches zero, the density of states diverges and approaches infinity.

The density of states is a concept related to solid state physics, which is a real valued function of a real parameter t. It is the number of possible states with a given energy e per unit volume. In a free electron gas in d dimensions, states are parameterized by wave vectors determined by frequency and energy. As the maximum allowed energy increases, the number of available states increases, known as the density of states. This is an important concept in understanding the effect of the rlct, and can be used to measure rapidly changing values of energy.

The volume of a sphere is related to the energy of a regular model, and singularities can appear due to the complexity of the energy function. A pre-image is a sub-manifold in the plane, and the density of states is the amount of new volume acquired as the level sets move. Saad's theorem states that the set of critical values of a function has measure zero, and regular values of t are those for which the pre-image is a sub-manifold. A heuristic is proposed to define the density of states when filling up a space.

The density of states is the volume between two level sets of a function, p. This volume is calculated by integrating h over the gradient of p along s0. Taylor expanding p near a point x, the value of p can increase from t to t plus h, with larger gradients making it easier to do so. The distance between the two level sets is equal to h multiplied by the magnitude of the gradient of p at x. If the gradient of p is 0 at any point, the two level sets may meet, resulting in an incorrect volume calculation.

The density of states is calculated by integrating a function over a submanifold of a certain dimension. As the level set approaches zero, the gradient of the function increases, causing the integral of the density of states to diverge and approach infinity. This contrasts to the regular case, where the density of states goes to zero as the level set approaches zero.

Density of states is an important concept in singular learning theory, which is related to solid state physics. It is a real valued function of a real parameter t, and its asymptotic behaviour is determined by the singularities of the set of true parameters. In a system with n discrete configurations, the density of states is the number of possible states with a given energy e per unit volume. For a continuous system, the density of states must be calculated differently. It is an important concept in understanding the effect of the rlct.

N is the measure of the set of states with energy less than or equal to E. A free electron gas in d dimensions is an example of this concept, with states parameterized by wave vectors determined by frequency and energy. As the maximum allowed energy increases, the number of available states increases, and this is known as the density of states. This is an important concept in solid state physics, as rapidly changing values of energy can be physically meaningful.

Volume of a sphere is proportional to energy of a regular model, where d is the number of parameters. In solid state physics, singularities appear for the same reason they appear in statistical learning theory, depending on the complexity of the energy function. For example, a function of kx squared is singular, but more interesting singularities can be found with higher powers of k. In a purely mathematical setting, the volume between level sets of an open subset of R to the n and a smooth function on u can be partitioned.

A generic pre-image is called a fiber and is a sub-manifold in the plane. As the value is perturbed, the pre-image changes and the properties start to reflect the singularity. The density of states is the amount of new volume acquired as the level sets move across the page. A formal definition of this is given, with regular values and critical values discussed.

The gradient of a function p must be non-zero for a generic t for the pre-image to be a sub-manifold of dimension n-1. Saad's theorem states that the set of critical values of p, where the pre-image is not a sub-manifold, has measure zero. Regular values of t are those for which the pre-image is a sub-manifold. When filling up u by dragging level sets across the picture, attention should be paid to regular values of t. A heuristic is proposed to define the density.

The gradient of p is non-zero at regular values, and is larger at some points than others. This affects the level sets of p, with larger gradients resulting in shorter distances between level sets. Taylor expanding p near a point x, p can be approximated as p at x plus a sum of h multiplied by the partial derivatives of p. This allows for the value of p to increase from t to t plus h, with larger gradients making it easier to do so.

To measure the volume between two level sets, one can solve an equation involving the gradient of the value of p at a point x. The distance between the two level sets is equal to h multiplied by the magnitude of the gradient of p at x. The volume is then calculated by integrating h over the gradient of p along s0, which is the definition of the density of states. If the gradient of p is 0 at any point, the two level sets may meet, resulting in an incorrect volume calculation.

Density of states can be identified with the volume between level sets of a function, p. This volume is calculated as the integral over s0 of h on nabla p, a vector-valued function. The density of states is then defined as the integral over the level set of phi divided by lambda p along s0. This definition allows for a more sophisticated form of argument to fill in the details.

Density of states is calculated by integrating a function over a submanifold of a certain dimension. The simplest example is a sum of squares, where the pre-image of the function is a sphere of a certain radius. The integral of the density of states is proportional to the volume of the sphere and its derivative. As the sphere collapses, the integral of the density of states changes, as the gradient of the function is evaluated at a point closer and closer to the origin.

The integrand of a function can either increase or decrease as the level set approaches zero. An example is given of a function p, where the level set at t equals zero is a hyperbola. The gradient of p at the points plus or minus root h0 becomes larger and larger as h goes to zero. An explicit phi is given to calculate the integral, which diverges, meaning the density of states approaches infinity as s approaches zero. This contrasts to the regular case, where the density of states goes to zero as the level set approaches zero.

The density of states is the rate of change of volume with a varying value. It can be made arbitrarily large for small h and s close to zero. The behavior of the density of states near a singular level set depends on the original function and its values near the pre-image of critical values. For x squared minus y squared, the density of states can behave differently than for the sum of squares. This can affect Bayesian posterior and learning.

so today i'm going to give a bit of an introduction to what's called the density of states which is an important piece of mathematics in singular learning theory which we haven't discussed fully until now it's an idea that appears particularly in physics i think it's maybe not super common in statistics as far as i know so just introduce it in the language that watanabe uses uh the density of states which he writes i'm not sure if it's v or nu or maybe i'll just write v so i'll explain what it means it's perhaps the easiest way to understand the meaning of the rlct which we've discussed so the density of states is some real valued function of a real parameter t and its asymptotic behavior as t approaches zero is as we'll see in some examples controlled by the singularities of the set of true parameters so the density of states is a function associated to a triple p q phi and it has both a straightforward interpretation and illustrates the effect of the rlct so that's why it's a topic of interest all right so in solid state physics if you do a course on solid-state physics or condensed matter physics you'll encounter the density of states very early usually referred to as just dos and written as d rather than v so i'm going to stick to writing d just because that's what you'll encounter in the references so by the way the notes for this uh seminar are on the seminar webpage and they have references to textbooks and other reference and other sources for digging into this further so the density of states [Music] is defined as follows for a system with n discrete configurations i'll worry about the continuous case in a moment and a good example would be a particle in a box in quantum mechanics assuming some limit on the possible momentum the possible wave functions are quantized depending on the size of the box and so there's finitely many call them in indiscrete configurations and volume v the density of states is d e e is an energy so it's per unit volume how many of the possible states have that energy ei is the energy of the ith state the order in which you i mean this the set of states is some finite set the order doesn't matter it's just enumerate all the possible states each one has an energy and this is the chronicle delta so this is one if e is equal to e i and zero otherwise right so this just counts the number of states with the given energy e per unit volume so for a continuous system we of course have to do something a little bit different uh just the foreshadow where we're going

to connect this back to statistics you can think of the loss function uh or kl divergence k as being something like energy right so if you want to count the number of states with a given energy that's a bit like counting the number of configurations of your neural network the number of weights which have a given value of the kl divergence k for example of course that's not an integer right you have to think about the measure of that set perhaps but we'll come back to that but that's the analogy we're going to make later right so for a continuous system instead you define the density of states at energy e to be one over the volume the number of states which have energy less than or equal to e plus delta e so i will define n in a moment so n e is the number or the measure of the set of states with energy less than or equal to e okay so this is clearly the rate of change of the number of states with energy less than or equal to e as you increase e that is d e is the number of new states newly accessible states per unit volume when the maximum allowed energy is increased so a typical example in a solid state physics textbook would be something like a conductor and you will have depending on the material certain energy bands that are occupied by electrons and as you increase the allowed energy well the number of available states for your system uh increases doesn't decrease of course because it's less than or equal to e if a state is accessible at energy e it's accessible at energy plus delta e so this n is a monotonically well it's a non-decreasing function of the energy and as you increase the energy more and more potential states are available and if that number is changing very rapidly at a given sort of threshold value of the energy then that's physically very meaningful so that's why it's an important central idea in solid state physics so i want to illustrate this with the simplest possible example which is a free electron gas in d dimensions so and if you don't have a background in physics it doesn't matter i think it's just useful to illustrate a place in which these concepts also appear so that they maybe don't seem so specific to singular learning theory so for a free electron gas in d dimensions states that are parameterized by wave vectors which just give the frequency of the wave function in all the d directions and the energy of such a state so k is a vector and it's given by that's okay that's h bar h bar squared on 2 m k squared k squared just here

meaning the norm so k dot with itself so the total number of states of energy less than or equal to e well that's just asking for the volume of a sphere i mean there's this there's this factor here right so it's not you have to adjust the radius accordingly but who cares that's just a constant factor in front so it's of a d sphere of radius e to the one half so that number is proportional to e to the d on two right so that means that d by the above formula since it's the derivative of uh n e with respect to e it is proportional to e to the d on two minus one and you might recognize that so this is going to be the density of states for a regular model right so if you think about the possible states of a neural network well not in your network is that singular but if you have a regular model like just a simple gaussian parameterized by its say amine then the density of states of that looks like exactly this example where d is the number of parameters and more interesting systems so this came back to the fact the form of the density of states in that case came back to the formula for the energy in more complicated systems that's a more complicated function of k the more complicated the function of k the more interesting the density of states all right so let me now switch over to just purely mathematics well maybe i'll make one comment so one of the reasons to introduce this connection to physics is just to point out that singularities appear in solid state physics for exactly the same reason they appear in statistical learning theory this density of states depends on the complexity of this energy function k squared is a simple function it is singular but uh in a very mild sense so it's a morse function if you change kx squared to k cubed or higher powers of k i mean by that i mean the norm of k to a higher power then you get more interesting singularities and more interesting density of states and more interesting physics and this is something that's a very active topic in solid-state physics um you can look up terms like van hove singularities and so on magic angles you can see references in the notes for that all right so to come back to a purely mathematical setting uh i want to talk about the volume between level sets so i'm going to fix an open subset of r to the n and a smooth function on u so in physics p would be the energy e and in statistical learning theory this would be k so you can of course partition the set u into level sets right so just take the pre image

under p of every value t of view so if it's worth drawing a picture just to kind of make the point that not all level sets are the same it's a kind of generic behavior and special fibers so such a pre-image is sometimes called a fiber if it's thought of as having geometric structure okay so [Music] a generic fiber so let's call it t generic so this is r you pick a generic value and look at all the elements of u that have that value so this is p inverse t generic it looks like that picture that picture is meant to communicate that this thing here is a sub manifold that is it looks like ah right in my picture and it looks like ah sitting inside r2 and in my picture this plane in which this fiber sits is is you but there will be special values where the pre-image doesn't look like that so maybe p inverse t 0 has a point here which so the fiber is not a sub manifold here or maybe maybe it looks like this so what i want you to think about now is the the pre-image of nearby values so if i were to take a value down here let's call it t 0 prime i guess and to take its pre image well generically if you perturb the value a little bit the pre image is now not singular that is it is a sub manifold as i've drawn it so it's sort of like this case except as you make these values closer and closer it becomes in some sense i mean it's always a sub manifold but its properties start to reflect the singularity this point that i've indicated where the fiber is not a sub manifold okay so that's part of the context of what i'm about to talk about now what's the relevance to the density of states well consider flooding the graph so flood the graph will not really flood the graph uh flood flood you so that is i want you to consider looking at the said you and kind of filling in everything to the left of a given level set the way i've drawn it increasing t goes this way so as i increase t i've got this sort of march of level sets right and they're moving across the page and as i do that i'm filling up more and more and how much new volume do i acquire at each step right so this is a small change in the value induces a change in the level sets and this is the intermediate volume so that's how we're going to define the general density of states okay so let me now do that i guess well i sort of need to talk about regular values and critical values first okay so now we want to give a formal definition of health what how much volume is covered by the flood to the level set at t plus delta t

but this doesn't really make sense at the singular values if you look at the at the picture the singular values were not um i guess it's hard to explain why it doesn't make sense before i've said what i'm about to say so maybe i'll skip that comment so just except for the moment that there's a complication here when we try and make a formal definition which is that it won't really work at those fibers which are singular all right um okay so some observations to clarify what i've said on the previous board informally so for generic t the pre-image is a sub manifold of dimension n minus one don't remember if i gave a value to the dimension i said u is an open subset of rn right yeah okay so more precisely if the gradient of p is non-zero non-zero vector at every point x in u with p x equal t that is in the pre-image well then p has rank one in an open neighborhood of the closed set p inverse t and then it's a standard theorem that this must be a sub manifold i i'm not really justifying why this is true for a generic t right so the statement i've just made is that if this is non-zero then it's a sub-manifold so why why do i know this is the generic behavior well that's by assad's theorem so by saad's theorem the set of critical values that is the t for which the pre image is a not a sub manifold so the set of critical values of p i.e the t and r such that p inverse t is not a sub manifold so t 0 and t 1 on the previous board are examples of critical values that set of critical values has measure zero okay so that's what i mean by generic right except for some uh measure zero set of exceptions like in the previous board where t0 and t1 were just individual points and everything else i was kind of indicating had a pre-image which was a sub-manifold uh so that's what we have so some language i'm going to write crit p for the set of all x and u where the gradient so all partial derivatives of p are zero t and r is a regular value if it is well one way of saying it is the pre-image doesn't make meet the set of critical points i.e it is not a critical value so regular value and critical value are two terms i want to use so t 0 and t 1 are critical values and t generic and t 0 prime and t 0 double prime are regular values so when we talk about filling up u by dragging these level sets across the picture i want to pay attention just to regular values of t okay so let's give it a heuristic i'm going to start with this sort of heuristic motivation for how to define the density

of states and then i'll um if there's time talk about sort of proper you know differential form way of doing it so defining the density of states so u and p as above let t and r be a regular value well since the regular value values are generic there's an open neighborhood of t in r consisting of regular values let's let h be sufficiently small that all these level sets i'll call them sh which means i take the nearby level set at t plus with value t plus h let's assume that they're all sub manifolds which i can do by the previous sentence so set so we're going to consider those level sets and i'm going through a picture of them so here's here's s0 which is just the pre-image of t and then here's sh okay so on the pre-image so on s0 by hypothesis t is a regular value so the gradient of p is non-zero everywhere and i'm going to draw those in as green lines so now i want you to notice that i'm drawing them of different sizes for some reason can you figure out why or do you agree that they first of all why should they point to this level set and secondly why are the ones up here longer so these are all nabla p why are they shorter down here anybody okay so this this has got the partial derivatives of p in it right so if you were to tailor expand p near this point you would be making use of the gradient of p at that point the larger the partial derivatives the more quickly p will increase and so here uh i guess this shouldn't be longer than the one to its left [Music] here the partial derivatives are large because that vector is large but that means p increases quickly so in order to get the value of p to increase from t to t plus h that's what this is i want to increase that value of p from t to t plus h and the partial derivatives are large it takes less it takes less to do that right so the level set is closer i need to change the value of x the input less in order to increase the value of p by that fixed amount h if the gradient of p is large okay so that's why down here you see that it's far away because the gradient is small it takes a long time to increase to that level set okay so let's do so the taylor series expansion looks like the following if i start at the point x so this is my x and i travel along the gradient at x that's approximately p at x plus sum of i h right i get one of these partial derivatives just because it's already in there the second one from the taylor series expansion so that's p x h number p squared so what i'm trying to figure out is is

how to measure the the volume in between these two level sets right so to get a volume i need to add up squares and to get squares i'm going to sort of pick out something like that and worry about its volume so that's what i'm doing right now but i want to work out what this distance here is okay so how am i going to work out that distance well i can solve an equation involving this right i know that i get to hear when the value of p increases to t plus h okay so if we assume x is in s0 and then set t plus h equal to the above and solve for solve for um well not quite h right because this is not a unit vector maybe maybe i'll write this slightly differently so let me write that as the magnitude of the gradient at x times a unit vector in the direction of the gradient at x so this here is is this distance right so this distance d is is that quantity and h is what we get to change there okay so if we solve for t plus h we get t plus h is equal to well that's t already right so it's t plus h labeler p squared that's nabla p at x right and that gives us h equals h lambda p squared uh what did i do from here um no i shouldn't have called this h right it's a bad choice of better um maybe i should just call it let me rewrite this equation and um it's not it's not h i don't know what letter to call it now but i call it d prime it's related to d d prime d prime prime all right so we get d prime d prime is equal to h on enable p squared but the actual distance d is equal to d prime times nabla p that means that the actual distance is h on nabla p okay that's what i was getting at right so that's that's this distance here so if we want to work out the volume in between these level sets we just have to take this distance multiply it by a line element here use that to get a rectangle and then sum over multiple such rectangles any questions all right down so basically we're going to integrate h over nabla p along s 0 and that's the definition of the density of states okay so the picture we end up with is the following just straightening things out a little bit this is s0 here's our point x i've got delta x here this is my level set s h i make delta x sufficiently smaller than it everything looks kind of straight like this this is h on this and that's the area element so the volume in between the level sets s0 and sh now maybe it's worth commenting at this point what goes wrong if nabla p is 0 somewhere if an album p was 0 somewhere then this level set might meet that one right because it just the

value doesn't move and then well this this square thing doesn't make any sense so let's write it in vol from the level set associated to t to the level set associated to t plus h is the sum over all those rectangles which is to say it's the integral over s0 of h on nabla p which is a function a vector-valued function well that's vector-valued and that's scalar value defined on s0 and with respect to some line element on s0 and and s0 may not be finite right so we can put some regularization by with compact support to make that actually finite you can of course verify which means that what i've just defined as a distribution rather than a number right it's a gadget that takes in a finitely supported kind of test function compactly supported test function it gives you a number okay but i hope that's pretty clear so that's how we get this volume between level sets that i was drawing before okay so it's coming back to the density of states uh it is reasonable to identify so now think about p as being the energy and dragging this level set to the right is increasing the available energy and you get more states and the states are the points in you if you count how many states have become newly accessible well that's exactly this volume in between these level sets so then the density of states at t can be reasonably defined to be 1 on v remember the density of states is per unit volume uh v you know you can sort of define however you like but if if phi is fixed then it would be the integral i mean that's some some measure and then one on h vol t t plus h well remember the density of states had a terminal that looked like n e plus delta e minus n e on delta e right so that's what i've just written it's just h instead of delta e and that volume is exactly that difference between the number of accessible states except now i'm integrating to count states rather than using phi rather than just talking about a number and then from the definition as we've written it that is that cancels this h here cancels that h there so we end up with the final definition of the density of states which is one on some kind of volume doesn't matter too much what that is the integral over the level set so s0 is the pre-image of t of phi divided by lambda p along s0 okay so that's our definition of the density of states uh right so that's what i've given you so far is a motivation for that definition not really a derivation uh you can do a more sophisticated form of this argument and fill in the details about the

distributions i kind of hinted at that's all done in the notes but i guess i won't have time for that okay so that's the density of states i want to compute this in an example and show how it's different in a regular case and something like a singular case are there any questions okay uh maybe i'm going to keep that diagram because it's nice example well the simplest example is just to take a sum of squares so what are the regular values well if i'm looking at the pre-image of p i'm looking at a sphere of a certain radius right the sphere is a sub manifold of rn i mean i'm talking about the n minus one sphere that's a sub manifold of rn unless i look at for the sphere of radius zero that's not a sub manifold well it's not a sub manifold of dimension one less right so that's that's the only um singular value it's the only place where the gradient is the zero vector for t not equal to zero the level set yes zero there's a sphere of radius t to the one-half and if i compute the integral in the density of states so the integral over s0 that's the sphere of one over the gradient of this function on the sphere you can do a little calculation and see that's proportional to t to the n on two minus one which is what we saw before right that was exactly the calculation uh i didn't really do the calculation i just stated that in the example of a free electron gas you would end up with a density of states which was e to the d on two minus one d is now n but that's the same calculation so that example in the physical sense in the statistical sense that's a sort of i don't know geometric sense that's a regular example right because this function is a sum of squares so you see something to do with the volume of a sphere and its derivative which is which is this okay um so one thing to note is that so that means the density of states is proportional to the same thing right the difference is just this volume factor so that means that as t approaches zero so that is as you have you've got the picture you should have in your mind when you see this is a sphere which is collapsing right because t is the square of the radius and as the sphere collapses and i'm doing this integral over the sphere well two things are happening right one thing is that the thing over which i'm integrating is becoming smaller but this is also changing right so as i'm making t smaller and smaller well i'm evaluating the gradient here so this is a point x and that point x is closer and closer to the origin so

that will be smaller in norm which means this quantity here is decreasing as t goes to zero which means that ratio is becoming very large but this set of which i'm into integrating is decreasing so i have this kind of fight between the integrand which is growing very large and the space of which i'm integrating which is becoming very small in measure so uh both depending on the example either can win in this case it happens that uh the sort of the trend towards zero wins but that's not always the case so let me give you an example where the opposite trend the trend of the integrand to increase to infinity winds all right and i guess that's what i'll finish on so let p be x squared minus y squared so t equals zero is a singular value at the level set at t equals zero is this thing here which is not a sub manifold of r to take u to be some ball around the origin it doesn't matter so this is s zero p inverse t well t is zero so for small h this is a hyperbola so it looks like looks like this and the points plus or minus root h0 are in sh and those points have the gradient of p equal to the square root of 2h as you can easily compute so that means that as i was just saying before that ratio goes to infinity at these points as h goes to zero and so as you make as you make this level set approach the zero level set uh the gradient at those points i guess it doesn't points rather in the opposite direction right this is the gradient those vectors become smaller and smaller so one over them becomes larger and larger so that's the integrand okay so i'm going to give an explicit phi just so we can actually compute the integral i guess that doesn't have compact support uh well it depends how i choose u so choose u to be a closed disk at the origin and then that's what then as h goes to zero this integral that i'm doing s0 by on nabla p what it looks like is the following integral do the calculation and that diverges now what that means is that the density of states i guess i use t already i'll use s the density of states approaches infinity as s approaches zero in the previous sort of regular case where p was the sum of squares the density of states went to zero now what does that mean i mean in terms of the interpretation of the number of accessible states or whatever well the density of states going to zero means that near that level set as you vary the level set the volume doesn't change very much right whereas in this case the closer you make that blue level set sh to the left zero level

set uh for a given change in the value you'll get more and more volume right i mean it's the rate of change of the volume with the with the value that you that you're um varying so it takes a little bit of thought to get that clear in your head maybe but um so that's a very different situation and it's actually one of the keys to understanding why singularities affect the bayesian posterior and learning and so on okay so i think i'll make that connection briefly and then stop so i haven't mentioned the rlct yet but i'm now going to bring that into the story okay so the divergence ds to infinity in this case means that for sufficiently small h and s sufficiently small h and s sufficiently close to zero so this ratio which is what the density of states is one over h the volume between these level sets may be made arbitrarily large so relative to the size in the change that is h so in that example i had these two lines that's two parts of the hyperbola and then i've got two level sets say this in here is the volume so that's vol s s plus h so for a given uh maybe i should label the level sets so that's s s and that's s s plus h right so if i take a sufficiently close to zero so i take this level set sufficiently close to the hyperbola and for sufficiently small h that change in volume so the volume that i've colored in dark blue there can be much i mean for all sufficiently small values of h so maybe don't worry about the constraint on h that volume can be very large arbitrarily large if you make this level set on the left close to the hyperbola so that chain that volume that i've colored in can be arbitrarily an arbitrary factor larger than the uh change in the value so like sort of moving along the the [Music] copy of r that was at the bottom of the other graph that controls which level sets i'm taking okay so we see that the behavior of the density of states as t goes to zero can be complicated and depends intricately on the original function that is on p and its values near um near singular level sets that is the pre-image of critical values it's important here that this this level set which was the zero level set was not a sub manifold if it were a sub manifold uh i would always get the behavior where the density of states goes to zero as as t or s goes to zero so near a singular level set the density of states can behave in the way that i've just described for x squared minus y squared it can behave very differently to this case in the sum of squares and indeed