WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed. The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except for errors at the level of individual words during transcription.

Synopsis


Encoding is the process of assigning symbols to signals in order to store and sample them efficiently. Shannon's Source coding theorem states that the minimum number of symbols required to encode a block of signals converges to the entropy of the source as the block size increases. The Craft MacMillan theorem is a constructive proof which shows how to construct an optimal encoding with a length of logarithm of one over Q. KL Divergence measures the difference between the expected and optimal number of symbols per signal. Learn more about these theorems and techniques to optimize encoding.

Short Summary


Encoding is the process of assigning symbols to observations or signals. Shannon's Source coding theorem is used to justify the entropy of a probability distribution in terms of the expected number of symbols it takes to record a sample from the distribution. This can be applied to a variety of situations, such as audio or a bird's song. Optimal encoding can be investigated to determine the best encoding, such as assigning red 0 1 and blue 1 1 and green 0 0 for a set of colorful birds. Questions can be asked about which encodings are better according to criteria such as shorter is better.
Encoding is the assignment of a secret sequence to a signal in order to store and sample it efficiently. The best encoding is the one that uses the fewest resources, meaning it is the shortest. The expected cost of encoding is determined by minimising a quantity which is the sum of the probability of an event multiplied by the length of the encoding. The Shannon Source coding theorem states that in the limit as the block size goes to infinity, the value of U can be determined.
The Craft MacMillan theorem is used to construct lower and upper bounds for the optimal encoding of a given alphabet. A length function L assigns positive integers to signals, and it is determined by an encoding. The theorem states that a function L is the length function of some encoding if and only if the sum of X and L(X) is less than or equal to one. This provides a useful way to characterize which functions are length functions of some encoding, as it must fit X into the set of symbols with the assigned length.
A optimization problem with a constraint was discussed, and the variables were changed from L of x to S of x and set s of x to be the constraint. Kraft MacMillan was applied to the inequality to switch it to an equality, and the constraint was relaxed to not allow integers. A Lagrange multiplier was introduced, and the equation -Qx*Lambda = log(Sigma) was used to substitute back into the original equation, resulting in Sx = Qx as the unique critical point of L. This shows that the solution to the optimization problem is Sx = Qx, and L must be equal to -log(Sigma*Qx). This provides a lower bound on U.
The Craft MacMillan theorem states that an integer valued function can be used to find an upper bound for the entropy of a source, which is the minimum expected length of the encoding of a signal, plus one. The source coding theorem states that the minimum number of symbols required to encode a block of independent and identically distributed signals converges to the entropy of the source as the block size increases. Arithmetic coding is a technique which allows for amortized code length that is not an integer. The expected length of encoding can be reduced to one over the length of the sequence, giving an upper bound of the entropy plus one. Lower bounds are bounded below by the entropy, H_Q.
Entropy is a measure of how many symbols an observer needs to write down in order to store signals from a distribution. The Craft MacMillan theorem is a constructive proof which shows how to construct an optimal encoding with a length of logarithm of one over Q. KL Divergence is a measure of the gap between the expected number of symbols per signal when using an estimated probability distribution, and the optimal number that could be achieved if the true distribution was known. It can be used to compare two probability distributions and measure how different they are.

Long Summary


An observer is receiving signals from a set X, which is assumed to be finite. The problem is to store the information of these signals from the source. A cartoon is used to illustrate this problem, where a collection of colorful birds is seen outside the window. To encode this, one needs to make a choice of the most likely color for a bird. The goal is to justify the entropy of a probability distribution in terms of the expected number of symbols it takes to record a sample from the distribution. This is known as Shannon's Source coding theorem.
Encoding is the process of assigning symbols to observations or signals. One can divide up colors in a discrete way, for example assigning red 0 1 and blue 1 1 and green 0 0. An alphabet is chosen, such as English, and a sequence of symbols is associated to the signal. Questions can be asked about which encodings are better according to criteria such as shorter is better. This can be applied to a variety of situations, such as audio or a bird's song. It is not always known what encoding to use, or an existing encoding may be seen as wasteful. Therefore, optimal encoding can be investigated to determine the best encoding.
Encoding is an assignment of a sequence of symbols to any signal in a way that assigns distinct secret sequences to distinct symbols. It is assumed that the use of resources in storing the information in a signal is proportional to the length, and that not all signals are created equal. It is more important to get a short encoding for more common sequences, as this will save resources over many samples. The best encoding is the one that uses the fewest resources, meaning it is the shortest.
In order to store and sample signals efficiently, it is necessary to determine the expected cost of encoding. This is done by minimising a quantity which is the sum of the probability of an event multiplied by the length of the encoding. To solve this problem, an optimal encoding must be found which will minimise this quantity. An example of a case where this is easy to do is when there are fewer possibilities than symbols in the alphabet, as the shortest possible encoding is of length one. This will eventually lead to the Shannon Source coding theorem, which states that in the limit as the block size goes to infinity the value of U can be determined.
M is an optimal encoding for a given alphabet if it assigns the shortest strings of symbols to each signal. The Craft MacMillan theorem is used to construct lower and upper bounds for this quantity U. If two encodings assign the same lengths to each signal, then U is equal for both, and this quantity is the minimum achievable. This is not a trivial problem, and the most non-trivial step is to reformulate the parameterization to figure out what are the possible length functions for encodings.
A length function L assigns positive integers to signals and is determined by an encoding. A theorem states that a function L is the length function of some encoding if and only if the sum of X and L(X) is less than or equal to one. This is a useful way to characterize which functions are length functions of some encoding. If L assigns large numbers to signals, it is easy to fit X into the set of symbols with that length. If L assigns small numbers, the constraint becomes tricky as X must be fitted into one.
A minimum value is calculated over a set of functions from X to R, where X is real valued and the constraint is satisfied. To simplify the calculation, the class of functions is relaxed so that the minimum is achieved by some L star. If this L star makes the constraint inequality a strict inequality, then a LaGrange multiplier argument can be used to replace the minimum with an equality. This allows the minimum to be found over a more compact set, and it can be bounded by a constant C that does not depend on L.
The speaker discussed an optimization problem with a constraint. They changed variables from L of x to S of x, and set s of x to be the thing in the constraint. They applied Kraft MacMillan, switched from an inequality to an equality, and relaxed the constraint to not allow integers. They then formulated the problem as a LaGrange multiplier.
A lower bound on U is found by introducing a Lagrange multiplier Lambda and setting L as a function of s and Lambda. This leads to the equation -Qx*Lambda = log(Sigma). Substituting this back into the original equation yields Sx = Qx, which is the unique critical point of L. This shows that the original optimization problem has the solution Sx = Qx, and L must be equal to -log(Sigma*Qx). This provides a lower bound on U.
Upper and lower bounds of expected length of encoding signals from a source into strings in an alphabet is discussed. U is the minimum of these encodings and is bounded below by the entropy, H_Q. An upper bound is constructed by taking the sum of X in xqx times LX, which is equal to minus log Sigma QX. Arithmetic coding is a technique which allows for amortized code length that is not an integer.
The Craft MacMillan theorem states that an integer valued function can be used to find an upper bound for the entropy of a source. This upper bound is the minimum expected length of the encoding of a signal, plus one. By considering a block of signals, a sequence of IID samples from the source, the expected length of encoding can be reduced to one over the length of the sequence. This gives an upper bound of the entropy plus one.
The source coding theorem states that the minimum number of symbols required to encode a block of independent and identically distributed signals converges to the entropy of the source as the block size increases. This is related to the minimization of encoding single signals, with the expected length of writing down a sequence of IID samples being equal to the expected number of symbols required to write down a single sample multiplied by the block size.
The entropy of a distribution is a measure of how many symbols an observer needs to write down in order to store the signals from the distribution. This relates to the KL Divergence, as if an observer believes the source has a different distribution than it actually does, the observer can take the entropy of the distribution and use it as a means of constructing and encoding the signals.
Craft MacMillan theorem is a constructive proof which tells how to construct an encoding with length logarithm of one over Q X. As the length of blocks goes to infinity, the roof of log p x can be used to get closer to the actual logarithm. This encoding is believed to be optimal and it can achieve a lower per signal symbol cost by knowing Q. Q is different to p as there are some x's which are likely in reality but not in Q.
KL Divergence is a measure of the gap between the expected number of symbols per signal when using an encoding based on an estimated probability distribution, and what could be achieved if the true distribution was known. It is the difference between the expected number of symbols per signal when using an encoding based on the estimated probability distribution, and the optimal number that could be achieved if the true distribution was known. It is positive, as it is not possible to do better than knowing the truth. It can be used to compare two probability distributions and measure how different they are.

Raw Transcript


this is going to tackle the most basic points which is the motivation for the concept of the entropy of a probability distribution so we're going to justify the entropy in terms of the expected number of symbols it takes to record a sample from a probability distribution that the technical content will be summarized as Shannon's Source coding theorem so that's where we're working towards and then as a result of understanding that we'll be able to see what the meaning of the KL divergences at least from the point of view of information Theory there are other ways to motivate it but today I want to focus on the motivation which comes from this kind of point of view of signals and information and and so on okay so the setup is the following so we're going to consider an observer receiving or measuring signals or samples from some set X and for Simplicity I'm going to assume X is finite I'm not going to say it's completely trivial to justify in the way I'm about to the infinite case because actually I don't know completely how to justify it you can extend the formal definition of the entropy trivially to the general case but the motivation seems a little trickier to me anyway today we're going to assume that X is finite uh the signals or samples are distributed According to some probability distribution queue which is assumed unknown as usual Okay so usually in this seminar we would then proceed to say okay the problem is to try and infer what that distribution is or what the process which generates It Is by formulating a model and then talking about Which models are better than which other models and inferring parameters and so on today we're going to go in a somewhat different direction so rather than modeling the true distribution we're going to start with a much more basic problem and the basic problem is well before we can model something we have to be able to write down the information that we've received hahaha so how do we store the information of these signals that we're receiving from this source okay so my cartoon is going to be the following so here's you sitting at a desk and you're lucky because out the window are a collection of uh beautiful colorful birds the reason I'm choosing color is that you have to actually make a choice of how to encode this right so here's your birdies sitting on the tree and you want to maybe talk about or think about well what is the most likely kind of color for a bird okay so these colors are some uh there's some distribution out there
in nature of the color of birds we're going to assume and You by the nature of your perception sort of have to take as a starting point the idea that maybe you can divide up these colors in some discrete way not necessarily finite and then you're going to write down what those observations are right so maybe you're going to decide to record red 0 1 and blue is one one and green is zero zero or something like that so you get the idea but these are choices right there's a choice of alphabet here in this case it's sort of implicitly zero and one uh there's a choice of which sequence of symbols in that alphabet to associate to which observation or which signal and then you can ask questions about well which encodings are better According to some criteria all right so I guess we could imagine a variety of different kinds of situations maybe in this as in this case what comes to us is not something that's already in a natural alphabet say a color although we could use English right we could write down a red blue and green so we already have a sort of we're sort of already predisposed to use a particular alphabet the the English alphabet and a particular encoding of these observations into that alphabet uh but maybe you could imagine a measurement or a signal that doesn't say a chunk of a wave file or a uh you know some sound that comes to you you know I shouldn't say a wave file it's already encoded I suppose a chunk of audio a bird's song you don't have some already defined alphabet and encoding of that to consider so that's one situation you don't already know an encoding but you could also consider a situation where you know an encoding but you might think that it's wasteful right so in this case if red blue and green are the only Colors that appear in nature well I'm using three symbols to write down red and four symbols to write down blue and five for green but the encoding I have on the first board is cheaper right it uses less symbols and the alphabet is smaller so in some sense you could imagine that that's a better encoding so what we're going to be investigating is the optimal encoding where the criteria is that the shorter is better hey Dan yeah he said this is kind of a step that you have to be able to do before modeling um is that just like you mean that colloquially or you mean this is like somehow different from modeling because it I've also seen people say that compression and coding is just models yeah uh I don't know how to draw a very sharp line between them so I'm not really
asserting that uh I'm not asserting really any connection between them except that it does seem to be clearly at least from a practical point of view a necessary initial step I mean you can't uh well maybe operationally it's an initial step I don't know that mathematically you have to worry about this before you start talking about models okay okay so what is the best encoding of signals where best is going to mean shortest why shortest well I'll come back to that in a minute but let's just set up some terms first so we're going to suppose given the alphabet um so which is a non-empty finite set and by Sigma star we're going to mean non-empty sequences in Sigma usually this means arbitrary sequences but for today I don't want to count the empty string I'm also going to assume that there are no no signals X for which the probability is either zero or one and sometimes it's convenient to allow the empty string for technical reasons if you want to encode events that have zero probability and Edge things like that but I'm just it's not that interesting I'm just going to skip over that kind of thing we mean non-empty sequences in Sigma and as usual if I if I have a sequence its length is going to be denoted by These Bars all right so definition what do I mean by an encoding so an encoding is an assignment of a sequence of symbols to any signal in a way that assigns distinct secret sequences to distinct symbols right now it's a reasonable hypothesis that the use of resources uh in storing the information in a signal is proportional to the length and that gets into a bit tricky territory but you could say sort of at the limit you need to at least erase a certain number of squares or positions in order to store a symbol in the alphabet and that'll be proportional to the number of symbols you're storing so this is a sort of reasonable hypothesis and then by best I'm going to mean the fewest use of resources and therefore since that's proportional to the length what we want to do is take the shortest but for what I mean for all X or do some X are they more important to shorten than others so that's the next point I'm going to make suppose it should be pretty obvious that not all X's are created equal if we are concerned with receiving multiple signals from our source and then encoding them and writing them down it's much more important to get a short encoding for more common sequences because those will save us more resources over many samples okay so let me just make that point more
precisely so the resource cost of storing and samples IE signals received is by my hypothesis proportional 2. that obviously okay which I'll just rewrite in a pretty obvious way okay let's just sum of X PX times MX where BX is the empirical probability according to my samples and by this I mean the chronica Delta right so one of x equals x i zero otherwise so I've got my n samples and this is how much how long the stored information is okay so then yeah I can't really hear you sorry all right can you hear me now yep some kind of expected cost uh well right now I'm writing down the cost for this particular sample all right but yes that's right I mean when I uh it was that way but how come there was a probability that got introduced right I said empirical probability oh empirical probability just means the proportion of the sound that's right sure okay uh I see a question in the chat was that what you asked me earlier storing information and modeling information yeah yeah yeah I guess that's an interesting question we can come back to okay uh so for enlarge this converges to just the expected value of this random variable which is the the length of the encoding okay so if we want to do a good job of storing samples or signals from this Source what we want to do is choose an encoding M which minimizes this expectation so we have an optimization problem which is to minimize this quantity over all possible encodings okay so we're going to solve this problem uh well we're going to figure out as best as it's possible to do what the value of U is sort of as a function of Q of x it'll take some time so we're going to first produce a lower bound and then an upper bound and the eventual statement will be in terms of blocks of messages in the limit as the block size goes to infinity and that's the Shannon Source coding theorem but before that I think it's worth just staring at this for a minute and thinking about how we would do this in the cases where this is easy okay so what's an example of a case where it's easy to find a optimal encoding and encoding that minimizes this quantity well if there are if there are fewer possibilities then there are things in our alphabet we just get to assign I mean obviously I want to take the lengths of the encodings as short as possible right and the shortest possible thing I can give is length one that is I just pick a symbol actually in Sigma for every event so you know so if x is red green and blue birds and I literally have three distinct symbols in my
alphabet then I can just get away with M always being of length one and that's as small as I can make it okay so then it's easy uh what if there are what if there's one more thing one more possible signal than there are things in my alphabet so I have to pick one of them to be encoded by a length two string rather than a length one string well it's pretty clear that I can make that optimal by choosing the thing that gets the length to string to be among the least likely things right so that then when I look at this sum in the case where this is a 2 this is smallest does that make sense okay yeah um then you might go one case beyond that and think about well what if there are two things I have to give length two two well you should pick them to be you know the smallest probability of things but you can sort of see if you start thinking about it it's quickly going to become a bit of a difficult sort of trade-off right if you've got to choose between multiple things and um you know which ones to bump up to longer length and so on okay so this is this is not a not a trivial problem I wouldn't say it's very difficult um you know we'll be done with the proof by the end of this seminar but it's not obvious how to come up with an optimal encoding and the most non-trivial step is to reformulate the sort of parameterization here to figure out what what is a possible what is the what are the possible length functions for encodings so that's my next term it's called The Craft MacMillan theorem well okay so we're going to construct lower bounds for you lower and upper bounds for this quantity U H okay so the first observation is that if you have an encoding M which produces the same lengths as an encoding M Prime for all signals uh yeah I guess I wanted to introduce some notation it's a bit of a Pity it's on the other board set but um I'm going to call this quantity here um sorry Dan you said if this size of X is more than size of Sigma and then we didn't write anything oh I didn't write anything yeah then any m with MX always equal to 1. gives a minimal encoding I.E um is minimal yeah thanks okay I think I'll reproduce the um over here hmm so outside um means sum over x and x Q X MX all right so if an encoding assigns the same lengths maybe different sequences of symbols but they all have the same length for each input then u m is equal to u m Prime as is obvious from the formula hence this quantity U which we're interested in is the minimum you can think about it as the minimum
over length functions right so from any encoding you can derive a length function I'm writing it as just Bar M here that's clear so LX stands for this any encoding determines a length function and well the the quantity here that we're minimizing is only depends on the length function so this this is the same all right well uh the reason this is useful is we can characterize which functions taking values in positive integers are length functions of some encoding so that question is answered by the following theorem I won't have time to present it actually I think it would be a good idea for uh somebody here to to prepare a talk on this because it it's actually not in my opinion it's not possible to really understand the meaning of the KL Divergence until you understand what I'm explaining today plus how this procedure of constructing and encoding from a length function works that's at least that's how I how I tend to think about it um okay so given a function there exists an encoding for which it is the length function if and only if alcohol if and only if the sum of X and X yeah the proof isn't difficult it's a bit too long to give in this talk but I think it's instructive to see okay so you're given a function L it comes from an encoding if and only if this inequality holds so let's just think about this as a kind of Sanity check for a second uh if you make l x larger this is going to get smaller red so what's the constraint here well uh I suppose I mean it should be easy to for L to be the length function of an encoding if L assigns very large numbers to things right if LSI is very large numbers for things then you can clearly fit I mean relative to the size of X you should be able to fit X into that set of symbols of things with that length and that that tracks with the formula right if LX is very large uh relative to the size of X and somehow the left hand side is going to be easily less than one the constraint becomes tricky if you try and make the L axis small making them small makes these some ends on the left large and somehow you have to fit them in to one so that sort of makes sense and okay I'm going to suppose in all the rest I'm not saying this is necessary in this theorem but I'm just I'm remembering it now so I'm going to suppose that the probabilities of all these signals are in zero one otherwise in some of the following I'll have to make some special cases okay that basically comes by uh you can prove this by thinking about encoding things in terms
of trees okay so to come back to our original calculation so U was the minimum over these sums where L satisfies this constraint okay now I'm going to do some operations to the right hand side and it will turn into an inequality okay so what I'm going to do now is relax the class of functions so let X be real valued same constraint now the point here is that this is a bigger set than this so it's minimum must be smaller okay now I want to argue that I can replace this by an equality and what I'm going to do in a moment as you probably already guessed this to use a LaGrange multiplier argument you could probably do it already here with the more sophisticated versions that handle inequalities in this case it's easy enough to turn this into an egg quality and then use the simplest form of a LaGrange multiplier so we'll just do that um okay so first it's worth observing that this minimum is actually achieved uh let me think for a moment I'm not completely sure I actually need this oh okay since X is finite uh we can bound UL so this quantity is UL right that's a definition I guess I was writing u m before so we can but we can bound UL below by UL Prime for some L Prime that takes uh values in an interval the reason I can take it be non-negative you can just think about uh what would happen if it had a negative value and just shifting it up by a little bit and I can take it to be bounded above by a constant C that doesn't depend on l um yeah it's kind of easier to think about yourself than than for me to argue for it since X is finite and if you think about uh the size of Sigma you can convince yourself that if the values of L are bigger than you come up with a constant such that the the bigger than that constant you can always safely decrease them by some amount and maintain that inequality okay so that's not so important so we can replace we can place the minimum over all functions from X to R by a by a minimum which only considers functions L taking values in some bounded interval okay the point being that that's compact so the minimum is actually achieved by some L star I mean I guess I could have just I could have done this part when I made this first step right rather than going to R first so since this is compact the minimum is achieved by some L star now suppose I want to claim that this L star actually makes this constraint inequality so suppose for a contradiction that am I riding uh that l-star makes this a strict inequality okay you can see that all the L stars
must be positive again maybe I'll leave you to think about that so uh that means that I can pick one and decrease it by some amount sufficiently small uh and decreasing it will make that term larger in the sum but because I decreased it by a sufficiently small amount I'll still be strictly less than one because I'm bounded away from one but that then contradicts minimality foreign okay that means that this minimal guy actually makes this inequality so we can conclude that sorry sorry why can't you decrease that something yeah so if it's positive I can decrease it and still keep it away from zero and the point is that uh when I come here so each of the terms looks like looks like this right so if I decrease that exponent uh I'm going to make the denominator there smaller which makes the fraction larger which could in principle break this this strict inequality right I mean if it was less than or equals to it it might but since it's a strict inequality there's always some Gap there and if I give myself an Epsilon I can choose the delta in the size of the decrease here to keep it within that Epsilon so that it's still less than one does that make sense okay thanks okay uh that means that in the previous line where I minimized over things satisfying the inequality I can now minimize over things that satisfy the equality and I could put my compact interval here again but I don't care anymore okay so that's the first there's sort of two steps that have happened here the first was to apply craft MacMillan and the second was to switch from the inequality to inequality I guess you could say we also relaxed it to consider functions L that don't return integers and all of these are part of I mean at least this relaxation to not allow integers is part of the reason why it's a little difficult to think about uh the the content of the entropy um but we'll come back to that okay well now we've got something we can formulate as a LaGrange multiplier Okay so we've got a optimization problem subject to a constraint I'm going to simplify things a little bit by setting up some notation but I think I'm out of boards here so we're going to move over to the second we'll move over here and continue uh yeah that's right Matt that's the um it's a tree with branching Factor Sigma exactly okay so maybe I'll just repeat this foreign okay so we're going to set s of x to be the thing in the constraint so we're basically going to change variables from L of x to S of x okay so L of X is minus log s of x
in which case I'm continuing from here that is equal to the minimum of just substituting this into here minimum of minus some of our x q of X log Sigma s of x subject to this so that's Max so it's it's not clear we should think about the current optimization problems have anything to do with an encoding right because these L's may not be integers they don't necessarily come from an encoding anymore this constraint that we've set up is an analog of what's in the craft MacMillan theorem in order for this to be a lower bound on the thing we care about but we're not assigning any particular interpretation to to the optimal function ill in the current situation Okay so so what do we do we're going to introduce LaGrange multiplier Lambda and set L as a function of s and Lambda to be the thing we're trying to minimize maximize rather plus Lambda times the constraint so we have as many variables plus one as there are X as in X right so for each X we have a variable s x here and those derivatives are QX on SX log Sigma plus Lambda and here okay so where all those parts derivatives zero well we're gonna look at the first equation first [Music] um well I guess the second one is easier those will have to add to one don't think about them as probabilities right because um well I guess it is appropriate to think about them as probabilities isn't it there's a there's a non-negative the first equation gives uh what rearranging for ASX here gives minus QX on Lambda log okay substituting one sorry two into one gives that the sum over x in x of minus QX Lambda log Sigma has to be equal to one therefore minus sum of x in x q X is equal to Lambda log Sigma I guess we better know Lambda is positive before we do that um sorry I mean Lambda is non-zero um that's yeah but that's true okay um but this is one right so therefore Lambda is equal to um minus one on log Sigma which are now substituting back in here gives us SX is equal to QX this is what we're looking for so this shows that there's a unique solution a unique critical point of L I mean then technically speaking you have to check that that actually is uh the global minimum of the original optimization problem but you can do that so this shows that the the original optimization problem on the first board here has the solution s where s x is equal to QX and in the original coordinates so l l must be equal to minus log Sigma QX okay so now we've basically have our lower bound all right so U is greater than or equal to well the minimum
that I've formulated as a maximum which is by what I just performed equal to the following quantity the sum of X in xqx times LX LX is minus log Sigma QX so this is minus I'm over x and x q X log Sigma QX okay so this quantity which is appearing here for the first time it's called the entropy HQ of Q usually you'd take Sigma to uh be in some other base doesn't matter of course up to a multiplicative constant so we'll just stick with this okay so it's with respect to Sigma I guess it's not clear I took the minus sign in there I intend to have the minus sign okay so just to recap so U was the the minimum of all encodings of signals from our source into Strings in the alphabet Sigma of the expected length of those encodings and we've shown that that's bounded below by the entropy next we've got to do an upper bound and then kind of talk about how to how to somehow get that up about and lower bound to approach one another but are there any questions about this first part note that although you was talking about encodings right and therefore the the things that were the expected value was of lengths right so sum over QX sum over x q X and some length uh here the the thing that it looks a bit like this but the length now is I mean we could change this to one over Q X maybe that makes it even less makes it less unlike a length I mean U was the turned in things that was something like sum over x q X MX right so it's sort of clear we're supposed to think of this quantity as being somehow related to this uh foreign but well there's no reason for that logarithm to be an integer all right Dan have you heard about arithmetic coding uh nope okay we studied this in second year algorithms this is the technique by which you can um you can you can use discrete symbols and achieve an amortized code length that's not an integer for for particularly yeah that's right that's what I'm about to explain okay well I won't explain it I'll kind of leave it it's implicit in the way we're going to deal with the limit as we consider uh sequences of signals I guess that's what you mean by amortized yeah like that cool yeah I didn't know the name thanks it's a neat idea kind of um all right for an upper bound okay so just to recall U was the minimum well it's kind of easy to construct an upper bound right I just have to find something in the right hand side okay well given what I just discussed uh there's a kind of natural thing to try right maybe that logarithm is not an integer but if I take the
closest larger integer then maybe that will work so that's what we're going to do so I'm going to set LX to B and here now I'm back to this situation right this was the original thing where the L took values and integers and I'm going to take the logarithm and take the roof and that's going to be a non-negative integer by the the way I've set things up then well by definition this is greater than or equal to that logarithm okay so that's an integer valued function which satisfies the hypotheses of The Craft MacMillan theorem [Music] and therefore we get our upper bound hence you is less than or equal to well I guess maybe I'll do it on the next board okay so you for this particular L right because this L is in the set and U is the minimum of these quantities over all such L's in the set but LX was the roof and the roof is less than or equal to the quantity plus one so I'm just expanding that and this gives us a sum over Q X which is just one okay so that's HQ plus one okay so what we've proven so far is that with the above notation the entropy of Q is less than or equal to this minimum expected length of the encoding of a signal and it's bounded above by the entropy plus one okay so you can start to see what the entropy is trying to tell you we're not quite finished because well uh there's just a there's this gap of one right it's not really satisfactory but uh it's pretty obvious that if you instead talk about the uh expected length of encoding a sequence of signals from the same source that roughly speaking you're going to [Music] um get this one into a one over B where the B is the length of the sequence so that's what I'll now explain yeah any questions up to here comments okay so this is pretty straightforward so I'll be a little quick uh the block a block of signals is just a sequence of IID samples from Q I guess there's one observation which is that when you minimize the length of a stored um well I'll say it precisely okay so given an encoding I'm going to write uh this to just mean I guess it's a little bit of a tricky Point here to do with prefix free things right um so I get a sequence of signals from The Source I can encode each of them and then just write them down and that will have length the sum of the lengths as long as I don't need any punctuation red if there's punctuation involved like a comma or a space I I it seems like I might have to add a B minus one or something here uh I guess you could probably do that and not change the conclusion I'm about to
make I didn't check but in the limit it probably goes away um but you can also avoid it and probably this is more standard by by assuming that m is a prefix free code so that you don't need spaces or punctuation symbols when you write down a sequence of encodings but maybe that's a topic for another time all right so you then look at the expected length of writing down a sequence of IID samples from The Source that's this quantity and an easy calculation will convince you that that's B times the expected number of symbols required to write down a single sample right um was some of xqx MX okay so hence the point here is that the the minimization over encoding blocks is just related to the the minimization of encoding single signals right so let me call it u b that's the minimum of all encodings I guess I didn't think about the possibility that you could have an encoding for blocks that somehow worked better than using the same encoding for each sample but since their IID I guess you could argue that's not possible right so this would be the best you can do uh for encoding blocks but by what we just said I mean this quantity here is is um underline right but what we just said this is um B min over m u m which is b u okay so now we're ready for the source coding theorem so colloquially it says the minimum number of symbols per signal and in expectation to encode a block B signals from QX that's a function of B converges to HQ as B goes to Infinity proof okay what I mean is 1 over b u b converges to HQ B is on the previous board well but u b is BU and I had these upper and lower bounds for you so just multiplying them by B gives me this and therefore HQ uh sorry there's a one there wait no no I've done something wrong here doesn't the one of the pages go away yeah a bit confused actually um so I had this right a UBS bu h oh hmm I mean I mean I can't just multiply through by B this doesn't help me right gives me that uh what what if you can you put u b equals b u into the limit equation that you have just a second yeah um yeah what was the suggestion sorry can you put the [Music] um can you put u b equals b u into this if you put if you replace that with b u then you end up with u converges to HQ but U is just so you get U equals yeah so I must have screwed something up here somewhere um I think I mean I can just apply everything I said before to get to get this but with a one here um yeah I thought I didn't need to do that uh okay it's a bit unfortunate that I'll
have to just sketch this argument now I don't I think I think from what I've got on the boards here it's not actually okay so I'll explain how I can get to this formula directly um I thought I thought I did a cleverer thing but appears not okay so that as that's what I want is this you'll agree that if I can prove that that I divide through by B and I take the limit and I'm done right okay so why is that well um look at this I can just solve the optimization problem again but for this you see okay so look at this optimization problem um the sum on the right hand side unfortunately it's it's in here uh but there's some on the right hand side basically you can treat the components separately and so um you can set up this optimization problem just for each signal separately basically and then solve it as I did before and you'll just end up with bhq and for the same reasons as before you'll have a a one there um yeah so that's sorry I have to finish in this kind of soft way but um that does work and so then you end up with this formula here I think there must should be a way of deriving it from what I said but I don't really want to mess around with it too much right now okay so therefore sorry this makes sense yeah that was how I originally did it and then I switched it to this but okay so that's quite satisfactory right so this quantity this quantity the entropy that depends on the true distribution tells you the um that you can't do better than storing this many symbols per signal in expectation um when you consider blocks and the block length goes to Infinity I guess I'm a bit over time already but it's really only a few more minutes to explain the kale Divergence from here so um I'll do that over here okay so that tells you the meaning the information theoretic meaning of the entropy of a distribution it's about an observer receiving signals or samples from the distribution and trying to write them down and how many symbols do they need to do that so here's how that relates to the KL Divergence okay suppose you the Observer this is going to be a bit sketchy because it's it's not completely straightforward to Think Through completely but suppose The Observer believes the source has distribution PX rather than QX okay so what do I mean what does it mean to believe the source has distribution p x well if you think you know the distribution of the signals you can take what we just discussed and kind of operationalize it and turn it into a means of constructing and encoding
right so that was that's sort of in the proof of The Craft MacMillan theorem right the craft Millen theorem if we go if you think back through what we just discussed uh to prove the upper bound I said okay take as the lengths these logarithms of one over Q X and take the roof that's an integer that satisfies some constraints and the craft MacMillan theorem it's constructive it tells you how to construct an encoding which has those as its lengths okay so to believe the source has distribution PX is to then to take the roof of log p x get an encoding that has those integers and use it when somebody gives you a sample from the signal use that encoding to write it down now that's a roof and not really the logarithm itself but that's sort of relevant to what Matt was saying as the length of the blocks goes to Infinity you can sort of do better and better with that roof and it it gets closer to the actual logarithm so roughly speaking you can think about actually using an encoding which which has the length which is minus the log of p x all right so that's the part that I'm where there's a bit more thought really required to convince yourself but I'll just say for today that employing the craft MacMillan theorem what would you call it Matt arithmetic encoding that's right coding okay employing the craft MacMillan theorem uh construct an encoding and I'll just say with in scale quotes because this need not be an integer right so it's not um okay but that's the idea so this isn't a proof of The Craft MacMillan theorem uh the proof will actually construct an encoding for you given the integers LX yeah okay well that's why you're saying we should read the proof yeah I think it's interesting yeah oh okay now putting aside the difference between you know this and an integer and so on and just sort of thinking in the limit as B goes to Infinity um they think this is optimal right so they think all that we've done guarantees them that this is the best way to write down signals any encoding that's that is produced by this procedure okay so they think this is optimal but in fact achieve aloha sort of simple rate you can achieve a lower per signal symbol cost or just like the number of symbols in expectation um by actually knowing Q so roughly speaking uh you know you have p and but that's different to Q so there's there's some x's for which uh you think they're likely and Q actually I mean and in reality they're a little bit less likely or a little bit more likely which means that you're a bit off
in how you assign your encoding at least and you can you can see how this this probably shouldn't show up necessarily for Sim single signals right if there's a very small gap between the probabilities you'll only see it if you take sequences okay yeah so you can do better if you use the actual true distribution okay so there's three quantities that were sort of implicitly talking about here sum over xqx log 1 over PX all the logarithms of Base okay so this is what you actually get I mean remember we're sort of thinking about this as length of encoding right modular the caveats are discussed so this is the uh number of the expected number of symbols per signal that you actually get so you you're getting a signal you write it down you're getting a signal you write it down and what you will calculate is your empirical estimate of how many symbols per signal you're writing down will be an approximation of this with the Q There now what you expected to get was this right when you when you bought PX off the Shady dealer in the probability Market uh this is what you thought you were promised and then a third quantity is well what you could do if you really knew the true distribution all right and the kale Divergence is the difference of two of these quantities and that's its meaning so recall the KL Divergence is by definition following some is the reason I expected to get the second one because I thought that's right yeah you thought P was the truth yeah that's right okay so that is the first term here was what you got minus what is optimal and by what I mean the number of symbols per signal or really the number of symbols per like signal in a block as the block length goes to Infinity um minus what you could do if you actually knew the true distribution right so I.E the KL Divergence is the gap I mean this is positive right because you can't do better than the knowing the truth you can always in expectation shorten the number of symbols you use to record signals if you know the truth compared to thinking it's some other distribution so it's the expected number you will save expected number of symbols per signal you save as in you don't have to write down by switching from your encoding based on P to an encoding based on Q and from this definition you can sort of see how q and P play asymmetric roles right okay so that's a that's the meaning of the KL Divergence from an information theoretic point of view it's the expected number of symbols per signal you save by using an encoding based on