Learn how to calculate the probability of a sequence of heads and tails given a fixed data set. Explore two models - a binomial expression and a two coin model - and how to use Maximum Likelihood Estimation to determine the best parameter for the observed data. Understand that the frequency of heads is a random variable and that different data sets will result in different outcomes. Find out how to predict the accuracy of the procedure before you start.

The speaker discusses a method to calculate the probability of a sequence of heads and tails given a fixed data set. They introduce two models, a binomial expression and a more complex model involving two coins, and explain how to use Maximum Likelihood Estimation to find the best parameter in a model that is most likely to produce the observed data. They also explain that the frequency of heads is a random variable and that a different data set will give a different result, and how to know if the procedure will work well before going into the room and observing the data set.

The transcript introduces statistical procedures and assumptions that are taught in a basic statistics course. It focuses on estimating the probability of a coin toss coming out as heads and introduces the concept of a parameter, likelihood functions, and the maximum likelihood method. It also explains that the inference method used to predict the probability produces random results, and the need to quantify this randomness is important to understand the accuracy of the method.

The speaker is discussing a situation where someone is tossing a coin and sending out a set of observations. The speaker is asking how much would someone be willing to bet that the probability of turning out heads is half. They suggest a "stupid statistical method" where the probability of turning out heads is guessed to be half, but this only works for a fair coin. The speaker then moves on to discuss a more sophisticated method of estimating the probability.

The transcript is discussing a method to calculate the probability of a sequence of heads and tails given a fixed data set. The parameter space is defined as the probability of landing heads, which is equal to some number omega, and lies between 0 and 1. The likelihood of a sequence is given by multiplying the probability of landing heads (w) and the probability of landing tails (1-w). The number of heads (h) and tails (t) is equal to the length of the sequence (n) minus the number of heads (h).

The transcript discusses two possible models for estimating the likelihood of observing a set of data. The first model is a binomial expression and the second is a more complex model involving two coins. The second model allows for degeneracy if the probability of landing heads is the same for both coins. The speaker then dives into the details of the first model and how it is usually used for statistical estimation.

Maximum Likelihood Estimation is a method used to find the best parameter in a model that is most likely to produce the observed data. The likelihood function is treated like an objective function and the log of the likelihood function is maximized for numerical stability. The empirical KL divergence is also related to the likelihood function and is equal to the entropy of the true distribution if the truth is at w0. Maximizing this quantity is the same as minimizing the KL divergence, which is usually denoted with K.

The transcript discusses how to maximize a function with respect to dx. It explains how to find the best guess of the probability of a coin toss, which is equal to the number of heads divided by the total number of throws (the frequency of heads). It also explains that this quantity is a random variable, and that a different data set will give a different result. Finally, it explains how to know if the procedure will work well before going into the room and observing the data set.

The role of statistical learning theory is to quantify the uncertainty of any inference procedure. In this case, the probability of a coin toss landing heads given two tosses is one, but the confidence in this estimate is low. To quantify performance, a normal approximation to a binomial distribution is used to create a histogram of the estimated probability. This histogram is centered at the truth, and is an unbiased estimator, meaning on average the estimate is expected to be the truth.

The expected value of a normal distribution is given by a formula using the binomial distribution. However, this normal approximation only works with large sample sizes and can lead to inaccurate results if the sample size is small. Asymptotic normality is a general phenomenon that applies to arbitrary dimension models and different inference procedures. It involves looking at the log of the likelihood function and plotting it between 0 and 1. The shape of the profile can affect the accuracy of the estimates.

okay um yeah welcome around so um so uh on the board on the first board here is kind of the rough plan um for for this session i um actually haven't timed myself so uh let's see how far we go um i think the main goal is to um introduce or at least re-familiarize sort of kind of bringing some long-lost memory in this brick back to cache of kind of um statistical procedures um that is taught in say and in a sort of a statistics one one course um but then sort of also also on the way points out some um maybe assumption that was never clarified um in those kind of courses okay so um uh a true to a statistics 101 course um we are we're going to do a flat coin flip um so um that's a good pointing color okay right so uh this is this will be our um task so i'll tell us will be um to estimate the probability of coin toss um coming out ahead because this is a bernoulli sort of a newly random variable um we only need to specify uh sort of what is the probability of a head and uh let's call that um just so that everyone starts getting used to the notation that we're going to use throughout this minute i'm going to call that um w because it's going to be a parameter um uh so that's that's going to be the thing that we want to estimate um and along the way we shall introduce um what are the possible ways to do such a statistical inference um we'll be introducing model parameters um and something called likelihood functions and we are going to use just for today something called the maximum likelihood method um actually because um in this case we don't actually have a lot of degeneracy actually we don't have degeneracy uh depending on the model that we we construct um and for such uh situation we actually have um have that the maximum likelihood method is is in some sense the best um in some specific sense that we'll talk about and just a bit of preview is that we are going to point out that uh just state explicitly that whatever influence method we use to to predict the probability that inference method itself produced so random results because the inference depends on a set of data data points um previous observation about the coin toss for example um and therefore it is the the result is itself random and the need to quantify this randomness and uh is uh is important uh because we need this quantification to uh tell us how good our influence method is on average um and that that will hopefully lead us to some important concept that we will define later okay let's um start to do any sort

of questions before i while we are moving to another board let's move to the next board um there's no question let's um let's start okay i'm now on the next board uh the orb hasn't quite moved yet uh uh then maybe move closer to the board maybe i think dan is still controlling the whole oh is he the speaker we need to we need it all push him can they all see the second boat not really uh the recording can't say it is the problem until right by the way the supplementary section is supposed to be uh more informal so please interrupt me when when um oh there we go sorry guys away from the computer and i got caught sorry i moved to the second ball right quickly yeah we're trying to we're trying to push you over there [Laughter] oh yeah that was an amazing experience anyway i'm glad to oblige okay sorry interrupting no that's okay okay so let's start so um uh so the situation is that we i'm going to imagine uh actual an actual situation where um someone a person um is um on uh the other side of the room tossing a coin um tails heads and then um he is sending say by a mailbox or something um out a set of um n observations so um maybe there's heads heads heads um ads or something like that there's n of those so we code that that set of observation a data set and i'm going to use this notation and subscript n for the size of the data set so each observation is going to be a random variable in this case a newly random variable um standing for each each is either so each one of these is either flipping out heads um or flipping out flipping flipping tail um uh i i'm hopefully i'm better at projective geometry than perspective drawing um okay um so um so our task is to come up with an estimate i'm going to use hat for anything that is an estimate of the so basically the person is asking you if i flip this coin again um what is what is your estimate so how about how much would you bet that this is uh this is half how much would you be willing to bet and that's a kind of a function of the the your probability estimate so so um so this is this is a problem with some money on the line um so we can have a rather stupid method so a stupid statistical method okay let's call that method zero um and just ignore all the previous data and and just guess um um p probability of turning out heads is half and therefore the probability of turning a so this um this only works for uh the uh fair coin like there's only one kind of coin in the world that uh where this method will work like if

the person is nefarious and just do um uh use a loaded coin or something like that then this method would never work oh so a more common uh method um the one that is tied in statistics 101 uh would be um we'll be setting out a model so um assuming that um um so so if so if we kind of do a we can argue like this if we assume that um the the probability of landing heads is is equal to um some number omega and this omega because it is a probability lies in this interval 0 1. so already this is some concept that would be important for us so this will be called a parameter and this is a parameter space that will frequently label just capital w so this is a very simple parameter space so if we assume that that is true under this assumption then um the probability of we can ask what is the probability of um why is there a line okay um what is the probability of observing oops of observing this data set given this assumption so this function so we are thinking of this dn as a data set fixed by that person right so we we're only ever going to get this data at the end um it's fixed and this is the function going from uh this is function of w and this is called um the like liquid function so this is the probe probability of observing uh dn given w or assuming that um assuming that we uh the probability of heads is w um okay um uh let me maybe not move to another board let me just erase the pictures and um first give the form of this likelihood so um i hope that you can um agree that um if i have the the likelihood of a sequence say the following sequence um h head tail uh head telltale uh the likelihood that it happens exactly like this is well the likelihood of landing heads the first time is w assuming that it is the likelihood has w um then tail is one that one on one minus w and w one minus w and one sw and if you are assuming that each of these is an independent row uh it's a different row um and the person doesn't get um get tired after a while or something like that so that the the condition of the trail is all the same then the probability of this particular sequence is just um multiply all this together which means um the probability of the particular sequence is just given by uh multiplying um this many uh w and multiply uh this many one minus w where h is just a number of heads and t is the number of tau and um because because this sequence is uh of length n t is just um t is just equal to n minus h okay um but um just for later convenience uh just so that later i don't need to sum over all

possible sequences i'm going to write this likelihood of observing this um this set of um data as an unordered set so um the order doesn't matter so there is an n choose h um vector out front um so that's uh uh that's just the binomial distribution so this is equal to n choose h um w to the h um one minus f d to the t um and the likelihood is just um the binomial expression okay so let's let's move to the next bullet um yep clearing board um okay so um uh before before i go into um uh the details of that um that that estimation that model um i'm going to propose uh another model uh method two where um this is kind of a more complex model and my answer some of the questions just now regarding degeneracy and singularity is that we don't believe that that person is being fair to us even though he gave um gave us a sequence of heads and tails he might be using multiple different coins uh say maybe you guessed that he is using um two coins um maybe there is um uh he he has a box um inside a box there's two coins one of them is a fair coin one of them is biased and he is every time he uh throws a coin he he's actually reaching into the box and randomly picking up um um picking out one of the coins with some probability say uh t um in which case uh you will have the probability of landing heads um is uh given the probability t of picking a coin and the bias w of the bias coin the fair coin is just a half for probability submitting heads then the likelihood function becomes a function of two variables and it's equal to well it's um if it choose uh we currently t it chooses the fair coin in which case the probability of landing heads is a half uh with probability one minus t uh you choose the bias case uh bias coin in which case the probability of landing heads is one minus w uh so you could you might you might well use this model and observe some degeneracy if for for some reason w is a half then this model sort of collapsed back to the original model because um this uh both of these coin uh affair so it doesn't matter which one you uh take out um so this is an example of uh degeneracy an example in which you might you might have reason to believe that someone is using a more complicated uh world uh that is inhabiting inhibiting uh inhabiting a more complicated world than you imagine so um we won't be uh looking at this at moments in the interest of time i'm just going to dive into the method one just now and tell the story of how people usually do statistical estimation in that

situation so the method i'm going to introduce is going to call maximum likelihood estimation so we have the likelihood function just now the the idea is to treat this function like an objective function um uh in in what sentence can we justify that well um this is a bit of a uh a bit of a mental gymnastic where you assume that the truth um say uh the the best parameter in your set of model is the one that um is the most likely to produce the data set you observe to label the truth w0 we don't actually know that um this is an assumption this is this is the this is the realizability um okay so we're going to use a maximum likelihood exclamation um and um so because due to uh time um i'm going to skip all the actual optimization but um some important concept uh to to note is that we usually don't actually do maximization on this function sorry i was uh so i was actually saying um that w zero the fact that uh w0 exists meaning uh the truth the assumption that the person is actually um uh using one coin and that coin has a fixed uh probability of hand landing head this is the really realizability uh assumption um that dan was talking about just now um okay so with that assumption um uh this model um attempts to say that the closest the best estimator we can get that is closest the truth will be the one that maximize this likelihood function so we don't actually in practice maximize the likelihood itself we actually maximize the log of the likelihood function because of numerical stability and because the calculus is easier to do and log is a monotonic function so maximizing one is the same as mechanizing um the other while i'm here i'm going to introduce also another quantity um very much related to this likelihood function um called the uh the empirical uh kl divergence um or the uh uh empirical uh training error so that is this quantity is actually given by um oops um sorry i think that was a minus um where this thing is um well for now this is this is the entropy of the of the true distribution and you don't we don't need to talk about what that is for now it is just something that is uh that doesn't depend on w so uh maximizing uh this thing is the same as maximizing um the log of this the likelihood is the same as minimizing because of the negative sign this quantity and the expected value of this quantity is equal to something uh we call the kl uh divergence and is equal to um um so if the truth is at w zero um is equal to this expression and it's usually denoted with k

uh this is integral with respect to dx um okay um we will talk about this a lot more uh the next week um but so if you want to so let's go ahead and maximize this if you try to go and find the r max over w in 0 1 of this well you might go ahead and differentiate this function and find the find where the gradient vanishes and if you solve the equation you get something um you get something like the estimate that you're trying to find the estimate that makes the gradient vanish is equal to one uh the let's call that w hat which is the estimate um satisfy this equation um so if people recognize this is the odds um so all this is just another way of encoding probability um you say 2 to 1 instead of probability between 0 and 1 is equal to hedge head the the ratio of heads to tell and this implies that um the w hat is equal to uh the number of heads divided by the number of data which is something that everyone um sort of already know on the outside which is um the best guess of what um the next coin toss which turns out had would just count the number of heads divided by the total number of throws which is the frequency of hats something to note though is that um there is some issue in this equation when um for the solution when h is 0 or h is equal to capital n in which case you get w hat is equal to 0 or 1. um uh we will see what that uh problem that causes uh a little bit later on um sorry i'm then is it okay if i just continue for another 5.5 minutes yeah sure good uh let's move to the other part thank you um um so we get this formula of the maximum likelihood estimate this procedure so this procedure of observing dn and then counting number of heads um and then computing um w head equal to h on um h equals that uh h1n and then finally predicting um the probability of head the thing the the goal that we actually want to get at at the end is just equal to the probability just now evaluated at um [Music] at this estimate this procedure turns out turns up with something that is uh that depends on how many heads there are in the data set so this this quantity here is actually um is actually um uh a random variable of the it's a it's a measure measurable function of the data set and therefore it's itself random so to know if um to know if how well this procedure will work so before you even go into the room and observe the uh the data set we know that a different data set even though that person is using the same coin every time a different data set will give you a

different um estimate that we had so think about the case when um when the person is is mean and only give you two tops and in that two toes it might be both heads um in which case you will estimate uh your probability of the next according to this procedure the probability of nets um coin toss landing heads will be one because there will be n on n equal to one um and i don't think you will be very confident about that um that estimate so uh we want to so this is part of what it meant to have a statistical learning theory is to quantify the performance so let's write it down um statistical test t go learning theory oops um then it is um one of some of sometimes the okay so i was saying that uh the the role of statistical learning theory is to quantify this uncertainty of this of any inference procedure so that you know that when you use this influence procedure for a different data set and for possibly different truth so a different data generating process you know how well this inverse procedure will work so for our case i'm going to uh sort of without showing that showing the calculation i can post this calculation later i can tell you that um for this maximum likelihood procedure so if the truth is given by um by that assumption that we make that the person is throwing a single coin with a fixed probability of having heads of w zero then the distribution uh remember the fact that uh w hat itself is a random variable the distribution of the w hat is given by um well it's given by what is the probability of you estimating um h on n what is the probability of us estimating that well that's the same as the probability of having um h hat in n uh trials so that's again the um uh the binomial distribution but this time with um uh the truth uh assuming that it is uh uh it is w stewart um so one thing from say uh elementary probability that we shall use with our comment here is that there is this uh very famous um uh normal approximation so using normal approximation to a binomial we know that if we have a really really really large uh data set then um we have so this is w hat um and if we plot the histogram of w hat is going to look like it's going to be well approximated by a normal distribution centered at center at w0 the fact that this is centered at w0 is another statistics concept that we'll be using which is called this is an unbiased estimator meaning on average you you you actually expect that your estimate calculated like this is uh is going to be the truth

um um the expected value that the on average part of that is kind of important and the spread here given by the variance of the normal distribution is given by uh uh well it's given by the um the the variance of this distribution right in this case we actually have um uh uh exact formula um and if you calculate that using um the binomial distribution um variance you get this um so w naught multiplied by 1 and stuff is not divided by n so notice something though um this distribution um so notice that this normal approximation only holds in when n is really really large so what happens when n is really really small so if n is really really small um like one or two then um your normal approximation um has has a very large variance so this sigma square is very um is large is on the order of one um of the number one um in which case remember that the normal distribution does not confine itself to zero one so this is actually a really really terrible approximation uh particularly is even worse when uh the truth w naught is very very close to the the two problematic points that we're talking about before which is zero and one so um for some example if you have if w naught is very very close to uh uh uh zero then you need you need um a lot of data before the normal distribution concentrate around that point and not have something like this happening which is nonsense because w0 because your estimate should never be below zero okay um okay um so so this this spread this thing um uh this is the phenomenon that the fact that we have a normal approximation to the binomial distribution to the distribution of our estimator here is kind of a general phenomenon called known as asymptotic normality um okay and uh because i have run out of time i'm just going to say that um the generalization of this phenomenon to say arbitrary dimension model and to different [Music] inference procedure is use something called the facial information matrix and i'm just going to mention the the intuition here so just now if we um if we remember the the method that we were using just now where we are looking at log of the likelihood function so if we plot this function uh between zero and one um it could be that um we are we are unfortunate and have a very uh have a have a profile like this or we could be fortunate and have a profile like that um in the second case in this case if you um uh nearby nearby estimates of w so if your if your data set is somehow a little bit different um and and causes you to estimate uh not