The following is a post that deals with some fairly basic probability theory, and addresses the question of whether it's fair to say the lottery is a 'tax on the stupid'. It's mostly aimed at anyone who hasn't done maths past GCSE, since the ideas won't be particularly surprising or interesting to anyone whose thought about probability beyond what they were forced to do at school.

-----------

I have often heard people refer to lotteries as a 'tax on the stupid', and on first glance it's hard to disagree. Even without knowing about the probabilities involved, like any gambling method the house always wins, and so as a player you're expected to lose.

We take as our motivating example the UK National Lottery, rebranded as 'Lotto' a few years ago. If you buy a Lotto ticket, what can you 'expect' to win? Before we calculate this, we take a very brief detour into some basic probability theory, to illustrate how to work out expected winnings. This will probably be familiar to most people reading this, so some skim reading might be in order for some of you.

First, let's just talk about the probability of something happening. Suppose I toss a coin and ask you to call heads or tails, you might say that your chances of getting it right are 50/50, 1 in 2, or 50%. I'd say your probability of getting it right is 0.5, and it's this way of saying it that we'll stick with.

When talking about probability in this sense, we use a scale from 0 to 1, where 0 means there's no chance of something happening (like the probability of rolling a die labelled 1 to 6 and getting a 7) and 1 means that the outcome is definitely going to happen. A probability of 0.5 is halfway between 0 and 1, and so indicates an outcome that is as likely to happen as it is to not happen.

Now, suppose we play a a simple gambling game. I toss a coin and you stake £1 on the outcome being heads or tails. So if you are betting £1 on this outcome, what is a fair return if you win? You would probably say instinctively that it's fair if you profit £1 if you win, since you lose £1 if you lose. In other words, if you win I should pay you £2 (including the £1 you gave me to start with), and if you lose I give you nothing. We can check this instinctive guess is in fact right by doing some pretty simple algebra.

Let's say I give you £x if you win, and otherwise I keep your £1. There are two possible outcomes to the coin toss:

You win the toss with probability 0.5 and profit £x - £1

You lose the toss with probability 0.5 and lose £1 (in other words, you 'profit' -£1)

You can work out your 'expected' winnings by multiplying the probability of an event by what you profit, and then adding these up for all the possible outcomes. In this case there are just two outcomes, and so the expected return is 0.5*(x-1) + 0.5*(-1), corresponding to your 0.5 probability of profiting x-1 pounds, and the 0.5 probability of you losing 1 pound. If we expand the algebra we get 0.5*x - 0.5 - 0.5 = 0.5*x - 1.

A bet is 'fair' if your expected profit is zero, i.e. if in the long run you would expect to neither win nor lose money. So to choose x to make the bet fair we have to find an x so that 0.5*x - 1 = 0, and it's not too difficult to see that x = 2 satisfies this condition. As we guessed, the game is fair if I give you £2 back if you win.

So anyway, back to the lottery. To calculate one's expected return from a lottery ticket we can just apply the above probability theory to the slightly more complicated lottery prize structure, right? Well, not really. Whilst it's easy(ish) to calculate the probability of each winning combination (matching 3, 4, 5, 5 and the bonus ball, or all 6 numbers), the prizes you get for each outcome are variable. The wikipedia page about the lottery details the precise mechanism, as well as telling you the probabilities of each outcome, but the important point is that only the £10 prize for matching three numbers is fixed. The other amounts are determined by how much money is left in the prize fund once all the £10 winners are accounted for. (This is why when you see a draw on TV they talk about the 'estimated' jackpot; they don't know what the final jackpot will be until they know how many £10s have been won.) Another problem is that how much a ticket wins depends on how many other people win that prize, which complicates the expectations even more.

Fortunately, there is an easy way to work out the expected return for a ticket. From all the money made from ticket sales, Camelot set aside 45% for the prize fund, with the rest going to charities and tax (as well as a profit for the company). Every ticket has the same chance of winning as every other ticket, and so the expected return for every ticket must be the same. Because 45p from every ticket is then given back to the people buying tickets, this means that your expected return from a £1 ticket is that 45p. In other words, for every £1 ticket you buy, you can expect to lose 55p, in the long run at least. That's a pretty terrible return, so maybe that 'tax on the stupid' line isn't too inaccurate after all.

In fact, as gambling games go, Lotto is one of the more 'unfair', at least in terms of the punters' expected returns. For example, on an American roulette wheel, there are 18 red, 18 black and 2 green numbers (the 0 and 00). If the green numbers weren't there, then betting on red would be like betting on a coin toss, and so a fair payout on a bet of £1 would be £1 as we calculated earlier. Of course, in real roulette the payout isn't fair, and whilst you do get double your stake back if you bet on red and win, the two green numbers make winning slightly less likely than it should be for this to be a fair bet. More precisely, if you put £1 on red, then your expected return is (roughly) £0.95; on average you 'only' lose five pence per spin. Compared to the £0.45 you get from a Lotto ticket, the roulette wheel seems like a great deal.

So anyway, if you gamble, you're expected (on average) to lose, so it's stupid to do it - is that a fair assessment? You can probably guess that I'm not convinced it is. Calculating 'expected' returns based purely on probabilities makes one fairly major assumption: the value of money is linear. That is, it assumes that the difference between £50 and £100 is the same as the difference between £1,000 and £1,050. You're probably thinking "well it is, it's £50 both times", but that's not quite the point. For instance, suppose someone calculated the most money you could ever possibly want or need in your lifetime, and then someone else offered you double this. Calculations about long-term expected returns assumes that the second offer is worth twice as much as the first, and whilst that is obviously the case in terms of raw numbers, to an individual there isn't really any difference. If I've got as much money as I could ever possibly want, then any more money is worthless to me.

If this example is a little too fanciful for your tastes, then let's construct a moderately more realistic scenario. If someone offered you £50 or £100, you would take the £100 without question, and feel much better off for it. On the other hand, if someone offered you £1,000,000 or £1,000,050, you'd probably still take the larger amount, but that extra £50 seems far less valuable, because compared to £1,000,000 it's virtually nothing.

The point here is that the value of money is not simply how many zeroes there are on the end of a number, and so calculations of expected lottery returns are a bit meaningless if you don't take this into account.

So if we can't calculate value of a lottery ticket by just multiplying all of the possible winning amounts and the probability of attaining them, what can we do? Well we can still use this method, it's just that we have to be a bit more careful in how we define what the winnings are. Instead of calculating one's expected return in terms of pounds and pence, we have to instead think about the 'value' of each possible outcome. Investigating what money is worth to people is an area of psychology/economics which has received a fair bit of attention, and as you'd expect it's the sort of thing which can vary enormously from person to person. It's worth bearing this in mind, then, the next time you hear someone call lottery players (or even gamblers in general) 'stupid'; they might have thought about it a bit more than you think.

## Wednesday, 21 April 2010

## Thursday, 8 April 2010

### Football - Goal times in brief

I just had a quick look at the times goals were scored in the Premier League dataset:

If you'll excuse the somewhat poorly labelled x-axis (R is being fussy, and I'm not particularly inclined to try and fix it for something so trivial), there are a few interesting points.

Most obvious are the two huge bars at around half time and full time. Of course, as you might have already guessed, this is due to how goal times have been reported - goals in injury time in the first half are reported as a 45th minute goal, and in the second half as a 90th minute goal. The fact the 9oth minute bar is so much taller than the 45th minute one demonstrates something most of us have probably observed: second half injury time is almost always longer than first half injury time.

If we filter out these two anomolies, and look at the goal times without the 46th of 91st minute goals, we get the following:

A couple of things to notice here. The first is that goals in the first minute really do seem quite rare, occurring in just 82 games (once in over 200 games). The second is that there looks like there might be a slight pattern to goal times - it seems goals become a bit more likely as the match wears on. Is this the case, or are our eyes just deceiving us?

Fortunately, we can test this hypothesis using a statistical technique called linear modelling. What this means is that we assume that the number of goals scored for a particular minute have a linear relationship with the time at which they're scored. In other words, if we just plotted the above bar graph but with points instead of bars, the points would lie on a roughly straight line. In fact, let's do this and see how it looks. One thing we'll change is from goal frequency to percentage, the data are the same, but it will make more sense to talk about percentages later on.

It looks a lot clearer when we plot the data this way, with most of the points seeming to lie (very roughly) on a straight line. We also note that each minute seems responsible for around 1% of goals. With 90 minutes in a game we'd expect something like this, so we've provided ourselves with a useful 'sanity check' - never a bad thing when playing with data.

Having observed this pattern, can we make any use of it? It might be nice to be able to fit a model that could tell us how likely a goal in, say, the 10th minute of a match would be, if it's true that there is a simple, straight line relationship between time in the game and likelihood of a goal. One option would be to just put a ruler on the graph and try and draw a straight line that seemed to best fit the data (indeed, I can remember doing this when I was at school, back when I would call this a 'line of best fit'). Fortunately, we can use statistical software to do this for us, and in arguably a much more reliable way.

To put our model mathematically, suppose the percentage of goals scored in the Xth minute of the game is G, then we'd assume that G = a + bX, where a and b are some numbers we want to find out. Using statistical software we can fit this model through a method called least squares). What this method does is a bit like what you do when you put a ruler on the graph and try and move it around until it looks about right, with the same number of points above and below the line you want to draw. What your eye is doing when you do this is probably trying to minimise the total distance all of your points are from the line you're drawing; what least squares does is calculate the line that minimises the square of these distances. In other words, if you imagine a line drawn on the graph, and measure the distance from the line to each of the points, square these distances and add them all up, least squares will find you the line that makes this total the smallest.

Applying this method we find that a = 0.895 and b = 0.005, which we can then plug back into our model equation: instead of G = a + bX we now get G = 0.895 + 0.005X. This tells us that for every extra minute in the game, the percentage of goals scored in that minute goes up by 0.005. Admittedly, this isn't very much at all, but over the course of the game that works out to around a 0.45% increase from start to finish, which when you consider that the average minute will only have 1.11% of goals, seems a little more dramatic. For example, a goal in the 80th minute is 40% more likely than a goal in the 10th.

Finally, let's plot the line onto the previous graph, like so:

You might well think that it looks about as good as something you could have done by eye with a ruler, and you're probably right. However, another thing using a computational method can tell us is how 'significant' the effect of time on goal probability is. In other words, to what extent is there really an underlying relationship between the time in a game and the probability of a goal, and to what extent is this just the result of our data randomly falling into a pattern. We find that, mostly thanks to the sheer size of our dataset, that this pattern is not likely to be down to chance; there really does seem to be a linear relationship between time in a game and the probability of a goal going in.

Perhaps something to bear in mind after a goalless first half.

If you'll excuse the somewhat poorly labelled x-axis (R is being fussy, and I'm not particularly inclined to try and fix it for something so trivial), there are a few interesting points.

Most obvious are the two huge bars at around half time and full time. Of course, as you might have already guessed, this is due to how goal times have been reported - goals in injury time in the first half are reported as a 45th minute goal, and in the second half as a 90th minute goal. The fact the 9oth minute bar is so much taller than the 45th minute one demonstrates something most of us have probably observed: second half injury time is almost always longer than first half injury time.

If we filter out these two anomolies, and look at the goal times without the 46th of 91st minute goals, we get the following:

A couple of things to notice here. The first is that goals in the first minute really do seem quite rare, occurring in just 82 games (once in over 200 games). The second is that there looks like there might be a slight pattern to goal times - it seems goals become a bit more likely as the match wears on. Is this the case, or are our eyes just deceiving us?

Fortunately, we can test this hypothesis using a statistical technique called linear modelling. What this means is that we assume that the number of goals scored for a particular minute have a linear relationship with the time at which they're scored. In other words, if we just plotted the above bar graph but with points instead of bars, the points would lie on a roughly straight line. In fact, let's do this and see how it looks. One thing we'll change is from goal frequency to percentage, the data are the same, but it will make more sense to talk about percentages later on.

It looks a lot clearer when we plot the data this way, with most of the points seeming to lie (very roughly) on a straight line. We also note that each minute seems responsible for around 1% of goals. With 90 minutes in a game we'd expect something like this, so we've provided ourselves with a useful 'sanity check' - never a bad thing when playing with data.

Having observed this pattern, can we make any use of it? It might be nice to be able to fit a model that could tell us how likely a goal in, say, the 10th minute of a match would be, if it's true that there is a simple, straight line relationship between time in the game and likelihood of a goal. One option would be to just put a ruler on the graph and try and draw a straight line that seemed to best fit the data (indeed, I can remember doing this when I was at school, back when I would call this a 'line of best fit'). Fortunately, we can use statistical software to do this for us, and in arguably a much more reliable way.

To put our model mathematically, suppose the percentage of goals scored in the Xth minute of the game is G, then we'd assume that G = a + bX, where a and b are some numbers we want to find out. Using statistical software we can fit this model through a method called least squares). What this method does is a bit like what you do when you put a ruler on the graph and try and move it around until it looks about right, with the same number of points above and below the line you want to draw. What your eye is doing when you do this is probably trying to minimise the total distance all of your points are from the line you're drawing; what least squares does is calculate the line that minimises the square of these distances. In other words, if you imagine a line drawn on the graph, and measure the distance from the line to each of the points, square these distances and add them all up, least squares will find you the line that makes this total the smallest.

Applying this method we find that a = 0.895 and b = 0.005, which we can then plug back into our model equation: instead of G = a + bX we now get G = 0.895 + 0.005X. This tells us that for every extra minute in the game, the percentage of goals scored in that minute goes up by 0.005. Admittedly, this isn't very much at all, but over the course of the game that works out to around a 0.45% increase from start to finish, which when you consider that the average minute will only have 1.11% of goals, seems a little more dramatic. For example, a goal in the 80th minute is 40% more likely than a goal in the 10th.

Finally, let's plot the line onto the previous graph, like so:

You might well think that it looks about as good as something you could have done by eye with a ruler, and you're probably right. However, another thing using a computational method can tell us is how 'significant' the effect of time on goal probability is. In other words, to what extent is there really an underlying relationship between the time in a game and the probability of a goal, and to what extent is this just the result of our data randomly falling into a pattern. We find that, mostly thanks to the sheer size of our dataset, that this pattern is not likely to be down to chance; there really does seem to be a linear relationship between time in a game and the probability of a goal going in.

Perhaps something to bear in mind after a goalless first half.

## Wednesday, 7 April 2010

### Football - Exploratory Analyses

I recently got hold of a rather fun dataset: every English Premier League football match since it was established in 1992. With 22 teams for the first three seasons (a fact in itself which was news to me), and 20 from 1995/1996 season onwards, I have data on 6,706 football matches up to the end of last season. Awesome.

So, as with any new data set, the first thing to do (after cleaning it up) is to think about what it is we are interested in, and go about some simple, exploratory analyses. I found myself with some free time this evening, so have gone through getting together some fairly basic statistics, which I'll share here through the controversial (yes, really) yet commonly used pie chart. A subject I'll probably do a separate post about later.

My initial idea was to look at results, and in particular the impact of home advantage. Everyone knows that playing at home is supposed to give your team a boost, be it through familiarity with the ground, less travelling, more fans, knowing where the booby traps are, and so on. Indeed, this is an idea well-trodden by sports statisticians, for instance the guys behind Fink Tank assign a 0.5 goal advantage to the home team.

First up then, let's retread old ground and look just at match outcomes. Out of all 6706 games in the dataset, how many times does the home team win, how many times does the away team win, and how many times is it a draw?

To anyone with a passing interest in football, this won't be a particularly surprising chart. 46% (nearly half) of all games are won by the home team, with draws and away wins being about as likely as each other. This distribution of match results (or at least, a very similar one) has been used by the guys at the Winton programme for the public understanding of risk to look at to what extent the final standings in the Premier League table represent skill, and to what extent luck. If you're interested (and don't mind a little bit of maths), their work is well worth a read.

So what else can we do to explore the data? How about looking at what happens in each half of a game? The next two figures summarise the outcome of each half of a match, regardless of the overall match result.

Looking at each half individually, we see that the overall match outcome isn't obviously reflected. A draw suddenly seems much more likely, with 43% of first halfs and 37% of second halfs being tied. When you consider that this turns into just 27% of matches being drawn overall, it seems a little surprising.

The other thing we notice is that the second half is drawn considerably less often than the first; both teams are more likely to win the second half than the first half. Our exploratory analysis has thrown up our first potentially interesting question: why do second halfs result in fewer draws? It could be the result of a greater drive to win a game, an inspirational substitution or just teams getting tired (and so more sloppy) as the game wears on. We'll come back to this another time; for now we should do a bit more digging to see if any other interesting finds show up.

In the three figures on the left, I've looked at the final result of matches, depending on who (if anyone) is winning at half time. Firstly, it is unsurprising to see that if the away team is winning at half time, they're odds-on to go on and win the match outright, doing so 67% of the time. However, all is not lost if your team goes in losing at home at half time, they still have almost a 1 in 3 chance of salvaging something from the game, which seems pretty good consolation.

A draw at half time, meanwhile, suggests a draw at the final whistle is the most likely of the three possible outcomes, happening nearly 40% of the time. Again, though, we see the home advantage being demonstrated through the statistics. With the match drawn at half time, the away team goes on to win 1 in 4 times; a home win is much more likely.

Finally, we look at the opposite of our first chart, and see that if the home team is winning at half time they do a much better job of holding onto that lead. Whilst an away team leading at half time would go on to win two-thirds of such matches, a home team in a similar position go on to win 80% (four-fifths) of the time. The contrast is even more stark when we look at the probability of a comeback. A home team losing at half time will come back and win 10% of the time, so once every ten games. On the other hand, an away team losing at half time will only manage to win 5%, just once in twenty, of such games.

I'll finish off (after that pie chart overload) with a bit of fun (for some value of 'fun'). The following is a bar chart showing the goal difference at the end of a match from the home team's perspective (so +1 means the home team won by one goal, -2 means the home team lost by two goals). Notice how when we look at the data this way, one might mistakenly conclude that a draw is the most likely outcome, as the highest bar is for a goal difference of zero. The real story, of course, requires us to consider all of the bars, and one can easily see how this graph has been produced from the same data that told us 46% of home teams win.

So, as with any new data set, the first thing to do (after cleaning it up) is to think about what it is we are interested in, and go about some simple, exploratory analyses. I found myself with some free time this evening, so have gone through getting together some fairly basic statistics, which I'll share here through the controversial (yes, really) yet commonly used pie chart. A subject I'll probably do a separate post about later.

My initial idea was to look at results, and in particular the impact of home advantage. Everyone knows that playing at home is supposed to give your team a boost, be it through familiarity with the ground, less travelling, more fans, knowing where the booby traps are, and so on. Indeed, this is an idea well-trodden by sports statisticians, for instance the guys behind Fink Tank assign a 0.5 goal advantage to the home team.

First up then, let's retread old ground and look just at match outcomes. Out of all 6706 games in the dataset, how many times does the home team win, how many times does the away team win, and how many times is it a draw?

To anyone with a passing interest in football, this won't be a particularly surprising chart. 46% (nearly half) of all games are won by the home team, with draws and away wins being about as likely as each other. This distribution of match results (or at least, a very similar one) has been used by the guys at the Winton programme for the public understanding of risk to look at to what extent the final standings in the Premier League table represent skill, and to what extent luck. If you're interested (and don't mind a little bit of maths), their work is well worth a read.

So what else can we do to explore the data? How about looking at what happens in each half of a game? The next two figures summarise the outcome of each half of a match, regardless of the overall match result.

Looking at each half individually, we see that the overall match outcome isn't obviously reflected. A draw suddenly seems much more likely, with 43% of first halfs and 37% of second halfs being tied. When you consider that this turns into just 27% of matches being drawn overall, it seems a little surprising.

The other thing we notice is that the second half is drawn considerably less often than the first; both teams are more likely to win the second half than the first half. Our exploratory analysis has thrown up our first potentially interesting question: why do second halfs result in fewer draws? It could be the result of a greater drive to win a game, an inspirational substitution or just teams getting tired (and so more sloppy) as the game wears on. We'll come back to this another time; for now we should do a bit more digging to see if any other interesting finds show up.

In the three figures on the left, I've looked at the final result of matches, depending on who (if anyone) is winning at half time. Firstly, it is unsurprising to see that if the away team is winning at half time, they're odds-on to go on and win the match outright, doing so 67% of the time. However, all is not lost if your team goes in losing at home at half time, they still have almost a 1 in 3 chance of salvaging something from the game, which seems pretty good consolation.

A draw at half time, meanwhile, suggests a draw at the final whistle is the most likely of the three possible outcomes, happening nearly 40% of the time. Again, though, we see the home advantage being demonstrated through the statistics. With the match drawn at half time, the away team goes on to win 1 in 4 times; a home win is much more likely.

Finally, we look at the opposite of our first chart, and see that if the home team is winning at half time they do a much better job of holding onto that lead. Whilst an away team leading at half time would go on to win two-thirds of such matches, a home team in a similar position go on to win 80% (four-fifths) of the time. The contrast is even more stark when we look at the probability of a comeback. A home team losing at half time will come back and win 10% of the time, so once every ten games. On the other hand, an away team losing at half time will only manage to win 5%, just once in twenty, of such games.

I'll finish off (after that pie chart overload) with a bit of fun (for some value of 'fun'). The following is a bar chart showing the goal difference at the end of a match from the home team's perspective (so +1 means the home team won by one goal, -2 means the home team lost by two goals). Notice how when we look at the data this way, one might mistakenly conclude that a draw is the most likely outcome, as the highest bar is for a goal difference of zero. The real story, of course, requires us to consider all of the bars, and one can easily see how this graph has been produced from the same data that told us 46% of home teams win.

## Monday, 5 April 2010

### Bowling and Betting: The Power of Powers

So in a burst of productivity, I have created a blog, made an introductory post, and am now posting an actual document in the space of a few minutes.

This will be the first of my simple stats documents (that's a working title, before you worry), where I attempt to put together an accessible document about some aspect of probability or statistics that seems Just About Interesting Enough. It looks at what happens when you start multiplying probabilities, and how this ties in with accumulators - a type of bet that isn't as good as it might seem.

The first document is here, so go read, or don't, it's all cool.

This will be the first of my simple stats documents (that's a working title, before you worry), where I attempt to put together an accessible document about some aspect of probability or statistics that seems Just About Interesting Enough. It looks at what happens when you start multiplying probabilities, and how this ties in with accumulators - a type of bet that isn't as good as it might seem.

The first document is here, so go read, or don't, it's all cool.

### First

It seems sensible to have an introductory post, so here it is.

Welcome to stats is stats backwards (great name, I know), a blog where I plan to dump statistical ramblings, drafts of papers, any of my simple stats documents I put together, that sort of thing. I appreciate that I am rather talking into the aether, as I can't imagine many will find this corner of the Internet, let alone hang around to read about the 'fascinating' world of statistics, but if I ever let readership put me off writing, I'm not sure I'd ever get published.

With a few posts up now, I'm attempting a tagging system to try and make certain types of posts easier to find, as well as identifying the nature of a post quickly. In particular, an indication of the 'mathsy-ness' of an article will be given, namely basic, moderate or advanced. The aim is that most people can get through any post on here, but some will require a bit more effort than others. There will also be 'introductory' posts, where fairly basic concepts will be introduced, usually for the sake of being built on later.

Welcome to stats is stats backwards (great name, I know), a blog where I plan to dump statistical ramblings, drafts of papers, any of my simple stats documents I put together, that sort of thing. I appreciate that I am rather talking into the aether, as I can't imagine many will find this corner of the Internet, let alone hang around to read about the 'fascinating' world of statistics, but if I ever let readership put me off writing, I'm not sure I'd ever get published.

With a few posts up now, I'm attempting a tagging system to try and make certain types of posts easier to find, as well as identifying the nature of a post quickly. In particular, an indication of the 'mathsy-ness' of an article will be given, namely basic, moderate or advanced. The aim is that most people can get through any post on here, but some will require a bit more effort than others. There will also be 'introductory' posts, where fairly basic concepts will be introduced, usually for the sake of being built on later.

Subscribe to:
Posts (Atom)