All models are wrong.: May 2010

Thursday 13 May 2010

Doing it by Degrees

When it comes to looking at university education, many do not have to think too hard about what they want to study, the bigger dilemma is where to study it. That said, university prospectuses will often tout various statistics to try and lure potential students into plumping for a particular course, with employability one of the more commonly seen figures. But is this a reliable metric of how 'valuable' a degree is? Or is it yet another example of STATISTICS ABUSE? (roll opening credits)

Every year AGCAS produces a report looking at destinations of graduates six months after graduation. The latest report, from 2009, is the result of questionnaires sent to all graduates from the 2007/8 academic year, and can be downloaded here. The report itself contains quite a lot of interesting data, with destination breakdown (how many graduates are employed, unemployed, or studying for further degrees) as well as stats on the types of job those in employment have found themselves in. With these statistics available for a number of subjects (or subject areas), we can get a feel for which subjects seem the most or least valuable.

The report provides details on the following subjects/subject areas:

Science

Biology; Chemistry; Environmental, Physical Geographical and Terrestrial Sciences; Physics; Sports Science

Mathematics, IT and Computing

Computer Science and Information Technology; Mathematics

Engineering and Building Management

Architecture and Building; Civil Engineering; Electric and Electronic Engineering; Mechanical Engineering

Social Sciences

Economics; Geography; Law; Politics; Psychology; Sociology

Arts, Creative Arts and Humanities

Art and Design; English; History; Media Studies; Languages; Performing Arts

Business and Administrative Studies

Accountancy; Business and Management; Marketing

So let's start with employment, surely a perfectly good benchmark of how 'good' a degree is. The AGCAS report splits graduates into those in UK employment, overseas employment, as well as those working and studying. We add these three together to give us our employment figures:

Top 5 for Employment:

Civil Engineering (78.3% employed)
Marketing (74.6%)
Business and Management (73.6%)
Architecture and Building (73.4%)
Accountancy (73.0%)

Bottom 5 for Employment:

Law (35.2%)
Physics (37.9%)
Chemistry (44.0%)
Biology (58.0%)
History (58.7%)

I think it's fair to say there are some surprises here. Marketing, and Business and Management two subject areas often cited as housing archetypal 'Mickey Mouse' degrees make the top 5, whilst historically 'tough' subjects like chemistry and physics are at the opposite end. Are people really better off studying business over biology? Or is there something wrong with our metric?

Naturally, I'm inclined to believe the latter, and with good reason. As is so often the case, one statistic does not tell the whole story; whilst these numbers tell us what proportion of graduates were employed six months after graduation, it is not simply the case that everyone else was unemployed. AGCAS reports a number of 'studying' statistics as well, such as those studying for a higher degree, a PGCE, or professional qualifications. Perhaps then, unemployment is a better way of assessing degrees, as this takes people who are 'employed' with study into accout. Let's see what happens:

Top 5 for Unemployment:

Law (5.5% unemployed)
Sports Science (5.6%)
Geography (6.4%)
Civil Engineering (7.0%)
Psychology (7.4%)

Bottom 5 for Unemployment

Computer Science and Information Technology (13.7%)
Media Studies (12.3%)
Art and Design (12.2%)
Electrical and Electronic Engineering (11%)
Accountancy (10.9%)

Quite a big change. Law jumps from worst for employment to best for unemployment (as you might expect, they're all studying), and accountancy has done the opposite. There are still some surprises, such as Computer Science and IT having the highest rate of unemployment, and another 'Mickey Mouse' course in the form of Sports Science being second best. However, this seems a much less debatable statistic than employment, and so it seems reasonable to take these figures at face value.

There is, of course, an issue we have yet to discuss, which will be a rather pressing one for many new graduates: money. What good is being employed if you're only getting paid £5 an hour for those fancy letters after your name?

The salary data in the AGCAS report are a little harder to find, let alone digest. Whilst we get nice pie charts and percentage breakdowns for destinations, discussion of salaries is restricted to an introductory paragraph. If we trawl through these, however, we do get some numbers, and merging them all together we can do another top and bottom 5, this time based on the average salary of respondents.

Top 5 for Salary

Economics (£24065)
Civil Engineering (£24006)
Architecture and Building (£23689)
Mechanical Engineering (£23683)
Electrical and Electronic Engineering (£22372)

Bottom 5 for Salary

Art and Design (£15656)
Media Studies (£16295)
Psychology (£16500)
Sports Science (£16627)
English (£16642)

Once again, a rather marked change. Media Studies keeps the bottom 5 place it enjoyed under the unemployment stats, but it is joined by Sports Science, which was second best for unemployment. There are no real surprises in our top 5, however, all these subjects having a fairly substantial pedigree.

For the sake of argument, then, let's suppose that you are most interested in average salary. As I mentioned, the AGCAS report makes it much easier to find the employment/unemployment figures for a subject than it does to find average salaries. Do these provide an adequate indicator of average salary? Our top/bottom 5s above would suggest not, but these only cover 10 of 26 subjects. Let's plot some graphs!

First up, average salary against employment, is there a strong link between the two?

Hmm, no obvious pattern there, then. How about unemployment, does that give us a better fit?

There doesn't seem to be any sort of pattern there either.

We can in fact calculate a number that gives us an idea of how closely related two sets of numbers are. The correlation coefficient between two sets of (x,y) points (like our (employment %, salary) points on our graph) varies from -1 to 1. If it's close to 0 that means our numbers are not closely related, whilst if it is close to +1 or -1 it suggests a strong relationship. For example, if in our plot of employment against salary above all our points seemed to be on a straight line, this would suggest a correlation of around 1 or -1. The sign indicates the direction of the correlation. If it's positive this means as salary increases, so does employment. If it's negative, then as salary increases, employment decreases. This doesn't mean the two are related - "correlation does not imply causation" is one of a statistician's many mantras - it just shows that these data happen to have an association (which we may go on to convince ourselves is a causal one).

So that diversion aside, what correlations do we get in our two plots above? Looking at them, we'd expect it to be close to zero; there doesn't seem to be much of a pattern in either of them. For the first plot, of employment against salary, we find a correlation coefficient of 0.12 - so not much of a surprise there. For unemployment it's even worse: 0.06. In short, neither employment nor unemployment is a good indicator of average salary.

There is one area of the AGCAS report we haven't discussed, however, which might prove useful. Whilst each subject has a page of percentages of those in employment, studying, and so on, it also has a page showing what types of jobs are held by those who are employed. These range from a variety of 'Professionals' down to 'Numerical Clerks and Cashiers', and 'Retail, Catering, Waiting and Bar Staff'. This last one doesn't sound too glamarous; you've just spent 3 years earning a degree and you're still working in a bar? More to the point, these jobs are going to be low paying, so hopefully they're a better indicator of average salary. Let's see:

There definitely seems to be a pattern there, and the correlation between the two variables is -0.88 - that's a pretty strong negative correlation. The higher the proportion of those employed in retail, the lower the average salary. Not a surprising result, but it's always worth checking these things.

Is this at all useful, though? The salary data are in the document, you just have to dig for them a bit more. There is, however, one thing we've not mentioned. Because the report doesn't give average salaries the same prominent treatment as the employment data, some numbers are, in fact, missing. Whilst we can see what proportion of history graduates are studying in the UK for a teaching qualification, we can't find their average salary six months after graduation (and the same goes for performing arts). However, because we've identified the percentage of those working in retail as a useful indicator of average salary, we can use this knowledge to predict the average salaries of history and performing arts graduates. (In statistics, we'd call our retail statistic an 'instrument' for salary.)

So how do we turn our retail employment data into a prediction of salary? If you read my previous post about the times goals are scored in football matches, you should already know where I'm going with this. If not, then go and read it now, and come back when you're ready to apologise for such an oversight.

So anyway, it's time for some more linear regression. We're looking to fit the model S = a + bR, where S is salary, and R is the percentage of those employed who are employed in retail. If we can estimate a and b, then we can use this equation to estimate S when we only know R, as is the case for history and performing arts degrees. We can also plot a cool line on our graph to show the trend. Running the numbers, we find a = 25014 and b = -468, and plotting the line this generates onto our graph gives us:

We can now either use our equation S = a + bR with a and b replaced with 25014 and -468, or read straight off the line on our graph. For both history and performing arts, retail employment was 17.4%, so plugging R = 17.4 into this equation gives S = 25014 - 468*17.4 = £16,870.40. Our model suggests that both subjects seem to lead to (relatively) low average salaries, something which would not have been easy to discern from the report alone.

Alas, this all assumes our model is accurate, and with a relatively small number of observations I wouldn't be inclined to place too much confidence in these conclusions. Here I've taken a single report to base rather a lot of analysis on. However, it does illustrate a couple of interesting points. Firstly, mere 'employability' figures seem a rather dubious metric on which to base the value of a degree. Perhaps more surprisingly, unemployment doesn't seem to be a particularly good one either, at least in terms of indicating average salary. Whilst this report did have salary data in it, they weren't as clearly laid out as the other data, and were in fact missing for some subjects. This has allowed us to demonstrate how you can use another variable (if you think it's a good enough surrogate) to estimate missing data. Whilst for this particular problem you're probably better off just trying to hunt down the data you want in another report, our way is clearly much more fun.

Wednesday 5 May 2010

How many horses?

So the Lib Dems sent me some election material this morning. Unfortunately for them, ours is a very safe Labour seat, as you can see from this bar chart of the last election:

Not particularly pretty, but fairly clear, I think. Labour have a big majority, the Lib Dems are a (relatively) distant second, and the Tories and Greens are pretty much just making up the numbers.

So, how did the Lib Dems choose to present these data in their election leaflet? Like this!

Crikey. They say it's a two-horse race, and it really does look like one, doesn't it? Except hang on, this graph should be showing the same data as my one, why does it look so different? Surely they haven't been abusing statistics for political gain?

Well, before we accuse them of that, let's check a couple of common tricks people use when presenting bar charts to try and give a particular impression.

First up, it's the 'cut the y-axis above zero' method. Here that means rather than having the bottom of the graph equivalent to zero votes, having it equivalent to something larger. The Lib Dems can't have done this though, because that would only exaggerate the difference in votes. To demonstrate, if we dismiss the Tories and just plot the Lib Dem and Labour votes, and have a cut-off at 9,000 votes, it looks like this:

Wow, no point voting for anyone other than Labour here, they've got it wrapped up... (Obviously, were we making real propaganda, we'd leave off the y-axis; you can't have people reading that and working out what we're up to!)

So the Lib Dem's can't have done that, so another option is a logarithmic y-axis. What this means is that rather than each mark on the y-axis indicating a constant increase of votes, each mark instead corresponds to an increase by a factor, maybe 10. In other words, whilst a standard axis will go 1,000, 2,000, 3,000, and so on, a logarithmic one would go 1,000, 10,000, 100,000, increasing by a factor of 10 each time. These scales are useful for when you're trying to show a graph with both very large and very small numbers. It would seem a bit silly to use one here, but can it explain the Lib Dem graph?

Encouraging? Maybe. Notice now how everyone seems much closer, and that the y-axis is increasing logarithmically; going up in multiples of 10. This still doesn't really look like the graph the Lib Dems produced (the Tories seem a lot closer than they should be), so let's tweak it a bit more, and go back to cutting the y-axis off somewhere suitable. We'll also drop the pesky marks on the y-axis that actually tell us what's going on:

Aha! That's much more like it. Not a perfect imitation, but certainly getting there. We've got the Tories down as an also-ran, and the Lib Dems really giving Labour a run for their money. We could probably pick a better logarithmic factor (we used 10 here) to get the Lib Dem and Labour bars a bit closer together, but I think by now we've established that the Lib Dems are really just playing Silly Buggers. I can't imagine they actually fished around for a good scale on which to make the graph look like that, instead they've just drawn some appropriately shaped bars and stuck the numbers on. Of course, they've told us the numbers (and even given a source for bonus authenticity!), so it's our own fault if we just look at the coloured rectangles and draw the wrong conclusion. Still, that's precisely what they're hoping people will do, and it's a great example of why people don't trust statistics.

Tuesday 4 May 2010

Practical Probability - Is insurance a 'tax on the stupid'?

In a previous post I talked about gambling, and specifically the value of lottery tickets. I opened with the line "lotteries are a tax on the stupid", which I have often heard people trot out when they feel it pertinent. When someone says this in my earshot, I have a simple question in reply: "Do you have home insurance?". Almost invariably, the answer is "yes...why?".

Suppose I've set up a lottery, let's call it Thundercracker. I quite like money, but I'm also a bit lazy, so my lottery isn't very complicated. Each week you pay me £1 and get a lottery ticket where you pick a number from 1 to 10. I'll then hold a draw where I pick a numbered ball out of a bag, if your number comes out I'll give you £5, if not, you win nothing. We can work out your 'expected' returns in the same way we did when talking about coin tosses. You have a one in ten chance of winning and profiting £4, and a nine in ten chance of using and losing £1 (or, to put it another way, profiting -£1). To return to the vernacular from the previous post:

You win with probability 0.1 and profit £4
You lose with probability 0.9 and profit -£1

and so your expected profit is 0.1*£4 + 0.9*-£1 = £0.40 - £0.90 = -£0.50. On average you lose (and so I profit) 50p every week. Sounds good to me, and aren't you so stupid to keep playing when the odds are stacked against you?

One week however, I get bored of the balls in a bag lark, and I decide to change the rules slightly. I happen to know you're a bit of a minimalist, and that the value of everything in your home is £5. Now, rather than giving you £5 if I pick your ball out of the bag, I'll give you £5 if instead everything in your house gets stolen. From your perspective nothing has changed (fiscally at least): if you 'lose' the lottery (that is, your stuff doesn't get stolen), you're down the £1 you paid to me for your lottery 'ticket'. On the other hand, if you 'win' the lottery (by having all your stuff nicked) then you win £5 from me. Because the lottery has nothing to do with whether your stuff got stolen or not, you would have been in that predicament anyway, so the £5 I give you is just like the £5 you get if you win the old lottery. In fact, I've decided the probability that you'll get burgled in any one week is 1 in 10, so I continue to make the same profit I did before, and you the same (expected) loss.

This is a bit of a silly example, but it illustrates the principle: paying however much money a week for insurance is doing exactly the same thing as playing the lottery is, at least in terms of financial loss or gain. The only difference is that in a lottery the probabilities are all easy(ish) to calculate, whereas things are a lot less clear for insurance.

However, one thing you do know about insurance companies is that, like casinos, they always win (otherwise they would go out of business). So overall they are going to be offering worse returns than they should given the true chances of bad things happening. You might find a policy which you individually are expected to profit from, but you would be very fortunate to do so.

Of course, losing your house is perhaps as bad as winning millions of pounds is good. Indeed, when talking about lottery tickets I discussed how the 'value' of an outcome isn't necessarily simply the number of pounds you get from it. The same logic can be applied here. Fiscally speaking, insurance sets you up for a loss in the same way a lottery ticket does. However, many would argue the value they ascribe to the various possible outcomes means that insurance (to them, at least) is worth it overall. Others may feel the same about playing the lottery. Is either really a 'tax on the stupid'? It depends on where your values lie.

All models are wrong.

Thursday 13 May 2010

Doing it by Degrees

Wednesday 5 May 2010

How many horses?

Tuesday 4 May 2010

Practical Probability - Is insurance a 'tax on the stupid'?

Followers

Blog Archive

About Me