Friday, January 12, 2024

On knowing a lot about something and something about a lot of things

The physicist Richard Feynman told a story about being on a panel of experts from a variety of academic fields.  The full details are in one of the Surely you're joking books I read many years ago.  I'm paraphrasing from memory here because lazy.  The gist is that the panel was asked to look at someone's paper that pulled together ideas from a variety of fields and was generating a lot of buzz.  Just the sort of thing you'd want an interdisciplinary panel of experts to look at.

All the experts on the panel had a similar reaction: Overall, it looks very interesting, but the stuff in my area needs quite a bit of work -- this bit is a little bit off, they're mis-applying these terms and these parts are just wrong.  But there are some really interesting ideas and this is definitely worth further attention.

In Feynman's telling, at least, he was the one to offer a different take: If every expert is saying the part they know about is bad, that says it's just bad all the way through.  It doesn't really matter what an expert thinks of the area outside their expertise.


Relying on people's subjective impressions is risky.  What we need here is some way to objectively determine the value of a paper that crosses areas of knowledge.  Here's one way to do it: Have everyone rate the paper in each area on a scale of 0 - 100 and then pull together the numbers.

Let's say we have five people on the panel, specializing in music theory, physics, Thai cuisine, medieval literature and athletics, and someone has written a paper pulling together ideas from these fields into an exciting new synthesis.  Their ratings might be:

Music Physics Thai food Medi. lit Athletics Overall
Music theorist 25 75 80 65 85 66
Physicist 70 15 80 60 60 57
Thai chef 65 85 5 70 70 59
Medievalist 90 70 80 25 85 70
Athlete 85 90 95 90 30 78
Overall 67 67 68 62 66 66

Overall, the panel rates the paper 66 out of 100.  We don't have enough context here to know whether 66 is a good score or a mediocre score, but it certainly doesn't look horrible.  The highest score is in Thai cuisine, and the highest score there was from the athletics expert, so maybe the author has discovered some interesting contribution to Thai food by way of athletics.

But hang on a minute.  The highest overall score is in Thai cuisine, but the lowest rating from any expert is the 5 from the Thai chef.  Let's ask each of the experts how much they know about their fields and those outside their home turf:

Music Physics Thai food Medi. lit Athletics
Music theorist 95 5 15 10 5
Physicist 20 100 10 5 5
Thai chef 5 10 100 10 15
Medievalist 10 5 10 95 10
Athlete 10 15 5 10 95

Everyone feels confident in their own field, as you might expect, and they don't feel particularly confident outside their own field, which also makes sense. There's also quite a bit more variation outside the home fields, which makes a certain amount of sense as well.  Maybe the physicist happens to have taken a couple of courses in music theory.  Maybe the athlete has only had Thai food once.  You can expect someone to have studied extensively in their field, but who knows what they've done outside it.

We should take this into account when looking at the ratings.  A Thai chef saying that the paper is weak in Thai cuisine means more than an athlete saying it's great.  If we take a weighted average by multiplying each rating by the panelist's confidence, adding those up and dividing by the total weight (that is, the total of the confidence numbers), we get a considerably different picture:

Music Physics Thai food Medi. lit Athletics Overall
Weighted result 40 33 27 38 42 36

Overall, the paper rates 36 out of 100 rather than 66.  Its weakest area is Thai cuisine, and even its strongest area, athletics, is well below the previous score of 66.

This seems much more plausible.  The person who knows Thai food best rated it low, and now we're counting that ten times more heavily than the physicist's rating and twenty times more heavily than the judge who said they knew least about it.

I think there are a few lessons to be drawn here.  First, it's important to take context into account.  The medievalist's rating means a lot if it's about Medieval literature and not much if it's about physics, unless they also happen to have a background there.  Second, just putting numbers on something doesn't make it any more or less rigorous.  The 66 rating and the 36 rating are both numbers, but one means a lot more than the other.

Third, when it comes specifically to averages, a weighted average can be a useful tool for expressing how much any particular data point should count for.  Just be sure to assign the weights independently from the numbers you're weighting.  Asking the panelists ahead of time how much they know about each field makes sense.  Looking at rating numbers and deciding how much to weight them is a classic example of data fiddling.

Finally, it's worth keeping in mind that people often give the benefit of the doubt to something that sounds plausible when they don't have anything better to go on.  As I understand it, this was the case in Feynman's example.  In that case, giving the paper to a panel of experts from different fields gave the author much more room to hide than if they'd, say, submitted a shortened version of the paper for each field.

The answer is not necessarily to actively distrust anything from outside one's own expertise, but it's important not to automatically trust something you don't know about just because it seems reasonable.  The better evaluation isn't "I don't believe it" but "I really can't say".

I'll leave it up to the reader how any of this might apply to, say, generative AI, LLMs and chatbots.

No comments:

Post a Comment