Intermittent Conjecture: January 2024

The physicist Richard Feynman told a story about being on a panel of experts from a variety of academic fields. The full details are in one of the Surely you're joking books I read many years ago. I'm paraphrasing from memory here because lazy. The gist is that the panel was asked to look at someone's paper that pulled together ideas from a variety of fields and was generating a lot of buzz. Just the sort of thing you'd want an interdisciplinary panel of experts to look at.

All the experts on the panel had a similar reaction: Overall, it looks very interesting, but the stuff in my area needs quite a bit of work -- this bit is a little bit off, they're mis-applying these terms and these parts are just wrong. But there are some really interesting ideas and this is definitely worth further attention.

In Feynman's telling, at least, he was the one to offer a different take: If every expert is saying the part they know about is bad, that says it's just bad all the way through. It doesn't really matter what an expert thinks of the area outside their expertise.

Relying on people's subjective impressions is risky. What we need here is some way to objectively determine the value of a paper that crosses areas of knowledge. Here's one way to do it: Have everyone rate the paper in each area on a scale of 0 - 100 and then pull together the numbers.

Let's say we have five people on the panel, specializing in music theory, physics, Thai cuisine, medieval literature and athletics, and someone has written a paper pulling together ideas from these fields into an exciting new synthesis. Their ratings might be:

	Music	Physics	Thai food	Medi. lit	Athletics	Overall
Music theorist	25	75	80	65	85	66
Physicist	70	15	80	60	60	57
Thai chef	65	85	5	70	70	59
Medievalist	90	70	80	25	85	70
Athlete	85	90	95	90	30	78
Overall	67	67	68	62	66	66

Overall, the panel rates the paper 66 out of 100. We don't have enough context here to know whether 66 is a good score or a mediocre score, but it certainly doesn't look horrible. The highest score is in Thai cuisine, and the highest score there was from the athletics expert, so maybe the author has discovered some interesting contribution to Thai food by way of athletics.

But hang on a minute. The highest overall score is in Thai cuisine, but the lowest rating from any expert is the 5 from the Thai chef. Let's ask each of the experts how much they know about their fields and those outside their home turf:

	Music	Physics	Thai food	Medi. lit	Athletics
Music theorist	95	5	15	10	5
Physicist	20	100	10	5	5
Thai chef	5	10	100	10	15
Medievalist	10	5	10	95	10
Athlete	10	15	5	10	95

Everyone feels confident in their own field, as you might expect, and they don't feel particularly confident outside their own field, which also makes sense. There's also quite a bit more variation outside the home fields, which makes a certain amount of sense as well. Maybe the physicist happens to have taken a couple of courses in music theory. Maybe the athlete has only had Thai food once. You can expect someone to have studied extensively in their field, but who knows what they've done outside it.

We should take this into account when looking at the ratings. A Thai chef saying that the paper is weak in Thai cuisine means more than an athlete saying it's great. If we take a weighted average by multiplying each rating by the panelist's confidence, adding those up and dividing by the total weight (that is, the total of the confidence numbers), we get a considerably different picture:

	Music	Physics	Thai food	Medi. lit	Athletics	Overall
Weighted result	40	33	27	38	42	36

Overall, the paper rates 36 out of 100 rather than 66. Its weakest area is Thai cuisine, and even its strongest area, athletics, is well below the previous score of 66.

This seems much more plausible. The person who knows Thai food best rated it low, and now we're counting that ten times more heavily than the physicist's rating and twenty times more heavily than the judge who said they knew least about it.

I think there are a few lessons to be drawn here. First, it's important to take context into account. The medievalist's rating means a lot if it's about Medieval literature and not much if it's about physics, unless they also happen to have a background there. Second, just putting numbers on something doesn't make it any more or less rigorous. The 66 rating and the 36 rating are both numbers, but one means a lot more than the other.

Third, when it comes specifically to averages, a weighted average can be a useful tool for expressing how much any particular data point should count for. Just be sure to assign the weights independently from the numbers you're weighting. Asking the panelists ahead of time how much they know about each field makes sense. Looking at rating numbers and deciding how much to weight them is a classic example of data fiddling.

Finally, it's worth keeping in mind that people often give the benefit of the doubt to something that sounds plausible when they don't have anything better to go on. As I understand it, this was the case in Feynman's example. In that case, giving the paper to a panel of experts from different fields gave the author much more room to hide than if they'd, say, submitted a shortened version of the paper for each field.

The answer is not necessarily to actively distrust anything from outside one's own expertise, but it's important not to automatically trust something you don't know about just because it seems reasonable. The better evaluation isn't "I don't believe it" but "I really can't say".

I'll leave it up to the reader how any of this might apply to, say, generative AI, LLMs and chatbots.

Intermittent Conjecture

Friday, January 12, 2024

On knowing a lot about something and something about a lot of things