Misunderstanding the replication crisis

Based on the abstract, it seems Alexander Bird’s Understanding the replication crisis as a base rate fallacy has it backwards. Is there reason to dig into the paper?

He notes a core feature of the crisis:

If most of the hypotheses under test are false, then there will be many false hypotheses that are apparently supported by the outcomes of well conducted experiments and null hypothesis significance tests with a type-I error rate (α) of 5%.

Then he says this solves the problem:

Failure to recognize this is to commit the fallacy of ignoring the base rate.

But it merely states the problem: Why most published research findings are false.

...much exaggerated.

Headlines about the death of theory are philosopher clickbait. Fortunately Laura Spinney’s article is more self-aware than the headline:


❝   But Anderson’s [2008] prediction of the end of theory looks to have been premature – or maybe his thesis was itself an oversimplification. There are several reasons why theory refuses to die, despite the successes of such theory-free prediction engines as Facebook and AlphaFold. All are illuminating, because they force us to ask: what’s the best way to acquire knowledge and where does science go from here?

(Note: Laura Spinney also wrote Pale Rider, a history of the 1918 flu.)

Big Data?

Forget Facebook for a moment. Image classification is the undisputed success of black-box AI: we don’t know how to write a program to recognize cats, but we can train a neural net on lots of picture and “automate the ineffable”.

But we’ve had theory-less ways to recognize cat images for millions of years. Heck, we have recorded images of cats from thousands of years ago. Automating the ineffable, in Kozyrkov’s lovely phrase, is unspeakably cool, but it has no bearing on the death of theory. It just lets machines do theory-free what we’ve been doing theory-free already.

Mis-understanding

The problem with black boxes is supposedly that we don’t understand what they’re doing. Hence DARPA’s “Third Wave” of Explainable AI. Kozyrkov thinks testing is better than explaining - after all we trust humans and they can’t explain what they’re doing.

I’m more with DARPA than Kozyrkov here: explainable is important because it tells us how to anticipate failure. We trust inexplicable humans because we basically understand their failure modes. We’re limited, but not fragile.

But theory doesn’t mean understanding anyway. That cat got out of the bag with quantum mechanics. Ahem.

Apparently the whole of quantum theory follows from startlingly simple assumptions about information. That makes for a fascinating new Argument from Design, with the twist that the universe was designed for non-humans, because humans neither grasp the theory nor the world it describes. Most of us don’t understand quantum. Well maybe Feynman, though even he suggested he might not really understand.

Though Feynman and others seem happy to be instrumentalist about theory. Maybe derivability is enough. It is a kind of understanding, and we might grant that to quantum.

But then why not grant it to black-box AI? Just because the final thing is a pile of linear algebra rather than a few differential equations?

Theory-free?

I think it was Wheeler or Penrose – one of those types anyway – who imagined we met clearly advanced aliens who also seemed to have answered most of our open mathematical questions.

And then imagined our disappointment when we discovered that their highly practical proofs amounted to using fast computers to show they held for all numbers tried so far. However large that bound was, we should be rightly disappointed by their lack of ambition and rigor.

Theory-free is science-free. A colleague (Richard de Rozario) opined that “theory-free science” is a category error. It confuses science with prediction, when science is also the framework where we test predictions, and the error-correction system for generating theories.

Back to the article

Three examples from the article:

Machines can predict better than professionals.

Certainly. Since the 1970s when Meehl showed that simple linear regressions could outpredict psychiatrists, clinicians, and other professionals. In later work he showed they could do that even if the parameters were random.

So beating these humans isn’t prediction trumping theory. It’s just showing disciplines with really bad theory.

Prospecting Gaps

I admire Tom Griffiths, and any work he does. He’s one of the top cognitive scientists around, and using neural nets to probe the gaps in prospect theory is clever; whether it yield epicycles or breakthroughs it should advance the field.

He’s right that more data means you can support more epicycles. But basic insights Wallace’s MML remain: if the sum of your theory + data is not smaller than the data, you don’t have an explanation.

Regularization

AlphaFold’s jumping-off point was the ability of human gamers to out-fold traditional models. The gamers intuitively discovered patterns – though they couldn’t fully articulate them. So this was just another case of automating the ineffable.

But the deep nets that do this are still fragile – they fail in surprising ways that humans don’t, and they are subject to bizarre hacks, because their ineffable theory just isn’t strong enough. Not yet anyway.

So we see that while half of success of Deep Nets is Moore’s law and Thank God for Gamers, the other half is tricks to regularize the model.

That is, to reduce its flexibility.

I daresay, to push it towards theory.

Model Testing, Mayo, Fisher, Bickel

Deborah Mayo has a new post on Model Testing and p-values vs. posteriors.

I haven’t read Bickel, but now want to. So thanks for the alert.

Fisher is surely right about facetiously adopting extreme priors – a “million to one” prior shouldn’t happen without some equivalent of a million experiences, and a “model check” on the priors makes sense if you want more than a hypothetical.

But surely, his paragraph about “not capable of finding expression in any calculation” is rhetorical fluff amounting you “you have the wrong model”. If priors are plucked from the air, they can be airily dismissed.

On this note I find his switch to parapsychology puzzling (follow her link to pp. 42-44). His Pleiades example shows that their unlikely clustering makes it hard to accept randomness despite the posterior odds still favoring randomness by 30:1. I think he is arguing that the sheer weight of evidence makes us doubt our prior assumption. And rightly so – how confident were we of that million:1 guesstimate anyway?

But his parapsychology example seems to make the opposite point. Here we (correctly) stick to our prior skepticism, and explain away the surprising results as fabrications, mistakes, etc. Here our prior is informed by many things including a century of failed parapsychology experiments – or more precisely a pattern of promising results falling apart under scrutiny. Like prospects for a new cancer drug, we expect it to let us down.

I like to think that “simple” Bayesian inference over all computable models wouldn’t suffer this problem, because all model classes are included. This being intractable, we typically calculate posteriors inside small convenient model families. If things look very wrong, hopefully we remember we may just have the wrong model. Though probably not before adding some epicycles.

What is a replication

In a recent Nature essay urging pre-registering replications, Brian Nosek and Tim Errington note:

Conducting a replication demands a theoretical commitment to the features that matter.

That draws on their paper What is a replication? and Nosek’s earlier UQ talk of the same name arguing that a replication is a test with “no prior reason to expect a different outcome.”

Importantly, it’s not about procedure. I wish I’d thought of that, because it’s obvious after it’s pointed out. Unless you are offering a case study, you should want your result to replicate when there are differences in procedure.

But psychology is a complex domain with weak theory. It’s hard to know what will matter. There is no prior expectation that the well-established Weber-Fechner law would fail among the Kalahri – but it would be interesting if it did. The well-established Müller-Lyer illusion does seem to fade in some cultures. That requires different explanations.

Back to the Nature essay:

What, then, constitutes a theoretical commitment? Here’s an idea from economists: a theoretical commitment is something you’re willing to bet on. If researchers are willing to bet on a replication with wide variation in experimental details, that indicates their confidence that a phenomenon is generalizable and robust. … If they cannot suggest any design that they would bet on, perhaps they don’t even believe that the original finding is replicable.

This has the added virtue of encouraging dialogue with the original authors rather than drive-by refutations. And by pre-registering, you both declare that before you saw the results, this seemed a reasonable test. Perhaps that will help you revise beliefs given the results, and suggest productive new tests.