Replication


❝   And that’s why mistakes had to be corrected. BASF fully recognized that Ostwald would be annoyed by criticism of his work. But they couldn’t tiptoe around it, because they were trying to make ammonia from water and air. If Ostwald’s work couldn’t help them do that, then they couldn’t get into the fertilizer and explosives business. They couldn’t make bread from air. And they couldn’t pay Ostwald royalties. If the work wasn’t right, it was useless to everyone, including Ostwald.

From Paul von Hippel, "[When does science self-correct?](https://goodscience.substack.com/p/when-does-science-self-correct-lessons)". And when not.

Misunderstanding the replication crisis

Based on the abstract, it seems Alexander Bird’s Understanding the replication crisis as a base rate fallacy has it backwards. Is there reason to dig into the paper?

He notes a core feature of the crisis:

If most of the hypotheses under test are false, then there will be many false hypotheses that are apparently supported by the outcomes of well conducted experiments and null hypothesis significance tests with a type-I error rate (α) of 5%.

Then he says this solves the problem:

Failure to recognize this is to commit the fallacy of ignoring the base rate.

But it merely states the problem: Why most published research findings are false.

To fix peer review, break it into stages, a short Nature opinion by Olava Amaral.

  • Separate evaluation of rigor from curation of space.
  • Check the data first!
  • Statcheck can do a lot of this.
  • Reallocate the 100M annual hours of peer review to standardization etc.

Predicting replicability: scientists are 73% accurate

Congratulations to Michael Gordon et al for their paper in PLoS One!

This paper combines the results of four previous replication market studies. Data & scripts are in the pooledmaRket R package.

Key points:

  • Combined, it covers 103 replications in behavioral & social sciences.
  • Markets were 73% accurate, surveys a bit less.
  • p-values predict original findings, though not at the frequencies you’d expect.

Enough summarizing - it’s open access, go read it! 😀🔬📄

~ ~ ~

Coda

We used this knowledge in the Replication Markets project, but it took awhile to get into print, as it does.

It should be possible to get 80-90% accuracy:

  • These were one-off markets - no feedback and no learning!
  • A simple p-value model does nearly as well, with different predictions.
  • Simple NLP models on the PDF of the paper do nearly as well, with different predictions.

Replication Markets probably did worse ☹️, but another team may have done better. TBD.

What is a replication

In a recent Nature essay urging pre-registering replications, Brian Nosek and Tim Errington note:

Conducting a replication demands a theoretical commitment to the features that matter.

That draws on their paper What is a replication? and Nosek’s earlier UQ talk of the same name arguing that a replication is a test with “no prior reason to expect a different outcome.”

Importantly, it’s not about procedure. I wish I’d thought of that, because it’s obvious after it’s pointed out. Unless you are offering a case study, you should want your result to replicate when there are differences in procedure.

But psychology is a complex domain with weak theory. It’s hard to know what will matter. There is no prior expectation that the well-established Weber-Fechner law would fail among the Kalahri – but it would be interesting if it did. The well-established Müller-Lyer illusion does seem to fade in some cultures. That requires different explanations.

Back to the Nature essay:

What, then, constitutes a theoretical commitment? Here’s an idea from economists: a theoretical commitment is something you’re willing to bet on. If researchers are willing to bet on a replication with wide variation in experimental details, that indicates their confidence that a phenomenon is generalizable and robust. … If they cannot suggest any design that they would bet on, perhaps they don’t even believe that the original finding is replicable.

This has the added virtue of encouraging dialogue with the original authors rather than drive-by refutations. And by pre-registering, you both declare that before you saw the results, this seemed a reasonable test. Perhaps that will help you revise beliefs given the results, and suggest productive new tests.

Rapid Reviews: COVID-19

The announcement of RR:C19 seems a critical step forward. Similar to Hopkins' Novel Coronavirus Research Compendium. Both are mentioned in Begley’s article.

So.. would it help to add prediction markets on replication, publication, citations?

Unsurprisingly, popular media is more popular:

A new study finds that “peer-reviewed scientific publications receiving more attention in non-scientific media are more likely to be cited than scientific publications receiving less popular media attention.”

Pro-active Replication

Science Is Hard

Nature Communications has an inspiring piece about a DARPA project baking replication into the original study. The DARPA Biological Control program applies independent validation & verification (IV&V) to synthetic biology. The article is a Lessons Learned piece.

Although DARPA oversight presumably mitigated the element of competition, the IV&V teams had at least as much trouble as some historical cases discussed below. It’s worth reading their Hard lessons table. Here’s one:

We lost more than a year after discovering that commonly used biochemicals that were thought to be interchangeable are not.

And this one seems to apply to an discipline: “pick a person

The projects that lacked a dedicated and stable point of contact were the same ones that took the longest to reproduce. That is not coincidence.

They also report how much they needed personal interaction - a result familiar to those of us in Science Studies (more later).

A key component of the IV&V teams’ effort has been to spend a day or more working with the performer teams in their laboratories. Often, members of a performer laboratory travel to the IV&V laboratory as well. These interactions lead to a better grasp of methodology than reading a paper, frequently revealing person-to-person differences that can affect results.

But I was still surprised how much.

A typical academic lab trying to reproduce another lab’s results would probably limit itself to a month or so and perhaps three or four permutations before giving up. Our effort needed capable research groups that could dedicate much more time (in one case, 20 months) and that could flexibly follow evolving research.

To be fair, this is biology, a proverbially cantankerous field. But the canonical Science Studies reference of the difficulty of replication is laser physics.

Before I explore that, pause to appreciate the importance of this DARPA work: (1) The value of baked-in replication for really understanding the original result, and (2) the real difficulty in achieving it. I encourage you to read the NC piece and imagine how this could be practiced at funding levels below that of DARPA programs.

Echoes of the Past

In Science Studies, replication troubles evoke the “experimenters' regress”. The canonical reference is Collins’s 1974 paper on laser physics (or really, his 1985 book):

to date, no-one to whom I have spoken has succeeded in building a TEA laser using written sources (including preprints and internal reports) as the sole source of information, though several unsuccessful attempts have been made, and there is now a considerable literature on the subject. …. The laboratories studied here … actually learned to build working models of TEA lasers by contact with a source laboratory either by personal visits and telephone calls or by transfer of personnel.

Shortly thereafter, he notes that the many failures were:

simply because the parameters of the device were not understood by the source laboratories themselves. … For instance, a spokesman at Origin reports that it was only previous experience that enabled him to see that the success of a laser built by another laboratory depended on the inductance of their transformer, at that time thought to be a quite insignificant element.

This is of course echoed in the new NC piece about the DARPA program.

Collins expands on this and other episodes in his (1985), attempting to make sense of (then nascent) attempts to detect gravity waves. As Turner (2014) summarizes:

The point was to show that replication was not and could not be understood as a mechanical process…

So the crisis isn’t merely that it’s hard to replicate from publications - any more than it’s a crisis that it’s so hard to learn to ride a bicycle by studying a manual. And no doubt many failed replications are failures of technique, caused by researchers foolishly attempting replication without contacting the original authors. The crisis is that we have many reasons to expect half the original results were indeed spurious. One of the goals of Replication Markets and the larger DARPA SCORE program is to help sort out which.

Back to the Future

I’ve fallen behind in the literature. I see that Collins has a 2016 chapter on the modern situation: “Reproducibility of Experiments: Experimenters' Regress, Statistical Uncertainty Principle, and the Replication Imperative.” I look forward to reading it.

And of course this brings us to Nosek and Errington’s preprint, “What is replication?”, where they argue that replication itself is an “exciting, generative, vital contributor to research progress”.

But now, as Gru says, back to work.


Notes

Collins, Harry M. 1974. “The TEA Set: Tacit Knowledge and Scientific Networks”. Science Studies 4. 165-86. (Online here.)

Collins, Harry M. 1985. Changing order: replication and induction in scientific practice.

Nosek, Brian, and Errington, Tim. 2019-2020. What is replication? MetaArXiv Preprints.

Turner, Stephen P. 2014. Understanding the Tacit. (p.96)

Aschwanden's teaser subtitle: wisdom in the scientific crowd

Science writer Christine Aschwanden (@cragcrest) just published a nice summary of the October paper “Crowdsourcing Hypothesis Tests” by Landy et al.

Well, half the paper.

I like the WIRED piece. It covers the main result: if you give a dozen labs a novel hypothesis to tetst, they devise about a dozen viable ways to test it, with variable results. So what?

Anna Dreber, an economist at the Stockholm School of Economics and an author on the project. “We researchers need to be way more careful now in how we say, ‘I’ve tested the hypothesis.’ You need to say, ‘I’ve tested it in this very specific way.’ Whether it generalizes to other settings is up to more research to show.”

We often neglect this ladder of specificity from theory to experiment and the inevitable rigor-relevance tradeoff, so measuring the effect puts some humility back into our always too-narrow error bars.

But Anschwanden’s subtitle hints at more:

Just how much wisdom is there in the scientific crowd?

Perhaps some paragraphs were cut, because in the second part of the paper, Landy et al asked a crowd to forecast the outcomes of each variant. In his prize-winning Metascience poster Domenico Viganola (@dviganola) summarized the result:

Section of Domenico's poster, showing correlations between forecasts and results, for each hypothesis.

The scientific crowd was reasonably sensitive to the effect of the design variations from the different labs.

scientists’ predictions are positively correlated with the realized outcomes, both in terms of effect sizes and in terms of whether the result is statistically significant or not for the different sets of study materials.

And,

scientists were able to predict not only which hypotheses would receive empirical support (see Figure 3a) but also variability in results for the same hypothesis based on the design choices made by different research teams (see Figure 3b).

I think that’s pretty cool - and suggests there is reasonable wisdom in the scientific crowd.

But not quite enough wisdom to make this stud unnecessary.


Note: Dreber and Viganola are key part of the @ReplicationMkts team.

Horse betting and @ReplicationMkts

A friend sent a May 2018 Bloomberg article by Kit Chellel, “The gambler who cracked the horse-racing code."
It’s a good story of forecasting real-world events, and the interplay of statistical models and intuition/expertise to tune those models, and contains useful guides for @ReplicationMkts forecasters.

I was struck by this:

A breakthrough came when Benter hit on the idea of incorporating a data set hiding in plain sight: the Jockey Club’s publicly available betting odds.

That’s a surprising oversight. In his defense, 1990 was way before Wisdom of the Crowd, the Iowa Electronic Markets were just getting started, and Robin Hanson was only beginning to write about prediction markets. On the other hand, sports betting had been around for a long time. Anyway, the paragraph continues:

Building his own set of odds from scratch had been profitable, but he found that using the public odds as a starting point and refining them with his proprietary algorithm was dramatically more profitable. He considered the move his single most important innovation, and in the 1990-91 season, he said, he won about $3 million.

Well, of course. Those odds contain all the work already done by others.

Use Prior Wisdom

There are at least three easy sources of prior wisdom in Replication Markets:

  • The starting price is set from the original study’s p-value, using a tiny (two-step) decision tree. This simple model has been 60-70% accurate.
  • The replication rate of the field, from previous replication studies or as forecast in Round 0.
  • The current market value. Markets are not surveys. If you arrive late to a well-traded market, pause. Why it is where it is? Is that value the result of a gradual negotiation of opposing views, or is it a single wild excursion?

With attention you can do better than these - as Benter did in horse betting - but don’t ignore them. Here are two stories on correcting the market:

  1. Yes, our markets are noisy. In SciCast, “faber” made much of his money with a simple bot that half-reversed wild trades by new accounts. So do this when it seems right - but note Faber’s first bot lost thousands of points to a savvy new opponent, before the bot gained backoff rules.

  2. DAGGRE’s top forecaster regularly updated a spreadsheet of his best estimates. But his trading rule was to correct the market only halfway to his best estimate, and wait. The market might know something he doesn’t. (Also, market judo: trading +3% 10x is cheaper than changing +30% 1x.)

Kelly Rule

The article also mentions that Benter used the Kelly rule to stay ahead of gambler’s ruin. Kelly’s rule was an important early result in, and unorthodox application of, Shannon’s Information Theory.

The rule is based on the insight that a market - or set of markets - is a source of repeated returns, and you want to maximize your lifetime earnings. Therefore you must be as aggressive as you can while ensuring you never go broke. In theory, Kelly will never go broke - though in the real world (a) it’s still more aggressive than most can stomach, so most practitioners add a cushion, and (b) in the real world, there are usually either minimum bets or fees, so the guarantee becomes more of a tendency. But Kelly investors tend to do well, if they really do have an edge over the current market value.

Also, where Kelly solved the case for a single bet, in SciCast, we created a simplified Kelly algorithm for combinatorial bets.

Many can win

Horse betting is a zero-sum game, like poker:

There was no law against what they were doing, but in a parimutuel gambling system, every dollar they won was a dollar lost by someone else

That’s not the case with Replication Markets. In LMSR-based markets, if ten people trade on a claim, and all raise the chance of the correct outcome, then they all win. The points come from the market maker - us.

We are also a play-money market - with prizes! - and we don’t charge transaction fees. Thanks to DARPA, who thinks it worthwhile to fund good estimates of the replicability of social science.