Aschwanden's teaser subtitle: wisdom in the scientific crowd

Science writer Christine Aschwanden (@cragcrest) just published a nice summary of the October paper “Crowdsourcing Hypothesis Tests” by Landy et al.

Well, half the paper.

I like the WIRED piece. It covers the main result: if you give a dozen labs a novel hypothesis to tetst, they devise about a dozen viable ways to test it, with variable results. So what?

Anna Dreber, an economist at the Stockholm School of Economics and an author on the project. “We researchers need to be way more careful now in how we say, ‘I’ve tested the hypothesis.’ You need to say, ‘I’ve tested it in this very specific way.’ Whether it generalizes to other settings is up to more research to show.”

We often neglect this ladder of specificity from theory to experiment and the inevitable rigor-relevance tradeoff, so measuring the effect puts some humility back into our always too-narrow error bars.

But Anschwanden’s subtitle hints at more:

Just how much wisdom is there in the scientific crowd?

Perhaps some paragraphs were cut, because in the second part of the paper, Landy et al asked a crowd to forecast the outcomes of each variant. In his prize-winning Metascience poster Domenico Viganola (@dviganola) summarized the result:

Section of Domenico's poster, showing correlations between forecasts and results, for each hypothesis.

The scientific crowd was reasonably sensitive to the effect of the design variations from the different labs.

scientists’ predictions are positively correlated with the realized outcomes, both in terms of effect sizes and in terms of whether the result is statistically significant or not for the different sets of study materials.

And,

scientists were able to predict not only which hypotheses would receive empirical support (see Figure 3a) but also variability in results for the same hypothesis based on the design choices made by different research teams (see Figure 3b).

I think that’s pretty cool - and suggests there is reasonable wisdom in the scientific crowd.

But not quite enough wisdom to make this stud unnecessary.


Note: Dreber and Viganola are key part of the @ReplicationMkts team.

Horse betting and @ReplicationMkts

A friend sent a May 2018 Bloomberg article by Kit Chellel, “The gambler who cracked the horse-racing code."
It’s a good story of forecasting real-world events, and the interplay of statistical models and intuition/expertise to tune those models, and contains useful guides for @ReplicationMkts forecasters.

I was struck by this:

A breakthrough came when Benter hit on the idea of incorporating a data set hiding in plain sight: the Jockey Club’s publicly available betting odds.

That’s a surprising oversight. In his defense, 1990 was way before Wisdom of the Crowd, the Iowa Electronic Markets were just getting started, and Robin Hanson was only beginning to write about prediction markets. On the other hand, sports betting had been around for a long time. Anyway, the paragraph continues:

Building his own set of odds from scratch had been profitable, but he found that using the public odds as a starting point and refining them with his proprietary algorithm was dramatically more profitable. He considered the move his single most important innovation, and in the 1990-91 season, he said, he won about $3 million.

Well, of course. Those odds contain all the work already done by others.

Use Prior Wisdom

There are at least three easy sources of prior wisdom in Replication Markets:

  • The starting price is set from the original study’s p-value, using a tiny (two-step) decision tree. This simple model has been 60-70% accurate.
  • The replication rate of the field, from previous replication studies or as forecast in Round 0.
  • The current market value. Markets are not surveys. If you arrive late to a well-traded market, pause. Why it is where it is? Is that value the result of a gradual negotiation of opposing views, or is it a single wild excursion?

With attention you can do better than these - as Benter did in horse betting - but don’t ignore them. Here are two stories on correcting the market:

  1. Yes, our markets are noisy. In SciCast, “faber” made much of his money with a simple bot that half-reversed wild trades by new accounts. So do this when it seems right - but note Faber’s first bot lost thousands of points to a savvy new opponent, before the bot gained backoff rules.

  2. DAGGRE’s top forecaster regularly updated a spreadsheet of his best estimates. But his trading rule was to correct the market only halfway to his best estimate, and wait. The market might know something he doesn’t. (Also, market judo: trading +3% 10x is cheaper than changing +30% 1x.)

Kelly Rule

The article also mentions that Benter used the Kelly rule to stay ahead of gambler’s ruin. Kelly’s rule was an important early result in, and unorthodox application of, Shannon’s Information Theory.

The rule is based on the insight that a market - or set of markets - is a source of repeated returns, and you want to maximize your lifetime earnings. Therefore you must be as aggressive as you can while ensuring you never go broke. In theory, Kelly will never go broke - though in the real world (a) it’s still more aggressive than most can stomach, so most practitioners add a cushion, and (b) in the real world, there are usually either minimum bets or fees, so the guarantee becomes more of a tendency. But Kelly investors tend to do well, if they really do have an edge over the current market value.

Also, where Kelly solved the case for a single bet, in SciCast, we created a simplified Kelly algorithm for combinatorial bets.

Many can win

Horse betting is a zero-sum game, like poker:

There was no law against what they were doing, but in a parimutuel gambling system, every dollar they won was a dollar lost by someone else

That’s not the case with Replication Markets. In LMSR-based markets, if ten people trade on a claim, and all raise the chance of the correct outcome, then they all win. The points come from the market maker - us.

We are also a play-money market - with prizes! - and we don’t charge transaction fees. Thanks to DARPA, who thinks it worthwhile to fund good estimates of the replicability of social science.

Greenland on Cognition, Causality, and Statistics

Sander Greenland argues that all mistakes of P-values will be reproduced in any other method, because they are fundamentally cognitive errors due to bias and incentives. We need to pay attention to cognition of stats and emphasize the causal story and proper use of plain-language. One example, he proposes using S-value not P-value.

Logically equivalent is NOT cognitively equivalent.

NISS Webinar: www.niss.org/events/di…

Happy to hear “diversity of thought” called out early in today’s “Inclusive Leaders” training at Jacobs. Is your team all from the same university? Are you agreeing too much? You are likely missing key pieces of the picture. #EnsembleLearning #DiverseTeams