For COVID forecasting, remember the superforecasters at Good Judgment. Currently placing US deaths by March in 200K - 1.1M, with 3:2 for above 350K, up from 1:1 on July 11.

“Foolish demon, it did not have to be so.” But Taraka was no more. ~R. Zelazny

Alan Jacobs with a cautionary tale about assuming news is representative of reality, and remembering to sanity-check our answers. blog.ayjay.org/proportio…

Protests and COVID

Worried about #COVID, I did not join #BLM protests. Even if outdoors + masks, marches bunch up, & there are only so many restrooms. It’s been an open question what effect they had. NCRC has has reviewed a 1-JUN NBER study: @ county level, seems no. Can’t address individ.

Good news: Despite case rise, excess deaths have been dropping, nearly back to 100% after high 142%. Bad news: @epiellie thinks it’s just lag: early test ➛ more lead time. Cases up 3-4 wks ago, ICU 2-3, deaths up last period. Q: why do ensemble models expect steady death rate?

I saw my old and much-loved Monash colleague #”ChrisWallace” https://en.wikipedia.org/wiki/Chris_Wallace_(computer_scientist) trending on Twitter. Alas, it turns out it’s just some reporter with a 5-second clip.

How about #WallaceTreeMultiplier, #MML, #ArrowOfTime, #SILIAC.

Open access is good, unless you're a journal?

Bob Horn sent me this news in Nature

NEWS
16 JULY 2020 Open-access Plan S to allow publishing in any journal Funders will override policies of subscription journals that don’t let scientists share accepted manuscripts under open licence. Richard Van Noorden

This seems good news, unless you’re a journal.

I expect journals to do a good quality control, and top journals to do top quality control. At minimum good review and gatekeeping (they are failing here, but assume that is fixed separately). But also, production. Most scientists can neither write nor draw, and I want journals to minimize typos and maximize production quality. If I want to struggle with scrawl, I’ll go for preprints: it’s fair game there.

So, if you (the journal) can’t charge me for access, and I still expect high quality, you need to charge up front. The obvious candidates are the authors and funders. The going rate right now seems to be around $2000 per article, which is a non-starter for authors. Authors of course want to fix this by getting the funders to pay, but that money comes from somewhere.

Challenge: How to get up-front costs below $500 per article?

Here’s some uninformed back-of-the-envelope saying that will be hard.

Editors. Even Rowling needs editors.

  • Assume paper subscriptions pay for themselves and peer review is free.
  • For simplicity, assume we’re paying one editor $50K to do all the key work.
  • Guess: they take at least 5 hours per 10-page paper on correspondence, editing, typesetting, and production. Double for benefits etc. That’s $250 per paper.

Looking good so far!

Webslingers: someone has to refill the bit buckets.

  • Suppose webslinger + servers is $64K/year. Magically including benefits.
  • The average journal has 64 articles in a year.
  • Uh-oh: that’s $1000 right there.

So… seems one webslinger needs to be able to manage about 10 journal websites. Is that doable? How well do the big publishers scale? Do they get super efficient, or fall prey to Parkinson’s law?

Alternative: societies / funders have to subsidize the journals as necessary road infrastructure. That might amount to half the costs. How much before they effectively insulate the new journals from accountability to quality control… again?

AI Bias

Dr. Rachel Thomas writes,

When we think about AI, we need to think about complicated real-world systems [because] decision-making happens within complicated real-world systems.

For example, bail/bond algorithms live in this world:

for public defenders to meet with defendants at Rikers Island, where many pre-trial detainees in NYC who can’t afford bail are held, involves a bus ride that is two hours each way and they then only get 30 minutes to see the defendant, assuming the guards are on time (which is not always the case)

Designed or not, this system leads innocents to plead guilty so they can leave jail faster than if they waited for a trial. I hope that was not the point.

Happily I don’t work bail/bond algorithms, but one decision tree is much like another. ”We do things right” means I need to ask more about decision context. We know decision theory - our customers don’t. Decisions should weigh the costs of false positives versus false negatives. It’s tempting to hand them the maximized ROC curve and make threshold choice Someone Else’s Problem. But Someone Else often accepts the default.

False positives abound in cyber-security. The lesser evil is being ignored like some nervous “check engine” light. The greater is being too easily believed. We can detect anomalies - but usually the customer has to investigate.

We can help by providing context. Does the system report confidence? Does it simply say “I don’t know?" when appropriate? Do we know the relative costs of misses and false alarms? Can the customer adjust those for the situation?

Rapid Reviews: COVID-19

The announcement of RR:C19 seems a critical step forward. Similar to Hopkins' Novel Coronavirus Research Compendium. Both are mentioned in Begley’s article.

So.. would it help to add prediction markets on replication, publication, citations?

Unsurprisingly, popular media is more popular:

A new study finds that “peer-reviewed scientific publications receiving more attention in non-scientific media are more likely to be cited than scientific publications receiving less popular media attention.”

PDF: Systematically, the NYT COVID-19 counts are high, & CovidTracker’s low. Two more in between. Small relative difference, though notable absolute (1000s). Preprint, so needs a check.

www.medrxiv.org/content/1…

From @Broniatowski: Postdoc with the MOSI project at the Institute for Data Democracy & Politics at GWU. Work on ”detecting, tracking, and correcting disinformation/misinformation.” Could start this summer.

www.gwu.jobs/postings/…

Old tabs: found a nice recommendation from one of our @replicationmarkets forecasters, on their beautifully designed website, “Follow the Argument.”

followtheargument.org/sunday-01…

A solid Rodney Brooks essay on peer review value and flaws, from inside. No solution, but worth reading for context and detail. HTT Bob Horn. rodneybrooks.com/peer-revi…

I’m not accustomed to saying anything with certainty after only one or two observations. ~Andreas Vesalius, The China Root Epistle, 1546.

PDF: US intervention timing simulations using county-level & mobility data: “had these same control measures been … 1-2 weeks earlier,” 55% of reported deaths could have been avoided. Strong claim. Haven’t looked, but counterfactuals are hard.

PDF: Simulation: age separation reduced C19 mortality even fixing interactions. “Separating just the older group… reduces the overall mortality… by 66%.” But “age separation is difficult.” ¿Is Erdös-Rényi an OK model here?

  • New short PDF Heparin and C19 reviews ~2K Spanish C19 patients. Heparin halved mortality after adjusting for age and gender; same or better after adding severity and other drugs. Needs randomized followup, but I guess ~80% likely to reproduce.

A few COVID-19 PDFs

A non-random sample of new PDFs whose titles caught my eye yesterday. Based on a quick scan, so I may have missed something. (These are brand new PDFs - so use even more doubt than for well-published papers.)

  • Mortality rate and estimate of fraction of undiagnosed… cases… Simple amateur model estimates the mortality curve for US March & April diagnoses, using data from worldometers, and a Gaussian[!?] model. March peak death rate is on the 13th day after diagnosis, similar to Chinese data. Total mortality was 21%[!!?] suggesting severe under-testing [or data problems]. Whatever, the same method applied to cases after 1-APR finds 6.4%, suggesting more testing. If the real rate is 2.4% [!?], then 89% of March cases were untested and 63% of April cases. [The 2.4% IFR seems ~4x too high by other estimates. They got it by averaging rates from China, the Diamond Princess, Iceland, and Thailand. It’s not clear if they weighted those. First author cites themselves in a basically unrelated physics paper. But then, I’m not an epidemiologist either.]

  • [Estimation of the true infection fatality rate… in each country] (https://www.medrxiv.org/content/10.1101/2020.05.13.20101071v2?%253fcollection=) Short paper adjusting estimates because low PCR exam rate means exams are restricted to suspicious cases, and vice versa. “Reported IRs [infection rates] in USA using antibody tests were 1.5% in Santa Clara, 4.6% in Los Angeles county, and 13.9% in New York state, and our estimate of TIR [true infection rate] in the whole USA is 5.0%.” Estimates US IFR [infection fatality rate] as 0.6% [95% CI: 0.33 - 1.07], slightly higher than Germany and Japan’s ~0.5%, a bit lower than Sweden 0.7%, and much lower than Italy or the UK around 1.6%. [This is similar to the running Metaculus estimate, and note the 2.4% above is way outside the interval.]

  • Estimation of the infection fatality rate… More “simple methodology” but from claims to be from science & epi faculty in Mexico. They assume all members of a household will be infected together, which is plausible. But they don’t really dig into household data, and don’t really dig into case data, but just explore the method on “available data”. Eh.

  • Reproductive number of COVID-19: A systematic review and meta-analysis…: Global average of R=2.7 [95%CI 2.2-3.3] with lots of variation. They note that’s above WHO but lower than another summary. Among their 27 studies in their analysis, only two [both South Korea], publish estimates <1. Take with salt. By non-epidemiologists. They include Diamond Princess [R>14, albeit with large error bars]. And they claim method [MLE vs SEIR vs MCMC vs…] matters but they have so few samples per category and so many comparisons I don’t think it means anything.

  • Restarting after… Comparing policies and infections in the 16 states of Germany, the UC-Davis authors find that contact restrictions were more effective than border closures. Mobility data comes from Google. Using SEIR models, they then predict the effect of ways to relax policy. They think social distancing (German-style) reduced case counts by about 97% – equivalently, total case count would have been about 38X higher. Contact restrictions were estimated to be about 50% effective, versus about 2% for border closures. [What you’d expect for closing the gate after the Trojan horse is inside.] They put a lot of effort into modeling different parts of the transportation system: cars vs. trucks; public transit. They think that compared to keeping restrictions, lifting contact restrictions will cause a 51% or a 27% increase depending on scenarios. Relaxing initial business closures yields a 29% or 16% increase. Relaxing non-essential closures by 7% or 4%.

  • Relationship between Average Daily Temperature and… (v2). The authors say warmer is better, but note it could be UV or hidden correlation with eg age, pollution, social distancing, etc. And “when no significant differences exist in the average daily temperture of two cities in the same country, there is no significant difference in the average cumulative daily rate of confirmed cases.” Fifteen F-tests, and no obvious correction for multiple tests. I didn’t read in enough detail to know the right correction, but eyeballing the results, I’m guessing that would cut out half their “significant” correlations. Still, it remains plausible .

  • A surprising formula… A mathematician at UNC Chapel Hill claims surprising accuracy using a two-phase differential equations model. (Switches from one curve to the other at point of closest approach.) I haven’t had time to dive into the details, but I’m partial to phase-space for modeling dynamical systems. The paper argues for hard measures, once triggered, a theory the author calls epidemic “momentum managing”, which he expands in a longer linked piece.

  • Disparities in Vulnerability… From 6 scholars in population health and aging, at 4 major US universities: “particular attention should be paid to the risk of adverse outcomes in midlife for non-Hispanic blacks, adults with a high school degree or less, and low-income Americans”. So, the usual. Oh, “our estimates likely understate those disparities.” An AI model trained on pre-pandemic medical claims created a vulnerability index for severe respiratory infection, by eduction, income, and race-ethnicity. High school education or less: 2x risk vs. college. Lowest income quartile: 3x risk vs highest. Both because of early onset underlying health conditions, esp. hypertension. See Risk Factor Chart below.

Oops: - ‘astonishingly, the forensic medical professional had not died at all. [And] they “do not know for sure and cannot scientifically confirm that the virus moved from the dead body.”’ ~RetractionWatch bit.ly/2Tt71Mv

Innumeracy: In mid-Feb I sent my team home due to C19 worries. ~1week later someone at the office was diagnosed, so good. Official case load in my county was still <10.

Three months later with 8K cases (1:140 people), I find I’m getting lax on cleaning & handwashing. :-s

Grumpy Geophysicist argues against public preprint servers.

Confirmation bias is a risk within the scientific community; it is positively rampant in the broader public.

Slopen science?

Pro-active Replication

Science Is Hard

Nature Communications has an inspiring piece about a DARPA project baking replication into the original study. The DARPA Biological Control program applies independent validation & verification (IV&V) to synthetic biology. The article is a Lessons Learned piece.

Although DARPA oversight presumably mitigated the element of competition, the IV&V teams had at least as much trouble as some historical cases discussed below. It’s worth reading their Hard lessons table. Here’s one:

We lost more than a year after discovering that commonly used biochemicals that were thought to be interchangeable are not.

And this one seems to apply to an discipline: “pick a person

The projects that lacked a dedicated and stable point of contact were the same ones that took the longest to reproduce. That is not coincidence.

They also report how much they needed personal interaction - a result familiar to those of us in Science Studies (more later).

A key component of the IV&V teams’ effort has been to spend a day or more working with the performer teams in their laboratories. Often, members of a performer laboratory travel to the IV&V laboratory as well. These interactions lead to a better grasp of methodology than reading a paper, frequently revealing person-to-person differences that can affect results.

But I was still surprised how much.

A typical academic lab trying to reproduce another lab’s results would probably limit itself to a month or so and perhaps three or four permutations before giving up. Our effort needed capable research groups that could dedicate much more time (in one case, 20 months) and that could flexibly follow evolving research.

To be fair, this is biology, a proverbially cantankerous field. But the canonical Science Studies reference of the difficulty of replication is laser physics.

Before I explore that, pause to appreciate the importance of this DARPA work: (1) The value of baked-in replication for really understanding the original result, and (2) the real difficulty in achieving it. I encourage you to read the NC piece and imagine how this could be practiced at funding levels below that of DARPA programs.

Echoes of the Past

In Science Studies, replication troubles evoke the “experimenters' regress”. The canonical reference is Collins’s 1974 paper on laser physics (or really, his 1985 book):

to date, no-one to whom I have spoken has succeeded in building a TEA laser using written sources (including preprints and internal reports) as the sole source of information, though several unsuccessful attempts have been made, and there is now a considerable literature on the subject. …. The laboratories studied here … actually learned to build working models of TEA lasers by contact with a source laboratory either by personal visits and telephone calls or by transfer of personnel.

Shortly thereafter, he notes that the many failures were:

simply because the parameters of the device were not understood by the source laboratories themselves. … For instance, a spokesman at Origin reports that it was only previous experience that enabled him to see that the success of a laser built by another laboratory depended on the inductance of their transformer, at that time thought to be a quite insignificant element.

This is of course echoed in the new NC piece about the DARPA program.

Collins expands on this and other episodes in his (1985), attempting to make sense of (then nascent) attempts to detect gravity waves. As Turner (2014) summarizes:

The point was to show that replication was not and could not be understood as a mechanical process…

So the crisis isn’t merely that it’s hard to replicate from publications - any more than it’s a crisis that it’s so hard to learn to ride a bicycle by studying a manual. And no doubt many failed replications are failures of technique, caused by researchers foolishly attempting replication without contacting the original authors. The crisis is that we have many reasons to expect half the original results were indeed spurious. One of the goals of Replication Markets and the larger DARPA SCORE program is to help sort out which.

Back to the Future

I’ve fallen behind in the literature. I see that Collins has a 2016 chapter on the modern situation: “Reproducibility of Experiments: Experimenters' Regress, Statistical Uncertainty Principle, and the Replication Imperative.” I look forward to reading it.

And of course this brings us to Nosek and Errington’s preprint, “What is replication?”, where they argue that replication itself is an “exciting, generative, vital contributor to research progress”.

But now, as Gru says, back to work.


Notes

Collins, Harry M. 1974. “The TEA Set: Tacit Knowledge and Scientific Networks”. Science Studies 4. 165-86. (Online here.)

Collins, Harry M. 1985. Changing order: replication and induction in scientific practice.

Nosek, Brian, and Errington, Tim. 2019-2020. What is replication? MetaArXiv Preprints.

Turner, Stephen P. 2014. Understanding the Tacit. (p.96)

Aschwanden's teaser subtitle: wisdom in the scientific crowd

Science writer Christine Aschwanden (@cragcrest) just published a nice summary of the October paper “Crowdsourcing Hypothesis Tests” by Landy et al.

Well, half the paper.

I like the WIRED piece. It covers the main result: if you give a dozen labs a novel hypothesis to tetst, they devise about a dozen viable ways to test it, with variable results. So what?

Anna Dreber, an economist at the Stockholm School of Economics and an author on the project. “We researchers need to be way more careful now in how we say, ‘I’ve tested the hypothesis.’ You need to say, ‘I’ve tested it in this very specific way.’ Whether it generalizes to other settings is up to more research to show.”

We often neglect this ladder of specificity from theory to experiment and the inevitable rigor-relevance tradeoff, so measuring the effect puts some humility back into our always too-narrow error bars.

But Anschwanden’s subtitle hints at more:

Just how much wisdom is there in the scientific crowd?

Perhaps some paragraphs were cut, because in the second part of the paper, Landy et al asked a crowd to forecast the outcomes of each variant. In his prize-winning Metascience poster Domenico Viganola (@dviganola) summarized the result:

Section of Domenico's poster, showing correlations between forecasts and results, for each hypothesis.

The scientific crowd was reasonably sensitive to the effect of the design variations from the different labs.

scientists’ predictions are positively correlated with the realized outcomes, both in terms of effect sizes and in terms of whether the result is statistically significant or not for the different sets of study materials.

And,

scientists were able to predict not only which hypotheses would receive empirical support (see Figure 3a) but also variability in results for the same hypothesis based on the design choices made by different research teams (see Figure 3b).

I think that’s pretty cool - and suggests there is reasonable wisdom in the scientific crowd.

But not quite enough wisdom to make this stud unnecessary.


Note: Dreber and Viganola are key part of the @ReplicationMkts team.

Horse betting and @ReplicationMkts

A friend sent a May 2018 Bloomberg article by Kit Chellel, “The gambler who cracked the horse-racing code."
It’s a good story of forecasting real-world events, and the interplay of statistical models and intuition/expertise to tune those models, and contains useful guides for @ReplicationMkts forecasters.

I was struck by this:

A breakthrough came when Benter hit on the idea of incorporating a data set hiding in plain sight: the Jockey Club’s publicly available betting odds.

That’s a surprising oversight. In his defense, 1990 was way before Wisdom of the Crowd, the Iowa Electronic Markets were just getting started, and Robin Hanson was only beginning to write about prediction markets. On the other hand, sports betting had been around for a long time. Anyway, the paragraph continues:

Building his own set of odds from scratch had been profitable, but he found that using the public odds as a starting point and refining them with his proprietary algorithm was dramatically more profitable. He considered the move his single most important innovation, and in the 1990-91 season, he said, he won about $3 million.

Well, of course. Those odds contain all the work already done by others.

Use Prior Wisdom

There are at least three easy sources of prior wisdom in Replication Markets:

  • The starting price is set from the original study’s p-value, using a tiny (two-step) decision tree. This simple model has been 60-70% accurate.
  • The replication rate of the field, from previous replication studies or as forecast in Round 0.
  • The current market value. Markets are not surveys. If you arrive late to a well-traded market, pause. Why it is where it is? Is that value the result of a gradual negotiation of opposing views, or is it a single wild excursion?

With attention you can do better than these - as Benter did in horse betting - but don’t ignore them. Here are two stories on correcting the market:

  1. Yes, our markets are noisy. In SciCast, “faber” made much of his money with a simple bot that half-reversed wild trades by new accounts. So do this when it seems right - but note Faber’s first bot lost thousands of points to a savvy new opponent, before the bot gained backoff rules.

  2. DAGGRE’s top forecaster regularly updated a spreadsheet of his best estimates. But his trading rule was to correct the market only halfway to his best estimate, and wait. The market might know something he doesn’t. (Also, market judo: trading +3% 10x is cheaper than changing +30% 1x.)

Kelly Rule

The article also mentions that Benter used the Kelly rule to stay ahead of gambler’s ruin. Kelly’s rule was an important early result in, and unorthodox application of, Shannon’s Information Theory.

The rule is based on the insight that a market - or set of markets - is a source of repeated returns, and you want to maximize your lifetime earnings. Therefore you must be as aggressive as you can while ensuring you never go broke. In theory, Kelly will never go broke - though in the real world (a) it’s still more aggressive than most can stomach, so most practitioners add a cushion, and (b) in the real world, there are usually either minimum bets or fees, so the guarantee becomes more of a tendency. But Kelly investors tend to do well, if they really do have an edge over the current market value.

Also, where Kelly solved the case for a single bet, in SciCast, we created a simplified Kelly algorithm for combinatorial bets.

Many can win

Horse betting is a zero-sum game, like poker:

There was no law against what they were doing, but in a parimutuel gambling system, every dollar they won was a dollar lost by someone else

That’s not the case with Replication Markets. In LMSR-based markets, if ten people trade on a claim, and all raise the chance of the correct outcome, then they all win. The points come from the market maker - us.

We are also a play-money market - with prizes! - and we don’t charge transaction fees. Thanks to DARPA, who thinks it worthwhile to fund good estimates of the replicability of social science.

Greenland on Cognition, Causality, and Statistics

Sander Greenland argues that all mistakes of P-values will be reproduced in any other method, because they are fundamentally cognitive errors due to bias and incentives. We need to pay attention to cognition of stats and emphasize the causal story and proper use of plain-language. One example, he proposes using S-value not P-value.

Logically equivalent is NOT cognitively equivalent.

NISS Webinar: www.niss.org/events/di…