AI Bias

Dr. Rachel Thomas writes,

When we think about AI, we need to think about complicated real-world systems [because] decision-making happens within complicated real-world systems.

For example, bail/bond algorithms live in this world:

for public defenders to meet with defendants at Rikers Island, where many pre-trial detainees in NYC who can’t afford bail are held, involves a bus ride that is two hours each way and they then only get 30 minutes to see the defendant, assuming the guards are on time (which is not always the case)

Designed or not, this system leads innocents to plead guilty so they can leave jail faster than if they waited for a trial. I hope that was not the point.

Happily I don’t work bail/bond algorithms, but one decision tree is much like another. ”We do things right” means I need to ask more about decision context. We know decision theory - our customers don’t. Decisions should weigh the costs of false positives versus false negatives. It’s tempting to hand them the maximized ROC curve and make threshold choice Someone Else’s Problem. But Someone Else often accepts the default.

False positives abound in cyber-security. The lesser evil is being ignored like some nervous “check engine” light. The greater is being too easily believed. We can detect anomalies - but usually the customer has to investigate.

We can help by providing context. Does the system report confidence? Does it simply say “I don’t know?" when appropriate? Do we know the relative costs of misses and false alarms? Can the customer adjust those for the situation?

Rapid Reviews: COVID-19

The announcement of RR:C19 seems a critical step forward. Similar to Hopkins' Novel Coronavirus Research Compendium. Both are mentioned in Begley’s article.

So.. would it help to add prediction markets on replication, publication, citations?

Unsurprisingly, popular media is more popular:

A new study finds that “peer-reviewed scientific publications receiving more attention in non-scientific media are more likely to be cited than scientific publications receiving less popular media attention.”

PDF: Systematically, the NYT COVID-19 counts are high, & CovidTracker’s low. Two more in between. Small relative difference, though notable absolute (1000s). Preprint, so needs a check.

www.medrxiv.org/content/1…

From @Broniatowski: Postdoc with the MOSI project at the Institute for Data Democracy & Politics at GWU. Work on ”detecting, tracking, and correcting disinformation/misinformation.” Could start this summer.

www.gwu.jobs/postings/…

Old tabs: found a nice recommendation from one of our @replicationmarkets forecasters, on their beautifully designed website, “Follow the Argument.”

followtheargument.org/sunday-01…

A solid Rodney Brooks essay on peer review value and flaws, from inside. No solution, but worth reading for context and detail. HTT Bob Horn. rodneybrooks.com/peer-revi…

I’m not accustomed to saying anything with certainty after only one or two observations. ~Andreas Vesalius, The China Root Epistle, 1546.

PDF: US intervention timing simulations using county-level & mobility data: “had these same control measures been … 1-2 weeks earlier,” 55% of reported deaths could have been avoided. Strong claim. Haven’t looked, but counterfactuals are hard.

PDF: Simulation: age separation reduced C19 mortality even fixing interactions. “Separating just the older group… reduces the overall mortality… by 66%.” But “age separation is difficult.” ¿Is Erdös-Rényi an OK model here?

  • New short PDF Heparin and C19 reviews ~2K Spanish C19 patients. Heparin halved mortality after adjusting for age and gender; same or better after adding severity and other drugs. Needs randomized followup, but I guess ~80% likely to reproduce.

A few COVID-19 PDFs

A non-random sample of new PDFs whose titles caught my eye yesterday. Based on a quick scan, so I may have missed something. (These are brand new PDFs - so use even more doubt than for well-published papers.)

  • Mortality rate and estimate of fraction of undiagnosed… cases… Simple amateur model estimates the mortality curve for US March & April diagnoses, using data from worldometers, and a Gaussian[!?] model. March peak death rate is on the 13th day after diagnosis, similar to Chinese data. Total mortality was 21%[!!?] suggesting severe under-testing [or data problems]. Whatever, the same method applied to cases after 1-APR finds 6.4%, suggesting more testing. If the real rate is 2.4% [!?], then 89% of March cases were untested and 63% of April cases. [The 2.4% IFR seems ~4x too high by other estimates. They got it by averaging rates from China, the Diamond Princess, Iceland, and Thailand. It’s not clear if they weighted those. First author cites themselves in a basically unrelated physics paper. But then, I’m not an epidemiologist either.]

  • [Estimation of the true infection fatality rate… in each country] (https://www.medrxiv.org/content/10.1101/2020.05.13.20101071v2?%253fcollection=) Short paper adjusting estimates because low PCR exam rate means exams are restricted to suspicious cases, and vice versa. “Reported IRs [infection rates] in USA using antibody tests were 1.5% in Santa Clara, 4.6% in Los Angeles county, and 13.9% in New York state, and our estimate of TIR [true infection rate] in the whole USA is 5.0%.” Estimates US IFR [infection fatality rate] as 0.6% [95% CI: 0.33 - 1.07], slightly higher than Germany and Japan’s ~0.5%, a bit lower than Sweden 0.7%, and much lower than Italy or the UK around 1.6%. [This is similar to the running Metaculus estimate, and note the 2.4% above is way outside the interval.]

  • Estimation of the infection fatality rate… More “simple methodology” but from claims to be from science & epi faculty in Mexico. They assume all members of a household will be infected together, which is plausible. But they don’t really dig into household data, and don’t really dig into case data, but just explore the method on “available data”. Eh.

  • Reproductive number of COVID-19: A systematic review and meta-analysis…: Global average of R=2.7 [95%CI 2.2-3.3] with lots of variation. They note that’s above WHO but lower than another summary. Among their 27 studies in their analysis, only two [both South Korea], publish estimates <1. Take with salt. By non-epidemiologists. They include Diamond Princess [R>14, albeit with large error bars]. And they claim method [MLE vs SEIR vs MCMC vs…] matters but they have so few samples per category and so many comparisons I don’t think it means anything.

  • Restarting after… Comparing policies and infections in the 16 states of Germany, the UC-Davis authors find that contact restrictions were more effective than border closures. Mobility data comes from Google. Using SEIR models, they then predict the effect of ways to relax policy. They think social distancing (German-style) reduced case counts by about 97% – equivalently, total case count would have been about 38X higher. Contact restrictions were estimated to be about 50% effective, versus about 2% for border closures. [What you’d expect for closing the gate after the Trojan horse is inside.] They put a lot of effort into modeling different parts of the transportation system: cars vs. trucks; public transit. They think that compared to keeping restrictions, lifting contact restrictions will cause a 51% or a 27% increase depending on scenarios. Relaxing initial business closures yields a 29% or 16% increase. Relaxing non-essential closures by 7% or 4%.

  • Relationship between Average Daily Temperature and… (v2). The authors say warmer is better, but note it could be UV or hidden correlation with eg age, pollution, social distancing, etc. And “when no significant differences exist in the average daily temperture of two cities in the same country, there is no significant difference in the average cumulative daily rate of confirmed cases.” Fifteen F-tests, and no obvious correction for multiple tests. I didn’t read in enough detail to know the right correction, but eyeballing the results, I’m guessing that would cut out half their “significant” correlations. Still, it remains plausible .

  • A surprising formula… A mathematician at UNC Chapel Hill claims surprising accuracy using a two-phase differential equations model. (Switches from one curve to the other at point of closest approach.) I haven’t had time to dive into the details, but I’m partial to phase-space for modeling dynamical systems. The paper argues for hard measures, once triggered, a theory the author calls epidemic “momentum managing”, which he expands in a longer linked piece.

  • Disparities in Vulnerability… From 6 scholars in population health and aging, at 4 major US universities: “particular attention should be paid to the risk of adverse outcomes in midlife for non-Hispanic blacks, adults with a high school degree or less, and low-income Americans”. So, the usual. Oh, “our estimates likely understate those disparities.” An AI model trained on pre-pandemic medical claims created a vulnerability index for severe respiratory infection, by eduction, income, and race-ethnicity. High school education or less: 2x risk vs. college. Lowest income quartile: 3x risk vs highest. Both because of early onset underlying health conditions, esp. hypertension. See Risk Factor Chart below.

Oops: - ‘astonishingly, the forensic medical professional had not died at all. [And] they “do not know for sure and cannot scientifically confirm that the virus moved from the dead body.”’ ~RetractionWatch bit.ly/2Tt71Mv

Innumeracy: In mid-Feb I sent my team home due to C19 worries. ~1week later someone at the office was diagnosed, so good. Official case load in my county was still <10.

Three months later with 8K cases (1:140 people), I find I’m getting lax on cleaning & handwashing. :-s

Grumpy Geophysicist argues against public preprint servers.

Confirmation bias is a risk within the scientific community; it is positively rampant in the broader public.

Slopen science?

Pro-active Replication

Science Is Hard

Nature Communications has an inspiring piece about a DARPA project baking replication into the original study. The DARPA Biological Control program applies independent validation & verification (IV&V) to synthetic biology. The article is a Lessons Learned piece.

Although DARPA oversight presumably mitigated the element of competition, the IV&V teams had at least as much trouble as some historical cases discussed below. It’s worth reading their Hard lessons table. Here’s one:

We lost more than a year after discovering that commonly used biochemicals that were thought to be interchangeable are not.

And this one seems to apply to an discipline: “pick a person

The projects that lacked a dedicated and stable point of contact were the same ones that took the longest to reproduce. That is not coincidence.

They also report how much they needed personal interaction - a result familiar to those of us in Science Studies (more later).

A key component of the IV&V teams’ effort has been to spend a day or more working with the performer teams in their laboratories. Often, members of a performer laboratory travel to the IV&V laboratory as well. These interactions lead to a better grasp of methodology than reading a paper, frequently revealing person-to-person differences that can affect results.

But I was still surprised how much.

A typical academic lab trying to reproduce another lab’s results would probably limit itself to a month or so and perhaps three or four permutations before giving up. Our effort needed capable research groups that could dedicate much more time (in one case, 20 months) and that could flexibly follow evolving research.

To be fair, this is biology, a proverbially cantankerous field. But the canonical Science Studies reference of the difficulty of replication is laser physics.

Before I explore that, pause to appreciate the importance of this DARPA work: (1) The value of baked-in replication for really understanding the original result, and (2) the real difficulty in achieving it. I encourage you to read the NC piece and imagine how this could be practiced at funding levels below that of DARPA programs.

Echoes of the Past

In Science Studies, replication troubles evoke the “experimenters' regress”. The canonical reference is Collins’s 1974 paper on laser physics (or really, his 1985 book):

to date, no-one to whom I have spoken has succeeded in building a TEA laser using written sources (including preprints and internal reports) as the sole source of information, though several unsuccessful attempts have been made, and there is now a considerable literature on the subject. …. The laboratories studied here … actually learned to build working models of TEA lasers by contact with a source laboratory either by personal visits and telephone calls or by transfer of personnel.

Shortly thereafter, he notes that the many failures were:

simply because the parameters of the device were not understood by the source laboratories themselves. … For instance, a spokesman at Origin reports that it was only previous experience that enabled him to see that the success of a laser built by another laboratory depended on the inductance of their transformer, at that time thought to be a quite insignificant element.

This is of course echoed in the new NC piece about the DARPA program.

Collins expands on this and other episodes in his (1985), attempting to make sense of (then nascent) attempts to detect gravity waves. As Turner (2014) summarizes:

The point was to show that replication was not and could not be understood as a mechanical process…

So the crisis isn’t merely that it’s hard to replicate from publications - any more than it’s a crisis that it’s so hard to learn to ride a bicycle by studying a manual. And no doubt many failed replications are failures of technique, caused by researchers foolishly attempting replication without contacting the original authors. The crisis is that we have many reasons to expect half the original results were indeed spurious. One of the goals of Replication Markets and the larger DARPA SCORE program is to help sort out which.

Back to the Future

I’ve fallen behind in the literature. I see that Collins has a 2016 chapter on the modern situation: “Reproducibility of Experiments: Experimenters' Regress, Statistical Uncertainty Principle, and the Replication Imperative.” I look forward to reading it.

And of course this brings us to Nosek and Errington’s preprint, “What is replication?”, where they argue that replication itself is an “exciting, generative, vital contributor to research progress”.

But now, as Gru says, back to work.


Notes

Collins, Harry M. 1974. “The TEA Set: Tacit Knowledge and Scientific Networks”. Science Studies 4. 165-86. (Online here.)

Collins, Harry M. 1985. Changing order: replication and induction in scientific practice.

Nosek, Brian, and Errington, Tim. 2019-2020. What is replication? MetaArXiv Preprints.

Turner, Stephen P. 2014. Understanding the Tacit. (p.96)