For some reason it’s hard to hold this in mind.


❝   In MAGAworld, declarative statements ... serve as identity markers.... They are not for conveying Facts, Truth, Reality.... Whether ... Democrats have and deploy weather weapons could not be more irrelevant; what matters is that _this is the kind of thing we say about Democrats_

Would people who talk about weather weapons agree?

And was “Defund the police” similar?

Earlier AJ linked from a recent post. Insightful as usual. teachers at the margins – The Homebound Symphony

Bizarre editing glitch in a new Science article about possible image manipulation in Alzheimer’s studies:

Accordingalpha-synuclein and might to Prothena, the drug blocks the spread of toxic slow the progression of Parkinson’s movement disorders and dementia.

❝   the peril of AI is the same as the promise of AI: it’s a thoughtlessness enabler.

Quote by Kozyrkov.
❝   The difficulty of getting the answer was often a filter that ensured only those who understood the problem deeply would arrive at a solution.

kozyrkov.medium.com/strawberrys-paradox, by Cassie Kozyrkov

An excellent article, though reading in full probably requires you join Medium.

Interesting abstract from Herzenstein et al suggests that replicable studies are transparent and confident relative to non-replicable ones.

Alas for itself, the abstract is timid. They “allude to the possibility that”.

Even if true, likely temporarily so. But I’ve requested the full text.

Daniel Lakens on why we won’t move beyond p<.05. Key: few really offered alternatives.

Part 2 of an excellent 2-part reflection on the APA special issue 5 years ago.

In-depth review of pending Metascience reforms at NIH by Stuart Buck at GoodScience. Excellent in-depth commentary.

This newsletter covers no fewer than four exciting metascience developments, with huge potential for improving science and medicine at NIH and elsewhere.

Including:

  • FY24 Appropriations Bill
  • Senate Report of 2023
  • Cassidy Report
  • House Report

This is promising: you can run LLM inference and training on 13W of power. I’ve yet to read the research paper, but they found you don’t need matrix multiplication if you adopt ternary [-1, 0, 1] values.

Daniel Miessler suggests the Left is fueling Trump with a defeatist and anti-American narrative.

I started following Miessler for cyber-security. I like his generally centrist and wide-ranging take, and respect what he’s accomplished with hard work and self study. He’s onto something here.

The Shackles of Convenience, from Dan Miessler.

Found an old tab with a reminder that misinfo isn’t just on the Right.

I like being wrong?

Chesterton - on The Family


❝   The best way that a man could test his readiness to encounter the common variety of mankind would be to climb down a chimney into any house at random, and get on as well as possible with the people inside. And that is essentially what each one of us did on the day that he was born.

Chesterton, Heretics -- On Certain Modern Writers and the Institution of the Family.

In praise of idleness - Bertrand Russel


❝   The modern man thinks that everything ought to be done for the sake of something else, and never for its own sake.

In praise of idleness, by Bertrand Russel. (HTT Daniel Miessler.)

My neighbor’s astonishingly vivid azalea

Starting Harrow by Tamsyn Muir. 📚

Finished The Difficult Subject by (pen name) Molly Macallen, book 2 in the Maddy Shanks mystery series. 📚

Dr Shanks tries to resume her academic life while the trial from book one begins in Philadelphia. But her chair wants her to investigate the the surprising death of her predecessor.

Wim Vanderbauwhede estimates ChatGPT uses ~60x as much energy as Google search. (Geometric mean of other estimates, with analysis.)

“The first step is to draw a diagram. When you draw a diagram you’re loading it into your GPU. That is the key.” ~Casey Handmar on solving problems

The analogical case with humans would be where established authors, for example, sued new and upcoming authors for having learned their craft, in part, by reading the works of established authors.

I suggest the answer, however, is not twisting existing copyright law into performing new functions badly, but in writing new laws that directly address the new problems

~Kevin Korb, Generative AI Does Not Violate Copyright but Needs To


❝   Consider that Amazon has had to cap the number of self-published “books” an author can submit to a mere three books per day

~Cory Doctorow [The Coprophagic AI Crisis](https://pluralistic.net/2024/03/14/inhuman-centipede/#enshittibottification)

Infant Mortality & Decline of the West


❝   Infant mortality, the telltale metric that led him to predict the Soviet collapse half a century ago, is higher in Mr. Biden’s America (5.4 per thousand) than in Mr. Putin’s Russia.

~NYT on [Emmanual Todd & the Decline of the West](https://www.nytimes.com/2024/03/09/opinion/emmanuel-todd-decline-west.html)

Thoughts

The US has infant and maternal mortality problems, but is it this bad, or is it just Russia finally catching up?

  • The CIA World Fact Book estimates 2023 Russia still behind at 6.6 infant deaths per thousand live births, versus 5.1 for the US. For comparison, it estimates 35 European countries are below 5 per thousand, and the US is on par with Poland.
CIA World Factbook table showing estimated 2023 Russian & US infant mortality as 6.6 (Rank 162) and 5.1 (Rank 174) respectively.
  • In contrast, Macrotrends data says Russia has edged ahead at 4.8, while it rates the US worse at 5.5. (US and RU data here.) That’s in line with Todd’s US number, and they claim to source from the UN World Population Prospects, so I’ll presume some overlap there. I don’t know the sources myself.

But here’s a combined trend using Macrotrend’s data, from 1960-2024 (omitting Russia’s disastrous 1950s). Even this data has the US slowly improving, so the story is Russia catching up.

Russia & US Infant Mortality 1960-2024

Possibly relevant: birth rates are similar at 11 & 12 per thousand (Macrotrends).

Either way, Russia is close to the US now, and I’m surprised – my impressions were outdated. But this graph doesn’t seem cause for concern about the US. Comparison to peer democracies might. I’d have to read Todd’s book for the argument.

Other striking thoughts:

A specialist in the anthropology of families, Mr. Todd warns that a lot of the values Americans are currently spreading are less universal than Americans think.

Which thought continues:

In a similar way, during the Cold War, the Soviet Union’s official atheism was a deal-breaker for many people who might otherwise have been well disposed toward Communism.

And despite the US haing ~2.5x Russia’s population (per US Census):

Mr. Todd calculates that the United States produces fewer engineers than Russia does, not just per capita but in absolute numbers.

Though this may reflect his values for what counts as productive (my emphasis):

It is experiencing an “internal brain drain,” as its young people drift from demanding, high-skill, high-value-added occupations to law, finance and various occupations that merely transfer value around the economy and in some cases may even destroy it. (He asks us to consider the ravages of the opioid industry, for instance.)


❝   this plutocratic assumption behind progressive fads

An arresting phrase from Chesterton, What’s Wrong with the World. He says:

modern movements… generally repose upon some experience peculiar to the rich.

The greatest of these is Search

A Berkeley Computer Science lab just uploaded “Approaching Human-Level Forecasting with Language Models” to arXiv DOI:10.48550/arXiv.2402.18563. My take:

There were three things that helped: Study, Scale, and Search, but the greatest of these is Search.

Halawi et al replicated earlier results that off-the-shelf LLMs can’t forecast, then showed how to make them better. Quickly:

  • A coin toss gets a squared error score of 0.25.
  • Off-the-shelf LLMs are nearly that bad.
  • Except GPT-4 that got 0.208.
  • With web search and fine tuning, the best LLMs got down to 0.179.
  • The (human) crowd average was 0.149.

Adding news search and fine tuning, the LLMs were decent (Brier .179), and well-calibrated. Not nearly as good as the crowd (.149), but probably (I’m guessing) better than the median forecaster – most of crowd accuracy is usually carried by the top few %. I’m surprised by the calibration.

Calibration curves from Halawi et al

By far the biggest gain was adding Info-Retrieval (Brier .206 -> .186), especially when it found at least 5 relevant articles.

With respect to retrieval, our system nears the performance of the crowd when there are at least 5 relevant articles. We further observe that as the number of articles increases, our Brier score improves and surpasses the crowd’s (Figure 4a). Intuitively, our system relies on high-quality retrieval, and when conditioned on more articles, it performs better.

Note: they worked to avoid information leakage. The test set only used questions published after the models' cutoff date, and they did sanity checks to ensure the model didn’t already know events after the cutoff date (and did know events before it!). New retrieval used APIs that allowed cutoff dates, so they could simulate more information becoming available during the life of the question. Retrieval dates were sampled based on advertised question closing date, not actual resolution.

Study:

Fine-tuning the model improved versus baseline: (.186 -> .179) for the full system, with variants at 0.181-0.183. If I understand correctly, it was trained on model forecasts of training data which had outperformed the crowd but not by too much, to mitigate overconfidence.

That last adjustment – good but not too good – suggests there are still a lot of judgmental knobs to twiddle, risking a garden of forking paths. However, assuming no actual flaws like information leakage, the paper stands as an existence proof of decent forecasting, though not a representative sample of what replication-of-method would find.

Scale:

GPT4 did a little better than GPT3.5. (.179 vs .182). And a lot better than lunchbox models like Llama-2.

But it’s not just scale, as you can see below: Llama’s 13B model outperforms its 70B, by a lot. Maybe sophistication would be a better word, but that’s too many syllables for a slogan.

Brier scores for Llama-2 (3 variants) and Mistral-7B-Instruct showing values from 0.226 (near chance) to 0.353 (far worse than chance).

Thoughts

Calibration: was surprisingly good. Some of that probably comes from the careful selection of forecasts to fine-tune from, and some likely from the crowd-within style setup where the model forecast is the trimmed mean from at least 16 different forecasts it generated for the question. [Update: the [Schoenegger et al] paper (also this week) ensembled 12 different LLMs and got worse calibration. Fine tuning has my vote now.]

Forking Paths: They did a hyperparameter search to optimize system configuration like the the choice of aggregation function (trimmed mean), as well as retrieval and prompting strategies. This bothers me less than it might because (1) adding good article retrieval matters more than all the other steps; (2) hyperparameter search can itself be specified and replicated (though I bet the choice of good-but-not-great for training forecasts was ad hoc), and (3) the paper is an existence proof.

Quality: It’s about 20% worse than the crowd forecast. I would like to know where it forms in crowd %ile. However, it’s different enough from the crowd that mixing them at 4:1 (crowd:model) improved the aggregate.

They note the system had biggest Brier gains when the crowd was guessing near 0.5, but I’m unimpressed. (1) this seems of little practical importance, especially if those questions really are uncertain, (2) it’s still only taking Brier of .24 -> .237, nothing to write home about, and (3) it’s too post-hoc, drawing the target after taking the shots.

Overall: A surprising amount of forecasting is nowcasting, and this is something LLMs with good search and inference could indeed get good at. At minimum they could do the initial sweep on a question, or set better priors. I would imagine that marrying LLMs with Argument Mapping could improve things even more.

This paper looks like a good start.

Alan Jacobs again: Hatred alone is immortal.

Ah, the problem with the world. But Chesterton whispers, “The problem with the world? I am.” How to hold that truth without self-loathing, to see humanity’s flaws and foibles, and love us despite and because of them. Pratchett as his best does this.

‘Speak when you are angry and you will make the best speech you will ever regret.’ – Ambrose Bierce