Daniel Lakens on why we won’t move beyond p<.05. Key: few really offered alternatives.

Part 2 of an excellent 2-part reflection on the APA special issue 5 years ago.

In-depth review of pending Metascience reforms at NIH by Stuart Buck at GoodScience. Excellent in-depth commentary.

This newsletter covers no fewer than four exciting metascience developments, with huge potential for improving science and medicine at NIH and elsewhere.

Including:

  • FY24 Appropriations Bill
  • Senate Report of 2023
  • Cassidy Report
  • House Report

This is promising: you can run LLM inference and training on 13W of power. I’ve yet to read the research paper, but they found you don’t need matrix multiplication if you adopt ternary [-1, 0, 1] values.

Daniel Miessler suggests the Left is fueling Trump with a defeatist and anti-American narrative.

I started following Miessler for cyber-security. I like his generally centrist and wide-ranging take, and respect what he’s accomplished with hard work and self study. He’s onto something here.

The Shackles of Convenience, from Dan Miessler.

Found an old tab with a reminder that misinfo isn’t just on the Right.

I like being wrong?

Chesterton - on The Family


❝   The best way that a man could test his readiness to encounter the common variety of mankind would be to climb down a chimney into any house at random, and get on as well as possible with the people inside. And that is essentially what each one of us did on the day that he was born.

Chesterton, Heretics -- On Certain Modern Writers and the Institution of the Family.

In praise of idleness - Bertrand Russel


❝   The modern man thinks that everything ought to be done for the sake of something else, and never for its own sake.

In praise of idleness, by Bertrand Russel. (HTT Daniel Miessler.)

My neighbor’s astonishingly vivid azalea

Starting Harrow by Tamsyn Muir. 📚

Finished The Difficult Subject by (pen name) Molly Macallen, book 2 in the Maddy Shanks mystery series. 📚

Dr Shanks tries to resume her academic life while the trial from book one begins in Philadelphia. But her chair wants her to investigate the the surprising death of her predecessor.

Wim Vanderbauwhede estimates ChatGPT uses ~60x as much energy as Google search. (Geometric mean of other estimates, with analysis.)

“The first step is to draw a diagram. When you draw a diagram you’re loading it into your GPU. That is the key.” ~Casey Handmar on solving problems

The analogical case with humans would be where established authors, for example, sued new and upcoming authors for having learned their craft, in part, by reading the works of established authors.

I suggest the answer, however, is not twisting existing copyright law into performing new functions badly, but in writing new laws that directly address the new problems

~Kevin Korb, Generative AI Does Not Violate Copyright but Needs To


❝   Consider that Amazon has had to cap the number of self-published “books” an author can submit to a mere three books per day

~Cory Doctorow [The Coprophagic AI Crisis](https://pluralistic.net/2024/03/14/inhuman-centipede/#enshittibottification)

Infant Mortality & Decline of the West


❝   Infant mortality, the telltale metric that led him to predict the Soviet collapse half a century ago, is higher in Mr. Biden’s America (5.4 per thousand) than in Mr. Putin’s Russia.

~NYT on [Emmanual Todd & the Decline of the West](https://www.nytimes.com/2024/03/09/opinion/emmanuel-todd-decline-west.html)

Thoughts

The US has infant and maternal mortality problems, but is it this bad, or is it just Russia finally catching up?

  • The CIA World Fact Book estimates 2023 Russia still behind at 6.6 infant deaths per thousand live births, versus 5.1 for the US. For comparison, it estimates 35 European countries are below 5 per thousand, and the US is on par with Poland.
CIA World Factbook table showing estimated 2023 Russian & US infant mortality as 6.6 (Rank 162) and 5.1 (Rank 174) respectively.
  • In contrast, Macrotrends data says Russia has edged ahead at 4.8, while it rates the US worse at 5.5. (US and RU data here.) That’s in line with Todd’s US number, and they claim to source from the UN World Population Prospects, so I’ll presume some overlap there. I don’t know the sources myself.

But here’s a combined trend using Macrotrend’s data, from 1960-2024 (omitting Russia’s disastrous 1950s). Even this data has the US slowly improving, so the story is Russia catching up.

Russia & US Infant Mortality 1960-2024

Possibly relevant: birth rates are similar at 11 & 12 per thousand (Macrotrends).

Either way, Russia is close to the US now, and I’m surprised – my impressions were outdated. But this graph doesn’t seem cause for concern about the US. Comparison to peer democracies might. I’d have to read Todd’s book for the argument.

Other striking thoughts:

A specialist in the anthropology of families, Mr. Todd warns that a lot of the values Americans are currently spreading are less universal than Americans think.

Which thought continues:

In a similar way, during the Cold War, the Soviet Union’s official atheism was a deal-breaker for many people who might otherwise have been well disposed toward Communism.

And despite the US haing ~2.5x Russia’s population (per US Census):

Mr. Todd calculates that the United States produces fewer engineers than Russia does, not just per capita but in absolute numbers.

Though this may reflect his values for what counts as productive (my emphasis):

It is experiencing an “internal brain drain,” as its young people drift from demanding, high-skill, high-value-added occupations to law, finance and various occupations that merely transfer value around the economy and in some cases may even destroy it. (He asks us to consider the ravages of the opioid industry, for instance.)


❝   this plutocratic assumption behind progressive fads

An arresting phrase from Chesterton, What’s Wrong with the World. He says:

modern movements… generally repose upon some experience peculiar to the rich.

The greatest of these is Search

A Berkeley Computer Science lab just uploaded “Approaching Human-Level Forecasting with Language Models” to arXiv DOI:10.48550/arXiv.2402.18563. My take:

There were three things that helped: Study, Scale, and Search, but the greatest of these is Search.

Halawi et al replicated earlier results that off-the-shelf LLMs can’t forecast, then showed how to make them better. Quickly:

  • A coin toss gets a squared error score of 0.25.
  • Off-the-shelf LLMs are nearly that bad.
  • Except GPT-4 that got 0.208.
  • With web search and fine tuning, the best LLMs got down to 0.179.
  • The (human) crowd average was 0.149.

Adding news search and fine tuning, the LLMs were decent (Brier .179), and well-calibrated. Not nearly as good as the crowd (.149), but probably (I’m guessing) better than the median forecaster – most of crowd accuracy is usually carried by the top few %. I’m surprised by the calibration.

Calibration curves from Halawi et al

By far the biggest gain was adding Info-Retrieval (Brier .206 -> .186), especially when it found at least 5 relevant articles.

With respect to retrieval, our system nears the performance of the crowd when there are at least 5 relevant articles. We further observe that as the number of articles increases, our Brier score improves and surpasses the crowd’s (Figure 4a). Intuitively, our system relies on high-quality retrieval, and when conditioned on more articles, it performs better.

Note: they worked to avoid information leakage. The test set only used questions published after the models' cutoff date, and they did sanity checks to ensure the model didn’t already know events after the cutoff date (and did know events before it!). New retrieval used APIs that allowed cutoff dates, so they could simulate more information becoming available during the life of the question. Retrieval dates were sampled based on advertised question closing date, not actual resolution.

Study:

Fine-tuning the model improved versus baseline: (.186 -> .179) for the full system, with variants at 0.181-0.183. If I understand correctly, it was trained on model forecasts of training data which had outperformed the crowd but not by too much, to mitigate overconfidence.

That last adjustment – good but not too good – suggests there are still a lot of judgmental knobs to twiddle, risking a garden of forking paths. However, assuming no actual flaws like information leakage, the paper stands as an existence proof of decent forecasting, though not a representative sample of what replication-of-method would find.

Scale:

GPT4 did a little better than GPT3.5. (.179 vs .182). And a lot better than lunchbox models like Llama-2.

But it’s not just scale, as you can see below: Llama’s 13B model outperforms its 70B, by a lot. Maybe sophistication would be a better word, but that’s too many syllables for a slogan.

Brier scores for Llama-2 (3 variants) and Mistral-7B-Instruct showing values from 0.226 (near chance) to 0.353 (far worse than chance).

Thoughts

Calibration: was surprisingly good. Some of that probably comes from the careful selection of forecasts to fine-tune from, and some likely from the crowd-within style setup where the model forecast is the trimmed mean from at least 16 different forecasts it generated for the question. [Update: the [Schoenegger et al] paper (also this week) ensembled 12 different LLMs and got worse calibration. Fine tuning has my vote now.]

Forking Paths: They did a hyperparameter search to optimize system configuration like the the choice of aggregation function (trimmed mean), as well as retrieval and prompting strategies. This bothers me less than it might because (1) adding good article retrieval matters more than all the other steps; (2) hyperparameter search can itself be specified and replicated (though I bet the choice of good-but-not-great for training forecasts was ad hoc), and (3) the paper is an existence proof.

Quality: It’s about 20% worse than the crowd forecast. I would like to know where it forms in crowd %ile. However, it’s different enough from the crowd that mixing them at 4:1 (crowd:model) improved the aggregate.

They note the system had biggest Brier gains when the crowd was guessing near 0.5, but I’m unimpressed. (1) this seems of little practical importance, especially if those questions really are uncertain, (2) it’s still only taking Brier of .24 -> .237, nothing to write home about, and (3) it’s too post-hoc, drawing the target after taking the shots.

Overall: A surprising amount of forecasting is nowcasting, and this is something LLMs with good search and inference could indeed get good at. At minimum they could do the initial sweep on a question, or set better priors. I would imagine that marrying LLMs with Argument Mapping could improve things even more.

This paper looks like a good start.

Alan Jacobs again: Hatred alone is immortal.

Ah, the problem with the world. But Chesterton whispers, “The problem with the world? I am.” How to hold that truth without self-loathing, to see humanity’s flaws and foibles, and love us despite and because of them. Pratchett as his best does this.

‘Speak when you are angry and you will make the best speech you will ever regret.’ – Ambrose Bierce

❝   For Ockham, the principle of simplicity limits the multiplication of hypotheses not necessarily entities. Favoring the formulation “It is useless to do with more what can be done with less,” Ockham implies that theories are meant to do things, namely, explain and predict, and these things can be accomplished more effectively with fewer assumptions.

From the IEP entry for “William of Ockham” by Sharon Kaye

Bari Weiss: Why DEI Must End for Good

I’m afraid she’s right. Worth watching or reading in entirety.

The blurry JPEG

❝   LLMs aren't people, but they act a lot more like people than logical machines.

~Ethan Mollick

Linda McIver and Cory Doctorow do not buy the AI hype.

McIver ChatGPT is an evolutionary dead end:

As I have noted in the past, these systems are not intelligent. They do not think. They do not understand language. They literally choose a statistically likely next word, using the vast amounts of text they have cheerfully stolen from the internet as their source.

Doctorow’s Autocomplete Worshippers:

AI has all the hallmarks of a classic pump-and-dump, starting with terminology. AI isn’t “artificial” and it’s not “intelligent.” “Machine learning” doesn’t learn. On this week’s Trashfuture podcast, they made an excellent (and profane and hilarious) case that ChatGPT is best understood as a sophisticated form of autocomplete – not our new robot overlord.

Not so fast. First, AI systems do understand text, though not the real-world referents. Although LLMs were trained by choosing the most likely word, they do more. Representations matter. How you choose the most likely word matters. A very large word frequency table could predict the most likely word, but it couldn’t do novel word algebra (king - man + woman = ___) or any of the other things that LLMs do.

Second, McIver and Doctorow trade on their expertise to make their debunking claim: we understand AI. But that won’t do. As David Mandel notes in a recent preprint AI Risk is the only existential risk where the experts in the field rate it riskier than informed outsiders.

Google’s Peter Norvig clearly understands AI. And he and colleagues argue they’re already general, if limited:

Artificial General Intelligence (AGI) means many different things to different people, but the most important parts of it have already been achieved by the current generation of advanced AI large language models such as ChatGPT, Bard, LLaMA and Claude. …today’s frontier models perform competently even on novel tasks they were not trained for, crossing a threshold that previous generations of AI and supervised deep learning systems never managed. Decades from now, they will be recognized as the first true examples of AGI, just as the 1945 ENIAC is now recognized as the first true general-purpose electronic computer.

That doesn’t mean he’s right, only that knowing how LLMs work doesn’t automatically dispel claims.

Meta’s Yann LeCun clearly understands AI. He sides with McIver & Doctorow that AI is dumber than cats, and argues there’s a regulatory-capture game going on. (Meta wants more openness, FYI.)

Demands to police AI stemmed from the “superiority complex” of some of the leading tech companies that argued that only they could be trusted to develop AI safely, LeCun said. “I think that’s incredibly arrogant. And I think the exact opposite,” he said in an interview for the FT’s forthcoming Tech Tonic podcast series.

Regulating leading-edge AI models today would be like regulating the jet airline industry in 1925 when such aeroplanes had not even been invented, he said. “The debate on existential risk is very premature until we have a design for a system that can even rival a cat in terms of learning capabilities, which we don’t have at the moment,” he said.

Could a system be dumber than cats and still general?

McIver again:

There is no viable path from this statistical threshing machine to an intelligent system. You cannot refine statistical plausibility into independent thought. You can only refine it into increased plausibility.

I don’t think McIver was trying to spell out the argument in that short post, but as stated this begs the question. Perhaps you can’t get life from dead matter. Perhaps you can. The argument cannot be, “It can’t be intelligent if I understand the parts”.

Doctorow refers to Ted Chiang’s “instant classic”, ChatGPT Is a Blurry JPEG of the Web

[AI] hallucinations are compression artifacts, but—like the incorrect labels generated by the Xerox photocopier—they are plausible enough that identifying them requires comparing them against the originals, which in this case means either the Web or our own knowledge of the world.

I think that does a good job at correcting many mistaken impressions, and correctly deflating things a bit. But also, that “Blurry JPEG” is key to LLM’s abilities: they are compressing their world, be it images, videos, or text. That is, they are making models of it. As Doctorow notes,

Except in some edge cases, these systems don’t store copies of the images they analyze, nor do they reproduce them.

They gist them. Not necessarily the way humans do, but analogously. Those models let them abstract, reason, and create novelty. Compression doesn’t guarantee intelligence, but it is closely related.

Two main limitations of AI right now:

  1. They’re still small. Vast in some ways, but with limited working memory. Andrej Karpathy suggests LLMs are like early 8-bit CPUs. We are still experimenting with the rest of the von Neumann architecture to get a viable system.
  2. AI is trapped in a self-referential world of syntax. The reason they hallucinate (image models) or BS (LLMs) is they have no semantic grounding – no external access to ground truth.

Why not use a century of experience with cognitive measures (PDF) to help quantify AI abilities and gaps?

~ ~ ~

A interesting tangent: Doctorow’s piece covers copyright. He thinks that

Under these [current market] conditions, giving a creator more copyright is like giving a bullied schoolkid extra lunch money.

…there are loud, insistent calls … that training a machine-learning system is a copyright infringement.

This is a bad theory. First, it’s bad as a matter of copyright law. Fundamentally, machine learning … [is] a math-heavy version of what every creator does: analyze how the works they admire are made, so they can make their own new works.

So any law against this would undo what wins creators have had over conglomerates regarding fair use and derivative works.

Turning every part of the creative process into “IP” hasn’t made creators better off. All that’s it’s accomplished is to make it harder to create without taking terms from a giant corporation, whose terms inevitably include forcing you to trade all your IP away to them. That’s something that Spider Robinson prophesied in his Hugo-winning 1982 story, “Melancholy Elephants”.

A few good things in psychology


❝   So if you hear that 60% of papers in your field don’t replicate, shouldn't you care a lot about which ones? Why didn't my colleagues and I immediately open up that paper's supplement, click on the 100 links, and check whether any of our most beloved findings died? _~A. Mastroianni_

HTT to the well-read Robert Horn for the link.

After replication failures and more recent accounts of fraud, Elizabeth Gilbert & Nick Hobson ask, Is psychology good for anything?

If the entire field of psychology disappeared today, would it matter? …

Adam Mastroianni, a postdoctoral research scholar at Columbia Business School says: meh, not really.

At the time I replied with something like this:

~ ~ ~ ~ ~

There’s truth to this. I think Taleb noted that Shakespeare and Aeschylus are keener observers of the human condition than the average academic. But it’s good to remember there are useful things in psychology, as Gilbert & Hobson note near the end.

I might add the Weber-Fechner law and effects that reveal mental mechanisms, like:

  • 7±2
  • Stroop
  • mental rotation
  • lesion-coupled deficits
  • some fMRI.

Losing the Weber-Fechner would be like losing Newton: $F = m a$ reset default motion from stasis to inertia. $p = k log (S/S0)$ – reset sensation from absolute to relative.

7±2 could be 8±3 or 6±1, and there’s chunking. But to lose the idea that short-term memory has but a few fleeting registers would quake the field. And when the default is only ~7, losing or gaining a few is huge.

Stroop, mental rotation, deficits, & brain function are a mix of observation and implied theory. Removing some of this is erasing the moons of Jupiter: stubborn bright spots that rule out theories and strongly suggest alternatives. Stroop is “just” an illusion – but its existence limits independence of processing. Stroop’s cousins in visual search have practical applications from combat & rescue to user-interface design.

Likewise, that brains rotate images constrains mechanism, and informs dyslexic-friendly fonts and interface design.

Neural signals are too slow to track major-league fastballs. But batters can hit them. That helped find some clever signal processing hacks that helps animals perceive moving objects slightly ahead of where they are.

~ ~ ~

But yes, on the whole psychology is observation-rich and theory-poor: cards tiled in a mosaic, not built into houses.

I opened with Mastroanni’s plane crash analogy – if you heard that 60% of your relatives died in a plane crash, pretty soon you’d want to know which ones.

It’s damning that psychology needn’t much care.

Perhaps There are no statistics in the kingdom of God.


❝   Would studies like this be better if they always did all their stats perfectly? Of course. But the real improvement would be not doing this kind of study at all.

~ Adam Mastroianni, There are no statistics in the kingdom of God