The analogical case with humans would be where established authors, for example, sued new and upcoming authors for having learned their craft, in part, by reading the works of established authors.

I suggest the answer, however, is not twisting existing copyright law into performing new functions badly, but in writing new laws that directly address the new problems

~Kevin Korb, Generative AI Does Not Violate Copyright but Needs To


❝   Consider that Amazon has had to cap the number of self-published “books” an author can submit to a mere three books per day

~Cory Doctorow [The Coprophagic AI Crisis](https://pluralistic.net/2024/03/14/inhuman-centipede/#enshittibottification)

Infant Mortality & Decline of the West


❝   Infant mortality, the telltale metric that led him to predict the Soviet collapse half a century ago, is higher in Mr. Biden’s America (5.4 per thousand) than in Mr. Putin’s Russia.

~NYT on [Emmanual Todd & the Decline of the West](https://www.nytimes.com/2024/03/09/opinion/emmanuel-todd-decline-west.html)

Thoughts

The US has infant and maternal mortality problems, but is it this bad, or is it just Russia finally catching up?

  • The CIA World Fact Book estimates 2023 Russia still behind at 6.6 infant deaths per thousand live births, versus 5.1 for the US. For comparison, it estimates 35 European countries are below 5 per thousand, and the US is on par with Poland.
CIA World Factbook table showing estimated 2023 Russian & US infant mortality as 6.6 (Rank 162) and 5.1 (Rank 174) respectively.
  • In contrast, Macrotrends data says Russia has edged ahead at 4.8, while it rates the US worse at 5.5. (US and RU data here.) That’s in line with Todd’s US number, and they claim to source from the UN World Population Prospects, so I’ll presume some overlap there. I don’t know the sources myself.

But here’s a combined trend using Macrotrend’s data, from 1960-2024 (omitting Russia’s disastrous 1950s). Even this data has the US slowly improving, so the story is Russia catching up.

Russia & US Infant Mortality 1960-2024

Possibly relevant: birth rates are similar at 11 & 12 per thousand (Macrotrends).

Either way, Russia is close to the US now, and I’m surprised – my impressions were outdated. But this graph doesn’t seem cause for concern about the US. Comparison to peer democracies might. I’d have to read Todd’s book for the argument.

Other striking thoughts:

A specialist in the anthropology of families, Mr. Todd warns that a lot of the values Americans are currently spreading are less universal than Americans think.

Which thought continues:

In a similar way, during the Cold War, the Soviet Union’s official atheism was a deal-breaker for many people who might otherwise have been well disposed toward Communism.

And despite the US haing ~2.5x Russia’s population (per US Census):

Mr. Todd calculates that the United States produces fewer engineers than Russia does, not just per capita but in absolute numbers.

Though this may reflect his values for what counts as productive (my emphasis):

It is experiencing an “internal brain drain,” as its young people drift from demanding, high-skill, high-value-added occupations to law, finance and various occupations that merely transfer value around the economy and in some cases may even destroy it. (He asks us to consider the ravages of the opioid industry, for instance.)


❝   this plutocratic assumption behind progressive fads

An arresting phrase from Chesterton, What’s Wrong with the World. He says:

modern movements… generally repose upon some experience peculiar to the rich.

The greatest of these is Search

A Berkeley Computer Science lab just uploaded “Approaching Human-Level Forecasting with Language Models” to arXiv DOI:10.48550/arXiv.2402.18563. My take:

There were three things that helped: Study, Scale, and Search, but the greatest of these is Search.

Halawi et al replicated earlier results that off-the-shelf LLMs can’t forecast, then showed how to make them better. Quickly:

  • A coin toss gets a squared error score of 0.25.
  • Off-the-shelf LLMs are nearly that bad.
  • Except GPT-4 that got 0.208.
  • With web search and fine tuning, the best LLMs got down to 0.179.
  • The (human) crowd average was 0.149.

Adding news search and fine tuning, the LLMs were decent (Brier .179), and well-calibrated. Not nearly as good as the crowd (.149), but probably (I’m guessing) better than the median forecaster – most of crowd accuracy is usually carried by the top few %. I’m surprised by the calibration.

Calibration curves from Halawi et al

By far the biggest gain was adding Info-Retrieval (Brier .206 -> .186), especially when it found at least 5 relevant articles.

With respect to retrieval, our system nears the performance of the crowd when there are at least 5 relevant articles. We further observe that as the number of articles increases, our Brier score improves and surpasses the crowd’s (Figure 4a). Intuitively, our system relies on high-quality retrieval, and when conditioned on more articles, it performs better.

Note: they worked to avoid information leakage. The test set only used questions published after the models' cutoff date, and they did sanity checks to ensure the model didn’t already know events after the cutoff date (and did know events before it!). New retrieval used APIs that allowed cutoff dates, so they could simulate more information becoming available during the life of the question. Retrieval dates were sampled based on advertised question closing date, not actual resolution.

Study:

Fine-tuning the model improved versus baseline: (.186 -> .179) for the full system, with variants at 0.181-0.183. If I understand correctly, it was trained on model forecasts of training data which had outperformed the crowd but not by too much, to mitigate overconfidence.

That last adjustment – good but not too good – suggests there are still a lot of judgmental knobs to twiddle, risking a garden of forking paths. However, assuming no actual flaws like information leakage, the paper stands as an existence proof of decent forecasting, though not a representative sample of what replication-of-method would find.

Scale:

GPT4 did a little better than GPT3.5. (.179 vs .182). And a lot better than lunchbox models like Llama-2.

But it’s not just scale, as you can see below: Llama’s 13B model outperforms its 70B, by a lot. Maybe sophistication would be a better word, but that’s too many syllables for a slogan.

Brier scores for Llama-2 (3 variants) and Mistral-7B-Instruct showing values from 0.226 (near chance) to 0.353 (far worse than chance).

Thoughts

Calibration: was surprisingly good. Some of that probably comes from the careful selection of forecasts to fine-tune from, and some likely from the crowd-within style setup where the model forecast is the trimmed mean from at least 16 different forecasts it generated for the question. [Update: the [Schoenegger et al] paper (also this week) ensembled 12 different LLMs and got worse calibration. Fine tuning has my vote now.]

Forking Paths: They did a hyperparameter search to optimize system configuration like the the choice of aggregation function (trimmed mean), as well as retrieval and prompting strategies. This bothers me less than it might because (1) adding good article retrieval matters more than all the other steps; (2) hyperparameter search can itself be specified and replicated (though I bet the choice of good-but-not-great for training forecasts was ad hoc), and (3) the paper is an existence proof.

Quality: It’s about 20% worse than the crowd forecast. I would like to know where it forms in crowd %ile. However, it’s different enough from the crowd that mixing them at 4:1 (crowd:model) improved the aggregate.

They note the system had biggest Brier gains when the crowd was guessing near 0.5, but I’m unimpressed. (1) this seems of little practical importance, especially if those questions really are uncertain, (2) it’s still only taking Brier of .24 -> .237, nothing to write home about, and (3) it’s too post-hoc, drawing the target after taking the shots.

Overall: A surprising amount of forecasting is nowcasting, and this is something LLMs with good search and inference could indeed get good at. At minimum they could do the initial sweep on a question, or set better priors. I would imagine that marrying LLMs with Argument Mapping could improve things even more.

This paper looks like a good start.

Alan Jacobs again: Hatred alone is immortal.

Ah, the problem with the world. But Chesterton whispers, “The problem with the world? I am.” How to hold that truth without self-loathing, to see humanity’s flaws and foibles, and love us despite and because of them. Pratchett as his best does this.

‘Speak when you are angry and you will make the best speech you will ever regret.’ – Ambrose Bierce

❝   For Ockham, the principle of simplicity limits the multiplication of hypotheses not necessarily entities. Favoring the formulation “It is useless to do with more what can be done with less,” Ockham implies that theories are meant to do things, namely, explain and predict, and these things can be accomplished more effectively with fewer assumptions.

From the IEP entry for “William of Ockham” by Sharon Kaye