Wim Vanderbauwhede estimates ChatGPT uses ~60x as much energy as Google search. (Geometric mean of other estimates, with analysis.)

“The first step is to draw a diagram. When you draw a diagram you’re loading it into your GPU. That is the key.” ~Casey Handmar on solving problems

The analogical case with humans would be where established authors, for example, sued new and upcoming authors for having learned their craft, in part, by reading the works of established authors.

I suggest the answer, however, is not twisting existing copyright law into performing new functions badly, but in writing new laws that directly address the new problems

~Kevin Korb, Generative AI Does Not Violate Copyright but Needs To


❝   Consider that Amazon has had to cap the number of self-published “books” an author can submit to a mere three books per day

~Cory Doctorow [The Coprophagic AI Crisis](https://pluralistic.net/2024/03/14/inhuman-centipede/#enshittibottification)

Infant Mortality & Decline of the West


❝   Infant mortality, the telltale metric that led him to predict the Soviet collapse half a century ago, is higher in Mr. Biden’s America (5.4 per thousand) than in Mr. Putin’s Russia.

~NYT on [Emmanual Todd & the Decline of the West](https://www.nytimes.com/2024/03/09/opinion/emmanuel-todd-decline-west.html)

Thoughts

The US has infant and maternal mortality problems, but is it this bad, or is it just Russia finally catching up?

  • The CIA World Fact Book estimates 2023 Russia still behind at 6.6 infant deaths per thousand live births, versus 5.1 for the US. For comparison, it estimates 35 European countries are below 5 per thousand, and the US is on par with Poland.
CIA World Factbook table showing estimated 2023 Russian & US infant mortality as 6.6 (Rank 162) and 5.1 (Rank 174) respectively.
  • In contrast, Macrotrends data says Russia has edged ahead at 4.8, while it rates the US worse at 5.5. (US and RU data here.) That’s in line with Todd’s US number, and they claim to source from the UN World Population Prospects, so I’ll presume some overlap there. I don’t know the sources myself.

But here’s a combined trend using Macrotrend’s data, from 1960-2024 (omitting Russia’s disastrous 1950s). Even this data has the US slowly improving, so the story is Russia catching up.

Russia & US Infant Mortality 1960-2024

Possibly relevant: birth rates are similar at 11 & 12 per thousand (Macrotrends).

Either way, Russia is close to the US now, and I’m surprised – my impressions were outdated. But this graph doesn’t seem cause for concern about the US. Comparison to peer democracies might. I’d have to read Todd’s book for the argument.

Other striking thoughts:

A specialist in the anthropology of families, Mr. Todd warns that a lot of the values Americans are currently spreading are less universal than Americans think.

Which thought continues:

In a similar way, during the Cold War, the Soviet Union’s official atheism was a deal-breaker for many people who might otherwise have been well disposed toward Communism.

And despite the US haing ~2.5x Russia’s population (per US Census):

Mr. Todd calculates that the United States produces fewer engineers than Russia does, not just per capita but in absolute numbers.

Though this may reflect his values for what counts as productive (my emphasis):

It is experiencing an “internal brain drain,” as its young people drift from demanding, high-skill, high-value-added occupations to law, finance and various occupations that merely transfer value around the economy and in some cases may even destroy it. (He asks us to consider the ravages of the opioid industry, for instance.)


❝   this plutocratic assumption behind progressive fads

An arresting phrase from Chesterton, What’s Wrong with the World. He says:

modern movements… generally repose upon some experience peculiar to the rich.

The greatest of these is Search

A Berkeley Computer Science lab just uploaded “Approaching Human-Level Forecasting with Language Models” to arXiv DOI:10.48550/arXiv.2402.18563. My take:

There were three things that helped: Study, Scale, and Search, but the greatest of these is Search.

Halawi et al replicated earlier results that off-the-shelf LLMs can’t forecast, then showed how to make them better. Quickly:

  • A coin toss gets a squared error score of 0.25.
  • Off-the-shelf LLMs are nearly that bad.
  • Except GPT-4 that got 0.208.
  • With web search and fine tuning, the best LLMs got down to 0.179.
  • The (human) crowd average was 0.149.

Adding news search and fine tuning, the LLMs were decent (Brier .179), and well-calibrated. Not nearly as good as the crowd (.149), but probably (I’m guessing) better than the median forecaster – most of crowd accuracy is usually carried by the top few %. I’m surprised by the calibration.

Calibration curves from Halawi et al

By far the biggest gain was adding Info-Retrieval (Brier .206 -> .186), especially when it found at least 5 relevant articles.

With respect to retrieval, our system nears the performance of the crowd when there are at least 5 relevant articles. We further observe that as the number of articles increases, our Brier score improves and surpasses the crowd’s (Figure 4a). Intuitively, our system relies on high-quality retrieval, and when conditioned on more articles, it performs better.

Note: they worked to avoid information leakage. The test set only used questions published after the models' cutoff date, and they did sanity checks to ensure the model didn’t already know events after the cutoff date (and did know events before it!). New retrieval used APIs that allowed cutoff dates, so they could simulate more information becoming available during the life of the question. Retrieval dates were sampled based on advertised question closing date, not actual resolution.

Study:

Fine-tuning the model improved versus baseline: (.186 -> .179) for the full system, with variants at 0.181-0.183. If I understand correctly, it was trained on model forecasts of training data which had outperformed the crowd but not by too much, to mitigate overconfidence.

That last adjustment – good but not too good – suggests there are still a lot of judgmental knobs to twiddle, risking a garden of forking paths. However, assuming no actual flaws like information leakage, the paper stands as an existence proof of decent forecasting, though not a representative sample of what replication-of-method would find.

Scale:

GPT4 did a little better than GPT3.5. (.179 vs .182). And a lot better than lunchbox models like Llama-2.

But it’s not just scale, as you can see below: Llama’s 13B model outperforms its 70B, by a lot. Maybe sophistication would be a better word, but that’s too many syllables for a slogan.

Brier scores for Llama-2 (3 variants) and Mistral-7B-Instruct showing values from 0.226 (near chance) to 0.353 (far worse than chance).

Thoughts

Calibration: was surprisingly good. Some of that probably comes from the careful selection of forecasts to fine-tune from, and some likely from the crowd-within style setup where the model forecast is the trimmed mean from at least 16 different forecasts it generated for the question. [Update: the [Schoenegger et al] paper (also this week) ensembled 12 different LLMs and got worse calibration. Fine tuning has my vote now.]

Forking Paths: They did a hyperparameter search to optimize system configuration like the the choice of aggregation function (trimmed mean), as well as retrieval and prompting strategies. This bothers me less than it might because (1) adding good article retrieval matters more than all the other steps; (2) hyperparameter search can itself be specified and replicated (though I bet the choice of good-but-not-great for training forecasts was ad hoc), and (3) the paper is an existence proof.

Quality: It’s about 20% worse than the crowd forecast. I would like to know where it forms in crowd %ile. However, it’s different enough from the crowd that mixing them at 4:1 (crowd:model) improved the aggregate.

They note the system had biggest Brier gains when the crowd was guessing near 0.5, but I’m unimpressed. (1) this seems of little practical importance, especially if those questions really are uncertain, (2) it’s still only taking Brier of .24 -> .237, nothing to write home about, and (3) it’s too post-hoc, drawing the target after taking the shots.

Overall: A surprising amount of forecasting is nowcasting, and this is something LLMs with good search and inference could indeed get good at. At minimum they could do the initial sweep on a question, or set better priors. I would imagine that marrying LLMs with Argument Mapping could improve things even more.

This paper looks like a good start.

Alan Jacobs again: Hatred alone is immortal.

Ah, the problem with the world. But Chesterton whispers, “The problem with the world? I am.” How to hold that truth without self-loathing, to see humanity’s flaws and foibles, and love us despite and because of them. Pratchett as his best does this.

‘Speak when you are angry and you will make the best speech you will ever regret.’ – Ambrose Bierce

❝   For Ockham, the principle of simplicity limits the multiplication of hypotheses not necessarily entities. Favoring the formulation “It is useless to do with more what can be done with less,” Ockham implies that theories are meant to do things, namely, explain and predict, and these things can be accomplished more effectively with fewer assumptions.

From the IEP entry for “William of Ockham” by Sharon Kaye

Bari Weiss: Why DEI Must End for Good

I’m afraid she’s right. Worth watching or reading in entirety.

The blurry JPEG

❝   LLMs aren't people, but they act a lot more like people than logical machines.

~Ethan Mollick

Linda McIver and Cory Doctorow do not buy the AI hype.

McIver ChatGPT is an evolutionary dead end:

As I have noted in the past, these systems are not intelligent. They do not think. They do not understand language. They literally choose a statistically likely next word, using the vast amounts of text they have cheerfully stolen from the internet as their source.

Doctorow’s Autocomplete Worshippers:

AI has all the hallmarks of a classic pump-and-dump, starting with terminology. AI isn’t “artificial” and it’s not “intelligent.” “Machine learning” doesn’t learn. On this week’s Trashfuture podcast, they made an excellent (and profane and hilarious) case that ChatGPT is best understood as a sophisticated form of autocomplete – not our new robot overlord.

Not so fast. First, AI systems do understand text, though not the real-world referents. Although LLMs were trained by choosing the most likely word, they do more. Representations matter. How you choose the most likely word matters. A very large word frequency table could predict the most likely word, but it couldn’t do novel word algebra (king - man + woman = ___) or any of the other things that LLMs do.

Second, McIver and Doctorow trade on their expertise to make their debunking claim: we understand AI. But that won’t do. As David Mandel notes in a recent preprint AI Risk is the only existential risk where the experts in the field rate it riskier than informed outsiders.

Google’s Peter Norvig clearly understands AI. And he and colleagues argue they’re already general, if limited:

Artificial General Intelligence (AGI) means many different things to different people, but the most important parts of it have already been achieved by the current generation of advanced AI large language models such as ChatGPT, Bard, LLaMA and Claude. …today’s frontier models perform competently even on novel tasks they were not trained for, crossing a threshold that previous generations of AI and supervised deep learning systems never managed. Decades from now, they will be recognized as the first true examples of AGI, just as the 1945 ENIAC is now recognized as the first true general-purpose electronic computer.

That doesn’t mean he’s right, only that knowing how LLMs work doesn’t automatically dispel claims.

Meta’s Yann LeCun clearly understands AI. He sides with McIver & Doctorow that AI is dumber than cats, and argues there’s a regulatory-capture game going on. (Meta wants more openness, FYI.)

Demands to police AI stemmed from the “superiority complex” of some of the leading tech companies that argued that only they could be trusted to develop AI safely, LeCun said. “I think that’s incredibly arrogant. And I think the exact opposite,” he said in an interview for the FT’s forthcoming Tech Tonic podcast series.

Regulating leading-edge AI models today would be like regulating the jet airline industry in 1925 when such aeroplanes had not even been invented, he said. “The debate on existential risk is very premature until we have a design for a system that can even rival a cat in terms of learning capabilities, which we don’t have at the moment,” he said.

Could a system be dumber than cats and still general?

McIver again:

There is no viable path from this statistical threshing machine to an intelligent system. You cannot refine statistical plausibility into independent thought. You can only refine it into increased plausibility.

I don’t think McIver was trying to spell out the argument in that short post, but as stated this begs the question. Perhaps you can’t get life from dead matter. Perhaps you can. The argument cannot be, “It can’t be intelligent if I understand the parts”.

Doctorow refers to Ted Chiang’s “instant classic”, ChatGPT Is a Blurry JPEG of the Web

[AI] hallucinations are compression artifacts, but—like the incorrect labels generated by the Xerox photocopier—they are plausible enough that identifying them requires comparing them against the originals, which in this case means either the Web or our own knowledge of the world.

I think that does a good job at correcting many mistaken impressions, and correctly deflating things a bit. But also, that “Blurry JPEG” is key to LLM’s abilities: they are compressing their world, be it images, videos, or text. That is, they are making models of it. As Doctorow notes,

Except in some edge cases, these systems don’t store copies of the images they analyze, nor do they reproduce them.

They gist them. Not necessarily the way humans do, but analogously. Those models let them abstract, reason, and create novelty. Compression doesn’t guarantee intelligence, but it is closely related.

Two main limitations of AI right now:

  1. They’re still small. Vast in some ways, but with limited working memory. Andrej Karpathy suggests LLMs are like early 8-bit CPUs. We are still experimenting with the rest of the von Neumann architecture to get a viable system.
  2. AI is trapped in a self-referential world of syntax. The reason they hallucinate (image models) or BS (LLMs) is they have no semantic grounding – no external access to ground truth.

Why not use a century of experience with cognitive measures (PDF) to help quantify AI abilities and gaps?

~ ~ ~

A interesting tangent: Doctorow’s piece covers copyright. He thinks that

Under these [current market] conditions, giving a creator more copyright is like giving a bullied schoolkid extra lunch money.

…there are loud, insistent calls … that training a machine-learning system is a copyright infringement.

This is a bad theory. First, it’s bad as a matter of copyright law. Fundamentally, machine learning … [is] a math-heavy version of what every creator does: analyze how the works they admire are made, so they can make their own new works.

So any law against this would undo what wins creators have had over conglomerates regarding fair use and derivative works.

Turning every part of the creative process into “IP” hasn’t made creators better off. All that’s it’s accomplished is to make it harder to create without taking terms from a giant corporation, whose terms inevitably include forcing you to trade all your IP away to them. That’s something that Spider Robinson prophesied in his Hugo-winning 1982 story, “Melancholy Elephants”.

A few good things in psychology


❝   So if you hear that 60% of papers in your field don’t replicate, shouldn't you care a lot about which ones? Why didn't my colleagues and I immediately open up that paper's supplement, click on the 100 links, and check whether any of our most beloved findings died? _~A. Mastroianni_

HTT to the well-read Robert Horn for the link.

After replication failures and more recent accounts of fraud, Elizabeth Gilbert & Nick Hobson ask, Is psychology good for anything?

If the entire field of psychology disappeared today, would it matter? …

Adam Mastroianni, a postdoctoral research scholar at Columbia Business School says: meh, not really.

At the time I replied with something like this:

~ ~ ~ ~ ~

There’s truth to this. I think Taleb noted that Shakespeare and Aeschylus are keener observers of the human condition than the average academic. But it’s good to remember there are useful things in psychology, as Gilbert & Hobson note near the end.

I might add the Weber-Fechner law and effects that reveal mental mechanisms, like:

  • 7±2
  • Stroop
  • mental rotation
  • lesion-coupled deficits
  • some fMRI.

Losing the Weber-Fechner would be like losing Newton: $F = m a$ reset default motion from stasis to inertia. $p = k log (S/S0)$ – reset sensation from absolute to relative.

7±2 could be 8±3 or 6±1, and there’s chunking. But to lose the idea that short-term memory has but a few fleeting registers would quake the field. And when the default is only ~7, losing or gaining a few is huge.

Stroop, mental rotation, deficits, & brain function are a mix of observation and implied theory. Removing some of this is erasing the moons of Jupiter: stubborn bright spots that rule out theories and strongly suggest alternatives. Stroop is “just” an illusion – but its existence limits independence of processing. Stroop’s cousins in visual search have practical applications from combat & rescue to user-interface design.

Likewise, that brains rotate images constrains mechanism, and informs dyslexic-friendly fonts and interface design.

Neural signals are too slow to track major-league fastballs. But batters can hit them. That helped find some clever signal processing hacks that helps animals perceive moving objects slightly ahead of where they are.

~ ~ ~

But yes, on the whole psychology is observation-rich and theory-poor: cards tiled in a mosaic, not built into houses.

I opened with Mastroanni’s plane crash analogy – if you heard that 60% of your relatives died in a plane crash, pretty soon you’d want to know which ones.

It’s damning that psychology needn’t much care.

Perhaps There are no statistics in the kingdom of God.


❝   Would studies like this be better if they always did all their stats perfectly? Of course. But the real improvement would be not doing this kind of study at all.

~ Adam Mastroianni, There are no statistics in the kingdom of God

Replication


❝   And that’s why mistakes had to be corrected. BASF fully recognized that Ostwald would be annoyed by criticism of his work. But they couldn’t tiptoe around it, because they were trying to make ammonia from water and air. If Ostwald’s work couldn’t help them do that, then they couldn’t get into the fertilizer and explosives business. They couldn’t make bread from air. And they couldn’t pay Ostwald royalties. If the work wasn’t right, it was useless to everyone, including Ostwald.

From Paul von Hippel, "[When does science self-correct?](https://goodscience.substack.com/p/when-does-science-self-correct-lessons)". And when not.

A friend sent me 250th Anniversary Boston Tea Party tea for the 16th.

The tea must be authentic: it’s appears to have been intercepted in transit.

Intelligible Failure

Adam Russell created the DARPA SCORE replication project. Here he reflects on the importance of Intelligible Failure.

[Advanced Research Projects Agencies] need intelligible failure to learn from the bets they take. And that means evaluating risks taken (or not) and understanding—not merely observing—failures achieved, which requires both brains and guts. That brings me back to the hardest problem in making failure intelligible: ourselves. Perhaps the neologism we really need going forward is for intelligible failure itself—to distinguish it, as a virtue, from the kind of failure that we never want to celebrate: the unintelligible failure, immeasurable, born of sloppiness, carelessness, expediency, low standards, or incompetence, with no way to know how or even if it contributed to real progress.

Came across an older Alan Jacobs post:

For those who have been formed largely by the mythical core of human culture, disagreement and alternative points of view may well appear to them not as matters for rational adjudication but as defilement from which they must be cleansed.

It also has a section on The mythical core as lossy compression.

Today I listened to Sam Harris and read Alan Jacobs on Israel & Gaza. Highly recommended.

(With luck “Micropost” will link Jacobs here.)

Huh. Moonlight is redder than sunlight. The “silvery moon” is an illusion. iopscience.iop.org/article/1…

This looks like a good way to appreciate the beautiful bright thing that periodically makes it hard to see nebulas. blog.jatan.space

Do your own research

Sabine Hossenfelder’s Do your own research… but do it right is an excellent guide to critical thinking and a helpful antidote to the meme that no one should “do your own research”.

  1. (When not to do you own research.)
  2. Prepare well:
    • Reasonable expections: what can you reasonably learn in hours of online/library work?
    • Which specific questions are you trying to answer?
    • Be honest with yourself: about your biases and about what you don’t understand, or aren’t understanding as you read.
  3. Start with basics: Begin wtih peer-reviewed review articles, reports, lectures, & textbooks. Then look at recent publications. Use Google Scholar and related services to track citations to your source. Check for predatory journals. Beware preprints and conference proceedings, unless you can consult an expert.
  4. No cherry-picking! [ Even though you probably started because someone is wrong on the internet. -crt] This is the #1 mistake of “do your own research”.
  5. Track down sources
    • Never trust 2nd hand sources. Look at them to get started, but don’t end there.
    • If data is available, favor that over the text. Abstracts and conclusions especially tend to overstate.

Strong Towns & Ideological Purity

Good essay by Peter Norton Why we need Strong Towns critiquing a Current Affairs piece by Allison Dean.

I side with Norton here: Dean falls into the trap of demanding ideological purity. If you have to read only one, pick Norton. But after donning your Norton spectacles, read Dean for a solid discussion of points of overlap, reinterpreting critiques as debate about the best way to reach shared goals.

This other response to Dean attempts to add some middle ground to Dean’s Savannah and Flint examples, hinting what a reframed critique might look like. Unfortunately it’s long and meanders, and grinds its own axen.

I suspect Dean is allergic to economic justifications like “wealth” and “prosperity”. But we want our communities to thrive, and valuing prosperity is no more yearning for Dickensian hellscapes than loving community is pining for totalitarian ones.

The wonderful, walkable, wish-I-lived-there communities on Not Just Bikes are thriving, apparently in large part by sensible people-oriented design. More of that please.

(And watch Not Just Bikes for a more approachable take on Strong Towns, and examples of success.)

Essentialism

In a recent newsletter, Jesse Singal notes that recent MAGA gains among Democratic constituencies should prompt progressives to pause and question their own political assumptions & theories:

Quote from Singal saying Trump's continued inroads should prompt soul-searching from the Left: what have they got wrong?

It is quite a string of anomalies. A scientist would be prompted to look for alternate theories.

Singal suggests part of the problem is essentialism:

…activists and others like to talk and write about race in the deeply essentialist and condescending and tokenizing way… It’s everywhere, and it has absolutely exploded during the Trump years.

…both right-wing racists and left-of-center social justice types, [tend] to flatten groups of hundreds of millions of people into borderline useless categories, and to then pretend they share some sort of essence…

The irony.

~~~

Aside: I’m also reminded of a grad school story from Ruth.

Zeno: …Descartes is being an essentialist here…

Ruth: Wait, no, I think you’re being the essentialist….

_____: I’m sorry, what’s an essentialist?

Zeno: [short pause] It is a derogatory term.

(Yes, we had a philosophy teacher named Zeno. I’m not sure if the subject was actually Descartes.)

Unforced bias

In The bias we swim in, Linda McIver notes:

Recently I saw a post going around about how ChatGPT assumed that, in the sentence “The paralegal married the attorney because she was pregnant.” “she” had to refer to the paralegal. It went through multiple contortions to justify its assumption…

Her own conversation with ChatGPT was not as bad as the one making the rounds, but still self-contradictory.

Of course it makes the common gender mistake. What amazes me are the contorted justifications. What skin off AIs nose to say, “Yeah, OK, ‘she’ could be the attorney”? But it’s also read responses and learned that humans get embarrassed and move to self defense.

                         ~~~~~

If justifying, it could do better. Surely someone must have written how the framing highlights the power dynamic. And I find this reading less plausible:

The underling married his much better-paid boss BECAUSE she was pregnant.

At least, it’s not the same BECAUSE implied in the original.