This is promising: you can run LLM inference and training on 13W of power. I’ve yet to read the research paper, but they found you don’t need matrix multiplication if you adopt ternary [-1, 0, 1] values.
A Berkeley Computer Science lab just uploaded “Approaching Human-Level Forecasting
with Language Models” to arXiv DOI:10.48550/arXiv.2402.18563. My take:
There were three things that helped: Study, Scale,
and Search, but the greatest of these is Search.
Halawi et al replicated earlier results that off-the-shelf LLMs can’t forecast, then showed how to make them better. Quickly:
A coin toss gets a squared error score of 0.25.
Off-the-shelf LLMs are nearly that bad.
Except GPT-4 that got 0.208.
With web search and fine tuning, the best LLMs got down to 0.179.
The (human) crowd average was 0.149.
Adding news search and fine tuning, the LLMs were decent (Brier .179), and well-calibrated. Not nearly as good as the crowd (.149), but probably (I’m guessing) better than the median forecaster – most of crowd accuracy is usually carried by the top few %. I’m surprised by the calibration.
Search
By far the biggest gain was adding Info-Retrieval (Brier .206 -> .186), especially when it found at least 5 relevant articles.
With respect to retrieval, our system nears the performance of the crowd when there are at least 5 relevant articles. We further observe that as the number of articles increases, our Brier score improves and surpasses the crowd’s (Figure 4a). Intuitively, our system relies on high-quality retrieval, and when conditioned on more articles, it performs better.
Note: they worked to avoid information leakage. The test set only used questions published after the models' cutoff date, and they did sanity checks to ensure the model didn’t already know events after the cutoff date (and did know events before it!). New retrieval used APIs that allowed cutoff dates, so they could simulate more information becoming available during the life of the question. Retrieval dates were sampled based on advertised question closing date, not actual resolution.
Study:
Fine-tuning the model improved versus baseline: (.186 -> .179) for the full system, with variants at 0.181-0.183. If I understand correctly, it was trained on model forecasts of training data which had outperformed the crowd but not by too much, to mitigate overconfidence.
That last adjustment – good but not too good – suggests there are still a lot of judgmental knobs to twiddle, risking a garden of forking paths. However, assuming no actual flaws like information leakage, the paper stands as an existence proof of decent forecasting, though not a representative sample of what replication-of-method would find.
Scale:
GPT4 did a little better than GPT3.5. (.179 vs .182). And a lot better than lunchbox models like Llama-2.
But it’s not just scale, as you can see below: Llama’s 13B model outperforms its 70B, by a lot. Maybe sophistication would be a better word, but that’s too many syllables for a slogan.
Thoughts
Calibration: was surprisingly good. Some of that probably comes from the careful selection of forecasts to fine-tune from, and some likely from the crowd-within style setup where the model forecast is the trimmed mean from at least 16 different forecasts it generated for the question. [Update: the [Schoenegger et al] paper (also this week) ensembled 12 different LLMs and got worse calibration. Fine tuning has my vote now.]
Forking Paths: They did a hyperparameter search to optimize system configuration like the the choice of aggregation function (trimmed mean), as well as retrieval and prompting strategies. This bothers me less than it might because (1) adding good article retrieval matters more than all the other steps; (2) hyperparameter search can itself be specified and replicated (though I bet the choice of good-but-not-great for training forecasts was ad hoc), and (3) the paper is an existence proof.
Quality: It’s about 20% worse than the crowd forecast. I would like to know where it forms in crowd %ile. However, it’s different enough from the crowd that mixing them at 4:1 (crowd:model) improved the aggregate.
They note the system had biggest Brier gains when the crowd was guessing near 0.5, but I’m unimpressed. (1) this seems of little practical importance, especially if those questions really are uncertain, (2) it’s still only taking Brier of .24 -> .237, nothing to write home about, and (3) it’s too post-hoc, drawing the target after taking the shots.
Overall: A surprising amount of forecasting is nowcasting, and this is something LLMs with good search and inference could indeed get good at. At minimum they could do the initial sweep on a question, or set better priors. I would imagine that marrying LLMs with Argument Mapping could improve things even more.
As I have noted in the past, these systems are not intelligent. They do not think. They do not understand language. They literally choose a statistically likely next word, using the vast amounts of text they have cheerfully stolen from the internet as their source.
AI has all the hallmarks of a classic pump-and-dump, starting with terminology. AI isn’t “artificial” and it’s not “intelligent.” “Machine learning” doesn’t learn. On this week’s Trashfuture podcast, they made an excellent (and profane and hilarious) case that ChatGPT is best understood as a sophisticated form of autocomplete – not our new robot overlord.
Not so fast. First, AI systems do understand text, though not the real-world referents. Although LLMs were trained by choosing the most likely word, they do more. Representations matter. How you choose the most likely word matters. A very large word frequency table could predict the most likely word, but it couldn’t do novel word algebra (king - man + woman = ___) or any of the other things that LLMs do.
Second, McIver and Doctorow trade on their expertise to make their debunking claim: we understand AI. But that won’t do. As David Mandel notes in a recent preprint AI Risk is the only existential risk where the experts in the field rate it riskier than informed outsiders.
Google’s Peter Norvig clearly understands AI. And he and colleagues argue they’re already general, if limited:
Artificial General Intelligence (AGI) means many different things to different people, but the most important parts of it have already been achieved by the current generation of advanced AI large language models such as ChatGPT, Bard, LLaMA and Claude.
…today’s frontier models perform competently even on novel tasks they were not trained for, crossing a threshold that previous generations of AI and supervised deep learning systems never managed. Decades from now, they will be recognized as the first true examples of AGI, just as the 1945 ENIAC is now recognized as the first true general-purpose electronic computer.
That doesn’t mean he’s right, only that knowing how LLMs work doesn’t automatically dispel claims.
Meta’s Yann LeCun clearly understands AI. He sides with McIver & Doctorow that AI is dumber than cats, and argues there’s a regulatory-capture game going on. (Meta wants more openness, FYI.)
Demands to police AI stemmed from the “superiority complex” of some of the leading tech companies that argued that only they could be trusted to develop AI safely, LeCun said. “I think that’s incredibly arrogant. And I think the exact opposite,” he said in an interview for the FT’s forthcoming Tech Tonic podcast series.
Regulating leading-edge AI models today would be like regulating the jet airline industry in 1925 when such aeroplanes had not even been invented, he said. “The debate on existential risk is very premature until we have a design for a system that can even rival a cat in terms of learning capabilities, which we don’t have at the moment,” he said.
Could a system be dumber than cats and still general?
McIver again:
There is no viable path from this statistical threshing machine to an intelligent system. You cannot refine statistical plausibility into independent thought. You can only refine it into increased plausibility.
I don’t think McIver was trying to spell out the argument in that short post, but as stated this begs the question. Perhaps you can’t get life from dead matter. Perhaps you can. The argument cannot be, “It can’t be intelligent if I understand the parts”.
[AI] hallucinations are compression artifacts, but—like the incorrect labels generated by the Xerox photocopier—they are plausible enough that identifying them requires comparing them against the originals, which in this case means either the Web or our own knowledge of the world.
I think that does a good job at correcting many mistaken impressions, and correctly deflating things a bit. But also, that “Blurry JPEG” is key to LLM’s abilities: they are compressing their world, be it images, videos, or text. That is, they are making models of it. As Doctorow notes,
Except in some edge cases, these systems don’t store copies of the images they analyze, nor do they reproduce them.
They gist them. Not necessarily the way humans do, but analogously. Those models let them abstract, reason, and create novelty. Compression doesn’t guarantee intelligence, but it is closely related.
Two main limitations of AI right now:
They’re still small. Vast in some ways, but with limited working memory. Andrej Karpathy suggests LLMs are like early 8-bit CPUs. We are still experimenting with the rest of the von Neumann architecture to get a viable system.
AI is trapped in a self-referential world of syntax. The reason they hallucinate (image models) or BS (LLMs) is they have no semantic grounding – no external access to ground truth.
A interesting tangent: Doctorow’s piece covers copyright. He thinks that
Under these [current market] conditions, giving a creator more copyright is like giving a bullied schoolkid extra lunch money.
…there are loud, insistent calls … that training a machine-learning system is a copyright infringement.
This is a bad theory. First, it’s bad as a matter of copyright law. Fundamentally, machine learning … [is] a math-heavy version of what every creator does: analyze how the works they admire are made, so they can make their own new works.
So any law against this would undo what wins creators have had over conglomerates regarding fair use and derivative works.
Turning every part of the creative process into “IP” hasn’t made creators better off. All that’s it’s accomplished is to make it harder to create without taking terms from a giant corporation, whose terms inevitably include forcing you to trade all your IP away to them. That’s something that Spider Robinson prophesied in his Hugo-winning 1982 story, “Melancholy Elephants”.
I think our mindset should often be that while we don’t know what we’re going to face, we’re going to develop a team that is really good and an industrial base that is fast. We are going to figure it out as it goes, and we’re going to build to need then. Now, that’s terrifying.
On military acquisition:
The mine-resistant vehicles were a classic case; they didn’t exist at all in the US inventory; we produced thousands of them. [after the war started]
This was inspired by Paul Harrison’s (pfh’s) 2021 post, We’ve been doing k-means wrong for more than half a century. Pfh found that the K-means algorithm in his R package put too many clusters in dense areas, resulting in worse fits compared with just cutting a Ward clustering at height (k).
tl;dr
I re-implemented much of pfh’s notebook in Python, and found that Scikit-learn did just fine with k-means++ init, but reproduced the problem using naive init (random restarts). Cross-checking with flexclust, he decided the problem was a bug in the LICORS implementation of k-means++.
Upshot: use either Ward clustering or k-means++ to choose initial clusters. In Python you’re fine with Scikit-learn’s default. But curiously the Kward here ran somewhat faster.
Update Nov-2022:
I just searched for the LICORS bug. LICORS hasn’t been maintained since 2013, but it’s popular in part for its implementation of kmeanspp , compared to the default (naive) kmeans in R’s stats package. However, it had a serious bug in the distance matrix computation reported by Bernd Fritzke in Nov. 2021 that likely accounts for the behavior Paul noticed. Apparently fixing that drastically improved its performance.
I believe k-means is an essential tool for summarizing data. It is not simply “clustering”, it is an approximation that provides good coverage of a whole dataset by neither overly concentrating on the most dense regions nor focussing too much on extremes. Maybe this is something our society needs more of. Anyway, we should get it right. ~pfh
References
Citation for fastcluster:
Daniel Müllner, fastcluster:Fast Hierarchical, Agglomerative Clustering Routines for R and Python, Journal of Statistical Software 53 (2013), no. 9, 1–18, URL http://www.jstatsoft.org/v53/i09/
Speed depends on many things.
Note the up-to-1000x improvements in this 2018 discussion of sklearn’s k-means by using Cython, OpenMP, and well-vectorized code. (The ticket is closed with a [MRG] note, so may have been incorporated already.)
And this 2020 post says Facebook’s faiss library is 8x faster and 27x more accurate than sklearn, at least on larger datasets like MNIST.
Or a 2017 post for CFXKMeans promising 78x speedup – though note sklearn may have gotten much faster since then.
Test Times & Scores
I’ll omit the code running the tests. Defined null_fit() , do_fits(), do_splice(),
functions to run fits and then combine results into a dataframe.
Borrowing an example from pfh, we will generate two squares of uniform density, the first with 10K points and the second with 1K, and find $k=200$ means. Because the points have a ratio of 10:1, we expect the ideal #clusters to be split $\sqrt{10}:1$.
Name
Score
Wall Time[s]
CPU Time[s]
0
KNull
20.008466
0.027320
0.257709
1
KMeans_full
14.536896
0.616821
6.964919
2
KMeans_Elkan
15.171172
4.809588
69.661552
3
KMeans++
13.185790
4.672390
68.037351
4
KWard
13.836546
1.694548
4.551085
5
Polish
13.108796
0.176962
2.568561
Python
We see the same thing using vanilla k-means (random restarts), but the default k-means++ init overcomes it.
The Data:
SciKit KMeans: Null & Full (Naive init):
SciKit KMeans: Elkan & Kmeans++:
Ward & Polish:
More data: vary $n1, k$
What we’re seeing above is that Ward is fast and nearly as good, but not better.
Let’s collect multiple samples, varying $k$ and $n1$ as we go.
Name
Score
Wall Time[s]
CPU Time[s]
k
n1
0
KNull
10.643565
0.001678
0.002181
5
10
1
KMeans_full
4.766947
0.010092
0.010386
5
10
2
KMeans_Elkan
4.766947
0.022314
0.022360
5
10
3
KMeans++
4.766947
0.027672
0.027654
5
10
4
KWard
5.108086
0.008825
0.009259
5
10
...
...
...
...
...
...
...
475
KMeans_full
14.737051
0.546886
6.635604
200
1000
476
KMeans_Elkan
14.452111
6.075230
87.329714
200
1000
477
KMeans++
13.112620
5.592246
78.233175
200
1000
478
KWard
13.729485
1.953153
4.668957
200
1000
479
Polish
13.091032
0.144555
2.160262
200
1000
480 rows × 6 columns
Plot Scores
We will see that KWard+Polish is often competitive on score, but seldom better.
rows = $k$
columns = $n1$ (with $n2 = 10 \times n1$
Pfh’s example was for $n1 = 1000$ and $k = 200$.
Plot CPU Times
Remember that polish() happens after the Ward clustering, so you should really add those two columns. But in most cases it’s on the order of the KNull.
Even with the polish step, Ward is generally faster, often much faster. The first two has curious exceptions for $n1 = 100, 1000$. I’m tempted to call that setup overhead, except it’s not there for $n1 = 10$, and the charts have different orders of magnitude for the $y$ axis.
Note that the wall-time differences are less extreme, as KMeans() uses concurrent processes. (That helps the polish() step as well, but it usually has few iterations.)
Fair enough: On uniform random data, Ward is as fast as naive K-means and as good as a k-means++ init.
How does it do on data it was designed for? Let’s make some blobs and re-run.
Test with Real Clusters
OK, so that’s how it performs if we’re just quanitizing uniform random noise. What about when the data has real clusters? Saw blob generation on the faiss example post.
Preview of some of the data we’re generating:
Name
Score
Wall Time[s]
CPU Time[s]
k
n1
0
KNull
448.832176
0.001025
0.001025
5
100
1
KMeans_full
183.422464
0.007367
0.007343
5
100
2
KMeans_Elkan
183.422464
0.010636
0.010637
5
100
3
KMeans++
183.422464
0.020496
0.020731
5
100
4
KWard
183.422464
0.006334
0.006710
5
100
...
...
...
...
...
...
...
235
KMeans_full
3613.805107
0.446731
5.400928
200
10000
236
KMeans_Elkan
3604.162116
4.658532
68.597281
200
10000
237
KMeans++
3525.459731
4.840202
71.138150
200
10000
238
KWard
3665.501244
1.791814
4.648277
200
10000
239
Polish
3499.884487
0.144633
2.141082
200
10000
240 rows × 6 columns
Scores
Ward and Polish score on par with KMeans++. For some combinations of $n1$ and $k$ the other algorithms are also on par, but for some they do notably worse.
CPU Times
Ward is constant for a given $n1$, about 4-5s for $n1 = 10,000$. KMeans gets surprisingly slow as $k$ increases, taking 75s vs. Ward’s 4-5s for $k=200$. Even more surprising, Elkan is uniformly slower than full. This could be intepreted vs. compiled, but needs looking into.
The basic result holds.
Strangely, Elkan is actually slower than the old EM algorithm, despite having well-organized blobs where the triangle inequality was supposed to help.
Conclusion
On both uniform random data and blob data, KWard+Polish scores like k-means++ while running as fast as vanilla k-means (random restarts).
In uniform data, the polish step seems to be required to match k-means++. In blobs, you can pretty much stop after the initial Ward.
Surprisingly, for sklearn the EM algorithm (algorithm='full') is faster than the default Elkan.
Appendix: the Classes
We defined two classes for the tests:
KNull - Pick $k$ data points at at random and call that your answer.
KWard - A drop-in replacement for Scikit-Learn’s KMeans, but uses Ward hierarchical clustering, but at height $k$.
We call the fastcluster package for the actual Ward clustering, and provide a .polish() method to do a few of the usual EM iterations to polish the result.
It gives results comparable to the default k-means++ init, but (oddly) was notably faster for large $k$. This is probably just interpreted versus compiled, but needs some attention. Ward is $O(n^2)$ while k-means++ is $O(kn)$, but Ward was running in 4-5s while scikit-learn’s kmeans ++ was taking notably longer for $k≥10$. For $k=200$ it took 75s. (!!)
The classes are really a means to an end here. The post is probably most interesting as:
identifying a bug in an R package,
learning a bit more about what makes K-means solutions good
finding an odd slowness in scikit-learn’s k-means.
KNull Class
KNull just choses $k$ points at random from $X$. We could write that from scratch, but it’s equivalent to calling KMeans with 1 init and 1 iteration. (Besides, I was making a subclassing mistake in KWard, and this minimal subclass let me track it down.)
class KNull(KMeans):
"""KMeans with only 1 iteration: pick $k$ points from $X$."""
def __init__(self,
n_clusters: int=3,
random_state: RandLike=None):
"""Initialize w/1 init, 1 iter."""
super().__init__(n_clusters,
init="random",
n_init=1,
max_iter=1,
random_state=random_state)
Aside: A quick test confirms .inertia_ stores the training .score(mat). Good because it runs about 45x faster.
KWard Class
We make this a subclass of KMeans replacing the fit() method with a call to fastcluster.linkage_vector(X, method='ward') followed by cut_tree(k) to get the initial clusters, and a new polish() method that calls regular KMeans starting from the Ward clusters, and running up to 100 iterations, polishing the fit. (We could put polish() into fit() but this is clearer for testing.
Note: The _vector is a memory-saving variant. The Python fastcluster.linkage*() functions are equivalent to the R fastcluster.hclust*() functions.
class KWard(KMeans):
"""KMeans but use Ward algorithm to find the clusters.
See KMeans for full docs on inherited params.
Parameters
-----------
These work as in KMeans:
n_clusters : int, default=8
verbose : int, default=0
random_state : int, RandomState instance, default=None
THESE ARE IGNORED:
init, n_init, max_iter, tol
copy_x, n_jobs, algorithm
Attributes
-----------
algorithm : "KWard"
Else, populated as per KMeans:
cluster_centers_
labels_
inertia_
n_iter_ ????
Notes
-------
Ward hierarchical clustering repeatedly joins the two most similar points. We then cut the
resulting tree at height $k$. We use the fast O(n^2) implementation in fastcluster.
"""
def __init__(self, n_clusters: int=8, *,
verbose: int=0, random_state=None):
super().__init__(n_clusters,
verbose = verbose,
random_state = random_state)
self.algorithm = "KWard" # TODO: Breaks _check_params()
self.polished_ = False
def fit(self, X: np.array, y=None, sample_weight=None):
"""Find K-means cluster centers using Ward clustering.
Set .labels_, .cluster_centers_, .inertia_.
Set .polished_ = False.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Passed to fc.linkage_vector.
y : Ignored
sample_weight : Ignored
TODO: does fc support weights?
"""
# TODO - add/improve the validation/check steps
X = self._validate_data(X, accept_sparse='csr',
dtype=[np.float64, np.float32],
order='C', copy=self.copy_x,
accept_large_sparse=False)
# Do the clustering. Use pandas for easy add_col, groupby.
hc = fc.linkage_vector(X, method="ward")
dfX = pd.DataFrame(X)
dfX['cluster'] = cut_tree(hc, n_clusters=self.n_clusters)
# Calculate centers, labels, inertia
_ = dfX.groupby('cluster').mean().to_numpy()
self.cluster_centers_ = np.ascontiguousarray(_)
self.labels_ = dfX['cluster']
self.inertia_ = -self.score(X)
# Return the raw Ward clustering assignment
self.polished_ = False
return self
def polish(self, X, max_iter: int=100):
"""Use KMeans to polish the Ward centers. Modifies self!"""
if self.polished_:
print("Already polished. Run .fit() to reset.")
return self
# Do a few iterations
ans = KMeans(self.n_clusters, n_init=1, max_iter=max_iter,
init=self.cluster_centers_)\
.fit(X)
# How far did we move?
𝛥c = np.linalg.norm(self.cluster_centers_ - ans.cluster_centers_)
𝛥s = self.inertia_ - ans.inertia_
print(f" Centers moved by: {𝛥c:8.1f};\n"
f" Score improved by: {𝛥s:8.1f} (>0 good).")
self.labels_ = ans.labels_
self.inertia_ = ans.inertia_
self.cluster_centers_ = ans.cluster_centers_
self.polished_ = True
return self
Usage:KWard(k).fit(X) or KWard(k).fit(X).polish(X).
While I’m thinking of Sullivan, this from his newsletter, regarding a conversation on his podcast:
When you regard debate itself as a form of white supremacy, you tend not to be very good at it.
Psychic AI
FollowTheArgument on Inferring Political Orientation From a Single Picture reviews a recent paper showing algorithms can do this an astonishing 70% of the time, beating humans (55%) and a 100-item personality inventory (66%).
It’s still possible it’s latched onto some irrelevant or fragile feature set, but (a) they cropped close to the face, (b) they controlled for age/sex/race, (c) they cross-validated, and (d) they tested models from facebook on photos from other sites, and verce visa. Given the 2,048 features extracted by the facial image classifier, even regression did as well.
It seems ~60% is easily-named transient features like head-tilt, facial expression, etc. Leaving the rest as-yet-unnamed.
It’s tempting to go to physiognomy, but consider the following things humans do specifically to signal their affiliations:
Makeup amount and style
Beard type
Earrings, other jewelry
Hair style
What’s fascinating is how poorly humans do. But then, that result might be very different if the subjects were trained cold-readers.
I’ve been musing about all the Seussing. Social media discussions, and reading the spectrum on Flip Side (recommended).
I’m not especially qualified, but it seems to me:
Liberals are right the board’s decision probably isn’t cancellation. (It depends on motivation, which we’re likely to get wrong.)
But eBay’s action is cancellation. It’s illiberal, and wrong. It’s also downright silly given what eBay still sells.
Conservatives make a strong case that most of the alleged offenses really weren’t – see Cathy Young’s take for why anyone thought so. If I Ran the Zoo being a notable exception.
Writing for IEEE Spectrum, Joanna Goodrich says that Deep Blue beat Kasparov because was just so fast.
The supercomputer could explore up to 200 million possible chess positions per second with its AI program.
But it wasn’t. Fast enough. Not really. IBM didn’t expect to win, just to lose less badly. Kasparov won the first game. Lost the second. Drew three.
In an account I read years ago (Pandolfini?), it came down to psychology. Deep Blue was doing better than expected, and K started to doubt his preparation or understanding.
Then Deep Blue played a move that spooked Kasparov into thinking it was far faster than it really was, and (uncharacteristically) he panicked and resigned.
But the machine itself had panicked. A bug made that move random.
This recent article roughly agrees with my decades-old memory, supplying the bit I forgot about it being a bug.
This fits my mildly informed blockchain scepticism. I see few legitimate use cases. I’m not alone
I’ve never seen so many people searching so hard for a problem to go with their solution.
Blockchain is a cool technology, but mostly malign. As an article I read a couple of years ago (and can’t find) notes, it’s broken by design: a solution to a post-apocalyptic world where there are no trusted institutions. But the whole of civilization is building trusted institutions.
On top of that, the most popular algorithms are an environmental disaster, and this has been known for years. CO2 emissions from bitcoin mining offset greenhouse-gas reductions from the Paris treaty. They are at least 100,000x less efficient than normal visa transactions.
Just capturing a facebook post (screenshot) - I went into the internet archive to see what used to be on antifa dot com and when it started redirecting. The q folks of course see the redirect as strong evidence of some nefarious conspiracy. So, thoughts in screenshot.
~ ~ ~ * ~ ~ ~
I notice qnonymous.org is available cheap. If I made that point to whitehouse.gov, …. well probably no one would care. If they did care, the net effect would just increase total agitation, rather than to serve as a reductio. Unhelpful. Besides, if you start from the premise that the world is absurd, then a reductio loses its force.
But perhaps the best reason not to do this is that it does not arise from a principle of charity. On which point, why the original post?
We found that asynchronous communication methods, such as Slack and email, were being treated as synchronous communication methods. …we had created an always-on distraction…
We found that remote work led to more meetings, not fewer.
The set new rules:
Async means same-day or next-morning.
Need faster? Text. [Or, y’know, phone. Remember phones? -ed]
Block out Maker hours. They used 150-min in AM, and 90-min in PM.