AI, AGI & Things
Introduction
Are we evaluating LLM performance in the right domain(s)? Just by trying one of the chat systems, most people are blown away by how "smart" these models feel—until they eventually discover simple tests like the famous strawberry test proving the opposite, or simply wave and move on when they see an LLM hallucinate for the first time.
Others are convinced that artifical general intelligence is here, often without any understanding what's under the hood.
I wrote this post to organize my own high level, current (in January of 2025) thoughts about the subject in light of the newest results of frontier AI models, reflections of people with lots of insight, but questionable incentives, promising everyday random research, NVIDIA advancing at incredible pace and so much more.
Screenshots are taken from Joscha Bach's presentation, many of these thoughts are not mine, I tried to reference those giants.
Perception vs Reality
Many of us measure intelligence based on a system's factual knowledge, which is a domain where computers clearly outperform humans due to vast amounts of training data. However, it is challenging to determine whether LLMs truly understand the meaning of the text or are they simply "knowing the answers" to our questions? Is their apparent understanding more about combinational creativity —combining parts of known answers in new ways— or do they have a genuine grasp of the content?
For example, tests like the well known strawberry test reveal that LLMs can sometimes fail to demonstrate true comprehension. The vast amount of data these models are trained on makes it difficult for us to fully comprehend and assess their capabilities as that puts them in a way different league in a domain that we appreciate. Can we be sure that when LLMs provide seemingly intelligent responses, they are truly understanding the text rather than just memorizing and recombining known information?
Furthermore, can data modeling mimic certain aspects of cognition? How do we know our brains aren't doing something similar —relying more on pattern recognition and memorization, especially in System 1 thinking— rather than deep comprehension?
Evaluation Methods
In order to answer some of these questions (and more importantly, measure progress), different evaluation methods are used.
Qualitative Assessments
People often rely on their subjective experience when measuring LLM performance, especially when evaluating more complex aspects of language model capabilities. While quantitative metrics exist, many researchers and users find that qualitative, subjective assessments can provide valuable insights into a model's real-world performance.
Some common types of qualitative tests include:
- Common Sense and World Knowledge
- Dialogue and Conversational Capabilities
- Emotional Intelligence and Empathy
- Creativity and Originality
- Safety and Fairness
Quantitative Metrics
Testing intelligence seems quite challenging because of the way word embeddings are built, often involving vast amounts of data. Even though memorization is more like lossy compression rather than writing something to disk, we can never be certain wether the observed illusion of intelligence or "creativity" (if we can agree on what creativity means) or by definition, "the perception or inference of information; and the ability to retain it as knowledge to be applied to adaptive behaviors within an environment or context" i.e. intelligence is what we see.
Many evaluations exist that each try to assess models from various perspectives. Some focus on directly addressing specific applications, like SWE-Bench (tests systems' ability to solve GitHub issues automatically), RE-Bench (measuring the performance of humans and AI agents on day-long ML research engineering tasks) or MT-Bench (assesses conversational abilities, coherence, and instruction-following capabilities through multi-turn interactions).
Another trend is using difficult, expert-level questions for assessment. For instance, the Frontier Math test even keeps the questionnaire secret to avoid data contamination (to allow models learn in during pre-training).
The ARC-AGI test serves as a benchmark to measure AI skill-acquisition on unknown tasks. By their definition, AGI is a system that can efficiently acquire new skills outside of its training data.
Last month, OpenAI's O3 results on the ARC-AGI test sent shockwaves through the AI community. Avoiding clickbait on the Internet seems like an impossible task already, this release didn't make it any easier.
To refresh my rusty memory, I read the post to confirm the fact that intelligence in the ARC-AGI world is a moving target. A new version of the test is set to be released soon (if it hasn't been released already); and it is NOT meant to be a definitive proof of AGI, but rather as a benchmark for progress.
Considering the high costs associated with OpenAI's solution, this seems like another step, a new dimension in the scaling hypotheses by implementing test-time-compute.
The Nature of Intelligence
People often say that LLMs are just next token generators, which I think is true. The interesting question then arises: what if our brain (or at least a part of it) functions similary as a next-token generator?
Why do we expect the intelligence of an entire human "system" —brain controlling a body in a physical world, with years of experience— to emerge solely from such a mechanism? For instance, babies are not considered intelligent despite their brains being active( I resist the temptation to go down this rabbithole now); they need time and experience to develop.
A few questions come to mind:
- What if we add some code (think any programming language with an imperative paradigm for simplicity) that continuously prompts the language model and updates its state?
- What if we allow the model to prompt itself or other models?
The main question is: why do we expect all this complexity to be handled by a single model without external mechanisms?
According to Joscha Bach, our minds contain two significant domains (the world and ideas) within the representations our brains create of the world. The 'self' is usually part of the world, reflecting what we are currently aware of and enabling us to recreate past states through protocol memory.
If we take our own minds as an example, it seems like (not that I know anything
about this to be honest, why are you even reading this?) some kind of
"bootstrap" mechanism might be necessary to reach an initial mental state.
Are all these functions supposed to be internal processes within a language model (regardless of its architecture), or do they rely on external mechanisms that prompt the model? Could these mechanisms be akin to what we call the "outer mind"?
Returning to reality
Does this all matter?
I'm not sure, but I'm happy to see new dimensions of scaling being applied in research, moving beyond focusing on pre-training and size. However, I still feel that the "brute-force" nature of the current architectures is missing something.
But is this "intelligence" useful? Absolutely! Can it replace humans? No, but it can certainly help them. By enhancing our capabilities and understanding, these advancements in AI have the potential to support and augment human intelligence in meaningful ways.