Large Language Models (Likely) Aren't the Future of Artificial General Intelligence, And That's Okay
LLMs are a jack-of-all-trades, master of none, and that might be a good thing.
What is Artificial General Intelligence?
Artificial General Intelligence (AGI) is a concept that has intrigued computer scientists and philosophers alike. Simply put, AGI refers to a machine with the ability to perform any intellectual task that a human can. This includes solving mathematical problems, conversing with people, playing chess, driving a car, using a smartphone, and much more. Renowned computer scientist, Alan Turing, in his seminal paper “Computing Machinery and Intelligence”, envisioned machines that could think and reason like humans, which he also explored in his famous Turing Test.
AGI isn't just about handling specific tasks; it’s about having the versatility and adaptability that we, as humans, possess. This includes not only logical reasoning but also creativity, intuition, and ethical decision-making. Currently, humans can process visual information, understand and generate language, learn new skills, remember past experiences, have individual ethics & morality, and make decisions based on a combination of sensory inputs and stored knowledge. AGI aims to replicate this multimodal and flexible intelligence.
LLMs are (currently) limited
Large Language Models (LLMs) like GPT-4, Claude Opus, or Gemini Ultra, have shown remarkable capabilities in understanding and generating human-like text. However, they fall short of being AGI. Humans are innately adept (thank you, six million years of evolution) at the following tasks, while LLMs are currently limited:
Vision — LLMs lack the ability to process visual information. They cannot see and interpret images or videos (at least not yet, as of the writing of this article).
Motor Skills — Tasks like driving a car, playing sports, playing an instrument, or performing manual labor are beyond the scope of LLMs.
Basic Math — Despite being trained on vast amounts of text, LLMs often struggle with simple multiplication and more complex mathematical problems.
Using Devices — Operating a smartphone, making a call, or using apps is something LLMs cannot do on their own. Apple most recently in June 2024 announced a new Siri powered by Apple Intelligence, and a piece of the system, MM1, is a Multimodal Large Language Model (MLMM) that performs device actions on behalf of its user. With this MLLM, Siri can not only make calls, book reservations, schedule meetings, for you, but doing so while knowing all your personal details like which meeting you mean, who you want to call, and more. Apple claims all these actions will be performed on-device (without making API calls to hosted LLMs), which makes a positive step towards on-device privacy-focused implementations, a topic of great importance for people and companies. .
Creativity and Intuition — LLMs do not possess the creative and intuitive abilities that humans have (they do feel like they are extremely creative, however) which are essential for innovation and complex problem-solving. Recent research from Anthropic, detailed in their work “Mapping the Mind of a Large Language Model”, reveals that while LLMs like Claude Sonnet can represent a vast array of concepts, these representations are still limited compared to human creativity and intuition. This interpretability research shows that each concept is distributed across many neurons, and each neuron contributes to multiple concepts. However, understanding and manipulating these concepts remain challenging.
Ethical and Moral Reasoning — LLMs cannot make decisions based on ethics or morals, an area where human intelligence excels. Current LLM ethics and morality is based on fine-tuning (such as Anthropic Claude’s Constitution) something that is in the control and oversight of each individual AI lab. It turns out, you can model a LLM to be like Elon Musk. Uh… Yeah.
While OpenAI's ChatGPT-4o (the “o” standing for “omni-modality”, with some stunning and controversial demos) and Google's Gemini claim to be natively multimodal, meaning they can understand vision, audio, video, etc., their abilities are still in their infancy compared to human capabilities.
But.. what if this is fine?
The goal of AGI is to have a system that can perform a wide array of tasks seamlessly. But LLMs don't need to be the jack-of-all-trades to be valuable. Instead, they can act as the central processing unit (CPU) of a larger, more capable system. Here's how this can work:
Specialized Models
Just as humans use different parts of their brains for different tasks, a large language model can call upon specialized models for specific tasks. For example, AlphaGo for playing chess, Google's Waymo for driving, AlphaFold for protein folding, and continued breakthroughs (like GPT-4o) in native multi-modality for vision, video, and audio processing.
Memory Systems
The human brain has short-term and long-term memory. Using the analogy of a computer system, short-term memory is the Random Access Memory (RAM), and long-term memory is the Solid Disk Drive (SSD).
In a similar vein, LLMs can use context as short-term memory and leverage vector databases like Pinecone (disclosure: I work for Pinecone) using Retrieval Augmented Generation (RAG) as long-term memory. This allows them to retrieve information efficiently and enhance their performance.
Even the concept of RAG is in its infancy; Microsoft Research recently published a piece on applying knowledge graphs to RAG for even more accurate information retrieval. Or how Google’s Multimodal Embedding Model can natively “understand” images and videos, not just text, without going through an intermediary modality such as text. This native multimodal understanding reduces costs, latency, and improves accuracy, because a system no longer needs to have other modalities “described” to them through text.
Think of it this way, you reading a book (through text) versus watching a movie (through video) is a viscerally different experience.
Not Just Text — Multimodality Understanding
By integrating vision (camera systems), hearing (microphones), and other sensory inputs, LLMs can form part of a multimodal system that mimics human capabilities. This further augments the LLM’s sensory inputs, greatly exceeding what humans can see, feel, smell, and hear. There’s ongoing research on models such as Google DeepMind’s GenEm used to control robotic systems that normally would be impossible with text-only LLMs, as robotics requires multimodal perception and complex reasoning.
The human brain is LLM-like
Think of the human brain. It is indeed more advanced than current LLMs in many ways, but it also relies on appendages and additional systems to function. For example, eyes for vision, ears for hearing, and limbs for movement. Similarly, an LLM can serve as the core intellect while relying on specialized models to extend its capabilities.
A system where the LLM is the CPU would look like a computer system with various peripherals. The LLM handles general tasks and coordinates with other models for specific functions. The advantage of this modular approach is that individual components can be upgraded independently. For instance, the vision model can be improved without needing to retrain the entire LLM.
LLMs don’t need to master everything
Large Language Models might not be the future of AGI, but they don't necessarily need to be. We can leverage what LLMs are great at — being a jack of all trades, master of none.
Rather than striving for a singular, all-encompassing AI, we may be moving towards a more modular and adaptable approach. LLMs, serving as the central processing unit, can coordinate with specialized models to tackle a wide array of tasks, mirroring the human brain's versatility.
This modular architecture allows for continuous improvement and specialization in various domains. The future of AI may not be a monolithic AGI, but rather a sophisticated ecosystem of interconnected models and systems, each contributing its unique strengths.
So, what do you think?
Do you believe LLMs are the future of AGI?
Or will it be a hybrid multi-system approach?
Or maybe LLMs entirely are just the wrong path?