From text to world: The legal significance of multimodal AI


Guest post by Philip Young, CEO and co-founder of Garfield AI

Young: Might we need a jurisprudence of intelligence?

For decades, the law has treated artificial intelligence (AI) as a technological curiosity: software that automates routine and basic human tasks.

That view is becoming obsolete. Recent advances in large language models (LLMs) have shown that machines trained purely on text can generate reasoning, argument and creativity at a level once thought distinctively human.

The next phase, already underway, will integrate text with vision, sound, motion and even touch.

This convergence will produce systems that no longer “read about” the world but perceive it. They will move from being linguistic savants to embodied participants in physical and digital environments.

For lawyers, policymakers and judges, this represents not just a technological transition but a jurisprudential one. When intelligence becomes grounded in perception, it begins to interact directly with the factual matrix of the world, the same substrate on which law itself operates.

From prediction to understanding

Current LLMs, at their core, are sophisticated probability engines. They predict the next word in a sequence based on vast training corpora.

Their reasoning ability is an emergent property of prediction accuracy: to guess well, a model must internalise the regularities of human discourse, which implicitly encode facts about the physical and social world.

This has led to a subtle but crucial shift in our conception of what a ‘model’ is. It is no longer a dictionary of words or a list of rules. It is an abstract world model, a high-dimensional mathematical structure representing relationships between things, actions and consequences.

One useful way to imagine this is as a conceptual space built on the mathematics of vectors. Many readers will remember vectors from school, plotted as arrows in two or three dimensions. Now imagine that instead of three dimensions, there are thousands, or even tens of thousands, or more, dimensions, each representing a tiny nuance of meaning, context or association.

Within this immense and interconnected space, every word or concept can be thought of as a point or coordinate. Words such as ‘contract’, ‘breach’, or ‘duty of care’ are not mere tokens but positions within that space that reflect how those ideas relate to each other in practice.

Even legal concepts exist as clusters or regions in this vast geometry of meaning, their meanings defined by proximity to other related ideas such as intention, damage or liability.

In effect, the model has learned a form of jurisprudential topography: a map in which reasoning becomes the movement through, and recognition, of these conceptual relationships.

In this sense, a text-only LLM lives in a world of language alone. Like Plato’s prisoners watching the shadows on the cave wall, it perceives only descriptions of the world rather than the world itself. It can reason about those shadows, reconstructing patterns and drawing analogies, but it does not yet see the underlying forms.

Its knowledge of law, science or emotion is inferred from patterns of words, not from direct sensory experience.

It is tempting for some lawyers to dismiss this as nothing more than mathematics, a system of weights and probabilities with no genuine understanding.

Yet the same description could apply, in functional terms, to the human brain. The brain is also an abstract system, a biological network of electrochemical signals and connected neurons that constructs an internal model of reality.

The brain does not touch, smell, or see the world directly. It receives signals from the senses and builds a coherent internal simulation from them. Everything a human knows, believes or reasons about is mediated through that simulation.

If one regards an LLM’s latent space as too abstract to constitute thought, it is worth reflecting that the mental representations inside the human cortex are no less abstract. Both systems form internal world models, the difference being one of substrate and origin, not of principle.

The human mind learns through embodied experience; the machine learns through data. But both transform streams of input into structured representations of reality and reason within those structures.

For law, this matters because our profession is itself an exercise in modelling the world. Legal reasoning abstracts from concrete events to rules and from rules back to predictions about human behaviour.

In that sense, lawyers and LLMs perform analogous cognitive functions: both operate on structured representations of reality to derive conclusions consistent with precedent and context.

The coming multimodal shift

Yet today’s LLMs remain confined within the medium of language. They have never seen a crime scene, negotiated a document, heard a witness, handled an exhibit or made submissions to a judge. Their understanding of the world is second-hand, mediated through text written by humans who themselves interpret sensory experience.

That limitation is temporary. The most advanced systems, such as OpenAI’s GPT-4o, Google DeepMind’s Gemini, and Anthropic’s multimodal Claude, already integrate text, image and sound into unified models.

Experimental robotic platforms are adding proprioception and tactile feedback, allowing neural networks to learn not just from descriptions but from physical interaction with the world itself.

This evolution arguably parallels human development. Our brains learn by correlating sensory inputs – sight, sound and touch – into coherent models of the environment. Language emerges later as a symbolic overlay on those sensory foundations.

Multimodal AI reverses that order: starting from language and expanding outward into perception.

I suggest the result will likely be systems capable of grounded reasoning, understanding not merely what words mean in context, but what objects, actions and consequences those words refer to in the world.

Put another way, a model will understand not merely that a ball falls when dropped but what that fall looks and feels like, and its witnessed consequences.

Grounding, meaning and the rule of law

Multimodal AI will raise questions for jurisprudence itself.

The rule of law presupposes a shared understanding of facts and concepts. Courts rely on language to represent reality; definitions are fundamental to law, and meaning is the bridge between word and world.

But when machines construct their own high-dimensional representations of reality, representations richer than human perception, the alignment between legal language and empirical fact may begin to drift.

For example, an AI might interpret ‘negligence’ not as a linguistic category but as a cluster in a multidimensional behavioural space derived from vast data on accidents, outcomes and intent.

Its classification could be statistically impeccable yet legally or morally unintuitive. The law will have to decide whether to privilege human normative meaning or machine-derived empirical accuracy when the two diverge.

This tension echoes the old jurisprudential debate between formalism and realism. Multimodal AI, by providing a data-rich picture of how the world actually works, could tempt a return to pure realism, judgments based on empirical prediction rather than normative principle.

The enduring task of lawyers will be to ensure that technological understanding remains subordinate to human values and rights.

Beyond human senses

One intriguing aspect of multimodal AI is that it need not be limited to human sensory channels. Models can already process infrared, X-ray and radio signals. Future systems may perceive electromagnetic spectra, network topologies, or quantum states beyond human capability.

In much the same way that a dog hears ultrasonic frequencies or a bee perceives ultraviolet patterns invisible to us, an AI could ‘see’ and ‘hear’ realities that lie outside human perception, yet nevertheless exist.

It might detect subsonic vibrations from machinery indicating imminent failure, pick up radio interference patterns revealing data tampering, or analyse spectral light data to determine the authenticity of a painting or a document.

From a legal standpoint, this raises profound evidential questions. If a model detects a pattern invisible to human observers, such as an infrared heat signature indicating tampering, or an ultrasonic recording showing concealed speech, should its finding be admissible? How do we cross-examine an inference that no human could have perceived directly?

The law will have to expand its doctrine of evidence to accommodate forms of ‘machine perception’ that transcend human sense, whilst ensuring that the evidential weight of such data is properly tested and explained to human fact-finders.

A new jurisprudence of intelligence

The integration of sensory grounding with linguistic reasoning marks a transition from artificial intelligence as software to artificial intelligence as participant.

When models can perceive, interpret and act within the same physical environment as humans, their decisions become part of the social and moral fabric the law exists to regulate.

The coming decades may therefore demand a jurisprudence of intelligence, a body of law concerned not merely with the actions of intelligent agents but with the nature of intelligence itself.

Such a jurisprudence would draw on constitutional principles to define limits of delegated decision-making, administrative law to ensure transparency and accountability, and human rights to preserve autonomy and dignity in a world increasingly shared with artificial minds.

Our internal debate here at Garfield AI

After drafting this piece, I discussed it with my technical co-founder, Daniel Long. He takes a different view. Daniel agrees that multimodal AI represents an important technical evolution, but he questions whether it constitutes embodiment.

In his view, multimodality adds new data streams into an existing latent space rather than creating a fundamentally different kind of intelligence.

Daniel argues that true embodiment would require continual learning. In other words, systems that can update their understanding through ongoing interaction with the environment, much as humans learn through lived experience.

Without continual learning, he suggests, multimodal models remain sophisticated but static, limited to the patterns they absorbed during pre-training. Drawing on analogies from biology, he likens current models to animals with innate but fixed behaviours, whereas human-level learning depends on the capacity to adapt and reform internal representations across a lifetime.

This is an important distinction. If Daniel is right, then the real transformation may not come from multimodality itself but from the combination of multimodality with continual learning, i.e. an AI that not only perceives but evolves through experience.

For my part, I continue to believe that the addition of perception, even without ongoing learning, represents a meaningful shift.

It moves AI closer to the human condition of grounding knowledge in sensory data, creating a bridge between language and the physical world – although Daniel has won me around with his fundamental argument that continual learning is ultimately crucial.

It will be fascinating to see, in time, which of the two views proves correct and whether intelligence truly deepens through richer perception alone, or whether it demands the capacity to learn continuously from the world it inhabits.

Either way, both possibilities lead us toward the same horizon: a future in which artificial systems participate more fully in the human project of understanding reality, and the law must once again evolve to meet them there.

Conclusion

Law has always evolved alongside new ways of knowing. The printing press gave rise to copyright, the telegraph to communications law, the computer to data protection. Multimodal artificial intelligence will demand the next transformation.

As machines begin to see, hear and interact with the world, they will cease to be mere tools of language and become actors in the factual matrix of law.

Our task as lawyers is not to resist that evolution but to shape it, ensuring that intelligence, however embodied, remains aligned with justice.

Philip Young will be speaking about Garfield AI at our Law Firm Growth Summit on 18 March in London.




Leave a Comment

By clicking Submit you consent to Legal Futures storing your personal data and confirm you have read our Privacy Policy and section 5 of our Terms & Conditions which deals with user-generated content. All comments will be moderated before posting.

Required fields are marked *
Email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Loading animation