To understand how image enhancing works, I want to take a long detour in the land of information theory. I got prompted to do this because of a recent post by Yann LeCun. In this tweet, he claims images contain much more information than words. He uses this to infer that current large language models (LLM) could be much more powerful if they could process images in a way similar to or better than what humans do. This may be true, but it also got me thinking: just how much information does an image contain? I will attempt to convey that the answer is not straightforward. Information is a concept that was developed by Claude Shannon in 1948 as a mathematical theory of *communication. * And communication is inherently a 2-body phenomenon.

Let’s start with a simple question: how many images can we see? Let us only consider images represented on a regular square grid of \(n \times n\) pixels. The image will also have \(c\) colour channels (\(c=3\) for RGB channels), and the intensity of each channel is represented by 8 bits, i.e. the intensity value can take values from 0 to 255, for a total of \(2^{8}\) possibilities. In this simplified model, the number of possible images is given by \(2^{8 \times c \times n^{2}}\). If we set \(c=1\) and \(n=256\), this number is of the order of \(10^{10^{4}}\). The situation is often worse in astronomy, since we consider more channels (often 7 or 8) and we encode intensity with 32 bits instead of 8.

This number suggests that we could never observe the same image at two different moments. The probability of doing so is infinitesimally small

\begin{equation}

P = 0.\underbrace{00\dots 0}_{10^{4} \times}1

\end{equation}

But of course, experience tells us that this is not true. In reality, images are not uniformly distributed in the space of pixel configurations. Some events are much more likely than others. For example, we are more likely to find pictures of cats on the web than anything else. Looking at the night sky, we find stars and faint diffuse objects like galaxies.

Yet, this number can give us an upper bound estimate of the amount of information that an image can contain. Shannon’s information theory provides one possible way to quantify this amount. By definition, the information is given by the negative logarithm of its probability. If we suppose a uniform distribution of images, the amount of information contained in an image, measured in bits, would be

\begin{equation}

I = -\log_{2}\left(\frac{1}{2^{8 \times c \times n^{2}}}\right) = 8 \times c \times n^{2}

\end{equation}

Arguably, this is the number that LeCun is referring to. If a word or a token contains 1 byte (8 bits) of information, then an image can contain thousands of words worth of information. Thus, in principle, an image could transmit much more information than words. But, as mentioned previously, \(P\) was assumed to be a uniform distribution. In other words, we have assumed that the world model of the receiver is as uninformative as it can be.

Information is a concept that depends on the world model assumed by the receiver. In other words, information is deeply related to what someone finds interesting or *surprising. *Shannon information is also sometimes called the “surprise”. It is perhaps surprising to link the concept of surprise to the concept of information. After all, surprise is a subjective feeling, while information seems more like a cold hard fact. But really, information is all about updating knowledge. So knowledge should be viewed as what is stored in memory, and information is the process of updating memory.

A photograph is often used to capture a memory, a trace of what was. Like Deckard, searching for a clue for his case in a photograph, astronomers study the night sky to find clues about what the Universe looked like millions or billions of years ago. Astronomers never do this by assuming a naive uniform distribution over what images look like. Like Deckard, they look for something specific. They train for years to develop an intuition for what images of the night sky look like. They build an internal distribution such that their surprise is minimized when they look at a new image. Until they find a surprising feature, e.g. a supernova or a new planet, for which they then request telescope time for more observations to confirm their discovery.

Bayesian inference is a way to quantify this process. To be more specific, Bayesian inference quantifies how to update a prior world model, \(p(\mathbf{x})\), with new information, \(\mathbf{y}\), in order to obtain a new world model \(p(\mathbf{x} \mid \mathbf{y})\), called the posterior (knowledge). The connection between information theory and Bayesian inference lies in the difference, more specifically the Kullback-Leibler (KL) divergence, between the prior and the posterior. If Bayesian inference is about updating knowledge, then the KL divergence between the prior and the posterior should be the information gained from the observation (\(\mathbf{y}\))

\begin{equation}

D_{\mathrm{KL}}(p \parallel q) = \int p(x) \log \frac{ p(x) }{q(x)} d x\, .

\end{equation}

It is called divergence because it is not symmetric, unlike the concept of distance in geometry. The KL divergence is also related to the concept of entropy, which can be viewed as the expected information gained from observing a state \(x\) when using a world model \(p(x)\)

\begin{equation}

H(p) = \mathbb{E}_p[I_p(x)] = -\int p(x) \log p(x) dx\, .

\end{equation}

I introduced \(I_p(x) = -\log p(x)\) for the information content of a state \(x\) under the model \(p(x)\). Using these two equations, it can be shown that

\begin{equation}

D_{\mathrm{KL}}(p \parallel q) = \mathbb{E}_{p}[I_q(x)] – \mathbb{E}_p[I_p(x)]\, .

\end{equation}

\(H(p, q) = \mathbb{E}_p[I_q(x)]\) is the cross entropy, i.e. the expected information, as measured by \(q\), when generating states according to \(p\).

If we set \(p \mapsto \mathrm{posterior}\) and \(q \mapsto \mathrm{prior}\), then the KL divergence becomes the expected information gained from the observation

\begin{equation}

D_{\mathrm{KL}}(\mathrm{posterior}\parallel \mathrm{prior}) = \mathbb{E}[I_{\mathrm{prior}}(x)] – \mathbb{E}[I_{\mathrm{posterior}}(x)] \, .

\end{equation}

Knowing that the KL divergence should always be 0 or positive (if it exists), then we must have

\begin{equation}

I_{\mathrm{posterior}}(x) \leq I_{\mathrm{prior}}(x)\, .

\end{equation}

The information content of an image must *decrease* (or stay the same) when we update our knowledge of the world with new information. This is perhaps surprising at first, but in many ways, we have an intuitive sense of this in our daily lives. We often stop looking at common objects because we have already seen them before. We expect them to look the same. They do not surprise us anymore. In other words, they do not change the state of our knowledge of the world, or provide us with new information.

Curious, I asked ChatGPT4, OpenAI’s LLM, to generate two images. I prompted the LLM to first generate an image with the least amount of information according to his internal world model (whatever that may be), and it generated a blank canvas with clean brushes and other painting tools ready to be used to draw a picture. I then asked the LLM to generate an image with a lot of information and it generated a very intricate tapestry. According to the LLM, it depicts the entire history of human civilization intertwined with the natural world. I find that the low-information image communicates something relatively simple that can be described in a few words, while the high-information image isn’t captured by the few words I chose. An entire blog post would probably be needed to describe it fully. But this might just be me. Information depends on the receiver. One might find much more meaning in the blank canvas than I do.

In any case, I would argue that the blank canvas image is a bit of a waste of memory. It can be encoded much more efficiently using words rather than an image. And this is where I would slightly disagree with LeCun’s premise that images contain more information than words. While images can contain a lot of information, most images can be efficiently compressed in a few words. A few words can transmit the same message if the receiver has the prerequisite world model for language. Rather, I believe that LeCun is alluding to the channel capacity when he says that images can transmit more information.