Alexandre Adam

Edited by Salma Salhi and Sacha Perry-Fagant

Echoes in the noise

In the 1982 movie Blade Runner, there is a scene in which Deckard (Harrison Ford) is trying to find a clue in a photograph. He places a photographic paper in what looks like a scanner and then asks the computer to zoom in and enhance the image, multiple times, to the point of ridicule. In this blog post, I want to ask the question: is this even possible? And if so, how? While we are at it, I will also tell you about my recent paper which uses Bayesian inference to enhance images from the Hubble Space Telescope (HST) to the point of being comparable to the James Webb Space Telescope (JWST).

To understand how image enhancing works, I want to take a long detour in the land of information theory. I got prompted to do this because of a recent post by Yann LeCun. In this tweet, he claims images contain much more information than words. He uses this to infer that current large language models (LLM) could be much more powerful if they could process images in a way similar to or better than what humans do. This may be true, but it also got me thinking: just how much information does an image contain? I will attempt to convey that the answer is not straightforward. Information is a concept that was developed by Claude Shannon in 1948 as a mathematical theory of communication. And communication is inherently a 2-body phenomenon.

Let’s start with a simple question: how many images can we see? Let us only consider images represented on a regular square grid of \(n \times n\) pixels. The image will also have \(c\) colour channels (\(c=3\) for RGB channels), and the intensity of each channel is represented by 8 bits, i.e. the intensity value can take values from 0 to 255, for a total of \(2^{8}\) possibilities. In this simplified model, the number of possible images is given by \(2^{8 \times c \times n^{2}}\). If we set \(c=1\) and \(n=256\), this number is of the order of \(10^{10^{4}}\). The situation is often worse in astronomy, since we consider more channels (often 7 or 8) and we encode intensity with 32 bits instead of 8.

This number suggests that we could never observe the same image at two different moments. The probability of doing so is infinitesimally small

\begin{equation}
P = 0.\underbrace{00\dots 0}_{10^{4} \times}1
\end{equation}

But of course, experience tells us that this is not true. In reality, images are not uniformly distributed in the space of pixel configurations. Some events are much more likely than others. For example, we are more likely to find pictures of cats on the web than anything else. Looking at the night sky, we find stars and faint diffuse objects like galaxies.

Yet, this number can give us an upper bound estimate of the amount of information that an image can contain. Shannon’s information theory provides one possible way to quantify this amount. By definition, the information is given by the negative logarithm of its probability. If we suppose a uniform distribution of images, the amount of information contained in an image, measured in bits, would be
\begin{equation}
I = -\log_{2}\left(\frac{1}{2^{8 \times c \times n^{2}}}\right) = 8 \times c \times n^{2}
\end{equation}

Arguably, this is the number that LeCun is referring to. If a word or a token contains 1 byte (8 bits) of information, then an image can contain thousands of words worth of information. Thus, in principle, an image could transmit much more information than words. But, as mentioned previously, \(P\) was assumed to be a uniform distribution. In other words, we have assumed that the world model of the receiver is as uninformative as it can be.

Information is a concept that depends on the world model assumed by the receiver. In other words, information is deeply related to what someone finds interesting or surprising. Shannon information is also sometimes called the “surprise”. It is perhaps surprising to link the concept of surprise to the concept of information. After all, surprise is a subjective feeling, while information seems more like a cold hard fact. But really, information is all about updating knowledge. So knowledge should be viewed as what is stored in memory, and information is the process of updating memory.

A photograph is often used to capture a memory, a trace of what was. Like Deckard, searching for a clue for his case in a photograph, astronomers study the night sky to find clues about what the Universe looked like millions or billions of years ago. Astronomers never do this by assuming a naive uniform distribution over what images look like. Like Deckard, they look for something specific. They train for years to develop an intuition for what images of the night sky look like. They build an internal distribution such that their surprise is minimized when they look at a new image. Until they find a surprising feature, e.g. a supernova or a new planet, for which they then request telescope time for more observations to confirm their discovery.

Bayesian inference is a way to quantify this process. To be more specific, Bayesian inference quantifies how to update a prior world model, \(p(\mathbf{x})\), with new information, \(\mathbf{y}\), in order to obtain a new world model \(p(\mathbf{x} \mid \mathbf{y})\), called the posterior (knowledge). The connection between information theory and Bayesian inference lies in the difference, more specifically the Kullback-Leibler (KL) divergence, between the prior and the posterior. If Bayesian inference is about updating knowledge, then the KL divergence between the prior and the posterior should be the information gained from the observation (\(\mathbf{y}\))

\begin{equation}
D_{\mathrm{KL}}(p \parallel q) = \int p(x) \log \frac{ p(x) }{q(x)} d x\, .
\end{equation}
It is called divergence because it is not symmetric, unlike the concept of distance in geometry. The KL divergence is also related to the concept of entropy, which can be viewed as the expected information gained from observing a state \(x\) when using a world model \(p(x)\)
\begin{equation}
H(p) = \mathbb{E}_p[I_p(x)] = -\int p(x) \log p(x) dx\, .
\end{equation}
I introduced \(I_p(x) = -\log p(x)\) for the information content of a state \(x\) under the model \(p(x)\). Using these two equations, it can be shown that
\begin{equation}
D_{\mathrm{KL}}(p \parallel q) = \mathbb{E}_{p}[I_q(x)] – \mathbb{E}_p[I_p(x)]\, .
\end{equation}
\(H(p, q) = \mathbb{E}_p[I_q(x)]\) is the cross entropy, i.e. the expected information, as measured by \(q\), when generating states according to \(p\).
If we set \(p \mapsto \mathrm{posterior}\) and \(q \mapsto \mathrm{prior}\), then the KL divergence becomes the expected information gained from the observation
\begin{equation}
D_{\mathrm{KL}}(\mathrm{posterior}\parallel \mathrm{prior}) = \mathbb{E}[I_{\mathrm{prior}}(x)] – \mathbb{E}[I_{\mathrm{posterior}}(x)] \, .
\end{equation}
Knowing that the KL divergence should always be 0 or positive (if it exists), then we must have
\begin{equation}
I_{\mathrm{posterior}}(x) \leq I_{\mathrm{prior}}(x)\, .
\end{equation}
The information content of an image must decrease (or stay the same) when we update our knowledge of the world with new information. This is perhaps surprising at first, but in many ways, we have an intuitive sense of this in our daily lives. We often stop looking at common objects because we have already seen them before. We expect them to look the same. They do not surprise us anymore. In other words, they do not change the state of our knowledge of the world, or provide us with new information.

Curious, I asked ChatGPT4, OpenAI’s LLM, to generate two images. I prompted the LLM to first generate an image with the least amount of information according to his internal world model (whatever that may be), and it generated a blank canvas with clean brushes and other painting tools ready to be used to draw a picture. I then asked the LLM to generate an image with a lot of information and it generated a very intricate tapestry. According to the LLM, it depicts the entire history of human civilization intertwined with the natural world. I find that the low-information image communicates something relatively simple that can be described in a few words, while the high-information image isn’t captured by the few words I chose. An entire blog post would probably be needed to describe it fully. But this might just be me. Information depends on the receiver. One might find much more meaning in the blank canvas than I do.

In any case, I would argue that the blank canvas image is a bit of a waste of memory. It can be encoded much more efficiently using words rather than an image. And this is where I would slightly disagree with LeCun’s premise that images contain more information than words. While images can contain a lot of information, most images can be efficiently compressed in a few words. A few words can transmit the same message if the receiver has the prerequisite world model for language. Rather, I believe that LeCun is alluding to the channel capacity when he says that images can transmit more information.

But first, how is all this related to enhancing images? Well, if Deckard can zoom in on a photograph using a computer, then he must be using extensive prior knowledge to fill in the gaps. This is especially true if the information he seeks is not directly stored on the photographic paper. Rather, the information must be inferred using Deckard’s world model (and/or the computer’s world model) and using the patterns stored on the paper.

In other words, enhancing an image can be viewed as filling in the missing bits of information from a mixture of what is available in the observed world and the prior knowledge available. Bayesian inference is the perfect tool for this, which is why I set out to ask the question: can we use Bayesian inference to enhance images from the HST to the point of being comparable to those from the JWST?

When HST takes a picture of the sky, it must collect photons in a bucket, also called an aperture. This aperture has a distinctive response function, which is often called the point spread function (PSF). The PSF is a measure of how the telescope blurs the image of a point source. As it turns out, this effect is mainly a function of the size, \(D\), of the aperture. More specifically, the blur will have an extent that is proportional to \(\lambda/D\), where \(\lambda\) is the wavelength of the light. This is why the JWST, with a larger aperture, can take sharper images than the HST. The HST has a 2.4m aperture, while the JWST has a 6.5m aperture. This means that, in theory, the JWST can take images that are 2.7 times sharper than the HST (at the same wavelength).

Thus, the first thing to do to improve HST images would be to remove the effect of the PSF. This is easier said than done. Moreover, space images are also affected by noise. Noise is a sort of catch-all term to describe anything that is not related to the signal we care about. The signal could be the photons from a distant faint galaxy for example. But, the telescope also collects photons from the sky background, from the telescope emitting like a black body, from the electronics, and so on. There are even charged particles emitted from the Sun bouncing on the camera and causing the pixels to saturate. Three examples of HST  noise are shown in the figure below.

We need to separate the signal from the noise. This is where Bayesian inference comes in. We can use the prior knowledge of what the signal looks like to generate plausible states for the underlying signal given an observation. Let’s encode the prior knowledge in the distribution \(p(\mathbf{x})\), where \(\mathbf{x}\) is a pixelated image representation of the ideal signal. I’ll touch on why we chose this representation later. We also need a model for the noise and the measurement process such that we can compare the state \(\mathbf{x}\) with the observed raw data from the HST, \(\mathbf{y}\). This, we call the likelihood \(p(\mathbf{y} \mid \mathbf{x})\). Thomas Bayes wrote in his 1763 paper called “An Essay towards Solving a Problem in the Doctrine of Chances” that we can use these two distributions to obtain the posterior distribution \(p(\mathbf{x} \mid \mathbf{y})\), a theorem that now bears his name:
\begin{equation}
p(\mathbf{x} \mid \mathbf{y}) = \frac{p(\mathbf{y} \mid \mathbf{x}) p(\mathbf{x})}{p(\mathbf{y})}\, .
\end{equation}

A careful reader will notice that I neglected to mention \(p(\mathbf{y})\), called the evidence. We can completely ignore this term with a few simple manipulations of Bayes’s theorem. We first take the logarithm on both sides:
\begin{equation}
\log p(\mathbf{x} \mid \mathbf{y}) = \log p(\mathbf{y} \mid \mathbf{x}) + \log p(\mathbf{x}) – \log p(\mathbf{y})\, .
\end{equation}
We then take the gradient with respect to \(\mathbf{x}\)
\begin{equation}
\nabla_{\mathbf{x}} \log p(\mathbf{x} \mid \mathbf{y}) = \nabla_{\mathbf{x}} \log p(\mathbf{y} \mid \mathbf{x}) + \nabla_{\mathbf{x}} \log p(\mathbf{x})\, .
\end{equation}
The evidence term drops out when we take the gradient. The quantity we are left with is called the score of the posterior, which is a generalization of the Fisher score in statistics. It’s interesting to note that the posterior score points in the direction that minimizes the information content of an image \(\mathbf{x}\).

To perform Bayesian inference, we must model as accurately as we can the likelihood score and the prior score function. In our paper, we model the score of the likelihood and the score of the prior directly using a method called score-matching. I would highly recommend Yang Song’s blog post for an in-depth discussion of score matching. For the purpose of brevity, I will simply say that score-matching is the process of training a deep neural network to match the score of a distribution while only being given samples from the distribution. This is particularly useful if we don’t have a simple model for space images or noise. This is to be contrasted to what is typically done in practice, where simplified models are often assumed for either or both.

A common strategy to simplify the likelihood is to preprocess the data to obtain a likelihood as close to Gaussian as possible. This is because a Gaussian likelihood is analytically tractable. We can write it down with pen and paper. The most common of such preprocessing is called Drizzle. This algorithm is particularly effective at removing cosmic rays. The idea is to take multiple images of the same patch of the sky and align them to the same reference frame. The images are then combined in a way that the noise is reduced. This is done by taking a weighted median of the images, which can be shown to be a robust estimator of the true signal.

But, even though Drizzle is a robust estimator, we can also show that some information must be lost in the resulting image. For some downstream inference tasks, this missing information can potentially be crucial. This statement is a direct consequence of the data processing inequality in information theory. To make this statement more precise, let us introduce the mutual information between two random variables \(X\) and \(Y\), which is defined as the distance between the joint distribution and the product of the marginal distributions
\begin{equation}
I(X; Y) = D_{\mathrm{KL}}(p(x, y) \parallel p(x)p(y))
\end{equation}
This distance is zero when \(X\) and \(Y\) are unrelated such that the joint can be factorized \(p(x, y) = p(x)p(y)\). Now, let us consider the Markov chain \(X \rightarrow Y \rightarrow Z\), where \(X\) is the ideal signal, \(Y\) is the observed image, and \(Z\) is the processed image. The data processing inequality states that
\begin{equation}
I(X; Y) \geq I(X; Z)\, .
\end{equation}
This means that the processed image cannot contain more information about the ideal signal than observation does. In other words, any preprocessing of the data must lose some information. Or, in the ideal case, keep the same amount of information. This is a fundamental limitation of any data processing algorithm. It is also a fundamental limitation of any inference algorithm.

In our paper, we perform Bayesian inference directly from the raw data using accurate score models, circumventing the preprocessing entirely. This is perhaps the first time such an analysis is possible because of the recent advances in deep learning and the availability of large datasets to train the score models. Such advances also allow us to choose very flexible models like pixelated grids to represent the ideal signal.

 


But, any choice of model for \(\mathbf{x}\) will potentially limit the amount of information we can extract from the data. This is best quantified by the channel capacity in information theory. The capacity of a model \(\mathbf{x}\) is defined as the supremum of the mutual information between the ideal signal, over all possible world models \(p(\mathbf{x})\)
\begin{equation}
C = \underset{p(\mathbf{x})}{\mathrm{sup}}\, I(X; Y)\, .
\end{equation}
In other words, \(C\) is the upper limit of the information we can theoretically extract. We can also talk about a specific capacity for a specific choice of a world model \(p(\mathbf{x})\). The theoretically optimal enhancement will have a world model with a specific capacity at least equal to the entropy of the observation. This follows from the definition of the mutual information
\begin{equation}
I(X; Y) = H(Y) – H(Y \mid X)\, ,
\end{equation}
which is maximized when \(H(Y \mid X) = 0\). In astronomy, \(Y\) is an image. \(X\) is also modelled as an image in the most general approach for image enhancement, especially if we want to preserve as much information as possible. Furthermore, the world model \(p(\mathbf{x})\) must be chosen with a cross-entropy \(H(Y \mid X)\) as small as possible. This suggests that the prior must be chosen to be as informative as possible to minimize \(H(Y \mid X)\), which also maximizes the channel capacity.

Conclusion

Coming back to LeCun’s post, I believe we now have the tools to better understand what he means when he says that processing images (and video for that matter) is probably a crucial step for any sentient AI to emerge. Humans have most likely developed language in a way that best matches the rate at which they can process information. This is why it has a relatively low bandwidth limit, i.e. a low capacity, compared with images. On the other hand, images can potentially transport a much larger volume of information since images have a large capacity. This is better suited for computers which process information at a very high rate, at the condition that these computers have the world model to process that information.

Modelling data as images is well suited for astronomy since the observations of the world are also images. Observations in astronomy have a potentially high information content, but traditional methodologies fall short of fully exploiting this richness. This leaves a vast frontier for discovery within the natural world. As technological advancements yield increasingly refined images from both space and ground telescopes, the necessity for revising our prior assumptions (\(p(\mathbf{x})\)) about the cosmos becomes paramount. By doing so, we enhance our ability to distill knowledge from our observations. To the extent that previously indistinct images captured by the HST can be transformed into images of clarity comparable to those from the JWST.