O Stable Diffusion is a mechanism that uses machine learning to generate images from text. But one developer found that the tool can also be used to compress an image to a level that surpasses standards like JPEG and WebP. The resulting file may even have visual artifacts, but in a low proportion.
On the web, where image optimization is so important that it is encouraged by Google, this discovery can be useful to online stores, social networks and sites like technoblog.
Explaining this Stable Diffusion thing
Let’s say Stable Diffusion is a fashion toy. like the example of Dall-E, the engine can generate images through instructions given via text, in a matter of minutes. Many of them are impressive.
You can take the test yourself. Stable Diffusion is available on sites such as Hugging Face. In the latter, I typed the following words in the text field: house (house), car (car) and sky (sky). The result was this:
Cool, but what about image compression?
If, on the one hand, tools like Stable Diffusion have been bothering artists for flood communities with artificially generated imageson the other hand, attract the attention of many artificial intelligence enthusiasts.
It’s the case of the software engineer Matthias Buhlmann. While testing Stable Diffusion, he found that the tool triggers three artificial neural networks. One of them is the Variational Auto Encoder (VAE), which encodes and decodes an image within a latent space.
Understand latent space with an intermediate representation of the expected result. Making a rough comparison, it is as if this space represents a sketch or a reduced version of the image.
In fact, the latent space contains a lower resolution representation of the original image, but with more precise details. With this, the image can be expanded again, without being de-characterized.
The latent space is used for another neural network algorithm, U-Net, to kick in. A random noise is inserted in the space for this mechanism to generate predictions about what it “sees” there. It’s as if the algorithm is a person trying to identify shapes in clouds.
This process serves to eliminate noise in a manner consistent with the expected result and works in consonance with the third neural network, the text encoder. This serves as a guide as to what U-Net should try to “see”.
This is a very simplified explanation. What matters is knowing that all this is combined so that the image requested by the user is generated.
During this process, the image has “impurities” removed. This is not necessarily to make it smaller, but to make the result more accurate.
During his experiment, Bühlmann discovered that the Stable Diffusion algorithms can be adapted to just do image compression. To do so, he removed the text encoder, but kept the procedures related to image treatment.
In testing, he found the results to be quite convincing. The photo of a llama was made 6.74 KB in size using a tool that compresses the image in WebP. In JPEG, the image was 5.66 KB. In Stable Diffusion, the image was 4.97 KB and, even so, presents more details than the previous ones.
The result is not perfect, let’s be clear. Bühlmann noticed that faces or text in images may be less visible. But it is assumed that the mechanism can be adapted and trained to overcome these limitations.
For those who want to experiment or even collaborate with the project, Bühlmann has published the source code of his work on Google Colab.
With information: Ars Technica.