Autoencoders & Variational Autoencoders
Written by Xi Huang
What is an autoencoder?
The traditional autoencoder is a neural network that is based on
the idea of finding representation of latent information, also known
as the encoder-decoder architecture. As the name suggests, it consists
of two main components: the encoder and the decoder.
The point of the encoder is to take in a data point and convert it into
a low-dimensional embedding of the original input, or in other words, a
latent representation. For example, if we are working with images of
handwritten digits from the MNIST dataset with an autoencoder,
the output of the learned encoder can be some condensed representation of the digit that the
input image is supposed to portray, such as a unique combination of the
digit’s features among the set of its handwritten depictions.
On the other hand, the decoder is designed to take in the low-dimensional
embedding and “reconstruct” the input data. The reconstruction may not be the same
as the input since the decoder starts off with the embedding, but the goal for training
is to output something close to the input, minimizing the loss of information
in the embedding process measured by a loss function. For the MNIST example,
the autoencoder can take a handwritten digit input, extract the digit’s features
into an embedding with the encoder, and generate its original “handwritten”
look of the same digit with the decoder. This process is unsupervised, since
we do not need labeled data, and compare the output with the input directly
instead.
Now, what is a variational autoencoder (VAE)?
Variational autoencoder is an adaptation from traditional autoencoder. Instead of embedding the input into a vector representation, the variational autoencoder architecture embeds the input into a distribution. Then, the decoder samples from that embedded distribution and construct a generated output. For example, going back to our MNIST example, now we have a model that can embed the distribution of latent characteristics of a digit, and then recreate a “handwritten” portrayal of the digit based on features sampled from its learned distribution embedding. In addition, during training, KL-divergence is usually used as a regularizer as its purpose is to compare the similarity between two distributions.
One main advantage that VAE has over traditional autoencoder is that sampling from the embedded distribution allows the model to generate new data that comes from the same embedding distribution as the input data. Meanwhile, traditional autoencoder can only generate one output similar to its input since its structure is deterministic.
In my current project, we are exploring the use of VAE to generate music data. We work with a dataset called the Lakh MIDI Dataset. Since sequential information is essential when it comes to music, we are experimenting with a bidirectional encoder utilizing LSTM to capture temporal relationship in the data. Similarly, a LSTM is used for the decoder to recreate a music sequence. Here is a blog post related to this topic from a previous work called "MusicVAE".
References
[1] C. Doersch, Tutorial on Variational Autoencoders (2021), arXiv
[2] P. Janetzky, Generative Networks: From AE to VAE to GAN to CycleGAN (2021), Towards Data Science
[3] A. Roberts, J. Engel, C. Raffel, I. Simon, C. Hawthorne,
MusicVAE: Creating a palette for musical scores with machine learning. (2018), Magenta