Especially suited to processing time-variant data such as audio, or spatial frames such as images. The network structure is such that shared parameters are multiplied and summed over localized regions, acting like a convolution filter over the inputs. This reproduces how many traditional edge detection filters etc. worked. There are many pre-trained convolutional models freely available, such as the Google Inception networks for visual classification.
These networks are special in that they can be trained in an unsupervised setting. They effectively learn a compressed representation of inputs over time, which then allows the network to detect the existence of the learned embedding in new samples. The autoencoder is built as a two stage network, the encoding stage compresses the input, and a decoder then decompresses the embedding – with this output being compared to the input value and the difference is then propagated back through the network as errors. Autoencoders have been around for a long time, but more recently Variational Autoencoders (VAEs) have seen a surge in popularity.
Networks capable of generating completely new samples (commonly images) in the same style as the source material. These networks have been used to perform style transfer, for example where Van Goughs signature brushstrokes are applied to any new source image. The training process involves two networks competing in a zero sum game framework, one which generates new samples and another – the discriminator – which determines if the generated sample seems fake. The generator trains by learning to fool the critic, hence the name. Some other examples of GANs include the widely distributed videos of presidential speeches where the script has been altered and the video doctored to make the mouth movements reflect the new script. NVIDIA recently created a number of fake, generated celebrity faces which look real.
LSTMs are great for working with sequences data which requires memory of previously observed samples during processing. Using a recurrent network architecture where previous computations are retained using a bypass line. The major drawback of LSTMs is that they have heavy memory bandwidth requirements for computation, which is the primary bottleneck for processing (moving data to and from the GPU). LSTMs are often used to predict future values for time-series data, a form of sequence to sequence processing.
Since WaveNet was introduced, and with it the Transformer module, the preferred method for operating on sequence data has been to use hierarchical attention networks, a special case of convolutional networks. These have been used with great success in speech synthesis by google, allowing the system to retain contextual information about previous utterances to guide the synthesis of the next speech token.