Transformer: Cosmos — All that is, or was, or ever will be

Transformer

cosmos 12th May 2019 at 5:40am

Attention is all you need. In this work we propose the Transformer, a model architecture eschewing recurrence and insteadrelying entirely on an attention mechanism to draw global dependencies between input and output.The Transformer allows for significantly more parallelization a

Uses Attention

The transformer basically does the following:

encode
1. appplies a simple element-wise transformation to the long sequence of features (easy to parallelize :)!) that gives key-value pairs, and queries
2. for n in NumberOfLayers
  1. Attend to sequence of (key-value pairs, queries) using an Attention function, transform the the sequence of outputs of attention, and output a new sequence of key-value pairs, and queries
decode. Works the same as the encoding, except that
1. we mask the attention, so that each output in the sequence can only attend to previous points in the sequence. This makes it an autoregressive model, that will take time linear on the sequence length, when generating, but can be trained in parallel, by feeding the target outputs all at the same time, as it's usually done when training autoregressive models.
2. we have an "encoder-decoder attention" where the outputs of the first sublayer in the decoder give queries for the sequence of key-values at the output of the last layer of the encoder. The outptus of these are combined with the outputs of the first attention sublayer.

Where do the first set of queries come from? As far as I understand (as per described above), they come from the input embedding layers right? Yeah this seems to be correct

positional encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of theorder of the sequence, we must inject some information about the relative or absolute position. They add "positional encodings" to the input embeddings at thebottoms of the encoder and decoder stacks. The positional encodings have the same dimensiondmodelas the embeddings, so that the two can be summed. There are many choices of positional encodings,learned and fixed [9].

Is adding ok?, rather than concatenating? I guess you can still encode a lot of info in these high-dim vectors

They use sinusodial functions with varying frequencies. They hypothesize that this can make learning relative positioning easy as the positional embeddings of two positions separated by a fixed amount are related by a linear transformation (that's a nice property of the fourier basis! It is nice and linear)