Attention is all you need. In this work we propose the Transformer, a model architecture eschewing recurrence and insteadrelying entirely on an attention mechanism to draw global dependencies between input and output.The Transformer allows for significantly more parallelization a
Uses Attention
The transformer basically does the following:
Where do the first set of queries come from? As far as I understand (as per described above), they come from the input embedding layers right? Yeah this seems to be correct
positional encoding
Since our model contains no recurrence and no convolution, in order for the model to make use of theorder of the sequence, we must inject some information about the relative or absolute position. They add "positional encodings" to the input embeddings at thebottoms of the encoder and decoder stacks. The positional encodings have the same dimensiondmodelas the embeddings, so that the two can be summed. There are many choices of positional encodings,learned and fixed [9].
Is adding ok?, rather than concatenating? I guess you can still encode a lot of info in these high-dim vectors
They use sinusodial functions with varying frequencies. They hypothesize that this can make learning relative positioning easy as the positional embeddings of two positions separated by a fixed amount are related by a linear transformation (that's a nice property of the fourier basis! It is nice and linear)