Computers need numbers to make calculations. This means neural machine translation (NMT) needs numbers instead of words. But how do you turn words into numbers?

Turning words into digits: it sounds like science fiction but it is how NMT works today. In our previous blog post, we talked about the encoder – decoder architecture of NMT. An encoder neural network reads and encodes a source sentence into a ‘thought vector’ or ‘meaning vector’, which is a sequence of numbers that represents the sentence meaning. A decoder then outputs a translation from the encoded vector. In other words, the ‘digitized thought’ serves as a bridge between two languages.

Turning words into numbers

There are different ways to represent words by numbers. One way to do this is the so-called one-hot representation of words. If you feel this is getting a little complex, bear with us, we will try to explain it in an easy way. One-hot representation means that you give every word in a particular language a number on the basis of “0” and “1”. This way, you create a vector. There can be only one “1” in the vector ( “one hot” means “one 1”). 

You could, for example, represent the words “cat” and “dog” as the following vectors:

Cat: [0, 1, 0 .............0]

Dog: [1, 0, 0 ............. ]

The total number of 0's and 1's in your vector depends on the total number of words in the language.

This one-hot representation is too restrictive and only used as a starting point for NMT. It does not show similarities or relationships between words e.g. the fact that “cat” is related to “cats”, or the fact that both cats and dogs are animals. Instead, each word is equally far from another word in meaning. So even though cats and dogs are both animals, they are presented in the same way as for example “blue”, which is less similar in meaning.  

Word embedding

So how can words be represented numerically in a more concise and meaningful way? NMT attributes a set of numbers (or vector) to every word to define its position in a theoretical “meaning space” or cloud. The set of numbers does not consist of “0” and “1”, but of decimals in between. This way, features and similarities between words are marked as well.

“Cat” and “dog” could, for example, be presented as follows:

Cat: [0.33, 0.44, ...........]

Dog: [0.33, 0.45, ...........]

The vector starts similar because “cat” and “dog” share features. They are, for example, both animals. The vectors for “cat” and “blue” will be further away from each other.

Visually, it would look like this:

The magic here is that it’s the neural network that calculates the similarities between words through repeated guesses and based on training data and techniques such as distributional semantics, which tries to characterize words by the company they keep (their context). However, to do so, it needs big data, i.e. a large set of data (for example a large translation memory). Without it, similarities cannot be calculated.

Thought vectors

A sentence can be looked at as a path of these words, which can in turn be distilled down to its own set of numbers, or thought vector. This is where it gets really interesting. Thought vectors make it possible to translate a sentence while keeping the context of the whole sentence in mind. Whereas Statistical Machine Translation (SMT) translates a sentence in parts (for example per 3 words, regardless of the context), NMT translates the whole sentence in line with its context.

Below, you can find a visual representation of how neural networks work. A recurrent neural network (RNN) encodes a source sentence. The thought vector recognizes the features, meaning and context of the source sentences. Another recurrent neural network then decodes the sentence and produces the translation. Recurrent neural networks are used for series of sentences or signals. Neural networks are more static, without an interval of time.


That’s not the end of it

Reports about the quality improvement of NMT over statistical machine translation vary for different language pairs. Nevertheless, it is clear that NMT reduces post-editing time significantly.  Lexical and morphological mistakes as well as mistakes in the word order occur less frequently. Despite the excitement around NMT today, the technology isn’t perfect yet. In our final post of this blog series, we’ll discuss the different ways in which neural machine translation is developing.