When an MT project is finished we compare our original MT output with the post edited version to see how well our MT engine has performed. But to measure any change in MT quality over the course of different projects, you need a metric that is objective and repeatable.

Metrics

There are a lot of different metrics freely available that are used by academics as well as engineers in the industry. They all have their merits and disadvantages and there have been endless discussions about which one the industry should focus on.

However, the discussion at Yamagata Europe went another way. We asked ourselves if we should even use these metrics.

One of the disadvantages of using metrics like BLEU or F-Measure is that they aren't really convincing to people without a background in MT. You won't persuade a new customer by telling him you can create a Statistical MT engine with a BLEU score of 70%.

That's why, instead of looking at these scores, we decided to use our own metrics. We needed figures that were clear for everyone, which means the engineers as well as the vendors and the clients.

First Time Right Segments 

So we decided to keep it simple and look at the First Time Right Segments (FTR). As should be more than obvious by the name alone, these are segments where the post-editor doesn't have to change a thing. The MT output as well as tag placement and punctuation are perfect. We used this metric as our goal for improving the quality of our engines by getting this number as high as possible.

This worked well for a while, until we reached a certain limit with this number. We kept improving our engines, but it wasn't really showing in the results. We realized that MT had its limits for reaching perfect translations which in turn limited the efficiency of our metric.

We needed a way to convince vendors and clients that the quality of our engines was improving even though the number of FTR segments stayed stagnant.

Post Editing Effort

Again there are plenty of metrics available that reflect Post Editing Effort or Post Editing Distance if you will. But telling someone that your TER score has decreased with two percent since your last update isn't going to sound impressive to anyone who doesn't have a clue what this means.

And even when you explain that this means that the post-editor will have to put in less effort to complete the job, it doesn't give them anything tangible to grasp the improvement.

That's why we started dividing all the segments that did require editing into Almost Right (AR) and High Effort (HE) segments. This way our scores would reflect the effort the post editor had to put in, in a tangible way. So when a project is finished we can see how many segments needed editing (First Time Right or not) and how many of these segments needed a lot of effort (Almost Right vs. High Effort).

Setting a limit on low/high effort might seem arbitrary, but in practice this metric does exactly what we wanted it to do. We now have an objective, repeatable way to show someone that the constant updates we do to improve our engines have an actual influence on the post-editing effort.
Even more important we can show this in actual numbers that mean something to anyone on every end of the translation spectrum.

Example

Maybe an example will make this more clear. Let's say we have a customer who has weekly projects with an average of 1000 segments per week for a certain language combination.

The first time we would use MT for this costumer we would get the basic starting results (see Start in graph). Before the next project starts, we update our engine with new data, feedback from the translator,... And sure enough we see improvement in the results of the second project (see After 1 update).

Not only can we show this positive evolution, these figures should be clear to everyone. More segments that are correct from the first time and fewer segments that require high effort, in concrete numbers. We can easily calculate these results every week and update the engine accordingly.

Once we reach a certain level of quality, we might notice that our updates aren't having a lot of effect anymore. We can then either change the way we update or consider the engine to be mature (see Mature in graph). Once an engine is considered mature, we only do minor updates unless we see a clear decline in quality.

Machine Translation Score

 

So why did we stop there? Why not divide the output into 10 categories that represent a post editing effort of 10 percent each?

Because sometimes less is more.

What's the difference between a segment where someone had to change 60% or 95% of the sentence? In practice the editor will discard both these sentences and start translating from scratch. Sometimes it's better to draw the line before you wind up with more numbers that say less.