The article concludes that the overall translation score of Llama 4 is below that of Llama 3.3.
However, the included table shows that Llama 4 scores better on all subcategories included in the test - coherence, idiomaticity and accuracy.
Something does not add up. The conclusion just states "...downgrade from LLama 3.3 in every respect" without further explanation.
Looking at the individual language pages, it does come behind pretty often. And in Japanese for example, it has higher scores but also a much higher refusal rate. The summary page doesn't show a refusal rate column, so not all the data is represented there.
The article concludes that the overall translation score of Llama 4 is below that of Llama 3.3. However, the included table shows that Llama 4 scores better on all subcategories included in the test - coherence, idiomaticity and accuracy.
Something does not add up. The conclusion just states "...downgrade from LLama 3.3 in every respect" without further explanation.
Looking at the individual language pages, it does come behind pretty often. And in Japanese for example, it has higher scores but also a much higher refusal rate. The summary page doesn't show a refusal rate column, so not all the data is represented there.
Could be Simpson's Paradox: https://en.wikipedia.org/wiki/Simpson%27s_paradox