Llama 4 performs worse than Llama 3 at translation

guccihat 2 days ago

The article concludes that the overall translation score of Llama 4 is below that of Llama 3.3. However, the included table shows that Llama 4 scores better on all subcategories included in the test - coherence, idiomaticity and accuracy.

Something does not add up. The conclusion just states "...downgrade from LLama 3.3 in every respect" without further explanation.

smallerize 2 days ago

Looking at the individual language pages, it does come behind pretty often. And in Japanese for example, it has higher scores but also a much higher refusal rate. The summary page doesn't show a refusal rate column, so not all the data is represented there.
Tomte 2 days ago

Could be Simpson's Paradox: https://en.wikipedia.org/wiki/Simpson%27s_paradox