有人用过 BLEU 评估过文本得分吗？

tfers-migration · March 30, 2020, 9:57am

python 代码如下

from nltk.translate.bleu_score import sentence_bleu
reference = [['a', 'close', 'up', 'picture', 'of', 'a', 'brown', "bear's", 'face'], ['A', 'large', 'bear', 'that', 'is', 'sitting', 'on', 'grass', ''], ['Closeup', 'of', 'a', 'brown', 'bear', 'sitting', 'in', 'a', 'grassy', 'area'], ['The', 'large', 'brown', 'bear', 'has', 'a', 'black', 'nose'], ['A', 'big', 'burly', 'grizzly', 'bear', 'is', 'show', 'with', 'grass', 'in', 'the', 'background']]
candidate = ['a', 'large', 'brown', 'bear', 'standing', 'on', 'top', 'of', 'a', 'grass', 'covered', 'field']
print ('Individual 1-gram: %f' % sentence_bleu (reference, candidate, weights=(1, 0, 0, 0)))
print ('Individual 2-gram: %f' % sentence_bleu (reference, candidate, weights=(0, 1, 0, 0)))
print ('Individual 3-gram: %f' % sentence_bleu (reference, candidate, weights=(0, 0, 1, 0)))
print ('Individual 4-gram: %f' % sentence_bleu (reference, candidate, weights=(0, 0, 0, 1)))

得到输出为

UserWarning:
Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction ().
Individual 1-gram: 0.666667
Individual 2-gram: 0.272727
Individual 3-gram: 0.100000
Individual 4-gram: 1.000000

我不懂这里 4-gram 为什么会报错，并且是否 1-gram 得分就是 BLEU-1 得分，4-gram 就是 BLEU-4 得分呢？
希望懂的人能帮帮我，谢谢大家啦！

提问人 StormshadowRay，2018-4-22 21:23:50

tfers-migration · March 30, 2020, 9:58am

对 BLEU 不是特别熟，所以我不确定 4-gram 是不是你所说的 BLEU-4 得分，但你的代码权重是正确的单一 4 元得分。

不过 4-gram 报错的原因貌似挺明显的：你的 reference 中的任何一句都不包含 candidate 中的任何一个 4 元词组，所以错误信息里面也说了 “Sentence contains 0 counts of 4-gram overlaps.”

手机上发代码不太方便，简单解释一下为什么得到 1.0 这个不合理的结果（BLEU scores might be undesirable）：首先是没有 4 元词组这个问题导致你计算 4-gram 时默认的 smoothing_function 返回 0（其实是没有得分），得分变换后得到 1.0；第二就是你第四个 reference 和 candidate 长度相等，导致 brevity_penalty=1。以上两者相乘得到了 1.0。

yunhai_luo，发表于 2018-4-23 05:30:26