EMNLP MT System Combination Papers
I’ve skimmed a the system combination papers mentioned at ACS:
Machine Translation Papers at EMNLP.
System combination can give large improvements in MT quality, but in practice you have to ask people to translate material for you on their system.
Duan et al. (Feature Subspace Method for SMT System Combination) take a single system and create a dozen variants of it (each leaving out a single feature, or using a lower-order language model), and gain about 1 BLEU by doing sentence-level rescoring using ngram-agreement and internal model features (weights trained on a dev set). This is more than the modest 0.25 BLEU I’ve obtained by doing the same thing for a single system’s nbests; they didn’t report their single-system reranking gain but it’s probably similar. When combining two actual systems (Hiero-style and phrase-based), they have to leave out the internal features (which are difficult or impossible to evaluate on an unrelated system output), but still get an additive ~1 BLEU when using 2*12 than 2*1 systems. They also found that 20-50 nbests were best (this agrees with my experience in sentence reranking, but obviously that’s dependent on the search errors, duplicate derivations, and other properties of the actual system and search procedure; sometimes I’ve found a small benefit from up to 1000-bests). This is an excellent result and easy (if tedious) to apply to your own system(s). I don’t find their theoretical motivation particularly compelling. It seems to me that all the benefits are explained in terms of Bayesian averaging over multiple feature-weight vectors.
Feng et al. (Lattice-based System Combination for SMT) take the nice word-level (“sausage graph”) confusion network IHMM system combination method of Xiadong He, and take the obvious step of generating lattices. They get another 0.5-1 BLEU over his result. They build the lattice from word to word alignments against a backbone translation, as in the CN case, so it’s impressive that they get a meaningful lattice. My own thought would be to have the MT systems output alignment information directly, but their way makes integration easier. It’s also possible to get much of the benefit from a lattice’s phrases by using system preferences for ngrams as a feature on top of the unigram CN; Zhao & He get almost identical improvement by doing so (they also add the usual ngram-agreement reranking feature, where you treat your weighted nbest as an English LM, into their CN decoder, which is smart). If you have an existing CN system, it’s probably easier to do this than switch to lattices (there’s no reason not to try to combine the methods, but they do very similar things already). He & Toutanova (Joint Optimization for MT Sytem Combination) also amend He’s original approach, and get a larger improvement (1.5 BLEU). The original CN IHMM approach was a pipeline of several greedy/hard decisions; this work fixes that by jointly searching possible reorderings/alignments/selections of phrases from the nbest translations. There’s probably not any way to “add lattices” on top of this; there’s no explicit backbone, and they already use bigram system voting. I’d try new features before considering lattices (e.g. nbest ngram LM, 3gram voting). They show that the integrated alignment search is important; they lose 0.3 BLEU if they restrict the search to Viterbi union alignments.
System combination can give large improvements in MT quality, but in practice you have to ask people to translate material for you on their system.
Duan et al. (Feature Subspace Method for SMT System Combination) take a single system and create a dozen variants of it (each leaving out a single feature, or using a lower-order language model), and gain about 1 BLEU by doing sentence-level rescoring using ngram-agreement and internal model features (weights trained on a dev set). This is more than the modest 0.25 BLEU I’ve obtained by doing the same thing for a single system’s nbests; they didn’t report their single-system reranking gain but it’s probably similar. When combining two actual systems (Hiero-style and phrase-based), they have to leave out the internal features (which are difficult or impossible to evaluate on an unrelated system output), but still get an additive ~1 BLEU when using 2*12 than 2*1 systems. They also found that 20-50 nbests were best (this agrees with my experience in sentence reranking, but obviously that’s dependent on the search errors, duplicate derivations, and other properties of the actual system and search procedure; sometimes I’ve found a small benefit from up to 1000-bests). This is an excellent result and easy (if tedious) to apply to your own system(s). I don’t find their theoretical motivation particularly compelling. It seems to me that all the benefits are explained in terms of Bayesian averaging over multiple feature-weight vectors.
Feng et al. (Lattice-based System Combination for SMT) take the nice word-level (“sausage graph”) confusion network IHMM system combination method of Xiadong He, and take the obvious step of generating lattices. They get another 0.5-1 BLEU over his result. They build the lattice from word to word alignments against a backbone translation, as in the CN case, so it’s impressive that they get a meaningful lattice. My own thought would be to have the MT systems output alignment information directly, but their way makes integration easier. It’s also possible to get much of the benefit from a lattice’s phrases by using system preferences for ngrams as a feature on top of the unigram CN; Zhao & He get almost identical improvement by doing so (they also add the usual ngram-agreement reranking feature, where you treat your weighted nbest as an English LM, into their CN decoder, which is smart). If you have an existing CN system, it’s probably easier to do this than switch to lattices (there’s no reason not to try to combine the methods, but they do very similar things already). He & Toutanova (Joint Optimization for MT Sytem Combination) also amend He’s original approach, and get a larger improvement (1.5 BLEU). The original CN IHMM approach was a pipeline of several greedy/hard decisions; this work fixes that by jointly searching possible reorderings/alignments/selections of phrases from the nbest translations. There’s probably not any way to “add lattices” on top of this; there’s no explicit backbone, and they already use bigram system voting. I’d try new features before considering lattices (e.g. nbest ngram LM, 3gram voting). They show that the integrated alignment search is important; they lose 0.3 BLEU if they restrict the search to Viterbi union alignments.