sleepyhead111 commited on Apr 20, 2025

Commit

12aef23

verified ·

1 Parent(s): 88117f8

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

data/test/trainable_data/de2en/preprocess.log +29 -0
data/test/trainable_data/de2en/test1.de-en.de.idx +0 -0
data/test/trainable_data/de2en/test1.de-en.en.idx +0 -0
data/test/trainable_data/de2en/test2.de-en.de.idx +0 -0
data/test/trainable_data/de2en/test2.de-en.en.idx +0 -0
data/test/trainable_data/en2de/test.en-de.en.idx +0 -0
data/test/trainable_data/en2de/test2.en-de.de.idx +0 -0
data/test/trainable_data/en2de/test2.en-de.en.idx +0 -0
data/test/trainable_data/en2zh/dict.en.txt +0 -0
data/test/trainable_data/en2zh/dict.zh.txt +0 -0
data/test/trainable_data/en2zh/preprocess.log +23 -0
data/test/trainable_data/en2zh/test.en-zh.en.idx +0 -0
data/test/trainable_data/en2zh/test.en-zh.zh.idx +0 -0
data/test/trainable_data/en2zh/test1.en-zh.en.idx +0 -0
data/test/trainable_data/en2zh/test1.en-zh.zh.idx +0 -0
data/test/trainable_data/en2zh/test2.en-zh.en.idx +0 -0
data/test/trainable_data/en2zh/test2.en-zh.zh.idx +0 -0
data/test/trainable_data/zh2en/dict.en.txt +0 -0
data/test/trainable_data/zh2en/dict.zh.txt +0 -0
data/test/trainable_data/zh2en/preprocess.log +6 -0
data/test/trainable_data/zh2en/preprocess1.log +6 -0
data/test/trainable_data/zh2en/test.zh-en.en.idx +0 -0
data/test/trainable_data/zh2en/test1.zh-en.en.idx +0 -0
data/test/trainable_data/zh2en/test2.zh-en.en.idx +0 -0
data/test/trainable_data/zh2en/test2.zh-en.zh.idx +0 -0
mosesdecoder/scripts/analysis/nontranslated_words.pl +100 -0
mosesdecoder/scripts/analysis/smtgui/Corpus.pm +1345 -0
mosesdecoder/scripts/analysis/smtgui/README +42 -0
mosesdecoder/scripts/analysis/smtgui/file-descriptions +4 -0
mosesdecoder/scripts/analysis/smtgui/file-factors +9 -0
mosesdecoder/scripts/analysis/smtgui/newsmtgui.cgi +1006 -0
mosesdecoder/scripts/analysis/weight-scan-summarize.sh +79 -0
mosesdecoder/scripts/ems/web/javascripts/builder.js +136 -0
mosesdecoder/scripts/ems/web/javascripts/dragdrop.js +974 -0
mosesdecoder/scripts/ems/web/javascripts/prototype.js +0 -0
mosesdecoder/scripts/ems/web/javascripts/sound.js +63 -0
mosesdecoder/vw/Classifier.h +197 -0
mosesdecoder/vw/ClassifierFactory.cpp +48 -0
mosesdecoder/vw/Jamfile +20 -0
mosesdecoder/vw/Normalizer.h +78 -0
mosesdecoder/vw/README.md +113 -0
mosesdecoder/vw/VWPredictor.cpp +121 -0
mosesdecoder/vw/VWTrainer.cpp +99 -0
scripts/decode-backtrans.sh +69 -0
scripts/decode.sh +69 -0
scripts/train-backtrans.sh +157 -0
scripts/train.sh +157 -0
subword-nmt/.github/workflows/pythonpublish.yml +26 -0
subword-nmt/.gitignore +105 -0
subword-nmt/CHANGELOG.md +52 -0

data/test/trainable_data/de2en/preprocess.log ADDED Viewed

	@@ -0,0 +1,29 @@

+Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=30, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='de', target_lang='en', trainpref='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.train', validpref='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.valid', testpref='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test', align_suffix=None, destdir='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict=None, nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=True, only_source=False, padding_factor=8, workers=32)
+[de] Dictionary: 47776 types
+[de] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.train.de: 46388489 sents, 1161403088 tokens, 0.0% replaced by <unk>
+[de] Dictionary: 47776 types
+[de] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.valid.de: 1997 sents, 58227 tokens, 0.00515% replaced by <unk>
+[de] Dictionary: 47776 types
+[de] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.de: 3545 sents, 116081 tokens, 0.00345% replaced by <unk>
+[en] Dictionary: 47776 types
+[en] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.train.en: 46388489 sents, 1094684830 tokens, 0.0% replaced by <unk>
+[en] Dictionary: 47776 types
+[en] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.valid.en: 1997 sents, 54062 tokens, 0.0% replaced by <unk>
+[en] Dictionary: 47776 types
+[en] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.en: 3545 sents, 110575 tokens, 0.00181% replaced by <unk>
+Wrote preprocessed data to /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data
+Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=30, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='de', target_lang='en', trainpref=None, validpref=None, testpref='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.flores,/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt22,/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt23', align_suffix=None, destdir='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict=None, nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=True, only_source=False, padding_factor=8, workers=32)
+Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=30, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='de', target_lang='en', trainpref=None, validpref=None, testpref='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.flores,/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt22,/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt23', align_suffix=None, destdir='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data', thresholdtgt=0, thresholdsrc=0, tgtdict='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data/dict.en.txt', srcdict='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data/dict.de.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=False, padding_factor=8, workers=32)
+[de] Dictionary: 47776 types
+[de] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.flores.de: 1012 sents, 34004 tokens, 0.00588% replaced by <unk>
+[de] Dictionary: 47776 types
+[de] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt22.de: 1984 sents, 45732 tokens, 0.00219% replaced by <unk>
+[de] Dictionary: 47776 types
+[de] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt23.de: 549 sents, 36345 tokens, 0.00275% replaced by <unk>
+[en] Dictionary: 47776 types
+[en] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.flores.en: 1012 sents, 30385 tokens, 0.00658% replaced by <unk>
+[en] Dictionary: 47776 types
+[en] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt22.en: 1984 sents, 45259 tokens, 0.0% replaced by <unk>
+[en] Dictionary: 47776 types
+[en] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt23.en: 549 sents, 34931 tokens, 0.0% replaced by <unk>
+Wrote preprocessed data to /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data

data/test/trainable_data/de2en/test1.de-en.de.idx ADDED Viewed

Binary file (23.8 kB). View file

data/test/trainable_data/de2en/test1.de-en.en.idx ADDED Viewed

Binary file (23.8 kB). View file

data/test/trainable_data/de2en/test2.de-en.de.idx ADDED Viewed

Binary file (6.61 kB). View file

data/test/trainable_data/de2en/test2.de-en.en.idx ADDED Viewed

Binary file (6.61 kB). View file

data/test/trainable_data/en2de/test.en-de.en.idx ADDED Viewed

Binary file (12.2 kB). View file

data/test/trainable_data/en2de/test2.en-de.de.idx ADDED Viewed

Binary file (6.71 kB). View file

data/test/trainable_data/en2de/test2.en-de.en.idx ADDED Viewed

Binary file (6.71 kB). View file

data/test/trainable_data/en2zh/dict.en.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

data/test/trainable_data/en2zh/dict.zh.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

data/test/trainable_data/en2zh/preprocess.log ADDED Viewed

	@@ -0,0 +1,23 @@

+Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=30, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='en', target_lang='zh', trainpref='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.train', validpref='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.valid', testpref='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.flores,/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt22,/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt23', align_suffix=None, destdir='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/trainable_data_1', thresholdtgt=0, thresholdsrc=0, tgtdict='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpecode_32k/bpecode.zh', srcdict='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpecode_32k/bpecode.en', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=False, padding_factor=8, workers=32)
+Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=30, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='en', target_lang='zh', trainpref='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.train', validpref='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.valid', testpref='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.flores,/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt22,/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt23', align_suffix=None, destdir='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/trainable_data_1', thresholdtgt=0, thresholdsrc=0, tgtdict='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/trainable_data/dict.zh.txt', srcdict='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/trainable_data/dict.en.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=False, padding_factor=8, workers=32)
+[en] Dictionary: 46040 types
+[en] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.train.en: 33431411 sents, 890241636 tokens, 0.0% replaced by <unk>
+[en] Dictionary: 46040 types
+[en] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.valid.en: 1999 sents, 59177 tokens, 0.0% replaced by <unk>
+[en] Dictionary: 46040 types
+[en] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.flores.en: 1012 sents, 28474 tokens, 0.00702% replaced by <unk>
+[en] Dictionary: 46040 types
+[en] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt22.en: 2037 sents, 44690 tokens, 0.00224% replaced by <unk>
+[en] Dictionary: 46040 types
+[en] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt23.en: 2074 sents, 47187 tokens, 0.0% replaced by <unk>
+[zh] Dictionary: 60432 types
+[zh] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.train.zh: 33431411 sents, 816506971 tokens, 0.0% replaced by <unk>
+[zh] Dictionary: 60432 types
+[zh] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.valid.zh: 1999 sents, 57690 tokens, 0.00347% replaced by <unk>
+[zh] Dictionary: 60432 types
+[zh] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.flores.zh: 1012 sents, 27872 tokens, 0.0% replaced by <unk>
+[zh] Dictionary: 60432 types
+[zh] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt22.zh: 2037 sents, 41432 tokens, 0.0% replaced by <unk>
+[zh] Dictionary: 60432 types
+[zh] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt23.zh: 2074 sents, 44353 tokens, 0.0% replaced by <unk>
+Wrote preprocessed data to /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/trainable_data_1

data/test/trainable_data/en2zh/test.en-zh.en.idx ADDED Viewed

Binary file (12.2 kB). View file

data/test/trainable_data/en2zh/test.en-zh.zh.idx ADDED Viewed

Binary file (12.2 kB). View file

data/test/trainable_data/en2zh/test1.en-zh.en.idx ADDED Viewed

Binary file (24.5 kB). View file

data/test/trainable_data/en2zh/test1.en-zh.zh.idx ADDED Viewed

Binary file (24.5 kB). View file

data/test/trainable_data/en2zh/test2.en-zh.en.idx ADDED Viewed

Binary file (24.9 kB). View file

data/test/trainable_data/en2zh/test2.en-zh.zh.idx ADDED Viewed

Binary file (24.9 kB). View file

data/test/trainable_data/zh2en/dict.en.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

data/test/trainable_data/zh2en/dict.zh.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

data/test/trainable_data/zh2en/preprocess.log ADDED Viewed

	@@ -0,0 +1,6 @@

+Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=42, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='zh', target_lang='en', trainpref=None, validpref=None, testpref='/mnt/congmh/luoyf/xzq-fairseq/data/test/tokenized/zh2en/bpe.test.zh2en.flores', align_suffix=None, destdir='/mnt/congmh/luoyf/xzq-fairseq/data/test/trainable_data/zh2en0', thresholdtgt=0, thresholdsrc=0, tgtdict='/mnt/congmh/luoyf/xzq-fairseq/data/en-zh/wmt23/trainable_data/dict.en.txt', srcdict='/mnt/congmh/luoyf/xzq-fairseq/data/en-zh/wmt23/trainable_data/dict.zh.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=False, padding_factor=8, workers=32)
+[zh] Dictionary: 60432 types
+[zh] /mnt/congmh/luoyf/xzq-fairseq/data/test/tokenized/zh2en/bpe.test.zh2en.flores.zh: 1012 sents, 27918 tokens, 0.0% replaced by <unk>
+[en] Dictionary: 46040 types
+[en] /mnt/congmh/luoyf/xzq-fairseq/data/test/tokenized/zh2en/bpe.test.zh2en.flores.en: 1012 sents, 28474 tokens, 0.00702% replaced by <unk>
+Wrote preprocessed data to /mnt/congmh/luoyf/xzq-fairseq/data/test/trainable_data/zh2en0

data/test/trainable_data/zh2en/preprocess1.log ADDED Viewed

	@@ -0,0 +1,6 @@

+Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=42, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='zh', target_lang='en', trainpref=None, validpref=None, testpref='/mnt/congmh/luoyf/xzq-fairseq/data/test/tokenized/zh2en/bpe.test.zh2en.wmt22', align_suffix=None, destdir='/mnt/congmh/luoyf/xzq-fairseq/data/test/trainable_data/zh2en1', thresholdtgt=0, thresholdsrc=0, tgtdict='/mnt/congmh/luoyf/xzq-fairseq/data/en-zh/wmt23/trainable_data/dict.en.txt', srcdict='/mnt/congmh/luoyf/xzq-fairseq/data/en-zh/wmt23/trainable_data/dict.zh.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=False, padding_factor=8, workers=32)
+[zh] Dictionary: 60432 types
+[zh] /mnt/congmh/luoyf/xzq-fairseq/data/test/tokenized/zh2en/bpe.test.zh2en.wmt22.zh: 1875 sents, 51510 tokens, 0.0194% replaced by <unk>
+[en] Dictionary: 46040 types
+[en] /mnt/congmh/luoyf/xzq-fairseq/data/test/tokenized/zh2en/bpe.test.zh2en.wmt22.en: 1875 sents, 62056 tokens, 0.00645% replaced by <unk>
+Wrote preprocessed data to /mnt/congmh/luoyf/xzq-fairseq/data/test/trainable_data/zh2en1

data/test/trainable_data/zh2en/test.zh-en.en.idx ADDED Viewed

Binary file (12.2 kB). View file

data/test/trainable_data/zh2en/test1.zh-en.en.idx ADDED Viewed

Binary file (22.5 kB). View file

data/test/trainable_data/zh2en/test2.zh-en.en.idx ADDED Viewed

Binary file (23.7 kB). View file

data/test/trainable_data/zh2en/test2.zh-en.zh.idx ADDED Viewed

Binary file (23.7 kB). View file

mosesdecoder/scripts/analysis/nontranslated_words.pl ADDED Viewed

	@@ -0,0 +1,100 @@

+#!/usr/bin/env perl
+#
+# This file is part of moses.  Its use is licensed under the GNU Lesser General
+# Public License version 2.1 or, at your option, any later version.
+# $Id$
+# Reads a source and hypothesis file and counts equal tokens. Some of these
+# are punctuation, some are numbers, but most of the remaining are simply
+# unknown words that the decoder just copied. This script tells you how often
+# this happens.
+#
+# Ondrej Bojar
+use strict;
+use warnings;
+use Getopt::Long;
+my $ignore_numbers = 0;
+my $ignore_punct = 0;
+my $usage = 0;
+my $top = 10;
+GetOptions(
+  "help" => \$usage,
+  "top=i" => \$top,
+  "ignore-numbers" => \$ignore_numbers,
+  "ignore-punct" => \$ignore_punct,
+) or exit 1;
+my $src = shift;
+my $tgt = shift;
+if ($usage || !defined $src || !defined $tgt) {
+  print STDERR "nontranslated_words.pl srcfile hypothesisfile
+...counts the number of words that are equal in src and hyp. These are
+typically unknown words.
+Options:
+  --top=N  ... list N top copied tokens
+  --ignore-numbers  ... numbers usually do not get translated, but do
+     not count them (it is not an error)
+  --ignore-punct ... same for punct, do not include it in the count
+";
+  exit 1;
+}
+binmode(STDOUT, ":utf8");
+binmode(STDERR, ":utf8");
+open SRC, $src or die "Can't read $src";
+open TGT, $tgt or die "Can't read $tgt";
+binmode(SRC, ":utf8");
+binmode(TGT, ":utf8");
+my $nr=0;
+my $outtoks = 0;
+my $intoks = 0;
+my $copiedtoks = 0;
+my %copiedtok;
+while (<SRC>) {
+  $nr++;
+  chomp;
+  s/^\s+|\s+$//g;
+  my @src = split /\s+/;
+  my %src = map {($_,1)} @src;
+  $intoks += scalar @src;
+  my $t = <TGT>;
+  die "$tgt too short!" if !defined $t;
+  $t =~ s/^\s+|\s+$//g;
+  foreach my $outtok (split /\s+/, $t) {
+    $outtoks++;
+    next if !defined $src{$outtok}; # this word did not appear in input, we generated it
+    next if $ignore_numbers && $outtok =~ /^-?[0-9]*([.,][0-9]+)?$/;
+    next if $ignore_punct && $outtok =~ /^[[:punct:]]+$/;
+    $copiedtoks++;
+    $copiedtok{$outtok}++;
+  }
+}
+my $t = <TGT>;
+die "$tgt too long!" if defined $t;
+close SRC;
+close TGT;
+print "Sentences:\t$nr
+Source tokens:\t$intoks
+Output tokens:\t$outtoks
+Output tokens appearing also in input sent:\t$copiedtoks\t"
+  .sprintf("%.2f %%", $copiedtoks/$outtoks*100)
+  ."\t".($ignore_punct?"ignoring":"including")." punctuation"
+  ."\t".($ignore_numbers?"ignoring":"including")." numbers"
+  ."\n";
+if ($top) {
+  my $cnt = 0;
+  print "Top $top copied tokens:\n";
+  foreach my $t (sort {$copiedtok{$b}<=>$copiedtok{$a} || $a cmp $b} keys %copiedtok) {
+    print "$copiedtok{$t}\t$t\n";
+    last if $cnt > $top;
+    $cnt++;
+  }
+}

mosesdecoder/scripts/analysis/smtgui/Corpus.pm ADDED Viewed

	@@ -0,0 +1,1345 @@

+#package Corpus: hold a bunch of sentences in any language, with translation factors and stats about individual sentences and the corpus as a whole
+#Evan Herbst, 7 / 25 / 06
+#
+# This file is part of moses.  Its use is licensed under the GNU Lesser General
+# Public License version 2.1 or, at your option, any later version.
+package Corpus;
+BEGIN
+{
+	push @INC, "../perllib"; #for Error.pm
+}
+use Error;
+return 1;
+###########################################################################################################################
+##### 'our' variables are available outside the package #####
+#all factor names used should be in this list, just in case
+our @FACTORNAMES = ('surf', 'pos', 'lemma', 'stem', 'morph');
+#constructor
+#arguments: short corpus name (-name), hashref of filenames to descriptions (-descriptions), formatted string with various config info (-info_line)
+sub new
+{
+	my $class = shift;
+	my %args = @_; #turn the remainder of @_ into a hash
+	my ($corpusName, $refFileDescs, $infoLine) = ($args{'-name'}, $args{'-descriptions'}, $args{'-info_line'});
+	my ($factorList, $inputLingmodels, $outputLingmodels) = split(/\s*:\s*/, $infoLine);
+	my $self = {};
+	$self->{'corpusName'} = $corpusName;
+	$self->{'truth'} = []; #arrayref of arrayrefs of factors
+	$self->{'input'} = []; #same; also same for any system outputs that get loaded
+	$self->{'tokenCount'} = {}; #sysname => number of tokens in file
+	$self->{'truthFilename'} = "";
+	$self->{'inputFilename'} = "";
+	$self->{'sysoutFilenames'} = {}; #hashref of (string => string) for (system name, filename)
+	$self->{'phraseTableFilenames'} = {}; #factor name => filename
+	$self->{'fileCtimes'} = {}; #file ID of some kind => changetime in seconds
+	$self->{'factorIndices'} = {}; #factor name => index
+	my @factors = split(/\s+/, $factorList);
+	for(my $i = 0; $i < scalar(@factors); $i++)
+	{
+		$self->{'factorIndices'}->{$factors[$i]} = $i;
+	}
+	$self->{'inputLMs'} = {}; #factor name => lingmodel filename
+	$self->{'outputLMs'} = {};
+	foreach my $lmInfo (split(/\s*,\s*/, $inputLingmodels))
+	{
+		my @tokens = split(/\s+/, $lmInfo);
+		$self->{'inputLMs'}->{$tokens[0]} = $tokens[1];
+	}
+	foreach my $lmInfo (split(/\s*,\s*/, $outputLingmodels))
+	{
+		my @tokens = split(/\s+/, $lmInfo);
+		$self->{'outputLMs'}->{$tokens[0]} = $tokens[1];
+	}
+	$self->{'phraseTables'} = {}; #factor name (from @FACTORNAMES) => hashref of source phrases to anything; used for unknown-word counting
+	$self->{'unknownCount'} = {}; #factor name => count of unknown tokens in input
+	$self->{'sysoutWER'} = {}; #system name => (factor name => arrayref with system output total WER and arrayref of WER scores for individual sysout sentences wrt truth)
+	$self->{'sysoutPWER'} = {}; #similarly
+	$self->{'nnAdjWERPWER'} = {}; #system name => arrayref of [normalized WER, normalized PWER]
+	$self->{'perplexity'} = {}; #system name => (factor name => perplexity raw score)
+	$self->{'fileDescriptions'} = {}; #filename associated with us => string description of file
+	$self->{'bleuScores'} = {}; #system name => (factor name => arrayref of (overall score, arrayref of per-sentence scores) )
+	$self->{'bleuConfidence'} = {}; #system name => (factor name => arrayrefs holding statistical test data on BLEU scores)
+	$self->{'subsetBLEUstats'} = {}; #system name => (factor name => n-gram precisions and lengths for independent corpus subsets)
+	$self->{'comparisonStats'} = {}; #system name 1 => (system name 2 => (factor name => p-values, and indices of better system, for all tests used))
+	$self->{'cacheFilename'} = "cache/$corpusName.cache"; #all memory of various scores is stored here
+	bless $self, $class;
+	$self->locateFiles($refFileDescs); #find all relevant files in the current directory; set filenames and descriptions
+	$self->loadCacheFile();
+	print STDERR "on load:\n";
+	$self->printDetails();
+	return $self;
+}
+#arguments: filename
+#return: description string
+#throw if filename doesn't belong to this corpus
+sub getFileDescription
+{
+	my ($self, $filename) = @_;
+	if(!defined($self->{'fileDescriptions'}->{$filename}))
+	{
+		throw Error::Simple(-text => "Corpus::getFileDescription(): invalid filename '$filename'\n");
+	}
+	return $self->{'fileDescriptions'}->{$filename};
+}
+#arguments: none
+#return: list of system names (NOT including 'input', 'truth' and other special cases)
+sub getSystemNames
+{
+	my $self = shift;
+	return keys %{$self->{'sysoutFilenames'}};
+}
+#calculate the number of unknown factor values for the given factor in the input file
+#arguments: factor name
+#return: unknown factor count, total factor count (note the total doesn't depend on the factor)
+#throw if we don't have an input file or a phrase table for the given factor defined or if there's no index known for the given factor
+sub calcUnknownTokens
+{
+	my ($self, $factorName) = @_;
+	#check in-memory cache first
+	if(exists $self->{'unknownCount'}->{$factorName} && exists $self->{'tokenCount'}->{'input'})
+	{
+		return ($self->{'unknownCount'}->{$factorName}, $self->{'tokenCount'}->{'input'});
+	}
+	warn "calcing unknown tokens\n";
+	$self->ensureFilenameDefined('input');
+	$self->ensurePhraseTableDefined($factorName);
+	$self->ensureFactorPosDefined($factorName);
+	$self->loadSentences('input', $self->{'inputFilename'});
+	$self->loadPhraseTable($factorName);
+	#count unknown and total words
+	my ($unknownTokens, $totalTokens) = (0, 0);
+	my $factorIndex = $self->{'factorIndices'}->{$factorName};
+	foreach my $sentence (@{$self->{'input'}})
+	{
+		$totalTokens += scalar(@$sentence);
+		foreach my $word (@$sentence)
+		{
+			if(!defined($self->{'phraseTables'}->{$factorName}->{$word->[$factorIndex]}))
+			{
+				$unknownTokens++;
+			}
+		}
+	}
+	$self->{'unknownCount'}->{$factorName} = $unknownTokens;
+	$self->{'tokenCount'}->{'input'} = $totalTokens;
+	return ($unknownTokens, $totalTokens);
+}
+#arguments: system name
+#return: (WER, PWER) for nouns and adjectives in given system wrt truth
+#throw if given system or truth is not set or if index of 'surf' or 'pos' hasn't been specified
+sub calcNounAdjWER_PWERDiff
+{
+	my ($self, $sysname) = @_;
+	#check in-memory cache first
+	if(exists $self->{'nnAdjWERPWER'}->{$sysname})
+	{
+		return @{$self->{'nnAdjWERPWER'}->{$sysname}};
+	}
+	warn "calcing NN/JJ PWER/WER\n";
+	$self->ensureFilenameDefined('truth');
+	$self->ensureFilenameDefined($sysname);
+	$self->ensureFactorPosDefined('surf');
+	$self->ensureFactorPosDefined('pos');
+	$self->loadSentences('truth', $self->{'truthFilename'});
+	$self->loadSentences($sysname, $self->{'sysoutFilenames'}->{$sysname});
+	#find nouns and adjectives and score them
+	my ($werScore, $pwerScore) = (0, 0);
+	my $nnNadjTags = $self->getPOSTagList('nounAndAdj');
+	for(my $i = 0; $i < scalar(@{$self->{'truth'}}); $i++)
+	{
+		my @nnAdjEWords = $self->filterFactors($self->{'truth'}->[$i], $self->{'factorIndices'}->{'pos'}, $nnNadjTags);
+		my @nnAdjSWords = $self->filterFactors($self->{$sysname}->[$i], $self->{'factorIndices'}->{'pos'}, $nnNadjTags);
+		my ($sentWer, $tmp) = $self->sentenceWER(\@nnAdjSWords, \@nnAdjEWords, $self->{'factorIndices'}->{'surf'});
+		$werScore += $sentWer;
+		($sentWer, $tmp) = $self->sentencePWER(\@nnAdjSWords, \@nnAdjEWords, $self->{'factorIndices'}->{'surf'});
+		$pwerScore += $sentWer;
+	}
+	#unhog memory
+	$self->releaseSentences('truth');
+	$self->releaseSentences($sysname);
+	$self->{'nnAdjWERPWER'}->{$sysname} = [$werScore / $self->{'tokenCount'}->{'truth'}, $pwerScore / $self->{'tokenCount'}->{'truth'}];
+	return @{$self->{'nnAdjWERPWER'}->{$sysname}};
+}
+#calculate detailed WER statistics and put them into $self
+#arguments: system name, factor name to consider (default 'surf', surface form)
+#return: overall surface WER for given system (w/o filtering)
+#throw if given system or truth is not set or if index of factor name hasn't been specified
+sub calcOverallWER
+{
+	my ($self, $sysname, $factorName) = (shift, shift, 'surf');
+	if(scalar(@_) > 0) {$factorName = shift;}
+	#check in-memory cache first
+	if(exists $self->{'sysoutWER'}->{$sysname}->{$factorName})
+	{
+		return $self->{'sysoutWER'}->{$sysname}->{$factorName}->[0];
+	}
+	warn "calcing WER\n";
+	$self->ensureFilenameDefined('truth');
+	$self->ensureFilenameDefined($sysname);
+	$self->ensureFactorPosDefined($factorName);
+	$self->loadSentences('truth', $self->{'truthFilename'});
+	$self->loadSentences($sysname, $self->{'sysoutFilenames'}->{$sysname});
+	my ($wer, $swers, $indices) = $self->corpusWER($self->{$sysname}, $self->{'truth'}, $self->{'factorIndices'}->{$factorName});
+	$self->{'sysoutWER'}->{$sysname}->{$factorName} = [$wer, $swers, $indices]; #total; arrayref of scores for individual sentences; arrayref of arrayrefs of offending words in each sentence
+	#unhog memory
+	$self->releaseSentences('truth');
+	$self->releaseSentences($sysname);
+	return $self->{'sysoutWER'}->{$sysname}->{$factorName}->[0] / $self->{'tokenCount'}->{'truth'};
+}
+#calculate detailed PWER statistics and put them into $self
+#arguments: system name, factor name to consider (default 'surf')
+#return: overall surface PWER for given system (w/o filtering)
+#throw if given system or truth is not set or if index of factor name hasn't been specified
+sub calcOverallPWER
+{
+	my ($self, $sysname, $factorName) = (shift, shift, 'surf');
+	if(scalar(@_) > 0) {$factorName = shift;}
+	#check in-memory cache first
+	if(exists $self->{'sysoutPWER'}->{$sysname}->{$factorName})
+	{
+		return $self->{'sysoutPWER'}->{$sysname}->{$factorName}->[0];
+	}
+	warn "calcing PWER\n";
+	$self->ensureFilenameDefined('truth');
+	$self->ensureFilenameDefined($sysname);
+	$self->ensureFactorPosDefined($factorName);
+	$self->loadSentences('truth', $self->{'truthFilename'});
+	$self->loadSentences($sysname, $self->{'sysoutFilenames'}->{$sysname});
+	my ($pwer, $spwers, $indices) = $self->corpusPWER($self->{$sysname}, $self->{'truth'}, $self->{'factorIndices'}->{$factorName});
+	$self->{'sysoutPWER'}->{$sysname}->{$factorName} = [$pwer, $spwers, $indices]; #total; arrayref of scores for individual sentences; arrayref of arrayrefs of offending words in each sentence
+	#unhog memory
+	$self->releaseSentences('truth');
+	$self->releaseSentences($sysname);
+	return $self->{'sysoutPWER'}->{$sysname}->{$factorName}->[0] / $self->{'tokenCount'}->{'truth'};
+}
+#arguments: system name, factor name to consider (default 'surf')
+#return: array of (BLEU score, n-gram precisions, brevity penalty)
+sub calcBLEU
+{
+	my ($self, $sysname, $factorName) = (shift, shift, 'surf');
+	if(scalar(@_) > 0) {$factorName = shift;}
+	#check in-memory cache first
+	if(exists $self->{'bleuScores'}->{$sysname} && exists $self->{'bleuScores'}->{$sysname}->{$factorName})
+	{
+		return $self->{'bleuScores'}->{$sysname}->{$factorName};
+	}
+	warn "calcing BLEU\n";
+	$self->ensureFilenameDefined('truth');
+	$self->ensureFilenameDefined($sysname);
+	$self->ensureFactorPosDefined($factorName);
+	$self->loadSentences('truth', $self->{'truthFilename'});
+	$self->loadSentences($sysname, $self->{'sysoutFilenames'}->{$sysname});
+	#score structure: various total scores, arrayref of by-sentence score arrays
+	if(!exists $self->{'bleuScores'}->{$sysname}) {$self->{'bleuScores'}->{$sysname} = {};}
+	if(!exists $self->{'bleuScores'}->{$sysname}->{$factorName}) {$self->{'bleuScores'}->{$sysname}->{$factorName} = [[], []];}
+	my ($good1, $tot1, $good2, $tot2, $good3, $tot3, $good4, $tot4, $totCLength, $totRLength) = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+	my $factorIndex = $self->{'factorIndices'}->{$factorName};
+	for(my $i = 0; $i < scalar(@{$self->{'truth'}}); $i++)
+	{
+		my ($truthSentence, $sysoutSentence) = ($self->{'truth'}->[$i], $self->{$sysname}->[$i]);
+		my ($unigood, $unicount, $bigood, $bicount, $trigood, $tricount, $quadrugood, $quadrucount, $cLength, $rLength) =
+				$self->sentenceBLEU($truthSentence, $sysoutSentence, $factorIndex, 0); #last argument is whether to debug-print
+		push @{$self->{'bleuScores'}->{$sysname}->{$factorName}->[1]}, [$unigood, $unicount, $bigood, $bicount, $trigood, $tricount, $quadrugood, $quadrucount, $cLength, $rLength];
+		$good1 += $unigood; $tot1 += $unicount;
+		$good2 += $bigood; $tot2 += $bicount;
+		$good3 += $trigood; $tot3 += $tricount;
+		$good4 += $quadrugood; $tot4 += $quadrucount;
+		$totCLength += $cLength;
+		$totRLength += $rLength;
+	}
+	my $brevity = ($totCLength > $totRLength || $totCLength == 0) ? 1 : exp(1 - $totRLength / $totCLength);
+	my ($pct1, $pct2, $pct3, $pct4) = ($tot1 == 0 ? -1 : $good1 / $tot1, $tot2 == 0 ? -1 : $good2 / $tot2,
+													$tot3 == 0 ? -1 : $good3 / $tot3, $tot4 == 0 ? -1 : $good4 / $tot4);
+	my ($logsum, $logcount) = (0, 0);
+	if($tot1 > 0) {$logsum += my_log($pct1); $logcount++;}
+	if($tot2 > 0) {$logsum += my_log($pct2); $logcount++;}
+	if($tot3 > 0) {$logsum += my_log($pct3); $logcount++;}
+	if($tot4 > 0) {$logsum += my_log($pct4); $logcount++;}
+	my $bleu = $brevity * exp($logsum / $logcount);
+	$self->{'bleuScores'}->{$sysname}->{$factorName}->[0] = [$bleu, 100 * $pct1, 100 * $pct2, 100 * $pct3, 100 * $pct4, $brevity];
+	#unhog memory
+	$self->releaseSentences('truth');
+	$self->releaseSentences($sysname);
+	return @{$self->{'bleuScores'}->{$sysname}->{$factorName}->[0]};
+}
+#do t-tests on the whole-corpus n-gram precisions vs. the average precisions over a set number of disjoint subsets
+#arguments: system name, factor name BLEU was run on (default 'surf')
+#return: arrayref of [arrayref of p-values for overall precision vs. subset average, arrayrefs of [(lower, upper) 95% credible intervals for true overall n-gram precisions]]
+#
+#written to try to save memory
+sub statisticallyTestBLEUResults
+{
+	my ($self, $sysname, $factorName) = (shift, shift, 'surf');
+	if(scalar(@_) > 0) {$factorName = shift;}
+	#check in-memory cache first
+	if(exists $self->{'bleuConfidence'}->{$sysname} && exists $self->{'bleuConfidence'}->{$sysname}->{$factorName})
+	{
+		return $self->{'bleuConfidence'}->{$sysname}->{$factorName};
+	}
+	warn "performing consistency tests\n";
+	my $k = 30; #HARDCODED NUMBER OF SUBSETS (WE DO k-FOLD CROSS-VALIDATION); IF YOU CHANGE THIS YOU MUST ALSO CHANGE getApproxPValue() and $criticalTStat
+	my $criticalTStat = 2.045; #hardcoded value given alpha (.025 here) and degrees of freedom (= $k - 1) ########################################
+	$self->ensureFilenameDefined('truth');
+	$self->ensureFilenameDefined($sysname);
+	$self->ensureFactorPosDefined($factorName);
+	#ensure we have full-corpus BLEU results
+	if(!exists $self->{'bleuScores'}->{$sysname}->{$factorName})
+	{
+		$self->calcBLEU($sysname, $factorName);
+	}
+	if(!exists $self->{'subsetBLEUstats'}->{$sysname}) {$self->{'subsetBLEUstats'}->{$sysname} = {};}
+	if(!exists $self->{'subsetBLEUstats'}->{$sysname}->{$factorName}) {$self->{'subsetBLEUstats'}->{$sysname}->{$factorName} = [];}
+	#calculate n-gram precisions for each small subset
+	my @sentenceStats = @{$self->{'bleuScores'}->{$sysname}->{$factorName}->[1]};
+	for(my $i = 0; $i < $k; $i++)
+	{
+		my ($good1, $tot1, $good2, $tot2, $good3, $tot3, $good4, $tot4, $sysoutLength, $truthLength) = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+		for(my $j = $i; $j < scalar(@sentenceStats); $j += $k) #subset #K consists of every Kth sentence
+		{
+			$good1 += $sentenceStats[$j]->[0]; $tot1 += $sentenceStats[$j]->[1];
+			$good2 += $sentenceStats[$j]->[2]; $tot2 += $sentenceStats[$j]->[3];
+			$good3 += $sentenceStats[$j]->[4]; $tot3 += $sentenceStats[$j]->[5];
+			$good4 += $sentenceStats[$j]->[6]; $tot4 += $sentenceStats[$j]->[7];
+			$sysoutLength += $sentenceStats[$j]->[8];
+			$truthLength += $sentenceStats[$j]->[9];
+		}
+		push @{$self->{'subsetBLEUstats'}->{$sysname}->{$factorName}}, [$good1, $tot1, $good2, $tot2, $good3, $tot3, $good4, $tot4, $sysoutLength, $truthLength];
+	}
+	my $subsetStats = $self->{'subsetBLEUstats'}->{$sysname}->{$factorName};
+	#calculate first two moments for subset scores for each n-gram precision, and t statistic
+	my $fullCorpusBLEU = $self->{'bleuScores'}->{$sysname}->{$factorName}->[0]; #an arrayref
+	my @means = (0) x 4;
+	my @devs = (0) x 4;
+	my $t = []; #t statistics for all n-gram orders
+	if(!exists $self->{'bleuConfidence'}->{$sysname}) {$self->{'bleuConfidence'}->{$sysname} = {};}
+	$self->{'bleuConfidence'}->{$sysname}->{$factorName} = [[], []]; #lower-bound p-values for whole corpus vs. subset average; confidence intervals for all n-gram orders
+	for(my $i = 0; $i < 4; $i++) #run through n-gram orders
+	{
+		for(my $j = 0; $j < $k; $j++) #run through subsets
+		{
+			$means[$i] += $subsetStats->[$j]->[2 * $i] / $subsetStats->[$j]->[2 * $i + 1]; #matching / total n-grams
+		}
+		$means[$i] /= $k;
+		for(my $j = 0; $j < $k; $j++) #run through subsets
+		{
+			$devs[$i] += ($subsetStats->[$j]->[2 * $i] / $subsetStats->[$j]->[2 * $i + 1] - $means[$i]) ** 2;
+		}
+		$devs[$i] = sqrt($devs[$i] / ($k - 1));
+		$t->[$i] = ($fullCorpusBLEU->[$i + 1] / 100 - $means[$i]) / $devs[$i];
+		push @{$self->{'bleuConfidence'}->{$sysname}->{$factorName}->[0]}, getLowerBoundPValue($t->[$i]); #p-value for overall score vs. subset average
+		push @{$self->{'bleuConfidence'}->{$sysname}->{$factorName}->[1]},
+							[$means[$i] - $criticalTStat * $devs[$i] / sqrt($k), $means[$i] + $criticalTStat * $devs[$i] / sqrt($k)]; #the confidence interval
+	}
+	return $self->{'bleuConfidence'}->{$sysname}->{$factorName};
+}
+#arguments: system name, factor name
+#return: perplexity of language model (specified in a config file) wrt given system output
+sub calcPerplexity
+{
+	my ($self, $sysname, $factorName) = @_;
+	print STDERR "ppl $sysname $factorName\n";
+	#check in-memory cache first
+	if(exists $self->{'perplexity'}->{$sysname} && exists $self->{'perplexity'}->{$sysname}->{$factorName})
+	{
+		return $self->{'perplexity'}->{$sysname}->{$factorName};
+	}
+	warn "calcing perplexity\n";
+	$self->ensureFilenameDefined($sysname);
+	my $sysoutFilename;
+	if($sysname eq 'truth' || $sysname eq 'input') {$sysoutFilename = $self->{"${sysname}Filename"};}
+	else {$sysoutFilename = $self->{'sysoutFilenames'}->{$sysname};}
+	my $lmFilename;
+	if($sysname eq 'input') {$lmFilename = $self->{'inputLMs'}->{$factorName};}
+	else {$lmFilename = $self->{'outputLMs'}->{$factorName};}
+	my $tmpfile = ".tmp" . time;
+	my $cmd = "perl ./extract-factors.pl $sysoutFilename " . $self->{'factorIndices'}->{$factorName} . " > $tmpfile";
+	`$cmd`; #extract just the factor we're interested in; ngram doesn't understand factored notation
+	my @output = `./ngram -lm $lmFilename -ppl $tmpfile`; #run the SRI n-gram tool
+	`rm -f $tmpfile`;
+	$output[1] =~ /ppl1=\s*([0-9\.]+)/;
+	$self->{'perplexity'}->{$sysname}->{$factorName} = $1;
+	return $self->{'perplexity'}->{$sysname}->{$factorName};
+}
+#run a paired t test and a sign test on BLEU statistics for subsets of both systems' outputs
+#arguments: system name 1, system name 2, factor name
+#return: arrayref of [arrayref of confidence levels for t test at which results differ, arrayref of index (0/1) of better system by t test,
+#                     arrayref of confidence levels for sign test at which results differ, arrayref of index (0/1) of better system by sign test],
+#     where each inner arrayref has one element per n-gram order considered
+sub statisticallyCompareSystemResults
+{
+	my ($self, $sysname1, $sysname2, $factorName) = @_;
+	#check in-memory cache first
+	if(exists $self->{'comparisonStats'}->{$sysname1} && exists $self->{'comparisonStats'}->{$sysname1}->{$sysname2}
+		&& exists $self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName})
+	{
+		return $self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName};
+	}
+	warn "comparing sysoutputs\n";
+	$self->ensureFilenameDefined($sysname1);
+	$self->ensureFilenameDefined($sysname2);
+	$self->ensureFactorPosDefined($factorName);
+	#make sure we have tallied results for both systems
+	if(!exists $self->{'subsetBLEUstats'}->{$sysname1}->{$factorName}) {$self->statisticallyTestBLEUResults($sysname1, $factorName);}
+	if(!exists $self->{'subsetBLEUstats'}->{$sysname2}->{$factorName}) {$self->statisticallyTestBLEUResults($sysname2, $factorName);}
+	if(!exists $self->{'comparisonStats'}->{$sysname1}) {$self->{'comparisonStats'}->{$sysname1} = {};}
+	if(!exists $self->{'comparisonStats'}->{$sysname1}->{$sysname2}) {$self->{'comparisonStats'}->{$sysname1}->{$sysname2} = {};}
+	if(!exists $self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName}) {$self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName} = [];}
+	my ($tConfidences, $tWinningIndices, $signConfidences, $signWinningIndices) = ([], [], [], []);
+	for(my $i = 0; $i < 4; $i++) #loop over n-gram order
+	{
+		#t-test stats
+		my ($mean, $dev) = (0, 0); #of the difference between the first and second systems' precisions
+		#sign-test stats
+		my ($nPlus, $nMinus) = (0, 0);
+		my $j;
+		for($j = 0; $j < scalar(@{$self->{'subsetBLEUstats'}->{$sysname1}->{$factorName}}); $j++)
+		{
+			my ($stats1, $stats2) = ($self->{'subsetBLEUstats'}->{$sysname1}->{$factorName}->[$j], $self->{'subsetBLEUstats'}->{$sysname2}->{$factorName}->[$j]);
+			my ($prec1, $prec2) = ($stats1->[2 * $i] / $stats1->[2 * $i + 1], $stats2->[2 * $i] / $stats2->[2 * $i + 1]); #n-gram precisions
+			$mean += $prec1 - $prec2;
+			if($prec1 > $prec2) {$nPlus++;} else {$nMinus++;}
+		}
+		$mean /= $j;
+		for($j = 0; $j < scalar(@{$self->{'subsetBLEUstats'}->{$sysname1}->{$factorName}}); $j++)
+		{
+			my ($stats1, $stats2) = ($self->{'subsetBLEUstats'}->{$sysname1}->{$factorName}->[$j], $self->{'subsetBLEUstats'}->{$sysname2}->{$factorName}->[$j]);
+			my ($prec1, $prec2) = ($stats1->[2 * $i] / $stats1->[2 * $i + 1], $stats2->[2 * $i] / $stats2->[2 * $i + 1]); #n-gram precisions
+			$dev += ($prec1 - $prec2 - $mean) ** 2;
+		}
+		$dev = sqrt($dev / (($j - 1) * $j)); #need the extra j because the variance of Xbar is 1/n the variance of X
+		#t test
+		my $t = $mean / $dev; #this isn't the standard form; remember the difference of the means is equal to the mean of the differences
+		my $cc = getUpperBoundPValue($t);
+		print STDERR "comparing at n=$i: mu $mean, sigma $dev, t $t -> conf >= " . (1 - $cc) . "\n";
+		push @$tConfidences, $cc;
+		push @$tWinningIndices, ($mean > 0) ? 0 : 1;
+		#sign test
+		my %binomialCoefficients; #map (n+ - n-) to a coefficient; compute on the fly!
+		for(my $k = 0; $k <= $nPlus + $nMinus; $k++)
+		{
+			$binomialCoefficients{$k} = binCoeff($nPlus + $nMinus, $k);
+		}
+		my $sumCoeffs = 0;
+		foreach my $coeff (values %binomialCoefficients) #get a lower bound on the probability mass inside (n+ - n-)
+		{
+			if($coeff > $binomialCoefficients{$nPlus}) {$sumCoeffs += $coeff;}
+		}
+		push @$signConfidences, $sumCoeffs;
+		push @$signWinningIndices, ($nPlus > $nMinus) ? 0 : 1;
+	}
+	$self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName} = [$tConfidences, $tWinningIndices, $signConfidences, $signWinningIndices];
+	return $self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName};
+}
+#write HTML to be displayed to compare the various versions we have of each sentence in the corpus;
+#allow to filter which versions will be displayed
+#(we don't write the whole page, just the contents of the body)
+#arguments: filehandleref to which to write, regex to filter filename extensions to be included
+#return: none
+sub writeComparisonPage
+{
+	my ($self, $fh, $filter) = @_;
+	my @filteredExtensions = grep($filter, ('e', 'f', keys %{$self->{'sysoutFilenames'}}));
+	my %openedFiles = $self->openFiles(@filteredExtensions);
+	my $id = 1; #sentence ID string
+	while(my %lines = $self->readLineFromFiles(%openedFiles))
+	{
+		$self->printSingleSentenceComparison($fh, $id, %lines);
+		$id++;
+	}
+	$self->closeFiles(%openedFiles);
+}
+##########################################################################################################
+#####     INTERNAL     ###################################################################################
+##########################################################################################################
+#destructor!
+#arguments: none
+#return: none
+sub DESTROY
+{
+	my $self = shift;
+	$self->writeCacheFile();
+}
+#write all scores in memory to disk
+#arguments: none
+#return: none
+sub writeCacheFile
+{
+	my $self = shift;
+	if(!open(CACHEFILE, ">" . $self->{'cacheFilename'}))
+	{
+		warn "Corpus::writeCacheFile(): can't open '" . $self->{'cacheFilename'} . "' for write\n";
+		return;
+	}
+	#store file changetimes to disk
+	print CACHEFILE "File changetimes\n";
+	my $ensureCtimeIsOutput = sub
+	{
+		my $ext = shift;
+		#check for a previously read value
+		if(exists $self->{'fileCtimes'}->{$ext} && $self->cacheIsCurrentForFile($ext)) {print CACHEFILE "$ext " . $self->{'fileCtimes'}->{$ext} . "\n";}
+		else {print CACHEFILE "$ext " . time . "\n";} #our info must just have been calculated
+	};
+	if(exists $self->{'truthFilename'}) {&$ensureCtimeIsOutput('e');}
+	if(exists $self->{'inputFilename'}) {&$ensureCtimeIsOutput('f');}
+	foreach my $factorName (keys %{$self->{'phraseTableFilenames'}}) {&$ensureCtimeIsOutput("pt_$factorName");}
+	foreach my $sysname (keys %{$self->{'sysoutFilenames'}}) {&$ensureCtimeIsOutput($sysname);}
+	#store bleu scores to disk
+	print CACHEFILE "\nBLEU scores\n";
+	foreach my $sysname (keys %{$self->{'bleuScores'}})
+	{
+		foreach my $factorName (keys %{$self->{'bleuScores'}->{$sysname}})
+		{
+			print CACHEFILE "$sysname $factorName " . join(' ', @{$self->{'bleuScores'}->{$sysname}->{$factorName}->[0]});
+			foreach my $sentenceBLEU (@{$self->{'bleuScores'}->{$sysname}->{$factorName}->[1]})
+			{
+				print CACHEFILE ";" . join(' ', @$sentenceBLEU);
+			}
+			print CACHEFILE "\n";
+		}
+	}
+	#store t statistics for overall BLEU score and subsets in k-fold cross-validation
+	print CACHEFILE "\nBLEU statistics\n";
+	foreach my $sysname (keys %{$self->{'bleuConfidence'}})
+	{
+		foreach my $factorName (keys %{$self->{'bleuConfidence'}->{$sysname}})
+		{
+			print CACHEFILE "$sysname $factorName " . join(' ', @{$self->{'bleuConfidence'}->{$sysname}->{$factorName}->[0]});
+			foreach my $subsetConfidence (@{$self->{'bleuConfidence'}->{$sysname}->{$factorName}->[1]})
+			{
+				print CACHEFILE ";" . join(' ', @$subsetConfidence);
+			}
+			print CACHEFILE "\n";
+		}
+	}
+	#store statistics comparing system outputs
+	print CACHEFILE "\nStatistical comparisons\n";
+	foreach my $sysname1 (keys %{$self->{'comparisonStats'}})
+	{
+		foreach my $sysname2 (keys %{$self->{'comparisonStats'}->{$sysname1}})
+		{
+			foreach my $factorName (keys %{$self->{'comparisonStats'}->{$sysname1}->{$sysname2}})
+			{
+				print CACHEFILE "$sysname1 $sysname2 $factorName " . join(';', map {join(' ', @$_)} @{$self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName}}) . "\n";
+			}
+		}
+	}
+	#store unknown-token counts to disk
+	print CACHEFILE "\nUnknown-token counts\n";
+	foreach my $factorName (keys %{$self->{'unknownCount'}})
+	{
+		print CACHEFILE $factorName . " " . $self->{'phraseTableFilenames'}->{$factorName} . " " . $self->{'unknownCount'}->{$factorName} . " " . $self->{'tokenCount'}->{'input'} . "\n";
+	}
+	#store WER, PWER to disk
+	print CACHEFILE "\nWER scores\n";
+	my $printWERFunc =
+	sub
+	{
+		my $werType = shift;
+		foreach my $sysname (keys %{$self->{$werType}})
+		{
+			foreach my $factorName (keys %{$self->{$werType}->{$sysname}})
+			{
+				my ($totalWER, $sentenceWERs, $errorWords) = @{$self->{$werType}->{$sysname}->{$factorName}};
+				print CACHEFILE "$werType $sysname $factorName $totalWER " . join(' ', @$sentenceWERs);
+				foreach my $indices (@$errorWords)
+				{
+					print CACHEFILE ";" . join(' ', @$indices);
+				}
+				print CACHEFILE "\n";
+			}
+		}
+	};
+	&$printWERFunc('sysoutWER');
+	&$printWERFunc('sysoutPWER');
+	#store corpus perplexities to disk
+	print CACHEFILE "\nPerplexity\n";
+	foreach my $sysname (keys %{$self->{'perplexity'}})
+	{
+		foreach my $factorName (keys %{$self->{'perplexity'}->{$sysname}})
+		{
+			print CACHEFILE "$sysname $factorName " . $self->{'perplexity'}->{$sysname}->{$factorName} . "\n";
+		}
+	}
+	print "\nNN/ADJ WER/PWER\n";
+	foreach my $sysname (keys %{$self->{'nnAdjWERPWER'}})
+	{
+		print CACHEFILE "$sysname " . join(' ', @{$self->{'nnAdjWERPWER'}->{$sysname}}) . "\n";
+	}
+	print "\n";
+	close(CACHEFILE);
+}
+#load all scores present in the cache file into the appropriate fields of $self
+#arguments: none
+#return: none
+sub loadCacheFile
+{
+	my $self = shift;
+	if(!open(CACHEFILE, "<" . $self->{'cacheFilename'}))
+	{
+		warn "Corpus::loadCacheFile(): can't open '" . $self->{'cacheFilename'} . "' for read\n";
+		return;
+	}
+	my $mode = 'none';
+	while(my $line = <CACHEFILE>)
+	{
+		next if $line =~ /^[ \t\n\r\x0a]*$/; #anyone know why char 10 (0x0a) shows up on empty lines, at least on solaris?
+		chomp $line;
+		#check for start of section
+		if($line =~ /File changetimes/) {$mode = 'ctime';}
+		elsif($line =~ /BLEU scores/) {$mode = 'bleu';}
+		elsif($line =~ /BLEU statistics/) {$mode = 'bstats';}
+		elsif($line =~ /Statistical comparisons/) {$mode = 'cmp';}
+		elsif($line =~ /Unknown-token counts/) {$mode = 'unk';}
+		elsif($line =~ /WER scores/) {$mode = 'wer';}
+		elsif($line =~ /Perplexity/) {$mode = 'ppl';}
+		elsif($line =~ /NN\/ADJ WER\/PWER/) {$mode = 'nawp';}
+		#get data when in a mode already
+		elsif($mode eq 'ctime')
+		{
+			local ($fileExtension, $ctime) = split(/\s+/, $line);
+			$self->{'fileCtimes'}->{$fileExtension} = $ctime;
+		}
+		elsif($mode eq 'bleu')
+		{
+			local ($sysname, $factorName, $rest) = split(/\s+/, $line, 3);
+			next if !$self->cacheIsCurrentForFile($sysname) || !$self->cacheIsCurrentForFile('e');
+			if(!exists $self->{'bleuScores'}->{$sysname}) {$self->{'bleuScores'}->{$sysname} = {};}
+			if(!exists $self->{'bleuScores'}->{$sysname}->{$factorName}) {$self->{'bleuScores'}->{$sysname}->{$factorName} = [[], []];}
+			my @stats = map {my @tmp = split(/\s+/, $_); \@tmp;} split(/;/, $rest);
+			print STDERR "bleu 1: " . join(', ', @{shift @stats}) . "\n";
+			print STDERR "bleu 2: " . join(' ', map {"{" . join(', ', @$_) . "}"} @stats) . "\n";
+		#	$self->{'bleuScores'}->{$sysname}->{$factorName}->[0] = shift @stats;
+		#	$self->{'bleuScores'}->{$sysname}->{$factorName}->[1] = \@stats;
+		}
+		elsif($mode eq 'bstats')
+		{
+			local ($sysname, $factorName, $rest) = split(/\s+/, $line, 3);
+			next if !$self->cacheIsCurrentForFile($sysname) || !$self->cacheIsCurrentForFile('e');
+			if(!exists $self->{'bleuConfidence'}->{$sysname}) {$self->{'bleuConfidence'}->{$sysname} = {};}
+			if(!exists $self->{'bleuConfidence'}->{$sysname}->{$factorName}) {$self->{'bleuConfidence'}->{$sysname}->{$factorName} = [[], []];}
+			my @stats = map {my @tmp = split(/\s+/, $_); \@tmp;} split(/;/, $rest);
+			$self->{'bleuConfidence'}->{$sysname}->{$factorName}->[0] = shift @stats;
+			$self->{'bleuConfidence'}->{$sysname}->{$factorName}->[1] = \@stats;
+		}
+		elsif($mode eq 'cmp')
+		{
+			local ($sysname1, $sysname2, $factorName, $rest) = split(/\s+/, $line, 4);
+			next if !$self->cacheIsCurrentForFile($sysname1) || !$self->cacheIsCurrentForFile($sysname2) || !$self->cacheIsCurrentForFile('e');
+			if(!exists $self->{'comparisonStats'}->{$sysname1}) {$self->{'comparisonStats'}->{$sysname1} = {};}
+			if(!exists $self->{'comparisonStats'}->{$sysname1}->{$sysname2}) {$self->{'comparisonStats'}->{$sysname1}->{$sysname2} = {};}
+			if(!exists $self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName}) {$self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName} = [];}
+			my @stats = map {my @x = split(' ', $_); \@x} split(/;/, $rest);
+			$self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName} = \@stats;
+		}
+		elsif($mode eq 'unk')
+		{
+			local ($factorName, $phraseTableFilename, $unknownCount, $totalCount) = split(' ', $line);
+			next if !$self->cacheIsCurrentForFile('f') || !$self->cacheIsCurrentForFile("pt_$factorName");
+			if(defined($self->{'phraseTableFilenames'}->{$factorName}) && $self->{'phraseTableFilenames'}->{$factorName} eq $phraseTableFilename)
+			{
+				$self->{'unknownCount'}->{$factorName} = $unknownCount;
+				$self->{'totalTokens'} = $totalCount;
+			}
+		}
+		elsif($mode eq 'wer')
+		{
+			local ($werType, $sysname, $factorName, $totalWER, $details) = split(/\s+/, $line, 5); #werType is 'sysoutWER' or 'sysoutPWER'
+			next if !$self->cacheIsCurrentForFile($sysname) || !$self->cacheIsCurrentForFile('e');
+			$details =~ /^([^;]*);(.*)/;
+			my @sentenceWERs = split(/\s+/, $1);
+			if(!exists $self->{$werType}->{$sysname}) {$self->{$werType}->{$sysname} = {};}
+			$self->{$werType}->{$sysname}->{$factorName} = [$totalWER, \@sentenceWERs, []];
+			my @indexLists = split(/;/, $2);
+			for(my $i = 0; $i < scalar(@sentenceWERs); $i++)
+			{
+				my @indices = grep(/\S/, split(/\s+/, $indexLists[$i])); #find all nonempty tokens
+				$self->{$werType}->{$sysname}->{$factorName}->[2] = \@indices;
+			}
+		}
+		elsif($mode eq 'ppl')
+		{
+			local ($sysname, $factorName, $perplexity) = split(/\s+/, $line);
+			next if !$self->cacheIsCurrentForFile($sysname);
+			if(!exists $self->{'perplexity'}->{$sysname}) {$self->{'perplexity'}->{$sysname} = {};}
+			$self->{'perplexity'}->{$sysname}->{$factorName} = $perplexity;
+		}
+		elsif($mode eq 'nawp')
+		{
+			local ($sysname, @scores) = split(/\s+/, $line);
+			next if !$self->cacheIsCurrentForFile($sysname);
+			$self->{'nnAdjWERPWER'}->{$sysname} = \@scores;
+		}
+	}
+	close(CACHEFILE);
+}
+#arguments: cache type ('bleu' | ...), system name, factor name
+#return: none
+sub flushCache
+{
+	my ($self, $cacheType, $sysname, $factorName) = @_;
+	if($cacheType eq 'bleu')
+	{
+		if(defined($self->{'bleuScores'}->{$sysname}) && defined($self->{'bleuScores'}->{$sysname}->{$factorName}))
+		{
+			delete $self->{'bleuScores'}->{$sysname}->{$factorName};
+		}
+	}
+}
+#arguments: file extension
+#return: whether (0/1) our cache for the given file is at least as recent as the file
+sub cacheIsCurrentForFile
+{
+	my ($self, $ext) = @_;
+	return 0 if !exists $self->{'fileCtimes'}->{$ext} ;
+	my @liveStats = stat($self->{'corpusName'} . ".$ext");
+	return ($liveStats[9] <= $self->{'fileCtimes'}->{$ext}) ? 1 : 0;
+}
+##### utils #####
+#arguments: a, b (scalars)
+sub min
+{
+	my ($a, $b) = @_;
+	return ($a < $b) ? $a : $b;
+}
+#arguments: a, b (scalars)
+sub max
+{
+	my ($a, $b) = @_;
+	return ($a > $b) ? $a : $b;
+}
+#arguments: x
+sub my_log
+{
+  return -9999999999 unless $_[0];
+  return log($_[0]);
+}
+#arguments: x
+sub round
+{
+	my $x = shift;
+	if($x - int($x) < .5) {return int($x);}
+	return int($x) + 1;
+}
+#return an approximation of the p-value for a given t FOR A HARDCODED NUMBER OF DEGREES OF FREEDOM
+# (IF YOU CHANGE THIS HARDCODED NUMBER YOU MUST ALSO CHANGE statisticallyTestBLEUResults() and getLowerBoundPValue() )
+#arguments: the t statistic, $t
+#return: a lower bound on the probability mass outside (beyond) +/-$t in the t distribution
+#
+#for a wonderful t-distribution calculator, see <http://math.uc.edu/~brycw/classes/148/tables.htm#t>. UC.edu is Cincinnati.
+sub getLowerBoundPValue
+{
+	my $t = abs(shift);
+	#encode various known p-values for ###### DOF = 29 ######
+	my %t2p = #since we're comparing (hopefully) very similar values, this chart is weighted toward the low end of the t-stat
+	(
+		0.0063 => .995,
+		0.0126 => .99,
+		0.0253 => .98,
+		0.0380 => .97,
+		0.0506 => .96,
+		0.0633 => .95,
+		0.0950 => .925,
+		0.127  => .9,
+		0.191  => .85,
+		0.256  => .8,
+		0.389  => .7,
+		0.530  => .6,
+		0.683  => .5,
+		0.854  => .4,
+		1.055  => .3,
+		1.311  => .2,
+		1.699  => .1
+	);
+	foreach my $tCmp (sort keys %t2p) {return $t2p{$tCmp} if $t <= $tCmp;}
+	return 0; #loosest bound ever! groovy, man
+}
+#arguments: the t statistic, $t
+#return: an upper bound on the probability mass outside (beyond) +/-$t in the t distribution
+sub getUpperBoundPValue
+{
+	my $t = abs(shift);
+	#encode various known p-values for ###### DOF = 29 ######
+	my %t2p =
+	(
+		4.506 => .0001,
+		4.254 => .0002,
+		3.918 => .0005,
+		3.659 => .001,
+		3.396 => .002,
+		3.038 => .005,
+		2.756 => .01,
+		2.462 => .02,
+		2.045 => .05,
+		1.699 => .1,
+		1.311 => .2,
+		0.683 => .5
+	);
+	foreach my $tCmp (reverse sort keys %t2p) {return $t2p{$tCmp} if $t >= $tCmp;}
+	return 1; #loosest bound ever!
+}
+#arguments: n, r
+#return: binomial coefficient for p = .5 (ie nCr * (1/2)^n)
+sub binCoeff
+{
+	my ($n, $r) = @_;
+	my $coeff = 1;
+	for(my $i = $r + 1; $i <= $n; $i++) {$coeff *= $i; $coeff /= ($i - $r);}
+	return $coeff * (.5 ** $n);
+}
+#throw if the given factor doesn't have an index defined
+#arguments: factor name
+#return: none
+sub ensureFactorPosDefined
+{
+	my ($self, $factorName) = @_;
+	if(!defined($self->{'factorIndices'}->{$factorName}))
+	{
+		throw Error::Simple(-text => "Corpus: no index known for factor '$factorName'\n");
+	}
+}
+#throw if the filename field corresponding to the argument hasn't been defined
+#arguments: 'truth' | 'input' | a system name
+#return: none
+sub ensureFilenameDefined
+{
+	my ($self, $sysname) = @_;
+	if($sysname eq 'truth' || $sysname eq 'input')
+	{
+		if(!defined($self->{"${sysname}Filename"}))
+		{
+			throw Error::Simple(-text => "Corpus: no $sysname corpus defined\n");
+		}
+	}
+	else
+	{
+		if(!defined($self->{'sysoutFilenames'}->{$sysname}))
+		{
+			throw Error::Simple(-text => "Corpus: no system $sysname defined\n");
+		}
+	}
+}
+#throw if there isn't a defined phrase-table filename for the given factor
+#arguments: factor name
+#return: none
+sub ensurePhraseTableDefined
+{
+	my ($self, $factorName) = @_;
+	if(!defined($self->{'phraseTableFilenames'}->{$factorName}))
+	{
+		throw Error::Simple(-text => "Corpus: no phrase table defined for factor '$factorName'\n");
+	}
+}
+#search current directory for files with our corpus name as basename and set filename fields of $self
+#arguments: hashref of filenames to descriptions
+#return: none
+sub locateFiles
+{
+	my ($self, $refDescs) = @_;
+	open(DIR, "ls -x1 . |") or die "Corpus::locateFiles(): couldn't list current directory\n";
+	my $corpusName = $self->{'corpusName'};
+	while(my $filename = <DIR>)
+	{
+		chop $filename; #remove \n
+		if($filename =~ /^$corpusName\.(.*)$/)
+		{
+			my $ext = $1;
+			if($ext eq 'e') {$self->{'truthFilename'} = $filename;}
+			elsif($ext eq 'f') {$self->{'inputFilename'} = $filename;}
+			elsif($ext =~ /pt_(.*)/) {$self->{'phraseTableFilenames'}->{$1} = $filename;}
+			else {$self->{'sysoutFilenames'}->{$ext} = $filename;}
+			if(defined($refDescs->{$filename}))
+			{
+				$self->{'fileDescriptions'}->{$filename} = $refDescs->{$filename};
+			}
+		}
+	}
+	close(DIR);
+}
+#arguments: type ('truth' | 'input' | a string to represent a system output), filename
+#pre: filename exists
+#return: none
+sub loadSentences
+{
+	my ($self, $sysname, $filename) = @_;
+	#if the sentences are already loaded, leave them be
+	if(exists $self->{$sysname} && scalar(@{$self->{$sysname}}) > 0) {return;}
+	$self->{$sysname} = [];
+	$self->{'tokenCount'}->{$sysname} = 0;
+	open(INFILE, "<$filename") or die "Corpus::load(): couldn't open '$filename' for read\n";
+	while(my $line = <INFILE>)
+	{
+		my @words = split(/\s+/, $line);
+		$self->{'tokenCount'}->{$sysname} += scalar(@words);
+		my $refFactors = [];
+		foreach my $word (@words)
+		{
+			my @factors = split(/\|/, $word);
+			push @$refFactors, \@factors;
+		}
+		push @{$self->{$sysname}}, $refFactors;
+	}
+	close(INFILE);
+}
+#free the memory used for the given corpus (but NOT any associated calculations, eg WER)
+#arguments: type ('truth' | 'input' | a string to represent a system output)
+#return: none
+sub releaseSentences
+{
+#	my ($self, $sysname) = @_;
+#	$self->{$sysname} = [];
+}
+#arguments: factor name
+#return: none
+#throw if we don't have a filename for the given phrase table
+sub loadPhraseTable
+{
+	my ($self, $factorName) = @_;
+	$self->ensurePhraseTableDefined($factorName);
+	my $filename = $self->{'phraseTableFilenames'}->{$factorName};
+	open(PTABLE, "<$filename") or die "couldn't open '$filename' for read\n";
+	$self->{'phraseTables'}->{$factorName} = {}; #create ref to phrase table (hash of strings, for source phrases, to anything whatsoever)
+	#assume the table is sorted so that duplicate source phrases will be consecutive
+	while(my $line = <PTABLE>)
+	{
+		my @phrases = split(/\s*\|\|\|\s*/, $line, 2);
+		$self->{'phraseTables'}->{$factorName}->{$phrases[0]} = 0; #just so that it's set to something
+	}
+	close(PTABLE);
+}
+#arguments: factor name
+#return: none
+sub releasePhraseTable
+{
+	my ($self, $factorName) = @_;
+	$self->{'phraseTables'}->{$factorName} = {};
+}
+#arguments: name of list ('nounAndAdj' | ...)
+#return: arrayref of strings (postags)
+sub getPOSTagList
+{
+	my ($self, $listname) = @_;
+	##### assume PTB tagset #####
+	if($listname eq 'nounAndAdj') {return ['NN', 'NNS', 'NNP', 'NNPS', 'JJ', 'JJR', 'JJS'];}
+#	if($listname eq '') {return [];}
+}
+#arguments: list to be filtered (arrayref of arrayrefs of factor strings), desired factor index, arrayref of allowable values
+#return: filtered list as array of arrayrefs of factor strings
+sub filterFactors
+{
+	my ($self, $refFullList, $index, $refFactorValues) = @_;
+	my $valuesRegex = join("|", @$refFactorValues);
+	my @filteredList = ();
+	foreach my $factors (@$refFullList)
+	{
+		if($factors->[$index] =~ m/$valuesRegex/)
+		{
+			push @filteredList, $factors;
+		}
+	}
+	return @filteredList;
+}
+#arguments: system output (arrayref of arrayrefs of arrayrefs of factor strings), truth (same), factor index to use
+#return: wer score, arrayref of sentence scores, arrayref of arrayrefs of indices of errorful words
+sub corpusWER
+{
+	my ($self, $refSysOutput, $refTruth, $index) = @_;
+	my ($totWER, $sentenceWER, $errIndices) = (0, [], []);
+	for(my $i = 0; $i < scalar(@$refSysOutput); $i++)
+	{
+		my ($sentWER, $indices) = $self->sentenceWER($refSysOutput->[$i], $refTruth->[$i], $index);
+		$totWER += $sentWER;
+		push @$sentenceWER, $sentWER;
+		push @$errIndices, $indices;
+	}
+	return ($totWER, $sentenceWER, $errIndices);
+}
+#arguments: system output (arrayref of arrayrefs of factor strings), truth (same), factor index to use
+#return: wer score, arrayref of arrayrefs of indices of errorful words
+sub sentenceWER
+{
+	#constants: direction we came through the table
+	my ($DIR_NONE, $DIR_SKIPTRUTH, $DIR_SKIPOUT, $DIR_SKIPBOTH) = (-1, 0, 1, 2); #values don't matter but must be unique
+	my ($self, $refSysOutput, $refTruth, $index) = @_;
+	my ($totWER, $indices) = (0, []);
+	my ($sLength, $eLength) = (scalar(@$refSysOutput), scalar(@$refTruth));
+	if($sLength == 0 || $eLength == 0) {return ($totWER, $indices);} #special case
+	my @refWordsMatchIndices = (-1) x $eLength; #at what sysout-word index this truth word is first matched
+	my @sysoutWordsMatchIndices = (-1) x $sLength; #at what truth-word index this sysout word is first matched
+	my $table = []; #index by sysout word index, then truth word index; a cell holds max count of matching words and direction we came to get it
+	#dynamic-programming time: find the path through the table with the maximum number of matching words
+	for(my $i = 0; $i < $sLength; $i++)
+	{
+		push @$table, [];
+		for(my $j = 0; $j < $eLength; $j++)
+		{
+			my ($maxPrev, $prevDir) = (0, $DIR_NONE);
+			if($i > 0 && $table->[$i - 1]->[$j]->[0] >= $maxPrev) {$maxPrev = $table->[$i - 1]->[$j]->[0]; $prevDir = $DIR_SKIPOUT;}
+			if($j > 0 && $table->[$i]->[$j - 1]->[0] >= $maxPrev) {$maxPrev = $table->[$i]->[$j - 1]->[0]; $prevDir = $DIR_SKIPTRUTH;}
+			if($i > 0 && $j > 0 && $table->[$i - 1]->[$j - 1]->[0] >= $maxPrev) {$maxPrev = $table->[$i - 1]->[$j - 1]->[0]; $prevDir = $DIR_SKIPBOTH;}
+			my $match = ($refSysOutput->[$i]->[$index] eq $refTruth->[$j]->[$index] && $refWordsMatchIndices[$j] == -1 && $sysoutWordsMatchIndices[$i] == -1) ? 1 : 0;
+			if($match == 1) {$refWordsMatchIndices[$j] = $i; $sysoutWordsMatchIndices[$i] = $j;}
+			push @{$table->[$i]}, [($match ? $maxPrev + 1 : $maxPrev), $prevDir];
+		}
+	}
+	#look back along the path and get indices of non-matching words
+	my @unusedSysout = (0) x $sLength; #whether each sysout word was matched--used for outputting html table
+	my ($i, $j) = ($sLength - 1, $eLength - 1);
+	while($i > 0) #work our way back to the first sysout word
+	{
+		push @{$table->[$i]->[$j]}, 0; #length is flag to highlight cell
+		if($table->[$i]->[$j]->[1] == $DIR_SKIPTRUTH)
+		{
+			$j--;
+		}
+		elsif($table->[$i]->[$j]->[1] == $DIR_SKIPOUT)
+		{
+			if($table->[$i - 1]->[$j]->[0] == $table->[$i]->[$j]->[0]) {unshift @$indices, $i; $unusedSysout[$i] = 1;}
+			$i--;
+		}
+		elsif($table->[$i]->[$j]->[1] == $DIR_SKIPBOTH)
+		{
+			if($table->[$i - 1]->[$j - 1]->[0] == $table->[$i]->[$j]->[0]) {unshift @$indices, $i; $unusedSysout[$i] = 1;}
+			$i--; $j--;
+		}
+	}
+	#we're at the first sysout word; finish up checking for matches
+	while($j > 0 && $refWordsMatchIndices[$j] != 0) {push @{$table->[0]->[$j]}, 0; $j--;}
+	if($j == 0 && $refWordsMatchIndices[0] != 0) {unshift @$indices, 0; $unusedSysout[0] = 1;} #no truth word was matched to the first sysout word
+	#print some HTML to debug the WER algorithm
+#	print "<table border=1><tr><td></td><td>" . join("</td><td>", map {() . $_->[$index]} @$refTruth) . "</td></tr>";
+#	for(my $i = 0; $i < $sLength; $i++)
+#	{
+#		print "<tr><td" . (($unusedSysout[$i] == 1) ? " style=\"background-color: #ffdd88\">" : ">") . $refSysOutput->[$i]->[$index] . "</td>";
+#		for(my $j = 0; $j < $eLength; $j++)
+#		{
+#			print "<td";
+#			if(scalar(@{$table->[$i]->[$j]}) > 2) {print " style=\"color: yellow; background-color: #000080\"";}
+#			my $arrow;
+#			if($table->[$i]->[$j]->[1] == $DIR_NONE) {$arrow = "&times;";}
+#			elsif($table->[$i]->[$j]->[1] == $DIR_SKIPTRUTH) {$arrow = "&larr;";}
+#			elsif($table->[$i]->[$j]->[1] == $DIR_SKIPOUT) {$arrow = "&uarr;";}
+#			elsif($table->[$i]->[$j]->[1] == $DIR_SKIPBOTH) {$arrow = "&loz;";}
+#			print ">" . $table->[$i]->[$j]->[0] . "  " . $arrow . "</td>";
+#		}
+#		print "</tr>";
+#	}
+#	print "</table>";
+	my $matchCount = 0;
+	if($sLength > 0) {$matchCount = $table->[$sLength - 1]->[$eLength - 1]->[0];}
+	return ($sLength - $matchCount, $indices);
+}
+#arguments: system output (arrayref of arrayrefs of arrayrefs of factor strings), truth (same), factor index to use
+#return: wer score, arrayref of sentence scores, arrayref of arrayrefs of indices of errorful words
+sub corpusPWER
+{
+	my ($self, $refSysOutput, $refTruth, $index) = @_;
+	my ($totWER, $sentenceWER, $errIndices) = (0, [], []);
+	for(my $i = 0; $i < scalar(@$refSysOutput); $i++)
+	{
+		my ($sentWER, $indices) = $self->sentencePWER($refSysOutput->[$i], $refTruth->[$i], $index);
+		$totWER += $sentWER;
+		push @$sentenceWER, $sentWER;
+		push @$errIndices, $indices;
+	}
+	return ($totWER, $sentenceWER, $errIndices);
+}
+#arguments: system output (arrayref of arrayrefs of factor strings), truth (same), factor index to use
+#return: wer score, arrayref of arrayrefs of indices of errorful words
+sub sentencePWER
+{
+	my ($self, $refSysOutput, $refTruth, $index) = @_;
+	my ($totWER, $indices) = (0, []);
+	my ($sLength, $eLength) = (scalar(@$refSysOutput), scalar(@$refTruth));
+	my @truthWordUsed = (0) x $eLength; #array of 0/1; can only match a given truth word once
+	for(my $j = 0; $j < $sLength; $j++)
+	{
+		my $found = 0;
+		for(my $k = 0; $k < $eLength; $k++) #check output word against entire truth sentence
+		{
+			if(lc $refSysOutput->[$j]->[$index] eq lc $refTruth->[$k]->[$index] && $truthWordUsed[$k] == 0)
+			{
+				$truthWordUsed[$k] = 1;
+				$found = 1;
+				last;
+			}
+		}
+		if($found == 0)
+		{
+			$totWER++;
+			push @$indices, $j;
+		}
+	}
+	return ($totWER, $indices);
+}
+#BLEU calculation for a single sentence
+#arguments: truth sentence (arrayref of arrayrefs of factor strings), sysout sentence (same), factor index to use
+#return: 1- through 4-gram matching and total counts (1-g match, 1-g tot, 2-g match...), candidate length, reference length
+sub sentenceBLEU
+{
+	my ($self, $refTruth, $refSysOutput, $factorIndex, $debug) = @_;
+	my ($length_reference, $length_translation) = (scalar(@$refTruth), scalar(@$refSysOutput));
+	my ($correct1, $correct2, $correct3, $correct4, $total1, $total2, $total3, $total4) = (0, 0, 0, 0, 0, 0, 0, 0);
+	my %REF_GRAM = ();
+	my ($i, $gram);
+	for($i = 0; $i < $length_reference; $i++)
+	{
+		$gram = $refTruth->[$i]->[$factorIndex];
+		$REF_GRAM{$gram}++;
+		next if $i<1;
+		$gram = $refTruth->[$i - 1]->[$factorIndex] ." ".$gram;
+		$REF_GRAM{$gram}++;
+      next if $i<2;
+      $gram = $refTruth->[$i - 2]->[$factorIndex] ." ".$gram;
+      $REF_GRAM{$gram}++;
+      next if $i<3;
+      $gram = $refTruth->[$i - 3]->[$factorIndex] ." ".$gram;
+      $REF_GRAM{$gram}++;
+	}
+	for($i = 0; $i < $length_translation; $i++)
+	{
+      $gram = $refSysOutput->[$i]->[$factorIndex];
+      if (defined($REF_GRAM{$gram}) && $REF_GRAM{$gram} > 0) {
+			$REF_GRAM{$gram}--;
+			$correct1++;
+      }
+      next if $i<1;
+      $gram = $refSysOutput->[$i - 1]->[$factorIndex] ." ".$gram;
+      if (defined($REF_GRAM{$gram}) && $REF_GRAM{$gram} > 0) {
+			$REF_GRAM{$gram}--;
+			$correct2++;
+      }
+      next if $i<2;
+      $gram = $refSysOutput->[$i - 2]->[$factorIndex] ." ".$gram;
+      if (defined($REF_GRAM{$gram}) && $REF_GRAM{$gram} > 0) {
+			$REF_GRAM{$gram}--;
+			$correct3++;
+      }
+      next if $i<3;
+      $gram = $refSysOutput->[$i - 3]->[$factorIndex] ." ".$gram;
+      if (defined($REF_GRAM{$gram}) && $REF_GRAM{$gram} > 0) {
+			$REF_GRAM{$gram}--;
+			$correct4++;
+      }
+	}
+	my $total = $length_translation;
+	$total1 = max(1, $total);
+	$total2 = max(1, $total - 1);
+	$total3 = max(1, $total - 2);
+	$total4 = max(1, $total - 3);
+	return ($correct1, $total1, $correct2, $total2, $correct3, $total3, $correct4, $total4, $length_translation, $length_reference);
+}
+##### filesystem #####
+#open as many given files as possible; only warn about the rest
+#arguments: list of filename extensions to open (assume corpus name is file title)
+#return: hash from type string to filehandleref, giving all files that were successfully opened
+sub openFiles
+{
+	my ($self, @extensions) = @_;
+	my %openedFiles = ();
+	foreach my $ext (@extensions)
+	{
+		if(!open(FILE, "<" . $self->{'corpusName'} . $ext))
+		{
+			warn "Corpus::openFiles(): couldn't open '" . $self->{'corpusName'} . $ext . "' for read\n";
+		}
+		else #success
+		{
+			$openedFiles{$ext} = \*FILE;
+		}
+	}
+	return %openedFiles;
+}
+#read one line from each given file
+#arguments: hash from type string to filehandleref
+#return: hash from type string to sentence (stored as arrayref of arrayrefs of factors) read from corresponding file
+sub readLineFromFiles
+{
+	my ($self, %openedFiles) = @_;
+	my %lines;
+	foreach my $type (keys %openedFiles)
+	{
+		$lines{$type} = [];
+		my $sentence = <$openedFiles{$type}>;
+		my @words = split(/\s+/, $sentence);
+		foreach my $word (@words)
+		{
+			my @factors = split(/\|/, $word);
+			push @{$lines{$type}}, \@factors;
+		}
+	}
+	return %lines;
+}
+#close all given files
+#arguments: hash from type string to filehandleref
+#return: none
+sub closeFiles
+{
+	my ($self, %openedFiles) = @_;
+	foreach my $type (keys %openedFiles)
+	{
+		close($openedFiles{$type});
+	}
+}
+##### write HTML #####
+#print HTML for comparing various versions of a sentence, with special processing for each version as appropriate
+#arguments: filehandleref to which to write, sentence ID string, hashref of version string to sentence (stored as arrayref of arrayref of factor strings)
+#return: none
+sub printSingleSentenceComparison
+{
+	my ($self, $fh, $sentID, $sentences) = @_;
+	my $curFH = select;
+	select $fh;
+	#javascript to reorder rows to look nice afterward
+	print "<script type=\"text/javascript\">
+	function reorder_$sentID()
+	{/*
+		var table = document.getElementById('div_$sentID').firstChild;
+		var refTransRow = table.getElementById('row_e');
+		var inputRow = table.getElementById('row_f');
+		table.removeRow(refTransRow);
+		table.removeRow(inputRow);
+		var newRow1 = table.insertRow(0);
+		var newRow2 = table.insertRow(1);
+		newRow1.childNodes = inputRow.childNodes;
+		newRow2.childNodes = refTransRow.childNodes;*/
+	}
+	</script>";
+	#html for sentences
+	print "<div id=\"div_$sentID\" style=\"padding: 3px; margin: 5px\">";
+	print "<table border=\"1\">";
+#	my $rowCount = 0;
+#	my @bgColors = ("#ffefbf", "#ffdf7f");
+	#process all rows in order
+	foreach my $sentType (keys %$sentences)
+	{
+		my $bgcolor = $bgColors[$rowCount % 2];
+		print "<tr id=\"row_$sentType\"><td align=right>";
+		#description of sentence
+		if(defined($self->{'fileDescriptions'}->{$self->{'corpusName'} . $sentType}))
+		{
+			print "(" . $self->{'fileDescriptions'}->{$self->{'corpusName'} . $sentType} . ")";
+		}
+		else
+		{
+			print "($sentType)";
+		}
+		print "</td><td align=left>";
+		#sentence with markup
+		if($sentType eq 'f') #input
+		{
+#			$self->writeHTMLSentenceWithFactors($fh, $sentences->{$sentType}, $inputColor);
+		}
+		elsif($sentType eq 'e') #reference translation
+		{
+#			$self->writeHTMLSentenceWithFactors($fh, $sentences->{$sentType}, $reftransColor);
+		}
+		else #system output
+		{
+#			$self->writeHTMLTranslationHighlightedWithFactors($fh, $sentences->{$sentType}, $sentences->{'e'}, $highlightColors);
+		}
+		print "</td></tr>";
+#		$rowCount++;
+	}
+	print "</table>";
+	print "</div>\n";
+	select $curFH;
+}
+#print contents of all fields of this object, with useful formatting for arrayrefs and hashrefs
+#arguments: none
+#return: none
+sub printDetails
+{
+	my $self = shift;
+	foreach my $key (keys %$self)
+	{
+		if(ref($self->{$key}) eq 'HASH')
+		{
+			print STDERR "obj: $key => {" . join(', ', map {"$_ => " . $self->{$key}->{$_}} (keys %{$self->{$key}})) . "}\n";
+		}
+		elsif(ref($self->{$key}) eq 'ARRAY')
+		{
+			print STDERR "obj: $key => (" . join(', ', @{$self->{$key}}) . ")\n";
+		}
+		elsif(ref($self->{$key}) eq '') #not a reference
+		{
+			print STDERR "obj: $key => " . $self->{$key} . "\n";
+		}
+	}
+}

mosesdecoder/scripts/analysis/smtgui/README ADDED Viewed

	@@ -0,0 +1,42 @@

+Readme for SMTGUI
+Philipp Koehn, Evan Herbst
+7 / 31 / 06
+-----------------------------------
+SMTGUI is Philipp's and my code to analyze a decoder's output (the decoder doesn't have to be moses, but most of SMTGUI's features relate to factors, so it probably will be). You can view a list of available corpora by running <newsmtgui.cgi?ACTION=> on any web server. When you're viewing a corpus, click the checkboxes and Compare to see sentences from various sources on one screen. Currently they're in an annoying format; feel free to make the display nicer and more useful. There are per-sentence stats stored in a Corpus object; they just aren't used yet. See compare2() in newsmtgui and Corpus::printSingleSentenceComparison() for a start to better display code. For now it's mostly the view-corpus screen that's useful.
+newsmtgui.cgi is the main program. Corpus.pm is my module; Error.pm is a standard part of Perl but appears to not always be distributed. The accompanying version is Error.pm v1.15.
+The program requires file 'file-factors', which gives the list of factors included in each corpus (see the example file for details). Only corpi included in 'file-factors' are displayed. The file 'file-descriptions' is optional and associates a descriptive string with each included filename. These are used only for display. Again an example is provided.
+For the corpus with name CORPUS, there should be present the files:
+- CORPUS.f, the foreign input
+- CORPUS.e, the truth (aka reference translation)
+- CORPUS.SYSTEM_TRANSLATION for each system to be analyzed
+- CORPUS.pt_FACTORNAME for each factor that requires a phrase table (these are currently used only to count unknown source words)
+The .f, .e and system-output files should have the usual pipe-delimited format, one sentence per line. Phrase tables should also have standard three-pipe format.
+A list of standard factor names is available in @Corpus::FACTORNAMES. Feel free to add, but woe betide you if you muck with 'surf', 'pos' and 'lemma'; those are hardcoded all over the place.
+Currently the program assumes you've included factors 'surf', 'pos' and 'lemma', in whatever order; if not you'll want to edit view_corpus() in newsmtgui.cgi to not automatically display all info. To get English POS tags and lemmas from a words-only corpus and put together factors into one file:
+$ $BIN/tag-english < CORPUS.lc > CORPUS.pos-tmp                                 (call Brill)
+$ $BIN/morph < CORPUS.pos-tmp > CORPUS.morph
+$ $DATA/test/factor-stem.en.perl < CORPUS.morph > CORPUS.lemma
+$ cat CORPUS.pos-tmp | perl -n -e 's/_/\|/g; print;' > CORPUS.lc+pos            (replace _ with |)
+$ $DATA/test/combine-features.perl CORPUS lc+pos lemma > CORPUS.lc+pos+lemma
+$ rm CORPUS.pos-tmp                                                             (cleanup)
+where $BIN=/export/ws06osmt/bin, $DATA=/export/ws06osmt/data.
+To get German POS tags and lemmas from a words-only corpus (the first step must be run on linux):
+$ $BIN/recase.perl --in CORPUS.lc --model $MODELS/en-de/recaser/pharaoh.ini > CORPUS.recased              (call pharaoh with a lowercase->uppercase model)
+$ $BIN/run-lopar-tagger-lowercase.perl CORPUS.recased CORPUS.recased.lopar                                (call LOPAR)
+$ $DATA/test/factor-stem.de.perl < CORPUS.recased.lopar > CORPUS.stem
+$ $BIN/lowercase.latin1.perl < CORPUS.stem > CORPUS.lcstem                                                (as you might guess, assumes latin-1 encoding)
+$ $DATA/test/factor-pos.de.perl < CORPUS.recased.lopar > CORPUS.pos
+$ $DATA/test/combine-features.perl CORPUS lc pos lcstem > CORPUS.lc+pos+lcstem
+where $MODELS=/export/ws06osmt/models.

mosesdecoder/scripts/analysis/smtgui/file-descriptions ADDED Viewed

	@@ -0,0 +1,4 @@

+devtest2006.de-en.matrix05-baseline.pharaoh Pharaoh JHUWS baseline run
+devtest2006.de-en.matrix05-baseline.moses-2006-07-20 Moses baseline run
+devtest2006.en-de.matrix05-baseline.pharaoh Pharaoh JHUWS baseline run
+devtest2006.en-de.matrix05-moses.2006-08-02 Moses baseline run

mosesdecoder/scripts/analysis/smtgui/file-factors ADDED Viewed

	@@ -0,0 +1,9 @@

+#corpus name : list of factors in corpus : [input] factor LMfilename, factor LMfilename, ... : [output] factor LMfilename, factor LMfilename, ...
+#(the given factors should be present in all files for the given corpus)
+devtest2006.de-en	: surf pos lemma : surf europarl.de.srilm.gz : surf europarl.en.srilm.gz
+devtest2006.en-de	: surf pos lemma : surf europarl.en.srilm.gz : surf europarl.de.srilm.gz
+test2006.en-de : surf : surf europarl.en.srilm.gz : surf europarl.de.srilm.gz
+#pstem: lemmas come from the Porter stemmer (and so are really a mix of stems and lemmas)
+pstem_devtest2006.de-en	: surf pos lemma : : surf europarl.en.srilm.gz
+#replace esset with ss in German text
+ss_devtest2006.en-de : surf pos lemma : surf europarl.en.srilm.gz : surf ss_europarl.de.srilm.gz

mosesdecoder/scripts/analysis/smtgui/newsmtgui.cgi ADDED Viewed

	@@ -0,0 +1,1006 @@

+#!/usr/bin/perl -w
+#
+# This file is part of moses.  Its use is licensed under the GNU Lesser General
+# Public License version 2.1 or, at your option, any later version.
+# $Id$
+use strict;
+use CGI;
+use Corpus; #Evan's code
+use Error qw(:try);
+#files with extensions other than these are interpreted as system translations; see the file 'file-descriptions', if it exists, for the comments that go with them
+my %FILETYPE = ('e' => 'Reference Translation',
+		'f' => 'Foreign Original',
+		'ref.sgm' => 'Reference Translations',
+		'e.sgm' => 'Reference Translations',
+		'src.sgm' => 'Foreign Originals',
+		'f.sgm' => 'Foreign Originals');
+my %DONTSCORE = ('f' => 1, 'f.sgm' => 1, 'src.sgm' => 1,
+		 'e' => 1, 'e.sgm' => 1, 'ref.sgm' => 1);
+my @SHOW = ('f', 'e', 'comm');
+my %SHOW_COLOR = ('f' => "BLUE",
+		  'e' => "GREEN");
+my $FOREIGN = 'f';
+#FILEDESC: textual descriptions associated with specific filenames; to be displayed on the single-corpus view
+my %FILEDESC = (); &load_descriptions();
+my %factorData = loadFactorData('file-factors');
+my %MEMORY;        &load_memory();
+my (@mBLEU,@NIST);
+@mBLEU=`cat mbleu-memory.dat` if -e "mbleu-memory.dat"; chop(@mBLEU);
+@NIST = `cat nist-memory.dat` if -e "nist-memory.dat"; chop(@NIST);
+my %in;            &ReadParse(); #parse arguments
+if (scalar(@ARGV) > 0 && $ARGV[0] eq 'bleu') {
+  $in{CORPUS} = $ARGV[1];
+  $in{ACTION} = "VIEW_CORPUS";
+}
+my %MULTI_REF;
+if ($in{CORPUS} && -e "$in{CORPUS}.ref.sgm") {
+  my $sysid;
+  open(REF,"$in{CORPUS}.ref.sgm");
+  while(<REF>) {
+    $sysid = $1 if /<DOC.+sysid=\"([^\"]+)\"/;
+    if (/<seg[^>]*> *(\S.+\S) *<\/seg>/) {
+      push @{$MULTI_REF{$sysid}}, $1;
+    }
+  }
+  close(REF);
+}
+if ($in{ACTION} eq '') { &show_corpora(); }
+elsif ($in{ACTION} eq 'VIEW_CORPUS') { &view_corpus(); }
+elsif ($in{ACTION} eq 'SCORE_FILE') { &score_file(); }
+elsif ($in{ACTION} eq 'RESCORE_FILE') { &score_file(); }
+elsif ($in{ACTION} eq 'COMPARE') { &compare(); }
+else { &htmlhead("Unknown Action $in{ACTION}"); }
+print "</BODY></HTML>\n";
+###### SHOW CORPORA IN EVALUATION DIRECTORY
+sub show_corpora {
+  my %CORPUS = ();
+  # find corpora in evaluation directory: see the factor-index file, which was already read in
+  foreach my $corpusName (keys %factorData)
+  {
+  	$CORPUS{$corpusName} = 1;
+  }
+  # list corpora
+  &htmlhead("All Corpora");
+  print "<UL>\n";
+  foreach (sort (keys %CORPUS)) {
+    print "<LI><A HREF=\"?ACTION=VIEW_CORPUS&CORPUS=".CGI::escape($_)."\">Corpus $_</A>\n";
+  }
+  print "</UL>\n";
+}
+###### SHOW INFORMATION FOR ONE CORPUS
+sub view_corpus {
+  my @TABLE;
+  &htmlhead("View Corpus $in{CORPUS}");
+  # find corpora in evaluation directory
+  my $corpus = new Corpus('-name' => "$in{CORPUS}", '-descriptions' => \%FILEDESC, '-info_line' => $factorData{$in{CORPUS}});
+#  $corpus->printDetails(); #debugging info
+  my ($sentence_count, $lineInfo);
+  if(-e "$in{CORPUS}.f")
+  {
+  	$lineInfo = `wc -l $in{CORPUS}.f`;
+  	$lineInfo =~ /^\s*(\d+)\s+/;
+  	$sentence_count = 0 + $1;
+	}
+	else
+	{
+	  $lineInfo = `wc -l $in{CORPUS}.e`;
+	  $lineInfo =~ /^\s*(\d+)\s+/;
+	  $sentence_count = 0 + $1;
+	}
+  print "Corpus '$in{CORPUS}' consists of $sentence_count sentences\n";
+  print "(<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&mBLEU=1>with mBLEU</A>)" if ((!defined($in{mBLEU})) && (scalar keys %MEMORY) && -e "$in{CORPUS}.e" && -e "$in{CORPUS}.f");
+  print "<P>\n";
+  print "<FORM ACTION=''>\n";
+  print "<INPUT TYPE=HIDDEN NAME=ACTION VALUE=COMPARE>\n";
+  print "<INPUT TYPE=HIDDEN NAME=CORPUS VALUE=\"$in{CORPUS}\">\n";
+  print "<TABLE BORDER=1 CELLSPACING=0><TR>
+<TD>File (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS}).">sort</A>)</TD>
+<TD>Date (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&SORT=TIME>sort</A>)</TD>";
+  if (-e "$in{CORPUS}.e") {
+    print "<TD>IBM BLEU (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&SORT=IBM>sort</A>)</TD>";
+  }
+  if (-e "$in{CORPUS}.ref.sgm" && -e "$in{CORPUS}.src.sgm") {
+    print "<TD>NIST (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&SORT=NIST>sort</A>)</TD>";
+    if (! -e "$in{CORPUS}.e") {
+      print "<TD>BLEU (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&SORT=BLEU>sort</A>)</TD>";
+    }
+  }
+  if ($in{mBLEU} && (scalar keys %MEMORY) && -e "$in{CORPUS}.e" && -e "$in{CORPUS}.f") {
+    print "<TD>mBLEU (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&SORT=mBLEU>sort</A>)</TD>";
+  }
+  print "<TD>Unknown Words</TD>"; #can't sort on; only applies to the input
+  print "<TD>Perplexity</TD>"; #applies to truth and system outputs
+  print "<TD>WER (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&SORT=WER>sort</A>)</TD>";
+  print "<TD>Noun & adj WER-PWER</TD>"; #can't sort on; only applies to sysoutputs
+  print "<TD>Surface vs. lemma PWER</TD>"; #can't sort on; only applies to sysoutputs
+	print "<TD>Statistical Measures</TD>";
+  opendir(DIR, ".") or die "couldn't open '.' for read";
+  my @filenames = readdir(DIR); #includes . and ..
+  closedir(DIR);
+  foreach $_ (@filenames)
+  {
+  	next if -d $_; #if is a directory
+    my $sgm = 0;
+    if (/.sgm$/)
+	 {
+	 	`grep '<seg' $_ | wc -l` =~ /^\s*(\d+)\s+/;
+		next unless $1 == $sentence_count;
+		$sgm = 1;
+    }
+    else
+	 {
+	 	`wc -l $_` =~ /^\s*(\d+)\s+/;
+		next unless $1 == $sentence_count;
+    }
+	 next unless /^$in{CORPUS}\.([^\/]+)$/;
+    my $file = $1;
+	 my $sort = "";
+    # checkbox for compare
+    my $row = "<TR><TD style=\"font-size: small\"><INPUT TYPE=CHECKBOX NAME=FILE_$file VALUE=1>";
+    # README
+    if (-e "$in{CORPUS}.$file.README") {
+      my $readme = `cat $in{CORPUS}.$file.README`;
+      $readme =~ s/([\"\'])/\\\"/g;
+      $readme =~ s/[\n\r]/\\n/g;
+      $readme =~ s/\t/\\t/g;
+      $row .= "<A HREF='javascript:FieldInfo(\"$in{CORPUS}.$file\",\"$readme\")'>";
+    }
+    # filename
+    $row .= "$file</A>";
+    # description (hard-coded)
+    my @TRANSLATION_SENTENCE = `cat $in{CORPUS}.$file`;
+    chop(@TRANSLATION_SENTENCE);
+	 #count sentences that contain null words
+	 my $null_count = 0;
+    foreach (@TRANSLATION_SENTENCE)
+	 {
+      $null_count++ if /^NULL$/ || /^NONE$/;
+    }
+    if ($null_count > 0) {
+      $row .= "$null_count NULL ";
+    }
+    $row .= " (".$FILETYPE{$file}.")" if defined($FILETYPE{$file});
+    $row .= " (".$FILEDESC{$in{CORPUS}.".".$file}.")" if defined($FILEDESC{$in{CORPUS}.".".$file});
+    $row .= " (".$FILEDESC{$file}.")" if defined($FILEDESC{$file});
+    # filedate
+    my @STAT = stat("$in{CORPUS}.$file");
+    my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime($STAT[8]); #STAT[8] should be last modify time
+    my $time = sprintf("%04d-%02d-%02d %02d:%02d:%02d",$year+1900,$mon+1,$mday,$hour,$min,$sec);
+    $row .= "</TD>\n<TD>".$time."</TD>\n";
+    if (defined($in{SORT}) && $in{SORT} eq 'TIME') { $sort = $time; }
+    # IBM BLEU score
+    my $no_bleu =0;
+    if (!$sgm && -e "$in{CORPUS}.e") {
+      $row .= "<TD>";
+      if (!defined($DONTSCORE{$file}) && $file !~ /^f$/ && $file ne "e" && $file !~ /^pt/) {
+	my ($score,$p1,$p2,$p3,$p4,$bp) = $corpus->calcBLEU($file, 'surf');
+	print STDERR "193: `$score `$p1 `$p2 `$p3 `$p4 `$bp\n";
+	$row .= sprintf("<B>%.04f</B> %.01f/%.01f/%.01f/%.01f *%.03f", $score, $p1, $p2, $p3, $p4, $bp);
+	if (defined($in{SORT}) && $in{SORT} eq 'IBM') { $sort = $score; }
+      }
+      $row .= "</TD>\n";
+    }
+    else {
+      $no_bleu=1;
+    }
+    # NIST score
+    if (-e "$in{CORPUS}.ref.sgm" && -e "$in{CORPUS}.src.sgm"
+	&& !$DONTSCORE{$file}) {
+      $row .= "<TD>";
+      print "$DONTSCORE{$file}+";
+      my ($nist,$nist_bleu);
+      if ($file =~ /sgm$/) {
+	($nist,$nist_bleu) = get_nist_score("$in{CORPUS}.ref.sgm","$in{CORPUS}.src.sgm","$in{CORPUS}.$file");
+	$row .= sprintf("<B>%.04f</B>",$nist);
+	if ($in{SORT} eq 'NIST') { $sort = $nist; }
+      }
+      $row .= "</TD>\n";
+      if ($no_bleu) {
+	$row .= "<TD>";
+	if ($file =~ /sgm$/) {
+	  $row .= sprintf("<B>%.04f</B>",$nist_bleu);
+	  if ($in{SORT} eq 'BLEU') { $sort = $nist_bleu; }
+	}
+	$row .= "</TD>\n";
+      }
+    }
+    # multi-bleu
+    if ($in{mBLEU} && (scalar keys %MEMORY) && -e "$in{CORPUS}.e") {
+      $row .= "<TD>";
+      if (!defined($DONTSCORE{$file}) && $file !~ /^f$/ && $file ne "e") {
+	my ($score,$p1,$p2,$p3,$p4,$bp) = get_multi_bleu_score("$in{CORPUS}.f","$in{CORPUS}.e","$in{CORPUS}.$file");
+	$row .= sprintf("<B>%.04f</B> %.01f/%.01f/%.01f/%.01f *%.03f",$score,$p1,$p2,$p3,$p4,$bp);
+	if ($in{SORT} eq 'mBLEU') { $sort = $score; }
+      }
+      $row .= "</TD>\n";
+    }
+	 my $isSystemOutput = ($file ne 'e' && $file ne 'f' && $file !~ /^pt/);
+	 # misc stats (note the unknown words should come first so the total word count is available for WER)
+	 $row .= "<TD align=\"center\">";
+	 if($file eq 'f') #input
+	 {
+	 	try
+		{
+			my ($unknownCount, $totalCount) = calc_unknown_words($corpus, 'surf');
+	 		$row .= sprintf("%.4lf (%d / %d)", $unknownCount / $totalCount, $unknownCount, $totalCount);
+		}
+		catch Error::Simple with {$row .= "[system error]";};
+	 }
+	 $row .= "</TD>\n<TD align=\"center\">";
+	 if($file eq 'e' || $file eq 'f' || $isSystemOutput)
+	 {
+	 	try
+		{
+			my $perplexity = $corpus->calcPerplexity(($file eq 'e') ? 'truth' : (($file eq 'f') ? 'input' : $file), 'surf');
+			$row .= sprintf("%.2lf", $perplexity);
+		}
+		catch Error::Simple with {$row .= "[system error]";}
+	 }
+	 $row .= "</TD>\n<TD align=\"center\">";
+	 if($isSystemOutput)
+	 {
+	 	try
+		{
+			my $surfaceWER = $corpus->calcOverallWER($file);
+			$row .= sprintf("%.4lf", $surfaceWER);
+		}
+		catch Error::Simple with {$row .= "[system error]";};
+	 }
+	 $row .= "</TD>\n<TD align=\"center\">";
+	 my ($nnAdjWER, $nnAdjPWER, $surfPWER, $lemmaPWER);
+	 if($isSystemOutput)
+	 {
+		try
+		{
+			($nnAdjWER, $nnAdjPWER, $surfPWER, $lemmaPWER) = calc_misc_stats($corpus, $file);
+			$row .= sprintf("WER = %.4lg<br>PWER = %.4lg<br><b>ratio = %.3lf</b>", $nnAdjWER, $nnAdjPWER, $nnAdjPWER / $nnAdjWER);
+		}
+		catch Error::Simple with {$row .= "[system error]";};
+	}
+	$row .= "</TD>\n<TD align=\"center\">";
+	if($isSystemOutput)
+	{
+		if($surfPWER == -1)
+		{
+			$row .= "[system error]";
+		}
+		else
+		{
+			my ($lemmaBLEU, $p1, $p2, $p3, $p4, $brevity) = $corpus->calcBLEU($file, 'lemma');
+			$row .= sprintf("surface = %.3lf<br>lemma = %.3lf<br><b>lemma BLEU = %.04f</b> %.01f/%.01f/%.01f/%.01f *%.03f",
+									$surfPWER, $lemmaPWER, $lemmaBLEU, $p1, $p2, $p3, $p4, $brevity);
+		}
+	}
+	$row .= "</TD>\n<TD align=\"center\">";
+	if($isSystemOutput)
+	{
+		try
+		{
+			my $testInfo = $corpus->statisticallyTestBLEUResults($file, 'surf');
+			my @tTestPValues = @{$testInfo->[0]};
+			my @confidenceIntervals = @{$testInfo->[1]};
+			$row .= "n-gram precision p-values (high p <=> consistent score):<br>t test " . join("/", map {sprintf("%.4lf", $_)} @tTestPValues);
+			$row .= "<p>n-gram precision 95% intervals:<br>" . join(",<br>", map {sprintf("[%.4lf - %.4lf]", $_->[0], $_->[1])} @confidenceIntervals);
+			my @bleuInterval = (approxBLEUFromNgramScores(map {$_->[0]} @confidenceIntervals), approxBLEUFromNgramScores(map {$_->[1]} @confidenceIntervals));
+			$row .= sprintf("<br><b>(BLEU: ~[%.4lf - %.4lf])</b>", $bleuInterval[0], $bleuInterval[1]);
+		}
+		catch Error::Simple with {$row .= "[system error]";}
+	}
+	$row .= "</TD>\n";
+    # correct sentence score
+    my($correct,$wrong,$unknown);
+    $row .= "<TD>";
+    if (!defined($DONTSCORE{$file}) && (scalar keys %MEMORY)) {
+      my ($correct,$just_syn,$just_sem,$wrong,$unknown) = get_score_from_memory("$in{CORPUS}.$FOREIGN",
+			       "$in{CORPUS}.$file");
+      $row .= "<B><FONT COLOR=GREEN>$correct</FONT></B>";
+      $row .= "/<FONT COLOR=ORANGE>$just_syn</FONT>";
+      $row .= "/<FONT COLOR=ORANGE>$just_sem</FONT>";
+      $row .= "/<FONT COLOR=RED>$wrong</FONT> ($unknown)</TD>\n";
+      if ($in{SORT} eq 'SCORE') {
+	$sort = sprintf("%03d %04d",$correct,$just_syn+$just_sem);
+      }
+    }
+	 else
+	 {
+	 	$row .= "</TD>\n";
+	}
+    $row .= "</TR>\n";
+    push @TABLE, "<!-- $sort -->\n$row";
+  }
+  close(DIR);
+  foreach (reverse sort @TABLE) { print $_; }
+  print "</TABLE>\n";
+  print "<INPUT TYPE=SUBMIT VALUE=\"Compare\">\n";
+  print "<INPUT TYPE=CHECKBOX NAME=SURFACE VALUE=1 CHECKED> Compare all different sentences (instead of just differently <I>evaluated</I> sentences) <INPUT TYPE=CHECKBOX NAME=WITH_EVAL VALUE=1 CHECKED> with evaluation</FORM><P>\n";
+  print "<P>The score is to be read as: <FONT COLOR=GREEN>correct</FONT>/<FONT COLOR=ORANGE>just-syn-correct</FONT>/<FONT COLOR=ORANGE>just-sem-correct</FONT>/<FONT COLOR=RED>wrong</FONT> (unscored)\n";
+  print "<BR>IBM BLEU is to be read as: <B>metric</B> unigram/bigram/trigram/quadgram *brevity-penalty<P>";
+  print "<DIV STYLE=\"border: 1px solid #006600\">";
+  print "<H2>Comparison of System Translations (p-values)</H2>";
+  my @sysnames = $corpus->getSystemNames();
+  for(my $i = 0; $i < scalar(@sysnames); $i++)
+  {
+  	for(my $j = $i + 1; $j < scalar(@sysnames); $j++)
+	{
+		my $comparison = $corpus->statisticallyCompareSystemResults($sysnames[$i], $sysnames[$j], 'surf');
+		print "<P><FONT COLOR=#00aa22>" . $sysnames[$i] . " vs. " . $sysnames[$j] . "</FONT>: [<I>t</I> test] ";
+		for(my $k = 0; $k < scalar(@{$comparison->[0]}); $k++)
+		{
+			print sprintf(($k == 0) ? "%.4lg" : "; %.4lg ", $comparison->[0]->[$k]);
+			if($comparison->[1]->[$k] == 0) {print "(&larr;)";} else {print "(&rarr;)";}
+		}
+		print "&nbsp;&nbsp;---&nbsp;&nbsp;[sign test] ";
+		for(my $k = 0; $k < scalar(@{$comparison->[2]}); $k++)
+		{
+			print sprintf(($k == 0) ? "%.4lg " : "; %.4lg ", $comparison->[2]->[$k]);
+			if($comparison->[3]->[$k] == 0) {print "(&larr;)";} else {print "(&rarr;)";}
+		}
+		print "\n";
+	}
+  }
+  print "</DIV\n";
+  print "<P><A HREF=\"newsmtgui.cgi?action=\">All corpora</A>\n";
+}
+###### SCORE TRANSLATIONS
+sub score_file {
+  if ($in{VIEW}) {
+    &htmlhead("View Translations");
+  }
+  else {
+    &htmlhead("Score Translations");
+  }
+  print "<A HREF=\"?ACTION=VIEW_CORPUS&CORPUS=".CGI::escape($in{CORPUS})."\">View Corpus $in{CORPUS}</A><P>\n";
+  print "<FORM ACTION=\"\" METHOD=POST>\n";
+  print "<INPUT TYPE=HIDDEN NAME=ACTION VALUE=$in{ACTION}>\n";
+  print "<INPUT TYPE=HIDDEN NAME=CORPUS VALUE=\"$in{CORPUS}\">\n";
+  print "<INPUT TYPE=HIDDEN NAME=FILE VALUE=\"$in{FILE}\">\n";
+  # get sentences
+  my @SENTENCES;
+  if ($in{FILE} =~ /.sgm$/) {
+      @SENTENCES = `grep '<seg' $in{CORPUS}.$in{FILE}`;
+      for(my $i=0;$i<$#SENTENCES;$i++) {
+	  $SENTENCES[$i] =~ s/^<seg[^>]+> *(\S.+\S) *<\/seg> *$/$1/;
+      }
+  }
+  else {
+      @SENTENCES = `cat $in{CORPUS}.$in{FILE}`; chop(@SENTENCES);
+  }
+  my %REFERENCE;
+  foreach (@SHOW) {
+    if (-e "$in{CORPUS}.$_") {
+      @{$REFERENCE{$_}} = `cat $in{CORPUS}.$_`; chop(@{$REFERENCE{$_}});
+    }
+  }
+  # update memory
+  foreach (keys %in) {
+    next unless /^SYN_SCORE_(\d+)$/;
+    next unless $in{"SEM_SCORE_$1"};
+    &store_in_memory($REFERENCE{$FOREIGN}[$1],
+		     $SENTENCES[$1],
+		     "syn_".$in{"SYN_SCORE_$1"}." sem_".$in{"SEM_SCORE_$1"});
+  }
+  # display sentences
+  for(my $i=0;$i<=$#SENTENCES;$i++) {
+    my $evaluation = &get_from_memory($REFERENCE{$FOREIGN}[$i],$SENTENCES[$i]);
+    next if ($in{ACTION} eq 'SCORE_FILE' &&
+	     ! $in{VIEW} &&
+	     $evaluation ne '' && $evaluation ne 'wrong');
+    print "<P>Sentence ".($i+1).":<BR>\n";
+    # color coding
+    &color_highlight_ngrams($i,&nist_normalize_text($SENTENCES[$i]),$REFERENCE{"e"}[$i]);
+    if (%MULTI_REF) {
+	foreach my $sysid (keys %MULTI_REF) {
+	    print "<FONT COLOR=GREEN>".$MULTI_REF{$sysid}[$i]."</FONT> (Reference $sysid)<BR>\n";
+	}
+    }
+    # all sentences
+    print "$SENTENCES[$i] (System output)<BR>\n";
+    foreach my $ref (@SHOW) {
+      if (-e "$in{CORPUS}.$ref") {
+	print "<FONT COLOR=$SHOW_COLOR{$ref}>".$REFERENCE{$ref}[$i]."</FONT> (".$FILETYPE{$ref}.")<BR>\n" if $REFERENCE{$ref}[$i];
+      }
+    }
+    if (! $in{VIEW}) {
+      print "<INPUT TYPE=RADIO NAME=SYN_SCORE_$i VALUE=correct";
+      print " CHECKED" if ($evaluation =~ /syn_correct/);
+      print "> perfect English\n";
+      print "<INPUT TYPE=RADIO NAME=SYN_SCORE_$i VALUE=wrong";
+      print " CHECKED" if ($evaluation =~ /syn_wrong/);
+      print "> imperfect English<BR>\n";
+      print "<INPUT TYPE=RADIO NAME=SEM_SCORE_$i VALUE=correct";
+      print " CHECKED" if ($evaluation =~ /sem_correct/);
+      print "> correct meaning\n";
+      print "<INPUT TYPE=RADIO NAME=SEM_SCORE_$i VALUE=wrong";
+      print " CHECKED" if ($evaluation =~ /sem_wrong/);
+      print "> incorrect meaning\n";
+    }
+  }
+  if (! $in{VIEW}) {
+    print "<P><INPUT TYPE=SUBMIT VALUE=\"Add evaluation\">\n";
+    print "</FORM>\n";
+  }
+}
+sub color_highlight_ngrams {
+  my($i,$sentence,$single_reference) = @_;
+  my @REF = ();
+  my %NGRAM = ();
+  if (%MULTI_REF) {
+    foreach my $sysid (keys %MULTI_REF) {
+      push @REF,&nist_normalize_text($MULTI_REF{$sysid}[$i]);
+    }
+  }
+  elsif ($single_reference) {
+    @REF = ($single_reference);
+  }
+  if (@REF) {
+    foreach my $ref (@REF) {
+      my @WORD = split(/\s+/,$ref);
+      for(my $n=1;$n<=4;$n++) {
+	for(my $w=0;$w<=$#WORD-($n-1);$w++) {
+	  my $ngram = "$n: ";
+	  for(my $j=0;$j<$n;$j++) {
+	    $ngram .= $WORD[$w+$j]." ";
+	  }
+	  $NGRAM{$ngram}++;
+	}
+      }
+    }
+    $sentence =~ s/^\s+//;
+    $sentence =~ s/\s+/ /;
+    $sentence =~ s/\s+$//;
+    my @WORD = split(/\s+/,$sentence);
+    my @CORRECT;
+    for(my $w=0;$w<=$#WORD;$w++) {
+      $CORRECT[$w] = 0;
+    }
+    for(my $n=1;$n<=4;$n++) {
+      for(my $w=0;$w<=$#WORD-($n-1);$w++) {
+	my $ngram = "$n: ";
+	for(my $j=0;$j<$n;$j++) {
+	  $ngram .= $WORD[$w+$j]." ";
+	}
+	next unless defined($NGRAM{$ngram}) && $NGRAM{$ngram}>0;
+	$NGRAM{$ngram}--;
+	for(my $j=0;$j<$n;$j++) {
+	  $CORRECT[$w+$j] = $n;
+	}
+      }
+    }
+    my @COLOR;
+    $COLOR[0] = "#FF0000";
+    $COLOR[1] = "#C000C0";
+    $COLOR[2] = "#0000FF";
+    $COLOR[3] = "#00C0C0";
+    $COLOR[4] = "#00C000";
+    for(my $w=0;$w<=$#WORD;$w++) {
+      print "<B><FONT COLOR=".$COLOR[$CORRECT[$w]].">$WORD[$w]<SUB>".$CORRECT[$w]."</SUB></FONT></B> ";
+    }
+    print "\n<BR>";
+  }
+}
+###### OTHER STATS
+#print (in some unspecified way) the offending exception of type Error::Simple
+#arguments: the error object, a context string
+#return: none
+sub printError
+{
+	my ($err, $context) = @_;
+	warn "$context: " . $err->{'-text'} . " @ " . $err->{'-file'} . " (" .$err->{'-line'} . ")\n";
+}
+#compute number and percentage of unknown tokens for a given factor in foreign corpus
+#arguments: corpus object ref, factor name
+#return (unkwordCount, totalWordCount), or (-1, -1) if an error occurs
+sub calc_unknown_words
+{
+	my ($corpus, $factorName) = @_;
+	try
+	{
+		my ($unknownCount, $totalCount) = $corpus->calcUnknownTokens($factorName);
+		return ($unknownCount, $totalCount);
+	}
+	catch Error::Simple with
+	{
+		my $err = shift;
+		printError($err, 'calc_unknown_words()');
+		return (-1, -1);
+	};
+}
+#compute (if we have the necessary factors) info for:
+#- diff btwn wer and pwer for NNs & ADJs -- if large, many reordering errors
+#- diff btwn pwer for surface forms and pwer for lemmas -- if large, morphology errors
+#arguments: corpus object, system name
+#return (NN/ADJ (wer, pwer), surf pwer, lemma pwer), or (-1, -1, -1, -1) if an error occurs
+sub calc_misc_stats
+{
+	my ($corpus, $sysname) = @_;
+	try
+	{
+		my ($nnAdjWER, $nnAdjPWER) = $corpus->calcNounAdjWER_PWERDiff($sysname);
+		my ($surfPWER, $lemmaPWER) = ($corpus->calcOverallPWER($sysname, 'surf'), $corpus->calcOverallPWER($sysname, 'lemma'));
+		return ($nnAdjWER, $nnAdjPWER, $surfPWER, $lemmaPWER);
+	}
+	catch Error::Simple with
+	{
+		my $err = shift;
+		printError($err, 'calc_misc_stats()');
+		return (-1, -1, -1, -1);
+	};
+}
+#approximate BLEU score from n-gram precisions (currently assume no length penalty)
+#arguments: n-gram precisions as an array
+#return: BLEU score
+sub approxBLEUFromNgramScores
+{
+	my $logsum = 0;
+	foreach my $p (@_) {$logsum += log($p);}
+	return exp($logsum / scalar(@_));
+}
+###### NIST SCORE
+sub get_nist_score {
+  my($reference_file,$source_file,$translation_file) = @_;
+  my @STAT = stat($translation_file);
+  my $current_timestamp = $STAT[9];
+  foreach (@NIST) {
+    my ($file,$time,$nist,$bleu) = split;
+    return ($nist,$bleu)
+      if ($file eq $translation_file && $current_timestamp == $time);
+  }
+  my $nist_eval = `/home/pkoehn/statmt/bin/mteval-v10.pl -c -r $reference_file -s $source_file -t $translation_file`;
+  return (0,0) unless ($nist_eval =~ /NIST score = (\d+\.\d+)  BLEU score = (\d+\.\d+)/i);
+  open(NIST,">>nist-memory.dat");
+  printf NIST "$translation_file $current_timestamp %f %f\n",$1,$2;
+  close(NIST);
+  return ($1,$2);
+}
+sub nist_normalize_text {
+    my ($norm_text) = @_;
+# language-independent part:
+    $norm_text =~ s/<skipped>//g; # strip "skipped" tags
+    $norm_text =~ s/-\n//g; # strip end-of-line hyphenation and join lines
+    $norm_text =~ s/\n/ /g; # join lines
+    $norm_text =~ s/(\d)\s+(\d)/$1$2/g; #join digits
+    $norm_text =~ s/&quot;/"/g;  # convert SGML tag for quote to "
+    $norm_text =~ s/&amp;/&/g;   # convert SGML tag for ampersand to &
+    $norm_text =~ s/&lt;/</g;    # convert SGML tag for less-than to >
+    $norm_text =~ s/&gt;/>/g;    # convert SGML tag for greater-than to <
+# language-dependent part (assuming Western languages):
+    $norm_text = " $norm_text ";
+#    $norm_text =~ tr/[A-Z]/[a-z]/ unless $preserve_case;
+    $norm_text =~ s/([\{-\~\[-\` -\&\(-\+\:-\@\/])/ $1 /g;   # tokenize punctuation
+    $norm_text =~ s/([^0-9])([\.,])/$1 $2 /g; # tokenize period and comma unless preceded by a digit
+    $norm_text =~ s/([\.,])([^0-9])/ $1 $2/g; # tokenize period and comma unless followed by a digit
+    $norm_text =~ s/([0-9])(-)/$1 $2 /g; # tokenize dash when preceded by a digit
+    $norm_text =~ s/\s+/ /g; # one space only between words
+    $norm_text =~ s/^\s+//;  # no leading space
+    $norm_text =~ s/\s+$//;  # no trailing space
+    return $norm_text;
+}
+###### BLEU SCORE
+sub get_multi_bleu_score {
+  my($foreign_file,$reference_file,$translation_file) = @_;
+  my @STAT = stat($translation_file);
+  my $current_timestamp = $STAT[9];
+  foreach (@mBLEU) {
+    my ($file,$time,$score,$g1,$g2,$g3,$g4,$bp) = split;
+    if ($file eq $translation_file && $current_timestamp == $time) {
+      return ($score,$g1*100,$g2*100,$g3*100,$g4*100,$bp);
+    }
+  }
+  # load reference translation from reference file
+  my @REFERENCE_SENTENCE = `cat $reference_file`; chop(@REFERENCE_SENTENCE);
+  my @TRANSLATION_SENTENCE = `cat $translation_file`; chop(@TRANSLATION_SENTENCE);
+  my %REF;
+  my @FOREIGN_SENTENCE = `cat $foreign_file`; chop(@FOREIGN_SENTENCE);
+  for(my $i=0;$i<=$#TRANSLATION_SENTENCE;$i++) {
+    push @{$REF{$FOREIGN_SENTENCE[$i]}},$REFERENCE_SENTENCE[$i];
+  }
+  # load reference translation from translation memory
+  foreach my $memory (keys %MEMORY) {
+    next if $MEMORY{$memory} ne 'syn_correct sem_correct';
+    my ($foreign,$english) = split(/ .o0O0o. /,$memory);
+    next unless defined($REF{$foreign});
+    push @{$REF{$foreign}},$english;
+  }
+  my(@CORRECT,@TOTAL,$length_translation,$length_reference);
+  # compute bleu
+  for(my $i=0;$i<=$#TRANSLATION_SENTENCE;$i++) {
+    my %REF_NGRAM = ();
+    my @WORD = split(/ /,$TRANSLATION_SENTENCE[$i]);
+    my $length_translation_this_sentence = scalar(@WORD);
+    my ($closest_diff,$closest_length) = (9999,9999);
+    foreach my $reference (@{$REF{$FOREIGN_SENTENCE[$i]}}) {
+      my @WORD = split(/ /,$reference);
+      my $length = scalar(@WORD);
+      if (abs($length_translation_this_sentence-$length) < $closest_diff) {
+	$closest_diff = abs($length_translation_this_sentence-$length);
+	$closest_length = $length;
+      }
+      for(my $n=1;$n<=4;$n++) {
+	my %REF_NGRAM_N = ();
+	for(my $start=0;$start<=$#WORD-($n-1);$start++) {
+	  my $ngram = "$n";
+	  for(my $w=0;$w<$n;$w++) {
+	    $ngram .= " ".$WORD[$start+$w];
+	  }
+	  $REF_NGRAM_N{$ngram}++;
+	}
+	foreach my $ngram (keys %REF_NGRAM_N) {
+	  if (!defined($REF_NGRAM{$ngram}) ||
+	      $REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) {
+	    $REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram};
+	  }
+	}
+      }
+    }
+    $length_translation += $length_translation_this_sentence;
+    $length_reference += $closest_length;
+    for(my $n=1;$n<=4;$n++) {
+      my %T_NGRAM = ();
+      for(my $start=0;$start<=$#WORD-($n-1);$start++) {
+	my $ngram = "$n";
+	for(my $w=0;$w<$n;$w++) {
+	  $ngram .= " ".$WORD[$start+$w];
+	}
+	$T_NGRAM{$ngram}++;
+      }
+      foreach my $ngram (keys %T_NGRAM) {
+	my $n = 0+$ngram;
+#	print "$i e $ngram $T_NGRAM{$ngram}<BR>\n";
+	$TOTAL[$n] += $T_NGRAM{$ngram};
+	if (defined($REF_NGRAM{$ngram})) {
+	  if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) {
+	    $CORRECT[$n] += $T_NGRAM{$ngram};
+#	    print "$i e correct1 $T_NGRAM{$ngram}<BR>\n";
+	  }
+	  else {
+	    $CORRECT[$n] += $REF_NGRAM{$ngram};
+#	    print "$i e correct2 $REF_NGRAM{$ngram}<BR>\n";
+	  }
+	}
+      }
+    }
+  }
+  my $brevity_penalty = 1;
+  if ($length_translation<$length_reference) {
+    $brevity_penalty = exp(1-$length_reference/$length_translation);
+  }
+  my $bleu = $brevity_penalty * exp((my_log( $CORRECT[1]/$TOTAL[1] ) +
+				     my_log( $CORRECT[2]/$TOTAL[2] ) +
+				     my_log( $CORRECT[3]/$TOTAL[3] ) +
+				     my_log( $CORRECT[4]/$TOTAL[4] ) ) / 4);
+  open(BLEU,">>mbleu-memory.dat");
+  @STAT = stat($translation_file);
+  printf BLEU "$translation_file $STAT[9] %f %f %f %f %f %f\n",$bleu,$CORRECT[1]/$TOTAL[1],$CORRECT[2]/$TOTAL[2],$CORRECT[3]/$TOTAL[3],$CORRECT[4]/$TOTAL[4],$brevity_penalty;
+  close(BLEU);
+  return ($bleu,
+	  100*$CORRECT[1]/$TOTAL[1],
+	  100*$CORRECT[2]/$TOTAL[2],
+	  100*$CORRECT[3]/$TOTAL[3],
+	  100*$CORRECT[4]/$TOTAL[4],
+	  $brevity_penalty);
+}
+sub my_log {
+  return -9999999999 unless $_[0];
+  return log($_[0]);
+}
+###### SCORE TRANSLATIONS
+################################ IN PROGRESS ###############################
+sub compare2
+{
+	&htmlhead("Compare Translations");
+	print "<A HREF=\"?ACTION=VIEW_CORPUS&CORPUS=".CGI::escape($in{CORPUS})."\">View Corpus $in{CORPUS}</A><P>\n";
+	print "<FORM ACTION=\"\" METHOD=POST>\n";
+	print "<INPUT TYPE=HIDDEN NAME=ACTION VALUE=$in{ACTION}>\n";
+	print "<INPUT TYPE=HIDDEN NAME=CORPUS VALUE=\"$in{CORPUS}\">\n";
+	my $corpus = new Corpus('-name' => "$in{CORPUS}", '-descriptions' => \%FILEDESC, '-info_line' => $factorData{$in{CORPUS}});
+	$corpus->writeComparisonPage(\*STDOUT, /^.*$/);
+	print "</FORM>\n";
+}
+sub compare {
+  &htmlhead("Compare Translations");
+  print "<A HREF=\"?ACTION=VIEW_CORPUS&CORPUS=".CGI::escape($in{CORPUS})."\">View Corpus $in{CORPUS}</A><P>\n";
+  print "<FORM ACTION=\"\" METHOD=POST>\n";
+  print "<INPUT TYPE=HIDDEN NAME=ACTION VALUE=$in{ACTION}>\n";
+  print "<INPUT TYPE=HIDDEN NAME=CORPUS VALUE=\"$in{CORPUS}\">\n";
+  # get sentences
+  my %SENTENCES;
+  my $sentence_count;
+  foreach (keys %in) {
+    if (/^FILE_(.+)$/) {
+      my $file = $1;
+      print "<INPUT TYPE=HIDDEN NAME=\"$file\" VALUE=1>\n";
+      my @SENTENCES;
+      if ($file =~ /.sgm$/) {
+	  @{$SENTENCES{$file}} = `grep '<seg' $in{CORPUS}.$file`;
+	  for(my $i=0;$i<$#{$SENTENCES{$file}};$i++) {
+	      $SENTENCES{$file}[$i] =~ s/^<seg[^>]+> *(\S.+\S) *<\/seg> *$/$1/;
+	  }
+      }
+      else {
+	  @{$SENTENCES{$file}} = `cat $in{CORPUS}.$1`;
+	  chop(@{$SENTENCES{$file}});
+      }
+      $sentence_count = scalar @{$SENTENCES{$file}};
+    }
+  }
+  my %REFERENCE;
+  foreach (@SHOW) {
+    if (-e "$in{CORPUS}.$_") {
+      @{$REFERENCE{$_}} = `cat $in{CORPUS}.$_`; chop(@{$REFERENCE{$_}});
+    }
+  }
+  # update memory
+  foreach (keys %in) {
+    next unless /^SYN_SCORE_(.+)_(\d+)$/;
+    next unless $in{"SEM_SCORE_$1_$2"};
+    &store_in_memory($REFERENCE{$FOREIGN}[$2],
+		     $SENTENCES{$1}[$2],
+                     "syn_".$in{"SYN_SCORE_$1_$2"}." sem_".$in{"SEM_SCORE_$1_$2"});
+  }
+  # display sentences
+  for(my $i=0;$i<$sentence_count;$i++)
+  {
+    my $evaluation = "";
+    my $show = 0;
+    my $surface = "";
+    foreach my $file (keys %SENTENCES)
+	 {
+      if ($in{SURFACE}) {
+	$SENTENCES{$file}[$i] =~ s/ *$//;
+	$surface = $SENTENCES{$file}[$i] if ($surface eq '');
+	$show = 1 if ($SENTENCES{$file}[$i] ne $surface);
+      }
+      else {
+	my $this_ev = &get_from_memory($REFERENCE{$FOREIGN}[$i],$SENTENCES{$file}[$i]);
+	$this_ev = "syn_wrong sem_wrong" unless $this_ev;
+	$evaluation = $this_ev if ($evaluation eq '');
+	$show = 1 if ($evaluation ne $this_ev);
+      }
+    }
+    next unless $show;
+    print "<HR>Sentence ".($i+1).":<BR>\n";
+    foreach my $ref (@SHOW) {
+      if (-e "$in{CORPUS}.$ref") {
+	print "<FONT COLOR=$SHOW_COLOR{$ref}>".$REFERENCE{$ref}[$i]."</FONT> (".$FILETYPE{$ref}.")<BR>\n";
+      }
+    }
+    foreach my $file (keys %SENTENCES) {
+      print "<B>$SENTENCES{$file}[$i]</B> ($file)<BR>\n";
+      &color_highlight_ngrams($i,&nist_normalize_text($SENTENCES{$file}[$i]),$REFERENCE{"e"}[$i]);
+      if (0 && $in{WITH_EVAL}) {
+	$evaluation = &get_from_memory($REFERENCE{$FOREIGN}[$i],$SENTENCES{$file}[$i]);
+	print "<INPUT TYPE=RADIO NAME=SYN_SCORE_$file"."_$i VALUE=correct";
+	print " CHECKED" if ($evaluation =~ /syn_correct/);
+	print "> perfect English\n";
+	print "<INPUT TYPE=RADIO NAME=SYN_SCORE_$file"."_$i VALUE=wrong";
+	print " CHECKED" if ($evaluation =~ /syn_wrong/);
+	print "> imperfect English<BR>\n";
+	print "<INPUT TYPE=RADIO NAME=SEM_SCORE_$file"."_$i VALUE=correct";
+	print " CHECKED" if ($evaluation =~ /sem_correct/);
+	print "> correct meaning\n";
+	print "<INPUT TYPE=RADIO NAME=SEM_SCORE_$file"."_$i VALUE=wrong";
+	print " CHECKED" if ($evaluation =~ /sem_wrong/);
+	print "> incorrect meaning<BR>\n";
+      }
+    }
+  }
+  print "<P><INPUT TYPE=SUBMIT VALUE=\"Add evaluation\">\n";
+  print "</FORM>\n";
+}
+###### MEMORY SUBS
+sub load_memory {
+  open(MEMORY,"evaluation-memory.dat") or return;
+  while(<MEMORY>) {
+    chop;
+    my($foreign,$translation,$evaluation) = split(/ \.o0O0o\. /);
+    $evaluation = 'syn_correct sem_correct' if ($evaluation eq 'correct');
+    $MEMORY{"$foreign .o0O0o. $translation"} = $evaluation;
+  }
+  close(MEMORY);
+}
+sub get_score_from_memory {
+  my($foreign_file,$translation_file) = @_;
+  my $unknown=0;
+  my $correct=0;
+  my $just_syn=0;
+  my $just_sem=0;
+  my $wrong=0;
+  my @FOREIGN = `cat $foreign_file`; chop(@FOREIGN);
+  my @TRANSLATION = `cat $translation_file`; chop(@TRANSLATION);
+  for(my $i=0;$i<=$#FOREIGN;$i++) {
+    if (my $evaluation = &get_from_memory($FOREIGN[$i],$TRANSLATION[$i])) {
+      if ($evaluation eq 'syn_correct sem_correct') { $correct++ }
+      elsif ($evaluation eq 'syn_correct sem_wrong') { $just_syn++ }
+      elsif ($evaluation eq 'syn_wrong sem_correct') { $just_sem++ }
+      elsif ($evaluation eq 'syn_wrong sem_wrong') { $wrong++ }
+      else { $unknown++; }
+    }
+    else { $unknown++; }
+  }
+  return($correct,$just_syn,$just_sem,$wrong,$unknown);
+}
+sub store_in_memory {
+  my($foreign,$translation,$evaluation) = @_;
+  &trim(\$translation);
+  return if $MEMORY{"$foreign .o0O0o. $translation"} eq $evaluation;
+  $MEMORY{"$foreign .o0O0o. $translation"} = $evaluation;
+  open(MEMORY,">>evaluation-memory.dat") or die "store_in_memory(): couldn't open 'evaluation-memory.dat' for append\n";
+  print MEMORY "$foreign .o0O0o. $translation .o0O0o. $evaluation\n";
+  close(MEMORY);
+}
+sub get_from_memory {
+  my($foreign,$translation) = @_;
+  &trim(\$translation);
+  return $MEMORY{"$foreign .o0O0o. $translation"};
+}
+sub trim {
+  my($translation) = @_;
+  $$translation =~ s/ +/ /g;
+  $$translation =~ s/^ +//;
+  $$translation =~ s/ +$//;
+}
+sub load_descriptions {
+  open(FD,"file-descriptions") or die "load_descriptions(): couldn't open 'file-descriptions' for read\n";
+  while(<FD>) {
+  	chomp;
+    my($file,$description) = split(/\s+/,$_,2);
+    $FILEDESC{$file} = $description;
+  }
+  close(FD);
+}
+#read config file giving various corpus config info
+#arguments: filename to read
+#return: hash of corpus names to strings containing formatted info
+sub loadFactorData
+{
+	my $filename = shift;
+	my %data = ();
+	open(INFILE, "<$filename") or die "loadFactorData(): couldn't open '$filename' for read\n";
+	while(my $line = <INFILE>)
+	{
+		if($line =~ /^\#/) {next;} #skip comment lines
+		$line =~ /^\s*(\S+)\s*:\s*(\S.*\S)\s*$/;
+		my $corpusName = $1;
+		$data{$corpusName} = $2;
+	}
+	close(INFILE);
+	return %data;
+}
+###### SUBS
+sub htmlhead {
+  print <<"___ENDHTML";
+Content-type: text/html
+<HTML><HEAD>
+<TITLE>MTEval: $_[0]</TITLE>
+<SCRIPT LANGUAGE="JavaScript">
+<!-- hide from old browsers
+function FieldInfo(field,description) {
+  popup = window.open("","popDialog","height=500,width=600,scrollbars=yes,resizable=yes");
+  popup.document.write("<HTML><HEAD><TITLE>"+field+"</TITLE></HEAD><BODY BGCOLOR=#FFFFCC><CENTER><B>"+field+"</B><HR SIZE=2 NOSHADE></CENTER><PRE>"+description+"</PRE><CENTER><FORM><INPUT TYPE='BUTTON' VALUE='Okay' onClick='self.close()'></FORM><CENTER></BODY></HTML>");
+  popup.focus();
+  popup.document.close();
+}
+<!-- done hiding -->
+</SCRIPT>
+</HEAD>
+<BODY BGCOLOR=white>
+<H2>Evaluation Tool for Machine Translation<BR>$_[0]</H2>
+___ENDHTML
+}
+############################# parts of cgi-lib.pl
+sub ReadParse {
+  my ($i, $key, $val);
+  # Read in text
+  my $in;
+  if (&MethGet) {
+    $in = $ENV{'QUERY_STRING'};
+  } elsif (&MethPost) {
+    read(STDIN,$in,$ENV{'CONTENT_LENGTH'});
+  }
+  my @in = split(/[&;]/,$in);
+  foreach $i (0 .. $#in) {
+    # Convert plus's to spaces
+    $in[$i] =~ s/\+/ /g;
+    # Split into key and value.
+    ($key, $val) = split(/=/,$in[$i],2); # splits on the first =.
+    # Convert %XX from hex numbers to alphanumeric
+    $key =~ s/%(..)/pack("c",hex($1))/ge;
+    $val =~ s/%(..)/pack("c",hex($1))/ge;
+    # Associate key and value
+    $in{$key} .= "\0" if (defined($in{$key})); # \0 is the multiple separator
+    $in{$key} .= $val;
+  }
+  return scalar(@in);
+}
+sub MethGet {
+  return ($ENV{'REQUEST_METHOD'} eq "GET");
+}
+sub MethPost {
+  return ($ENV{'REQUEST_METHOD'} eq "POST");
+}

mosesdecoder/scripts/analysis/weight-scan-summarize.sh ADDED Viewed

	@@ -0,0 +1,79 @@

+#!/bin/bash
+#
+# This file is part of moses.  Its use is licensed under the GNU Lesser General
+# Public License version 2.1 or, at your option, any later version.
+# Hackish summarization of weight-scan.pl results, heavily relies on tools by
+# Ondrej Bojar ([email protected]), some of which need Mercury; beware.
+function die() { echo "$@" >&2; exit 1; }
+set -o pipefail  # safer pipes
+refs="$1"
+dir="$2"
+[ -d "$dir" ] && [ -e "$refs" ] \
+  || die "usage: $0 ref-file weight-scan-working-dir"
+testbleu=$HOME/tools/src/obotools/testbleu
+projectbleu=$HOME/tools/src/obotools/projectbleu
+[ -x "$testbleu" ] || die "Can't run $testbleu"
+[ -x "$projectbleu" ] || die "Can't run $projectbleu"
+# create exact bleus and put them to bleu.*
+for f in $dir/out.*; do
+  bleuf=${f//out./bleu.}
+  [ -e "$bleuf" ] \
+    || $testbleu $refs < $f | pickre --re='BLEU...([0-9.]*)' > $bleuf \
+    || die "Failed to construct $bleuf"
+done
+# create bleu projections from each best* and put them to corresponding pbleu*
+# first collect all weights
+lcat $dir/weights.* \
+| tr ' ' , \
+| pickre --re='weights.([-0-9.]*)' \
+| cut -f 1,3 \
+| numsort 1 \
+> $dir/allweights
+allwparam=$(cut -f2 $dir/allweights | prefix -- '-w ' | tr '\n' ' ')
+for f in $dir/best*.*; do
+  pbleuf=$(echo $f | sed 's/best[0-9]*/pbleu/')
+  if [ ! -e "$pbleuf" ] || [ `wc -l < $pbleuf` -ne `wc -l < $dir/allweights` ]; then
+    # need to regenerate the projection
+    $projectbleu $refs $allwparam < $f \
+    | paste $dir/allweights - \
+    | cut -f1,3 \
+    > $pbleuf \
+    || die "Failed to construct $pbleuf"
+  fi
+done
+# summarize bleu projections
+echo "goal	proj/real	from	was" > $dir/graph.data
+for f in $dir/bleu.*; do
+  obs=$(echo $f | sed 's/^.*bleu\.//')
+  cat $dir/pbleu.$obs \
+  | pickre --re='F: ([0-9.]*)' \
+  | recut 2,1 \
+  | prefix --tab -- "$obs\tproj" \
+  >> $dir/graph.data
+  lcat $dir/bleu.$obs \
+  | pickre --re='bleu\.([-0-9.]*)' \
+  | prefix --tab -- "$obs\treal" \
+  | recut 1,2,3,5 \
+  >> $dir/graph.data
+done
+exit 0
+## COMMANDS TO PLOT IT:
+# plot 'walkable' graph of projections at various points
+g=weight-scan-tm_2/graph.data; cat $g | skip 1 | grep real | cut -f2- | numsort 2 | sed 's/real/all/' > cliprealall; skip 1 < $g | numsort 1,3 | split_at_colchange 1 | blockwise "(prefix --tab x cliprealall; cat -) | labelledxychart --data=3,4,0,'',linespoints --blockpivot=2" > clip
+# plot a combination of projections along with the individual projections and
+# the real scores
+cat best100.-0.100000 best100.-0.500000 best100.-0.300000 best100.-0.200000 | /home/obo/tools/src/obotools/projectbleu ../tune.ref $allwparam | paste allweights - > comb.-0.5_-0.3_-0.2_-0.1
+(lcat pbleu.-0.100000 pbleu.-0.500000 pbleu.-0.300000 pbleu.-0.200000 comb.-0.5_-0.3_-0.2_-0.1 | pickre --re='F: ([0-9.]*)' | recut 2,3,1 ; cat graph.data | skip 1 | grep real | cut -f2- | numsort 2 ) | tee delme | labelledxychart --blockpivot=1 --data=2,3,0,'',linespoints | gpsandbox

mosesdecoder/scripts/ems/web/javascripts/builder.js ADDED Viewed

	@@ -0,0 +1,136 @@

+// script.aculo.us builder.js v1.8.3, Thu Oct 08 11:23:33 +0200 2009
+// Copyright (c) 2005-2009 Thomas Fuchs (http://script.aculo.us, http://mir.aculo.us)
+//
+// script.aculo.us is freely distributable under the terms of an MIT-style license.
+// For details, see the script.aculo.us web site: http://script.aculo.us/
+var Builder = {
+  NODEMAP: {
+    AREA: 'map',
+    CAPTION: 'table',
+    COL: 'table',
+    COLGROUP: 'table',
+    LEGEND: 'fieldset',
+    OPTGROUP: 'select',
+    OPTION: 'select',
+    PARAM: 'object',
+    TBODY: 'table',
+    TD: 'table',
+    TFOOT: 'table',
+    TH: 'table',
+    THEAD: 'table',
+    TR: 'table'
+  },
+  // note: For Firefox < 1.5, OPTION and OPTGROUP tags are currently broken,
+  //       due to a Firefox bug
+  node: function(elementName) {
+    elementName = elementName.toUpperCase();
+    // try innerHTML approach
+    var parentTag = this.NODEMAP[elementName] || 'div';
+    var parentElement = document.createElement(parentTag);
+    try { // prevent IE "feature": http://dev.rubyonrails.org/ticket/2707
+      parentElement.innerHTML = "<" + elementName + "></" + elementName + ">";
+    } catch(e) {}
+    var element = parentElement.firstChild || null;
+    // see if browser added wrapping tags
+    if(element && (element.tagName.toUpperCase() != elementName))
+      element = element.getElementsByTagName(elementName)[0];
+    // fallback to createElement approach
+    if(!element) element = document.createElement(elementName);
+    // abort if nothing could be created
+    if(!element) return;
+    // attributes (or text)
+    if(arguments[1])
+      if(this._isStringOrNumber(arguments[1]) ||
+        (arguments[1] instanceof Array) ||
+        arguments[1].tagName) {
+          this._children(element, arguments[1]);
+        } else {
+          var attrs = this._attributes(arguments[1]);
+          if(attrs.length) {
+            try { // prevent IE "feature": http://dev.rubyonrails.org/ticket/2707
+              parentElement.innerHTML = "<" +elementName + " " +
+                attrs + "></" + elementName + ">";
+            } catch(e) {}
+            element = parentElement.firstChild || null;
+            // workaround firefox 1.0.X bug
+            if(!element) {
+              element = document.createElement(elementName);
+              for(attr in arguments[1])
+                element[attr == 'class' ? 'className' : attr] = arguments[1][attr];
+            }
+            if(element.tagName.toUpperCase() != elementName)
+              element = parentElement.getElementsByTagName(elementName)[0];
+          }
+        }
+    // text, or array of children
+    if(arguments[2])
+      this._children(element, arguments[2]);
+     return $(element);
+  },
+  _text: function(text) {
+     return document.createTextNode(text);
+  },
+  ATTR_MAP: {
+    'className': 'class',
+    'htmlFor': 'for'
+  },
+  _attributes: function(attributes) {
+    var attrs = [];
+    for(attribute in attributes)
+      attrs.push((attribute in this.ATTR_MAP ? this.ATTR_MAP[attribute] : attribute) +
+          '="' + attributes[attribute].toString().escapeHTML().gsub(/"/,'&quot;') + '"');
+    return attrs.join(" ");
+  },
+  _children: function(element, children) {
+    if(children.tagName) {
+      element.appendChild(children);
+      return;
+    }
+    if(typeof children=='object') { // array can hold nodes and text
+      children.flatten().each( function(e) {
+        if(typeof e=='object')
+          element.appendChild(e);
+        else
+          if(Builder._isStringOrNumber(e))
+            element.appendChild(Builder._text(e));
+      });
+    } else
+      if(Builder._isStringOrNumber(children))
+        element.appendChild(Builder._text(children));
+  },
+  _isStringOrNumber: function(param) {
+    return(typeof param=='string' || typeof param=='number');
+  },
+  build: function(html) {
+    var element = this.node('div');
+    $(element).update(html.strip());
+    return element.down();
+  },
+  dump: function(scope) {
+    if(typeof scope != 'object' && typeof scope != 'function') scope = window; //global scope
+    var tags = ("A ABBR ACRONYM ADDRESS APPLET AREA B BASE BASEFONT BDO BIG BLOCKQUOTE BODY " +
+      "BR BUTTON CAPTION CENTER CITE CODE COL COLGROUP DD DEL DFN DIR DIV DL DT EM FIELDSET " +
+      "FONT FORM FRAME FRAMESET H1 H2 H3 H4 H5 H6 HEAD HR HTML I IFRAME IMG INPUT INS ISINDEX "+
+      "KBD LABEL LEGEND LI LINK MAP MENU META NOFRAMES NOSCRIPT OBJECT OL OPTGROUP OPTION P "+
+      "PARAM PRE Q S SAMP SCRIPT SELECT SMALL SPAN STRIKE STRONG STYLE SUB SUP TABLE TBODY TD "+
+      "TEXTAREA TFOOT TH THEAD TITLE TR TT U UL VAR").split(/\s+/);
+    tags.each( function(tag){
+      scope[tag] = function() {
+        return Builder.node.apply(Builder, [tag].concat($A(arguments)));
+      };
+    });
+  }
+};

mosesdecoder/scripts/ems/web/javascripts/dragdrop.js ADDED Viewed

	@@ -0,0 +1,974 @@

+// script.aculo.us dragdrop.js v1.8.3, Thu Oct 08 11:23:33 +0200 2009
+// Copyright (c) 2005-2009 Thomas Fuchs (http://script.aculo.us, http://mir.aculo.us)
+//
+// script.aculo.us is freely distributable under the terms of an MIT-style license.
+// For details, see the script.aculo.us web site: http://script.aculo.us/
+if(Object.isUndefined(Effect))
+  throw("dragdrop.js requires including script.aculo.us' effects.js library");
+var Droppables = {
+  drops: [],
+  remove: function(element) {
+    this.drops = this.drops.reject(function(d) { return d.element==$(element) });
+  },
+  add: function(element) {
+    element = $(element);
+    var options = Object.extend({
+      greedy:     true,
+      hoverclass: null,
+      tree:       false
+    }, arguments[1] || { });
+    // cache containers
+    if(options.containment) {
+      options._containers = [];
+      var containment = options.containment;
+      if(Object.isArray(containment)) {
+        containment.each( function(c) { options._containers.push($(c)) });
+      } else {
+        options._containers.push($(containment));
+      }
+    }
+    if(options.accept) options.accept = [options.accept].flatten();
+    Element.makePositioned(element); // fix IE
+    options.element = element;
+    this.drops.push(options);
+  },
+  findDeepestChild: function(drops) {
+    deepest = drops[0];
+    for (i = 1; i < drops.length; ++i)
+      if (Element.isParent(drops[i].element, deepest.element))
+        deepest = drops[i];
+    return deepest;
+  },
+  isContained: function(element, drop) {
+    var containmentNode;
+    if(drop.tree) {
+      containmentNode = element.treeNode;
+    } else {
+      containmentNode = element.parentNode;
+    }
+    return drop._containers.detect(function(c) { return containmentNode == c });
+  },
+  isAffected: function(point, element, drop) {
+    return (
+      (drop.element!=element) &&
+      ((!drop._containers) ||
+        this.isContained(element, drop)) &&
+      ((!drop.accept) ||
+        (Element.classNames(element).detect(
+          function(v) { return drop.accept.include(v) } ) )) &&
+      Position.within(drop.element, point[0], point[1]) );
+  },
+  deactivate: function(drop) {
+    if(drop.hoverclass)
+      Element.removeClassName(drop.element, drop.hoverclass);
+    this.last_active = null;
+  },
+  activate: function(drop) {
+    if(drop.hoverclass)
+      Element.addClassName(drop.element, drop.hoverclass);
+    this.last_active = drop;
+  },
+  show: function(point, element) {
+    if(!this.drops.length) return;
+    var drop, affected = [];
+    this.drops.each( function(drop) {
+      if(Droppables.isAffected(point, element, drop))
+        affected.push(drop);
+    });
+    if(affected.length>0)
+      drop = Droppables.findDeepestChild(affected);
+    if(this.last_active && this.last_active != drop) this.deactivate(this.last_active);
+    if (drop) {
+      Position.within(drop.element, point[0], point[1]);
+      if(drop.onHover)
+        drop.onHover(element, drop.element, Position.overlap(drop.overlap, drop.element));
+      if (drop != this.last_active) Droppables.activate(drop);
+    }
+  },
+  fire: function(event, element) {
+    if(!this.last_active) return;
+    Position.prepare();
+    if (this.isAffected([Event.pointerX(event), Event.pointerY(event)], element, this.last_active))
+      if (this.last_active.onDrop) {
+        this.last_active.onDrop(element, this.last_active.element, event);
+        return true;
+      }
+  },
+  reset: function() {
+    if(this.last_active)
+      this.deactivate(this.last_active);
+  }
+};
+var Draggables = {
+  drags: [],
+  observers: [],
+  register: function(draggable) {
+    if(this.drags.length == 0) {
+      this.eventMouseUp   = this.endDrag.bindAsEventListener(this);
+      this.eventMouseMove = this.updateDrag.bindAsEventListener(this);
+      this.eventKeypress  = this.keyPress.bindAsEventListener(this);
+      Event.observe(document, "mouseup", this.eventMouseUp);
+      Event.observe(document, "mousemove", this.eventMouseMove);
+      Event.observe(document, "keypress", this.eventKeypress);
+    }
+    this.drags.push(draggable);
+  },
+  unregister: function(draggable) {
+    this.drags = this.drags.reject(function(d) { return d==draggable });
+    if(this.drags.length == 0) {
+      Event.stopObserving(document, "mouseup", this.eventMouseUp);
+      Event.stopObserving(document, "mousemove", this.eventMouseMove);
+      Event.stopObserving(document, "keypress", this.eventKeypress);
+    }
+  },
+  activate: function(draggable) {
+    if(draggable.options.delay) {
+      this._timeout = setTimeout(function() {
+        Draggables._timeout = null;
+        window.focus();
+        Draggables.activeDraggable = draggable;
+      }.bind(this), draggable.options.delay);
+    } else {
+      window.focus(); // allows keypress events if window isn't currently focused, fails for Safari
+      this.activeDraggable = draggable;
+    }
+  },
+  deactivate: function() {
+    this.activeDraggable = null;
+  },
+  updateDrag: function(event) {
+    if(!this.activeDraggable) return;
+    var pointer = [Event.pointerX(event), Event.pointerY(event)];
+    // Mozilla-based browsers fire successive mousemove events with
+    // the same coordinates, prevent needless redrawing (moz bug?)
+    if(this._lastPointer && (this._lastPointer.inspect() == pointer.inspect())) return;
+    this._lastPointer = pointer;
+    this.activeDraggable.updateDrag(event, pointer);
+  },
+  endDrag: function(event) {
+    if(this._timeout) {
+      clearTimeout(this._timeout);
+      this._timeout = null;
+    }
+    if(!this.activeDraggable) return;
+    this._lastPointer = null;
+    this.activeDraggable.endDrag(event);
+    this.activeDraggable = null;
+  },
+  keyPress: function(event) {
+    if(this.activeDraggable)
+      this.activeDraggable.keyPress(event);
+  },
+  addObserver: function(observer) {
+    this.observers.push(observer);
+    this._cacheObserverCallbacks();
+  },
+  removeObserver: function(element) {  // element instead of observer fixes mem leaks
+    this.observers = this.observers.reject( function(o) { return o.element==element });
+    this._cacheObserverCallbacks();
+  },
+  notify: function(eventName, draggable, event) {  // 'onStart', 'onEnd', 'onDrag'
+    if(this[eventName+'Count'] > 0)
+      this.observers.each( function(o) {
+        if(o[eventName]) o[eventName](eventName, draggable, event);
+      });
+    if(draggable.options[eventName]) draggable.options[eventName](draggable, event);
+  },
+  _cacheObserverCallbacks: function() {
+    ['onStart','onEnd','onDrag'].each( function(eventName) {
+      Draggables[eventName+'Count'] = Draggables.observers.select(
+        function(o) { return o[eventName]; }
+      ).length;
+    });
+  }
+};
+/*--------------------------------------------------------------------------*/
+var Draggable = Class.create({
+  initialize: function(element) {
+    var defaults = {
+      handle: false,
+      reverteffect: function(element, top_offset, left_offset) {
+        var dur = Math.sqrt(Math.abs(top_offset^2)+Math.abs(left_offset^2))*0.02;
+        new Effect.Move(element, { x: -left_offset, y: -top_offset, duration: dur,
+          queue: {scope:'_draggable', position:'end'}
+        });
+      },
+      endeffect: function(element) {
+        var toOpacity = Object.isNumber(element._opacity) ? element._opacity : 1.0;
+        new Effect.Opacity(element, {duration:0.2, from:0.7, to:toOpacity,
+          queue: {scope:'_draggable', position:'end'},
+          afterFinish: function(){
+            Draggable._dragging[element] = false
+          }
+        });
+      },
+      zindex: 1000,
+      revert: false,
+      quiet: false,
+      scroll: false,
+      scrollSensitivity: 20,
+      scrollSpeed: 15,
+      snap: false,  // false, or xy or [x,y] or function(x,y){ return [x,y] }
+      delay: 0
+    };
+    if(!arguments[1] || Object.isUndefined(arguments[1].endeffect))
+      Object.extend(defaults, {
+        starteffect: function(element) {
+          element._opacity = Element.getOpacity(element);
+          Draggable._dragging[element] = true;
+          new Effect.Opacity(element, {duration:0.2, from:element._opacity, to:0.7});
+        }
+      });
+    var options = Object.extend(defaults, arguments[1] || { });
+    this.element = $(element);
+    if(options.handle && Object.isString(options.handle))
+      this.handle = this.element.down('.'+options.handle, 0);
+    if(!this.handle) this.handle = $(options.handle);
+    if(!this.handle) this.handle = this.element;
+    if(options.scroll && !options.scroll.scrollTo && !options.scroll.outerHTML) {
+      options.scroll = $(options.scroll);
+      this._isScrollChild = Element.childOf(this.element, options.scroll);
+    }
+    Element.makePositioned(this.element); // fix IE
+    this.options  = options;
+    this.dragging = false;
+    this.eventMouseDown = this.initDrag.bindAsEventListener(this);
+    Event.observe(this.handle, "mousedown", this.eventMouseDown);
+    Draggables.register(this);
+  },
+  destroy: function() {
+    Event.stopObserving(this.handle, "mousedown", this.eventMouseDown);
+    Draggables.unregister(this);
+  },
+  currentDelta: function() {
+    return([
+      parseInt(Element.getStyle(this.element,'left') || '0'),
+      parseInt(Element.getStyle(this.element,'top') || '0')]);
+  },
+  initDrag: function(event) {
+    if(!Object.isUndefined(Draggable._dragging[this.element]) &&
+      Draggable._dragging[this.element]) return;
+    if(Event.isLeftClick(event)) {
+      // abort on form elements, fixes a Firefox issue
+      var src = Event.element(event);
+      if((tag_name = src.tagName.toUpperCase()) && (
+        tag_name=='INPUT' ||
+        tag_name=='SELECT' ||
+        tag_name=='OPTION' ||
+        tag_name=='BUTTON' ||
+        tag_name=='TEXTAREA')) return;
+      var pointer = [Event.pointerX(event), Event.pointerY(event)];
+      var pos     = this.element.cumulativeOffset();
+      this.offset = [0,1].map( function(i) { return (pointer[i] - pos[i]) });
+      Draggables.activate(this);
+      Event.stop(event);
+    }
+  },
+  startDrag: function(event) {
+    this.dragging = true;
+    if(!this.delta)
+      this.delta = this.currentDelta();
+    if(this.options.zindex) {
+      this.originalZ = parseInt(Element.getStyle(this.element,'z-index') || 0);
+      this.element.style.zIndex = this.options.zindex;
+    }
+    if(this.options.ghosting) {
+      this._clone = this.element.cloneNode(true);
+      this._originallyAbsolute = (this.element.getStyle('position') == 'absolute');
+      if (!this._originallyAbsolute)
+        Position.absolutize(this.element);
+      this.element.parentNode.insertBefore(this._clone, this.element);
+    }
+    if(this.options.scroll) {
+      if (this.options.scroll == window) {
+        var where = this._getWindowScroll(this.options.scroll);
+        this.originalScrollLeft = where.left;
+        this.originalScrollTop = where.top;
+      } else {
+        this.originalScrollLeft = this.options.scroll.scrollLeft;
+        this.originalScrollTop = this.options.scroll.scrollTop;
+      }
+    }
+    Draggables.notify('onStart', this, event);
+    if(this.options.starteffect) this.options.starteffect(this.element);
+  },
+  updateDrag: function(event, pointer) {
+    if(!this.dragging) this.startDrag(event);
+    if(!this.options.quiet){
+      Position.prepare();
+      Droppables.show(pointer, this.element);
+    }
+    Draggables.notify('onDrag', this, event);
+    this.draw(pointer);
+    if(this.options.change) this.options.change(this);
+    if(this.options.scroll) {
+      this.stopScrolling();
+      var p;
+      if (this.options.scroll == window) {
+        with(this._getWindowScroll(this.options.scroll)) { p = [ left, top, left+width, top+height ]; }
+      } else {
+        p = Position.page(this.options.scroll);
+        p[0] += this.options.scroll.scrollLeft + Position.deltaX;
+        p[1] += this.options.scroll.scrollTop + Position.deltaY;
+        p.push(p[0]+this.options.scroll.offsetWidth);
+        p.push(p[1]+this.options.scroll.offsetHeight);
+      }
+      var speed = [0,0];
+      if(pointer[0] < (p[0]+this.options.scrollSensitivity)) speed[0] = pointer[0]-(p[0]+this.options.scrollSensitivity);
+      if(pointer[1] < (p[1]+this.options.scrollSensitivity)) speed[1] = pointer[1]-(p[1]+this.options.scrollSensitivity);
+      if(pointer[0] > (p[2]-this.options.scrollSensitivity)) speed[0] = pointer[0]-(p[2]-this.options.scrollSensitivity);
+      if(pointer[1] > (p[3]-this.options.scrollSensitivity)) speed[1] = pointer[1]-(p[3]-this.options.scrollSensitivity);
+      this.startScrolling(speed);
+    }
+    // fix AppleWebKit rendering
+    if(Prototype.Browser.WebKit) window.scrollBy(0,0);
+    Event.stop(event);
+  },
+  finishDrag: function(event, success) {
+    this.dragging = false;
+    if(this.options.quiet){
+      Position.prepare();
+      var pointer = [Event.pointerX(event), Event.pointerY(event)];
+      Droppables.show(pointer, this.element);
+    }
+    if(this.options.ghosting) {
+      if (!this._originallyAbsolute)
+        Position.relativize(this.element);
+      delete this._originallyAbsolute;
+      Element.remove(this._clone);
+      this._clone = null;
+    }
+    var dropped = false;
+    if(success) {
+      dropped = Droppables.fire(event, this.element);
+      if (!dropped) dropped = false;
+    }
+    if(dropped && this.options.onDropped) this.options.onDropped(this.element);
+    Draggables.notify('onEnd', this, event);
+    var revert = this.options.revert;
+    if(revert && Object.isFunction(revert)) revert = revert(this.element);
+    var d = this.currentDelta();
+    if(revert && this.options.reverteffect) {
+      if (dropped == 0 || revert != 'failure')
+        this.options.reverteffect(this.element,
+          d[1]-this.delta[1], d[0]-this.delta[0]);
+    } else {
+      this.delta = d;
+    }
+    if(this.options.zindex)
+      this.element.style.zIndex = this.originalZ;
+    if(this.options.endeffect)
+      this.options.endeffect(this.element);
+    Draggables.deactivate(this);
+    Droppables.reset();
+  },
+  keyPress: function(event) {
+    if(event.keyCode!=Event.KEY_ESC) return;
+    this.finishDrag(event, false);
+    Event.stop(event);
+  },
+  endDrag: function(event) {
+    if(!this.dragging) return;
+    this.stopScrolling();
+    this.finishDrag(event, true);
+    Event.stop(event);
+  },
+  draw: function(point) {
+    var pos = this.element.cumulativeOffset();
+    if(this.options.ghosting) {
+      var r   = Position.realOffset(this.element);
+      pos[0] += r[0] - Position.deltaX; pos[1] += r[1] - Position.deltaY;
+    }
+    var d = this.currentDelta();
+    pos[0] -= d[0]; pos[1] -= d[1];
+    if(this.options.scroll && (this.options.scroll != window && this._isScrollChild)) {
+      pos[0] -= this.options.scroll.scrollLeft-this.originalScrollLeft;
+      pos[1] -= this.options.scroll.scrollTop-this.originalScrollTop;
+    }
+    var p = [0,1].map(function(i){
+      return (point[i]-pos[i]-this.offset[i])
+    }.bind(this));
+    if(this.options.snap) {
+      if(Object.isFunction(this.options.snap)) {
+        p = this.options.snap(p[0],p[1],this);
+      } else {
+      if(Object.isArray(this.options.snap)) {
+        p = p.map( function(v, i) {
+          return (v/this.options.snap[i]).round()*this.options.snap[i] }.bind(this));
+      } else {
+        p = p.map( function(v) {
+          return (v/this.options.snap).round()*this.options.snap }.bind(this));
+      }
+    }}
+    var style = this.element.style;
+    if((!this.options.constraint) || (this.options.constraint=='horizontal'))
+      style.left = p[0] + "px";
+    if((!this.options.constraint) || (this.options.constraint=='vertical'))
+      style.top  = p[1] + "px";
+    if(style.visibility=="hidden") style.visibility = ""; // fix gecko rendering
+  },
+  stopScrolling: function() {
+    if(this.scrollInterval) {
+      clearInterval(this.scrollInterval);
+      this.scrollInterval = null;
+      Draggables._lastScrollPointer = null;
+    }
+  },
+  startScrolling: function(speed) {
+    if(!(speed[0] || speed[1])) return;
+    this.scrollSpeed = [speed[0]*this.options.scrollSpeed,speed[1]*this.options.scrollSpeed];
+    this.lastScrolled = new Date();
+    this.scrollInterval = setInterval(this.scroll.bind(this), 10);
+  },
+  scroll: function() {
+    var current = new Date();
+    var delta = current - this.lastScrolled;
+    this.lastScrolled = current;
+    if(this.options.scroll == window) {
+      with (this._getWindowScroll(this.options.scroll)) {
+        if (this.scrollSpeed[0] || this.scrollSpeed[1]) {
+          var d = delta / 1000;
+          this.options.scroll.scrollTo( left + d*this.scrollSpeed[0], top + d*this.scrollSpeed[1] );
+        }
+      }
+    } else {
+      this.options.scroll.scrollLeft += this.scrollSpeed[0] * delta / 1000;
+      this.options.scroll.scrollTop  += this.scrollSpeed[1] * delta / 1000;
+    }
+    Position.prepare();
+    Droppables.show(Draggables._lastPointer, this.element);
+    Draggables.notify('onDrag', this);
+    if (this._isScrollChild) {
+      Draggables._lastScrollPointer = Draggables._lastScrollPointer || $A(Draggables._lastPointer);
+      Draggables._lastScrollPointer[0] += this.scrollSpeed[0] * delta / 1000;
+      Draggables._lastScrollPointer[1] += this.scrollSpeed[1] * delta / 1000;
+      if (Draggables._lastScrollPointer[0] < 0)
+        Draggables._lastScrollPointer[0] = 0;
+      if (Draggables._lastScrollPointer[1] < 0)
+        Draggables._lastScrollPointer[1] = 0;
+      this.draw(Draggables._lastScrollPointer);
+    }
+    if(this.options.change) this.options.change(this);
+  },
+  _getWindowScroll: function(w) {
+    var T, L, W, H;
+    with (w.document) {
+      if (w.document.documentElement && documentElement.scrollTop) {
+        T = documentElement.scrollTop;
+        L = documentElement.scrollLeft;
+      } else if (w.document.body) {
+        T = body.scrollTop;
+        L = body.scrollLeft;
+      }
+      if (w.innerWidth) {
+        W = w.innerWidth;
+        H = w.innerHeight;
+      } else if (w.document.documentElement && documentElement.clientWidth) {
+        W = documentElement.clientWidth;
+        H = documentElement.clientHeight;
+      } else {
+        W = body.offsetWidth;
+        H = body.offsetHeight;
+      }
+    }
+    return { top: T, left: L, width: W, height: H };
+  }
+});
+Draggable._dragging = { };
+/*--------------------------------------------------------------------------*/
+var SortableObserver = Class.create({
+  initialize: function(element, observer) {
+    this.element   = $(element);
+    this.observer  = observer;
+    this.lastValue = Sortable.serialize(this.element);
+  },
+  onStart: function() {
+    this.lastValue = Sortable.serialize(this.element);
+  },
+  onEnd: function() {
+    Sortable.unmark();
+    if(this.lastValue != Sortable.serialize(this.element))
+      this.observer(this.element)
+  }
+});
+var Sortable = {
+  SERIALIZE_RULE: /^[^_\-](?:[A-Za-z0-9\-\_]*)[_](.*)$/,
+  sortables: { },
+  _findRootElement: function(element) {
+    while (element.tagName.toUpperCase() != "BODY") {
+      if(element.id && Sortable.sortables[element.id]) return element;
+      element = element.parentNode;
+    }
+  },
+  options: function(element) {
+    element = Sortable._findRootElement($(element));
+    if(!element) return;
+    return Sortable.sortables[element.id];
+  },
+  destroy: function(element){
+    element = $(element);
+    var s = Sortable.sortables[element.id];
+    if(s) {
+      Draggables.removeObserver(s.element);
+      s.droppables.each(function(d){ Droppables.remove(d) });
+      s.draggables.invoke('destroy');
+      delete Sortable.sortables[s.element.id];
+    }
+  },
+  create: function(element) {
+    element = $(element);
+    var options = Object.extend({
+      element:     element,
+      tag:         'li',       // assumes li children, override with tag: 'tagname'
+      dropOnEmpty: false,
+      tree:        false,
+      treeTag:     'ul',
+      overlap:     'vertical', // one of 'vertical', 'horizontal'
+      constraint:  'vertical', // one of 'vertical', 'horizontal', false
+      containment: element,    // also takes array of elements (or id's); or false
+      handle:      false,      // or a CSS class
+      only:        false,
+      delay:       0,
+      hoverclass:  null,
+      ghosting:    false,
+      quiet:       false,
+      scroll:      false,
+      scrollSensitivity: 20,
+      scrollSpeed: 15,
+      format:      this.SERIALIZE_RULE,
+      // these take arrays of elements or ids and can be
+      // used for better initialization performance
+      elements:    false,
+      handles:     false,
+      onChange:    Prototype.emptyFunction,
+      onUpdate:    Prototype.emptyFunction
+    }, arguments[1] || { });
+    // clear any old sortable with same element
+    this.destroy(element);
+    // build options for the draggables
+    var options_for_draggable = {
+      revert:      true,
+      quiet:       options.quiet,
+      scroll:      options.scroll,
+      scrollSpeed: options.scrollSpeed,
+      scrollSensitivity: options.scrollSensitivity,
+      delay:       options.delay,
+      ghosting:    options.ghosting,
+      constraint:  options.constraint,
+      handle:      options.handle };
+    if(options.starteffect)
+      options_for_draggable.starteffect = options.starteffect;
+    if(options.reverteffect)
+      options_for_draggable.reverteffect = options.reverteffect;
+    else
+      if(options.ghosting) options_for_draggable.reverteffect = function(element) {
+        element.style.top  = 0;
+        element.style.left = 0;
+      };
+    if(options.endeffect)
+      options_for_draggable.endeffect = options.endeffect;
+    if(options.zindex)
+      options_for_draggable.zindex = options.zindex;
+    // build options for the droppables
+    var options_for_droppable = {
+      overlap:     options.overlap,
+      containment: options.containment,
+      tree:        options.tree,
+      hoverclass:  options.hoverclass,
+      onHover:     Sortable.onHover
+    };
+    var options_for_tree = {
+      onHover:      Sortable.onEmptyHover,
+      overlap:      options.overlap,
+      containment:  options.containment,
+      hoverclass:   options.hoverclass
+    };
+    // fix for gecko engine
+    Element.cleanWhitespace(element);
+    options.draggables = [];
+    options.droppables = [];
+    // drop on empty handling
+    if(options.dropOnEmpty || options.tree) {
+      Droppables.add(element, options_for_tree);
+      options.droppables.push(element);
+    }
+    (options.elements || this.findElements(element, options) || []).each( function(e,i) {
+      var handle = options.handles ? $(options.handles[i]) :
+        (options.handle ? $(e).select('.' + options.handle)[0] : e);
+      options.draggables.push(
+        new Draggable(e, Object.extend(options_for_draggable, { handle: handle })));
+      Droppables.add(e, options_for_droppable);
+      if(options.tree) e.treeNode = element;
+      options.droppables.push(e);
+    });
+    if(options.tree) {
+      (Sortable.findTreeElements(element, options) || []).each( function(e) {
+        Droppables.add(e, options_for_tree);
+        e.treeNode = element;
+        options.droppables.push(e);
+      });
+    }
+    // keep reference
+    this.sortables[element.identify()] = options;
+    // for onupdate
+    Draggables.addObserver(new SortableObserver(element, options.onUpdate));
+  },
+  // return all suitable-for-sortable elements in a guaranteed order
+  findElements: function(element, options) {
+    return Element.findChildren(
+      element, options.only, options.tree ? true : false, options.tag);
+  },
+  findTreeElements: function(element, options) {
+    return Element.findChildren(
+      element, options.only, options.tree ? true : false, options.treeTag);
+  },
+  onHover: function(element, dropon, overlap) {
+    if(Element.isParent(dropon, element)) return;
+    if(overlap > .33 && overlap < .66 && Sortable.options(dropon).tree) {
+      return;
+    } else if(overlap>0.5) {
+      Sortable.mark(dropon, 'before');
+      if(dropon.previousSibling != element) {
+        var oldParentNode = element.parentNode;
+        element.style.visibility = "hidden"; // fix gecko rendering
+        dropon.parentNode.insertBefore(element, dropon);
+        if(dropon.parentNode!=oldParentNode)
+          Sortable.options(oldParentNode).onChange(element);
+        Sortable.options(dropon.parentNode).onChange(element);
+      }
+    } else {
+      Sortable.mark(dropon, 'after');
+      var nextElement = dropon.nextSibling || null;
+      if(nextElement != element) {
+        var oldParentNode = element.parentNode;
+        element.style.visibility = "hidden"; // fix gecko rendering
+        dropon.parentNode.insertBefore(element, nextElement);
+        if(dropon.parentNode!=oldParentNode)
+          Sortable.options(oldParentNode).onChange(element);
+        Sortable.options(dropon.parentNode).onChange(element);
+      }
+    }
+  },
+  onEmptyHover: function(element, dropon, overlap) {
+    var oldParentNode = element.parentNode;
+    var droponOptions = Sortable.options(dropon);
+    if(!Element.isParent(dropon, element)) {
+      var index;
+      var children = Sortable.findElements(dropon, {tag: droponOptions.tag, only: droponOptions.only});
+      var child = null;
+      if(children) {
+        var offset = Element.offsetSize(dropon, droponOptions.overlap) * (1.0 - overlap);
+        for (index = 0; index < children.length; index += 1) {
+          if (offset - Element.offsetSize (children[index], droponOptions.overlap) >= 0) {
+            offset -= Element.offsetSize (children[index], droponOptions.overlap);
+          } else if (offset - (Element.offsetSize (children[index], droponOptions.overlap) / 2) >= 0) {
+            child = index + 1 < children.length ? children[index + 1] : null;
+            break;
+          } else {
+            child = children[index];
+            break;
+          }
+        }
+      }
+      dropon.insertBefore(element, child);
+      Sortable.options(oldParentNode).onChange(element);
+      droponOptions.onChange(element);
+    }
+  },
+  unmark: function() {
+    if(Sortable._marker) Sortable._marker.hide();
+  },
+  mark: function(dropon, position) {
+    // mark on ghosting only
+    var sortable = Sortable.options(dropon.parentNode);
+    if(sortable && !sortable.ghosting) return;
+    if(!Sortable._marker) {
+      Sortable._marker =
+        ($('dropmarker') || Element.extend(document.createElement('DIV'))).
+          hide().addClassName('dropmarker').setStyle({position:'absolute'});
+      document.getElementsByTagName("body").item(0).appendChild(Sortable._marker);
+    }
+    var offsets = dropon.cumulativeOffset();
+    Sortable._marker.setStyle({left: offsets[0]+'px', top: offsets[1] + 'px'});
+    if(position=='after')
+      if(sortable.overlap == 'horizontal')
+        Sortable._marker.setStyle({left: (offsets[0]+dropon.clientWidth) + 'px'});
+      else
+        Sortable._marker.setStyle({top: (offsets[1]+dropon.clientHeight) + 'px'});
+    Sortable._marker.show();
+  },
+  _tree: function(element, options, parent) {
+    var children = Sortable.findElements(element, options) || [];
+    for (var i = 0; i < children.length; ++i) {
+      var match = children[i].id.match(options.format);
+      if (!match) continue;
+      var child = {
+        id: encodeURIComponent(match ? match[1] : null),
+        element: element,
+        parent: parent,
+        children: [],
+        position: parent.children.length,
+        container: $(children[i]).down(options.treeTag)
+      };
+      /* Get the element containing the children and recurse over it */
+      if (child.container)
+        this._tree(child.container, options, child);
+      parent.children.push (child);
+    }
+    return parent;
+  },
+  tree: function(element) {
+    element = $(element);
+    var sortableOptions = this.options(element);
+    var options = Object.extend({
+      tag: sortableOptions.tag,
+      treeTag: sortableOptions.treeTag,
+      only: sortableOptions.only,
+      name: element.id,
+      format: sortableOptions.format
+    }, arguments[1] || { });
+    var root = {
+      id: null,
+      parent: null,
+      children: [],
+      container: element,
+      position: 0
+    };
+    return Sortable._tree(element, options, root);
+  },
+  /* Construct a [i] index for a particular node */
+  _constructIndex: function(node) {
+    var index = '';
+    do {
+      if (node.id) index = '[' + node.position + ']' + index;
+    } while ((node = node.parent) != null);
+    return index;
+  },
+  sequence: function(element) {
+    element = $(element);
+    var options = Object.extend(this.options(element), arguments[1] || { });
+    return $(this.findElements(element, options) || []).map( function(item) {
+      return item.id.match(options.format) ? item.id.match(options.format)[1] : '';
+    });
+  },
+  setSequence: function(element, new_sequence) {
+    element = $(element);
+    var options = Object.extend(this.options(element), arguments[2] || { });
+    var nodeMap = { };
+    this.findElements(element, options).each( function(n) {
+        if (n.id.match(options.format))
+            nodeMap[n.id.match(options.format)[1]] = [n, n.parentNode];
+        n.parentNode.removeChild(n);
+    });
+    new_sequence.each(function(ident) {
+      var n = nodeMap[ident];
+      if (n) {
+        n[1].appendChild(n[0]);
+        delete nodeMap[ident];
+      }
+    });
+  },
+  serialize: function(element) {
+    element = $(element);
+    var options = Object.extend(Sortable.options(element), arguments[1] || { });
+    var name = encodeURIComponent(
+      (arguments[1] && arguments[1].name) ? arguments[1].name : element.id);
+    if (options.tree) {
+      return Sortable.tree(element, arguments[1]).children.map( function (item) {
+        return [name + Sortable._constructIndex(item) + "[id]=" +
+                encodeURIComponent(item.id)].concat(item.children.map(arguments.callee));
+      }).flatten().join('&');
+    } else {
+      return Sortable.sequence(element, arguments[1]).map( function(item) {
+        return name + "[]=" + encodeURIComponent(item);
+      }).join('&');
+    }
+  }
+};
+// Returns true if child is contained within element
+Element.isParent = function(child, element) {
+  if (!child.parentNode || child == element) return false;
+  if (child.parentNode == element) return true;
+  return Element.isParent(child.parentNode, element);
+};
+Element.findChildren = function(element, only, recursive, tagName) {
+  if(!element.hasChildNodes()) return null;
+  tagName = tagName.toUpperCase();
+  if(only) only = [only].flatten();
+  var elements = [];
+  $A(element.childNodes).each( function(e) {
+    if(e.tagName && e.tagName.toUpperCase()==tagName &&
+      (!only || (Element.classNames(e).detect(function(v) { return only.include(v) }))))
+        elements.push(e);
+    if(recursive) {
+      var grandchildren = Element.findChildren(e, only, recursive, tagName);
+      if(grandchildren) elements.push(grandchildren);
+    }
+  });
+  return (elements.length>0 ? elements.flatten() : []);
+};
+Element.offsetSize = function (element, type) {
+  return element['offset' + ((type=='vertical' || type=='height') ? 'Height' : 'Width')];
+};

mosesdecoder/scripts/ems/web/javascripts/prototype.js ADDED Viewed

The diff for this file is too large to render. See raw diff

mosesdecoder/scripts/ems/web/javascripts/sound.js ADDED Viewed

	@@ -0,0 +1,63 @@

+// script.aculo.us sound.js v1.8.3, Thu Oct 08 11:23:33 +0200 2009
+// Copyright (c) 2005-2009 Thomas Fuchs (http://script.aculo.us, http://mir.aculo.us)
+//
+// Based on code created by Jules Gravinese (http://www.webveteran.com/)
+//
+// script.aculo.us is freely distributable under the terms of an MIT-style license.
+// For details, see the script.aculo.us web site: http://script.aculo.us/
+Sound = {
+  tracks: {},
+  _enabled: true,
+  template:
+    new Template('<embed style="height:0" id="sound_#{track}_#{id}" src="#{url}" loop="false" autostart="true" hidden="true"/>'),
+  enable: function(){
+    Sound._enabled = true;
+  },
+  disable: function(){
+    Sound._enabled = false;
+  },
+  play: function(url){
+    if(!Sound._enabled) {
+      return;
+    }
+    var options = Object.extend({
+      track: 'global', url: url, replace: false
+    }, arguments[1] || {});
+    if(options.replace && this.tracks[options.track]) {
+      $R(0, this.tracks[options.track].id).each(function(id){
+        var sound = $('sound_'+options.track+'_'+id);
+        sound.Stop && sound.Stop();
+        sound.remove();
+      });
+      this.tracks[options.track] = null;
+    }
+    if(!this.tracks[options.track]) {
+      this.tracks[options.track] = { id: 0 };
+    } else {
+      this.tracks[options.track].id++;
+    }
+    options.id = this.tracks[options.track].id;
+    $$('body')[0].insert(
+      Prototype.Browser.IE ? new Element('bgsound',{
+        id: 'sound_'+options.track+'_'+options.id,
+        src: options.url, loop: 1, autostart: true
+      }) : Sound.template.evaluate(options));
+  }
+};
+if(Prototype.Browser.Gecko && navigator.userAgent.indexOf("Win") > 0){
+  if(navigator.plugins && $A(navigator.plugins).detect(function(p){ return p.name.indexOf('QuickTime') != -1; })) {
+    Sound.template = new Template('<object id="sound_#{track}_#{id}" width="0" height="0" type="audio/mpeg" data="#{url}"/>');
+  } else if(navigator.plugins && $A(navigator.plugins).detect(function(p){ return p.name.indexOf('Windows Media') != -1; })) {
+    Sound.template = new Template('<object id="sound_#{track}_#{id}" type="application/x-mplayer2" data="#{url}"></object>');
+  } else if(navigator.plugins && $A(navigator.plugins).detect(function(p){ return p.name.indexOf('RealPlayer') != -1; })) {
+    Sound.template = new Template('<embed type="audio/x-pn-realaudio-plugin" style="height:0" id="sound_#{track}_#{id}" src="#{url}" loop="false" autostart="true" hidden="true"/>');
+  } else {
+    Sound.play = function(){};
+  }
+}

mosesdecoder/vw/Classifier.h ADDED Viewed

	@@ -0,0 +1,197 @@

+#ifndef moses_Classifier_h
+#define moses_Classifier_h
+#include <iostream>
+#include <string>
+#include <fstream>
+#include <sstream>
+#include <deque>
+#include <vector>
+#include <boost/shared_ptr.hpp>
+#include <boost/noncopyable.hpp>
+#include <boost/thread/condition_variable.hpp>
+#include <boost/thread/locks.hpp>
+#include <boost/thread/mutex.hpp>
+#include <boost/iostreams/filtering_stream.hpp>
+#include <boost/iostreams/filter/gzip.hpp>
+#include "../util/string_piece.hh"
+#include "../moses/Util.h"
+// forward declarations to avoid dependency on VW
+struct vw;
+class ezexample;
+namespace Discriminative
+{
+typedef std::pair<uint32_t, float> FeatureType; // feature hash (=ID) and value
+typedef std::vector<FeatureType> FeatureVector;
+/**
+* Abstract class to be implemented by classifiers.
+*/
+class Classifier
+{
+public:
+  /**
+   * Add a feature that does not depend on the class (label).
+   */
+  virtual FeatureType AddLabelIndependentFeature(const StringPiece &name, float value) = 0;
+  /**
+   * Add a feature that is specific for the given class.
+   */
+  virtual FeatureType AddLabelDependentFeature(const StringPiece &name, float value) = 0;
+  /**
+   * Efficient addition of features when their IDs are already computed.
+   */
+  virtual void AddLabelIndependentFeatureVector(const FeatureVector &features) = 0;
+  /**
+   * Efficient addition of features when their IDs are already computed.
+   */
+  virtual void AddLabelDependentFeatureVector(const FeatureVector &features) = 0;
+  /**
+   * Train using current example. Use loss to distinguish positive and negative training examples.
+   * Throws away current label-dependent features (so that features for another label/class can now be set).
+   */
+  virtual void Train(const StringPiece &label, float loss) = 0;
+  /**
+   * Predict the loss (inverse of score) of current example.
+   * Throws away current label-dependent features (so that features for another label/class can now be set).
+   */
+  virtual float Predict(const StringPiece &label) = 0;
+  // helper methods for indicator features
+  FeatureType AddLabelIndependentFeature(const StringPiece &name) {
+    return AddLabelIndependentFeature(name, 1.0);
+  }
+  FeatureType AddLabelDependentFeature(const StringPiece &name) {
+    return AddLabelDependentFeature(name, 1.0);
+  }
+  virtual ~Classifier() {}
+protected:
+  /**
+   * Escape special characters in a unified way.
+   */
+  static std::string EscapeSpecialChars(const std::string &str) {
+    std::string out;
+    out = Moses::Replace(str, "\\", "_/_");
+    out = Moses::Replace(out, "|", "\\/");
+    out = Moses::Replace(out, ":", "\\;");
+    out = Moses::Replace(out, " ", "\\_");
+    return out;
+  }
+  const static bool DEBUG = false;
+};
+// some of VW settings are hard-coded because they are always needed in our scenario
+// (e.g. quadratic source X target features)
+const std::string VW_DEFAULT_OPTIONS = " --hash all --noconstant -q st -t --ldf_override sc ";
+const std::string VW_DEFAULT_PARSER_OPTIONS = " --quiet --hash all --noconstant -q st -t --csoaa_ldf sc ";
+/**
+ * Produce VW training file (does not use the VW library!)
+ */
+class VWTrainer : public Classifier
+{
+public:
+  VWTrainer(const std::string &outputFile);
+  virtual ~VWTrainer();
+  virtual FeatureType AddLabelIndependentFeature(const StringPiece &name, float value);
+  virtual FeatureType AddLabelDependentFeature(const StringPiece &name, float value);
+  virtual void AddLabelIndependentFeatureVector(const FeatureVector &features);
+  virtual void AddLabelDependentFeatureVector(const FeatureVector &features);
+  virtual void Train(const StringPiece &label, float loss);
+  virtual float Predict(const StringPiece &label);
+protected:
+  void AddFeature(const StringPiece &name, float value);
+  bool m_isFirstSource, m_isFirstTarget, m_isFirstExample;
+private:
+  boost::iostreams::filtering_ostream m_bfos;
+  std::deque<std::string> m_outputBuffer;
+  void WriteBuffer();
+};
+/**
+ * Predict using VW library.
+ */
+class VWPredictor : public Classifier, private boost::noncopyable
+{
+public:
+  VWPredictor(const std::string &modelFile, const std::string &vwOptions);
+  virtual ~VWPredictor();
+  virtual FeatureType AddLabelIndependentFeature(const StringPiece &name, float value);
+  virtual FeatureType AddLabelDependentFeature(const StringPiece &name, float value);
+  virtual void AddLabelIndependentFeatureVector(const FeatureVector &features);
+  virtual void AddLabelDependentFeatureVector(const FeatureVector &features);
+  virtual void Train(const StringPiece &label, float loss);
+  virtual float Predict(const StringPiece &label);
+  friend class ClassifierFactory;
+protected:
+  FeatureType AddFeature(const StringPiece &name, float values);
+  ::vw *m_VWInstance, *m_VWParser;
+  ::ezexample *m_ex;
+  // if true, then the VW instance is owned by an external party and should NOT be
+  // deleted at end; if false, then we own the VW instance and must clean up after it.
+  bool m_sharedVwInstance;
+  bool m_isFirstSource, m_isFirstTarget;
+private:
+  // instantiation by classifier factory
+  VWPredictor(vw * instance, const std::string &vwOption);
+};
+/**
+ * Provider for classifier instances to be used by individual threads.
+ */
+class ClassifierFactory : private boost::noncopyable
+{
+public:
+  typedef boost::shared_ptr<Classifier> ClassifierPtr;
+  /**
+   * Creates VWPredictor instances to be used by individual threads.
+   */
+  ClassifierFactory(const std::string &modelFile, const std::string &vwOptions);
+  /**
+   * Creates VWTrainer instances (which write features to a file).
+   */
+  ClassifierFactory(const std::string &modelFilePrefix);
+  // return VWPredictor or VWTrainer instance depending on whether we're in training mode
+  ClassifierPtr operator()();
+  ~ClassifierFactory();
+private:
+  std::string m_vwOptions;
+  ::vw *m_VWInstance;
+  int m_lastId;
+  std::string m_modelFilePrefix;
+  bool m_gzip;
+  boost::mutex m_mutex;
+  const bool m_train;
+};
+} // namespace Discriminative
+#endif // moses_Classifier_h

mosesdecoder/vw/ClassifierFactory.cpp ADDED Viewed

	@@ -0,0 +1,48 @@

+#include "Classifier.h"
+#include "vw.h"
+#include "../moses/Util.h"
+#include <iostream>
+#include <boost/algorithm/string/predicate.hpp>
+using namespace boost::algorithm;
+namespace Discriminative
+{
+ClassifierFactory::ClassifierFactory(const std::string &modelFile, const std::string &vwOptions)
+  : m_vwOptions(vwOptions), m_train(false)
+{
+  m_VWInstance = VW::initialize(VW_DEFAULT_OPTIONS + " -i " + modelFile + vwOptions);
+}
+ClassifierFactory::ClassifierFactory(const std::string &modelFilePrefix)
+  : m_lastId(0), m_train(true)
+{
+  if (ends_with(modelFilePrefix, ".gz")) {
+    m_modelFilePrefix = modelFilePrefix.substr(0, modelFilePrefix.size() - 3);
+    m_gzip = true;
+  } else {
+    m_modelFilePrefix = modelFilePrefix;
+    m_gzip = false;
+  }
+}
+ClassifierFactory::~ClassifierFactory()
+{
+  if (! m_train)
+    VW::finish(*m_VWInstance);
+}
+ClassifierFactory::ClassifierPtr ClassifierFactory::operator()()
+{
+  if (m_train) {
+    boost::unique_lock<boost::mutex> lock(m_mutex); // avoid possible race for m_lastId
+    return ClassifierFactory::ClassifierPtr(
+             new VWTrainer(m_modelFilePrefix + "." + Moses::SPrint(m_lastId++) + (m_gzip ? ".gz" : "")));
+  } else {
+    return ClassifierFactory::ClassifierPtr(
+             new VWPredictor(m_VWInstance, VW_DEFAULT_PARSER_OPTIONS + m_vwOptions));
+  }
+}
+}

mosesdecoder/vw/Jamfile ADDED Viewed

	@@ -0,0 +1,20 @@

+alias headers : : : : <include>. <include>..//moses// <include>.. ;
+alias deps :  ..//z ..//boost_iostreams ..//boost_filesystem ../moses//moses ;
+boost 103600 ;
+# VW
+local with-vw = [ option.get "with-vw" ] ;
+if $(with-vw) {
+  lib vw : : <search>$(with-vw)/lib ;
+  lib allreduce : : <search>$(with-vw)/lib ;
+  obj ClassifierFactory.o : ClassifierFactory.cpp headers : <include>$(with-vw)/include/vowpalwabbit ;
+  obj VWPredictor.o : VWPredictor.cpp headers : <include>$(with-vw)/include/vowpalwabbit ;
+  alias vw_objects : VWPredictor.o ClassifierFactory.o vw allreduce : : : <library>boost_program_options ;
+  lib classifier : [ glob *.cpp : VWPredictor.cpp ClassifierFactory.cpp ] vw_objects headers ;
+  exe vwtrainer : MainVW deps ;
+  echo "Linking with Vowpal Wabbit" ;
+}

mosesdecoder/vw/Normalizer.h ADDED Viewed

	@@ -0,0 +1,78 @@

+#ifndef moses_Normalizer_h
+#define moses_Normalizer_h
+#include <vector>
+#include <algorithm>
+#include "Util.h"
+namespace Discriminative
+{
+class Normalizer
+{
+public:
+  virtual void operator()(std::vector<float> &losses) const = 0;
+  virtual ~Normalizer() {}
+};
+class SquaredLossNormalizer : public Normalizer
+{
+public:
+  virtual void operator()(std::vector<float> &losses) const {
+    // This is (?) a good choice for sqrt loss (default loss function in VW)
+    float sum = 0;
+    // clip to [0,1] and take 1-Z as non-normalized prob
+    std::vector<float>::iterator it;
+    for (it = losses.begin(); it != losses.end(); it++) {
+      if (*it <= 0.0) *it = 1.0;
+      else if (*it >= 1.0) *it = 0.0;
+      else *it = 1.0 - *it;
+      sum += *it;
+    }
+    if (! Moses::Equals(sum, 0)) {
+      // normalize
+      for (it = losses.begin(); it != losses.end(); it++)
+        *it /= sum;
+    } else {
+      // sum of non-normalized probs is 0, then take uniform probs
+      for (it = losses.begin(); it != losses.end(); it++)
+        *it = 1.0 / losses.size();
+    }
+  }
+  virtual ~SquaredLossNormalizer() {}
+};
+// safe softmax
+class LogisticLossNormalizer : public Normalizer
+{
+public:
+  virtual void operator()(std::vector<float> &losses) const {
+    std::vector<float>::iterator it;
+    float sum = 0;
+    float max = 0;
+    for (it = losses.begin(); it != losses.end(); it++) {
+      *it = -*it;
+      max = std::max(max, *it);
+    }
+    for (it = losses.begin(); it != losses.end(); it++) {
+      *it = exp(*it - max);
+      sum += *it;
+    }
+    for (it = losses.begin(); it != losses.end(); it++) {
+      *it /= sum;
+    }
+  }
+  virtual ~LogisticLossNormalizer() {}
+};
+} // namespace Discriminative
+#endif // moses_Normalizer_h

mosesdecoder/vw/README.md ADDED Viewed

	@@ -0,0 +1,113 @@

+Vowpal Wabbit for Moses
+=======================
+This is an attempt to integrate Vowpal Wabbit with Moses as a stateless feature
+function.
+Compatible with this frozen version of VW:
+    https://github.com/moses-smt/vowpal_wabbit
+To enable VW, you need to provide a path where VW was installed (using `make install`) to bjam:
+    ./bjam --with-vw=<path/to/vw/installation>
+Implemented classifier features
+-------------------------------
+* `VWFeatureSourceBagOfWords`: This creates a feature of form bow^token for every
+source sentence token.
+* `VWFeatureSourceExternalFeatures column=0`: when used with -inputtype 5 (`TabbedSentence`) this can be used to supply additional feature to VW. The input is a tab-separated file, the first column is the usual input sentence, all other columns can be used for meta-data. Parameter column=0 counts beginning with the first column that is not the input sentence.
+* `VWFeatureSourceIndicator`: Ass a feature for the whole source phrase.
+* `VWFeatureSourcePhraseInternal`: Adds a separate feature for every word of the source phrase.
+* `VWFeatureSourceWindow size=3`: Adds source words in a window of size 3 before and after the source phrase as features. These do not overlap with `VWFeatureSourcePhraseInternal`.
+* `VWFeatureTargetIndicator`: Adds a feature for the whole target phrase.
+* `VWFeatureTargetPhraseInternal`: Adds a separate feature for every word of the target phrase.
+Configuration
+-------------
+To use the classifier edit your moses.ini
+    [features]
+    ...
+    VW path=/home/username/vw/classifier1.vw
+    VWFeatureSourceBagOfWords
+    VWFeatureTargetIndicator
+    VWFeatureSourceIndicator
+    ...
+    [weights]
+    ...
+    VW0= 0.2
+    ...
+If you change the name of the main VW feature, remember to tell the VW classifier
+features which classifier they belong to:
+    [features]
+    ...
+    VW name=bart path=/home/username/vw/classifier1.vw
+    VWFeatureSourceBagOfWords used-by=bart
+    VWFeatureTargetIndicator used-by=bart
+    VWFeatureSourceIndicator used-by=bart
+    ...
+    [weights]
+    ...
+    bart= 0.2
+    ...
+You can also use multiple classifiers:
+    [features]
+    ...
+    VW name=bart path=/home/username/vw/classifier1.vw
+    VW path=/home/username/vw/classifier2.vw
+    VW path=/home/username/vw/classifier3.vw
+    VWFeatureSourceBagOfWords used-by=bart,VW0
+    VWFeatureTargetIndicator used-by=VW1,VW0,bart
+    VWFeatureSourceIndicator used-by=bart,VW1
+    ...
+    [weights]
+    ...
+    bart= 0.2
+    VW0= 0.2
+    VW1= 0.2
+    ...
+Features can use any combination of factors. Provide a comma-delimited list of factors in the `source-factors` or `target-factors` variables to override the default setting (`0`, i.e. the first factor).
+Training the classifier
+-----------------------
+Training uses `vwtrainer` which is a limited version of the `moses` binary. To train, provide your training data as input in the following format:
+    source tokens<tab>target tokens<tab>word alignment
+Use Moses format for the word alignment (`0-0 1-0` etc.). Set the input type to 5 (`TabbedSentence`, see above):
+    [inputtype]
+    5
+Configure your features in the `moses.ini` file (see above) and set the `train` flag:
+     [features]
+     ...
+     VW name=bart path=/home/username/vw/features.txt train=1
+     ...
+The `path` variable points to the file (prefix) where features will be written. Currently, threads write to separate files (maybe subject to change sooner or later): `features.txt.1`, `features.txt.2` etc.
+`vwtrainer` creates the translation option collection for each input sentence but does not run decoding. Therefore, you probably want to disable expensive feature functions such as the language model (LM score is not used by VW features at the moment).
+Run `vwtrainer`:
+    vwtrainer -f moses.trainvw.ini < tab-separated-training-data.tsv
+Currently, classification is implemented using VW's `csoaa_ldf` scheme with quadratic features which take the product of the source namespace (`s`, contains label-independent features) and the target namespace (`t`,  contains label-dependent features).
+To train VW in this setting, use the command:
+    cat features.txt.* | vw --hash all --loss_function logistic --noconstant -b 26 -q st --csoaa_ldf mc -f classifier1.vw

mosesdecoder/vw/VWPredictor.cpp ADDED Viewed

	@@ -0,0 +1,121 @@

+#include <iostream>
+#include "Classifier.h"
+#include "vw.h"
+#include "ezexample.h"
+#include "../moses/Util.h"
+namespace Discriminative
+{
+using namespace std;
+VWPredictor::VWPredictor(const string &modelFile, const string &vwOptions)
+{
+  m_VWInstance = VW::initialize(VW_DEFAULT_OPTIONS + " -i " + modelFile + vwOptions);
+  m_VWParser = VW::initialize(VW_DEFAULT_PARSER_OPTIONS + vwOptions + " --noop");
+  m_sharedVwInstance = false;
+  m_ex = new ::ezexample(m_VWInstance, false, m_VWParser);
+  m_isFirstSource = m_isFirstTarget = true;
+}
+VWPredictor::VWPredictor(vw *instance, const string &vwOptions)
+{
+  m_VWInstance = instance;
+  m_VWParser = VW::initialize(vwOptions + " --noop");
+  m_sharedVwInstance = true;
+  m_ex = new ::ezexample(m_VWInstance, false, m_VWParser);
+  m_isFirstSource = m_isFirstTarget = true;
+}
+VWPredictor::~VWPredictor()
+{
+  delete m_ex;
+  VW::finish(*m_VWParser);
+  if (!m_sharedVwInstance)
+    VW::finish(*m_VWInstance);
+}
+FeatureType VWPredictor::AddLabelIndependentFeature(const StringPiece &name, float value)
+{
+  // label-independent features are kept in a different feature namespace ('s' = source)
+  if (m_isFirstSource) {
+    // the first feature of a new example => create the source namespace for
+    // label-independent features to live in
+    m_isFirstSource = false;
+    m_ex->finish();
+    m_ex->addns('s');
+    if (DEBUG) std::cerr << "VW :: Setting source namespace\n";
+  }
+  return AddFeature(name, value); // namespace 's' is set up, add the feature
+}
+FeatureType VWPredictor::AddLabelDependentFeature(const StringPiece &name, float value)
+{
+  // VW does not use the label directly, instead, we do a Cartesian product between source and target feature
+  // namespaces, where the source namespace ('s') contains label-independent features and the target
+  // namespace ('t') contains label-dependent features
+  if (m_isFirstTarget) {
+    // the first target-side feature => create namespace 't'
+    m_isFirstTarget = false;
+    m_ex->addns('t');
+    if (DEBUG) std::cerr << "VW :: Setting target namespace\n";
+  }
+  return AddFeature(name, value);
+}
+void VWPredictor::AddLabelIndependentFeatureVector(const FeatureVector &features)
+{
+  if (m_isFirstSource) {
+    // the first feature of a new example => create the source namespace for
+    // label-independent features to live in
+    m_isFirstSource = false;
+    m_ex->finish();
+    m_ex->addns('s');
+    if (DEBUG) std::cerr << "VW :: Setting source namespace\n";
+  }
+  // add each feature index using this "low level" call to VW
+  for (FeatureVector::const_iterator it = features.begin(); it != features.end(); it++)
+    m_ex->addf(it->first, it->second);
+}
+void VWPredictor::AddLabelDependentFeatureVector(const FeatureVector &features)
+{
+  if (m_isFirstTarget) {
+    // the first target-side feature => create namespace 't'
+    m_isFirstTarget = false;
+    m_ex->addns('t');
+    if (DEBUG) std::cerr << "VW :: Setting target namespace\n";
+  }
+  // add each feature index using this "low level" call to VW
+  for (FeatureVector::const_iterator it = features.begin(); it != features.end(); it++)
+    m_ex->addf(it->first, it->second);
+}
+void VWPredictor::Train(const StringPiece &label, float loss)
+{
+  throw logic_error("Trying to train during prediction!");
+}
+float VWPredictor::Predict(const StringPiece &label)
+{
+  m_ex->set_label(label.as_string());
+  m_isFirstSource = true;
+  m_isFirstTarget = true;
+  float loss = m_ex->predict_partial();
+  if (DEBUG) std::cerr << "VW :: Predicted loss: " << loss << "\n";
+  m_ex->remns(); // remove target namespace
+  return loss;
+}
+FeatureType VWPredictor::AddFeature(const StringPiece &name, float value)
+{
+  if (DEBUG) std::cerr << "VW :: Adding feature: " << EscapeSpecialChars(name.as_string()) << ":" << value << "\n";
+  return std::make_pair(m_ex->addf(EscapeSpecialChars(name.as_string()), value), value);
+}
+} // namespace Discriminative

mosesdecoder/vw/VWTrainer.cpp ADDED Viewed

	@@ -0,0 +1,99 @@

+#include "Util.h"
+#include "Classifier.h"
+#include <boost/algorithm/string/predicate.hpp>
+#include <boost/iostreams/device/file.hpp>
+using namespace std;
+using namespace boost::algorithm;
+using namespace Moses;
+namespace Discriminative
+{
+VWTrainer::VWTrainer(const std::string &outputFile)
+{
+  if (ends_with(outputFile, ".gz")) {
+    m_bfos.push(boost::iostreams::gzip_compressor());
+  }
+  m_bfos.push(boost::iostreams::file_sink(outputFile));
+  m_isFirstSource = m_isFirstTarget = m_isFirstExample = true;
+}
+VWTrainer::~VWTrainer()
+{
+  m_bfos << "\n";
+  close(m_bfos);
+}
+FeatureType VWTrainer::AddLabelIndependentFeature(const StringPiece &name, float value)
+{
+  if (m_isFirstSource) {
+    if (m_isFirstExample) {
+      m_isFirstExample = false;
+    } else {
+      // finish previous example
+      m_bfos << "\n";
+    }
+    m_isFirstSource = false;
+    if (! m_outputBuffer.empty())
+      WriteBuffer();
+    m_outputBuffer.push_back("shared |s");
+  }
+  AddFeature(name, value);
+  return std::make_pair(0, value); // we don't hash features
+}
+FeatureType VWTrainer::AddLabelDependentFeature(const StringPiece &name, float value)
+{
+  if (m_isFirstTarget) {
+    m_isFirstTarget = false;
+    if (! m_outputBuffer.empty())
+      WriteBuffer();
+    m_outputBuffer.push_back("|t");
+  }
+  AddFeature(name, value);
+  return std::make_pair(0, value); // we don't hash features
+}
+void VWTrainer::AddLabelIndependentFeatureVector(const FeatureVector &features)
+{
+  throw logic_error("VW trainer does not support feature IDs.");
+}
+void VWTrainer::AddLabelDependentFeatureVector(const FeatureVector &features)
+{
+  throw logic_error("VW trainer does not support feature IDs.");
+}
+void VWTrainer::Train(const StringPiece &label, float loss)
+{
+  m_outputBuffer.push_front(label.as_string() + ":" + SPrint(loss));
+  m_isFirstSource = true;
+  m_isFirstTarget = true;
+  WriteBuffer();
+}
+float VWTrainer::Predict(const StringPiece &label)
+{
+  throw logic_error("Trying to predict during training!");
+}
+void VWTrainer::AddFeature(const StringPiece &name, float value)
+{
+  m_outputBuffer.push_back(EscapeSpecialChars(name.as_string()) + ":" + SPrint(value));
+}
+void VWTrainer::WriteBuffer()
+{
+  m_bfos << Join(" ", m_outputBuffer.begin(), m_outputBuffer.end()) << "\n";
+  m_outputBuffer.clear();
+}
+} // namespace Discriminative

scripts/decode-backtrans.sh ADDED Viewed

	@@ -0,0 +1,69 @@

+#! /usr/bin/bash
+set -eux
+# 若为0则使用CPU
+comet_eval_gpus=8
+# xzq-fairseq
+root_dir=$(dirname "$PWD")
+# 语言对
+src_lang=en
+tgt_lang=de
+threshold=0.7
+task_name=${src_lang}2${tgt_lang}
+raw_data_dir=$root_dir/data/test/raw/$task_name
+trainable_data_dir=$root_dir/data/test/trainable_data/$task_name
+## eval&decode param
+decode_max_tokens=2048
+beam=5
+nbest=1
+lenpen=1.0
+# 模型所在目录
+model_dir=$root_dir/exps/${task_name}_backtrans/${threshold}/transformer_big_wmt23
+### decode
+checkpoint_path=$model_dir/checkpoint_best.pt
+save_dir=$model_dir/decode_result
+mkdir -p $save_dir
+cp ${BASH_SOURCE[0]}  $save_dir
+declare -A gen_subset_dict
+gen_subset_dict=([test]=flores [test1]=wmt22 [test2]=wmt23)
+for gen_subset in ${!gen_subset_dict[*]}
+do
+    decode_file=$save_dir/decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
+    pure_file=$save_dir/pure_decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
+    CUDA_VISIBLE_DEVICES=0 fairseq-generate $trainable_data_dir -s $src_lang -t $tgt_lang \
+        --gen-subset $gen_subset \
+        --path $checkpoint_path  \
+        --max-tokens $decode_max_tokens \
+        --beam $beam \
+        --nbest $nbest \
+        --lenpen $lenpen  \
+        --seed 42 \
+        --remove-bpe | tee $decode_file
+    ### eval
+    # purify file
+    grep ^H $decode_file | LC_ALL=C sort -V | cut -f3- | perl $root_dir/mosesdecoder/scripts/tokenizer/detokenizer.perl -l $tgt_lang > $pure_file
+    eval_file=$model_dir/eval_${gen_subset_dict[$gen_subset]}.log
+    cur_time=`date +"%Y-%m-%d %H:%M:%S"`
+    echo "=============$cur_time===================" >> $eval_file
+    echo $checkpoint_path >> $eval_file
+    tail -n1 $decode_file >> $eval_file    # multi-bleu
+    # get score
+    src_file=$raw_data_dir/test.${task_name}.${gen_subset_dict[$gen_subset]}.$src_lang
+    ref_file=$raw_data_dir/test.${task_name}.${gen_subset_dict[$gen_subset]}.$tgt_lang
+    # sacrebleu_file=$save_dir/sacrebleu.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
+    comet22_file=$save_dir/comet22.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
+    # sacrebleu $ref_file -i $pure_file -w 2 >> $eval_file
+    sacrebleu $ref_file -i $pure_file -w 2 --tokenize zh >> $eval_file
+    comet-score -s $src_file -t $pure_file -r $ref_file --model $root_dir/wmt22-comet-da/checkpoints/model.ckpt | tee $comet22_file
+    echo "Comet22 Score" >> $eval_file
+    tail -n1 $comet22_file >> $eval_file    # 只取平均comet分
+done

scripts/decode.sh ADDED Viewed

	@@ -0,0 +1,69 @@

+#! /usr/bin/bash
+set -eux
+# 若为0则使用CPU
+comet_eval_gpus=8
+# xzq-fairseq
+root_dir=$(dirname "$PWD")
+# 语言对
+src_lang=en
+tgt_lang=de
+threshold=0.7
+task_name=${src_lang}2${tgt_lang}
+raw_data_dir=$root_dir/data/test/raw/$task_name
+trainable_data_dir=$root_dir/data/test/trainable_data/$task_name
+## eval&decode param
+decode_max_tokens=2048
+beam=5
+nbest=1
+lenpen=1.0
+# 模型所在目录
+model_dir=$root_dir/exps/${task_name}/${threshold}/transformer_big_wmt23
+### decode
+checkpoint_path=$model_dir/checkpoint_best.pt
+save_dir=$model_dir/decode_result
+mkdir -p $save_dir
+cp ${BASH_SOURCE[0]}  $save_dir
+declare -A gen_subset_dict
+gen_subset_dict=([test]=flores [test1]=wmt22 [test2]=wmt23)
+for gen_subset in ${!gen_subset_dict[*]}
+do
+    decode_file=$save_dir/decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
+    pure_file=$save_dir/pure_decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
+    CUDA_VISIBLE_DEVICES=0 fairseq-generate $trainable_data_dir -s $src_lang -t $tgt_lang \
+        --gen-subset $gen_subset \
+        --path $checkpoint_path  \
+        --max-tokens $decode_max_tokens \
+        --beam $beam \
+        --nbest $nbest \
+        --lenpen $lenpen  \
+        --seed 42 \
+        --remove-bpe | tee $decode_file
+    ### eval
+    # purify file
+    grep ^H $decode_file | LC_ALL=C sort -V | cut -f3- | perl $root_dir/mosesdecoder/scripts/tokenizer/detokenizer.perl -l $tgt_lang > $pure_file
+    eval_file=$model_dir/eval_${gen_subset_dict[$gen_subset]}.log
+    cur_time=`date +"%Y-%m-%d %H:%M:%S"`
+    echo "=============$cur_time===================" >> $eval_file
+    echo $checkpoint_path >> $eval_file
+    tail -n1 $decode_file >> $eval_file    # multi-bleu
+    # get score
+    src_file=$raw_data_dir/test.${task_name}.${gen_subset_dict[$gen_subset]}.$src_lang
+    ref_file=$raw_data_dir/test.${task_name}.${gen_subset_dict[$gen_subset]}.$tgt_lang
+    # sacrebleu_file=$save_dir/sacrebleu.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
+    comet22_file=$save_dir/comet22.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
+    # sacrebleu $ref_file -i $pure_file -w 2 >> $eval_file
+    sacrebleu $ref_file -i $pure_file -w 2 --tokenize zh >> $eval_file
+    comet-score -s $src_file -t $pure_file -r $ref_file --model $root_dir/wmt22-comet-da/checkpoints/model.ckpt | tee $comet22_file
+    echo "Comet22 Score" >> $eval_file
+    tail -n1 $comet22_file >> $eval_file    # 只取平均comet分
+done

scripts/train-backtrans.sh ADDED Viewed

	@@ -0,0 +1,157 @@

+#! /usr/bin/bash
+set -eux
+train_device=0,1,2,3,4,5,6,7
+eval_device=0
+# xzq-fairseq
+root_dir=$(dirname "$PWD")
+src_lang=en
+tgt_lang=de
+threshold=0.7
+data_name=wmt23
+# pair_lang=${src_lang}-${tgt_lang}
+task_name=${src_lang}2${tgt_lang}
+data_dir=$root_dir/data/${tgt_lang}2${src_lang}/${threshold}
+raw_data_dir=$data_dir/raw
+trainable_data_dir=$data_dir/trainable_data
+## eval&decode param
+decode_max_tokens=2048
+beam=5
+nbest=1
+lenpen=1.0
+## common param
+criterion=label_smoothed_cross_entropy
+label_smoothing=0.1
+seed=42
+max_epoch=40
+keep_last_epochs=1
+keep_best_checkpoints=5
+patience=5
+num_workers=8
+# specified param
+conf_name=transformer_big
+# Global Batch=卡数*max-tokens*梯度累计,对于训练数据较大的语种(train-set几十M),global batch在 100k tokens以上较好
+if [ $conf_name == "transformer_big" ]; then
+    arch=transformer_vaswani_wmt_en_de_big
+    activation_fn=relu
+    encoder_ffn_embed_dim=4096
+    share_all_embeddings=1
+    share_decoder_input_output_embed=1
+    learing_rate=1e-3
+    warmup=4000
+    max_tokens=8192
+    weight_decay=0.0
+    dropout=0.3
+    gradient_accumulation_steps=4
+else
+    echo "unknown conf_name=$conf_name"
+    exit
+fi
+model_dir=$root_dir/exps/${task_name}_backtrans/${threshold}/${conf_name}_${data_name}
+mkdir -p $model_dir
+cp ${BASH_SOURCE[0]}  $model_dir
+gpu_num=`echo "$train_device" | awk '{split($0,arr,",");print length(arr)}'`
+export CUDA_VISIBLE_DEVICES=$train_device
+cmd="fairseq-train $trainable_data_dir \
+--distributed-world-size $gpu_num -s $src_lang -t $tgt_lang \
+--arch $arch \
+--fp16 \
+--optimizer adam --clip-norm 0.0 \
+--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates $warmup \
+--lr $learing_rate --adam-betas  '(0.9, 0.98)' \
+--weight-decay $weight_decay \
+--dropout $dropout \
+--criterion $criterion  --label-smoothing $label_smoothing \
+--max-epoch $max_epoch \
+--max-tokens $max_tokens \
+--update-freq $gradient_accumulation_steps \
+--activation-fn $activation_fn \
+--encoder-ffn-embed-dim $encoder_ffn_embed_dim \
+--seed $seed \
+--num-workers $num_workers \
+--no-epoch-checkpoints \
+--keep-last-epochs $keep_last_epochs \
+--keep-best-checkpoints $keep_best_checkpoints \
+--patience $patience \
+--no-progress-bar \
+--log-interval 100 \
+--task "translation" \
+--ddp-backend no_c10d \
+--save-dir $model_dir \
+--tensorboard-logdir $model_dir"
+# add param
+if [ $share_all_embeddings -eq 1 ]; then
+cmd=${cmd}" --share-all-embeddings "
+fi
+if [ $share_decoder_input_output_embed -eq 1 ]; then
+cmd=${cmd}" --share-decoder-input-output-embed "
+fi
+if [ ${max_update:=0} -ne 0 ]; then
+cmd=${cmd}" --max-update $max_update"
+fi
+# run command
+cur_time=`date +"%Y-%m-%d %H:%M:%S"`
+echo "=============$cur_time===================" >> $model_dir/train.log
+cmd="nohup ${cmd} >> $model_dir/train.log 2>&1 &"
+eval $cmd
+# wait
+# ### decode
+# checkpoint_path=$model_dir/checkpoint_best.pt
+# save_dir=$model_dir/decode_result
+# mkdir -p $save_dir
+# cp ${BASH_SOURCE[0]}  $save_dir
+# declare -A gen_subset_dict
+# gen_subset_dict=([test]=flores [test1]=wmt22 [test2]=wmt23)
+# for gen_subset in ${!gen_subset_dict[*]}
+# do
+#     decode_file=$save_dir/decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
+#     pure_file=$save_dir/pure_decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
+#     CUDA_VISIBLE_DEVICES=$eval_device fairseq-generate \
+#         $trainable_data_dir \
+#         -s $src_lang -t $tgt_lang \
+#         --user-dir $user_dir \
+#         --gen-subset $gen_subset \
+#         --path $checkpoint_path  \
+#         --max-tokens $decode_max_tokens \
+#         --beam $beam \
+#         --nbest $nbest \
+#         --lenpen $lenpen  \
+#         --seed $seed \
+#         --remove-bpe | tee $decode_file
+#     ### eval
+#     # purify file
+#     grep ^H $decode_file  | LC_ALL=C sort -V | cut -f3- | perl $root_dir/mosesdecoder/scripts/tokenizer/detokenizer.perl -l $tgt_lang > $pure_file
+#     eval_file=$model_dir/eval_${gen_subset_dict[$gen_subset]}.log
+#     cur_time=`date +"%Y-%m-%d %H:%M:%S"`
+#     echo "=============$cur_time===================" >> $eval_file
+#     echo $checkpoint_path >> $eval_file
+#     tail -n1 $decode_file >> $eval_file    # multi-bleu
+#     # get score
+#     src_file=$raw_data_dir/test.${gen_subset_dict[$gen_subset]}.$src_lang
+#     ref_file=$raw_data_dir/test.${gen_subset_dict[$gen_subset]}.$tgt_lang
+#     sacrebleu_file=$save_dir/sacrebleu.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
+#     comet22_file=$save_dir/comet22.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
+#     sacrebleu $ref_file -i $pure_file -w 2 >> $eval_file
+#     comet-score -s $src_file -t $pure_file -r $ref_file --model $root_dir/wmt22-comet-da/checkpoints/model.ckpt | tee $comet22_file
+#     echo "Comet22 Score" >> $eval_file
+#     tail -n1 $comet22_file >> $eval_file    # 只取平均comet分
+#     echo -e "decode finished! \n decode tokenized file in $decode_file \n detokenized file in $pure_file \n sacrebleu file in $eval_file"
+# done

scripts/train.sh ADDED Viewed

	@@ -0,0 +1,157 @@

+#! /usr/bin/bash
+set -eux
+train_device=0,1,2,3,4,5,6,7
+eval_device=0
+# xzq-fairseq
+root_dir=$(dirname "$PWD")
+src_lang=en
+tgt_lang=de
+threshold=0.7
+data_name=wmt23
+# pair_lang=${src_lang}-${tgt_lang}
+task_name=${src_lang}2${tgt_lang}
+data_dir=$root_dir/data/${task_name}/${threshold}
+raw_data_dir=$data_dir/raw
+trainable_data_dir=$data_dir/trainable_data
+## eval&decode param
+decode_max_tokens=2048
+beam=5
+nbest=1
+lenpen=1.0
+## common param
+criterion=label_smoothed_cross_entropy
+label_smoothing=0.1
+seed=42
+max_epoch=40
+keep_last_epochs=1
+keep_best_checkpoints=5
+patience=5
+num_workers=8
+# specified param
+conf_name=transformer_big
+# Global Batch=卡数*max-tokens*梯度累计,对于训练数据较大的语种(train-set几十M),global batch在 100k tokens以上较好
+if [ $conf_name == "transformer_big" ]; then
+    arch=transformer_vaswani_wmt_en_de_big
+    activation_fn=relu
+    encoder_ffn_embed_dim=4096
+    share_all_embeddings=0
+    share_decoder_input_output_embed=1
+    learing_rate=1e-3
+    warmup=4000
+    max_tokens=8192
+    weight_decay=0.0
+    dropout=0.3
+    gradient_accumulation_steps=4
+else
+    echo "unknown conf_name=$conf_name"
+    exit
+fi
+model_dir=$root_dir/exps/$task_name/${threshold}/${conf_name}_${data_name}
+mkdir -p $model_dir
+cp ${BASH_SOURCE[0]}  $model_dir
+gpu_num=`echo "$train_device" | awk '{split($0,arr,",");print length(arr)}'`
+export CUDA_VISIBLE_DEVICES=$train_device
+cmd="fairseq-train $trainable_data_dir \
+--distributed-world-size $gpu_num -s $src_lang -t $tgt_lang \
+--arch $arch \
+--fp16 \
+--optimizer adam --clip-norm 0.0 \
+--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates $warmup \
+--lr $learing_rate --adam-betas  '(0.9, 0.98)' \
+--weight-decay $weight_decay \
+--dropout $dropout \
+--criterion $criterion  --label-smoothing $label_smoothing \
+--max-epoch $max_epoch \
+--max-tokens $max_tokens \
+--update-freq $gradient_accumulation_steps \
+--activation-fn $activation_fn \
+--encoder-ffn-embed-dim $encoder_ffn_embed_dim \
+--seed $seed \
+--num-workers $num_workers \
+--no-epoch-checkpoints \
+--keep-last-epochs $keep_last_epochs \
+--keep-best-checkpoints $keep_best_checkpoints \
+--patience $patience \
+--no-progress-bar \
+--log-interval 100 \
+--task "translation" \
+--ddp-backend no_c10d \
+--save-dir $model_dir \
+--tensorboard-logdir $model_dir"
+# add param
+if [ $share_all_embeddings -eq 1 ]; then
+cmd=${cmd}" --share-all-embeddings "
+fi
+if [ $share_decoder_input_output_embed -eq 1 ]; then
+cmd=${cmd}" --share-decoder-input-output-embed "
+fi
+if [ ${max_update:=0} -ne 0 ]; then
+cmd=${cmd}" --max-update $max_update"
+fi
+# run command
+cur_time=`date +"%Y-%m-%d %H:%M:%S"`
+echo "=============$cur_time===================" >> $model_dir/train.log
+cmd="nohup ${cmd} >> $model_dir/train.log 2>&1 &"
+eval $cmd
+# wait
+# ### decode
+# checkpoint_path=$model_dir/checkpoint_best.pt
+# save_dir=$model_dir/decode_result
+# mkdir -p $save_dir
+# cp ${BASH_SOURCE[0]}  $save_dir
+# declare -A gen_subset_dict
+# gen_subset_dict=([test]=flores [test1]=wmt22 [test2]=wmt23)
+# for gen_subset in ${!gen_subset_dict[*]}
+# do
+#     decode_file=$save_dir/decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
+#     pure_file=$save_dir/pure_decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
+#     CUDA_VISIBLE_DEVICES=$eval_device fairseq-generate \
+#         $trainable_data_dir \
+#         -s $src_lang -t $tgt_lang \
+#         --user-dir $user_dir \
+#         --gen-subset $gen_subset \
+#         --path $checkpoint_path  \
+#         --max-tokens $decode_max_tokens \
+#         --beam $beam \
+#         --nbest $nbest \
+#         --lenpen $lenpen  \
+#         --seed $seed \
+#         --remove-bpe | tee $decode_file
+#     ### eval
+#     # purify file
+#     grep ^H $decode_file  | LC_ALL=C sort -V | cut -f3- | perl $root_dir/mosesdecoder/scripts/tokenizer/detokenizer.perl -l $tgt_lang > $pure_file
+#     eval_file=$model_dir/eval_${gen_subset_dict[$gen_subset]}.log
+#     cur_time=`date +"%Y-%m-%d %H:%M:%S"`
+#     echo "=============$cur_time===================" >> $eval_file
+#     echo $checkpoint_path >> $eval_file
+#     tail -n1 $decode_file >> $eval_file    # multi-bleu
+#     # get score
+#     src_file=$raw_data_dir/test.${gen_subset_dict[$gen_subset]}.$src_lang
+#     ref_file=$raw_data_dir/test.${gen_subset_dict[$gen_subset]}.$tgt_lang
+#     sacrebleu_file=$save_dir/sacrebleu.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
+#     comet22_file=$save_dir/comet22.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
+#     sacrebleu $ref_file -i $pure_file -w 2 >> $eval_file
+#     comet-score -s $src_file -t $pure_file -r $ref_file --model $root_dir/wmt22-comet-da/checkpoints/model.ckpt | tee $comet22_file
+#     echo "Comet22 Score" >> $eval_file
+#     tail -n1 $comet22_file >> $eval_file    # 只取平均comet分
+#     echo -e "decode finished! \n decode tokenized file in $decode_file \n detokenized file in $pure_file \n sacrebleu file in $eval_file"
+# done

subword-nmt/.github/workflows/pythonpublish.yml ADDED Viewed

	@@ -0,0 +1,26 @@

+name: Upload Python Package
+on:
+  release:
+    types: [created]
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v1
+    - name: Set up Python
+      uses: actions/setup-python@v1
+      with:
+        python-version: '3.x'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install setuptools wheel twine
+    - name: Build and publish
+      env:
+        TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
+        TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
+      run: |
+        python setup.py sdist bdist_wheel
+        twine upload dist/*

subword-nmt/.gitignore ADDED Viewed

	@@ -0,0 +1,105 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+.pytest_cache/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+.static_storage/
+.media/
+local_settings.py
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# pyenv
+.python-version
+# celery beat schedule file
+celerybeat-schedule
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/

subword-nmt/CHANGELOG.md ADDED Viewed

	@@ -0,0 +1,52 @@

+CHANGELOG
+---------
+v0.3.9
+  - byte-level BPE support
+  - remove support for Python 2
+v0.3.8:
+  - multiprocessing support (get_vocab and apply_bpe)
+  - progress bar for learn_bpe
+  - seed parameter for deterministic BPE dropout
+  - ignore some unicode line separators which would crash subword-nmt
+v0.3.7:
+  - BPE dropout (Provilkov et al., 2019)
+  - more efficient glossaries (https://github.com/rsennrich/subword-nmt/pull/69)
+v0.3.6:
+  - fix to subword-bpe command encoding
+v0.3.5:
+  - fix to subword-bpe command under Python 2
+  - wider support of --total-symbols argument
+v0.3.4:
+  - segment_tokens method to improve library usability (https://github.com/rsennrich/subword-nmt/pull/52)
+  - support regex glossaries (https://github.com/rsennrich/subword-nmt/pull/56)
+  - allow unicode separators (https://github.com/rsennrich/subword-nmt/pull/57)
+  - new option --total-symbols in learn-bpe (commit 61ad8)
+  - fix documentation (best practices) (https://github.com/rsennrich/subword-nmt/pull/60)
+v0.3:
+ - library is now installable via pip
+ - fix occasional problems with UTF-8 whitespace and new lines in learn_bpe and apply_bpe.
+   - do not silently convert UTF-8 newline characters into "\n"
+   - do not silently convert UTF-8 whitespace characters into " "
+   - UTF-8 whitespace and newline characters are now considered part of a word, and segmented by BPE
+v0.2:
+ - different, more consistent handling of end-of-word token (commit a749a7) (https://github.com/rsennrich/subword-nmt/issues/19)
+ - allow passing of vocabulary and frequency threshold to apply_bpe.py, preventing the production of OOV (or rare) subword units (commit a00db)
+ - made learn_bpe.py deterministic (commit 4c54e)
+ - various changes to make handling of UTF more consistent between Python versions
+ - new command line arguments for apply_bpe.py:
+   - '--glossaries' to prevent given strings from being affected by BPE
+   - '--merges' to apply a subset of learned BPE operations
+ - new command line arguments for learn_bpe.py:
+   - '--dict-input': rather than raw text file, interpret input as a frequency dictionary (as created by get_vocab.py).
+v0.1:
+ - consistent cross-version unicode handling
+ - all scripts are now deterministic