sleepyhead111 commited on
Commit
12aef23
·
verified ·
1 Parent(s): 88117f8

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. data/test/trainable_data/de2en/preprocess.log +29 -0
  2. data/test/trainable_data/de2en/test1.de-en.de.idx +0 -0
  3. data/test/trainable_data/de2en/test1.de-en.en.idx +0 -0
  4. data/test/trainable_data/de2en/test2.de-en.de.idx +0 -0
  5. data/test/trainable_data/de2en/test2.de-en.en.idx +0 -0
  6. data/test/trainable_data/en2de/test.en-de.en.idx +0 -0
  7. data/test/trainable_data/en2de/test2.en-de.de.idx +0 -0
  8. data/test/trainable_data/en2de/test2.en-de.en.idx +0 -0
  9. data/test/trainable_data/en2zh/dict.en.txt +0 -0
  10. data/test/trainable_data/en2zh/dict.zh.txt +0 -0
  11. data/test/trainable_data/en2zh/preprocess.log +23 -0
  12. data/test/trainable_data/en2zh/test.en-zh.en.idx +0 -0
  13. data/test/trainable_data/en2zh/test.en-zh.zh.idx +0 -0
  14. data/test/trainable_data/en2zh/test1.en-zh.en.idx +0 -0
  15. data/test/trainable_data/en2zh/test1.en-zh.zh.idx +0 -0
  16. data/test/trainable_data/en2zh/test2.en-zh.en.idx +0 -0
  17. data/test/trainable_data/en2zh/test2.en-zh.zh.idx +0 -0
  18. data/test/trainable_data/zh2en/dict.en.txt +0 -0
  19. data/test/trainable_data/zh2en/dict.zh.txt +0 -0
  20. data/test/trainable_data/zh2en/preprocess.log +6 -0
  21. data/test/trainable_data/zh2en/preprocess1.log +6 -0
  22. data/test/trainable_data/zh2en/test.zh-en.en.idx +0 -0
  23. data/test/trainable_data/zh2en/test1.zh-en.en.idx +0 -0
  24. data/test/trainable_data/zh2en/test2.zh-en.en.idx +0 -0
  25. data/test/trainable_data/zh2en/test2.zh-en.zh.idx +0 -0
  26. mosesdecoder/scripts/analysis/nontranslated_words.pl +100 -0
  27. mosesdecoder/scripts/analysis/smtgui/Corpus.pm +1345 -0
  28. mosesdecoder/scripts/analysis/smtgui/README +42 -0
  29. mosesdecoder/scripts/analysis/smtgui/file-descriptions +4 -0
  30. mosesdecoder/scripts/analysis/smtgui/file-factors +9 -0
  31. mosesdecoder/scripts/analysis/smtgui/newsmtgui.cgi +1006 -0
  32. mosesdecoder/scripts/analysis/weight-scan-summarize.sh +79 -0
  33. mosesdecoder/scripts/ems/web/javascripts/builder.js +136 -0
  34. mosesdecoder/scripts/ems/web/javascripts/dragdrop.js +974 -0
  35. mosesdecoder/scripts/ems/web/javascripts/prototype.js +0 -0
  36. mosesdecoder/scripts/ems/web/javascripts/sound.js +63 -0
  37. mosesdecoder/vw/Classifier.h +197 -0
  38. mosesdecoder/vw/ClassifierFactory.cpp +48 -0
  39. mosesdecoder/vw/Jamfile +20 -0
  40. mosesdecoder/vw/Normalizer.h +78 -0
  41. mosesdecoder/vw/README.md +113 -0
  42. mosesdecoder/vw/VWPredictor.cpp +121 -0
  43. mosesdecoder/vw/VWTrainer.cpp +99 -0
  44. scripts/decode-backtrans.sh +69 -0
  45. scripts/decode.sh +69 -0
  46. scripts/train-backtrans.sh +157 -0
  47. scripts/train.sh +157 -0
  48. subword-nmt/.github/workflows/pythonpublish.yml +26 -0
  49. subword-nmt/.gitignore +105 -0
  50. subword-nmt/CHANGELOG.md +52 -0
data/test/trainable_data/de2en/preprocess.log ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=30, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='de', target_lang='en', trainpref='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.train', validpref='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.valid', testpref='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test', align_suffix=None, destdir='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict=None, nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=True, only_source=False, padding_factor=8, workers=32)
2
+ [de] Dictionary: 47776 types
3
+ [de] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.train.de: 46388489 sents, 1161403088 tokens, 0.0% replaced by <unk>
4
+ [de] Dictionary: 47776 types
5
+ [de] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.valid.de: 1997 sents, 58227 tokens, 0.00515% replaced by <unk>
6
+ [de] Dictionary: 47776 types
7
+ [de] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.de: 3545 sents, 116081 tokens, 0.00345% replaced by <unk>
8
+ [en] Dictionary: 47776 types
9
+ [en] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.train.en: 46388489 sents, 1094684830 tokens, 0.0% replaced by <unk>
10
+ [en] Dictionary: 47776 types
11
+ [en] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.valid.en: 1997 sents, 54062 tokens, 0.0% replaced by <unk>
12
+ [en] Dictionary: 47776 types
13
+ [en] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.en: 3545 sents, 110575 tokens, 0.00181% replaced by <unk>
14
+ Wrote preprocessed data to /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data
15
+ Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=30, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='de', target_lang='en', trainpref=None, validpref=None, testpref='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.flores,/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt22,/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt23', align_suffix=None, destdir='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict=None, nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=True, only_source=False, padding_factor=8, workers=32)
16
+ Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=30, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='de', target_lang='en', trainpref=None, validpref=None, testpref='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.flores,/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt22,/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt23', align_suffix=None, destdir='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data', thresholdtgt=0, thresholdsrc=0, tgtdict='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data/dict.en.txt', srcdict='/mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data/dict.de.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=False, padding_factor=8, workers=32)
17
+ [de] Dictionary: 47776 types
18
+ [de] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.flores.de: 1012 sents, 34004 tokens, 0.00588% replaced by <unk>
19
+ [de] Dictionary: 47776 types
20
+ [de] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt22.de: 1984 sents, 45732 tokens, 0.00219% replaced by <unk>
21
+ [de] Dictionary: 47776 types
22
+ [de] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt23.de: 549 sents, 36345 tokens, 0.00275% replaced by <unk>
23
+ [en] Dictionary: 47776 types
24
+ [en] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.flores.en: 1012 sents, 30385 tokens, 0.00658% replaced by <unk>
25
+ [en] Dictionary: 47776 types
26
+ [en] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt22.en: 1984 sents, 45259 tokens, 0.0% replaced by <unk>
27
+ [en] Dictionary: 47776 types
28
+ [en] /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/bpe/bpe.test.wmt23.en: 549 sents, 34931 tokens, 0.0% replaced by <unk>
29
+ Wrote preprocessed data to /mnt/ouyangyx/trans_fairseq/nmt/data/de-en/wmt23-50M/trainable_data
data/test/trainable_data/de2en/test1.de-en.de.idx ADDED
Binary file (23.8 kB). View file
 
data/test/trainable_data/de2en/test1.de-en.en.idx ADDED
Binary file (23.8 kB). View file
 
data/test/trainable_data/de2en/test2.de-en.de.idx ADDED
Binary file (6.61 kB). View file
 
data/test/trainable_data/de2en/test2.de-en.en.idx ADDED
Binary file (6.61 kB). View file
 
data/test/trainable_data/en2de/test.en-de.en.idx ADDED
Binary file (12.2 kB). View file
 
data/test/trainable_data/en2de/test2.en-de.de.idx ADDED
Binary file (6.71 kB). View file
 
data/test/trainable_data/en2de/test2.en-de.en.idx ADDED
Binary file (6.71 kB). View file
 
data/test/trainable_data/en2zh/dict.en.txt ADDED
The diff for this file is too large to render. See raw diff
 
data/test/trainable_data/en2zh/dict.zh.txt ADDED
The diff for this file is too large to render. See raw diff
 
data/test/trainable_data/en2zh/preprocess.log ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=30, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='en', target_lang='zh', trainpref='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.train', validpref='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.valid', testpref='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.flores,/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt22,/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt23', align_suffix=None, destdir='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/trainable_data_1', thresholdtgt=0, thresholdsrc=0, tgtdict='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpecode_32k/bpecode.zh', srcdict='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpecode_32k/bpecode.en', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=False, padding_factor=8, workers=32)
2
+ Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=30, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='en', target_lang='zh', trainpref='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.train', validpref='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.valid', testpref='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.flores,/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt22,/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt23', align_suffix=None, destdir='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/trainable_data_1', thresholdtgt=0, thresholdsrc=0, tgtdict='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/trainable_data/dict.zh.txt', srcdict='/mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/trainable_data/dict.en.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=False, padding_factor=8, workers=32)
3
+ [en] Dictionary: 46040 types
4
+ [en] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.train.en: 33431411 sents, 890241636 tokens, 0.0% replaced by <unk>
5
+ [en] Dictionary: 46040 types
6
+ [en] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.valid.en: 1999 sents, 59177 tokens, 0.0% replaced by <unk>
7
+ [en] Dictionary: 46040 types
8
+ [en] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.flores.en: 1012 sents, 28474 tokens, 0.00702% replaced by <unk>
9
+ [en] Dictionary: 46040 types
10
+ [en] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt22.en: 2037 sents, 44690 tokens, 0.00224% replaced by <unk>
11
+ [en] Dictionary: 46040 types
12
+ [en] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt23.en: 2074 sents, 47187 tokens, 0.0% replaced by <unk>
13
+ [zh] Dictionary: 60432 types
14
+ [zh] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.train.zh: 33431411 sents, 816506971 tokens, 0.0% replaced by <unk>
15
+ [zh] Dictionary: 60432 types
16
+ [zh] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.valid.zh: 1999 sents, 57690 tokens, 0.00347% replaced by <unk>
17
+ [zh] Dictionary: 60432 types
18
+ [zh] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.flores.zh: 1012 sents, 27872 tokens, 0.0% replaced by <unk>
19
+ [zh] Dictionary: 60432 types
20
+ [zh] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt22.zh: 2037 sents, 41432 tokens, 0.0% replaced by <unk>
21
+ [zh] Dictionary: 60432 types
22
+ [zh] /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/bpe/bpe.test.wmt23.zh: 2074 sents, 44353 tokens, 0.0% replaced by <unk>
23
+ Wrote preprocessed data to /mnt/ouyangyx/trans_fairseq/nmt/data/en2zh/wmt23-50M/trainable_data_1
data/test/trainable_data/en2zh/test.en-zh.en.idx ADDED
Binary file (12.2 kB). View file
 
data/test/trainable_data/en2zh/test.en-zh.zh.idx ADDED
Binary file (12.2 kB). View file
 
data/test/trainable_data/en2zh/test1.en-zh.en.idx ADDED
Binary file (24.5 kB). View file
 
data/test/trainable_data/en2zh/test1.en-zh.zh.idx ADDED
Binary file (24.5 kB). View file
 
data/test/trainable_data/en2zh/test2.en-zh.en.idx ADDED
Binary file (24.9 kB). View file
 
data/test/trainable_data/en2zh/test2.en-zh.zh.idx ADDED
Binary file (24.9 kB). View file
 
data/test/trainable_data/zh2en/dict.en.txt ADDED
The diff for this file is too large to render. See raw diff
 
data/test/trainable_data/zh2en/dict.zh.txt ADDED
The diff for this file is too large to render. See raw diff
 
data/test/trainable_data/zh2en/preprocess.log ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=42, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='zh', target_lang='en', trainpref=None, validpref=None, testpref='/mnt/congmh/luoyf/xzq-fairseq/data/test/tokenized/zh2en/bpe.test.zh2en.flores', align_suffix=None, destdir='/mnt/congmh/luoyf/xzq-fairseq/data/test/trainable_data/zh2en0', thresholdtgt=0, thresholdsrc=0, tgtdict='/mnt/congmh/luoyf/xzq-fairseq/data/en-zh/wmt23/trainable_data/dict.en.txt', srcdict='/mnt/congmh/luoyf/xzq-fairseq/data/en-zh/wmt23/trainable_data/dict.zh.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=False, padding_factor=8, workers=32)
2
+ [zh] Dictionary: 60432 types
3
+ [zh] /mnt/congmh/luoyf/xzq-fairseq/data/test/tokenized/zh2en/bpe.test.zh2en.flores.zh: 1012 sents, 27918 tokens, 0.0% replaced by <unk>
4
+ [en] Dictionary: 46040 types
5
+ [en] /mnt/congmh/luoyf/xzq-fairseq/data/test/tokenized/zh2en/bpe.test.zh2en.flores.en: 1012 sents, 28474 tokens, 0.00702% replaced by <unk>
6
+ Wrote preprocessed data to /mnt/congmh/luoyf/xzq-fairseq/data/test/trainable_data/zh2en0
data/test/trainable_data/zh2en/preprocess1.log ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=42, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='zh', target_lang='en', trainpref=None, validpref=None, testpref='/mnt/congmh/luoyf/xzq-fairseq/data/test/tokenized/zh2en/bpe.test.zh2en.wmt22', align_suffix=None, destdir='/mnt/congmh/luoyf/xzq-fairseq/data/test/trainable_data/zh2en1', thresholdtgt=0, thresholdsrc=0, tgtdict='/mnt/congmh/luoyf/xzq-fairseq/data/en-zh/wmt23/trainable_data/dict.en.txt', srcdict='/mnt/congmh/luoyf/xzq-fairseq/data/en-zh/wmt23/trainable_data/dict.zh.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=False, padding_factor=8, workers=32)
2
+ [zh] Dictionary: 60432 types
3
+ [zh] /mnt/congmh/luoyf/xzq-fairseq/data/test/tokenized/zh2en/bpe.test.zh2en.wmt22.zh: 1875 sents, 51510 tokens, 0.0194% replaced by <unk>
4
+ [en] Dictionary: 46040 types
5
+ [en] /mnt/congmh/luoyf/xzq-fairseq/data/test/tokenized/zh2en/bpe.test.zh2en.wmt22.en: 1875 sents, 62056 tokens, 0.00645% replaced by <unk>
6
+ Wrote preprocessed data to /mnt/congmh/luoyf/xzq-fairseq/data/test/trainable_data/zh2en1
data/test/trainable_data/zh2en/test.zh-en.en.idx ADDED
Binary file (12.2 kB). View file
 
data/test/trainable_data/zh2en/test1.zh-en.en.idx ADDED
Binary file (22.5 kB). View file
 
data/test/trainable_data/zh2en/test2.zh-en.en.idx ADDED
Binary file (23.7 kB). View file
 
data/test/trainable_data/zh2en/test2.zh-en.zh.idx ADDED
Binary file (23.7 kB). View file
 
mosesdecoder/scripts/analysis/nontranslated_words.pl ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env perl
2
+ #
3
+ # This file is part of moses. Its use is licensed under the GNU Lesser General
4
+ # Public License version 2.1 or, at your option, any later version.
5
+
6
+ # $Id$
7
+ # Reads a source and hypothesis file and counts equal tokens. Some of these
8
+ # are punctuation, some are numbers, but most of the remaining are simply
9
+ # unknown words that the decoder just copied. This script tells you how often
10
+ # this happens.
11
+ #
12
+ # Ondrej Bojar
13
+
14
+
15
+ use strict;
16
+ use warnings;
17
+ use Getopt::Long;
18
+
19
+ my $ignore_numbers = 0;
20
+ my $ignore_punct = 0;
21
+ my $usage = 0;
22
+ my $top = 10;
23
+
24
+ GetOptions(
25
+ "help" => \$usage,
26
+ "top=i" => \$top,
27
+ "ignore-numbers" => \$ignore_numbers,
28
+ "ignore-punct" => \$ignore_punct,
29
+ ) or exit 1;
30
+ my $src = shift;
31
+ my $tgt = shift;
32
+
33
+ if ($usage || !defined $src || !defined $tgt) {
34
+ print STDERR "nontranslated_words.pl srcfile hypothesisfile
35
+ ...counts the number of words that are equal in src and hyp. These are
36
+ typically unknown words.
37
+ Options:
38
+ --top=N ... list N top copied tokens
39
+ --ignore-numbers ... numbers usually do not get translated, but do
40
+ not count them (it is not an error)
41
+ --ignore-punct ... same for punct, do not include it in the count
42
+ ";
43
+ exit 1;
44
+ }
45
+
46
+ binmode(STDOUT, ":utf8");
47
+ binmode(STDERR, ":utf8");
48
+
49
+ open SRC, $src or die "Can't read $src";
50
+ open TGT, $tgt or die "Can't read $tgt";
51
+ binmode(SRC, ":utf8");
52
+ binmode(TGT, ":utf8");
53
+
54
+ my $nr=0;
55
+ my $outtoks = 0;
56
+ my $intoks = 0;
57
+ my $copiedtoks = 0;
58
+ my %copiedtok;
59
+ while (<SRC>) {
60
+ $nr++;
61
+ chomp;
62
+ s/^\s+|\s+$//g;
63
+ my @src = split /\s+/;
64
+ my %src = map {($_,1)} @src;
65
+ $intoks += scalar @src;
66
+ my $t = <TGT>;
67
+ die "$tgt too short!" if !defined $t;
68
+ $t =~ s/^\s+|\s+$//g;
69
+ foreach my $outtok (split /\s+/, $t) {
70
+ $outtoks++;
71
+ next if !defined $src{$outtok}; # this word did not appear in input, we generated it
72
+ next if $ignore_numbers && $outtok =~ /^-?[0-9]*([.,][0-9]+)?$/;
73
+ next if $ignore_punct && $outtok =~ /^[[:punct:]]+$/;
74
+ $copiedtoks++;
75
+ $copiedtok{$outtok}++;
76
+ }
77
+ }
78
+ my $t = <TGT>;
79
+ die "$tgt too long!" if defined $t;
80
+ close SRC;
81
+ close TGT;
82
+
83
+ print "Sentences:\t$nr
84
+ Source tokens:\t$intoks
85
+ Output tokens:\t$outtoks
86
+ Output tokens appearing also in input sent:\t$copiedtoks\t"
87
+ .sprintf("%.2f %%", $copiedtoks/$outtoks*100)
88
+ ."\t".($ignore_punct?"ignoring":"including")." punctuation"
89
+ ."\t".($ignore_numbers?"ignoring":"including")." numbers"
90
+ ."\n";
91
+
92
+ if ($top) {
93
+ my $cnt = 0;
94
+ print "Top $top copied tokens:\n";
95
+ foreach my $t (sort {$copiedtok{$b}<=>$copiedtok{$a} || $a cmp $b} keys %copiedtok) {
96
+ print "$copiedtok{$t}\t$t\n";
97
+ last if $cnt > $top;
98
+ $cnt++;
99
+ }
100
+ }
mosesdecoder/scripts/analysis/smtgui/Corpus.pm ADDED
@@ -0,0 +1,1345 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #package Corpus: hold a bunch of sentences in any language, with translation factors and stats about individual sentences and the corpus as a whole
2
+ #Evan Herbst, 7 / 25 / 06
3
+ #
4
+ # This file is part of moses. Its use is licensed under the GNU Lesser General
5
+ # Public License version 2.1 or, at your option, any later version.
6
+
7
+ package Corpus;
8
+ BEGIN
9
+ {
10
+ push @INC, "../perllib"; #for Error.pm
11
+ }
12
+ use Error;
13
+
14
+ return 1;
15
+
16
+ ###########################################################################################################################
17
+
18
+ ##### 'our' variables are available outside the package #####
19
+ #all factor names used should be in this list, just in case
20
+ our @FACTORNAMES = ('surf', 'pos', 'lemma', 'stem', 'morph');
21
+
22
+ #constructor
23
+ #arguments: short corpus name (-name), hashref of filenames to descriptions (-descriptions), formatted string with various config info (-info_line)
24
+ sub new
25
+ {
26
+ my $class = shift;
27
+ my %args = @_; #turn the remainder of @_ into a hash
28
+ my ($corpusName, $refFileDescs, $infoLine) = ($args{'-name'}, $args{'-descriptions'}, $args{'-info_line'});
29
+ my ($factorList, $inputLingmodels, $outputLingmodels) = split(/\s*:\s*/, $infoLine);
30
+ my $self = {};
31
+ $self->{'corpusName'} = $corpusName;
32
+ $self->{'truth'} = []; #arrayref of arrayrefs of factors
33
+ $self->{'input'} = []; #same; also same for any system outputs that get loaded
34
+ $self->{'tokenCount'} = {}; #sysname => number of tokens in file
35
+ $self->{'truthFilename'} = "";
36
+ $self->{'inputFilename'} = "";
37
+ $self->{'sysoutFilenames'} = {}; #hashref of (string => string) for (system name, filename)
38
+ $self->{'phraseTableFilenames'} = {}; #factor name => filename
39
+ $self->{'fileCtimes'} = {}; #file ID of some kind => changetime in seconds
40
+ $self->{'factorIndices'} = {}; #factor name => index
41
+ my @factors = split(/\s+/, $factorList);
42
+ for(my $i = 0; $i < scalar(@factors); $i++)
43
+ {
44
+ $self->{'factorIndices'}->{$factors[$i]} = $i;
45
+ }
46
+ $self->{'inputLMs'} = {}; #factor name => lingmodel filename
47
+ $self->{'outputLMs'} = {};
48
+ foreach my $lmInfo (split(/\s*,\s*/, $inputLingmodels))
49
+ {
50
+ my @tokens = split(/\s+/, $lmInfo);
51
+ $self->{'inputLMs'}->{$tokens[0]} = $tokens[1];
52
+ }
53
+ foreach my $lmInfo (split(/\s*,\s*/, $outputLingmodels))
54
+ {
55
+ my @tokens = split(/\s+/, $lmInfo);
56
+ $self->{'outputLMs'}->{$tokens[0]} = $tokens[1];
57
+ }
58
+ $self->{'phraseTables'} = {}; #factor name (from @FACTORNAMES) => hashref of source phrases to anything; used for unknown-word counting
59
+ $self->{'unknownCount'} = {}; #factor name => count of unknown tokens in input
60
+ $self->{'sysoutWER'} = {}; #system name => (factor name => arrayref with system output total WER and arrayref of WER scores for individual sysout sentences wrt truth)
61
+ $self->{'sysoutPWER'} = {}; #similarly
62
+ $self->{'nnAdjWERPWER'} = {}; #system name => arrayref of [normalized WER, normalized PWER]
63
+ $self->{'perplexity'} = {}; #system name => (factor name => perplexity raw score)
64
+ $self->{'fileDescriptions'} = {}; #filename associated with us => string description of file
65
+ $self->{'bleuScores'} = {}; #system name => (factor name => arrayref of (overall score, arrayref of per-sentence scores) )
66
+ $self->{'bleuConfidence'} = {}; #system name => (factor name => arrayrefs holding statistical test data on BLEU scores)
67
+ $self->{'subsetBLEUstats'} = {}; #system name => (factor name => n-gram precisions and lengths for independent corpus subsets)
68
+ $self->{'comparisonStats'} = {}; #system name 1 => (system name 2 => (factor name => p-values, and indices of better system, for all tests used))
69
+ $self->{'cacheFilename'} = "cache/$corpusName.cache"; #all memory of various scores is stored here
70
+ bless $self, $class;
71
+ $self->locateFiles($refFileDescs); #find all relevant files in the current directory; set filenames and descriptions
72
+ $self->loadCacheFile();
73
+ print STDERR "on load:\n";
74
+ $self->printDetails();
75
+ return $self;
76
+ }
77
+
78
+ #arguments: filename
79
+ #return: description string
80
+ #throw if filename doesn't belong to this corpus
81
+ sub getFileDescription
82
+ {
83
+ my ($self, $filename) = @_;
84
+ if(!defined($self->{'fileDescriptions'}->{$filename}))
85
+ {
86
+ throw Error::Simple(-text => "Corpus::getFileDescription(): invalid filename '$filename'\n");
87
+ }
88
+ return $self->{'fileDescriptions'}->{$filename};
89
+ }
90
+
91
+ #arguments: none
92
+ #return: list of system names (NOT including 'input', 'truth' and other special cases)
93
+ sub getSystemNames
94
+ {
95
+ my $self = shift;
96
+ return keys %{$self->{'sysoutFilenames'}};
97
+ }
98
+
99
+ #calculate the number of unknown factor values for the given factor in the input file
100
+ #arguments: factor name
101
+ #return: unknown factor count, total factor count (note the total doesn't depend on the factor)
102
+ #throw if we don't have an input file or a phrase table for the given factor defined or if there's no index known for the given factor
103
+ sub calcUnknownTokens
104
+ {
105
+ my ($self, $factorName) = @_;
106
+ #check in-memory cache first
107
+ if(exists $self->{'unknownCount'}->{$factorName} && exists $self->{'tokenCount'}->{'input'})
108
+ {
109
+ return ($self->{'unknownCount'}->{$factorName}, $self->{'tokenCount'}->{'input'});
110
+ }
111
+ warn "calcing unknown tokens\n";
112
+
113
+ $self->ensureFilenameDefined('input');
114
+ $self->ensurePhraseTableDefined($factorName);
115
+ $self->ensureFactorPosDefined($factorName);
116
+ $self->loadSentences('input', $self->{'inputFilename'});
117
+ $self->loadPhraseTable($factorName);
118
+
119
+ #count unknown and total words
120
+ my ($unknownTokens, $totalTokens) = (0, 0);
121
+ my $factorIndex = $self->{'factorIndices'}->{$factorName};
122
+ foreach my $sentence (@{$self->{'input'}})
123
+ {
124
+ $totalTokens += scalar(@$sentence);
125
+ foreach my $word (@$sentence)
126
+ {
127
+ if(!defined($self->{'phraseTables'}->{$factorName}->{$word->[$factorIndex]}))
128
+ {
129
+ $unknownTokens++;
130
+ }
131
+ }
132
+ }
133
+ $self->{'unknownCount'}->{$factorName} = $unknownTokens;
134
+ $self->{'tokenCount'}->{'input'} = $totalTokens;
135
+
136
+ return ($unknownTokens, $totalTokens);
137
+ }
138
+
139
+ #arguments: system name
140
+ #return: (WER, PWER) for nouns and adjectives in given system wrt truth
141
+ #throw if given system or truth is not set or if index of 'surf' or 'pos' hasn't been specified
142
+ sub calcNounAdjWER_PWERDiff
143
+ {
144
+ my ($self, $sysname) = @_;
145
+ #check in-memory cache first
146
+ if(exists $self->{'nnAdjWERPWER'}->{$sysname})
147
+ {
148
+ return @{$self->{'nnAdjWERPWER'}->{$sysname}};
149
+ }
150
+ warn "calcing NN/JJ PWER/WER\n";
151
+
152
+ $self->ensureFilenameDefined('truth');
153
+ $self->ensureFilenameDefined($sysname);
154
+ $self->ensureFactorPosDefined('surf');
155
+ $self->ensureFactorPosDefined('pos');
156
+ $self->loadSentences('truth', $self->{'truthFilename'});
157
+ $self->loadSentences($sysname, $self->{'sysoutFilenames'}->{$sysname});
158
+ #find nouns and adjectives and score them
159
+ my ($werScore, $pwerScore) = (0, 0);
160
+ my $nnNadjTags = $self->getPOSTagList('nounAndAdj');
161
+ for(my $i = 0; $i < scalar(@{$self->{'truth'}}); $i++)
162
+ {
163
+ my @nnAdjEWords = $self->filterFactors($self->{'truth'}->[$i], $self->{'factorIndices'}->{'pos'}, $nnNadjTags);
164
+ my @nnAdjSWords = $self->filterFactors($self->{$sysname}->[$i], $self->{'factorIndices'}->{'pos'}, $nnNadjTags);
165
+ my ($sentWer, $tmp) = $self->sentenceWER(\@nnAdjSWords, \@nnAdjEWords, $self->{'factorIndices'}->{'surf'});
166
+ $werScore += $sentWer;
167
+ ($sentWer, $tmp) = $self->sentencePWER(\@nnAdjSWords, \@nnAdjEWords, $self->{'factorIndices'}->{'surf'});
168
+ $pwerScore += $sentWer;
169
+ }
170
+
171
+ #unhog memory
172
+ $self->releaseSentences('truth');
173
+ $self->releaseSentences($sysname);
174
+ $self->{'nnAdjWERPWER'}->{$sysname} = [$werScore / $self->{'tokenCount'}->{'truth'}, $pwerScore / $self->{'tokenCount'}->{'truth'}];
175
+ return @{$self->{'nnAdjWERPWER'}->{$sysname}};
176
+ }
177
+
178
+ #calculate detailed WER statistics and put them into $self
179
+ #arguments: system name, factor name to consider (default 'surf', surface form)
180
+ #return: overall surface WER for given system (w/o filtering)
181
+ #throw if given system or truth is not set or if index of factor name hasn't been specified
182
+ sub calcOverallWER
183
+ {
184
+ my ($self, $sysname, $factorName) = (shift, shift, 'surf');
185
+ if(scalar(@_) > 0) {$factorName = shift;}
186
+ #check in-memory cache first
187
+ if(exists $self->{'sysoutWER'}->{$sysname}->{$factorName})
188
+ {
189
+ return $self->{'sysoutWER'}->{$sysname}->{$factorName}->[0];
190
+ }
191
+ warn "calcing WER\n";
192
+
193
+ $self->ensureFilenameDefined('truth');
194
+ $self->ensureFilenameDefined($sysname);
195
+ $self->ensureFactorPosDefined($factorName);
196
+ $self->loadSentences('truth', $self->{'truthFilename'});
197
+ $self->loadSentences($sysname, $self->{'sysoutFilenames'}->{$sysname});
198
+
199
+ my ($wer, $swers, $indices) = $self->corpusWER($self->{$sysname}, $self->{'truth'}, $self->{'factorIndices'}->{$factorName});
200
+ $self->{'sysoutWER'}->{$sysname}->{$factorName} = [$wer, $swers, $indices]; #total; arrayref of scores for individual sentences; arrayref of arrayrefs of offending words in each sentence
201
+
202
+ #unhog memory
203
+ $self->releaseSentences('truth');
204
+ $self->releaseSentences($sysname);
205
+ return $self->{'sysoutWER'}->{$sysname}->{$factorName}->[0] / $self->{'tokenCount'}->{'truth'};
206
+ }
207
+
208
+ #calculate detailed PWER statistics and put them into $self
209
+ #arguments: system name, factor name to consider (default 'surf')
210
+ #return: overall surface PWER for given system (w/o filtering)
211
+ #throw if given system or truth is not set or if index of factor name hasn't been specified
212
+ sub calcOverallPWER
213
+ {
214
+ my ($self, $sysname, $factorName) = (shift, shift, 'surf');
215
+ if(scalar(@_) > 0) {$factorName = shift;}
216
+ #check in-memory cache first
217
+ if(exists $self->{'sysoutPWER'}->{$sysname}->{$factorName})
218
+ {
219
+ return $self->{'sysoutPWER'}->{$sysname}->{$factorName}->[0];
220
+ }
221
+ warn "calcing PWER\n";
222
+
223
+ $self->ensureFilenameDefined('truth');
224
+ $self->ensureFilenameDefined($sysname);
225
+ $self->ensureFactorPosDefined($factorName);
226
+ $self->loadSentences('truth', $self->{'truthFilename'});
227
+ $self->loadSentences($sysname, $self->{'sysoutFilenames'}->{$sysname});
228
+
229
+ my ($pwer, $spwers, $indices) = $self->corpusPWER($self->{$sysname}, $self->{'truth'}, $self->{'factorIndices'}->{$factorName});
230
+ $self->{'sysoutPWER'}->{$sysname}->{$factorName} = [$pwer, $spwers, $indices]; #total; arrayref of scores for individual sentences; arrayref of arrayrefs of offending words in each sentence
231
+
232
+ #unhog memory
233
+ $self->releaseSentences('truth');
234
+ $self->releaseSentences($sysname);
235
+ return $self->{'sysoutPWER'}->{$sysname}->{$factorName}->[0] / $self->{'tokenCount'}->{'truth'};
236
+ }
237
+
238
+ #arguments: system name, factor name to consider (default 'surf')
239
+ #return: array of (BLEU score, n-gram precisions, brevity penalty)
240
+ sub calcBLEU
241
+ {
242
+ my ($self, $sysname, $factorName) = (shift, shift, 'surf');
243
+ if(scalar(@_) > 0) {$factorName = shift;}
244
+ #check in-memory cache first
245
+ if(exists $self->{'bleuScores'}->{$sysname} && exists $self->{'bleuScores'}->{$sysname}->{$factorName})
246
+ {
247
+ return $self->{'bleuScores'}->{$sysname}->{$factorName};
248
+ }
249
+ warn "calcing BLEU\n";
250
+
251
+ $self->ensureFilenameDefined('truth');
252
+ $self->ensureFilenameDefined($sysname);
253
+ $self->ensureFactorPosDefined($factorName);
254
+ $self->loadSentences('truth', $self->{'truthFilename'});
255
+ $self->loadSentences($sysname, $self->{'sysoutFilenames'}->{$sysname});
256
+
257
+ #score structure: various total scores, arrayref of by-sentence score arrays
258
+ if(!exists $self->{'bleuScores'}->{$sysname}) {$self->{'bleuScores'}->{$sysname} = {};}
259
+ if(!exists $self->{'bleuScores'}->{$sysname}->{$factorName}) {$self->{'bleuScores'}->{$sysname}->{$factorName} = [[], []];}
260
+
261
+ my ($good1, $tot1, $good2, $tot2, $good3, $tot3, $good4, $tot4, $totCLength, $totRLength) = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
262
+ my $factorIndex = $self->{'factorIndices'}->{$factorName};
263
+ for(my $i = 0; $i < scalar(@{$self->{'truth'}}); $i++)
264
+ {
265
+ my ($truthSentence, $sysoutSentence) = ($self->{'truth'}->[$i], $self->{$sysname}->[$i]);
266
+ my ($unigood, $unicount, $bigood, $bicount, $trigood, $tricount, $quadrugood, $quadrucount, $cLength, $rLength) =
267
+ $self->sentenceBLEU($truthSentence, $sysoutSentence, $factorIndex, 0); #last argument is whether to debug-print
268
+ push @{$self->{'bleuScores'}->{$sysname}->{$factorName}->[1]}, [$unigood, $unicount, $bigood, $bicount, $trigood, $tricount, $quadrugood, $quadrucount, $cLength, $rLength];
269
+ $good1 += $unigood; $tot1 += $unicount;
270
+ $good2 += $bigood; $tot2 += $bicount;
271
+ $good3 += $trigood; $tot3 += $tricount;
272
+ $good4 += $quadrugood; $tot4 += $quadrucount;
273
+ $totCLength += $cLength;
274
+ $totRLength += $rLength;
275
+ }
276
+ my $brevity = ($totCLength > $totRLength || $totCLength == 0) ? 1 : exp(1 - $totRLength / $totCLength);
277
+ my ($pct1, $pct2, $pct3, $pct4) = ($tot1 == 0 ? -1 : $good1 / $tot1, $tot2 == 0 ? -1 : $good2 / $tot2,
278
+ $tot3 == 0 ? -1 : $good3 / $tot3, $tot4 == 0 ? -1 : $good4 / $tot4);
279
+ my ($logsum, $logcount) = (0, 0);
280
+ if($tot1 > 0) {$logsum += my_log($pct1); $logcount++;}
281
+ if($tot2 > 0) {$logsum += my_log($pct2); $logcount++;}
282
+ if($tot3 > 0) {$logsum += my_log($pct3); $logcount++;}
283
+ if($tot4 > 0) {$logsum += my_log($pct4); $logcount++;}
284
+ my $bleu = $brevity * exp($logsum / $logcount);
285
+ $self->{'bleuScores'}->{$sysname}->{$factorName}->[0] = [$bleu, 100 * $pct1, 100 * $pct2, 100 * $pct3, 100 * $pct4, $brevity];
286
+
287
+ #unhog memory
288
+ $self->releaseSentences('truth');
289
+ $self->releaseSentences($sysname);
290
+ return @{$self->{'bleuScores'}->{$sysname}->{$factorName}->[0]};
291
+ }
292
+
293
+ #do t-tests on the whole-corpus n-gram precisions vs. the average precisions over a set number of disjoint subsets
294
+ #arguments: system name, factor name BLEU was run on (default 'surf')
295
+ #return: arrayref of [arrayref of p-values for overall precision vs. subset average, arrayrefs of [(lower, upper) 95% credible intervals for true overall n-gram precisions]]
296
+ #
297
+ #written to try to save memory
298
+ sub statisticallyTestBLEUResults
299
+ {
300
+ my ($self, $sysname, $factorName) = (shift, shift, 'surf');
301
+ if(scalar(@_) > 0) {$factorName = shift;}
302
+ #check in-memory cache first
303
+ if(exists $self->{'bleuConfidence'}->{$sysname} && exists $self->{'bleuConfidence'}->{$sysname}->{$factorName})
304
+ {
305
+ return $self->{'bleuConfidence'}->{$sysname}->{$factorName};
306
+ }
307
+ warn "performing consistency tests\n";
308
+
309
+ my $k = 30; #HARDCODED NUMBER OF SUBSETS (WE DO k-FOLD CROSS-VALIDATION); IF YOU CHANGE THIS YOU MUST ALSO CHANGE getApproxPValue() and $criticalTStat
310
+ my $criticalTStat = 2.045; #hardcoded value given alpha (.025 here) and degrees of freedom (= $k - 1) ########################################
311
+ $self->ensureFilenameDefined('truth');
312
+ $self->ensureFilenameDefined($sysname);
313
+ $self->ensureFactorPosDefined($factorName);
314
+
315
+ #ensure we have full-corpus BLEU results
316
+ if(!exists $self->{'bleuScores'}->{$sysname}->{$factorName})
317
+ {
318
+ $self->calcBLEU($sysname, $factorName);
319
+ }
320
+ if(!exists $self->{'subsetBLEUstats'}->{$sysname}) {$self->{'subsetBLEUstats'}->{$sysname} = {};}
321
+ if(!exists $self->{'subsetBLEUstats'}->{$sysname}->{$factorName}) {$self->{'subsetBLEUstats'}->{$sysname}->{$factorName} = [];}
322
+
323
+ #calculate n-gram precisions for each small subset
324
+ my @sentenceStats = @{$self->{'bleuScores'}->{$sysname}->{$factorName}->[1]};
325
+ for(my $i = 0; $i < $k; $i++)
326
+ {
327
+ my ($good1, $tot1, $good2, $tot2, $good3, $tot3, $good4, $tot4, $sysoutLength, $truthLength) = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
328
+ for(my $j = $i; $j < scalar(@sentenceStats); $j += $k) #subset #K consists of every Kth sentence
329
+ {
330
+ $good1 += $sentenceStats[$j]->[0]; $tot1 += $sentenceStats[$j]->[1];
331
+ $good2 += $sentenceStats[$j]->[2]; $tot2 += $sentenceStats[$j]->[3];
332
+ $good3 += $sentenceStats[$j]->[4]; $tot3 += $sentenceStats[$j]->[5];
333
+ $good4 += $sentenceStats[$j]->[6]; $tot4 += $sentenceStats[$j]->[7];
334
+ $sysoutLength += $sentenceStats[$j]->[8];
335
+ $truthLength += $sentenceStats[$j]->[9];
336
+ }
337
+ push @{$self->{'subsetBLEUstats'}->{$sysname}->{$factorName}}, [$good1, $tot1, $good2, $tot2, $good3, $tot3, $good4, $tot4, $sysoutLength, $truthLength];
338
+ }
339
+ my $subsetStats = $self->{'subsetBLEUstats'}->{$sysname}->{$factorName};
340
+ #calculate first two moments for subset scores for each n-gram precision, and t statistic
341
+ my $fullCorpusBLEU = $self->{'bleuScores'}->{$sysname}->{$factorName}->[0]; #an arrayref
342
+ my @means = (0) x 4;
343
+ my @devs = (0) x 4;
344
+ my $t = []; #t statistics for all n-gram orders
345
+ if(!exists $self->{'bleuConfidence'}->{$sysname}) {$self->{'bleuConfidence'}->{$sysname} = {};}
346
+ $self->{'bleuConfidence'}->{$sysname}->{$factorName} = [[], []]; #lower-bound p-values for whole corpus vs. subset average; confidence intervals for all n-gram orders
347
+ for(my $i = 0; $i < 4; $i++) #run through n-gram orders
348
+ {
349
+ for(my $j = 0; $j < $k; $j++) #run through subsets
350
+ {
351
+ $means[$i] += $subsetStats->[$j]->[2 * $i] / $subsetStats->[$j]->[2 * $i + 1]; #matching / total n-grams
352
+ }
353
+ $means[$i] /= $k;
354
+ for(my $j = 0; $j < $k; $j++) #run through subsets
355
+ {
356
+ $devs[$i] += ($subsetStats->[$j]->[2 * $i] / $subsetStats->[$j]->[2 * $i + 1] - $means[$i]) ** 2;
357
+ }
358
+ $devs[$i] = sqrt($devs[$i] / ($k - 1));
359
+ $t->[$i] = ($fullCorpusBLEU->[$i + 1] / 100 - $means[$i]) / $devs[$i];
360
+ push @{$self->{'bleuConfidence'}->{$sysname}->{$factorName}->[0]}, getLowerBoundPValue($t->[$i]); #p-value for overall score vs. subset average
361
+ push @{$self->{'bleuConfidence'}->{$sysname}->{$factorName}->[1]},
362
+ [$means[$i] - $criticalTStat * $devs[$i] / sqrt($k), $means[$i] + $criticalTStat * $devs[$i] / sqrt($k)]; #the confidence interval
363
+ }
364
+
365
+ return $self->{'bleuConfidence'}->{$sysname}->{$factorName};
366
+ }
367
+
368
+ #arguments: system name, factor name
369
+ #return: perplexity of language model (specified in a config file) wrt given system output
370
+ sub calcPerplexity
371
+ {
372
+ my ($self, $sysname, $factorName) = @_;
373
+ print STDERR "ppl $sysname $factorName\n";
374
+ #check in-memory cache first
375
+ if(exists $self->{'perplexity'}->{$sysname} && exists $self->{'perplexity'}->{$sysname}->{$factorName})
376
+ {
377
+ return $self->{'perplexity'}->{$sysname}->{$factorName};
378
+ }
379
+ warn "calcing perplexity\n";
380
+
381
+ $self->ensureFilenameDefined($sysname);
382
+ my $sysoutFilename;
383
+ if($sysname eq 'truth' || $sysname eq 'input') {$sysoutFilename = $self->{"${sysname}Filename"};}
384
+ else {$sysoutFilename = $self->{'sysoutFilenames'}->{$sysname};}
385
+ my $lmFilename;
386
+ if($sysname eq 'input') {$lmFilename = $self->{'inputLMs'}->{$factorName};}
387
+ else {$lmFilename = $self->{'outputLMs'}->{$factorName};}
388
+ my $tmpfile = ".tmp" . time;
389
+ my $cmd = "perl ./extract-factors.pl $sysoutFilename " . $self->{'factorIndices'}->{$factorName} . " > $tmpfile";
390
+ `$cmd`; #extract just the factor we're interested in; ngram doesn't understand factored notation
391
+ my @output = `./ngram -lm $lmFilename -ppl $tmpfile`; #run the SRI n-gram tool
392
+ `rm -f $tmpfile`;
393
+ $output[1] =~ /ppl1=\s*([0-9\.]+)/;
394
+ $self->{'perplexity'}->{$sysname}->{$factorName} = $1;
395
+ return $self->{'perplexity'}->{$sysname}->{$factorName};
396
+ }
397
+
398
+ #run a paired t test and a sign test on BLEU statistics for subsets of both systems' outputs
399
+ #arguments: system name 1, system name 2, factor name
400
+ #return: arrayref of [arrayref of confidence levels for t test at which results differ, arrayref of index (0/1) of better system by t test,
401
+ # arrayref of confidence levels for sign test at which results differ, arrayref of index (0/1) of better system by sign test],
402
+ # where each inner arrayref has one element per n-gram order considered
403
+ sub statisticallyCompareSystemResults
404
+ {
405
+ my ($self, $sysname1, $sysname2, $factorName) = @_;
406
+ #check in-memory cache first
407
+ if(exists $self->{'comparisonStats'}->{$sysname1} && exists $self->{'comparisonStats'}->{$sysname1}->{$sysname2}
408
+ && exists $self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName})
409
+ {
410
+ return $self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName};
411
+ }
412
+ warn "comparing sysoutputs\n";
413
+
414
+ $self->ensureFilenameDefined($sysname1);
415
+ $self->ensureFilenameDefined($sysname2);
416
+ $self->ensureFactorPosDefined($factorName);
417
+ #make sure we have tallied results for both systems
418
+ if(!exists $self->{'subsetBLEUstats'}->{$sysname1}->{$factorName}) {$self->statisticallyTestBLEUResults($sysname1, $factorName);}
419
+ if(!exists $self->{'subsetBLEUstats'}->{$sysname2}->{$factorName}) {$self->statisticallyTestBLEUResults($sysname2, $factorName);}
420
+
421
+ if(!exists $self->{'comparisonStats'}->{$sysname1}) {$self->{'comparisonStats'}->{$sysname1} = {};}
422
+ if(!exists $self->{'comparisonStats'}->{$sysname1}->{$sysname2}) {$self->{'comparisonStats'}->{$sysname1}->{$sysname2} = {};}
423
+ if(!exists $self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName}) {$self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName} = [];}
424
+ my ($tConfidences, $tWinningIndices, $signConfidences, $signWinningIndices) = ([], [], [], []);
425
+ for(my $i = 0; $i < 4; $i++) #loop over n-gram order
426
+ {
427
+ #t-test stats
428
+ my ($mean, $dev) = (0, 0); #of the difference between the first and second systems' precisions
429
+ #sign-test stats
430
+ my ($nPlus, $nMinus) = (0, 0);
431
+ my $j;
432
+ for($j = 0; $j < scalar(@{$self->{'subsetBLEUstats'}->{$sysname1}->{$factorName}}); $j++)
433
+ {
434
+ my ($stats1, $stats2) = ($self->{'subsetBLEUstats'}->{$sysname1}->{$factorName}->[$j], $self->{'subsetBLEUstats'}->{$sysname2}->{$factorName}->[$j]);
435
+ my ($prec1, $prec2) = ($stats1->[2 * $i] / $stats1->[2 * $i + 1], $stats2->[2 * $i] / $stats2->[2 * $i + 1]); #n-gram precisions
436
+ $mean += $prec1 - $prec2;
437
+ if($prec1 > $prec2) {$nPlus++;} else {$nMinus++;}
438
+ }
439
+ $mean /= $j;
440
+ for($j = 0; $j < scalar(@{$self->{'subsetBLEUstats'}->{$sysname1}->{$factorName}}); $j++)
441
+ {
442
+ my ($stats1, $stats2) = ($self->{'subsetBLEUstats'}->{$sysname1}->{$factorName}->[$j], $self->{'subsetBLEUstats'}->{$sysname2}->{$factorName}->[$j]);
443
+ my ($prec1, $prec2) = ($stats1->[2 * $i] / $stats1->[2 * $i + 1], $stats2->[2 * $i] / $stats2->[2 * $i + 1]); #n-gram precisions
444
+ $dev += ($prec1 - $prec2 - $mean) ** 2;
445
+ }
446
+ $dev = sqrt($dev / (($j - 1) * $j)); #need the extra j because the variance of Xbar is 1/n the variance of X
447
+ #t test
448
+ my $t = $mean / $dev; #this isn't the standard form; remember the difference of the means is equal to the mean of the differences
449
+ my $cc = getUpperBoundPValue($t);
450
+ print STDERR "comparing at n=$i: mu $mean, sigma $dev, t $t -> conf >= " . (1 - $cc) . "\n";
451
+ push @$tConfidences, $cc;
452
+ push @$tWinningIndices, ($mean > 0) ? 0 : 1;
453
+ #sign test
454
+ my %binomialCoefficients; #map (n+ - n-) to a coefficient; compute on the fly!
455
+ for(my $k = 0; $k <= $nPlus + $nMinus; $k++)
456
+ {
457
+ $binomialCoefficients{$k} = binCoeff($nPlus + $nMinus, $k);
458
+ }
459
+ my $sumCoeffs = 0;
460
+ foreach my $coeff (values %binomialCoefficients) #get a lower bound on the probability mass inside (n+ - n-)
461
+ {
462
+ if($coeff > $binomialCoefficients{$nPlus}) {$sumCoeffs += $coeff;}
463
+ }
464
+ push @$signConfidences, $sumCoeffs;
465
+ push @$signWinningIndices, ($nPlus > $nMinus) ? 0 : 1;
466
+ }
467
+ $self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName} = [$tConfidences, $tWinningIndices, $signConfidences, $signWinningIndices];
468
+ return $self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName};
469
+ }
470
+
471
+ #write HTML to be displayed to compare the various versions we have of each sentence in the corpus;
472
+ #allow to filter which versions will be displayed
473
+ #(we don't write the whole page, just the contents of the body)
474
+ #arguments: filehandleref to which to write, regex to filter filename extensions to be included
475
+ #return: none
476
+ sub writeComparisonPage
477
+ {
478
+ my ($self, $fh, $filter) = @_;
479
+ my @filteredExtensions = grep($filter, ('e', 'f', keys %{$self->{'sysoutFilenames'}}));
480
+ my %openedFiles = $self->openFiles(@filteredExtensions);
481
+ my $id = 1; #sentence ID string
482
+ while(my %lines = $self->readLineFromFiles(%openedFiles))
483
+ {
484
+ $self->printSingleSentenceComparison($fh, $id, %lines);
485
+ $id++;
486
+ }
487
+ $self->closeFiles(%openedFiles);
488
+ }
489
+
490
+ ##########################################################################################################
491
+ ##### INTERNAL ###################################################################################
492
+ ##########################################################################################################
493
+
494
+ #destructor!
495
+ #arguments: none
496
+ #return: none
497
+ sub DESTROY
498
+ {
499
+ my $self = shift;
500
+ $self->writeCacheFile();
501
+ }
502
+
503
+ #write all scores in memory to disk
504
+ #arguments: none
505
+ #return: none
506
+ sub writeCacheFile
507
+ {
508
+ my $self = shift;
509
+ if(!open(CACHEFILE, ">" . $self->{'cacheFilename'}))
510
+ {
511
+ warn "Corpus::writeCacheFile(): can't open '" . $self->{'cacheFilename'} . "' for write\n";
512
+ return;
513
+ }
514
+
515
+ #store file changetimes to disk
516
+ print CACHEFILE "File changetimes\n";
517
+ my $ensureCtimeIsOutput = sub
518
+ {
519
+ my $ext = shift;
520
+ #check for a previously read value
521
+ if(exists $self->{'fileCtimes'}->{$ext} && $self->cacheIsCurrentForFile($ext)) {print CACHEFILE "$ext " . $self->{'fileCtimes'}->{$ext} . "\n";}
522
+ else {print CACHEFILE "$ext " . time . "\n";} #our info must just have been calculated
523
+ };
524
+ if(exists $self->{'truthFilename'}) {&$ensureCtimeIsOutput('e');}
525
+ if(exists $self->{'inputFilename'}) {&$ensureCtimeIsOutput('f');}
526
+ foreach my $factorName (keys %{$self->{'phraseTableFilenames'}}) {&$ensureCtimeIsOutput("pt_$factorName");}
527
+ foreach my $sysname (keys %{$self->{'sysoutFilenames'}}) {&$ensureCtimeIsOutput($sysname);}
528
+ #store bleu scores to disk
529
+ print CACHEFILE "\nBLEU scores\n";
530
+ foreach my $sysname (keys %{$self->{'bleuScores'}})
531
+ {
532
+ foreach my $factorName (keys %{$self->{'bleuScores'}->{$sysname}})
533
+ {
534
+ print CACHEFILE "$sysname $factorName " . join(' ', @{$self->{'bleuScores'}->{$sysname}->{$factorName}->[0]});
535
+ foreach my $sentenceBLEU (@{$self->{'bleuScores'}->{$sysname}->{$factorName}->[1]})
536
+ {
537
+ print CACHEFILE ";" . join(' ', @$sentenceBLEU);
538
+ }
539
+ print CACHEFILE "\n";
540
+ }
541
+ }
542
+ #store t statistics for overall BLEU score and subsets in k-fold cross-validation
543
+ print CACHEFILE "\nBLEU statistics\n";
544
+ foreach my $sysname (keys %{$self->{'bleuConfidence'}})
545
+ {
546
+ foreach my $factorName (keys %{$self->{'bleuConfidence'}->{$sysname}})
547
+ {
548
+ print CACHEFILE "$sysname $factorName " . join(' ', @{$self->{'bleuConfidence'}->{$sysname}->{$factorName}->[0]});
549
+ foreach my $subsetConfidence (@{$self->{'bleuConfidence'}->{$sysname}->{$factorName}->[1]})
550
+ {
551
+ print CACHEFILE ";" . join(' ', @$subsetConfidence);
552
+ }
553
+ print CACHEFILE "\n";
554
+ }
555
+ }
556
+ #store statistics comparing system outputs
557
+ print CACHEFILE "\nStatistical comparisons\n";
558
+ foreach my $sysname1 (keys %{$self->{'comparisonStats'}})
559
+ {
560
+ foreach my $sysname2 (keys %{$self->{'comparisonStats'}->{$sysname1}})
561
+ {
562
+ foreach my $factorName (keys %{$self->{'comparisonStats'}->{$sysname1}->{$sysname2}})
563
+ {
564
+ print CACHEFILE "$sysname1 $sysname2 $factorName " . join(';', map {join(' ', @$_)} @{$self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName}}) . "\n";
565
+ }
566
+ }
567
+ }
568
+ #store unknown-token counts to disk
569
+ print CACHEFILE "\nUnknown-token counts\n";
570
+ foreach my $factorName (keys %{$self->{'unknownCount'}})
571
+ {
572
+ print CACHEFILE $factorName . " " . $self->{'phraseTableFilenames'}->{$factorName} . " " . $self->{'unknownCount'}->{$factorName} . " " . $self->{'tokenCount'}->{'input'} . "\n";
573
+ }
574
+ #store WER, PWER to disk
575
+ print CACHEFILE "\nWER scores\n";
576
+ my $printWERFunc =
577
+ sub
578
+ {
579
+ my $werType = shift;
580
+ foreach my $sysname (keys %{$self->{$werType}})
581
+ {
582
+ foreach my $factorName (keys %{$self->{$werType}->{$sysname}})
583
+ {
584
+ my ($totalWER, $sentenceWERs, $errorWords) = @{$self->{$werType}->{$sysname}->{$factorName}};
585
+ print CACHEFILE "$werType $sysname $factorName $totalWER " . join(' ', @$sentenceWERs);
586
+ foreach my $indices (@$errorWords)
587
+ {
588
+ print CACHEFILE ";" . join(' ', @$indices);
589
+ }
590
+ print CACHEFILE "\n";
591
+ }
592
+ }
593
+ };
594
+ &$printWERFunc('sysoutWER');
595
+ &$printWERFunc('sysoutPWER');
596
+ #store corpus perplexities to disk
597
+ print CACHEFILE "\nPerplexity\n";
598
+ foreach my $sysname (keys %{$self->{'perplexity'}})
599
+ {
600
+ foreach my $factorName (keys %{$self->{'perplexity'}->{$sysname}})
601
+ {
602
+ print CACHEFILE "$sysname $factorName " . $self->{'perplexity'}->{$sysname}->{$factorName} . "\n";
603
+ }
604
+ }
605
+ print "\nNN/ADJ WER/PWER\n";
606
+ foreach my $sysname (keys %{$self->{'nnAdjWERPWER'}})
607
+ {
608
+ print CACHEFILE "$sysname " . join(' ', @{$self->{'nnAdjWERPWER'}->{$sysname}}) . "\n";
609
+ }
610
+ print "\n";
611
+ close(CACHEFILE);
612
+ }
613
+
614
+ #load all scores present in the cache file into the appropriate fields of $self
615
+ #arguments: none
616
+ #return: none
617
+ sub loadCacheFile
618
+ {
619
+ my $self = shift;
620
+ if(!open(CACHEFILE, "<" . $self->{'cacheFilename'}))
621
+ {
622
+ warn "Corpus::loadCacheFile(): can't open '" . $self->{'cacheFilename'} . "' for read\n";
623
+ return;
624
+ }
625
+ my $mode = 'none';
626
+ while(my $line = <CACHEFILE>)
627
+ {
628
+ next if $line =~ /^[ \t\n\r\x0a]*$/; #anyone know why char 10 (0x0a) shows up on empty lines, at least on solaris?
629
+ chomp $line;
630
+ #check for start of section
631
+ if($line =~ /File changetimes/) {$mode = 'ctime';}
632
+ elsif($line =~ /BLEU scores/) {$mode = 'bleu';}
633
+ elsif($line =~ /BLEU statistics/) {$mode = 'bstats';}
634
+ elsif($line =~ /Statistical comparisons/) {$mode = 'cmp';}
635
+ elsif($line =~ /Unknown-token counts/) {$mode = 'unk';}
636
+ elsif($line =~ /WER scores/) {$mode = 'wer';}
637
+ elsif($line =~ /Perplexity/) {$mode = 'ppl';}
638
+ elsif($line =~ /NN\/ADJ WER\/PWER/) {$mode = 'nawp';}
639
+ #get data when in a mode already
640
+ elsif($mode eq 'ctime')
641
+ {
642
+ local ($fileExtension, $ctime) = split(/\s+/, $line);
643
+ $self->{'fileCtimes'}->{$fileExtension} = $ctime;
644
+ }
645
+ elsif($mode eq 'bleu')
646
+ {
647
+ local ($sysname, $factorName, $rest) = split(/\s+/, $line, 3);
648
+ next if !$self->cacheIsCurrentForFile($sysname) || !$self->cacheIsCurrentForFile('e');
649
+ if(!exists $self->{'bleuScores'}->{$sysname}) {$self->{'bleuScores'}->{$sysname} = {};}
650
+ if(!exists $self->{'bleuScores'}->{$sysname}->{$factorName}) {$self->{'bleuScores'}->{$sysname}->{$factorName} = [[], []];}
651
+ my @stats = map {my @tmp = split(/\s+/, $_); \@tmp;} split(/;/, $rest);
652
+ print STDERR "bleu 1: " . join(', ', @{shift @stats}) . "\n";
653
+ print STDERR "bleu 2: " . join(' ', map {"{" . join(', ', @$_) . "}"} @stats) . "\n";
654
+ # $self->{'bleuScores'}->{$sysname}->{$factorName}->[0] = shift @stats;
655
+ # $self->{'bleuScores'}->{$sysname}->{$factorName}->[1] = \@stats;
656
+ }
657
+ elsif($mode eq 'bstats')
658
+ {
659
+ local ($sysname, $factorName, $rest) = split(/\s+/, $line, 3);
660
+ next if !$self->cacheIsCurrentForFile($sysname) || !$self->cacheIsCurrentForFile('e');
661
+ if(!exists $self->{'bleuConfidence'}->{$sysname}) {$self->{'bleuConfidence'}->{$sysname} = {};}
662
+ if(!exists $self->{'bleuConfidence'}->{$sysname}->{$factorName}) {$self->{'bleuConfidence'}->{$sysname}->{$factorName} = [[], []];}
663
+ my @stats = map {my @tmp = split(/\s+/, $_); \@tmp;} split(/;/, $rest);
664
+ $self->{'bleuConfidence'}->{$sysname}->{$factorName}->[0] = shift @stats;
665
+ $self->{'bleuConfidence'}->{$sysname}->{$factorName}->[1] = \@stats;
666
+ }
667
+ elsif($mode eq 'cmp')
668
+ {
669
+ local ($sysname1, $sysname2, $factorName, $rest) = split(/\s+/, $line, 4);
670
+ next if !$self->cacheIsCurrentForFile($sysname1) || !$self->cacheIsCurrentForFile($sysname2) || !$self->cacheIsCurrentForFile('e');
671
+ if(!exists $self->{'comparisonStats'}->{$sysname1}) {$self->{'comparisonStats'}->{$sysname1} = {};}
672
+ if(!exists $self->{'comparisonStats'}->{$sysname1}->{$sysname2}) {$self->{'comparisonStats'}->{$sysname1}->{$sysname2} = {};}
673
+ if(!exists $self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName}) {$self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName} = [];}
674
+ my @stats = map {my @x = split(' ', $_); \@x} split(/;/, $rest);
675
+ $self->{'comparisonStats'}->{$sysname1}->{$sysname2}->{$factorName} = \@stats;
676
+ }
677
+ elsif($mode eq 'unk')
678
+ {
679
+ local ($factorName, $phraseTableFilename, $unknownCount, $totalCount) = split(' ', $line);
680
+ next if !$self->cacheIsCurrentForFile('f') || !$self->cacheIsCurrentForFile("pt_$factorName");
681
+ if(defined($self->{'phraseTableFilenames'}->{$factorName}) && $self->{'phraseTableFilenames'}->{$factorName} eq $phraseTableFilename)
682
+ {
683
+ $self->{'unknownCount'}->{$factorName} = $unknownCount;
684
+ $self->{'totalTokens'} = $totalCount;
685
+ }
686
+ }
687
+ elsif($mode eq 'wer')
688
+ {
689
+ local ($werType, $sysname, $factorName, $totalWER, $details) = split(/\s+/, $line, 5); #werType is 'sysoutWER' or 'sysoutPWER'
690
+ next if !$self->cacheIsCurrentForFile($sysname) || !$self->cacheIsCurrentForFile('e');
691
+ $details =~ /^([^;]*);(.*)/;
692
+ my @sentenceWERs = split(/\s+/, $1);
693
+ if(!exists $self->{$werType}->{$sysname}) {$self->{$werType}->{$sysname} = {};}
694
+ $self->{$werType}->{$sysname}->{$factorName} = [$totalWER, \@sentenceWERs, []];
695
+ my @indexLists = split(/;/, $2);
696
+ for(my $i = 0; $i < scalar(@sentenceWERs); $i++)
697
+ {
698
+ my @indices = grep(/\S/, split(/\s+/, $indexLists[$i])); #find all nonempty tokens
699
+ $self->{$werType}->{$sysname}->{$factorName}->[2] = \@indices;
700
+ }
701
+ }
702
+ elsif($mode eq 'ppl')
703
+ {
704
+ local ($sysname, $factorName, $perplexity) = split(/\s+/, $line);
705
+ next if !$self->cacheIsCurrentForFile($sysname);
706
+ if(!exists $self->{'perplexity'}->{$sysname}) {$self->{'perplexity'}->{$sysname} = {};}
707
+ $self->{'perplexity'}->{$sysname}->{$factorName} = $perplexity;
708
+ }
709
+ elsif($mode eq 'nawp')
710
+ {
711
+ local ($sysname, @scores) = split(/\s+/, $line);
712
+ next if !$self->cacheIsCurrentForFile($sysname);
713
+ $self->{'nnAdjWERPWER'}->{$sysname} = \@scores;
714
+ }
715
+ }
716
+ close(CACHEFILE);
717
+ }
718
+
719
+ #arguments: cache type ('bleu' | ...), system name, factor name
720
+ #return: none
721
+ sub flushCache
722
+ {
723
+ my ($self, $cacheType, $sysname, $factorName) = @_;
724
+ if($cacheType eq 'bleu')
725
+ {
726
+ if(defined($self->{'bleuScores'}->{$sysname}) && defined($self->{'bleuScores'}->{$sysname}->{$factorName}))
727
+ {
728
+ delete $self->{'bleuScores'}->{$sysname}->{$factorName};
729
+ }
730
+ }
731
+ }
732
+
733
+ #arguments: file extension
734
+ #return: whether (0/1) our cache for the given file is at least as recent as the file
735
+ sub cacheIsCurrentForFile
736
+ {
737
+ my ($self, $ext) = @_;
738
+ return 0 if !exists $self->{'fileCtimes'}->{$ext} ;
739
+ my @liveStats = stat($self->{'corpusName'} . ".$ext");
740
+ return ($liveStats[9] <= $self->{'fileCtimes'}->{$ext}) ? 1 : 0;
741
+ }
742
+
743
+ ##### utils #####
744
+ #arguments: a, b (scalars)
745
+ sub min
746
+ {
747
+ my ($a, $b) = @_;
748
+ return ($a < $b) ? $a : $b;
749
+ }
750
+ #arguments: a, b (scalars)
751
+ sub max
752
+ {
753
+ my ($a, $b) = @_;
754
+ return ($a > $b) ? $a : $b;
755
+ }
756
+ #arguments: x
757
+ sub my_log
758
+ {
759
+ return -9999999999 unless $_[0];
760
+ return log($_[0]);
761
+ }
762
+ #arguments: x
763
+ sub round
764
+ {
765
+ my $x = shift;
766
+ if($x - int($x) < .5) {return int($x);}
767
+ return int($x) + 1;
768
+ }
769
+
770
+ #return an approximation of the p-value for a given t FOR A HARDCODED NUMBER OF DEGREES OF FREEDOM
771
+ # (IF YOU CHANGE THIS HARDCODED NUMBER YOU MUST ALSO CHANGE statisticallyTestBLEUResults() and getLowerBoundPValue() )
772
+ #arguments: the t statistic, $t
773
+ #return: a lower bound on the probability mass outside (beyond) +/-$t in the t distribution
774
+ #
775
+ #for a wonderful t-distribution calculator, see <http://math.uc.edu/~brycw/classes/148/tables.htm#t>. UC.edu is Cincinnati.
776
+ sub getLowerBoundPValue
777
+ {
778
+ my $t = abs(shift);
779
+ #encode various known p-values for ###### DOF = 29 ######
780
+ my %t2p = #since we're comparing (hopefully) very similar values, this chart is weighted toward the low end of the t-stat
781
+ (
782
+ 0.0063 => .995,
783
+ 0.0126 => .99,
784
+ 0.0253 => .98,
785
+ 0.0380 => .97,
786
+ 0.0506 => .96,
787
+ 0.0633 => .95,
788
+ 0.0950 => .925,
789
+ 0.127 => .9,
790
+ 0.191 => .85,
791
+ 0.256 => .8,
792
+ 0.389 => .7,
793
+ 0.530 => .6,
794
+ 0.683 => .5,
795
+ 0.854 => .4,
796
+ 1.055 => .3,
797
+ 1.311 => .2,
798
+ 1.699 => .1
799
+ );
800
+ foreach my $tCmp (sort keys %t2p) {return $t2p{$tCmp} if $t <= $tCmp;}
801
+ return 0; #loosest bound ever! groovy, man
802
+ }
803
+ #arguments: the t statistic, $t
804
+ #return: an upper bound on the probability mass outside (beyond) +/-$t in the t distribution
805
+ sub getUpperBoundPValue
806
+ {
807
+ my $t = abs(shift);
808
+ #encode various known p-values for ###### DOF = 29 ######
809
+ my %t2p =
810
+ (
811
+ 4.506 => .0001,
812
+ 4.254 => .0002,
813
+ 3.918 => .0005,
814
+ 3.659 => .001,
815
+ 3.396 => .002,
816
+ 3.038 => .005,
817
+ 2.756 => .01,
818
+ 2.462 => .02,
819
+ 2.045 => .05,
820
+ 1.699 => .1,
821
+ 1.311 => .2,
822
+ 0.683 => .5
823
+ );
824
+ foreach my $tCmp (reverse sort keys %t2p) {return $t2p{$tCmp} if $t >= $tCmp;}
825
+ return 1; #loosest bound ever!
826
+ }
827
+
828
+ #arguments: n, r
829
+ #return: binomial coefficient for p = .5 (ie nCr * (1/2)^n)
830
+ sub binCoeff
831
+ {
832
+ my ($n, $r) = @_;
833
+ my $coeff = 1;
834
+ for(my $i = $r + 1; $i <= $n; $i++) {$coeff *= $i; $coeff /= ($i - $r);}
835
+ return $coeff * (.5 ** $n);
836
+ }
837
+
838
+ #throw if the given factor doesn't have an index defined
839
+ #arguments: factor name
840
+ #return: none
841
+ sub ensureFactorPosDefined
842
+ {
843
+ my ($self, $factorName) = @_;
844
+ if(!defined($self->{'factorIndices'}->{$factorName}))
845
+ {
846
+ throw Error::Simple(-text => "Corpus: no index known for factor '$factorName'\n");
847
+ }
848
+ }
849
+
850
+ #throw if the filename field corresponding to the argument hasn't been defined
851
+ #arguments: 'truth' | 'input' | a system name
852
+ #return: none
853
+ sub ensureFilenameDefined
854
+ {
855
+ my ($self, $sysname) = @_;
856
+ if($sysname eq 'truth' || $sysname eq 'input')
857
+ {
858
+ if(!defined($self->{"${sysname}Filename"}))
859
+ {
860
+ throw Error::Simple(-text => "Corpus: no $sysname corpus defined\n");
861
+ }
862
+ }
863
+ else
864
+ {
865
+ if(!defined($self->{'sysoutFilenames'}->{$sysname}))
866
+ {
867
+ throw Error::Simple(-text => "Corpus: no system $sysname defined\n");
868
+ }
869
+ }
870
+ }
871
+
872
+ #throw if there isn't a defined phrase-table filename for the given factor
873
+ #arguments: factor name
874
+ #return: none
875
+ sub ensurePhraseTableDefined
876
+ {
877
+ my ($self, $factorName) = @_;
878
+ if(!defined($self->{'phraseTableFilenames'}->{$factorName}))
879
+ {
880
+ throw Error::Simple(-text => "Corpus: no phrase table defined for factor '$factorName'\n");
881
+ }
882
+ }
883
+
884
+ #search current directory for files with our corpus name as basename and set filename fields of $self
885
+ #arguments: hashref of filenames to descriptions
886
+ #return: none
887
+ sub locateFiles
888
+ {
889
+ my ($self, $refDescs) = @_;
890
+ open(DIR, "ls -x1 . |") or die "Corpus::locateFiles(): couldn't list current directory\n";
891
+ my $corpusName = $self->{'corpusName'};
892
+ while(my $filename = <DIR>)
893
+ {
894
+ chop $filename; #remove \n
895
+ if($filename =~ /^$corpusName\.(.*)$/)
896
+ {
897
+ my $ext = $1;
898
+ if($ext eq 'e') {$self->{'truthFilename'} = $filename;}
899
+ elsif($ext eq 'f') {$self->{'inputFilename'} = $filename;}
900
+ elsif($ext =~ /pt_(.*)/) {$self->{'phraseTableFilenames'}->{$1} = $filename;}
901
+ else {$self->{'sysoutFilenames'}->{$ext} = $filename;}
902
+ if(defined($refDescs->{$filename}))
903
+ {
904
+ $self->{'fileDescriptions'}->{$filename} = $refDescs->{$filename};
905
+ }
906
+ }
907
+ }
908
+ close(DIR);
909
+ }
910
+
911
+ #arguments: type ('truth' | 'input' | a string to represent a system output), filename
912
+ #pre: filename exists
913
+ #return: none
914
+ sub loadSentences
915
+ {
916
+ my ($self, $sysname, $filename) = @_;
917
+ #if the sentences are already loaded, leave them be
918
+ if(exists $self->{$sysname} && scalar(@{$self->{$sysname}}) > 0) {return;}
919
+
920
+ $self->{$sysname} = [];
921
+ $self->{'tokenCount'}->{$sysname} = 0;
922
+ open(INFILE, "<$filename") or die "Corpus::load(): couldn't open '$filename' for read\n";
923
+ while(my $line = <INFILE>)
924
+ {
925
+ my @words = split(/\s+/, $line);
926
+ $self->{'tokenCount'}->{$sysname} += scalar(@words);
927
+ my $refFactors = [];
928
+ foreach my $word (@words)
929
+ {
930
+ my @factors = split(/\|/, $word);
931
+ push @$refFactors, \@factors;
932
+ }
933
+ push @{$self->{$sysname}}, $refFactors;
934
+ }
935
+ close(INFILE);
936
+ }
937
+
938
+ #free the memory used for the given corpus (but NOT any associated calculations, eg WER)
939
+ #arguments: type ('truth' | 'input' | a string to represent a system output)
940
+ #return: none
941
+ sub releaseSentences
942
+ {
943
+ # my ($self, $sysname) = @_;
944
+ # $self->{$sysname} = [];
945
+ }
946
+
947
+ #arguments: factor name
948
+ #return: none
949
+ #throw if we don't have a filename for the given phrase table
950
+ sub loadPhraseTable
951
+ {
952
+ my ($self, $factorName) = @_;
953
+ $self->ensurePhraseTableDefined($factorName);
954
+
955
+ my $filename = $self->{'phraseTableFilenames'}->{$factorName};
956
+ open(PTABLE, "<$filename") or die "couldn't open '$filename' for read\n";
957
+ $self->{'phraseTables'}->{$factorName} = {}; #create ref to phrase table (hash of strings, for source phrases, to anything whatsoever)
958
+ #assume the table is sorted so that duplicate source phrases will be consecutive
959
+ while(my $line = <PTABLE>)
960
+ {
961
+ my @phrases = split(/\s*\|\|\|\s*/, $line, 2);
962
+ $self->{'phraseTables'}->{$factorName}->{$phrases[0]} = 0; #just so that it's set to something
963
+ }
964
+ close(PTABLE);
965
+ }
966
+
967
+ #arguments: factor name
968
+ #return: none
969
+ sub releasePhraseTable
970
+ {
971
+ my ($self, $factorName) = @_;
972
+ $self->{'phraseTables'}->{$factorName} = {};
973
+ }
974
+
975
+ #arguments: name of list ('nounAndAdj' | ...)
976
+ #return: arrayref of strings (postags)
977
+ sub getPOSTagList
978
+ {
979
+ my ($self, $listname) = @_;
980
+ ##### assume PTB tagset #####
981
+ if($listname eq 'nounAndAdj') {return ['NN', 'NNS', 'NNP', 'NNPS', 'JJ', 'JJR', 'JJS'];}
982
+ # if($listname eq '') {return [];}
983
+ }
984
+
985
+ #arguments: list to be filtered (arrayref of arrayrefs of factor strings), desired factor index, arrayref of allowable values
986
+ #return: filtered list as array of arrayrefs of factor strings
987
+ sub filterFactors
988
+ {
989
+ my ($self, $refFullList, $index, $refFactorValues) = @_;
990
+ my $valuesRegex = join("|", @$refFactorValues);
991
+ my @filteredList = ();
992
+ foreach my $factors (@$refFullList)
993
+ {
994
+ if($factors->[$index] =~ m/$valuesRegex/)
995
+ {
996
+ push @filteredList, $factors;
997
+ }
998
+ }
999
+ return @filteredList;
1000
+ }
1001
+
1002
+ #arguments: system output (arrayref of arrayrefs of arrayrefs of factor strings), truth (same), factor index to use
1003
+ #return: wer score, arrayref of sentence scores, arrayref of arrayrefs of indices of errorful words
1004
+ sub corpusWER
1005
+ {
1006
+ my ($self, $refSysOutput, $refTruth, $index) = @_;
1007
+ my ($totWER, $sentenceWER, $errIndices) = (0, [], []);
1008
+ for(my $i = 0; $i < scalar(@$refSysOutput); $i++)
1009
+ {
1010
+ my ($sentWER, $indices) = $self->sentenceWER($refSysOutput->[$i], $refTruth->[$i], $index);
1011
+ $totWER += $sentWER;
1012
+ push @$sentenceWER, $sentWER;
1013
+ push @$errIndices, $indices;
1014
+ }
1015
+ return ($totWER, $sentenceWER, $errIndices);
1016
+ }
1017
+
1018
+ #arguments: system output (arrayref of arrayrefs of factor strings), truth (same), factor index to use
1019
+ #return: wer score, arrayref of arrayrefs of indices of errorful words
1020
+ sub sentenceWER
1021
+ {
1022
+ #constants: direction we came through the table
1023
+ my ($DIR_NONE, $DIR_SKIPTRUTH, $DIR_SKIPOUT, $DIR_SKIPBOTH) = (-1, 0, 1, 2); #values don't matter but must be unique
1024
+ my ($self, $refSysOutput, $refTruth, $index) = @_;
1025
+ my ($totWER, $indices) = (0, []);
1026
+ my ($sLength, $eLength) = (scalar(@$refSysOutput), scalar(@$refTruth));
1027
+ if($sLength == 0 || $eLength == 0) {return ($totWER, $indices);} #special case
1028
+
1029
+ my @refWordsMatchIndices = (-1) x $eLength; #at what sysout-word index this truth word is first matched
1030
+ my @sysoutWordsMatchIndices = (-1) x $sLength; #at what truth-word index this sysout word is first matched
1031
+ my $table = []; #index by sysout word index, then truth word index; a cell holds max count of matching words and direction we came to get it
1032
+ #dynamic-programming time: find the path through the table with the maximum number of matching words
1033
+ for(my $i = 0; $i < $sLength; $i++)
1034
+ {
1035
+ push @$table, [];
1036
+ for(my $j = 0; $j < $eLength; $j++)
1037
+ {
1038
+ my ($maxPrev, $prevDir) = (0, $DIR_NONE);
1039
+ if($i > 0 && $table->[$i - 1]->[$j]->[0] >= $maxPrev) {$maxPrev = $table->[$i - 1]->[$j]->[0]; $prevDir = $DIR_SKIPOUT;}
1040
+ if($j > 0 && $table->[$i]->[$j - 1]->[0] >= $maxPrev) {$maxPrev = $table->[$i]->[$j - 1]->[0]; $prevDir = $DIR_SKIPTRUTH;}
1041
+ if($i > 0 && $j > 0 && $table->[$i - 1]->[$j - 1]->[0] >= $maxPrev) {$maxPrev = $table->[$i - 1]->[$j - 1]->[0]; $prevDir = $DIR_SKIPBOTH;}
1042
+ my $match = ($refSysOutput->[$i]->[$index] eq $refTruth->[$j]->[$index] && $refWordsMatchIndices[$j] == -1 && $sysoutWordsMatchIndices[$i] == -1) ? 1 : 0;
1043
+ if($match == 1) {$refWordsMatchIndices[$j] = $i; $sysoutWordsMatchIndices[$i] = $j;}
1044
+ push @{$table->[$i]}, [($match ? $maxPrev + 1 : $maxPrev), $prevDir];
1045
+ }
1046
+ }
1047
+
1048
+ #look back along the path and get indices of non-matching words
1049
+ my @unusedSysout = (0) x $sLength; #whether each sysout word was matched--used for outputting html table
1050
+ my ($i, $j) = ($sLength - 1, $eLength - 1);
1051
+ while($i > 0) #work our way back to the first sysout word
1052
+ {
1053
+ push @{$table->[$i]->[$j]}, 0; #length is flag to highlight cell
1054
+ if($table->[$i]->[$j]->[1] == $DIR_SKIPTRUTH)
1055
+ {
1056
+ $j--;
1057
+ }
1058
+ elsif($table->[$i]->[$j]->[1] == $DIR_SKIPOUT)
1059
+ {
1060
+ if($table->[$i - 1]->[$j]->[0] == $table->[$i]->[$j]->[0]) {unshift @$indices, $i; $unusedSysout[$i] = 1;}
1061
+ $i--;
1062
+ }
1063
+ elsif($table->[$i]->[$j]->[1] == $DIR_SKIPBOTH)
1064
+ {
1065
+ if($table->[$i - 1]->[$j - 1]->[0] == $table->[$i]->[$j]->[0]) {unshift @$indices, $i; $unusedSysout[$i] = 1;}
1066
+ $i--; $j--;
1067
+ }
1068
+ }
1069
+ #we're at the first sysout word; finish up checking for matches
1070
+ while($j > 0 && $refWordsMatchIndices[$j] != 0) {push @{$table->[0]->[$j]}, 0; $j--;}
1071
+ if($j == 0 && $refWordsMatchIndices[0] != 0) {unshift @$indices, 0; $unusedSysout[0] = 1;} #no truth word was matched to the first sysout word
1072
+
1073
+ #print some HTML to debug the WER algorithm
1074
+ # print "<table border=1><tr><td></td><td>" . join("</td><td>", map {() . $_->[$index]} @$refTruth) . "</td></tr>";
1075
+ # for(my $i = 0; $i < $sLength; $i++)
1076
+ # {
1077
+ # print "<tr><td" . (($unusedSysout[$i] == 1) ? " style=\"background-color: #ffdd88\">" : ">") . $refSysOutput->[$i]->[$index] . "</td>";
1078
+ # for(my $j = 0; $j < $eLength; $j++)
1079
+ # {
1080
+ # print "<td";
1081
+ # if(scalar(@{$table->[$i]->[$j]}) > 2) {print " style=\"color: yellow; background-color: #000080\"";}
1082
+ # my $arrow;
1083
+ # if($table->[$i]->[$j]->[1] == $DIR_NONE) {$arrow = "&times;";}
1084
+ # elsif($table->[$i]->[$j]->[1] == $DIR_SKIPTRUTH) {$arrow = "&larr;";}
1085
+ # elsif($table->[$i]->[$j]->[1] == $DIR_SKIPOUT) {$arrow = "&uarr;";}
1086
+ # elsif($table->[$i]->[$j]->[1] == $DIR_SKIPBOTH) {$arrow = "&loz;";}
1087
+ # print ">" . $table->[$i]->[$j]->[0] . " " . $arrow . "</td>";
1088
+ # }
1089
+ # print "</tr>";
1090
+ # }
1091
+ # print "</table>";
1092
+
1093
+ my $matchCount = 0;
1094
+ if($sLength > 0) {$matchCount = $table->[$sLength - 1]->[$eLength - 1]->[0];}
1095
+ return ($sLength - $matchCount, $indices);
1096
+ }
1097
+
1098
+ #arguments: system output (arrayref of arrayrefs of arrayrefs of factor strings), truth (same), factor index to use
1099
+ #return: wer score, arrayref of sentence scores, arrayref of arrayrefs of indices of errorful words
1100
+ sub corpusPWER
1101
+ {
1102
+ my ($self, $refSysOutput, $refTruth, $index) = @_;
1103
+ my ($totWER, $sentenceWER, $errIndices) = (0, [], []);
1104
+ for(my $i = 0; $i < scalar(@$refSysOutput); $i++)
1105
+ {
1106
+ my ($sentWER, $indices) = $self->sentencePWER($refSysOutput->[$i], $refTruth->[$i], $index);
1107
+ $totWER += $sentWER;
1108
+ push @$sentenceWER, $sentWER;
1109
+ push @$errIndices, $indices;
1110
+ }
1111
+ return ($totWER, $sentenceWER, $errIndices);
1112
+ }
1113
+
1114
+ #arguments: system output (arrayref of arrayrefs of factor strings), truth (same), factor index to use
1115
+ #return: wer score, arrayref of arrayrefs of indices of errorful words
1116
+ sub sentencePWER
1117
+ {
1118
+ my ($self, $refSysOutput, $refTruth, $index) = @_;
1119
+ my ($totWER, $indices) = (0, []);
1120
+ my ($sLength, $eLength) = (scalar(@$refSysOutput), scalar(@$refTruth));
1121
+ my @truthWordUsed = (0) x $eLength; #array of 0/1; can only match a given truth word once
1122
+ for(my $j = 0; $j < $sLength; $j++)
1123
+ {
1124
+ my $found = 0;
1125
+ for(my $k = 0; $k < $eLength; $k++) #check output word against entire truth sentence
1126
+ {
1127
+ if(lc $refSysOutput->[$j]->[$index] eq lc $refTruth->[$k]->[$index] && $truthWordUsed[$k] == 0)
1128
+ {
1129
+ $truthWordUsed[$k] = 1;
1130
+ $found = 1;
1131
+ last;
1132
+ }
1133
+ }
1134
+ if($found == 0)
1135
+ {
1136
+ $totWER++;
1137
+ push @$indices, $j;
1138
+ }
1139
+ }
1140
+ return ($totWER, $indices);
1141
+ }
1142
+
1143
+ #BLEU calculation for a single sentence
1144
+ #arguments: truth sentence (arrayref of arrayrefs of factor strings), sysout sentence (same), factor index to use
1145
+ #return: 1- through 4-gram matching and total counts (1-g match, 1-g tot, 2-g match...), candidate length, reference length
1146
+ sub sentenceBLEU
1147
+ {
1148
+ my ($self, $refTruth, $refSysOutput, $factorIndex, $debug) = @_;
1149
+ my ($length_reference, $length_translation) = (scalar(@$refTruth), scalar(@$refSysOutput));
1150
+ my ($correct1, $correct2, $correct3, $correct4, $total1, $total2, $total3, $total4) = (0, 0, 0, 0, 0, 0, 0, 0);
1151
+ my %REF_GRAM = ();
1152
+ my ($i, $gram);
1153
+ for($i = 0; $i < $length_reference; $i++)
1154
+ {
1155
+ $gram = $refTruth->[$i]->[$factorIndex];
1156
+ $REF_GRAM{$gram}++;
1157
+ next if $i<1;
1158
+ $gram = $refTruth->[$i - 1]->[$factorIndex] ." ".$gram;
1159
+ $REF_GRAM{$gram}++;
1160
+ next if $i<2;
1161
+ $gram = $refTruth->[$i - 2]->[$factorIndex] ." ".$gram;
1162
+ $REF_GRAM{$gram}++;
1163
+ next if $i<3;
1164
+ $gram = $refTruth->[$i - 3]->[$factorIndex] ." ".$gram;
1165
+ $REF_GRAM{$gram}++;
1166
+ }
1167
+ for($i = 0; $i < $length_translation; $i++)
1168
+ {
1169
+ $gram = $refSysOutput->[$i]->[$factorIndex];
1170
+ if (defined($REF_GRAM{$gram}) && $REF_GRAM{$gram} > 0) {
1171
+ $REF_GRAM{$gram}--;
1172
+ $correct1++;
1173
+ }
1174
+ next if $i<1;
1175
+ $gram = $refSysOutput->[$i - 1]->[$factorIndex] ." ".$gram;
1176
+ if (defined($REF_GRAM{$gram}) && $REF_GRAM{$gram} > 0) {
1177
+ $REF_GRAM{$gram}--;
1178
+ $correct2++;
1179
+ }
1180
+ next if $i<2;
1181
+ $gram = $refSysOutput->[$i - 2]->[$factorIndex] ." ".$gram;
1182
+ if (defined($REF_GRAM{$gram}) && $REF_GRAM{$gram} > 0) {
1183
+ $REF_GRAM{$gram}--;
1184
+ $correct3++;
1185
+ }
1186
+ next if $i<3;
1187
+ $gram = $refSysOutput->[$i - 3]->[$factorIndex] ." ".$gram;
1188
+ if (defined($REF_GRAM{$gram}) && $REF_GRAM{$gram} > 0) {
1189
+ $REF_GRAM{$gram}--;
1190
+ $correct4++;
1191
+ }
1192
+ }
1193
+ my $total = $length_translation;
1194
+ $total1 = max(1, $total);
1195
+ $total2 = max(1, $total - 1);
1196
+ $total3 = max(1, $total - 2);
1197
+ $total4 = max(1, $total - 3);
1198
+
1199
+ return ($correct1, $total1, $correct2, $total2, $correct3, $total3, $correct4, $total4, $length_translation, $length_reference);
1200
+ }
1201
+
1202
+ ##### filesystem #####
1203
+
1204
+ #open as many given files as possible; only warn about the rest
1205
+ #arguments: list of filename extensions to open (assume corpus name is file title)
1206
+ #return: hash from type string to filehandleref, giving all files that were successfully opened
1207
+ sub openFiles
1208
+ {
1209
+ my ($self, @extensions) = @_;
1210
+ my %openedFiles = ();
1211
+ foreach my $ext (@extensions)
1212
+ {
1213
+ if(!open(FILE, "<" . $self->{'corpusName'} . $ext))
1214
+ {
1215
+ warn "Corpus::openFiles(): couldn't open '" . $self->{'corpusName'} . $ext . "' for read\n";
1216
+ }
1217
+ else #success
1218
+ {
1219
+ $openedFiles{$ext} = \*FILE;
1220
+ }
1221
+ }
1222
+ return %openedFiles;
1223
+ }
1224
+
1225
+ #read one line from each given file
1226
+ #arguments: hash from type string to filehandleref
1227
+ #return: hash from type string to sentence (stored as arrayref of arrayrefs of factors) read from corresponding file
1228
+ sub readLineFromFiles
1229
+ {
1230
+ my ($self, %openedFiles) = @_;
1231
+ my %lines;
1232
+ foreach my $type (keys %openedFiles)
1233
+ {
1234
+ $lines{$type} = [];
1235
+ my $sentence = <$openedFiles{$type}>;
1236
+ my @words = split(/\s+/, $sentence);
1237
+ foreach my $word (@words)
1238
+ {
1239
+ my @factors = split(/\|/, $word);
1240
+ push @{$lines{$type}}, \@factors;
1241
+ }
1242
+ }
1243
+ return %lines;
1244
+ }
1245
+
1246
+ #close all given files
1247
+ #arguments: hash from type string to filehandleref
1248
+ #return: none
1249
+ sub closeFiles
1250
+ {
1251
+ my ($self, %openedFiles) = @_;
1252
+ foreach my $type (keys %openedFiles)
1253
+ {
1254
+ close($openedFiles{$type});
1255
+ }
1256
+ }
1257
+
1258
+ ##### write HTML #####
1259
+
1260
+ #print HTML for comparing various versions of a sentence, with special processing for each version as appropriate
1261
+ #arguments: filehandleref to which to write, sentence ID string, hashref of version string to sentence (stored as arrayref of arrayref of factor strings)
1262
+ #return: none
1263
+ sub printSingleSentenceComparison
1264
+ {
1265
+ my ($self, $fh, $sentID, $sentences) = @_;
1266
+ my $curFH = select;
1267
+ select $fh;
1268
+ #javascript to reorder rows to look nice afterward
1269
+ print "<script type=\"text/javascript\">
1270
+ function reorder_$sentID()
1271
+ {/*
1272
+ var table = document.getElementById('div_$sentID').firstChild;
1273
+ var refTransRow = table.getElementById('row_e');
1274
+ var inputRow = table.getElementById('row_f');
1275
+ table.removeRow(refTransRow);
1276
+ table.removeRow(inputRow);
1277
+ var newRow1 = table.insertRow(0);
1278
+ var newRow2 = table.insertRow(1);
1279
+ newRow1.childNodes = inputRow.childNodes;
1280
+ newRow2.childNodes = refTransRow.childNodes;*/
1281
+ }
1282
+ </script>";
1283
+ #html for sentences
1284
+ print "<div id=\"div_$sentID\" style=\"padding: 3px; margin: 5px\">";
1285
+ print "<table border=\"1\">";
1286
+ # my $rowCount = 0;
1287
+ # my @bgColors = ("#ffefbf", "#ffdf7f");
1288
+ #process all rows in order
1289
+ foreach my $sentType (keys %$sentences)
1290
+ {
1291
+ my $bgcolor = $bgColors[$rowCount % 2];
1292
+ print "<tr id=\"row_$sentType\"><td align=right>";
1293
+ #description of sentence
1294
+ if(defined($self->{'fileDescriptions'}->{$self->{'corpusName'} . $sentType}))
1295
+ {
1296
+ print "(" . $self->{'fileDescriptions'}->{$self->{'corpusName'} . $sentType} . ")";
1297
+ }
1298
+ else
1299
+ {
1300
+ print "($sentType)";
1301
+ }
1302
+ print "</td><td align=left>";
1303
+ #sentence with markup
1304
+ if($sentType eq 'f') #input
1305
+ {
1306
+ # $self->writeHTMLSentenceWithFactors($fh, $sentences->{$sentType}, $inputColor);
1307
+ }
1308
+ elsif($sentType eq 'e') #reference translation
1309
+ {
1310
+ # $self->writeHTMLSentenceWithFactors($fh, $sentences->{$sentType}, $reftransColor);
1311
+ }
1312
+ else #system output
1313
+ {
1314
+ # $self->writeHTMLTranslationHighlightedWithFactors($fh, $sentences->{$sentType}, $sentences->{'e'}, $highlightColors);
1315
+ }
1316
+ print "</td></tr>";
1317
+ # $rowCount++;
1318
+ }
1319
+ print "</table>";
1320
+ print "</div>\n";
1321
+ select $curFH;
1322
+ }
1323
+
1324
+ #print contents of all fields of this object, with useful formatting for arrayrefs and hashrefs
1325
+ #arguments: none
1326
+ #return: none
1327
+ sub printDetails
1328
+ {
1329
+ my $self = shift;
1330
+ foreach my $key (keys %$self)
1331
+ {
1332
+ if(ref($self->{$key}) eq 'HASH')
1333
+ {
1334
+ print STDERR "obj: $key => {" . join(', ', map {"$_ => " . $self->{$key}->{$_}} (keys %{$self->{$key}})) . "}\n";
1335
+ }
1336
+ elsif(ref($self->{$key}) eq 'ARRAY')
1337
+ {
1338
+ print STDERR "obj: $key => (" . join(', ', @{$self->{$key}}) . ")\n";
1339
+ }
1340
+ elsif(ref($self->{$key}) eq '') #not a reference
1341
+ {
1342
+ print STDERR "obj: $key => " . $self->{$key} . "\n";
1343
+ }
1344
+ }
1345
+ }
mosesdecoder/scripts/analysis/smtgui/README ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Readme for SMTGUI
2
+ Philipp Koehn, Evan Herbst
3
+ 7 / 31 / 06
4
+ -----------------------------------
5
+
6
+ SMTGUI is Philipp's and my code to analyze a decoder's output (the decoder doesn't have to be moses, but most of SMTGUI's features relate to factors, so it probably will be). You can view a list of available corpora by running <newsmtgui.cgi?ACTION=> on any web server. When you're viewing a corpus, click the checkboxes and Compare to see sentences from various sources on one screen. Currently they're in an annoying format; feel free to make the display nicer and more useful. There are per-sentence stats stored in a Corpus object; they just aren't used yet. See compare2() in newsmtgui and Corpus::printSingleSentenceComparison() for a start to better display code. For now it's mostly the view-corpus screen that's useful.
7
+
8
+ newsmtgui.cgi is the main program. Corpus.pm is my module; Error.pm is a standard part of Perl but appears to not always be distributed. The accompanying version is Error.pm v1.15.
9
+
10
+ The program requires file 'file-factors', which gives the list of factors included in each corpus (see the example file for details). Only corpi included in 'file-factors' are displayed. The file 'file-descriptions' is optional and associates a descriptive string with each included filename. These are used only for display. Again an example is provided.
11
+
12
+ For the corpus with name CORPUS, there should be present the files:
13
+ - CORPUS.f, the foreign input
14
+ - CORPUS.e, the truth (aka reference translation)
15
+ - CORPUS.SYSTEM_TRANSLATION for each system to be analyzed
16
+ - CORPUS.pt_FACTORNAME for each factor that requires a phrase table (these are currently used only to count unknown source words)
17
+
18
+ The .f, .e and system-output files should have the usual pipe-delimited format, one sentence per line. Phrase tables should also have standard three-pipe format.
19
+
20
+ A list of standard factor names is available in @Corpus::FACTORNAMES. Feel free to add, but woe betide you if you muck with 'surf', 'pos' and 'lemma'; those are hardcoded all over the place.
21
+
22
+ Currently the program assumes you've included factors 'surf', 'pos' and 'lemma', in whatever order; if not you'll want to edit view_corpus() in newsmtgui.cgi to not automatically display all info. To get English POS tags and lemmas from a words-only corpus and put together factors into one file:
23
+
24
+ $ $BIN/tag-english < CORPUS.lc > CORPUS.pos-tmp (call Brill)
25
+ $ $BIN/morph < CORPUS.pos-tmp > CORPUS.morph
26
+ $ $DATA/test/factor-stem.en.perl < CORPUS.morph > CORPUS.lemma
27
+ $ cat CORPUS.pos-tmp | perl -n -e 's/_/\|/g; print;' > CORPUS.lc+pos (replace _ with |)
28
+ $ $DATA/test/combine-features.perl CORPUS lc+pos lemma > CORPUS.lc+pos+lemma
29
+ $ rm CORPUS.pos-tmp (cleanup)
30
+
31
+ where $BIN=/export/ws06osmt/bin, $DATA=/export/ws06osmt/data.
32
+
33
+ To get German POS tags and lemmas from a words-only corpus (the first step must be run on linux):
34
+
35
+ $ $BIN/recase.perl --in CORPUS.lc --model $MODELS/en-de/recaser/pharaoh.ini > CORPUS.recased (call pharaoh with a lowercase->uppercase model)
36
+ $ $BIN/run-lopar-tagger-lowercase.perl CORPUS.recased CORPUS.recased.lopar (call LOPAR)
37
+ $ $DATA/test/factor-stem.de.perl < CORPUS.recased.lopar > CORPUS.stem
38
+ $ $BIN/lowercase.latin1.perl < CORPUS.stem > CORPUS.lcstem (as you might guess, assumes latin-1 encoding)
39
+ $ $DATA/test/factor-pos.de.perl < CORPUS.recased.lopar > CORPUS.pos
40
+ $ $DATA/test/combine-features.perl CORPUS lc pos lcstem > CORPUS.lc+pos+lcstem
41
+
42
+ where $MODELS=/export/ws06osmt/models.
mosesdecoder/scripts/analysis/smtgui/file-descriptions ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ devtest2006.de-en.matrix05-baseline.pharaoh Pharaoh JHUWS baseline run
2
+ devtest2006.de-en.matrix05-baseline.moses-2006-07-20 Moses baseline run
3
+ devtest2006.en-de.matrix05-baseline.pharaoh Pharaoh JHUWS baseline run
4
+ devtest2006.en-de.matrix05-moses.2006-08-02 Moses baseline run
mosesdecoder/scripts/analysis/smtgui/file-factors ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ #corpus name : list of factors in corpus : [input] factor LMfilename, factor LMfilename, ... : [output] factor LMfilename, factor LMfilename, ...
2
+ #(the given factors should be present in all files for the given corpus)
3
+ devtest2006.de-en : surf pos lemma : surf europarl.de.srilm.gz : surf europarl.en.srilm.gz
4
+ devtest2006.en-de : surf pos lemma : surf europarl.en.srilm.gz : surf europarl.de.srilm.gz
5
+ test2006.en-de : surf : surf europarl.en.srilm.gz : surf europarl.de.srilm.gz
6
+ #pstem: lemmas come from the Porter stemmer (and so are really a mix of stems and lemmas)
7
+ pstem_devtest2006.de-en : surf pos lemma : : surf europarl.en.srilm.gz
8
+ #replace esset with ss in German text
9
+ ss_devtest2006.en-de : surf pos lemma : surf europarl.en.srilm.gz : surf ss_europarl.de.srilm.gz
mosesdecoder/scripts/analysis/smtgui/newsmtgui.cgi ADDED
@@ -0,0 +1,1006 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/perl -w
2
+ #
3
+ # This file is part of moses. Its use is licensed under the GNU Lesser General
4
+ # Public License version 2.1 or, at your option, any later version.
5
+
6
+ # $Id$
7
+ use strict;
8
+
9
+ use CGI;
10
+ use Corpus; #Evan's code
11
+ use Error qw(:try);
12
+
13
+ #files with extensions other than these are interpreted as system translations; see the file 'file-descriptions', if it exists, for the comments that go with them
14
+ my %FILETYPE = ('e' => 'Reference Translation',
15
+ 'f' => 'Foreign Original',
16
+ 'ref.sgm' => 'Reference Translations',
17
+ 'e.sgm' => 'Reference Translations',
18
+ 'src.sgm' => 'Foreign Originals',
19
+ 'f.sgm' => 'Foreign Originals');
20
+ my %DONTSCORE = ('f' => 1, 'f.sgm' => 1, 'src.sgm' => 1,
21
+ 'e' => 1, 'e.sgm' => 1, 'ref.sgm' => 1);
22
+ my @SHOW = ('f', 'e', 'comm');
23
+ my %SHOW_COLOR = ('f' => "BLUE",
24
+ 'e' => "GREEN");
25
+ my $FOREIGN = 'f';
26
+
27
+ #FILEDESC: textual descriptions associated with specific filenames; to be displayed on the single-corpus view
28
+ my %FILEDESC = (); &load_descriptions();
29
+ my %factorData = loadFactorData('file-factors');
30
+ my %MEMORY; &load_memory();
31
+ my (@mBLEU,@NIST);
32
+ @mBLEU=`cat mbleu-memory.dat` if -e "mbleu-memory.dat"; chop(@mBLEU);
33
+ @NIST = `cat nist-memory.dat` if -e "nist-memory.dat"; chop(@NIST);
34
+ my %in; &ReadParse(); #parse arguments
35
+
36
+ if (scalar(@ARGV) > 0 && $ARGV[0] eq 'bleu') {
37
+ $in{CORPUS} = $ARGV[1];
38
+ $in{ACTION} = "VIEW_CORPUS";
39
+ }
40
+
41
+ my %MULTI_REF;
42
+ if ($in{CORPUS} && -e "$in{CORPUS}.ref.sgm") {
43
+ my $sysid;
44
+ open(REF,"$in{CORPUS}.ref.sgm");
45
+ while(<REF>) {
46
+ $sysid = $1 if /<DOC.+sysid=\"([^\"]+)\"/;
47
+ if (/<seg[^>]*> *(\S.+\S) *<\/seg>/) {
48
+ push @{$MULTI_REF{$sysid}}, $1;
49
+ }
50
+ }
51
+ close(REF);
52
+ }
53
+
54
+ if ($in{ACTION} eq '') { &show_corpora(); }
55
+ elsif ($in{ACTION} eq 'VIEW_CORPUS') { &view_corpus(); }
56
+ elsif ($in{ACTION} eq 'SCORE_FILE') { &score_file(); }
57
+ elsif ($in{ACTION} eq 'RESCORE_FILE') { &score_file(); }
58
+ elsif ($in{ACTION} eq 'COMPARE') { &compare(); }
59
+ else { &htmlhead("Unknown Action $in{ACTION}"); }
60
+ print "</BODY></HTML>\n";
61
+
62
+ ###### SHOW CORPORA IN EVALUATION DIRECTORY
63
+
64
+ sub show_corpora {
65
+ my %CORPUS = ();
66
+
67
+ # find corpora in evaluation directory: see the factor-index file, which was already read in
68
+ foreach my $corpusName (keys %factorData)
69
+ {
70
+ $CORPUS{$corpusName} = 1;
71
+ }
72
+
73
+ # list corpora
74
+ &htmlhead("All Corpora");
75
+ print "<UL>\n";
76
+ foreach (sort (keys %CORPUS)) {
77
+ print "<LI><A HREF=\"?ACTION=VIEW_CORPUS&CORPUS=".CGI::escape($_)."\">Corpus $_</A>\n";
78
+ }
79
+ print "</UL>\n";
80
+ }
81
+
82
+ ###### SHOW INFORMATION FOR ONE CORPUS
83
+
84
+ sub view_corpus {
85
+ my @TABLE;
86
+ &htmlhead("View Corpus $in{CORPUS}");
87
+
88
+ # find corpora in evaluation directory
89
+ my $corpus = new Corpus('-name' => "$in{CORPUS}", '-descriptions' => \%FILEDESC, '-info_line' => $factorData{$in{CORPUS}});
90
+ # $corpus->printDetails(); #debugging info
91
+
92
+ my ($sentence_count, $lineInfo);
93
+ if(-e "$in{CORPUS}.f")
94
+ {
95
+ $lineInfo = `wc -l $in{CORPUS}.f`;
96
+ $lineInfo =~ /^\s*(\d+)\s+/;
97
+ $sentence_count = 0 + $1;
98
+ }
99
+ else
100
+ {
101
+ $lineInfo = `wc -l $in{CORPUS}.e`;
102
+ $lineInfo =~ /^\s*(\d+)\s+/;
103
+ $sentence_count = 0 + $1;
104
+ }
105
+
106
+ print "Corpus '$in{CORPUS}' consists of $sentence_count sentences\n";
107
+ print "(<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&mBLEU=1>with mBLEU</A>)" if ((!defined($in{mBLEU})) && (scalar keys %MEMORY) && -e "$in{CORPUS}.e" && -e "$in{CORPUS}.f");
108
+ print "<P>\n";
109
+ print "<FORM ACTION=''>\n";
110
+ print "<INPUT TYPE=HIDDEN NAME=ACTION VALUE=COMPARE>\n";
111
+ print "<INPUT TYPE=HIDDEN NAME=CORPUS VALUE=\"$in{CORPUS}\">\n";
112
+ print "<TABLE BORDER=1 CELLSPACING=0><TR>
113
+ <TD>File (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS}).">sort</A>)</TD>
114
+ <TD>Date (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&SORT=TIME>sort</A>)</TD>";
115
+ if (-e "$in{CORPUS}.e") {
116
+ print "<TD>IBM BLEU (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&SORT=IBM>sort</A>)</TD>";
117
+ }
118
+ if (-e "$in{CORPUS}.ref.sgm" && -e "$in{CORPUS}.src.sgm") {
119
+ print "<TD>NIST (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&SORT=NIST>sort</A>)</TD>";
120
+ if (! -e "$in{CORPUS}.e") {
121
+ print "<TD>BLEU (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&SORT=BLEU>sort</A>)</TD>";
122
+ }
123
+ }
124
+ if ($in{mBLEU} && (scalar keys %MEMORY) && -e "$in{CORPUS}.e" && -e "$in{CORPUS}.f") {
125
+ print "<TD>mBLEU (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&SORT=mBLEU>sort</A>)</TD>";
126
+ }
127
+ print "<TD>Unknown Words</TD>"; #can't sort on; only applies to the input
128
+ print "<TD>Perplexity</TD>"; #applies to truth and system outputs
129
+ print "<TD>WER (<A HREF=?ACTION=VIEW_CORPUS&CORPUS=" . CGI::escape($in{CORPUS})."&SORT=WER>sort</A>)</TD>";
130
+ print "<TD>Noun & adj WER-PWER</TD>"; #can't sort on; only applies to sysoutputs
131
+ print "<TD>Surface vs. lemma PWER</TD>"; #can't sort on; only applies to sysoutputs
132
+ print "<TD>Statistical Measures</TD>";
133
+
134
+ opendir(DIR, ".") or die "couldn't open '.' for read";
135
+ my @filenames = readdir(DIR); #includes . and ..
136
+ closedir(DIR);
137
+ foreach $_ (@filenames)
138
+ {
139
+ next if -d $_; #if is a directory
140
+ my $sgm = 0;
141
+ if (/.sgm$/)
142
+ {
143
+ `grep '<seg' $_ | wc -l` =~ /^\s*(\d+)\s+/;
144
+ next unless $1 == $sentence_count;
145
+ $sgm = 1;
146
+ }
147
+ else
148
+ {
149
+ `wc -l $_` =~ /^\s*(\d+)\s+/;
150
+ next unless $1 == $sentence_count;
151
+ }
152
+ next unless /^$in{CORPUS}\.([^\/]+)$/;
153
+ my $file = $1;
154
+ my $sort = "";
155
+ # checkbox for compare
156
+ my $row = "<TR><TD style=\"font-size: small\"><INPUT TYPE=CHECKBOX NAME=FILE_$file VALUE=1>";
157
+ # README
158
+ if (-e "$in{CORPUS}.$file.README") {
159
+ my $readme = `cat $in{CORPUS}.$file.README`;
160
+ $readme =~ s/([\"\'])/\\\"/g;
161
+ $readme =~ s/[\n\r]/\\n/g;
162
+ $readme =~ s/\t/\\t/g;
163
+ $row .= "<A HREF='javascript:FieldInfo(\"$in{CORPUS}.$file\",\"$readme\")'>";
164
+ }
165
+ # filename
166
+ $row .= "$file</A>";
167
+ # description (hard-coded)
168
+ my @TRANSLATION_SENTENCE = `cat $in{CORPUS}.$file`;
169
+ chop(@TRANSLATION_SENTENCE);
170
+
171
+ #count sentences that contain null words
172
+ my $null_count = 0;
173
+ foreach (@TRANSLATION_SENTENCE)
174
+ {
175
+ $null_count++ if /^NULL$/ || /^NONE$/;
176
+ }
177
+ if ($null_count > 0) {
178
+ $row .= "$null_count NULL ";
179
+ }
180
+
181
+ $row .= " (".$FILETYPE{$file}.")" if defined($FILETYPE{$file});
182
+ $row .= " (".$FILEDESC{$in{CORPUS}.".".$file}.")" if defined($FILEDESC{$in{CORPUS}.".".$file});
183
+ $row .= " (".$FILEDESC{$file}.")" if defined($FILEDESC{$file});
184
+ # filedate
185
+ my @STAT = stat("$in{CORPUS}.$file");
186
+ my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime($STAT[8]); #STAT[8] should be last modify time
187
+ my $time = sprintf("%04d-%02d-%02d %02d:%02d:%02d",$year+1900,$mon+1,$mday,$hour,$min,$sec);
188
+ $row .= "</TD>\n<TD>".$time."</TD>\n";
189
+ if (defined($in{SORT}) && $in{SORT} eq 'TIME') { $sort = $time; }
190
+ # IBM BLEU score
191
+ my $no_bleu =0;
192
+ if (!$sgm && -e "$in{CORPUS}.e") {
193
+ $row .= "<TD>";
194
+ if (!defined($DONTSCORE{$file}) && $file !~ /^f$/ && $file ne "e" && $file !~ /^pt/) {
195
+ my ($score,$p1,$p2,$p3,$p4,$bp) = $corpus->calcBLEU($file, 'surf');
196
+ print STDERR "193: `$score `$p1 `$p2 `$p3 `$p4 `$bp\n";
197
+ $row .= sprintf("<B>%.04f</B> %.01f/%.01f/%.01f/%.01f *%.03f", $score, $p1, $p2, $p3, $p4, $bp);
198
+ if (defined($in{SORT}) && $in{SORT} eq 'IBM') { $sort = $score; }
199
+ }
200
+ $row .= "</TD>\n";
201
+ }
202
+ else {
203
+ $no_bleu=1;
204
+ }
205
+ # NIST score
206
+ if (-e "$in{CORPUS}.ref.sgm" && -e "$in{CORPUS}.src.sgm"
207
+ && !$DONTSCORE{$file}) {
208
+ $row .= "<TD>";
209
+ print "$DONTSCORE{$file}+";
210
+ my ($nist,$nist_bleu);
211
+ if ($file =~ /sgm$/) {
212
+ ($nist,$nist_bleu) = get_nist_score("$in{CORPUS}.ref.sgm","$in{CORPUS}.src.sgm","$in{CORPUS}.$file");
213
+ $row .= sprintf("<B>%.04f</B>",$nist);
214
+ if ($in{SORT} eq 'NIST') { $sort = $nist; }
215
+ }
216
+ $row .= "</TD>\n";
217
+ if ($no_bleu) {
218
+ $row .= "<TD>";
219
+ if ($file =~ /sgm$/) {
220
+ $row .= sprintf("<B>%.04f</B>",$nist_bleu);
221
+ if ($in{SORT} eq 'BLEU') { $sort = $nist_bleu; }
222
+ }
223
+ $row .= "</TD>\n";
224
+ }
225
+ }
226
+ # multi-bleu
227
+ if ($in{mBLEU} && (scalar keys %MEMORY) && -e "$in{CORPUS}.e") {
228
+ $row .= "<TD>";
229
+ if (!defined($DONTSCORE{$file}) && $file !~ /^f$/ && $file ne "e") {
230
+ my ($score,$p1,$p2,$p3,$p4,$bp) = get_multi_bleu_score("$in{CORPUS}.f","$in{CORPUS}.e","$in{CORPUS}.$file");
231
+ $row .= sprintf("<B>%.04f</B> %.01f/%.01f/%.01f/%.01f *%.03f",$score,$p1,$p2,$p3,$p4,$bp);
232
+ if ($in{SORT} eq 'mBLEU') { $sort = $score; }
233
+ }
234
+ $row .= "</TD>\n";
235
+ }
236
+
237
+ my $isSystemOutput = ($file ne 'e' && $file ne 'f' && $file !~ /^pt/);
238
+ # misc stats (note the unknown words should come first so the total word count is available for WER)
239
+ $row .= "<TD align=\"center\">";
240
+ if($file eq 'f') #input
241
+ {
242
+ try
243
+ {
244
+ my ($unknownCount, $totalCount) = calc_unknown_words($corpus, 'surf');
245
+ $row .= sprintf("%.4lf (%d / %d)", $unknownCount / $totalCount, $unknownCount, $totalCount);
246
+ }
247
+ catch Error::Simple with {$row .= "[system error]";};
248
+ }
249
+ $row .= "</TD>\n<TD align=\"center\">";
250
+ if($file eq 'e' || $file eq 'f' || $isSystemOutput)
251
+ {
252
+ try
253
+ {
254
+ my $perplexity = $corpus->calcPerplexity(($file eq 'e') ? 'truth' : (($file eq 'f') ? 'input' : $file), 'surf');
255
+ $row .= sprintf("%.2lf", $perplexity);
256
+ }
257
+ catch Error::Simple with {$row .= "[system error]";}
258
+ }
259
+ $row .= "</TD>\n<TD align=\"center\">";
260
+ if($isSystemOutput)
261
+ {
262
+ try
263
+ {
264
+ my $surfaceWER = $corpus->calcOverallWER($file);
265
+ $row .= sprintf("%.4lf", $surfaceWER);
266
+ }
267
+ catch Error::Simple with {$row .= "[system error]";};
268
+ }
269
+ $row .= "</TD>\n<TD align=\"center\">";
270
+ my ($nnAdjWER, $nnAdjPWER, $surfPWER, $lemmaPWER);
271
+ if($isSystemOutput)
272
+ {
273
+ try
274
+ {
275
+ ($nnAdjWER, $nnAdjPWER, $surfPWER, $lemmaPWER) = calc_misc_stats($corpus, $file);
276
+ $row .= sprintf("WER = %.4lg<br>PWER = %.4lg<br><b>ratio = %.3lf</b>", $nnAdjWER, $nnAdjPWER, $nnAdjPWER / $nnAdjWER);
277
+ }
278
+ catch Error::Simple with {$row .= "[system error]";};
279
+ }
280
+ $row .= "</TD>\n<TD align=\"center\">";
281
+ if($isSystemOutput)
282
+ {
283
+ if($surfPWER == -1)
284
+ {
285
+ $row .= "[system error]";
286
+ }
287
+ else
288
+ {
289
+ my ($lemmaBLEU, $p1, $p2, $p3, $p4, $brevity) = $corpus->calcBLEU($file, 'lemma');
290
+ $row .= sprintf("surface = %.3lf<br>lemma = %.3lf<br><b>lemma BLEU = %.04f</b> %.01f/%.01f/%.01f/%.01f *%.03f",
291
+ $surfPWER, $lemmaPWER, $lemmaBLEU, $p1, $p2, $p3, $p4, $brevity);
292
+ }
293
+ }
294
+ $row .= "</TD>\n<TD align=\"center\">";
295
+ if($isSystemOutput)
296
+ {
297
+ try
298
+ {
299
+ my $testInfo = $corpus->statisticallyTestBLEUResults($file, 'surf');
300
+ my @tTestPValues = @{$testInfo->[0]};
301
+ my @confidenceIntervals = @{$testInfo->[1]};
302
+ $row .= "n-gram precision p-values (high p <=> consistent score):<br>t test " . join("/", map {sprintf("%.4lf", $_)} @tTestPValues);
303
+ $row .= "<p>n-gram precision 95% intervals:<br>" . join(",<br>", map {sprintf("[%.4lf - %.4lf]", $_->[0], $_->[1])} @confidenceIntervals);
304
+ my @bleuInterval = (approxBLEUFromNgramScores(map {$_->[0]} @confidenceIntervals), approxBLEUFromNgramScores(map {$_->[1]} @confidenceIntervals));
305
+ $row .= sprintf("<br><b>(BLEU: ~[%.4lf - %.4lf])</b>", $bleuInterval[0], $bleuInterval[1]);
306
+ }
307
+ catch Error::Simple with {$row .= "[system error]";}
308
+ }
309
+ $row .= "</TD>\n";
310
+
311
+ # correct sentence score
312
+ my($correct,$wrong,$unknown);
313
+ $row .= "<TD>";
314
+ if (!defined($DONTSCORE{$file}) && (scalar keys %MEMORY)) {
315
+ my ($correct,$just_syn,$just_sem,$wrong,$unknown) = get_score_from_memory("$in{CORPUS}.$FOREIGN",
316
+ "$in{CORPUS}.$file");
317
+ $row .= "<B><FONT COLOR=GREEN>$correct</FONT></B>";
318
+ $row .= "/<FONT COLOR=ORANGE>$just_syn</FONT>";
319
+ $row .= "/<FONT COLOR=ORANGE>$just_sem</FONT>";
320
+ $row .= "/<FONT COLOR=RED>$wrong</FONT> ($unknown)</TD>\n";
321
+ if ($in{SORT} eq 'SCORE') {
322
+ $sort = sprintf("%03d %04d",$correct,$just_syn+$just_sem);
323
+ }
324
+ }
325
+ else
326
+ {
327
+ $row .= "</TD>\n";
328
+ }
329
+
330
+ $row .= "</TR>\n";
331
+ push @TABLE, "<!-- $sort -->\n$row";
332
+ }
333
+ close(DIR);
334
+ foreach (reverse sort @TABLE) { print $_; }
335
+ print "</TABLE>\n";
336
+ print "<INPUT TYPE=SUBMIT VALUE=\"Compare\">\n";
337
+ print "<INPUT TYPE=CHECKBOX NAME=SURFACE VALUE=1 CHECKED> Compare all different sentences (instead of just differently <I>evaluated</I> sentences) <INPUT TYPE=CHECKBOX NAME=WITH_EVAL VALUE=1 CHECKED> with evaluation</FORM><P>\n";
338
+ print "<P>The score is to be read as: <FONT COLOR=GREEN>correct</FONT>/<FONT COLOR=ORANGE>just-syn-correct</FONT>/<FONT COLOR=ORANGE>just-sem-correct</FONT>/<FONT COLOR=RED>wrong</FONT> (unscored)\n";
339
+ print "<BR>IBM BLEU is to be read as: <B>metric</B> unigram/bigram/trigram/quadgram *brevity-penalty<P>";
340
+ print "<DIV STYLE=\"border: 1px solid #006600\">";
341
+ print "<H2>Comparison of System Translations (p-values)</H2>";
342
+ my @sysnames = $corpus->getSystemNames();
343
+ for(my $i = 0; $i < scalar(@sysnames); $i++)
344
+ {
345
+ for(my $j = $i + 1; $j < scalar(@sysnames); $j++)
346
+ {
347
+ my $comparison = $corpus->statisticallyCompareSystemResults($sysnames[$i], $sysnames[$j], 'surf');
348
+ print "<P><FONT COLOR=#00aa22>" . $sysnames[$i] . " vs. " . $sysnames[$j] . "</FONT>: [<I>t</I> test] ";
349
+ for(my $k = 0; $k < scalar(@{$comparison->[0]}); $k++)
350
+ {
351
+ print sprintf(($k == 0) ? "%.4lg" : "; %.4lg ", $comparison->[0]->[$k]);
352
+ if($comparison->[1]->[$k] == 0) {print "(&larr;)";} else {print "(&rarr;)";}
353
+ }
354
+ print "&nbsp;&nbsp;---&nbsp;&nbsp;[sign test] ";
355
+ for(my $k = 0; $k < scalar(@{$comparison->[2]}); $k++)
356
+ {
357
+ print sprintf(($k == 0) ? "%.4lg " : "; %.4lg ", $comparison->[2]->[$k]);
358
+ if($comparison->[3]->[$k] == 0) {print "(&larr;)";} else {print "(&rarr;)";}
359
+ }
360
+ print "\n";
361
+ }
362
+ }
363
+ print "</DIV\n";
364
+ print "<P><A HREF=\"newsmtgui.cgi?action=\">All corpora</A>\n";
365
+ }
366
+
367
+ ###### SCORE TRANSLATIONS
368
+
369
+ sub score_file {
370
+ if ($in{VIEW}) {
371
+ &htmlhead("View Translations");
372
+ }
373
+ else {
374
+ &htmlhead("Score Translations");
375
+ }
376
+ print "<A HREF=\"?ACTION=VIEW_CORPUS&CORPUS=".CGI::escape($in{CORPUS})."\">View Corpus $in{CORPUS}</A><P>\n";
377
+ print "<FORM ACTION=\"\" METHOD=POST>\n";
378
+ print "<INPUT TYPE=HIDDEN NAME=ACTION VALUE=$in{ACTION}>\n";
379
+ print "<INPUT TYPE=HIDDEN NAME=CORPUS VALUE=\"$in{CORPUS}\">\n";
380
+ print "<INPUT TYPE=HIDDEN NAME=FILE VALUE=\"$in{FILE}\">\n";
381
+
382
+ # get sentences
383
+ my @SENTENCES;
384
+ if ($in{FILE} =~ /.sgm$/) {
385
+ @SENTENCES = `grep '<seg' $in{CORPUS}.$in{FILE}`;
386
+ for(my $i=0;$i<$#SENTENCES;$i++) {
387
+ $SENTENCES[$i] =~ s/^<seg[^>]+> *(\S.+\S) *<\/seg> *$/$1/;
388
+ }
389
+ }
390
+ else {
391
+ @SENTENCES = `cat $in{CORPUS}.$in{FILE}`; chop(@SENTENCES);
392
+ }
393
+
394
+ my %REFERENCE;
395
+ foreach (@SHOW) {
396
+ if (-e "$in{CORPUS}.$_") {
397
+ @{$REFERENCE{$_}} = `cat $in{CORPUS}.$_`; chop(@{$REFERENCE{$_}});
398
+ }
399
+ }
400
+
401
+ # update memory
402
+ foreach (keys %in) {
403
+ next unless /^SYN_SCORE_(\d+)$/;
404
+ next unless $in{"SEM_SCORE_$1"};
405
+ &store_in_memory($REFERENCE{$FOREIGN}[$1],
406
+ $SENTENCES[$1],
407
+ "syn_".$in{"SYN_SCORE_$1"}." sem_".$in{"SEM_SCORE_$1"});
408
+ }
409
+
410
+ # display sentences
411
+ for(my $i=0;$i<=$#SENTENCES;$i++) {
412
+ my $evaluation = &get_from_memory($REFERENCE{$FOREIGN}[$i],$SENTENCES[$i]);
413
+ next if ($in{ACTION} eq 'SCORE_FILE' &&
414
+ ! $in{VIEW} &&
415
+ $evaluation ne '' && $evaluation ne 'wrong');
416
+ print "<P>Sentence ".($i+1).":<BR>\n";
417
+ # color coding
418
+ &color_highlight_ngrams($i,&nist_normalize_text($SENTENCES[$i]),$REFERENCE{"e"}[$i]);
419
+ if (%MULTI_REF) {
420
+ foreach my $sysid (keys %MULTI_REF) {
421
+ print "<FONT COLOR=GREEN>".$MULTI_REF{$sysid}[$i]."</FONT> (Reference $sysid)<BR>\n";
422
+ }
423
+ }
424
+
425
+ # all sentences
426
+ print "$SENTENCES[$i] (System output)<BR>\n";
427
+ foreach my $ref (@SHOW) {
428
+ if (-e "$in{CORPUS}.$ref") {
429
+ print "<FONT COLOR=$SHOW_COLOR{$ref}>".$REFERENCE{$ref}[$i]."</FONT> (".$FILETYPE{$ref}.")<BR>\n" if $REFERENCE{$ref}[$i];
430
+ }
431
+ }
432
+ if (! $in{VIEW}) {
433
+ print "<INPUT TYPE=RADIO NAME=SYN_SCORE_$i VALUE=correct";
434
+ print " CHECKED" if ($evaluation =~ /syn_correct/);
435
+ print "> perfect English\n";
436
+ print "<INPUT TYPE=RADIO NAME=SYN_SCORE_$i VALUE=wrong";
437
+ print " CHECKED" if ($evaluation =~ /syn_wrong/);
438
+ print "> imperfect English<BR>\n";
439
+ print "<INPUT TYPE=RADIO NAME=SEM_SCORE_$i VALUE=correct";
440
+ print " CHECKED" if ($evaluation =~ /sem_correct/);
441
+ print "> correct meaning\n";
442
+ print "<INPUT TYPE=RADIO NAME=SEM_SCORE_$i VALUE=wrong";
443
+ print " CHECKED" if ($evaluation =~ /sem_wrong/);
444
+ print "> incorrect meaning\n";
445
+ }
446
+ }
447
+ if (! $in{VIEW}) {
448
+ print "<P><INPUT TYPE=SUBMIT VALUE=\"Add evaluation\">\n";
449
+ print "</FORM>\n";
450
+ }
451
+ }
452
+
453
+ sub color_highlight_ngrams {
454
+ my($i,$sentence,$single_reference) = @_;
455
+ my @REF = ();
456
+ my %NGRAM = ();
457
+ if (%MULTI_REF) {
458
+ foreach my $sysid (keys %MULTI_REF) {
459
+ push @REF,&nist_normalize_text($MULTI_REF{$sysid}[$i]);
460
+ }
461
+ }
462
+ elsif ($single_reference) {
463
+ @REF = ($single_reference);
464
+ }
465
+ if (@REF) {
466
+ foreach my $ref (@REF) {
467
+ my @WORD = split(/\s+/,$ref);
468
+ for(my $n=1;$n<=4;$n++) {
469
+ for(my $w=0;$w<=$#WORD-($n-1);$w++) {
470
+ my $ngram = "$n: ";
471
+ for(my $j=0;$j<$n;$j++) {
472
+ $ngram .= $WORD[$w+$j]." ";
473
+ }
474
+ $NGRAM{$ngram}++;
475
+ }
476
+ }
477
+ }
478
+ $sentence =~ s/^\s+//;
479
+ $sentence =~ s/\s+/ /;
480
+ $sentence =~ s/\s+$//;
481
+ my @WORD = split(/\s+/,$sentence);
482
+ my @CORRECT;
483
+ for(my $w=0;$w<=$#WORD;$w++) {
484
+ $CORRECT[$w] = 0;
485
+ }
486
+ for(my $n=1;$n<=4;$n++) {
487
+ for(my $w=0;$w<=$#WORD-($n-1);$w++) {
488
+ my $ngram = "$n: ";
489
+ for(my $j=0;$j<$n;$j++) {
490
+ $ngram .= $WORD[$w+$j]." ";
491
+ }
492
+ next unless defined($NGRAM{$ngram}) && $NGRAM{$ngram}>0;
493
+ $NGRAM{$ngram}--;
494
+ for(my $j=0;$j<$n;$j++) {
495
+ $CORRECT[$w+$j] = $n;
496
+ }
497
+ }
498
+ }
499
+ my @COLOR;
500
+ $COLOR[0] = "#FF0000";
501
+ $COLOR[1] = "#C000C0";
502
+ $COLOR[2] = "#0000FF";
503
+ $COLOR[3] = "#00C0C0";
504
+ $COLOR[4] = "#00C000";
505
+ for(my $w=0;$w<=$#WORD;$w++) {
506
+ print "<B><FONT COLOR=".$COLOR[$CORRECT[$w]].">$WORD[$w]<SUB>".$CORRECT[$w]."</SUB></FONT></B> ";
507
+ }
508
+ print "\n<BR>";
509
+ }
510
+ }
511
+
512
+ ###### OTHER STATS
513
+
514
+ #print (in some unspecified way) the offending exception of type Error::Simple
515
+ #arguments: the error object, a context string
516
+ #return: none
517
+ sub printError
518
+ {
519
+ my ($err, $context) = @_;
520
+ warn "$context: " . $err->{'-text'} . " @ " . $err->{'-file'} . " (" .$err->{'-line'} . ")\n";
521
+ }
522
+
523
+ #compute number and percentage of unknown tokens for a given factor in foreign corpus
524
+ #arguments: corpus object ref, factor name
525
+ #return (unkwordCount, totalWordCount), or (-1, -1) if an error occurs
526
+ sub calc_unknown_words
527
+ {
528
+ my ($corpus, $factorName) = @_;
529
+ try
530
+ {
531
+ my ($unknownCount, $totalCount) = $corpus->calcUnknownTokens($factorName);
532
+ return ($unknownCount, $totalCount);
533
+ }
534
+ catch Error::Simple with
535
+ {
536
+ my $err = shift;
537
+ printError($err, 'calc_unknown_words()');
538
+ return (-1, -1);
539
+ };
540
+ }
541
+
542
+ #compute (if we have the necessary factors) info for:
543
+ #- diff btwn wer and pwer for NNs & ADJs -- if large, many reordering errors
544
+ #- diff btwn pwer for surface forms and pwer for lemmas -- if large, morphology errors
545
+ #arguments: corpus object, system name
546
+ #return (NN/ADJ (wer, pwer), surf pwer, lemma pwer), or (-1, -1, -1, -1) if an error occurs
547
+ sub calc_misc_stats
548
+ {
549
+ my ($corpus, $sysname) = @_;
550
+ try
551
+ {
552
+ my ($nnAdjWER, $nnAdjPWER) = $corpus->calcNounAdjWER_PWERDiff($sysname);
553
+ my ($surfPWER, $lemmaPWER) = ($corpus->calcOverallPWER($sysname, 'surf'), $corpus->calcOverallPWER($sysname, 'lemma'));
554
+ return ($nnAdjWER, $nnAdjPWER, $surfPWER, $lemmaPWER);
555
+ }
556
+ catch Error::Simple with
557
+ {
558
+ my $err = shift;
559
+ printError($err, 'calc_misc_stats()');
560
+ return (-1, -1, -1, -1);
561
+ };
562
+ }
563
+
564
+ #approximate BLEU score from n-gram precisions (currently assume no length penalty)
565
+ #arguments: n-gram precisions as an array
566
+ #return: BLEU score
567
+ sub approxBLEUFromNgramScores
568
+ {
569
+ my $logsum = 0;
570
+ foreach my $p (@_) {$logsum += log($p);}
571
+ return exp($logsum / scalar(@_));
572
+ }
573
+
574
+ ###### NIST SCORE
575
+
576
+ sub get_nist_score {
577
+ my($reference_file,$source_file,$translation_file) = @_;
578
+ my @STAT = stat($translation_file);
579
+ my $current_timestamp = $STAT[9];
580
+ foreach (@NIST) {
581
+ my ($file,$time,$nist,$bleu) = split;
582
+ return ($nist,$bleu)
583
+ if ($file eq $translation_file && $current_timestamp == $time);
584
+ }
585
+
586
+ my $nist_eval = `/home/pkoehn/statmt/bin/mteval-v10.pl -c -r $reference_file -s $source_file -t $translation_file`;
587
+ return (0,0) unless ($nist_eval =~ /NIST score = (\d+\.\d+) BLEU score = (\d+\.\d+)/i);
588
+
589
+ open(NIST,">>nist-memory.dat");
590
+ printf NIST "$translation_file $current_timestamp %f %f\n",$1,$2;
591
+ close(NIST);
592
+ return ($1,$2);
593
+ }
594
+
595
+ sub nist_normalize_text {
596
+ my ($norm_text) = @_;
597
+
598
+ # language-independent part:
599
+ $norm_text =~ s/<skipped>//g; # strip "skipped" tags
600
+ $norm_text =~ s/-\n//g; # strip end-of-line hyphenation and join lines
601
+ $norm_text =~ s/\n/ /g; # join lines
602
+ $norm_text =~ s/(\d)\s+(\d)/$1$2/g; #join digits
603
+ $norm_text =~ s/&quot;/"/g; # convert SGML tag for quote to "
604
+ $norm_text =~ s/&amp;/&/g; # convert SGML tag for ampersand to &
605
+ $norm_text =~ s/&lt;/</g; # convert SGML tag for less-than to >
606
+ $norm_text =~ s/&gt;/>/g; # convert SGML tag for greater-than to <
607
+
608
+ # language-dependent part (assuming Western languages):
609
+ $norm_text = " $norm_text ";
610
+ # $norm_text =~ tr/[A-Z]/[a-z]/ unless $preserve_case;
611
+ $norm_text =~ s/([\{-\~\[-\` -\&\(-\+\:-\@\/])/ $1 /g; # tokenize punctuation
612
+ $norm_text =~ s/([^0-9])([\.,])/$1 $2 /g; # tokenize period and comma unless preceded by a digit
613
+ $norm_text =~ s/([\.,])([^0-9])/ $1 $2/g; # tokenize period and comma unless followed by a digit
614
+ $norm_text =~ s/([0-9])(-)/$1 $2 /g; # tokenize dash when preceded by a digit
615
+ $norm_text =~ s/\s+/ /g; # one space only between words
616
+ $norm_text =~ s/^\s+//; # no leading space
617
+ $norm_text =~ s/\s+$//; # no trailing space
618
+
619
+ return $norm_text;
620
+ }
621
+
622
+ ###### BLEU SCORE
623
+
624
+ sub get_multi_bleu_score {
625
+ my($foreign_file,$reference_file,$translation_file) = @_;
626
+ my @STAT = stat($translation_file);
627
+ my $current_timestamp = $STAT[9];
628
+ foreach (@mBLEU) {
629
+ my ($file,$time,$score,$g1,$g2,$g3,$g4,$bp) = split;
630
+ if ($file eq $translation_file && $current_timestamp == $time) {
631
+ return ($score,$g1*100,$g2*100,$g3*100,$g4*100,$bp);
632
+ }
633
+ }
634
+
635
+ # load reference translation from reference file
636
+ my @REFERENCE_SENTENCE = `cat $reference_file`; chop(@REFERENCE_SENTENCE);
637
+ my @TRANSLATION_SENTENCE = `cat $translation_file`; chop(@TRANSLATION_SENTENCE);
638
+ my %REF;
639
+ my @FOREIGN_SENTENCE = `cat $foreign_file`; chop(@FOREIGN_SENTENCE);
640
+ for(my $i=0;$i<=$#TRANSLATION_SENTENCE;$i++) {
641
+ push @{$REF{$FOREIGN_SENTENCE[$i]}},$REFERENCE_SENTENCE[$i];
642
+ }
643
+ # load reference translation from translation memory
644
+ foreach my $memory (keys %MEMORY) {
645
+ next if $MEMORY{$memory} ne 'syn_correct sem_correct';
646
+ my ($foreign,$english) = split(/ .o0O0o. /,$memory);
647
+ next unless defined($REF{$foreign});
648
+ push @{$REF{$foreign}},$english;
649
+ }
650
+ my(@CORRECT,@TOTAL,$length_translation,$length_reference);
651
+ # compute bleu
652
+ for(my $i=0;$i<=$#TRANSLATION_SENTENCE;$i++) {
653
+ my %REF_NGRAM = ();
654
+ my @WORD = split(/ /,$TRANSLATION_SENTENCE[$i]);
655
+ my $length_translation_this_sentence = scalar(@WORD);
656
+ my ($closest_diff,$closest_length) = (9999,9999);
657
+ foreach my $reference (@{$REF{$FOREIGN_SENTENCE[$i]}}) {
658
+ my @WORD = split(/ /,$reference);
659
+ my $length = scalar(@WORD);
660
+ if (abs($length_translation_this_sentence-$length) < $closest_diff) {
661
+ $closest_diff = abs($length_translation_this_sentence-$length);
662
+ $closest_length = $length;
663
+ }
664
+ for(my $n=1;$n<=4;$n++) {
665
+ my %REF_NGRAM_N = ();
666
+ for(my $start=0;$start<=$#WORD-($n-1);$start++) {
667
+ my $ngram = "$n";
668
+ for(my $w=0;$w<$n;$w++) {
669
+ $ngram .= " ".$WORD[$start+$w];
670
+ }
671
+ $REF_NGRAM_N{$ngram}++;
672
+ }
673
+ foreach my $ngram (keys %REF_NGRAM_N) {
674
+ if (!defined($REF_NGRAM{$ngram}) ||
675
+ $REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) {
676
+ $REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram};
677
+ }
678
+ }
679
+ }
680
+ }
681
+ $length_translation += $length_translation_this_sentence;
682
+ $length_reference += $closest_length;
683
+ for(my $n=1;$n<=4;$n++) {
684
+ my %T_NGRAM = ();
685
+ for(my $start=0;$start<=$#WORD-($n-1);$start++) {
686
+ my $ngram = "$n";
687
+ for(my $w=0;$w<$n;$w++) {
688
+ $ngram .= " ".$WORD[$start+$w];
689
+ }
690
+ $T_NGRAM{$ngram}++;
691
+ }
692
+ foreach my $ngram (keys %T_NGRAM) {
693
+ my $n = 0+$ngram;
694
+ # print "$i e $ngram $T_NGRAM{$ngram}<BR>\n";
695
+ $TOTAL[$n] += $T_NGRAM{$ngram};
696
+ if (defined($REF_NGRAM{$ngram})) {
697
+ if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) {
698
+ $CORRECT[$n] += $T_NGRAM{$ngram};
699
+ # print "$i e correct1 $T_NGRAM{$ngram}<BR>\n";
700
+ }
701
+ else {
702
+ $CORRECT[$n] += $REF_NGRAM{$ngram};
703
+ # print "$i e correct2 $REF_NGRAM{$ngram}<BR>\n";
704
+ }
705
+ }
706
+ }
707
+ }
708
+ }
709
+ my $brevity_penalty = 1;
710
+ if ($length_translation<$length_reference) {
711
+ $brevity_penalty = exp(1-$length_reference/$length_translation);
712
+ }
713
+ my $bleu = $brevity_penalty * exp((my_log( $CORRECT[1]/$TOTAL[1] ) +
714
+ my_log( $CORRECT[2]/$TOTAL[2] ) +
715
+ my_log( $CORRECT[3]/$TOTAL[3] ) +
716
+ my_log( $CORRECT[4]/$TOTAL[4] ) ) / 4);
717
+
718
+ open(BLEU,">>mbleu-memory.dat");
719
+ @STAT = stat($translation_file);
720
+ printf BLEU "$translation_file $STAT[9] %f %f %f %f %f %f\n",$bleu,$CORRECT[1]/$TOTAL[1],$CORRECT[2]/$TOTAL[2],$CORRECT[3]/$TOTAL[3],$CORRECT[4]/$TOTAL[4],$brevity_penalty;
721
+ close(BLEU);
722
+
723
+ return ($bleu,
724
+ 100*$CORRECT[1]/$TOTAL[1],
725
+ 100*$CORRECT[2]/$TOTAL[2],
726
+ 100*$CORRECT[3]/$TOTAL[3],
727
+ 100*$CORRECT[4]/$TOTAL[4],
728
+ $brevity_penalty);
729
+ }
730
+
731
+ sub my_log {
732
+ return -9999999999 unless $_[0];
733
+ return log($_[0]);
734
+ }
735
+
736
+
737
+ ###### SCORE TRANSLATIONS
738
+
739
+ ################################ IN PROGRESS ###############################
740
+ sub compare2
741
+ {
742
+ &htmlhead("Compare Translations");
743
+ print "<A HREF=\"?ACTION=VIEW_CORPUS&CORPUS=".CGI::escape($in{CORPUS})."\">View Corpus $in{CORPUS}</A><P>\n";
744
+ print "<FORM ACTION=\"\" METHOD=POST>\n";
745
+ print "<INPUT TYPE=HIDDEN NAME=ACTION VALUE=$in{ACTION}>\n";
746
+ print "<INPUT TYPE=HIDDEN NAME=CORPUS VALUE=\"$in{CORPUS}\">\n";
747
+ my $corpus = new Corpus('-name' => "$in{CORPUS}", '-descriptions' => \%FILEDESC, '-info_line' => $factorData{$in{CORPUS}});
748
+ $corpus->writeComparisonPage(\*STDOUT, /^.*$/);
749
+ print "</FORM>\n";
750
+ }
751
+
752
+ sub compare {
753
+ &htmlhead("Compare Translations");
754
+ print "<A HREF=\"?ACTION=VIEW_CORPUS&CORPUS=".CGI::escape($in{CORPUS})."\">View Corpus $in{CORPUS}</A><P>\n";
755
+ print "<FORM ACTION=\"\" METHOD=POST>\n";
756
+ print "<INPUT TYPE=HIDDEN NAME=ACTION VALUE=$in{ACTION}>\n";
757
+ print "<INPUT TYPE=HIDDEN NAME=CORPUS VALUE=\"$in{CORPUS}\">\n";
758
+
759
+ # get sentences
760
+ my %SENTENCES;
761
+ my $sentence_count;
762
+ foreach (keys %in) {
763
+ if (/^FILE_(.+)$/) {
764
+ my $file = $1;
765
+ print "<INPUT TYPE=HIDDEN NAME=\"$file\" VALUE=1>\n";
766
+ my @SENTENCES;
767
+ if ($file =~ /.sgm$/) {
768
+ @{$SENTENCES{$file}} = `grep '<seg' $in{CORPUS}.$file`;
769
+ for(my $i=0;$i<$#{$SENTENCES{$file}};$i++) {
770
+ $SENTENCES{$file}[$i] =~ s/^<seg[^>]+> *(\S.+\S) *<\/seg> *$/$1/;
771
+ }
772
+ }
773
+ else {
774
+ @{$SENTENCES{$file}} = `cat $in{CORPUS}.$1`;
775
+ chop(@{$SENTENCES{$file}});
776
+ }
777
+
778
+ $sentence_count = scalar @{$SENTENCES{$file}};
779
+ }
780
+ }
781
+ my %REFERENCE;
782
+ foreach (@SHOW) {
783
+ if (-e "$in{CORPUS}.$_") {
784
+ @{$REFERENCE{$_}} = `cat $in{CORPUS}.$_`; chop(@{$REFERENCE{$_}});
785
+ }
786
+ }
787
+
788
+ # update memory
789
+ foreach (keys %in) {
790
+ next unless /^SYN_SCORE_(.+)_(\d+)$/;
791
+ next unless $in{"SEM_SCORE_$1_$2"};
792
+ &store_in_memory($REFERENCE{$FOREIGN}[$2],
793
+ $SENTENCES{$1}[$2],
794
+ "syn_".$in{"SYN_SCORE_$1_$2"}." sem_".$in{"SEM_SCORE_$1_$2"});
795
+ }
796
+
797
+ # display sentences
798
+ for(my $i=0;$i<$sentence_count;$i++)
799
+ {
800
+ my $evaluation = "";
801
+ my $show = 0;
802
+ my $surface = "";
803
+ foreach my $file (keys %SENTENCES)
804
+ {
805
+ if ($in{SURFACE}) {
806
+ $SENTENCES{$file}[$i] =~ s/ *$//;
807
+ $surface = $SENTENCES{$file}[$i] if ($surface eq '');
808
+ $show = 1 if ($SENTENCES{$file}[$i] ne $surface);
809
+ }
810
+ else {
811
+ my $this_ev = &get_from_memory($REFERENCE{$FOREIGN}[$i],$SENTENCES{$file}[$i]);
812
+ $this_ev = "syn_wrong sem_wrong" unless $this_ev;
813
+ $evaluation = $this_ev if ($evaluation eq '');
814
+ $show = 1 if ($evaluation ne $this_ev);
815
+ }
816
+ }
817
+ next unless $show;
818
+ print "<HR>Sentence ".($i+1).":<BR>\n";
819
+ foreach my $ref (@SHOW) {
820
+ if (-e "$in{CORPUS}.$ref") {
821
+ print "<FONT COLOR=$SHOW_COLOR{$ref}>".$REFERENCE{$ref}[$i]."</FONT> (".$FILETYPE{$ref}.")<BR>\n";
822
+ }
823
+ }
824
+ foreach my $file (keys %SENTENCES) {
825
+ print "<B>$SENTENCES{$file}[$i]</B> ($file)<BR>\n";
826
+ &color_highlight_ngrams($i,&nist_normalize_text($SENTENCES{$file}[$i]),$REFERENCE{"e"}[$i]);
827
+ if (0 && $in{WITH_EVAL}) {
828
+ $evaluation = &get_from_memory($REFERENCE{$FOREIGN}[$i],$SENTENCES{$file}[$i]);
829
+ print "<INPUT TYPE=RADIO NAME=SYN_SCORE_$file"."_$i VALUE=correct";
830
+ print " CHECKED" if ($evaluation =~ /syn_correct/);
831
+ print "> perfect English\n";
832
+ print "<INPUT TYPE=RADIO NAME=SYN_SCORE_$file"."_$i VALUE=wrong";
833
+ print " CHECKED" if ($evaluation =~ /syn_wrong/);
834
+ print "> imperfect English<BR>\n";
835
+ print "<INPUT TYPE=RADIO NAME=SEM_SCORE_$file"."_$i VALUE=correct";
836
+ print " CHECKED" if ($evaluation =~ /sem_correct/);
837
+ print "> correct meaning\n";
838
+ print "<INPUT TYPE=RADIO NAME=SEM_SCORE_$file"."_$i VALUE=wrong";
839
+ print " CHECKED" if ($evaluation =~ /sem_wrong/);
840
+ print "> incorrect meaning<BR>\n";
841
+ }
842
+ }
843
+ }
844
+ print "<P><INPUT TYPE=SUBMIT VALUE=\"Add evaluation\">\n";
845
+ print "</FORM>\n";
846
+ }
847
+
848
+ ###### MEMORY SUBS
849
+
850
+ sub load_memory {
851
+ open(MEMORY,"evaluation-memory.dat") or return;
852
+ while(<MEMORY>) {
853
+ chop;
854
+ my($foreign,$translation,$evaluation) = split(/ \.o0O0o\. /);
855
+ $evaluation = 'syn_correct sem_correct' if ($evaluation eq 'correct');
856
+ $MEMORY{"$foreign .o0O0o. $translation"} = $evaluation;
857
+ }
858
+ close(MEMORY);
859
+ }
860
+
861
+ sub get_score_from_memory {
862
+ my($foreign_file,$translation_file) = @_;
863
+ my $unknown=0;
864
+ my $correct=0;
865
+ my $just_syn=0;
866
+ my $just_sem=0;
867
+ my $wrong=0;
868
+ my @FOREIGN = `cat $foreign_file`; chop(@FOREIGN);
869
+ my @TRANSLATION = `cat $translation_file`; chop(@TRANSLATION);
870
+ for(my $i=0;$i<=$#FOREIGN;$i++) {
871
+ if (my $evaluation = &get_from_memory($FOREIGN[$i],$TRANSLATION[$i])) {
872
+ if ($evaluation eq 'syn_correct sem_correct') { $correct++ }
873
+ elsif ($evaluation eq 'syn_correct sem_wrong') { $just_syn++ }
874
+ elsif ($evaluation eq 'syn_wrong sem_correct') { $just_sem++ }
875
+ elsif ($evaluation eq 'syn_wrong sem_wrong') { $wrong++ }
876
+ else { $unknown++; }
877
+ }
878
+ else { $unknown++; }
879
+ }
880
+ return($correct,$just_syn,$just_sem,$wrong,$unknown);
881
+ }
882
+
883
+ sub store_in_memory {
884
+ my($foreign,$translation,$evaluation) = @_;
885
+ &trim(\$translation);
886
+ return if $MEMORY{"$foreign .o0O0o. $translation"} eq $evaluation;
887
+ $MEMORY{"$foreign .o0O0o. $translation"} = $evaluation;
888
+ open(MEMORY,">>evaluation-memory.dat") or die "store_in_memory(): couldn't open 'evaluation-memory.dat' for append\n";
889
+ print MEMORY "$foreign .o0O0o. $translation .o0O0o. $evaluation\n";
890
+ close(MEMORY);
891
+ }
892
+
893
+ sub get_from_memory {
894
+ my($foreign,$translation) = @_;
895
+ &trim(\$translation);
896
+ return $MEMORY{"$foreign .o0O0o. $translation"};
897
+ }
898
+
899
+ sub trim {
900
+ my($translation) = @_;
901
+ $$translation =~ s/ +/ /g;
902
+ $$translation =~ s/^ +//;
903
+ $$translation =~ s/ +$//;
904
+ }
905
+
906
+ sub load_descriptions {
907
+ open(FD,"file-descriptions") or die "load_descriptions(): couldn't open 'file-descriptions' for read\n";
908
+ while(<FD>) {
909
+ chomp;
910
+ my($file,$description) = split(/\s+/,$_,2);
911
+ $FILEDESC{$file} = $description;
912
+ }
913
+ close(FD);
914
+ }
915
+
916
+ #read config file giving various corpus config info
917
+ #arguments: filename to read
918
+ #return: hash of corpus names to strings containing formatted info
919
+ sub loadFactorData
920
+ {
921
+ my $filename = shift;
922
+ my %data = ();
923
+ open(INFILE, "<$filename") or die "loadFactorData(): couldn't open '$filename' for read\n";
924
+ while(my $line = <INFILE>)
925
+ {
926
+ if($line =~ /^\#/) {next;} #skip comment lines
927
+ $line =~ /^\s*(\S+)\s*:\s*(\S.*\S)\s*$/;
928
+ my $corpusName = $1;
929
+ $data{$corpusName} = $2;
930
+ }
931
+ close(INFILE);
932
+ return %data;
933
+ }
934
+
935
+ ###### SUBS
936
+
937
+ sub htmlhead {
938
+ print <<"___ENDHTML";
939
+ Content-type: text/html
940
+
941
+ <HTML><HEAD>
942
+ <TITLE>MTEval: $_[0]</TITLE>
943
+ <SCRIPT LANGUAGE="JavaScript">
944
+
945
+ <!-- hide from old browsers
946
+
947
+ function FieldInfo(field,description) {
948
+ popup = window.open("","popDialog","height=500,width=600,scrollbars=yes,resizable=yes");
949
+ popup.document.write("<HTML><HEAD><TITLE>"+field+"</TITLE></HEAD><BODY BGCOLOR=#FFFFCC><CENTER><B>"+field+"</B><HR SIZE=2 NOSHADE></CENTER><PRE>"+description+"</PRE><CENTER><FORM><INPUT TYPE='BUTTON' VALUE='Okay' onClick='self.close()'></FORM><CENTER></BODY></HTML>");
950
+ popup.focus();
951
+ popup.document.close();
952
+ }
953
+
954
+ <!-- done hiding -->
955
+
956
+ </SCRIPT>
957
+ </HEAD>
958
+ <BODY BGCOLOR=white>
959
+ <H2>Evaluation Tool for Machine Translation<BR>$_[0]</H2>
960
+ ___ENDHTML
961
+ }
962
+
963
+
964
+ ############################# parts of cgi-lib.pl
965
+
966
+
967
+ sub ReadParse {
968
+ my ($i, $key, $val);
969
+
970
+ # Read in text
971
+ my $in;
972
+ if (&MethGet) {
973
+ $in = $ENV{'QUERY_STRING'};
974
+ } elsif (&MethPost) {
975
+ read(STDIN,$in,$ENV{'CONTENT_LENGTH'});
976
+ }
977
+
978
+ my @in = split(/[&;]/,$in);
979
+
980
+ foreach $i (0 .. $#in) {
981
+ # Convert plus's to spaces
982
+ $in[$i] =~ s/\+/ /g;
983
+
984
+ # Split into key and value.
985
+ ($key, $val) = split(/=/,$in[$i],2); # splits on the first =.
986
+
987
+ # Convert %XX from hex numbers to alphanumeric
988
+ $key =~ s/%(..)/pack("c",hex($1))/ge;
989
+ $val =~ s/%(..)/pack("c",hex($1))/ge;
990
+
991
+ # Associate key and value
992
+ $in{$key} .= "\0" if (defined($in{$key})); # \0 is the multiple separator
993
+ $in{$key} .= $val;
994
+
995
+ }
996
+
997
+ return scalar(@in);
998
+ }
999
+
1000
+ sub MethGet {
1001
+ return ($ENV{'REQUEST_METHOD'} eq "GET");
1002
+ }
1003
+
1004
+ sub MethPost {
1005
+ return ($ENV{'REQUEST_METHOD'} eq "POST");
1006
+ }
mosesdecoder/scripts/analysis/weight-scan-summarize.sh ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ #
3
+ # This file is part of moses. Its use is licensed under the GNU Lesser General
4
+ # Public License version 2.1 or, at your option, any later version.
5
+
6
+ # Hackish summarization of weight-scan.pl results, heavily relies on tools by
7
+ # Ondrej Bojar ([email protected]), some of which need Mercury; beware.
8
+
9
+ function die() { echo "$@" >&2; exit 1; }
10
+ set -o pipefail # safer pipes
11
+
12
+ refs="$1"
13
+ dir="$2"
14
+
15
+ [ -d "$dir" ] && [ -e "$refs" ] \
16
+ || die "usage: $0 ref-file weight-scan-working-dir"
17
+
18
+ testbleu=$HOME/tools/src/obotools/testbleu
19
+ projectbleu=$HOME/tools/src/obotools/projectbleu
20
+
21
+ [ -x "$testbleu" ] || die "Can't run $testbleu"
22
+ [ -x "$projectbleu" ] || die "Can't run $projectbleu"
23
+
24
+ # create exact bleus and put them to bleu.*
25
+ for f in $dir/out.*; do
26
+ bleuf=${f//out./bleu.}
27
+ [ -e "$bleuf" ] \
28
+ || $testbleu $refs < $f | pickre --re='BLEU...([0-9.]*)' > $bleuf \
29
+ || die "Failed to construct $bleuf"
30
+ done
31
+
32
+ # create bleu projections from each best* and put them to corresponding pbleu*
33
+ # first collect all weights
34
+ lcat $dir/weights.* \
35
+ | tr ' ' , \
36
+ | pickre --re='weights.([-0-9.]*)' \
37
+ | cut -f 1,3 \
38
+ | numsort 1 \
39
+ > $dir/allweights
40
+ allwparam=$(cut -f2 $dir/allweights | prefix -- '-w ' | tr '\n' ' ')
41
+ for f in $dir/best*.*; do
42
+ pbleuf=$(echo $f | sed 's/best[0-9]*/pbleu/')
43
+ if [ ! -e "$pbleuf" ] || [ `wc -l < $pbleuf` -ne `wc -l < $dir/allweights` ]; then
44
+ # need to regenerate the projection
45
+ $projectbleu $refs $allwparam < $f \
46
+ | paste $dir/allweights - \
47
+ | cut -f1,3 \
48
+ > $pbleuf \
49
+ || die "Failed to construct $pbleuf"
50
+ fi
51
+ done
52
+
53
+ # summarize bleu projections
54
+ echo "goal proj/real from was" > $dir/graph.data
55
+ for f in $dir/bleu.*; do
56
+ obs=$(echo $f | sed 's/^.*bleu\.//')
57
+ cat $dir/pbleu.$obs \
58
+ | pickre --re='F: ([0-9.]*)' \
59
+ | recut 2,1 \
60
+ | prefix --tab -- "$obs\tproj" \
61
+ >> $dir/graph.data
62
+ lcat $dir/bleu.$obs \
63
+ | pickre --re='bleu\.([-0-9.]*)' \
64
+ | prefix --tab -- "$obs\treal" \
65
+ | recut 1,2,3,5 \
66
+ >> $dir/graph.data
67
+ done
68
+
69
+
70
+ exit 0
71
+
72
+ ## COMMANDS TO PLOT IT:
73
+ # plot 'walkable' graph of projections at various points
74
+ g=weight-scan-tm_2/graph.data; cat $g | skip 1 | grep real | cut -f2- | numsort 2 | sed 's/real/all/' > cliprealall; skip 1 < $g | numsort 1,3 | split_at_colchange 1 | blockwise "(prefix --tab x cliprealall; cat -) | labelledxychart --data=3,4,0,'',linespoints --blockpivot=2" > clip
75
+
76
+ # plot a combination of projections along with the individual projections and
77
+ # the real scores
78
+ cat best100.-0.100000 best100.-0.500000 best100.-0.300000 best100.-0.200000 | /home/obo/tools/src/obotools/projectbleu ../tune.ref $allwparam | paste allweights - > comb.-0.5_-0.3_-0.2_-0.1
79
+ (lcat pbleu.-0.100000 pbleu.-0.500000 pbleu.-0.300000 pbleu.-0.200000 comb.-0.5_-0.3_-0.2_-0.1 | pickre --re='F: ([0-9.]*)' | recut 2,3,1 ; cat graph.data | skip 1 | grep real | cut -f2- | numsort 2 ) | tee delme | labelledxychart --blockpivot=1 --data=2,3,0,'',linespoints | gpsandbox
mosesdecoder/scripts/ems/web/javascripts/builder.js ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // script.aculo.us builder.js v1.8.3, Thu Oct 08 11:23:33 +0200 2009
2
+
3
+ // Copyright (c) 2005-2009 Thomas Fuchs (http://script.aculo.us, http://mir.aculo.us)
4
+ //
5
+ // script.aculo.us is freely distributable under the terms of an MIT-style license.
6
+ // For details, see the script.aculo.us web site: http://script.aculo.us/
7
+
8
+ var Builder = {
9
+ NODEMAP: {
10
+ AREA: 'map',
11
+ CAPTION: 'table',
12
+ COL: 'table',
13
+ COLGROUP: 'table',
14
+ LEGEND: 'fieldset',
15
+ OPTGROUP: 'select',
16
+ OPTION: 'select',
17
+ PARAM: 'object',
18
+ TBODY: 'table',
19
+ TD: 'table',
20
+ TFOOT: 'table',
21
+ TH: 'table',
22
+ THEAD: 'table',
23
+ TR: 'table'
24
+ },
25
+ // note: For Firefox < 1.5, OPTION and OPTGROUP tags are currently broken,
26
+ // due to a Firefox bug
27
+ node: function(elementName) {
28
+ elementName = elementName.toUpperCase();
29
+
30
+ // try innerHTML approach
31
+ var parentTag = this.NODEMAP[elementName] || 'div';
32
+ var parentElement = document.createElement(parentTag);
33
+ try { // prevent IE "feature": http://dev.rubyonrails.org/ticket/2707
34
+ parentElement.innerHTML = "<" + elementName + "></" + elementName + ">";
35
+ } catch(e) {}
36
+ var element = parentElement.firstChild || null;
37
+
38
+ // see if browser added wrapping tags
39
+ if(element && (element.tagName.toUpperCase() != elementName))
40
+ element = element.getElementsByTagName(elementName)[0];
41
+
42
+ // fallback to createElement approach
43
+ if(!element) element = document.createElement(elementName);
44
+
45
+ // abort if nothing could be created
46
+ if(!element) return;
47
+
48
+ // attributes (or text)
49
+ if(arguments[1])
50
+ if(this._isStringOrNumber(arguments[1]) ||
51
+ (arguments[1] instanceof Array) ||
52
+ arguments[1].tagName) {
53
+ this._children(element, arguments[1]);
54
+ } else {
55
+ var attrs = this._attributes(arguments[1]);
56
+ if(attrs.length) {
57
+ try { // prevent IE "feature": http://dev.rubyonrails.org/ticket/2707
58
+ parentElement.innerHTML = "<" +elementName + " " +
59
+ attrs + "></" + elementName + ">";
60
+ } catch(e) {}
61
+ element = parentElement.firstChild || null;
62
+ // workaround firefox 1.0.X bug
63
+ if(!element) {
64
+ element = document.createElement(elementName);
65
+ for(attr in arguments[1])
66
+ element[attr == 'class' ? 'className' : attr] = arguments[1][attr];
67
+ }
68
+ if(element.tagName.toUpperCase() != elementName)
69
+ element = parentElement.getElementsByTagName(elementName)[0];
70
+ }
71
+ }
72
+
73
+ // text, or array of children
74
+ if(arguments[2])
75
+ this._children(element, arguments[2]);
76
+
77
+ return $(element);
78
+ },
79
+ _text: function(text) {
80
+ return document.createTextNode(text);
81
+ },
82
+
83
+ ATTR_MAP: {
84
+ 'className': 'class',
85
+ 'htmlFor': 'for'
86
+ },
87
+
88
+ _attributes: function(attributes) {
89
+ var attrs = [];
90
+ for(attribute in attributes)
91
+ attrs.push((attribute in this.ATTR_MAP ? this.ATTR_MAP[attribute] : attribute) +
92
+ '="' + attributes[attribute].toString().escapeHTML().gsub(/"/,'&quot;') + '"');
93
+ return attrs.join(" ");
94
+ },
95
+ _children: function(element, children) {
96
+ if(children.tagName) {
97
+ element.appendChild(children);
98
+ return;
99
+ }
100
+ if(typeof children=='object') { // array can hold nodes and text
101
+ children.flatten().each( function(e) {
102
+ if(typeof e=='object')
103
+ element.appendChild(e);
104
+ else
105
+ if(Builder._isStringOrNumber(e))
106
+ element.appendChild(Builder._text(e));
107
+ });
108
+ } else
109
+ if(Builder._isStringOrNumber(children))
110
+ element.appendChild(Builder._text(children));
111
+ },
112
+ _isStringOrNumber: function(param) {
113
+ return(typeof param=='string' || typeof param=='number');
114
+ },
115
+ build: function(html) {
116
+ var element = this.node('div');
117
+ $(element).update(html.strip());
118
+ return element.down();
119
+ },
120
+ dump: function(scope) {
121
+ if(typeof scope != 'object' && typeof scope != 'function') scope = window; //global scope
122
+
123
+ var tags = ("A ABBR ACRONYM ADDRESS APPLET AREA B BASE BASEFONT BDO BIG BLOCKQUOTE BODY " +
124
+ "BR BUTTON CAPTION CENTER CITE CODE COL COLGROUP DD DEL DFN DIR DIV DL DT EM FIELDSET " +
125
+ "FONT FORM FRAME FRAMESET H1 H2 H3 H4 H5 H6 HEAD HR HTML I IFRAME IMG INPUT INS ISINDEX "+
126
+ "KBD LABEL LEGEND LI LINK MAP MENU META NOFRAMES NOSCRIPT OBJECT OL OPTGROUP OPTION P "+
127
+ "PARAM PRE Q S SAMP SCRIPT SELECT SMALL SPAN STRIKE STRONG STYLE SUB SUP TABLE TBODY TD "+
128
+ "TEXTAREA TFOOT TH THEAD TITLE TR TT U UL VAR").split(/\s+/);
129
+
130
+ tags.each( function(tag){
131
+ scope[tag] = function() {
132
+ return Builder.node.apply(Builder, [tag].concat($A(arguments)));
133
+ };
134
+ });
135
+ }
136
+ };
mosesdecoder/scripts/ems/web/javascripts/dragdrop.js ADDED
@@ -0,0 +1,974 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // script.aculo.us dragdrop.js v1.8.3, Thu Oct 08 11:23:33 +0200 2009
2
+
3
+ // Copyright (c) 2005-2009 Thomas Fuchs (http://script.aculo.us, http://mir.aculo.us)
4
+ //
5
+ // script.aculo.us is freely distributable under the terms of an MIT-style license.
6
+ // For details, see the script.aculo.us web site: http://script.aculo.us/
7
+
8
+ if(Object.isUndefined(Effect))
9
+ throw("dragdrop.js requires including script.aculo.us' effects.js library");
10
+
11
+ var Droppables = {
12
+ drops: [],
13
+
14
+ remove: function(element) {
15
+ this.drops = this.drops.reject(function(d) { return d.element==$(element) });
16
+ },
17
+
18
+ add: function(element) {
19
+ element = $(element);
20
+ var options = Object.extend({
21
+ greedy: true,
22
+ hoverclass: null,
23
+ tree: false
24
+ }, arguments[1] || { });
25
+
26
+ // cache containers
27
+ if(options.containment) {
28
+ options._containers = [];
29
+ var containment = options.containment;
30
+ if(Object.isArray(containment)) {
31
+ containment.each( function(c) { options._containers.push($(c)) });
32
+ } else {
33
+ options._containers.push($(containment));
34
+ }
35
+ }
36
+
37
+ if(options.accept) options.accept = [options.accept].flatten();
38
+
39
+ Element.makePositioned(element); // fix IE
40
+ options.element = element;
41
+
42
+ this.drops.push(options);
43
+ },
44
+
45
+ findDeepestChild: function(drops) {
46
+ deepest = drops[0];
47
+
48
+ for (i = 1; i < drops.length; ++i)
49
+ if (Element.isParent(drops[i].element, deepest.element))
50
+ deepest = drops[i];
51
+
52
+ return deepest;
53
+ },
54
+
55
+ isContained: function(element, drop) {
56
+ var containmentNode;
57
+ if(drop.tree) {
58
+ containmentNode = element.treeNode;
59
+ } else {
60
+ containmentNode = element.parentNode;
61
+ }
62
+ return drop._containers.detect(function(c) { return containmentNode == c });
63
+ },
64
+
65
+ isAffected: function(point, element, drop) {
66
+ return (
67
+ (drop.element!=element) &&
68
+ ((!drop._containers) ||
69
+ this.isContained(element, drop)) &&
70
+ ((!drop.accept) ||
71
+ (Element.classNames(element).detect(
72
+ function(v) { return drop.accept.include(v) } ) )) &&
73
+ Position.within(drop.element, point[0], point[1]) );
74
+ },
75
+
76
+ deactivate: function(drop) {
77
+ if(drop.hoverclass)
78
+ Element.removeClassName(drop.element, drop.hoverclass);
79
+ this.last_active = null;
80
+ },
81
+
82
+ activate: function(drop) {
83
+ if(drop.hoverclass)
84
+ Element.addClassName(drop.element, drop.hoverclass);
85
+ this.last_active = drop;
86
+ },
87
+
88
+ show: function(point, element) {
89
+ if(!this.drops.length) return;
90
+ var drop, affected = [];
91
+
92
+ this.drops.each( function(drop) {
93
+ if(Droppables.isAffected(point, element, drop))
94
+ affected.push(drop);
95
+ });
96
+
97
+ if(affected.length>0)
98
+ drop = Droppables.findDeepestChild(affected);
99
+
100
+ if(this.last_active && this.last_active != drop) this.deactivate(this.last_active);
101
+ if (drop) {
102
+ Position.within(drop.element, point[0], point[1]);
103
+ if(drop.onHover)
104
+ drop.onHover(element, drop.element, Position.overlap(drop.overlap, drop.element));
105
+
106
+ if (drop != this.last_active) Droppables.activate(drop);
107
+ }
108
+ },
109
+
110
+ fire: function(event, element) {
111
+ if(!this.last_active) return;
112
+ Position.prepare();
113
+
114
+ if (this.isAffected([Event.pointerX(event), Event.pointerY(event)], element, this.last_active))
115
+ if (this.last_active.onDrop) {
116
+ this.last_active.onDrop(element, this.last_active.element, event);
117
+ return true;
118
+ }
119
+ },
120
+
121
+ reset: function() {
122
+ if(this.last_active)
123
+ this.deactivate(this.last_active);
124
+ }
125
+ };
126
+
127
+ var Draggables = {
128
+ drags: [],
129
+ observers: [],
130
+
131
+ register: function(draggable) {
132
+ if(this.drags.length == 0) {
133
+ this.eventMouseUp = this.endDrag.bindAsEventListener(this);
134
+ this.eventMouseMove = this.updateDrag.bindAsEventListener(this);
135
+ this.eventKeypress = this.keyPress.bindAsEventListener(this);
136
+
137
+ Event.observe(document, "mouseup", this.eventMouseUp);
138
+ Event.observe(document, "mousemove", this.eventMouseMove);
139
+ Event.observe(document, "keypress", this.eventKeypress);
140
+ }
141
+ this.drags.push(draggable);
142
+ },
143
+
144
+ unregister: function(draggable) {
145
+ this.drags = this.drags.reject(function(d) { return d==draggable });
146
+ if(this.drags.length == 0) {
147
+ Event.stopObserving(document, "mouseup", this.eventMouseUp);
148
+ Event.stopObserving(document, "mousemove", this.eventMouseMove);
149
+ Event.stopObserving(document, "keypress", this.eventKeypress);
150
+ }
151
+ },
152
+
153
+ activate: function(draggable) {
154
+ if(draggable.options.delay) {
155
+ this._timeout = setTimeout(function() {
156
+ Draggables._timeout = null;
157
+ window.focus();
158
+ Draggables.activeDraggable = draggable;
159
+ }.bind(this), draggable.options.delay);
160
+ } else {
161
+ window.focus(); // allows keypress events if window isn't currently focused, fails for Safari
162
+ this.activeDraggable = draggable;
163
+ }
164
+ },
165
+
166
+ deactivate: function() {
167
+ this.activeDraggable = null;
168
+ },
169
+
170
+ updateDrag: function(event) {
171
+ if(!this.activeDraggable) return;
172
+ var pointer = [Event.pointerX(event), Event.pointerY(event)];
173
+ // Mozilla-based browsers fire successive mousemove events with
174
+ // the same coordinates, prevent needless redrawing (moz bug?)
175
+ if(this._lastPointer && (this._lastPointer.inspect() == pointer.inspect())) return;
176
+ this._lastPointer = pointer;
177
+
178
+ this.activeDraggable.updateDrag(event, pointer);
179
+ },
180
+
181
+ endDrag: function(event) {
182
+ if(this._timeout) {
183
+ clearTimeout(this._timeout);
184
+ this._timeout = null;
185
+ }
186
+ if(!this.activeDraggable) return;
187
+ this._lastPointer = null;
188
+ this.activeDraggable.endDrag(event);
189
+ this.activeDraggable = null;
190
+ },
191
+
192
+ keyPress: function(event) {
193
+ if(this.activeDraggable)
194
+ this.activeDraggable.keyPress(event);
195
+ },
196
+
197
+ addObserver: function(observer) {
198
+ this.observers.push(observer);
199
+ this._cacheObserverCallbacks();
200
+ },
201
+
202
+ removeObserver: function(element) { // element instead of observer fixes mem leaks
203
+ this.observers = this.observers.reject( function(o) { return o.element==element });
204
+ this._cacheObserverCallbacks();
205
+ },
206
+
207
+ notify: function(eventName, draggable, event) { // 'onStart', 'onEnd', 'onDrag'
208
+ if(this[eventName+'Count'] > 0)
209
+ this.observers.each( function(o) {
210
+ if(o[eventName]) o[eventName](eventName, draggable, event);
211
+ });
212
+ if(draggable.options[eventName]) draggable.options[eventName](draggable, event);
213
+ },
214
+
215
+ _cacheObserverCallbacks: function() {
216
+ ['onStart','onEnd','onDrag'].each( function(eventName) {
217
+ Draggables[eventName+'Count'] = Draggables.observers.select(
218
+ function(o) { return o[eventName]; }
219
+ ).length;
220
+ });
221
+ }
222
+ };
223
+
224
+ /*--------------------------------------------------------------------------*/
225
+
226
+ var Draggable = Class.create({
227
+ initialize: function(element) {
228
+ var defaults = {
229
+ handle: false,
230
+ reverteffect: function(element, top_offset, left_offset) {
231
+ var dur = Math.sqrt(Math.abs(top_offset^2)+Math.abs(left_offset^2))*0.02;
232
+ new Effect.Move(element, { x: -left_offset, y: -top_offset, duration: dur,
233
+ queue: {scope:'_draggable', position:'end'}
234
+ });
235
+ },
236
+ endeffect: function(element) {
237
+ var toOpacity = Object.isNumber(element._opacity) ? element._opacity : 1.0;
238
+ new Effect.Opacity(element, {duration:0.2, from:0.7, to:toOpacity,
239
+ queue: {scope:'_draggable', position:'end'},
240
+ afterFinish: function(){
241
+ Draggable._dragging[element] = false
242
+ }
243
+ });
244
+ },
245
+ zindex: 1000,
246
+ revert: false,
247
+ quiet: false,
248
+ scroll: false,
249
+ scrollSensitivity: 20,
250
+ scrollSpeed: 15,
251
+ snap: false, // false, or xy or [x,y] or function(x,y){ return [x,y] }
252
+ delay: 0
253
+ };
254
+
255
+ if(!arguments[1] || Object.isUndefined(arguments[1].endeffect))
256
+ Object.extend(defaults, {
257
+ starteffect: function(element) {
258
+ element._opacity = Element.getOpacity(element);
259
+ Draggable._dragging[element] = true;
260
+ new Effect.Opacity(element, {duration:0.2, from:element._opacity, to:0.7});
261
+ }
262
+ });
263
+
264
+ var options = Object.extend(defaults, arguments[1] || { });
265
+
266
+ this.element = $(element);
267
+
268
+ if(options.handle && Object.isString(options.handle))
269
+ this.handle = this.element.down('.'+options.handle, 0);
270
+
271
+ if(!this.handle) this.handle = $(options.handle);
272
+ if(!this.handle) this.handle = this.element;
273
+
274
+ if(options.scroll && !options.scroll.scrollTo && !options.scroll.outerHTML) {
275
+ options.scroll = $(options.scroll);
276
+ this._isScrollChild = Element.childOf(this.element, options.scroll);
277
+ }
278
+
279
+ Element.makePositioned(this.element); // fix IE
280
+
281
+ this.options = options;
282
+ this.dragging = false;
283
+
284
+ this.eventMouseDown = this.initDrag.bindAsEventListener(this);
285
+ Event.observe(this.handle, "mousedown", this.eventMouseDown);
286
+
287
+ Draggables.register(this);
288
+ },
289
+
290
+ destroy: function() {
291
+ Event.stopObserving(this.handle, "mousedown", this.eventMouseDown);
292
+ Draggables.unregister(this);
293
+ },
294
+
295
+ currentDelta: function() {
296
+ return([
297
+ parseInt(Element.getStyle(this.element,'left') || '0'),
298
+ parseInt(Element.getStyle(this.element,'top') || '0')]);
299
+ },
300
+
301
+ initDrag: function(event) {
302
+ if(!Object.isUndefined(Draggable._dragging[this.element]) &&
303
+ Draggable._dragging[this.element]) return;
304
+ if(Event.isLeftClick(event)) {
305
+ // abort on form elements, fixes a Firefox issue
306
+ var src = Event.element(event);
307
+ if((tag_name = src.tagName.toUpperCase()) && (
308
+ tag_name=='INPUT' ||
309
+ tag_name=='SELECT' ||
310
+ tag_name=='OPTION' ||
311
+ tag_name=='BUTTON' ||
312
+ tag_name=='TEXTAREA')) return;
313
+
314
+ var pointer = [Event.pointerX(event), Event.pointerY(event)];
315
+ var pos = this.element.cumulativeOffset();
316
+ this.offset = [0,1].map( function(i) { return (pointer[i] - pos[i]) });
317
+
318
+ Draggables.activate(this);
319
+ Event.stop(event);
320
+ }
321
+ },
322
+
323
+ startDrag: function(event) {
324
+ this.dragging = true;
325
+ if(!this.delta)
326
+ this.delta = this.currentDelta();
327
+
328
+ if(this.options.zindex) {
329
+ this.originalZ = parseInt(Element.getStyle(this.element,'z-index') || 0);
330
+ this.element.style.zIndex = this.options.zindex;
331
+ }
332
+
333
+ if(this.options.ghosting) {
334
+ this._clone = this.element.cloneNode(true);
335
+ this._originallyAbsolute = (this.element.getStyle('position') == 'absolute');
336
+ if (!this._originallyAbsolute)
337
+ Position.absolutize(this.element);
338
+ this.element.parentNode.insertBefore(this._clone, this.element);
339
+ }
340
+
341
+ if(this.options.scroll) {
342
+ if (this.options.scroll == window) {
343
+ var where = this._getWindowScroll(this.options.scroll);
344
+ this.originalScrollLeft = where.left;
345
+ this.originalScrollTop = where.top;
346
+ } else {
347
+ this.originalScrollLeft = this.options.scroll.scrollLeft;
348
+ this.originalScrollTop = this.options.scroll.scrollTop;
349
+ }
350
+ }
351
+
352
+ Draggables.notify('onStart', this, event);
353
+
354
+ if(this.options.starteffect) this.options.starteffect(this.element);
355
+ },
356
+
357
+ updateDrag: function(event, pointer) {
358
+ if(!this.dragging) this.startDrag(event);
359
+
360
+ if(!this.options.quiet){
361
+ Position.prepare();
362
+ Droppables.show(pointer, this.element);
363
+ }
364
+
365
+ Draggables.notify('onDrag', this, event);
366
+
367
+ this.draw(pointer);
368
+ if(this.options.change) this.options.change(this);
369
+
370
+ if(this.options.scroll) {
371
+ this.stopScrolling();
372
+
373
+ var p;
374
+ if (this.options.scroll == window) {
375
+ with(this._getWindowScroll(this.options.scroll)) { p = [ left, top, left+width, top+height ]; }
376
+ } else {
377
+ p = Position.page(this.options.scroll);
378
+ p[0] += this.options.scroll.scrollLeft + Position.deltaX;
379
+ p[1] += this.options.scroll.scrollTop + Position.deltaY;
380
+ p.push(p[0]+this.options.scroll.offsetWidth);
381
+ p.push(p[1]+this.options.scroll.offsetHeight);
382
+ }
383
+ var speed = [0,0];
384
+ if(pointer[0] < (p[0]+this.options.scrollSensitivity)) speed[0] = pointer[0]-(p[0]+this.options.scrollSensitivity);
385
+ if(pointer[1] < (p[1]+this.options.scrollSensitivity)) speed[1] = pointer[1]-(p[1]+this.options.scrollSensitivity);
386
+ if(pointer[0] > (p[2]-this.options.scrollSensitivity)) speed[0] = pointer[0]-(p[2]-this.options.scrollSensitivity);
387
+ if(pointer[1] > (p[3]-this.options.scrollSensitivity)) speed[1] = pointer[1]-(p[3]-this.options.scrollSensitivity);
388
+ this.startScrolling(speed);
389
+ }
390
+
391
+ // fix AppleWebKit rendering
392
+ if(Prototype.Browser.WebKit) window.scrollBy(0,0);
393
+
394
+ Event.stop(event);
395
+ },
396
+
397
+ finishDrag: function(event, success) {
398
+ this.dragging = false;
399
+
400
+ if(this.options.quiet){
401
+ Position.prepare();
402
+ var pointer = [Event.pointerX(event), Event.pointerY(event)];
403
+ Droppables.show(pointer, this.element);
404
+ }
405
+
406
+ if(this.options.ghosting) {
407
+ if (!this._originallyAbsolute)
408
+ Position.relativize(this.element);
409
+ delete this._originallyAbsolute;
410
+ Element.remove(this._clone);
411
+ this._clone = null;
412
+ }
413
+
414
+ var dropped = false;
415
+ if(success) {
416
+ dropped = Droppables.fire(event, this.element);
417
+ if (!dropped) dropped = false;
418
+ }
419
+ if(dropped && this.options.onDropped) this.options.onDropped(this.element);
420
+ Draggables.notify('onEnd', this, event);
421
+
422
+ var revert = this.options.revert;
423
+ if(revert && Object.isFunction(revert)) revert = revert(this.element);
424
+
425
+ var d = this.currentDelta();
426
+ if(revert && this.options.reverteffect) {
427
+ if (dropped == 0 || revert != 'failure')
428
+ this.options.reverteffect(this.element,
429
+ d[1]-this.delta[1], d[0]-this.delta[0]);
430
+ } else {
431
+ this.delta = d;
432
+ }
433
+
434
+ if(this.options.zindex)
435
+ this.element.style.zIndex = this.originalZ;
436
+
437
+ if(this.options.endeffect)
438
+ this.options.endeffect(this.element);
439
+
440
+ Draggables.deactivate(this);
441
+ Droppables.reset();
442
+ },
443
+
444
+ keyPress: function(event) {
445
+ if(event.keyCode!=Event.KEY_ESC) return;
446
+ this.finishDrag(event, false);
447
+ Event.stop(event);
448
+ },
449
+
450
+ endDrag: function(event) {
451
+ if(!this.dragging) return;
452
+ this.stopScrolling();
453
+ this.finishDrag(event, true);
454
+ Event.stop(event);
455
+ },
456
+
457
+ draw: function(point) {
458
+ var pos = this.element.cumulativeOffset();
459
+ if(this.options.ghosting) {
460
+ var r = Position.realOffset(this.element);
461
+ pos[0] += r[0] - Position.deltaX; pos[1] += r[1] - Position.deltaY;
462
+ }
463
+
464
+ var d = this.currentDelta();
465
+ pos[0] -= d[0]; pos[1] -= d[1];
466
+
467
+ if(this.options.scroll && (this.options.scroll != window && this._isScrollChild)) {
468
+ pos[0] -= this.options.scroll.scrollLeft-this.originalScrollLeft;
469
+ pos[1] -= this.options.scroll.scrollTop-this.originalScrollTop;
470
+ }
471
+
472
+ var p = [0,1].map(function(i){
473
+ return (point[i]-pos[i]-this.offset[i])
474
+ }.bind(this));
475
+
476
+ if(this.options.snap) {
477
+ if(Object.isFunction(this.options.snap)) {
478
+ p = this.options.snap(p[0],p[1],this);
479
+ } else {
480
+ if(Object.isArray(this.options.snap)) {
481
+ p = p.map( function(v, i) {
482
+ return (v/this.options.snap[i]).round()*this.options.snap[i] }.bind(this));
483
+ } else {
484
+ p = p.map( function(v) {
485
+ return (v/this.options.snap).round()*this.options.snap }.bind(this));
486
+ }
487
+ }}
488
+
489
+ var style = this.element.style;
490
+ if((!this.options.constraint) || (this.options.constraint=='horizontal'))
491
+ style.left = p[0] + "px";
492
+ if((!this.options.constraint) || (this.options.constraint=='vertical'))
493
+ style.top = p[1] + "px";
494
+
495
+ if(style.visibility=="hidden") style.visibility = ""; // fix gecko rendering
496
+ },
497
+
498
+ stopScrolling: function() {
499
+ if(this.scrollInterval) {
500
+ clearInterval(this.scrollInterval);
501
+ this.scrollInterval = null;
502
+ Draggables._lastScrollPointer = null;
503
+ }
504
+ },
505
+
506
+ startScrolling: function(speed) {
507
+ if(!(speed[0] || speed[1])) return;
508
+ this.scrollSpeed = [speed[0]*this.options.scrollSpeed,speed[1]*this.options.scrollSpeed];
509
+ this.lastScrolled = new Date();
510
+ this.scrollInterval = setInterval(this.scroll.bind(this), 10);
511
+ },
512
+
513
+ scroll: function() {
514
+ var current = new Date();
515
+ var delta = current - this.lastScrolled;
516
+ this.lastScrolled = current;
517
+ if(this.options.scroll == window) {
518
+ with (this._getWindowScroll(this.options.scroll)) {
519
+ if (this.scrollSpeed[0] || this.scrollSpeed[1]) {
520
+ var d = delta / 1000;
521
+ this.options.scroll.scrollTo( left + d*this.scrollSpeed[0], top + d*this.scrollSpeed[1] );
522
+ }
523
+ }
524
+ } else {
525
+ this.options.scroll.scrollLeft += this.scrollSpeed[0] * delta / 1000;
526
+ this.options.scroll.scrollTop += this.scrollSpeed[1] * delta / 1000;
527
+ }
528
+
529
+ Position.prepare();
530
+ Droppables.show(Draggables._lastPointer, this.element);
531
+ Draggables.notify('onDrag', this);
532
+ if (this._isScrollChild) {
533
+ Draggables._lastScrollPointer = Draggables._lastScrollPointer || $A(Draggables._lastPointer);
534
+ Draggables._lastScrollPointer[0] += this.scrollSpeed[0] * delta / 1000;
535
+ Draggables._lastScrollPointer[1] += this.scrollSpeed[1] * delta / 1000;
536
+ if (Draggables._lastScrollPointer[0] < 0)
537
+ Draggables._lastScrollPointer[0] = 0;
538
+ if (Draggables._lastScrollPointer[1] < 0)
539
+ Draggables._lastScrollPointer[1] = 0;
540
+ this.draw(Draggables._lastScrollPointer);
541
+ }
542
+
543
+ if(this.options.change) this.options.change(this);
544
+ },
545
+
546
+ _getWindowScroll: function(w) {
547
+ var T, L, W, H;
548
+ with (w.document) {
549
+ if (w.document.documentElement && documentElement.scrollTop) {
550
+ T = documentElement.scrollTop;
551
+ L = documentElement.scrollLeft;
552
+ } else if (w.document.body) {
553
+ T = body.scrollTop;
554
+ L = body.scrollLeft;
555
+ }
556
+ if (w.innerWidth) {
557
+ W = w.innerWidth;
558
+ H = w.innerHeight;
559
+ } else if (w.document.documentElement && documentElement.clientWidth) {
560
+ W = documentElement.clientWidth;
561
+ H = documentElement.clientHeight;
562
+ } else {
563
+ W = body.offsetWidth;
564
+ H = body.offsetHeight;
565
+ }
566
+ }
567
+ return { top: T, left: L, width: W, height: H };
568
+ }
569
+ });
570
+
571
+ Draggable._dragging = { };
572
+
573
+ /*--------------------------------------------------------------------------*/
574
+
575
+ var SortableObserver = Class.create({
576
+ initialize: function(element, observer) {
577
+ this.element = $(element);
578
+ this.observer = observer;
579
+ this.lastValue = Sortable.serialize(this.element);
580
+ },
581
+
582
+ onStart: function() {
583
+ this.lastValue = Sortable.serialize(this.element);
584
+ },
585
+
586
+ onEnd: function() {
587
+ Sortable.unmark();
588
+ if(this.lastValue != Sortable.serialize(this.element))
589
+ this.observer(this.element)
590
+ }
591
+ });
592
+
593
+ var Sortable = {
594
+ SERIALIZE_RULE: /^[^_\-](?:[A-Za-z0-9\-\_]*)[_](.*)$/,
595
+
596
+ sortables: { },
597
+
598
+ _findRootElement: function(element) {
599
+ while (element.tagName.toUpperCase() != "BODY") {
600
+ if(element.id && Sortable.sortables[element.id]) return element;
601
+ element = element.parentNode;
602
+ }
603
+ },
604
+
605
+ options: function(element) {
606
+ element = Sortable._findRootElement($(element));
607
+ if(!element) return;
608
+ return Sortable.sortables[element.id];
609
+ },
610
+
611
+ destroy: function(element){
612
+ element = $(element);
613
+ var s = Sortable.sortables[element.id];
614
+
615
+ if(s) {
616
+ Draggables.removeObserver(s.element);
617
+ s.droppables.each(function(d){ Droppables.remove(d) });
618
+ s.draggables.invoke('destroy');
619
+
620
+ delete Sortable.sortables[s.element.id];
621
+ }
622
+ },
623
+
624
+ create: function(element) {
625
+ element = $(element);
626
+ var options = Object.extend({
627
+ element: element,
628
+ tag: 'li', // assumes li children, override with tag: 'tagname'
629
+ dropOnEmpty: false,
630
+ tree: false,
631
+ treeTag: 'ul',
632
+ overlap: 'vertical', // one of 'vertical', 'horizontal'
633
+ constraint: 'vertical', // one of 'vertical', 'horizontal', false
634
+ containment: element, // also takes array of elements (or id's); or false
635
+ handle: false, // or a CSS class
636
+ only: false,
637
+ delay: 0,
638
+ hoverclass: null,
639
+ ghosting: false,
640
+ quiet: false,
641
+ scroll: false,
642
+ scrollSensitivity: 20,
643
+ scrollSpeed: 15,
644
+ format: this.SERIALIZE_RULE,
645
+
646
+ // these take arrays of elements or ids and can be
647
+ // used for better initialization performance
648
+ elements: false,
649
+ handles: false,
650
+
651
+ onChange: Prototype.emptyFunction,
652
+ onUpdate: Prototype.emptyFunction
653
+ }, arguments[1] || { });
654
+
655
+ // clear any old sortable with same element
656
+ this.destroy(element);
657
+
658
+ // build options for the draggables
659
+ var options_for_draggable = {
660
+ revert: true,
661
+ quiet: options.quiet,
662
+ scroll: options.scroll,
663
+ scrollSpeed: options.scrollSpeed,
664
+ scrollSensitivity: options.scrollSensitivity,
665
+ delay: options.delay,
666
+ ghosting: options.ghosting,
667
+ constraint: options.constraint,
668
+ handle: options.handle };
669
+
670
+ if(options.starteffect)
671
+ options_for_draggable.starteffect = options.starteffect;
672
+
673
+ if(options.reverteffect)
674
+ options_for_draggable.reverteffect = options.reverteffect;
675
+ else
676
+ if(options.ghosting) options_for_draggable.reverteffect = function(element) {
677
+ element.style.top = 0;
678
+ element.style.left = 0;
679
+ };
680
+
681
+ if(options.endeffect)
682
+ options_for_draggable.endeffect = options.endeffect;
683
+
684
+ if(options.zindex)
685
+ options_for_draggable.zindex = options.zindex;
686
+
687
+ // build options for the droppables
688
+ var options_for_droppable = {
689
+ overlap: options.overlap,
690
+ containment: options.containment,
691
+ tree: options.tree,
692
+ hoverclass: options.hoverclass,
693
+ onHover: Sortable.onHover
694
+ };
695
+
696
+ var options_for_tree = {
697
+ onHover: Sortable.onEmptyHover,
698
+ overlap: options.overlap,
699
+ containment: options.containment,
700
+ hoverclass: options.hoverclass
701
+ };
702
+
703
+ // fix for gecko engine
704
+ Element.cleanWhitespace(element);
705
+
706
+ options.draggables = [];
707
+ options.droppables = [];
708
+
709
+ // drop on empty handling
710
+ if(options.dropOnEmpty || options.tree) {
711
+ Droppables.add(element, options_for_tree);
712
+ options.droppables.push(element);
713
+ }
714
+
715
+ (options.elements || this.findElements(element, options) || []).each( function(e,i) {
716
+ var handle = options.handles ? $(options.handles[i]) :
717
+ (options.handle ? $(e).select('.' + options.handle)[0] : e);
718
+ options.draggables.push(
719
+ new Draggable(e, Object.extend(options_for_draggable, { handle: handle })));
720
+ Droppables.add(e, options_for_droppable);
721
+ if(options.tree) e.treeNode = element;
722
+ options.droppables.push(e);
723
+ });
724
+
725
+ if(options.tree) {
726
+ (Sortable.findTreeElements(element, options) || []).each( function(e) {
727
+ Droppables.add(e, options_for_tree);
728
+ e.treeNode = element;
729
+ options.droppables.push(e);
730
+ });
731
+ }
732
+
733
+ // keep reference
734
+ this.sortables[element.identify()] = options;
735
+
736
+ // for onupdate
737
+ Draggables.addObserver(new SortableObserver(element, options.onUpdate));
738
+
739
+ },
740
+
741
+ // return all suitable-for-sortable elements in a guaranteed order
742
+ findElements: function(element, options) {
743
+ return Element.findChildren(
744
+ element, options.only, options.tree ? true : false, options.tag);
745
+ },
746
+
747
+ findTreeElements: function(element, options) {
748
+ return Element.findChildren(
749
+ element, options.only, options.tree ? true : false, options.treeTag);
750
+ },
751
+
752
+ onHover: function(element, dropon, overlap) {
753
+ if(Element.isParent(dropon, element)) return;
754
+
755
+ if(overlap > .33 && overlap < .66 && Sortable.options(dropon).tree) {
756
+ return;
757
+ } else if(overlap>0.5) {
758
+ Sortable.mark(dropon, 'before');
759
+ if(dropon.previousSibling != element) {
760
+ var oldParentNode = element.parentNode;
761
+ element.style.visibility = "hidden"; // fix gecko rendering
762
+ dropon.parentNode.insertBefore(element, dropon);
763
+ if(dropon.parentNode!=oldParentNode)
764
+ Sortable.options(oldParentNode).onChange(element);
765
+ Sortable.options(dropon.parentNode).onChange(element);
766
+ }
767
+ } else {
768
+ Sortable.mark(dropon, 'after');
769
+ var nextElement = dropon.nextSibling || null;
770
+ if(nextElement != element) {
771
+ var oldParentNode = element.parentNode;
772
+ element.style.visibility = "hidden"; // fix gecko rendering
773
+ dropon.parentNode.insertBefore(element, nextElement);
774
+ if(dropon.parentNode!=oldParentNode)
775
+ Sortable.options(oldParentNode).onChange(element);
776
+ Sortable.options(dropon.parentNode).onChange(element);
777
+ }
778
+ }
779
+ },
780
+
781
+ onEmptyHover: function(element, dropon, overlap) {
782
+ var oldParentNode = element.parentNode;
783
+ var droponOptions = Sortable.options(dropon);
784
+
785
+ if(!Element.isParent(dropon, element)) {
786
+ var index;
787
+
788
+ var children = Sortable.findElements(dropon, {tag: droponOptions.tag, only: droponOptions.only});
789
+ var child = null;
790
+
791
+ if(children) {
792
+ var offset = Element.offsetSize(dropon, droponOptions.overlap) * (1.0 - overlap);
793
+
794
+ for (index = 0; index < children.length; index += 1) {
795
+ if (offset - Element.offsetSize (children[index], droponOptions.overlap) >= 0) {
796
+ offset -= Element.offsetSize (children[index], droponOptions.overlap);
797
+ } else if (offset - (Element.offsetSize (children[index], droponOptions.overlap) / 2) >= 0) {
798
+ child = index + 1 < children.length ? children[index + 1] : null;
799
+ break;
800
+ } else {
801
+ child = children[index];
802
+ break;
803
+ }
804
+ }
805
+ }
806
+
807
+ dropon.insertBefore(element, child);
808
+
809
+ Sortable.options(oldParentNode).onChange(element);
810
+ droponOptions.onChange(element);
811
+ }
812
+ },
813
+
814
+ unmark: function() {
815
+ if(Sortable._marker) Sortable._marker.hide();
816
+ },
817
+
818
+ mark: function(dropon, position) {
819
+ // mark on ghosting only
820
+ var sortable = Sortable.options(dropon.parentNode);
821
+ if(sortable && !sortable.ghosting) return;
822
+
823
+ if(!Sortable._marker) {
824
+ Sortable._marker =
825
+ ($('dropmarker') || Element.extend(document.createElement('DIV'))).
826
+ hide().addClassName('dropmarker').setStyle({position:'absolute'});
827
+ document.getElementsByTagName("body").item(0).appendChild(Sortable._marker);
828
+ }
829
+ var offsets = dropon.cumulativeOffset();
830
+ Sortable._marker.setStyle({left: offsets[0]+'px', top: offsets[1] + 'px'});
831
+
832
+ if(position=='after')
833
+ if(sortable.overlap == 'horizontal')
834
+ Sortable._marker.setStyle({left: (offsets[0]+dropon.clientWidth) + 'px'});
835
+ else
836
+ Sortable._marker.setStyle({top: (offsets[1]+dropon.clientHeight) + 'px'});
837
+
838
+ Sortable._marker.show();
839
+ },
840
+
841
+ _tree: function(element, options, parent) {
842
+ var children = Sortable.findElements(element, options) || [];
843
+
844
+ for (var i = 0; i < children.length; ++i) {
845
+ var match = children[i].id.match(options.format);
846
+
847
+ if (!match) continue;
848
+
849
+ var child = {
850
+ id: encodeURIComponent(match ? match[1] : null),
851
+ element: element,
852
+ parent: parent,
853
+ children: [],
854
+ position: parent.children.length,
855
+ container: $(children[i]).down(options.treeTag)
856
+ };
857
+
858
+ /* Get the element containing the children and recurse over it */
859
+ if (child.container)
860
+ this._tree(child.container, options, child);
861
+
862
+ parent.children.push (child);
863
+ }
864
+
865
+ return parent;
866
+ },
867
+
868
+ tree: function(element) {
869
+ element = $(element);
870
+ var sortableOptions = this.options(element);
871
+ var options = Object.extend({
872
+ tag: sortableOptions.tag,
873
+ treeTag: sortableOptions.treeTag,
874
+ only: sortableOptions.only,
875
+ name: element.id,
876
+ format: sortableOptions.format
877
+ }, arguments[1] || { });
878
+
879
+ var root = {
880
+ id: null,
881
+ parent: null,
882
+ children: [],
883
+ container: element,
884
+ position: 0
885
+ };
886
+
887
+ return Sortable._tree(element, options, root);
888
+ },
889
+
890
+ /* Construct a [i] index for a particular node */
891
+ _constructIndex: function(node) {
892
+ var index = '';
893
+ do {
894
+ if (node.id) index = '[' + node.position + ']' + index;
895
+ } while ((node = node.parent) != null);
896
+ return index;
897
+ },
898
+
899
+ sequence: function(element) {
900
+ element = $(element);
901
+ var options = Object.extend(this.options(element), arguments[1] || { });
902
+
903
+ return $(this.findElements(element, options) || []).map( function(item) {
904
+ return item.id.match(options.format) ? item.id.match(options.format)[1] : '';
905
+ });
906
+ },
907
+
908
+ setSequence: function(element, new_sequence) {
909
+ element = $(element);
910
+ var options = Object.extend(this.options(element), arguments[2] || { });
911
+
912
+ var nodeMap = { };
913
+ this.findElements(element, options).each( function(n) {
914
+ if (n.id.match(options.format))
915
+ nodeMap[n.id.match(options.format)[1]] = [n, n.parentNode];
916
+ n.parentNode.removeChild(n);
917
+ });
918
+
919
+ new_sequence.each(function(ident) {
920
+ var n = nodeMap[ident];
921
+ if (n) {
922
+ n[1].appendChild(n[0]);
923
+ delete nodeMap[ident];
924
+ }
925
+ });
926
+ },
927
+
928
+ serialize: function(element) {
929
+ element = $(element);
930
+ var options = Object.extend(Sortable.options(element), arguments[1] || { });
931
+ var name = encodeURIComponent(
932
+ (arguments[1] && arguments[1].name) ? arguments[1].name : element.id);
933
+
934
+ if (options.tree) {
935
+ return Sortable.tree(element, arguments[1]).children.map( function (item) {
936
+ return [name + Sortable._constructIndex(item) + "[id]=" +
937
+ encodeURIComponent(item.id)].concat(item.children.map(arguments.callee));
938
+ }).flatten().join('&');
939
+ } else {
940
+ return Sortable.sequence(element, arguments[1]).map( function(item) {
941
+ return name + "[]=" + encodeURIComponent(item);
942
+ }).join('&');
943
+ }
944
+ }
945
+ };
946
+
947
+ // Returns true if child is contained within element
948
+ Element.isParent = function(child, element) {
949
+ if (!child.parentNode || child == element) return false;
950
+ if (child.parentNode == element) return true;
951
+ return Element.isParent(child.parentNode, element);
952
+ };
953
+
954
+ Element.findChildren = function(element, only, recursive, tagName) {
955
+ if(!element.hasChildNodes()) return null;
956
+ tagName = tagName.toUpperCase();
957
+ if(only) only = [only].flatten();
958
+ var elements = [];
959
+ $A(element.childNodes).each( function(e) {
960
+ if(e.tagName && e.tagName.toUpperCase()==tagName &&
961
+ (!only || (Element.classNames(e).detect(function(v) { return only.include(v) }))))
962
+ elements.push(e);
963
+ if(recursive) {
964
+ var grandchildren = Element.findChildren(e, only, recursive, tagName);
965
+ if(grandchildren) elements.push(grandchildren);
966
+ }
967
+ });
968
+
969
+ return (elements.length>0 ? elements.flatten() : []);
970
+ };
971
+
972
+ Element.offsetSize = function (element, type) {
973
+ return element['offset' + ((type=='vertical' || type=='height') ? 'Height' : 'Width')];
974
+ };
mosesdecoder/scripts/ems/web/javascripts/prototype.js ADDED
The diff for this file is too large to render. See raw diff
 
mosesdecoder/scripts/ems/web/javascripts/sound.js ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // script.aculo.us sound.js v1.8.3, Thu Oct 08 11:23:33 +0200 2009
2
+
3
+ // Copyright (c) 2005-2009 Thomas Fuchs (http://script.aculo.us, http://mir.aculo.us)
4
+ //
5
+ // Based on code created by Jules Gravinese (http://www.webveteran.com/)
6
+ //
7
+ // script.aculo.us is freely distributable under the terms of an MIT-style license.
8
+ // For details, see the script.aculo.us web site: http://script.aculo.us/
9
+
10
+ Sound = {
11
+ tracks: {},
12
+ _enabled: true,
13
+ template:
14
+ new Template('<embed style="height:0" id="sound_#{track}_#{id}" src="#{url}" loop="false" autostart="true" hidden="true"/>'),
15
+ enable: function(){
16
+ Sound._enabled = true;
17
+ },
18
+ disable: function(){
19
+ Sound._enabled = false;
20
+ },
21
+ play: function(url){
22
+ if(!Sound._enabled) {
23
+ return;
24
+ }
25
+ var options = Object.extend({
26
+ track: 'global', url: url, replace: false
27
+ }, arguments[1] || {});
28
+
29
+ if(options.replace && this.tracks[options.track]) {
30
+ $R(0, this.tracks[options.track].id).each(function(id){
31
+ var sound = $('sound_'+options.track+'_'+id);
32
+ sound.Stop && sound.Stop();
33
+ sound.remove();
34
+ });
35
+ this.tracks[options.track] = null;
36
+ }
37
+
38
+ if(!this.tracks[options.track]) {
39
+ this.tracks[options.track] = { id: 0 };
40
+ } else {
41
+ this.tracks[options.track].id++;
42
+ }
43
+
44
+ options.id = this.tracks[options.track].id;
45
+ $$('body')[0].insert(
46
+ Prototype.Browser.IE ? new Element('bgsound',{
47
+ id: 'sound_'+options.track+'_'+options.id,
48
+ src: options.url, loop: 1, autostart: true
49
+ }) : Sound.template.evaluate(options));
50
+ }
51
+ };
52
+
53
+ if(Prototype.Browser.Gecko && navigator.userAgent.indexOf("Win") > 0){
54
+ if(navigator.plugins && $A(navigator.plugins).detect(function(p){ return p.name.indexOf('QuickTime') != -1; })) {
55
+ Sound.template = new Template('<object id="sound_#{track}_#{id}" width="0" height="0" type="audio/mpeg" data="#{url}"/>');
56
+ } else if(navigator.plugins && $A(navigator.plugins).detect(function(p){ return p.name.indexOf('Windows Media') != -1; })) {
57
+ Sound.template = new Template('<object id="sound_#{track}_#{id}" type="application/x-mplayer2" data="#{url}"></object>');
58
+ } else if(navigator.plugins && $A(navigator.plugins).detect(function(p){ return p.name.indexOf('RealPlayer') != -1; })) {
59
+ Sound.template = new Template('<embed type="audio/x-pn-realaudio-plugin" style="height:0" id="sound_#{track}_#{id}" src="#{url}" loop="false" autostart="true" hidden="true"/>');
60
+ } else {
61
+ Sound.play = function(){};
62
+ }
63
+ }
mosesdecoder/vw/Classifier.h ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #ifndef moses_Classifier_h
2
+ #define moses_Classifier_h
3
+
4
+ #include <iostream>
5
+ #include <string>
6
+ #include <fstream>
7
+ #include <sstream>
8
+ #include <deque>
9
+ #include <vector>
10
+ #include <boost/shared_ptr.hpp>
11
+
12
+ #include <boost/noncopyable.hpp>
13
+ #include <boost/thread/condition_variable.hpp>
14
+ #include <boost/thread/locks.hpp>
15
+ #include <boost/thread/mutex.hpp>
16
+ #include <boost/iostreams/filtering_stream.hpp>
17
+ #include <boost/iostreams/filter/gzip.hpp>
18
+ #include "../util/string_piece.hh"
19
+ #include "../moses/Util.h"
20
+
21
+ // forward declarations to avoid dependency on VW
22
+ struct vw;
23
+ class ezexample;
24
+
25
+ namespace Discriminative
26
+ {
27
+ typedef std::pair<uint32_t, float> FeatureType; // feature hash (=ID) and value
28
+ typedef std::vector<FeatureType> FeatureVector;
29
+
30
+ /**
31
+ * Abstract class to be implemented by classifiers.
32
+ */
33
+ class Classifier
34
+ {
35
+ public:
36
+ /**
37
+ * Add a feature that does not depend on the class (label).
38
+ */
39
+ virtual FeatureType AddLabelIndependentFeature(const StringPiece &name, float value) = 0;
40
+
41
+ /**
42
+ * Add a feature that is specific for the given class.
43
+ */
44
+ virtual FeatureType AddLabelDependentFeature(const StringPiece &name, float value) = 0;
45
+
46
+ /**
47
+ * Efficient addition of features when their IDs are already computed.
48
+ */
49
+ virtual void AddLabelIndependentFeatureVector(const FeatureVector &features) = 0;
50
+
51
+ /**
52
+ * Efficient addition of features when their IDs are already computed.
53
+ */
54
+ virtual void AddLabelDependentFeatureVector(const FeatureVector &features) = 0;
55
+
56
+ /**
57
+ * Train using current example. Use loss to distinguish positive and negative training examples.
58
+ * Throws away current label-dependent features (so that features for another label/class can now be set).
59
+ */
60
+ virtual void Train(const StringPiece &label, float loss) = 0;
61
+
62
+ /**
63
+ * Predict the loss (inverse of score) of current example.
64
+ * Throws away current label-dependent features (so that features for another label/class can now be set).
65
+ */
66
+ virtual float Predict(const StringPiece &label) = 0;
67
+
68
+ // helper methods for indicator features
69
+ FeatureType AddLabelIndependentFeature(const StringPiece &name) {
70
+ return AddLabelIndependentFeature(name, 1.0);
71
+ }
72
+
73
+ FeatureType AddLabelDependentFeature(const StringPiece &name) {
74
+ return AddLabelDependentFeature(name, 1.0);
75
+ }
76
+
77
+ virtual ~Classifier() {}
78
+
79
+ protected:
80
+ /**
81
+ * Escape special characters in a unified way.
82
+ */
83
+ static std::string EscapeSpecialChars(const std::string &str) {
84
+ std::string out;
85
+ out = Moses::Replace(str, "\\", "_/_");
86
+ out = Moses::Replace(out, "|", "\\/");
87
+ out = Moses::Replace(out, ":", "\\;");
88
+ out = Moses::Replace(out, " ", "\\_");
89
+ return out;
90
+ }
91
+
92
+ const static bool DEBUG = false;
93
+
94
+ };
95
+
96
+ // some of VW settings are hard-coded because they are always needed in our scenario
97
+ // (e.g. quadratic source X target features)
98
+ const std::string VW_DEFAULT_OPTIONS = " --hash all --noconstant -q st -t --ldf_override sc ";
99
+ const std::string VW_DEFAULT_PARSER_OPTIONS = " --quiet --hash all --noconstant -q st -t --csoaa_ldf sc ";
100
+
101
+ /**
102
+ * Produce VW training file (does not use the VW library!)
103
+ */
104
+ class VWTrainer : public Classifier
105
+ {
106
+ public:
107
+ VWTrainer(const std::string &outputFile);
108
+ virtual ~VWTrainer();
109
+
110
+ virtual FeatureType AddLabelIndependentFeature(const StringPiece &name, float value);
111
+ virtual FeatureType AddLabelDependentFeature(const StringPiece &name, float value);
112
+ virtual void AddLabelIndependentFeatureVector(const FeatureVector &features);
113
+ virtual void AddLabelDependentFeatureVector(const FeatureVector &features);
114
+ virtual void Train(const StringPiece &label, float loss);
115
+ virtual float Predict(const StringPiece &label);
116
+
117
+ protected:
118
+ void AddFeature(const StringPiece &name, float value);
119
+
120
+ bool m_isFirstSource, m_isFirstTarget, m_isFirstExample;
121
+
122
+ private:
123
+ boost::iostreams::filtering_ostream m_bfos;
124
+ std::deque<std::string> m_outputBuffer;
125
+
126
+ void WriteBuffer();
127
+ };
128
+
129
+ /**
130
+ * Predict using VW library.
131
+ */
132
+ class VWPredictor : public Classifier, private boost::noncopyable
133
+ {
134
+ public:
135
+ VWPredictor(const std::string &modelFile, const std::string &vwOptions);
136
+ virtual ~VWPredictor();
137
+
138
+ virtual FeatureType AddLabelIndependentFeature(const StringPiece &name, float value);
139
+ virtual FeatureType AddLabelDependentFeature(const StringPiece &name, float value);
140
+ virtual void AddLabelIndependentFeatureVector(const FeatureVector &features);
141
+ virtual void AddLabelDependentFeatureVector(const FeatureVector &features);
142
+ virtual void Train(const StringPiece &label, float loss);
143
+ virtual float Predict(const StringPiece &label);
144
+
145
+ friend class ClassifierFactory;
146
+
147
+ protected:
148
+ FeatureType AddFeature(const StringPiece &name, float values);
149
+
150
+ ::vw *m_VWInstance, *m_VWParser;
151
+ ::ezexample *m_ex;
152
+ // if true, then the VW instance is owned by an external party and should NOT be
153
+ // deleted at end; if false, then we own the VW instance and must clean up after it.
154
+ bool m_sharedVwInstance;
155
+ bool m_isFirstSource, m_isFirstTarget;
156
+
157
+ private:
158
+ // instantiation by classifier factory
159
+ VWPredictor(vw * instance, const std::string &vwOption);
160
+ };
161
+
162
+ /**
163
+ * Provider for classifier instances to be used by individual threads.
164
+ */
165
+ class ClassifierFactory : private boost::noncopyable
166
+ {
167
+ public:
168
+ typedef boost::shared_ptr<Classifier> ClassifierPtr;
169
+
170
+ /**
171
+ * Creates VWPredictor instances to be used by individual threads.
172
+ */
173
+ ClassifierFactory(const std::string &modelFile, const std::string &vwOptions);
174
+
175
+ /**
176
+ * Creates VWTrainer instances (which write features to a file).
177
+ */
178
+ ClassifierFactory(const std::string &modelFilePrefix);
179
+
180
+ // return VWPredictor or VWTrainer instance depending on whether we're in training mode
181
+ ClassifierPtr operator()();
182
+
183
+ ~ClassifierFactory();
184
+
185
+ private:
186
+ std::string m_vwOptions;
187
+ ::vw *m_VWInstance;
188
+ int m_lastId;
189
+ std::string m_modelFilePrefix;
190
+ bool m_gzip;
191
+ boost::mutex m_mutex;
192
+ const bool m_train;
193
+ };
194
+
195
+ } // namespace Discriminative
196
+
197
+ #endif // moses_Classifier_h
mosesdecoder/vw/ClassifierFactory.cpp ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #include "Classifier.h"
2
+ #include "vw.h"
3
+ #include "../moses/Util.h"
4
+ #include <iostream>
5
+ #include <boost/algorithm/string/predicate.hpp>
6
+
7
+ using namespace boost::algorithm;
8
+
9
+ namespace Discriminative
10
+ {
11
+
12
+ ClassifierFactory::ClassifierFactory(const std::string &modelFile, const std::string &vwOptions)
13
+ : m_vwOptions(vwOptions), m_train(false)
14
+ {
15
+ m_VWInstance = VW::initialize(VW_DEFAULT_OPTIONS + " -i " + modelFile + vwOptions);
16
+ }
17
+
18
+ ClassifierFactory::ClassifierFactory(const std::string &modelFilePrefix)
19
+ : m_lastId(0), m_train(true)
20
+ {
21
+ if (ends_with(modelFilePrefix, ".gz")) {
22
+ m_modelFilePrefix = modelFilePrefix.substr(0, modelFilePrefix.size() - 3);
23
+ m_gzip = true;
24
+ } else {
25
+ m_modelFilePrefix = modelFilePrefix;
26
+ m_gzip = false;
27
+ }
28
+ }
29
+
30
+ ClassifierFactory::~ClassifierFactory()
31
+ {
32
+ if (! m_train)
33
+ VW::finish(*m_VWInstance);
34
+ }
35
+
36
+ ClassifierFactory::ClassifierPtr ClassifierFactory::operator()()
37
+ {
38
+ if (m_train) {
39
+ boost::unique_lock<boost::mutex> lock(m_mutex); // avoid possible race for m_lastId
40
+ return ClassifierFactory::ClassifierPtr(
41
+ new VWTrainer(m_modelFilePrefix + "." + Moses::SPrint(m_lastId++) + (m_gzip ? ".gz" : "")));
42
+ } else {
43
+ return ClassifierFactory::ClassifierPtr(
44
+ new VWPredictor(m_VWInstance, VW_DEFAULT_PARSER_OPTIONS + m_vwOptions));
45
+ }
46
+ }
47
+
48
+ }
mosesdecoder/vw/Jamfile ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ alias headers : : : : <include>. <include>..//moses// <include>.. ;
2
+ alias deps : ..//z ..//boost_iostreams ..//boost_filesystem ../moses//moses ;
3
+
4
+ boost 103600 ;
5
+
6
+ # VW
7
+ local with-vw = [ option.get "with-vw" ] ;
8
+ if $(with-vw) {
9
+ lib vw : : <search>$(with-vw)/lib ;
10
+ lib allreduce : : <search>$(with-vw)/lib ;
11
+
12
+ obj ClassifierFactory.o : ClassifierFactory.cpp headers : <include>$(with-vw)/include/vowpalwabbit ;
13
+ obj VWPredictor.o : VWPredictor.cpp headers : <include>$(with-vw)/include/vowpalwabbit ;
14
+
15
+ alias vw_objects : VWPredictor.o ClassifierFactory.o vw allreduce : : : <library>boost_program_options ;
16
+ lib classifier : [ glob *.cpp : VWPredictor.cpp ClassifierFactory.cpp ] vw_objects headers ;
17
+
18
+ exe vwtrainer : MainVW deps ;
19
+ echo "Linking with Vowpal Wabbit" ;
20
+ }
mosesdecoder/vw/Normalizer.h ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #ifndef moses_Normalizer_h
2
+ #define moses_Normalizer_h
3
+
4
+ #include <vector>
5
+ #include <algorithm>
6
+ #include "Util.h"
7
+
8
+ namespace Discriminative
9
+ {
10
+
11
+ class Normalizer
12
+ {
13
+ public:
14
+ virtual void operator()(std::vector<float> &losses) const = 0;
15
+ virtual ~Normalizer() {}
16
+ };
17
+
18
+ class SquaredLossNormalizer : public Normalizer
19
+ {
20
+ public:
21
+ virtual void operator()(std::vector<float> &losses) const {
22
+ // This is (?) a good choice for sqrt loss (default loss function in VW)
23
+
24
+ float sum = 0;
25
+
26
+ // clip to [0,1] and take 1-Z as non-normalized prob
27
+ std::vector<float>::iterator it;
28
+ for (it = losses.begin(); it != losses.end(); it++) {
29
+ if (*it <= 0.0) *it = 1.0;
30
+ else if (*it >= 1.0) *it = 0.0;
31
+ else *it = 1.0 - *it;
32
+ sum += *it;
33
+ }
34
+
35
+ if (! Moses::Equals(sum, 0)) {
36
+ // normalize
37
+ for (it = losses.begin(); it != losses.end(); it++)
38
+ *it /= sum;
39
+ } else {
40
+ // sum of non-normalized probs is 0, then take uniform probs
41
+ for (it = losses.begin(); it != losses.end(); it++)
42
+ *it = 1.0 / losses.size();
43
+ }
44
+ }
45
+
46
+ virtual ~SquaredLossNormalizer() {}
47
+ };
48
+
49
+ // safe softmax
50
+ class LogisticLossNormalizer : public Normalizer
51
+ {
52
+ public:
53
+ virtual void operator()(std::vector<float> &losses) const {
54
+ std::vector<float>::iterator it;
55
+
56
+ float sum = 0;
57
+ float max = 0;
58
+ for (it = losses.begin(); it != losses.end(); it++) {
59
+ *it = -*it;
60
+ max = std::max(max, *it);
61
+ }
62
+
63
+ for (it = losses.begin(); it != losses.end(); it++) {
64
+ *it = exp(*it - max);
65
+ sum += *it;
66
+ }
67
+
68
+ for (it = losses.begin(); it != losses.end(); it++) {
69
+ *it /= sum;
70
+ }
71
+ }
72
+
73
+ virtual ~LogisticLossNormalizer() {}
74
+ };
75
+
76
+ } // namespace Discriminative
77
+
78
+ #endif // moses_Normalizer_h
mosesdecoder/vw/README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Vowpal Wabbit for Moses
2
+ =======================
3
+
4
+ This is an attempt to integrate Vowpal Wabbit with Moses as a stateless feature
5
+ function.
6
+
7
+ Compatible with this frozen version of VW:
8
+
9
+ https://github.com/moses-smt/vowpal_wabbit
10
+
11
+ To enable VW, you need to provide a path where VW was installed (using `make install`) to bjam:
12
+
13
+ ./bjam --with-vw=<path/to/vw/installation>
14
+
15
+ Implemented classifier features
16
+ -------------------------------
17
+
18
+ * `VWFeatureSourceBagOfWords`: This creates a feature of form bow^token for every
19
+ source sentence token.
20
+ * `VWFeatureSourceExternalFeatures column=0`: when used with -inputtype 5 (`TabbedSentence`) this can be used to supply additional feature to VW. The input is a tab-separated file, the first column is the usual input sentence, all other columns can be used for meta-data. Parameter column=0 counts beginning with the first column that is not the input sentence.
21
+ * `VWFeatureSourceIndicator`: Ass a feature for the whole source phrase.
22
+ * `VWFeatureSourcePhraseInternal`: Adds a separate feature for every word of the source phrase.
23
+ * `VWFeatureSourceWindow size=3`: Adds source words in a window of size 3 before and after the source phrase as features. These do not overlap with `VWFeatureSourcePhraseInternal`.
24
+ * `VWFeatureTargetIndicator`: Adds a feature for the whole target phrase.
25
+ * `VWFeatureTargetPhraseInternal`: Adds a separate feature for every word of the target phrase.
26
+
27
+ Configuration
28
+ -------------
29
+
30
+ To use the classifier edit your moses.ini
31
+
32
+ [features]
33
+ ...
34
+ VW path=/home/username/vw/classifier1.vw
35
+ VWFeatureSourceBagOfWords
36
+ VWFeatureTargetIndicator
37
+ VWFeatureSourceIndicator
38
+ ...
39
+
40
+ [weights]
41
+ ...
42
+ VW0= 0.2
43
+ ...
44
+
45
+ If you change the name of the main VW feature, remember to tell the VW classifier
46
+ features which classifier they belong to:
47
+
48
+ [features]
49
+ ...
50
+ VW name=bart path=/home/username/vw/classifier1.vw
51
+ VWFeatureSourceBagOfWords used-by=bart
52
+ VWFeatureTargetIndicator used-by=bart
53
+ VWFeatureSourceIndicator used-by=bart
54
+ ...
55
+
56
+ [weights]
57
+ ...
58
+ bart= 0.2
59
+ ...
60
+
61
+ You can also use multiple classifiers:
62
+
63
+ [features]
64
+ ...
65
+ VW name=bart path=/home/username/vw/classifier1.vw
66
+ VW path=/home/username/vw/classifier2.vw
67
+ VW path=/home/username/vw/classifier3.vw
68
+ VWFeatureSourceBagOfWords used-by=bart,VW0
69
+ VWFeatureTargetIndicator used-by=VW1,VW0,bart
70
+ VWFeatureSourceIndicator used-by=bart,VW1
71
+ ...
72
+
73
+ [weights]
74
+ ...
75
+ bart= 0.2
76
+ VW0= 0.2
77
+ VW1= 0.2
78
+ ...
79
+
80
+ Features can use any combination of factors. Provide a comma-delimited list of factors in the `source-factors` or `target-factors` variables to override the default setting (`0`, i.e. the first factor).
81
+
82
+ Training the classifier
83
+ -----------------------
84
+
85
+ Training uses `vwtrainer` which is a limited version of the `moses` binary. To train, provide your training data as input in the following format:
86
+
87
+ source tokens<tab>target tokens<tab>word alignment
88
+
89
+ Use Moses format for the word alignment (`0-0 1-0` etc.). Set the input type to 5 (`TabbedSentence`, see above):
90
+
91
+ [inputtype]
92
+ 5
93
+
94
+ Configure your features in the `moses.ini` file (see above) and set the `train` flag:
95
+
96
+ [features]
97
+ ...
98
+ VW name=bart path=/home/username/vw/features.txt train=1
99
+ ...
100
+
101
+ The `path` variable points to the file (prefix) where features will be written. Currently, threads write to separate files (maybe subject to change sooner or later): `features.txt.1`, `features.txt.2` etc.
102
+
103
+ `vwtrainer` creates the translation option collection for each input sentence but does not run decoding. Therefore, you probably want to disable expensive feature functions such as the language model (LM score is not used by VW features at the moment).
104
+
105
+ Run `vwtrainer`:
106
+
107
+ vwtrainer -f moses.trainvw.ini < tab-separated-training-data.tsv
108
+
109
+ Currently, classification is implemented using VW's `csoaa_ldf` scheme with quadratic features which take the product of the source namespace (`s`, contains label-independent features) and the target namespace (`t`, contains label-dependent features).
110
+
111
+ To train VW in this setting, use the command:
112
+
113
+ cat features.txt.* | vw --hash all --loss_function logistic --noconstant -b 26 -q st --csoaa_ldf mc -f classifier1.vw
mosesdecoder/vw/VWPredictor.cpp ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #include <iostream>
2
+
3
+ #include "Classifier.h"
4
+ #include "vw.h"
5
+ #include "ezexample.h"
6
+ #include "../moses/Util.h"
7
+
8
+ namespace Discriminative
9
+ {
10
+
11
+ using namespace std;
12
+
13
+ VWPredictor::VWPredictor(const string &modelFile, const string &vwOptions)
14
+ {
15
+ m_VWInstance = VW::initialize(VW_DEFAULT_OPTIONS + " -i " + modelFile + vwOptions);
16
+ m_VWParser = VW::initialize(VW_DEFAULT_PARSER_OPTIONS + vwOptions + " --noop");
17
+ m_sharedVwInstance = false;
18
+ m_ex = new ::ezexample(m_VWInstance, false, m_VWParser);
19
+ m_isFirstSource = m_isFirstTarget = true;
20
+ }
21
+
22
+ VWPredictor::VWPredictor(vw *instance, const string &vwOptions)
23
+ {
24
+ m_VWInstance = instance;
25
+ m_VWParser = VW::initialize(vwOptions + " --noop");
26
+ m_sharedVwInstance = true;
27
+ m_ex = new ::ezexample(m_VWInstance, false, m_VWParser);
28
+ m_isFirstSource = m_isFirstTarget = true;
29
+ }
30
+
31
+ VWPredictor::~VWPredictor()
32
+ {
33
+ delete m_ex;
34
+ VW::finish(*m_VWParser);
35
+ if (!m_sharedVwInstance)
36
+ VW::finish(*m_VWInstance);
37
+ }
38
+
39
+ FeatureType VWPredictor::AddLabelIndependentFeature(const StringPiece &name, float value)
40
+ {
41
+ // label-independent features are kept in a different feature namespace ('s' = source)
42
+
43
+ if (m_isFirstSource) {
44
+ // the first feature of a new example => create the source namespace for
45
+ // label-independent features to live in
46
+ m_isFirstSource = false;
47
+ m_ex->finish();
48
+ m_ex->addns('s');
49
+ if (DEBUG) std::cerr << "VW :: Setting source namespace\n";
50
+ }
51
+ return AddFeature(name, value); // namespace 's' is set up, add the feature
52
+ }
53
+
54
+ FeatureType VWPredictor::AddLabelDependentFeature(const StringPiece &name, float value)
55
+ {
56
+ // VW does not use the label directly, instead, we do a Cartesian product between source and target feature
57
+ // namespaces, where the source namespace ('s') contains label-independent features and the target
58
+ // namespace ('t') contains label-dependent features
59
+
60
+ if (m_isFirstTarget) {
61
+ // the first target-side feature => create namespace 't'
62
+ m_isFirstTarget = false;
63
+ m_ex->addns('t');
64
+ if (DEBUG) std::cerr << "VW :: Setting target namespace\n";
65
+ }
66
+ return AddFeature(name, value);
67
+ }
68
+
69
+ void VWPredictor::AddLabelIndependentFeatureVector(const FeatureVector &features)
70
+ {
71
+ if (m_isFirstSource) {
72
+ // the first feature of a new example => create the source namespace for
73
+ // label-independent features to live in
74
+ m_isFirstSource = false;
75
+ m_ex->finish();
76
+ m_ex->addns('s');
77
+ if (DEBUG) std::cerr << "VW :: Setting source namespace\n";
78
+ }
79
+
80
+ // add each feature index using this "low level" call to VW
81
+ for (FeatureVector::const_iterator it = features.begin(); it != features.end(); it++)
82
+ m_ex->addf(it->first, it->second);
83
+ }
84
+
85
+ void VWPredictor::AddLabelDependentFeatureVector(const FeatureVector &features)
86
+ {
87
+ if (m_isFirstTarget) {
88
+ // the first target-side feature => create namespace 't'
89
+ m_isFirstTarget = false;
90
+ m_ex->addns('t');
91
+ if (DEBUG) std::cerr << "VW :: Setting target namespace\n";
92
+ }
93
+
94
+ // add each feature index using this "low level" call to VW
95
+ for (FeatureVector::const_iterator it = features.begin(); it != features.end(); it++)
96
+ m_ex->addf(it->first, it->second);
97
+ }
98
+
99
+ void VWPredictor::Train(const StringPiece &label, float loss)
100
+ {
101
+ throw logic_error("Trying to train during prediction!");
102
+ }
103
+
104
+ float VWPredictor::Predict(const StringPiece &label)
105
+ {
106
+ m_ex->set_label(label.as_string());
107
+ m_isFirstSource = true;
108
+ m_isFirstTarget = true;
109
+ float loss = m_ex->predict_partial();
110
+ if (DEBUG) std::cerr << "VW :: Predicted loss: " << loss << "\n";
111
+ m_ex->remns(); // remove target namespace
112
+ return loss;
113
+ }
114
+
115
+ FeatureType VWPredictor::AddFeature(const StringPiece &name, float value)
116
+ {
117
+ if (DEBUG) std::cerr << "VW :: Adding feature: " << EscapeSpecialChars(name.as_string()) << ":" << value << "\n";
118
+ return std::make_pair(m_ex->addf(EscapeSpecialChars(name.as_string()), value), value);
119
+ }
120
+
121
+ } // namespace Discriminative
mosesdecoder/vw/VWTrainer.cpp ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #include "Util.h"
2
+ #include "Classifier.h"
3
+ #include <boost/algorithm/string/predicate.hpp>
4
+ #include <boost/iostreams/device/file.hpp>
5
+
6
+ using namespace std;
7
+ using namespace boost::algorithm;
8
+ using namespace Moses;
9
+
10
+ namespace Discriminative
11
+ {
12
+
13
+ VWTrainer::VWTrainer(const std::string &outputFile)
14
+ {
15
+ if (ends_with(outputFile, ".gz")) {
16
+ m_bfos.push(boost::iostreams::gzip_compressor());
17
+ }
18
+ m_bfos.push(boost::iostreams::file_sink(outputFile));
19
+ m_isFirstSource = m_isFirstTarget = m_isFirstExample = true;
20
+ }
21
+
22
+ VWTrainer::~VWTrainer()
23
+ {
24
+ m_bfos << "\n";
25
+ close(m_bfos);
26
+ }
27
+
28
+ FeatureType VWTrainer::AddLabelIndependentFeature(const StringPiece &name, float value)
29
+ {
30
+ if (m_isFirstSource) {
31
+ if (m_isFirstExample) {
32
+ m_isFirstExample = false;
33
+ } else {
34
+ // finish previous example
35
+ m_bfos << "\n";
36
+ }
37
+
38
+ m_isFirstSource = false;
39
+ if (! m_outputBuffer.empty())
40
+ WriteBuffer();
41
+
42
+ m_outputBuffer.push_back("shared |s");
43
+ }
44
+
45
+ AddFeature(name, value);
46
+
47
+ return std::make_pair(0, value); // we don't hash features
48
+ }
49
+
50
+ FeatureType VWTrainer::AddLabelDependentFeature(const StringPiece &name, float value)
51
+ {
52
+ if (m_isFirstTarget) {
53
+ m_isFirstTarget = false;
54
+ if (! m_outputBuffer.empty())
55
+ WriteBuffer();
56
+
57
+ m_outputBuffer.push_back("|t");
58
+ }
59
+
60
+ AddFeature(name, value);
61
+
62
+ return std::make_pair(0, value); // we don't hash features
63
+ }
64
+
65
+ void VWTrainer::AddLabelIndependentFeatureVector(const FeatureVector &features)
66
+ {
67
+ throw logic_error("VW trainer does not support feature IDs.");
68
+ }
69
+
70
+ void VWTrainer::AddLabelDependentFeatureVector(const FeatureVector &features)
71
+ {
72
+ throw logic_error("VW trainer does not support feature IDs.");
73
+ }
74
+
75
+ void VWTrainer::Train(const StringPiece &label, float loss)
76
+ {
77
+ m_outputBuffer.push_front(label.as_string() + ":" + SPrint(loss));
78
+ m_isFirstSource = true;
79
+ m_isFirstTarget = true;
80
+ WriteBuffer();
81
+ }
82
+
83
+ float VWTrainer::Predict(const StringPiece &label)
84
+ {
85
+ throw logic_error("Trying to predict during training!");
86
+ }
87
+
88
+ void VWTrainer::AddFeature(const StringPiece &name, float value)
89
+ {
90
+ m_outputBuffer.push_back(EscapeSpecialChars(name.as_string()) + ":" + SPrint(value));
91
+ }
92
+
93
+ void VWTrainer::WriteBuffer()
94
+ {
95
+ m_bfos << Join(" ", m_outputBuffer.begin(), m_outputBuffer.end()) << "\n";
96
+ m_outputBuffer.clear();
97
+ }
98
+
99
+ } // namespace Discriminative
scripts/decode-backtrans.sh ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #! /usr/bin/bash
2
+ set -eux
3
+
4
+ # 若为0则使用CPU
5
+ comet_eval_gpus=8
6
+ # xzq-fairseq
7
+ root_dir=$(dirname "$PWD")
8
+ # 语言对
9
+ src_lang=en
10
+ tgt_lang=de
11
+ threshold=0.7
12
+
13
+ task_name=${src_lang}2${tgt_lang}
14
+ raw_data_dir=$root_dir/data/test/raw/$task_name
15
+ trainable_data_dir=$root_dir/data/test/trainable_data/$task_name
16
+
17
+ ## eval&decode param
18
+ decode_max_tokens=2048
19
+ beam=5
20
+ nbest=1
21
+ lenpen=1.0
22
+
23
+ # 模型所在目录
24
+ model_dir=$root_dir/exps/${task_name}_backtrans/${threshold}/transformer_big_wmt23
25
+
26
+ ### decode
27
+ checkpoint_path=$model_dir/checkpoint_best.pt
28
+ save_dir=$model_dir/decode_result
29
+
30
+ mkdir -p $save_dir
31
+ cp ${BASH_SOURCE[0]} $save_dir
32
+
33
+ declare -A gen_subset_dict
34
+ gen_subset_dict=([test]=flores [test1]=wmt22 [test2]=wmt23)
35
+ for gen_subset in ${!gen_subset_dict[*]}
36
+ do
37
+ decode_file=$save_dir/decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
38
+ pure_file=$save_dir/pure_decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
39
+
40
+ CUDA_VISIBLE_DEVICES=0 fairseq-generate $trainable_data_dir -s $src_lang -t $tgt_lang \
41
+ --gen-subset $gen_subset \
42
+ --path $checkpoint_path \
43
+ --max-tokens $decode_max_tokens \
44
+ --beam $beam \
45
+ --nbest $nbest \
46
+ --lenpen $lenpen \
47
+ --seed 42 \
48
+ --remove-bpe | tee $decode_file
49
+
50
+ ### eval
51
+ # purify file
52
+ grep ^H $decode_file | LC_ALL=C sort -V | cut -f3- | perl $root_dir/mosesdecoder/scripts/tokenizer/detokenizer.perl -l $tgt_lang > $pure_file
53
+
54
+ eval_file=$model_dir/eval_${gen_subset_dict[$gen_subset]}.log
55
+ cur_time=`date +"%Y-%m-%d %H:%M:%S"`
56
+ echo "=============$cur_time===================" >> $eval_file
57
+ echo $checkpoint_path >> $eval_file
58
+ tail -n1 $decode_file >> $eval_file # multi-bleu
59
+ # get score
60
+ src_file=$raw_data_dir/test.${task_name}.${gen_subset_dict[$gen_subset]}.$src_lang
61
+ ref_file=$raw_data_dir/test.${task_name}.${gen_subset_dict[$gen_subset]}.$tgt_lang
62
+ # sacrebleu_file=$save_dir/sacrebleu.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
63
+ comet22_file=$save_dir/comet22.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
64
+ # sacrebleu $ref_file -i $pure_file -w 2 >> $eval_file
65
+ sacrebleu $ref_file -i $pure_file -w 2 --tokenize zh >> $eval_file
66
+ comet-score -s $src_file -t $pure_file -r $ref_file --model $root_dir/wmt22-comet-da/checkpoints/model.ckpt | tee $comet22_file
67
+ echo "Comet22 Score" >> $eval_file
68
+ tail -n1 $comet22_file >> $eval_file # 只取平均comet分
69
+ done
scripts/decode.sh ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #! /usr/bin/bash
2
+ set -eux
3
+
4
+ # 若为0则使用CPU
5
+ comet_eval_gpus=8
6
+ # xzq-fairseq
7
+ root_dir=$(dirname "$PWD")
8
+ # 语言对
9
+ src_lang=en
10
+ tgt_lang=de
11
+ threshold=0.7
12
+
13
+ task_name=${src_lang}2${tgt_lang}
14
+ raw_data_dir=$root_dir/data/test/raw/$task_name
15
+ trainable_data_dir=$root_dir/data/test/trainable_data/$task_name
16
+
17
+ ## eval&decode param
18
+ decode_max_tokens=2048
19
+ beam=5
20
+ nbest=1
21
+ lenpen=1.0
22
+
23
+ # 模型所在目录
24
+ model_dir=$root_dir/exps/${task_name}/${threshold}/transformer_big_wmt23
25
+
26
+ ### decode
27
+ checkpoint_path=$model_dir/checkpoint_best.pt
28
+ save_dir=$model_dir/decode_result
29
+
30
+ mkdir -p $save_dir
31
+ cp ${BASH_SOURCE[0]} $save_dir
32
+
33
+ declare -A gen_subset_dict
34
+ gen_subset_dict=([test]=flores [test1]=wmt22 [test2]=wmt23)
35
+ for gen_subset in ${!gen_subset_dict[*]}
36
+ do
37
+ decode_file=$save_dir/decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
38
+ pure_file=$save_dir/pure_decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
39
+
40
+ CUDA_VISIBLE_DEVICES=0 fairseq-generate $trainable_data_dir -s $src_lang -t $tgt_lang \
41
+ --gen-subset $gen_subset \
42
+ --path $checkpoint_path \
43
+ --max-tokens $decode_max_tokens \
44
+ --beam $beam \
45
+ --nbest $nbest \
46
+ --lenpen $lenpen \
47
+ --seed 42 \
48
+ --remove-bpe | tee $decode_file
49
+
50
+ ### eval
51
+ # purify file
52
+ grep ^H $decode_file | LC_ALL=C sort -V | cut -f3- | perl $root_dir/mosesdecoder/scripts/tokenizer/detokenizer.perl -l $tgt_lang > $pure_file
53
+
54
+ eval_file=$model_dir/eval_${gen_subset_dict[$gen_subset]}.log
55
+ cur_time=`date +"%Y-%m-%d %H:%M:%S"`
56
+ echo "=============$cur_time===================" >> $eval_file
57
+ echo $checkpoint_path >> $eval_file
58
+ tail -n1 $decode_file >> $eval_file # multi-bleu
59
+ # get score
60
+ src_file=$raw_data_dir/test.${task_name}.${gen_subset_dict[$gen_subset]}.$src_lang
61
+ ref_file=$raw_data_dir/test.${task_name}.${gen_subset_dict[$gen_subset]}.$tgt_lang
62
+ # sacrebleu_file=$save_dir/sacrebleu.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
63
+ comet22_file=$save_dir/comet22.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
64
+ # sacrebleu $ref_file -i $pure_file -w 2 >> $eval_file
65
+ sacrebleu $ref_file -i $pure_file -w 2 --tokenize zh >> $eval_file
66
+ comet-score -s $src_file -t $pure_file -r $ref_file --model $root_dir/wmt22-comet-da/checkpoints/model.ckpt | tee $comet22_file
67
+ echo "Comet22 Score" >> $eval_file
68
+ tail -n1 $comet22_file >> $eval_file # 只取平均comet分
69
+ done
scripts/train-backtrans.sh ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #! /usr/bin/bash
2
+ set -eux
3
+
4
+ train_device=0,1,2,3,4,5,6,7
5
+ eval_device=0
6
+ # xzq-fairseq
7
+ root_dir=$(dirname "$PWD")
8
+
9
+ src_lang=en
10
+ tgt_lang=de
11
+ threshold=0.7
12
+
13
+ data_name=wmt23
14
+ # pair_lang=${src_lang}-${tgt_lang}
15
+ task_name=${src_lang}2${tgt_lang}
16
+ data_dir=$root_dir/data/${tgt_lang}2${src_lang}/${threshold}
17
+ raw_data_dir=$data_dir/raw
18
+ trainable_data_dir=$data_dir/trainable_data
19
+
20
+ ## eval&decode param
21
+ decode_max_tokens=2048
22
+ beam=5
23
+ nbest=1
24
+ lenpen=1.0
25
+
26
+ ## common param
27
+ criterion=label_smoothed_cross_entropy
28
+ label_smoothing=0.1
29
+ seed=42
30
+ max_epoch=40
31
+ keep_last_epochs=1
32
+ keep_best_checkpoints=5
33
+ patience=5
34
+ num_workers=8
35
+
36
+ # specified param
37
+ conf_name=transformer_big
38
+ # Global Batch=卡数*max-tokens*梯度累计,对于训练数据较大的语种(train-set几十M),global batch在 100k tokens以上较好
39
+ if [ $conf_name == "transformer_big" ]; then
40
+ arch=transformer_vaswani_wmt_en_de_big
41
+ activation_fn=relu
42
+ encoder_ffn_embed_dim=4096
43
+ share_all_embeddings=1
44
+ share_decoder_input_output_embed=1
45
+ learing_rate=1e-3
46
+ warmup=4000
47
+ max_tokens=8192
48
+ weight_decay=0.0
49
+ dropout=0.3
50
+ gradient_accumulation_steps=4
51
+ else
52
+ echo "unknown conf_name=$conf_name"
53
+ exit
54
+ fi
55
+
56
+ model_dir=$root_dir/exps/${task_name}_backtrans/${threshold}/${conf_name}_${data_name}
57
+ mkdir -p $model_dir
58
+ cp ${BASH_SOURCE[0]} $model_dir
59
+
60
+ gpu_num=`echo "$train_device" | awk '{split($0,arr,",");print length(arr)}'`
61
+ export CUDA_VISIBLE_DEVICES=$train_device
62
+ cmd="fairseq-train $trainable_data_dir \
63
+ --distributed-world-size $gpu_num -s $src_lang -t $tgt_lang \
64
+ --arch $arch \
65
+ --fp16 \
66
+ --optimizer adam --clip-norm 0.0 \
67
+ --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates $warmup \
68
+ --lr $learing_rate --adam-betas '(0.9, 0.98)' \
69
+ --weight-decay $weight_decay \
70
+ --dropout $dropout \
71
+ --criterion $criterion --label-smoothing $label_smoothing \
72
+ --max-epoch $max_epoch \
73
+ --max-tokens $max_tokens \
74
+ --update-freq $gradient_accumulation_steps \
75
+ --activation-fn $activation_fn \
76
+ --encoder-ffn-embed-dim $encoder_ffn_embed_dim \
77
+ --seed $seed \
78
+ --num-workers $num_workers \
79
+ --no-epoch-checkpoints \
80
+ --keep-last-epochs $keep_last_epochs \
81
+ --keep-best-checkpoints $keep_best_checkpoints \
82
+ --patience $patience \
83
+ --no-progress-bar \
84
+ --log-interval 100 \
85
+ --task "translation" \
86
+ --ddp-backend no_c10d \
87
+ --save-dir $model_dir \
88
+ --tensorboard-logdir $model_dir"
89
+
90
+ # add param
91
+ if [ $share_all_embeddings -eq 1 ]; then
92
+ cmd=${cmd}" --share-all-embeddings "
93
+ fi
94
+ if [ $share_decoder_input_output_embed -eq 1 ]; then
95
+ cmd=${cmd}" --share-decoder-input-output-embed "
96
+ fi
97
+ if [ ${max_update:=0} -ne 0 ]; then
98
+ cmd=${cmd}" --max-update $max_update"
99
+ fi
100
+
101
+ # run command
102
+ cur_time=`date +"%Y-%m-%d %H:%M:%S"`
103
+ echo "=============$cur_time===================" >> $model_dir/train.log
104
+ cmd="nohup ${cmd} >> $model_dir/train.log 2>&1 &"
105
+
106
+ eval $cmd
107
+
108
+ # wait
109
+
110
+ # ### decode
111
+ # checkpoint_path=$model_dir/checkpoint_best.pt
112
+ # save_dir=$model_dir/decode_result
113
+
114
+ # mkdir -p $save_dir
115
+ # cp ${BASH_SOURCE[0]} $save_dir
116
+
117
+ # declare -A gen_subset_dict
118
+ # gen_subset_dict=([test]=flores [test1]=wmt22 [test2]=wmt23)
119
+ # for gen_subset in ${!gen_subset_dict[*]}
120
+ # do
121
+ # decode_file=$save_dir/decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
122
+ # pure_file=$save_dir/pure_decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
123
+
124
+ # CUDA_VISIBLE_DEVICES=$eval_device fairseq-generate \
125
+ # $trainable_data_dir \
126
+ # -s $src_lang -t $tgt_lang \
127
+ # --user-dir $user_dir \
128
+ # --gen-subset $gen_subset \
129
+ # --path $checkpoint_path \
130
+ # --max-tokens $decode_max_tokens \
131
+ # --beam $beam \
132
+ # --nbest $nbest \
133
+ # --lenpen $lenpen \
134
+ # --seed $seed \
135
+ # --remove-bpe | tee $decode_file
136
+
137
+ # ### eval
138
+ # # purify file
139
+ # grep ^H $decode_file | LC_ALL=C sort -V | cut -f3- | perl $root_dir/mosesdecoder/scripts/tokenizer/detokenizer.perl -l $tgt_lang > $pure_file
140
+
141
+ # eval_file=$model_dir/eval_${gen_subset_dict[$gen_subset]}.log
142
+ # cur_time=`date +"%Y-%m-%d %H:%M:%S"`
143
+ # echo "=============$cur_time===================" >> $eval_file
144
+ # echo $checkpoint_path >> $eval_file
145
+ # tail -n1 $decode_file >> $eval_file # multi-bleu
146
+ # # get score
147
+ # src_file=$raw_data_dir/test.${gen_subset_dict[$gen_subset]}.$src_lang
148
+ # ref_file=$raw_data_dir/test.${gen_subset_dict[$gen_subset]}.$tgt_lang
149
+ # sacrebleu_file=$save_dir/sacrebleu.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
150
+ # comet22_file=$save_dir/comet22.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
151
+ # sacrebleu $ref_file -i $pure_file -w 2 >> $eval_file
152
+ # comet-score -s $src_file -t $pure_file -r $ref_file --model $root_dir/wmt22-comet-da/checkpoints/model.ckpt | tee $comet22_file
153
+ # echo "Comet22 Score" >> $eval_file
154
+ # tail -n1 $comet22_file >> $eval_file # 只取平均comet分
155
+
156
+ # echo -e "decode finished! \n decode tokenized file in $decode_file \n detokenized file in $pure_file \n sacrebleu file in $eval_file"
157
+ # done
scripts/train.sh ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #! /usr/bin/bash
2
+ set -eux
3
+
4
+ train_device=0,1,2,3,4,5,6,7
5
+ eval_device=0
6
+ # xzq-fairseq
7
+ root_dir=$(dirname "$PWD")
8
+
9
+ src_lang=en
10
+ tgt_lang=de
11
+ threshold=0.7
12
+
13
+ data_name=wmt23
14
+ # pair_lang=${src_lang}-${tgt_lang}
15
+ task_name=${src_lang}2${tgt_lang}
16
+ data_dir=$root_dir/data/${task_name}/${threshold}
17
+ raw_data_dir=$data_dir/raw
18
+ trainable_data_dir=$data_dir/trainable_data
19
+
20
+ ## eval&decode param
21
+ decode_max_tokens=2048
22
+ beam=5
23
+ nbest=1
24
+ lenpen=1.0
25
+
26
+ ## common param
27
+ criterion=label_smoothed_cross_entropy
28
+ label_smoothing=0.1
29
+ seed=42
30
+ max_epoch=40
31
+ keep_last_epochs=1
32
+ keep_best_checkpoints=5
33
+ patience=5
34
+ num_workers=8
35
+
36
+ # specified param
37
+ conf_name=transformer_big
38
+ # Global Batch=卡数*max-tokens*梯度累计,对于训练数据较大的语种(train-set几十M),global batch在 100k tokens以上较好
39
+ if [ $conf_name == "transformer_big" ]; then
40
+ arch=transformer_vaswani_wmt_en_de_big
41
+ activation_fn=relu
42
+ encoder_ffn_embed_dim=4096
43
+ share_all_embeddings=0
44
+ share_decoder_input_output_embed=1
45
+ learing_rate=1e-3
46
+ warmup=4000
47
+ max_tokens=8192
48
+ weight_decay=0.0
49
+ dropout=0.3
50
+ gradient_accumulation_steps=4
51
+ else
52
+ echo "unknown conf_name=$conf_name"
53
+ exit
54
+ fi
55
+
56
+ model_dir=$root_dir/exps/$task_name/${threshold}/${conf_name}_${data_name}
57
+ mkdir -p $model_dir
58
+ cp ${BASH_SOURCE[0]} $model_dir
59
+
60
+ gpu_num=`echo "$train_device" | awk '{split($0,arr,",");print length(arr)}'`
61
+ export CUDA_VISIBLE_DEVICES=$train_device
62
+ cmd="fairseq-train $trainable_data_dir \
63
+ --distributed-world-size $gpu_num -s $src_lang -t $tgt_lang \
64
+ --arch $arch \
65
+ --fp16 \
66
+ --optimizer adam --clip-norm 0.0 \
67
+ --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates $warmup \
68
+ --lr $learing_rate --adam-betas '(0.9, 0.98)' \
69
+ --weight-decay $weight_decay \
70
+ --dropout $dropout \
71
+ --criterion $criterion --label-smoothing $label_smoothing \
72
+ --max-epoch $max_epoch \
73
+ --max-tokens $max_tokens \
74
+ --update-freq $gradient_accumulation_steps \
75
+ --activation-fn $activation_fn \
76
+ --encoder-ffn-embed-dim $encoder_ffn_embed_dim \
77
+ --seed $seed \
78
+ --num-workers $num_workers \
79
+ --no-epoch-checkpoints \
80
+ --keep-last-epochs $keep_last_epochs \
81
+ --keep-best-checkpoints $keep_best_checkpoints \
82
+ --patience $patience \
83
+ --no-progress-bar \
84
+ --log-interval 100 \
85
+ --task "translation" \
86
+ --ddp-backend no_c10d \
87
+ --save-dir $model_dir \
88
+ --tensorboard-logdir $model_dir"
89
+
90
+ # add param
91
+ if [ $share_all_embeddings -eq 1 ]; then
92
+ cmd=${cmd}" --share-all-embeddings "
93
+ fi
94
+ if [ $share_decoder_input_output_embed -eq 1 ]; then
95
+ cmd=${cmd}" --share-decoder-input-output-embed "
96
+ fi
97
+ if [ ${max_update:=0} -ne 0 ]; then
98
+ cmd=${cmd}" --max-update $max_update"
99
+ fi
100
+
101
+ # run command
102
+ cur_time=`date +"%Y-%m-%d %H:%M:%S"`
103
+ echo "=============$cur_time===================" >> $model_dir/train.log
104
+ cmd="nohup ${cmd} >> $model_dir/train.log 2>&1 &"
105
+
106
+ eval $cmd
107
+
108
+ # wait
109
+
110
+ # ### decode
111
+ # checkpoint_path=$model_dir/checkpoint_best.pt
112
+ # save_dir=$model_dir/decode_result
113
+
114
+ # mkdir -p $save_dir
115
+ # cp ${BASH_SOURCE[0]} $save_dir
116
+
117
+ # declare -A gen_subset_dict
118
+ # gen_subset_dict=([test]=flores [test1]=wmt22 [test2]=wmt23)
119
+ # for gen_subset in ${!gen_subset_dict[*]}
120
+ # do
121
+ # decode_file=$save_dir/decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
122
+ # pure_file=$save_dir/pure_decode_${gen_subset_dict[$gen_subset]}_beam${beam}_lenpen${lenpen}.$tgt_lang
123
+
124
+ # CUDA_VISIBLE_DEVICES=$eval_device fairseq-generate \
125
+ # $trainable_data_dir \
126
+ # -s $src_lang -t $tgt_lang \
127
+ # --user-dir $user_dir \
128
+ # --gen-subset $gen_subset \
129
+ # --path $checkpoint_path \
130
+ # --max-tokens $decode_max_tokens \
131
+ # --beam $beam \
132
+ # --nbest $nbest \
133
+ # --lenpen $lenpen \
134
+ # --seed $seed \
135
+ # --remove-bpe | tee $decode_file
136
+
137
+ # ### eval
138
+ # # purify file
139
+ # grep ^H $decode_file | LC_ALL=C sort -V | cut -f3- | perl $root_dir/mosesdecoder/scripts/tokenizer/detokenizer.perl -l $tgt_lang > $pure_file
140
+
141
+ # eval_file=$model_dir/eval_${gen_subset_dict[$gen_subset]}.log
142
+ # cur_time=`date +"%Y-%m-%d %H:%M:%S"`
143
+ # echo "=============$cur_time===================" >> $eval_file
144
+ # echo $checkpoint_path >> $eval_file
145
+ # tail -n1 $decode_file >> $eval_file # multi-bleu
146
+ # # get score
147
+ # src_file=$raw_data_dir/test.${gen_subset_dict[$gen_subset]}.$src_lang
148
+ # ref_file=$raw_data_dir/test.${gen_subset_dict[$gen_subset]}.$tgt_lang
149
+ # sacrebleu_file=$save_dir/sacrebleu.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
150
+ # comet22_file=$save_dir/comet22.${gen_subset_dict[$gen_subset]}.beam${beam}_lenpen${lenpen}
151
+ # sacrebleu $ref_file -i $pure_file -w 2 >> $eval_file
152
+ # comet-score -s $src_file -t $pure_file -r $ref_file --model $root_dir/wmt22-comet-da/checkpoints/model.ckpt | tee $comet22_file
153
+ # echo "Comet22 Score" >> $eval_file
154
+ # tail -n1 $comet22_file >> $eval_file # 只取平均comet分
155
+
156
+ # echo -e "decode finished! \n decode tokenized file in $decode_file \n detokenized file in $pure_file \n sacrebleu file in $eval_file"
157
+ # done
subword-nmt/.github/workflows/pythonpublish.yml ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Upload Python Package
2
+
3
+ on:
4
+ release:
5
+ types: [created]
6
+
7
+ jobs:
8
+ deploy:
9
+ runs-on: ubuntu-latest
10
+ steps:
11
+ - uses: actions/checkout@v1
12
+ - name: Set up Python
13
+ uses: actions/setup-python@v1
14
+ with:
15
+ python-version: '3.x'
16
+ - name: Install dependencies
17
+ run: |
18
+ python -m pip install --upgrade pip
19
+ pip install setuptools wheel twine
20
+ - name: Build and publish
21
+ env:
22
+ TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
23
+ TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
24
+ run: |
25
+ python setup.py sdist bdist_wheel
26
+ twine upload dist/*
subword-nmt/.gitignore ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ *.egg-info/
24
+ .installed.cfg
25
+ *.egg
26
+ MANIFEST
27
+
28
+ # PyInstaller
29
+ # Usually these files are written by a python script from a template
30
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
31
+ *.manifest
32
+ *.spec
33
+
34
+ # Installer logs
35
+ pip-log.txt
36
+ pip-delete-this-directory.txt
37
+
38
+ # Unit test / coverage reports
39
+ htmlcov/
40
+ .tox/
41
+ .coverage
42
+ .coverage.*
43
+ .cache
44
+ nosetests.xml
45
+ coverage.xml
46
+ *.cover
47
+ .hypothesis/
48
+ .pytest_cache/
49
+
50
+ # Translations
51
+ *.mo
52
+ *.pot
53
+
54
+ # Django stuff:
55
+ *.log
56
+ .static_storage/
57
+ .media/
58
+ local_settings.py
59
+
60
+ # Flask stuff:
61
+ instance/
62
+ .webassets-cache
63
+
64
+ # Scrapy stuff:
65
+ .scrapy
66
+
67
+ # Sphinx documentation
68
+ docs/_build/
69
+
70
+ # PyBuilder
71
+ target/
72
+
73
+ # Jupyter Notebook
74
+ .ipynb_checkpoints
75
+
76
+ # pyenv
77
+ .python-version
78
+
79
+ # celery beat schedule file
80
+ celerybeat-schedule
81
+
82
+ # SageMath parsed files
83
+ *.sage.py
84
+
85
+ # Environments
86
+ .env
87
+ .venv
88
+ env/
89
+ venv/
90
+ ENV/
91
+ env.bak/
92
+ venv.bak/
93
+
94
+ # Spyder project settings
95
+ .spyderproject
96
+ .spyproject
97
+
98
+ # Rope project settings
99
+ .ropeproject
100
+
101
+ # mkdocs documentation
102
+ /site
103
+
104
+ # mypy
105
+ .mypy_cache/
subword-nmt/CHANGELOG.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ CHANGELOG
2
+ ---------
3
+ v0.3.9
4
+ - byte-level BPE support
5
+ - remove support for Python 2
6
+
7
+ v0.3.8:
8
+ - multiprocessing support (get_vocab and apply_bpe)
9
+ - progress bar for learn_bpe
10
+ - seed parameter for deterministic BPE dropout
11
+ - ignore some unicode line separators which would crash subword-nmt
12
+
13
+ v0.3.7:
14
+ - BPE dropout (Provilkov et al., 2019)
15
+ - more efficient glossaries (https://github.com/rsennrich/subword-nmt/pull/69)
16
+
17
+ v0.3.6:
18
+ - fix to subword-bpe command encoding
19
+
20
+ v0.3.5:
21
+ - fix to subword-bpe command under Python 2
22
+ - wider support of --total-symbols argument
23
+
24
+ v0.3.4:
25
+ - segment_tokens method to improve library usability (https://github.com/rsennrich/subword-nmt/pull/52)
26
+ - support regex glossaries (https://github.com/rsennrich/subword-nmt/pull/56)
27
+ - allow unicode separators (https://github.com/rsennrich/subword-nmt/pull/57)
28
+ - new option --total-symbols in learn-bpe (commit 61ad8)
29
+ - fix documentation (best practices) (https://github.com/rsennrich/subword-nmt/pull/60)
30
+
31
+ v0.3:
32
+ - library is now installable via pip
33
+ - fix occasional problems with UTF-8 whitespace and new lines in learn_bpe and apply_bpe.
34
+ - do not silently convert UTF-8 newline characters into "\n"
35
+ - do not silently convert UTF-8 whitespace characters into " "
36
+ - UTF-8 whitespace and newline characters are now considered part of a word, and segmented by BPE
37
+
38
+ v0.2:
39
+ - different, more consistent handling of end-of-word token (commit a749a7) (https://github.com/rsennrich/subword-nmt/issues/19)
40
+ - allow passing of vocabulary and frequency threshold to apply_bpe.py, preventing the production of OOV (or rare) subword units (commit a00db)
41
+ - made learn_bpe.py deterministic (commit 4c54e)
42
+ - various changes to make handling of UTF more consistent between Python versions
43
+ - new command line arguments for apply_bpe.py:
44
+ - '--glossaries' to prevent given strings from being affected by BPE
45
+ - '--merges' to apply a subset of learned BPE operations
46
+ - new command line arguments for learn_bpe.py:
47
+ - '--dict-input': rather than raw text file, interpret input as a frequency dictionary (as created by get_vocab.py).
48
+
49
+
50
+ v0.1:
51
+ - consistent cross-version unicode handling
52
+ - all scripts are now deterministic