In machine translation, and neural MT in particular, properly pre-processing input before passing it to the learner can greatly increase translation accuracy. This document describes the preprocessing options available within xnmt, and documents where external executables can be plugged into the experiment framework.
A number of tokenization methods are available out of the box; others can be plugged in either with some help (like sentencepiece) or by passing parameters through the experiment framework through to the external decoders.
Multiple tokenizers can be run on the same text; for example, it may be (is there a citation?) that running the Moses tokenizer before performing Byte-pair encoding (BPE) is preferable to either one or the other. It is worth noting, however, that if you want to exactly specify your vocabulary size at tokenization first, an exact-size tokenizer like BPE should be specified (and thus run) last.
- Sentencepiece: An external tokenizer library that permits a large number of tokenization
- options, is written in C++, and is very fast. It is a optional dependency
for xnmt (install via
pip install sentencepiece, see requirements-extra.txt). Specification of the training file is set through the experiment framework, but that (and all other) options can be passed transparently by adding them to the experiment config. See the Sentencepiece section for more specific information on this tokenizer.
- External Tokenizers: Any external tokenizer can be used as long as it tokenizes stdin and outputs
- to stdout. A single Yaml dictionary labelled
tokenizer_argsis used to pass all (and any) options to the external tokenizer. The option
detokenizer_path, and its option dictionary,
detokenizer_args, can optionally be used to specify a detokenizer.
- Byte-Pair Encoding: A compression-inspired unsupervised sub-word unit encoding
- that performs well (Sennrich, 2016) and permits specification
of an exact vocabulary size. Native to xnmt; written in Python.
Invoked with tokenizer type
bpe. Right now there is no separate bpe implementation (contributions are welcome), however sentencepiece provides a
bpeoptions that performs something similar for a fixed vocabulary size see the following section for more details.
The YAML options supported by the SentencepieceTokenizer are almost exactly those presented in the Sentencepiece readme, namely:
word. Please refer to the sentencepiece documentation for more details
model_prefix: The trained bpe model will be saved under
vocab_size: fixes the vocabulary size
hard_vocab_limit: setting this to
Falsewill make the vocab size a soft limit. Useful for small datasets. This is
Some notable exceptions are below:
- Instead of
extra_options, since one must be able to pass separate options to the encoder and the decoder, use
- When specifying extra options as above, note that
bosare both off-limits, and will produce odd errors in
vocab.py. This is because these options add
</s>to the output, which are already addded by xnmt, and are reserved types.
- Unfortunately, right now, if tokenizers are chained together we see the following behavior:
- If the Moses tokenizer is run first, and tokenizes files that are to be used for training BPE in Sentencepiece, Sentencepiece will learn off of the original files, not the Moses-tokenized ones.