eXtensible Neural Machine Translation¶
This is a repository for the extensible neural machine translation toolkit xnmt
.
It is coded in Python based on DyNet.
Getting Started¶
Prerequisites¶
xnmt
requires Python 3.6.
Before running xnmt
you must install the required packages, including Python bindings for
DyNet.
This can be done by running pip -r requirements.txt
Next, install xnmt
by running python setup.py install
for normal usage or python setup.py develop
for development.
Running the examples¶
xnmt
includes a series of tutorial-style examples in the examples subfolder.
These are a good starting point to get familiarized with specifying models and
experiments. To run the first experiment, use the following:
python -m xnmt.xnmt_run_experiments examples/01_standard.yaml
Make sure to read the comments provided in examples/01_standard.yaml
.
See experiments.md
for more details about writing experiment configuration files
that allow you to specify the various
Running unit tests¶
From the main directory, run: python -m unittest discover
Or, to run a specific test, use e.g. python -m unittest test.test_run.TestRunningConfig.test_standard
Cython modules¶
If you wish to use all the modules in xnmt that need cython, you need to build the cython extensions by this command:
python setup.py build_ext --inplace --use-cython-extensions
Experiment configuration file format¶
Configuration files are in YAML dictionary format.
At the top-level, a config file consists of a dictionary where keys are experiment names and values are the experiment specifications. By default, all experiments are run in lexicographical ordering, but xnmt_run_experiments can also be told to run only a selection of the specified experiments. An example template with 2 experiments looks like this:
exp1: !Experiment
exp_global: ...
preproc: ...
model: ...
train: ...
evaluate: ...
exp2: !Experiment
exp_global: ...
preproc: ...
model: ...
train: ...
evaluate: ...
!Experiment
is YAML syntax specifying a Python object of the same name, and
its parameters will be passed on to the Python constructor.
There can be a special top-level entry named defaults
; this experiment will
never be run, but can be used as a template where components are partially shared
using YAML anchors or the !Ref mechanism (more on this later).
The usage of exp_global
, preproc
, model
, train
, evaluate
are explained below.
Not all of them need to be specified, depending on the use case.
exp_global¶
This specifies settings that are global to this experiment. An example:
exp_global: !ExpGlobal
model_file: '{EXP_DIR}/models/{EXP}.mod'
log_file: '{EXP_DIR}/logs/{EXP}.log'
default_layer_dim: 512
dropout: 0.3
Not that for any strings used here or anywhere in the config file {EXP}
will
be over-written by the name of the experiment, {EXP_DIR}
will be overwritten
by the directory the config file lies in, and {PID}
by the process id.
To obtain a full list of allowed parameters, please check the constructor of
ExpGlobal
, specified under xnmt/exp_global.py. Behind the scenes, this class
also manages the DyNet parameters, it is therefore referenced by all components
that use DyNet parameters.
preproc¶
xnmt
supports a variety of data preprocessing features. Please refer to
preprocessing.rst
for details.
model¶
This specifies the model architecture. An typical example looks like this:
model: !DefaultTranslator
src_reader: !PlainTextReader
vocab: !Vocab {vocab_file: examples/data/head.ja.vocab}
trg_reader: !PlainTextReader
vocab: !Vocab {vocab_file: examples/data/head.en.vocab}
encoder: !BiLSTMSeqTransducer
layers: 1
attender: !MlpAttender
hidden_dim: 512
state_dim: 512
input_dim: 512
trg_embedder: !SimpleWordEmbedder
emb_dim: 512
decoder: !MlpSoftmaxDecoder
layers: 1
mlp_hidden_dim: 512
bridge: !CopyBridge {}
The top level entry is typically DefaultTranslator, which implements a standard attentional sequence-to-sequence model. It allows flexible specification of encoder, attender, source / target embedder, and other settings. Again, to obtain the full list of supported options, please refer to the corresponding class initializer methods.
Note that some of this Python objects are passed to their parent object’s
initializer method, which requires that the children are initialized first.
xnmt
therefore uses a bottom-up initialization strategy, where siblings
are initialized in the order they appear in the constructor. Among others,
this causes exp_global
(the first child of the top-level experiment) to be
initialized before any model component is initialized, so that model components
are free to use exp_global’s global default settings, DyNet parameters, etc.
It also guarantees that preprocessing is carried out before the model training.
train¶
A typical example looks like this:
train: !SimpleTrainingRegimen
trainer: !AdamTrainer
alpha: 0.001
run_for_epochs: 2
src_file: examples/data/head.ja
trg_file: examples/data/head.en
dev_tasks:
- !LossEvalTask
src_file: examples/data/head.ja
ref_file: examples/data/head.en
The expected object here is a subclass of TrainingRegimen. Besides
SimpleTrainingRegimen
, multi-task style training regimens are supported.
For multi task training, each training regimen uses their own model, so in this
case models must be specified as sub-components of the training regimen. Please
refer to examples/08_multitask.yaml for more details on this.
evaluate¶
If specified, the model is tested after training finished.
Translator Structure¶
If you want to dig in to using xnmt
for your research it is necessary to understand
the overall structure. The main class that you need to be aware of is Translator
, which
can calculate the conditional probability of the target sentence given the source sentence.
This is useful for calculating losses at training time, or generating sentences at test time.
Basically it consists of 5 major components:
- Source
Embedder
: This converts input symbols into continuous-space vectors. Usually this - is done by looking up the word in a lookup table, but it could be done any other way.
- Source
Encoder
: Takes the embedded input and encodes it, for example using a bi-directional- LSTM to calculate context-sensitive embeddings.
Attender
: This is the “attention” module, which takes the encoded input and decoder- state, then calculates attention.
- Target
Embedder
: This converts output symbols into continuous-space vectors like its counterpart - in the source language.
- Target
Decoder
: This calculates a probability distribution over the words in the output,- either to calculate a loss function during training, or to generate outputs at test time.
In addition, given this Translator
, we have a SearchStrategy
that takes the calculated
probabilities calculated by the decoder and actually generates outputs at test time.
There are a bunch of auxiliary classes as well to handle saving/loading of the inputs,
etc. However, if you’re interested in using xnmt
to develop a new method, most of your
work will probably go into one or a couple of the classes listed above.
Preprocessing¶
In machine translation, and neural MT in particular, properly pre-processing input
before passing it to the learner can greatly increase translation accuracy.
This document describes the preprocessing options available within xnmt
, and
documents where external executables can be plugged into the experiment framework.
Tokenization¶
A number of tokenization methods are available out of the box; others can be plugged in either with some help (like sentencepiece) or by passing parameters through the experiment framework through to the external decoders.
Multiple tokenizers can be run on the same text; for example, it may be (is there a citation?) that running the Moses tokenizer before performing Byte-pair encoding (BPE) is preferable to either one or the other. It is worth noting, however, that if you want to exactly specify your vocabulary size at tokenization first, an exact-size tokenizer like BPE should be specified (and thus run) last.
- Sentencepiece: An extenral tokenizer library that permits a large number of tokenization
- options, is written in C++, and is very fast. However, it must be installed
separately to
xnmt
. Specification of the training file is set through the experiment framework, but that (and all other) options can be passed transparently by adding them to the experiment config. See the Sentencepiece section for more specific information on this tokenizer.
- External Tokenizers: Any external tokenizer can be used as long as it tokenizes stdin and outputs
- to stdout. A single Yaml dictionary labelled
tokenizer_args
is used to pass all (and any) options to the external tokenizer. The optiondetokenizer_path
, and its option dictionary,detokenizer_args
, can optionally be used to specify a detokenizer.
Sentencepiece¶
The YAML options supported by the SentencepieceTokenizer are almost exactly those presented in the Sentencepiece readme. Some notable exceptions are below:
- Instead of
extra_options
, since one must be able to pass separate options to the encoder and the decoder, useencode_extra_options
anddecode_extra_options
, respectively.- When specifying extra options as above, note that
eos
andbos
are both off-limits, and will produce odd errors invocab.py
. This is because these options add<s>
and</s>
to the output, which are already addded byxnmt
, and are reserved types.
- Unfortunately, right now, if tokenizers are chained together we see the following behavior:
- If the Moses tokenizer is run first, and tokenizes files that are to be used for training BPE in Sentencepiece, Sentencepiece will learn off of the original files, not the Moses-tokenized ones.
API Doc¶
Embedder¶
Embedder¶
Encoder¶
TODO: Pending refactoring to reduce the number of classes in Encoder, the doc should also be re-factored.
Attender¶
Decoder¶
SearchStrategy¶
-
class
search_strategy.
BeamSearch
(beam_size, max_len=100, len_norm=None)[source]¶ Bases:
search_strategy.SearchStrategy
-
generate_output
(decoder, attender, output_embedder, dec_state, src_length=None, forced_trg_ids=None)[source]¶ Parameters: - decoder – decoder.Decoder subclass
- attender – attender.Attender subclass
- output_embedder – embedder.Embedder subclass
- dec_state – The decoder state
- src_length – length of src sequence, required for some types of length normalization
- forced_trg_ids – list of word ids, if given will force to generate this is the target sequence
Returns: (id list, score)
-
Other Classes¶
TODO: Add documentation.
Programming Style¶
Philosphy¶
The over-arching goal of xnmt
is that it be easy to use for research. When implementing a new
method, it should require only minimal changes (e.g. ideally the changes will be limited to a
single file, over-riding an existing class). Obviously this ideal will not be realizable all the
time, but when designing new functionality, try to think of this goal. If there are tradeoffs,
the following is the order of priority (of course getting all is great!):
- Code Correctness
- Extensibility and Readability
- Accuracy and Effectiveness of the Models
- Efficiency
Coding Conventions¶
There are also a minimal of coding style conventions:
- Follow Python 3 conventions, Python 2 is no longer supported.
- Functions should be snake_case, classes should be UpperCamelCase.
- Indentation should be two whitespace characters.
- Docstrings should be made in reST format (e.g.
:param param_name:
,:returns:
etc.)
A collection of unit tests exists to make sure things don’t break.
In variable names, common words should be abbreviated as:
- source -> src
- target -> trg
- sentence -> sent
- hypothesis -> hyp
- reference -> ref
For printing output in a consistent and controllable way, a few conventions should be followed (see _official documentation: https://docs.python.org/3/howto/logging.html#when-to-use-logging for more details): - logger.info() should be used for most outputs. Such outputs are assumed to
be usually shown but can be turned off if needed.
- print() for regular output without which the execution would be incomplete. The main use case is to print final results, etc.
- logger.debug() for detailed information that isn’t needed in normal operation
- logger.warning(), logger.error() or logger.critical() for problematic situations
- yaml_logger(dict) for structured logging of information that should be easily automatically parseable and might be too bulky to print to the console.
These loggers can be requested as follows:
- ::
- import logging logger = logging.getLogger(‘xnmt’) yaml_logger = logging.getLogger(‘yaml’)
Contributing¶
Go ahead and send a pull request! If you’re not sure whether something will be useful and want to ask beforehand, feel free to open an issue on the github.