eXtensible Neural Machine Translation

This is a repository for the extensible neural machine translation toolkit xnmt. It is coded in Python based on DyNet.

Getting Started

Prerequisites

xnmt requires Python 3.6.

Before running xnmt you must install the required packages, including Python bindings for DyNet. This can be done by running pip -r requirements.txt

Next, install xnmt by running python setup.py install for normal usage or python setup.py develop for development.

Running the examples

xnmt includes a series of tutorial-style examples in the examples subfolder. These are a good starting point to get familiarized with specifying models and experiments. To run the first experiment, use the following:

python -m xnmt.xnmt_run_experiments examples/01_standard.yaml

Make sure to read the comments provided in examples/01_standard.yaml.

See experiments.md for more details about writing experiment configuration files that allow you to specify the various

Running unit tests

From the main directory, run: python -m unittest discover Or, to run a specific test, use e.g. python -m unittest test.test_run.TestRunningConfig.test_standard

Cython modules

If you wish to use all the modules in xnmt that need cython, you need to build the cython extensions by this command:

python setup.py build_ext --inplace --use-cython-extensions

Experiment configuration file format

Configuration files are in YAML dictionary format.

At the top-level, a config file consists of a dictionary where keys are experiment names and values are the experiment specifications. By default, all experiments are run in lexicographical ordering, but xnmt_run_experiments can also be told to run only a selection of the specified experiments. An example template with 2 experiments looks like this:

exp1: !Experiment
  exp_global: ...
  preproc: ...
  model: ...
  train: ...
  evaluate: ...
exp2: !Experiment
  exp_global: ...
  preproc: ...
  model: ...
  train: ...
  evaluate: ...

!Experiment is YAML syntax specifying a Python object of the same name, and its parameters will be passed on to the Python constructor. There can be a special top-level entry named defaults; this experiment will never be run, but can be used as a template where components are partially shared using YAML anchors or the !Ref mechanism (more on this later).

The usage of exp_global, preproc, model, train, evaluate are explained below. Not all of them need to be specified, depending on the use case.

exp_global

This specifies settings that are global to this experiment. An example:

exp_global: !ExpGlobal
  model_file: '{EXP_DIR}/models/{EXP}.mod'
  log_file: '{EXP_DIR}/logs/{EXP}.log'
  default_layer_dim: 512
  dropout: 0.3

Not that for any strings used here or anywhere in the config file {EXP} will be over-written by the name of the experiment, {EXP_DIR} will be overwritten by the directory the config file lies in, and {PID} by the process id.

To obtain a full list of allowed parameters, please check the constructor of ExpGlobal, specified under xnmt/exp_global.py. Behind the scenes, this class also manages the DyNet parameters, it is therefore referenced by all components that use DyNet parameters.

preproc

xnmt supports a variety of data preprocessing features. Please refer to preprocessing.rst for details.

model

This specifies the model architecture. An typical example looks like this:

model: !DefaultTranslator
  src_reader: !PlainTextReader
    vocab: !Vocab {vocab_file: examples/data/head.ja.vocab}
  trg_reader: !PlainTextReader
    vocab: !Vocab {vocab_file: examples/data/head.en.vocab}
  encoder: !BiLSTMSeqTransducer
    layers: 1
  attender: !MlpAttender
    hidden_dim: 512
    state_dim: 512
    input_dim: 512
  trg_embedder: !SimpleWordEmbedder
    emb_dim: 512
  decoder: !MlpSoftmaxDecoder
    layers: 1
    mlp_hidden_dim: 512
    bridge: !CopyBridge {}

The top level entry is typically DefaultTranslator, which implements a standard attentional sequence-to-sequence model. It allows flexible specification of encoder, attender, source / target embedder, and other settings. Again, to obtain the full list of supported options, please refer to the corresponding class initializer methods.

Note that some of this Python objects are passed to their parent object’s initializer method, which requires that the children are initialized first. xnmt therefore uses a bottom-up initialization strategy, where siblings are initialized in the order they appear in the constructor. Among others, this causes exp_global (the first child of the top-level experiment) to be initialized before any model component is initialized, so that model components are free to use exp_global’s global default settings, DyNet parameters, etc. It also guarantees that preprocessing is carried out before the model training.

train

A typical example looks like this:

train: !SimpleTrainingRegimen
  trainer: !AdamTrainer
    alpha: 0.001
  run_for_epochs: 2
  src_file: examples/data/head.ja
  trg_file: examples/data/head.en
  dev_tasks:
    - !LossEvalTask
      src_file: examples/data/head.ja
      ref_file: examples/data/head.en

The expected object here is a subclass of TrainingRegimen. Besides SimpleTrainingRegimen, multi-task style training regimens are supported. For multi task training, each training regimen uses their own model, so in this case models must be specified as sub-components of the training regimen. Please refer to examples/08_multitask.yaml for more details on this.

evaluate

If specified, the model is tested after training finished.

Translator Structure

If you want to dig in to using xnmt for your research it is necessary to understand the overall structure. The main class that you need to be aware of is Translator, which can calculate the conditional probability of the target sentence given the source sentence. This is useful for calculating losses at training time, or generating sentences at test time. Basically it consists of 5 major components:

  1. Source Embedder: This converts input symbols into continuous-space vectors. Usually this
    is done by looking up the word in a lookup table, but it could be done any other way.
  2. Encoder: Takes the embedded input and encodes it, for example using a bi-directional
    LSTM to calculate context-sensitive embeddings.
  3. Attender: This is the “attention” module, which takes the encoded input and decoder
    state, then calculates attention.
  4. Target Embedder: This converts output symbols into continuous-space vectors like its counterpart
    in the source language.
  5. Decoder: This calculates a probability distribution over the words in the output,
    either to calculate a loss function during training, or to generate outputs at test time.

In addition, given this Translator, we have a SearchStrategy that takes the calculated probabilities calculated by the decoder and actually generates outputs at test time.

There are a bunch of auxiliary classes as well to handle saving/loading of the inputs, etc. However, if you’re interested in using xnmt to develop a new method, most of your work will probably go into one or a couple of the classes listed above.

Preprocessing

In machine translation, and neural MT in particular, properly pre-processing input before passing it to the learner can greatly increase translation accuracy. This document describes the preprocessing options available within xnmt, and documents where external executables can be plugged into the experiment framework.

Tokenization

A number of tokenization methods are available out of the box; others can be plugged in either with some help (like sentencepiece) or by passing parameters through the experiment framework through to the external decoders.

Multiple tokenizers can be run on the same text; for example, it may be (is there a citation?) that running the Moses tokenizer before performing Byte-pair encoding (BPE) is preferable to either one or the other. It is worth noting, however, that if you want to exactly specify your vocabulary size at tokenization first, an exact-size tokenizer like BPE should be specified (and thus run) last.

  1. Sentencepiece: An extenral tokenizer library that permits a large number of tokenization
    options, is written in C++, and is very fast. However, it must be installed separately to xnmt. Specification of the training file is set through the experiment framework, but that (and all other) options can be passed transparently by adding them to the experiment config. See the Sentencepiece section for more specific information on this tokenizer.
  2. External Tokenizers: Any external tokenizer can be used as long as it tokenizes stdin and outputs
    to stdout. A single Yaml dictionary labelled tokenizer_args is used to pass all (and any) options to the external tokenizer. The option detokenizer_path, and its option dictionary, detokenizer_args, can optionally be used to specify a detokenizer.

Sentencepiece

The YAML options supported by the SentencepieceTokenizer are almost exactly those presented in the Sentencepiece readme. Some notable exceptions are below:

  • Instead of extra_options, since one must be able to pass separate options to the encoder and the decoder, use encode_extra_options and decode_extra_options, respectively.
  • When specifying extra options as above, note that eos and bos are both off-limits, and will produce odd errors in vocab.py. This is because these options add <s> and </s> to the output, which are already addded by xnmt, and are reserved types.
  • Unfortunately, right now, if tokenizers are chained together we see the following behavior:
    • If the Moses tokenizer is run first, and tokenizes files that are to be used for training BPE in Sentencepiece, Sentencepiece will learn off of the original files, not the Moses-tokenized ones.

API Doc

Embedder

Embedder

Encoder

TODO: Pending refactoring to reduce the number of classes in Encoder, the doc should also be re-factored.

Attender

Decoder

SearchStrategy

class search_strategy.SearchStrategy[source]

Bases: object

class search_strategy.BeamSearch(beam_size, max_len=100, len_norm=None)[source]

Bases: search_strategy.SearchStrategy

generate_output(decoder, attender, output_embedder, dec_state, src_length=None, forced_trg_ids=None)[source]
Parameters:
  • decoder – decoder.Decoder subclass
  • attender – attender.Attender subclass
  • output_embedder – embedder.Embedder subclass
  • dec_state – The decoder state
  • src_length – length of src sequence, required for some types of length normalization
  • forced_trg_ids – list of word ids, if given will force to generate this is the target sequence
Returns:

(id list, score)

Other Classes

TODO: Add documentation.

Programming Style

Philosphy

The over-arching goal of xnmt is that it be easy to use for research. When implementing a new method, it should require only minimal changes (e.g. ideally the changes will be limited to a single file, over-riding an existing class). Obviously this ideal will not be realizable all the time, but when designing new functionality, try to think of this goal. If there are tradeoffs, the following is the order of priority (of course getting all is great!):

  1. Code Correctness
  2. Extensibility and Readability
  3. Accuracy and Effectiveness of the Models
  4. Efficiency

Coding Conventions

There are also a minimal of coding style conventions:

  • Follow Python 3 conventions, Python 2 is no longer supported.
  • Functions should be snake_case, classes should be UpperCamelCase.
  • Indentation should be two whitespace characters.
  • Docstrings should be made in reST format (e.g. :param param_name:, :returns: etc.)

A collection of unit tests exists to make sure things don’t break.

In variable names, common words should be abbreviated as:

  • source -> src
  • target -> trg
  • sentence -> sent
  • hypothesis -> hyp
  • reference -> ref

For printing output in a consistent and controllable way, a few conventions should be followed (see _official documentation: https://docs.python.org/3/howto/logging.html#when-to-use-logging for more details): - logger.info() should be used for most outputs. Such outputs are assumed to

be usually shown but can be turned off if needed.
  • print() for regular output without which the execution would be incomplete. The main use case is to print final results, etc.
  • logger.debug() for detailed information that isn’t needed in normal operation
  • logger.warning(), logger.error() or logger.critical() for problematic situations
  • yaml_logger(dict) for structured logging of information that should be easily automatically parseable and might be too bulky to print to the console.

These loggers can be requested as follows:

::
import logging logger = logging.getLogger(‘xnmt’) yaml_logger = logging.getLogger(‘yaml’)

Contributing

Go ahead and send a pull request! If you’re not sure whether something will be useful and want to ask beforehand, feel free to open an issue on the github.