Linux for Translators

HowTo: Managing translation memories in OmegaT

General

A translation memory ("TM") is a database in which texts are broken down into segments and stored in "translation units". Each translation unit consists of a pair of segments: the segment in the original text, and the segment in the translation. A segment is usually a sentence, though it may be just a phrase, or in some cases may span several sentences or even a whole paragraph.

OmegaT stores translation memories in the industry-standard TMX file format. In this HowTo, "a translation memory" therefore refers to a file.

The OmegaT philosophy

OmegaT is project oriented. Understanding this concept is crucial if you aim to use OmegaT effectively.

The OmegaT program does not have its "own" translation memory. It recognizes translation memories only in the context of a project. Each project has internal, input and output translation memories.

The internal translation memory

The internal translation memory is the project's own TM, which remains within the project, i.e. it is not used in other projects. It is the file project_save.tmx and can be found in the project's \omegat\ folder. This file contains only the segments that have been translated in the project up to now. It is therefore empty when the project is first created. The internal project translation memory is used by the OmegaT program and as the user, you do not normally need to deal with it.

Input translation memories

Input translation memories are translation memories created from past translations that are to be used in the active project for reference. In order for OmegaT to make use of them, place them in the project's \tm\ folder before loading the project (or, if the project is already loaded, re-load it again).

As standard, the project's translation memory folder is at <project>\tm\ (where <project> is the project's main or "root" folder). However, if you have a folder on your computer in which you have collected translation memories, you can make this folder the project's translation memory folder by selecting in it OmegaT using Project > Properties, then hitting the Browse button next to Translation Memory Folder. Selecting a translation memory folder in this way instructs OmegaT to look for the input translation memories there instead of in the default location of <project>\tm\. Once you have selected a different translation memory folder, OmegaT regards it as being part of the project.

There is no limit to the number of input translation memories (i.e. individual TMX files) you can have in a project. You may however want to limit the number (in the interests of speed), and select them carefully, in order to have more relevant matches displayed to you.

Output translation memories

Hitting Ctrl-d in OmegaT causes your translated documents (e.g. a translation in a Word file, if your source text was also a Word file) to be created. It also creates an output translation memory of the project. You can use this output TM as an input TM in future projects.

Here we have an idiosyncrasy of OmegaT: there is no function within OmegaT itself for copying, moving or exporting output translation memories to a chosen location. They are deposited in the project's root folder, and what you then do with them is left to you. (OmegaT does enable you to call up your file manager and a particular project within your folder with Project > Access Project Contents.)

Three output translation memory files are created when you hit Ctrl-d. They are all placed in the project's main ("root") folder, and have the names <project name>-level1.tmx, <project name>-level2.tmx and <project name>-omegat.tmx. The text contained in these three files is identical; they differ only in the amount of formatting information they also contain. For more details of the differences between the three variants of output translation memories, refer to the OmegaT manual.

Viewing the content of translation memories in an OmegaT project

OmegaT does not have a function with which the user can open input translation memory files directly. Instead, if they have been placed in the \tm\ folder, they are loaded automatically by OmegaT when the project is opened. Individual segments in input TMs are displayed to you when a segment identical or similar to the active segment (full match or fuzzy match) is found. If you are not sure whether OmegaT has loaded an input TM, a simple way of checking is to search (with Ctrl-f) for a word or phrase that you know to be in it.

Methods of managing translation memories when using OmegaT

As described above, OmegaT has, at the time of writing (November 2020), no functions of its own for managing translation memories; you must manage them yourself in your file manager. There are numerous ways of doing this, and OmegaT leaves it very much up to you, the user, how you go about it. Two methods, both popular, are described below.

The per-job project method

As its name suggests, in the per-job project method you create a new project for each job. (A job may contain one or more files to be translated.) After creating the project, you place any translation memories from past jobs (input TMs) that you think might be useful in the \tm\ folder.

Once you have understood the concept of projects and input and output TMs, this is not difficult. However, if you have tens or even hundreds of possibly useful translation memories from past jobs (e.g. in the same language combination, in the same subject, for the same customer), it is of course tedious to go looking for them in your file manager and copy them individually into the project's \tm\ folder each time you start a new job.

Instead, it is much more practical to collect translation memories in different folders ("repositories") according to language combination, subject, etc., and then, when you create a new project, simply to select the relevant folder using Project > Properties > Translation Memory Folder (Browse).

This requires some discipline on your part: whenever you finish a job, you must copy one of the three output translation memories into the relevant repository. There are means of automating this process (see below), but ultimately, it is your responsibility to decide how you organize your repositories.

The recycled project method

This method is quite different: instead of creating a new project for each job, you "recycle" an existing project. To do this, you remove the source files from the last job from the \source\ folder, and replace them with the source files for the current job. The editor pane of OmegaT only displays the content of files that are in \source\ when the project is loaded. Consequently, once you have removed files from \source\ after translation and you load the project again, the editor no longer shows you the content of these files. However, their segments are still present in the internal translation memory, and will be presented to you as matches or search results.

Advantages and drawbacks of the different methods

A major advantage of the recycled project method is that it spares you a lot of the effort of setting up a project, and in particular of managing translation memories by moving them around in your file manager. All the segments from past jobs completed in the recycled project are still present in the project's internal translation memory and are accessible in the form of matches or search results.

The recycled project method also has drawbacks, however.

One drawback concerns orphan segments. Orphan segments are segments that were translated in the current project but are no longer present in the files in the \source\ folder. This is the case for example when a source file is changed mid-project (perhaps because your customer sent you an edited version before you had completed the translation, or because you edited it yourself, for example to remove an unwanted line break). Such segments may be "half-finished", rough translations, but they are still available to you, and useful in these cases.

These orphan segments are not included in the output TM. In a "recycled" project though, they are still present, but indistinguishable from any other segments in completed translations. This may give you a false sense of security about their reliability.

This marking of all segments from past jobs as "orphan segments" has a further drawback compared to the per-job project method. If you create a new project for each job, segments in fuzzy matches and text searches are accompanied by an indication of the project in which they were produced. If you name your projects intelligently (for example indicating the customer, subject, and/or approximate date of the project), this information is therefore provided along with the match.

Variants of the per-job and recycled project methods

These are not the only possible ways of managing OmegaT projects. If you use the per-job project method, you can create an empty project with all the desired settings (language combination, a particular \tm\ repository, etc.) and use it as a template for relevant jobs. Equally, using the recycled project method does not prevent you from having separate projects for different customers, different subjects, etc., or adding input translation memories selectively for each project.

Advanced TM management using symbolic links

Symbolic links are a function provided by your operating system. Linux, Windows and Mac OSX all support symbolic links, though their use is more user-friendly on some operating systems than others.

If you are familiar with symbolic links, you have a powerful tool for managing your TMs. They enable you for example to have separate repositories for each customer, but at the same time to build up subject-based repositories by linking TMs relevant to the subject from the customer repository to the relevant subject repository. The benefit of this kind of management is that you spend less time moving TM files around each time you create a new project, and do not have duplicate copies of TMs taking up space.

Advanced TM management using commands and scripts

As already mentioned, OmegaT does not have dedicated built-in functions for managing translation memories. It does however have ways by which you can extend its own functionality by means of scripts.

You may not be able to program, but this should not put you off exploiting these very powerful functions of OmegaT. One of these is the external post-processing command. This command is executed each time you create the translated documents, and it also produces the output translation memories at the same time.

Here is an example of how this command can be used to manage your translation memories.

First create a repository, e.g. named TM-repository, in your home folder (e.g. at C:\Users\Me\TM-repository on Windows 10 or /home/Me/TM-repository on Linux. Insert your own user name instead of Me).

Then hit Options > Preferences > Saving and Output to display the External Post-processing Command field.

In this field, enter the following command:

cmd /c copy '${projectRoot}${projectName}-level1.tmx' '${HOMEPATH}\TM-repository\${projectName}-level1.tmx'

(if you are using Windows)

or

cp '${projectRoot}${projectName}-level1.tmx' '${HOME}/TM-repository'

(if you are using Linux).

Confirm with OK.

Now, whenever you create your target documents, the output TM of the project will be copied automatically to the TM-repository folder. If you select this TM-repository folder as your \tm\ folder in future projects, translation memories in it will then be imported into those projects.

You can also use the external post-processing command on individual projects. In this case, access it in the Project Properties dialog after enabling it in Preferences. If you use the recycled project method and have multiple recycled projects or use the per-project method but use template projects (e.g. for different subjects), you can then set up your recycled project or project template to save the output TM to a dedicated repository folder, rather than exporting all output TMs to the same folder.

Bear in mind that if you follow the procedure above, the TM of the current project will also be present in the \tm\ of the current project as soon as you create the translated documents, so you will then start seeing matches and search results appearing to be from legacy projects that are in fact from the present project.

With a little more effort, you can refine this process. In particular, if you are willing to get to grips with scripts, the possibilities are endless. For example:

The Auto Save TM script extends the automatic saving of the output TM to a repository as described above. You can define a pattern for your project names and configure the script to save the output TM to a particular repository folder based on the project's name. By default, the script saves the output TM to a particular folder according to the first three characters of its name.

As an alternative to using the external post-processing command, you can also use OmegaT's scripting function to manage your TMs. An example implementation is the Save TM As pair of scripts. When you run the first of the scripts from within OmegaT with Tools > Scripting, the second script presents you with a "Save As" menu with which you can save the output TM to the location and with the name of your choice.