Package translate :: Package storage :: Module wordfast
[hide private]
[frames] | no frames]

Module wordfast

source code

Manage the Wordfast Translation Memory format

Wordfast TM format is the Translation Memory format used by the Wordfast computer aided translation tool.

It is a bilingual base class derived format with WordfastTMFile and WordfastUnit providing file and unit level access.

Wordfast tools

Wordfast is a computer aided translation tool. It is an application built on top of Microsoft Word and is implemented as a rather sophisticated set of macros. Understanding that helps us understand many of the seemingly strange choices around this format including: encoding, escaping and file naming.

Implementation

The implementation covers the full requirements of a Wordfast TM file. The files are simple Tab Separated Value (TSV) files that can be read by Microsoft Excel and other spreadsheet programs. They use the .txt extension which does make it more difficult to automatically identify such files.

The dialect of the TSV files is specified by WordfastDialect.

Encoding

The files are UTF-16 or ISO-8859-1 (Latin1) encoded. These choices are most likely because Microsoft Word is the base editing tool for Wordfast.

The format is tab separated so we are able to detect UTF-16 vs Latin-1 by searching for the occurance of a UTF-16 tab character and then continuing with the parsing.

Timestamps

WordfastTime allows for the correct management of the Wordfast YYYYMMDD~HHMMSS timestamps. However, timestamps on individual units are not updated when edited.

Header

WordfastHeader provides header management support. The header functionality is fully implemented through observing the behaviour of the files in real use cases, input from the Wordfast programmers and public documentation.

Escaping

Wordfast TM implements a form of escaping that covers two aspects:

  1. Placeable: bold, formating, etc. These are left as is and ignored. It is up to the editor and future placeable implementation to manage these.
  2. Escapes: items that may confuse Excel or translators are escaped as &'XX;. These are fully implemented and are converted to and from Unicode. By observing behaviour and reading documentation we where able to observe all possible escapes. Unfortunately the escaping differs slightly between Windows and Mac version. This might cause errors in future.

Functions allow for conversion to Unicode and back to Wordfast escapes.

Extended Attributes

The last 4 columns allow users to define and manage extended attributes. These are left as is and are not directly managed byour implemenation.

Classes [hide private]
  WordfastDialect
Describe the properties of a Wordfast generated TAB-delimited file.
  WordfastTime
Manages time stamps in the Wordfast format of YYYYMMDD~hhmmss
  WordfastHeader
A wordfast translation memory header
  WordfastUnit
A Wordfast translation memory unit
  WordfastTMFile
A Wordfast translation memory file
Functions [hide private]
 
_char_to_wf(string)
Char -> Wordfast &'XX; escapes
source code
 
_wf_to_char(string)
Wordfast &'XX; escapes -> Char
source code
Variables [hide private]
  WF_TIMEFORMAT = "%Y%m%d~%H%M%S"
Time format used by Wordfast
  WF_FIELDNAMES_HEADER = ["date", "userlist", "tucount", "src-la...
Field names for the Wordfast header
  WF_FIELDNAMES = ["date", "user", "reuse", "src-lang", "source"...
Field names for a Wordfast TU
  WF_FIELDNAMES_HEADER_DEFAULTS = {"date": "%19000101~121212", "...
Default or minimum header entries for a Wordfast file
  WF_ESCAPE_MAP = "&'26;", u"\u0026", ("&'82;", u"\u201A"), ("&'...
Mapping of Wordfast &'XX; escapes to correct Unicode characters
  TAB_UTF16 = "\x00\x09"
The tab \t character as it would appear in UTF-16 encoding

Imports: csv, sys, time, base


Function Details [hide private]

_char_to_wf(string)

source code 

Char -> Wordfast &'XX; escapes

Full roundtripping is not possible because of the escaping of NEWLINE \n and TAB \t


Variables Details [hide private]

WF_FIELDNAMES_HEADER

Field names for the Wordfast header

Value:
["date", "userlist", "tucount", "src-lang", "version", "target-lang", \
"license", "attr1list", "attr2list", "attr3list", "attr4list", "attr5l\
ist"]

WF_FIELDNAMES

Field names for a Wordfast TU

Value:
["date", "user", "reuse", "src-lang", "source", "target-lang", "target\
", "attr1", "attr2", "attr3", "attr4"]

WF_FIELDNAMES_HEADER_DEFAULTS

Default or minimum header entries for a Wordfast file

Value:
{"date": "%19000101~121212", "userlist": "%User ID,TT,TT Translate-Too\
lkit", "tucount": "%TU=00000001", "src-lang": "%EN-US", "version": "%W\
ordfast TM v.5.51w9/00", "target-lang": "", "license": "%---00000001",\
 "attr1list": "", "attr2list": "", "attr3list": "", "attr4list": ""}

WF_ESCAPE_MAP

Mapping of Wordfast &'XX; escapes to correct Unicode characters

Value:
"&'26;", u"\u0026", ("&'82;", u"\u201A"), ("&'85;", u"\u2026"), ("&'91\
;", u"\u2018"), ("&'92;", u"\u2019"), ("&'93;", u"\u201C"), ("&'94;", \
u"\u201D"), ("&'96;", u"\u2013"), ("&'97;", u"\u2014"), ("&'99;", u"\u\
2122"), ("&'A0;", u"\u00A0"), ("&'A9;", u"\u00A9"), ("&'AE;", u"\u00AE\
"), ("&'BC;", u"\u00BC"), ("&'BD;", u"\u00BD"), ("&'BE;", u"\u00BE"), \
("&'A8;", u"\u00AE"), ("&'AA;", u"\u2122"), ("&'C7;", u"\u00AB"), ("&'\
C8;", u"\u00BB"), ("&'C9;", u"\u2026"), ("&'CA;", u"\u00A0"), ("&'D0;"\
, u"\u2013"), ("&'D1;", u"\u2014"), ("&'D2;", u"\u201C"), ("&'D3;", u"\
...