Home | Trees | Indices | Help |
|
---|
|
Manage the Wordfast Translation Memory format
Wordfast TM format is the Translation Memory format used by the Wordfast computer aided translation tool.
It is a bilingual base class derived format with WordfastTMFile and WordfastUnit providing file and unit level access.
Wordfast is a computer aided translation tool. It is an application built on top of Microsoft Word and is implemented as a rather sophisticated set of macros. Understanding that helps us understand many of the seemingly strange choices around this format including: encoding, escaping and file naming.
The implementation covers the full requirements of a Wordfast TM file. The files are simple Tab Separated Value (TSV) files that can be read by Microsoft Excel and other spreadsheet programs. They use the .txt extension which does make it more difficult to automatically identify such files.
The dialect of the TSV files is specified by WordfastDialect.
The files are UTF-16 or ISO-8859-1 (Latin1) encoded. These choices are most likely because Microsoft Word is the base editing tool for Wordfast.
The format is tab separated so we are able to detect UTF-16 vs Latin-1 by searching for the occurance of a UTF-16 tab character and then continuing with the parsing.
WordfastTime allows for the correct management of the Wordfast YYYYMMDD~HHMMSS timestamps. However, timestamps on individual units are not updated when edited.
WordfastHeader provides header management support. The header functionality is fully implemented through observing the behaviour of the files in real use cases, input from the Wordfast programmers and public documentation.
Wordfast TM implements a form of escaping that covers two aspects:
Functions allow for conversion to Unicode and back to Wordfast escapes.
The last 4 columns allow users to define and manage extended attributes. These are left as is and are not directly managed byour implemenation.
|
|||
WordfastDialect Describe the properties of a Wordfast generated TAB-delimited file. |
|||
WordfastTime Manages time stamps in the Wordfast format of YYYYMMDD~hhmmss |
|||
WordfastHeader A wordfast translation memory header |
|||
WordfastUnit A Wordfast translation memory unit |
|||
WordfastTMFile A Wordfast translation memory file |
|
|||
|
|||
|
|
|||
WF_TIMEFORMAT = "%Y%m%d~%H%M%S" Time format used by Wordfast |
|||
WF_FIELDNAMES_HEADER = ["date", "userlist", "tucount", "src-la Field names for the Wordfast header |
|||
WF_FIELDNAMES = ["date", "user", "reuse", "src-lang", "source" Field names for a Wordfast TU |
|||
WF_FIELDNAMES_HEADER_DEFAULTS = {"date": "%19000101~121212", " Default or minimum header entries for a Wordfast file |
|||
WF_ESCAPE_MAP = "&'26;", u"\u0026", ("&'82;", u"\u201A"), ("&' Mapping of Wordfast &'XX; escapes to correct Unicode characters |
|||
TAB_UTF16 = "\x00\x09" The tab \t character as it would appear in UTF-16 encoding |
Imports: csv, sys, time, base
|
Char -> Wordfast &'XX; escapes Full roundtripping is not possible because of the escaping of NEWLINE \n and TAB \t |
|
WF_FIELDNAMES_HEADERField names for the Wordfast header
|
WF_FIELDNAMESField names for a Wordfast TU
|
WF_FIELDNAMES_HEADER_DEFAULTSDefault or minimum header entries for a Wordfast file
|
WF_ESCAPE_MAPMapping of Wordfast &'XX; escapes to correct Unicode characters
|
Home | Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0.1 on Thu Oct 15 13:53:51 2009 | http://epydoc.sourceforge.net |