You may find it odd that the user manual contains the section on programming, as that is normally the matter for a separate, programmer-oriented document. On the other hand, while reading the "pure user" sections of this manual, you may have noticed that in Pology the distinction between a user and a programmer is more blurry than one would expect of a translation-related tool. Indeed, before getting into writing standalone Python programs which use the Pology library, there are many places in Pology itself where you can plug in some Python code to adapt the behavior to your language and translation environment. This section exists to support and stimulate such interaction with Pology.
The Pology library is quite simple conceptually and organizationally. It consists of a small core abstraction of the PO format, and a lot of mutually unrelated functionality that may come in handy in particular translation processing scenarios. Everything is covered by the Pology API documentation, but since API documentation tends to be non-linear and full of details obstructing the bigger picture, the following subsections are there to provide synthesis and rationale of salient points.
The PO format abstraction in Pology is a quite direct and fine-grained reflexion of PO format elements and conventions. This was a design goal from the start; no attempt was made at a more general abstraction, which would tentatively support various translation file formats.
There is, however, one glaring but intentional omission: multi-domain PO files (those which contain domain "..."
directives) are not supported. We had never observed a multi-domain PO file in the wild, nor thought of a significant advantage it could have today over multiple single-domain PO files. Supporting multi-domain PO files would mean not only always needing two nested loops to iterate through messages in a PO file, but it would also interfere with higher levels in Pology which assume equivalence between PO files and domains. Pology will simply report an error when trying to read a multi-domain PO file.
Because the PO abstraction is intended to be robust against programming errors when quickly writting custom scripts, and frugal on file modifications, by default some of the abstracted objects are "monitored". This means that they are checked for expected data types and have modification counters. Main monitored objects are PO files, PO headers, and PO messages, but also their attributes which are not plain data types (strings or numbers). For the moment, these secondary monitored types include Monlist
(the monitored counterpart to built-in list), Monset
(counterpart to set), and Monpair
(like two-element tuple). Monitored types do not in general provide the full scope of functionality of their built-in counterparts, so sometimes it may be easier (and faster) to work with built-in types and convert them to monitored at the moment of adding to PO objects.
To take a Monlist
instance as an example, here is how it behaves on its own:
>>> from pology.monitored import Monlist >>> l = Monlist([u"a", u"b", u"c"]) >>> l.modcount 0 >>> l.append(10) >>> l Monlist([u"a", u"b", u"c", 10]) >>> l.modcount 1 >>>
Appending an element has caused the modification counter to increase, but, as expected, it was possible to add an integer in spite of previous elements being strings. However, if the monitored list comes from a PO message:
>>> from pology.message import Message >>> msg = Message() >>> msg.msgstr Monlist([]) >>> msg.msgstr.append(10) Traceback (most recent call last): ... pology.PologyError: Expected <type 'unicode'> for sequence element, got <type 'int'>. >>> msg.msgstr.append(u"bar") >>> msg.msgstr.modcount 1 >>> msg.modcount 1
The Message
class has type constraints added to its attributes, and therefore addition of an integer to the .msgstr
list was rejected: only unicode values are allowed. This is particularly important due to the basic string type in Python being the raw byte array str[51], to automatically prevent carelessness with encodings. Once a proper string was added to .msgstr
list, its modification counter increased, but also the modification counter of the parent object.
A few more notes on modification counters. Consider this example:
>>> msg = Message() >>> msg.msgstr = Monlist(u"foo") >>> msg.msgstr.modcount 0 >>> msg.msgstr_modcount 1 >>> msg.modcount 1 >>> msg.msgstr[0] = u"foo" >>> msg.msgstr.modcount 0 >>> msg.msgstr = Monlist(u"foo") >>> msg.msgstr_modcount 1 >>> msg.modcount 1
Monlist(u"foo")
itself is a fresh list with modification counter at 0, so after it was assigned to msg.msgstr
, its modification counter is still 0. However, every attribute of a parent monitored object also has the associated attribute modification counter, denoted with trailing _modcount
; therefore msg.msgstr_modcount
did increase on assignment, and so did the parent msg.modcount
. Modification tracking actually checks for equality of values, so when same-valued objects are repeadetly assigned (starting from msg.msgstr[0] = u"foo"
above), modification counters do not increase.
Compound monitored objects may also have the attributes themselves constrained, to prevent typos and other brain glitches from causing mysterious wrong behavior when processing PO files. For example:
>>> msg = Message() >>> msg.msgtsr = Monlist(u"foo") Traceback (most recent call last): ... pology.PologyError: Attribute 'msgtsr' is not among specified. >>>
You may conclude that modification tracking and type and attribute constraining would slow down processing, and you would be right. Since PO messages are by far the most processed objects, a non-monitored counterpart to Message
is provided as well, for occasions where the code is only reading PO files, or has been sufficiently tested, and speed is of importance. See Section 11.1.2, “Message” for details.
PO messages are by default represented with the Message
class. It is monitored for modifications, and constrained on attributes and attribute types. It provides direct attribute access to parts of a PO message:
>>> from pology.monitored import Monpair >>> from pology.message import Message >>> msg = Message() >>> msg.msgid = u"Foo %s" >>> msg.msgstr.append(u"Bar %s") >>> msg.flag.add(u"c-format") >>> msg.fuzzy = True >>> print msg.to_string(), #, fuzzy, c-format msgid "Foo %s" msgstr "Bar %s" >>>
Attribute access provides the least hassle, while being guarded by monitoring, and makes clear the semantics of particular message parts. For example, the .flag
attribute is a set, to indicate that the order of flags should be of no importance to either a human translator or a PO processor, and the .msgstr
attribute is always a list in order to prevent the programmer from not taking into account plural messages. While the fuzzy state is formally indicated by a flag, it is considered special enough to have a separate attribute.
Some message parts may or may not be present in a message, and when they are not present, the corresponding attributes are either empty if sequences (e.g. .manual_comment
list for translator comments), or set to None
if strings[52] (e.g. .msgctxt
).
There are also several derived, read-only attributes for special purposes. For example, if in some context the messages are to be tracked in a dictionary by their keys, there is the .key
attribute available, which is an undefined but unique combination of .msgctxt
and .msgid
attributes. Or, there is the .active
attribute which is True
if the message is neither fuzzy nor obsolete, i.e. its translation (if there is one) would be used by the consumer of the PO file that the message is part of.
Message
has a number of methods for frequent operations that need to read or modify more than one attribute. For example, to thoroughly unfuzzy a message, it is not sufficient to just remove its fuzzy flag (by setting .fuzzy
to False
or removing u"fuzzy"
from .flag
set), but previous field comments (#| ...
) should be removed as well, and this is what .unfuzzy()
method does:
>>> print msg.to_string(), #| msgid "Foubar" #, fuzzy msgid "Foobar" msgstr "Fubar" >>> msg.unfuzzy() >>> print msg.to_string(), msgid "Foobar" msgstr "Fubar"
Other methods include those to copy over a subset of parts from another message, to revert the message to pristine untranslated state, and so on.
There exists a non-monitored counterpart to Message
, the MessageUnsafe
class. Its attributes are of built-in types, e.g. .msgstr
is plain list
, and there is no type nor attribute checking. By using MessageUnsafe
, a speedup of 50% to 100% has been observed in practical applications, so it makes for a good trade-off when you know what you are doing (e.g. you are certain that no modifications will be made). A PO file is opened with non-monitored messages by issuing the monitored=False
argument to Catalog
constructor.
Read-only code could should work with Message
and MessageUnsafe
objects without any type-based specialization. Code that writes may need some care to achieve the same, for example:
def translate_moo_as_mu (msg): if msg.msgid == u"Moo!": # works for both msg.msgstr = [u"Mu!"] # raises exception if Message msg.msgstr[:] = [u"Mu!"] # works for both msg.msgstr[0] = u"Mu!" # works for both (when not empty)
If you need to create an empty message of the same type as another message, or make a same-type copy of the message, you can use type
built-in:
newmsg1 = type(msg)() # create empty newmsg2 = type(msg)(msg) # copy
Message
and MessageUnsafe
share the virtual base class Message_base
, so you can use isinstance(obj, Message_base)
to check if an object is a PO message of either type.
The PO header could be treated as just another message, but that would both be inconvenient for operating on it, and disruptive in iteration over a catalog. Instead the Header
class is introduced. Similar to Message
, it provides both direct attribute access to parts of the header (like the .field
list of name-value pairs), and methods for usual manipulations which would need a sequence of basic data manipulations (like .set_field()
to either modify an existing or add a new header field with the given value).
In particular, header comments are represented by a number of attributes (.title
, .author
, etc.), some of which are strings and some lists, depending on semantics. Unfortunatelly, the PO format does not define this separation formally, so when the PO file is parsed, comments are split heuristically (.title
will be the first comment line, .author
will get every line which looks like it has an email address and a year in it, etc.)
Header
is a monitored class just like Message
, but unlike Message
it has no non-monitored counterpart. This is because in practice the header operations make a small part of total processing, so there is no real advantage at having non-monitored headers.
PO files are read and written through Catalog
objects. A small script to open a PO file on disk (given as the first argument), find all messages that contain a certain substring in the original text (given as the second argument), and write those messages to standard output, would look like this:
import sys from pology.catalog import Catalog from pology.msgreport import report_msg_content popath = sys.argv[1] substr = sys.argv[2] cat = Catalog(popath) for msg in cat: if substr in msg.msgid: report_msg_content(msg, cat)
Note the minimalistic code, both by raw length and access interface. Instead of using something like print msg.to_string()
to output the message, already in this example we introduce the msgreport
module, which contains various functions for reporting on PO messages;[53] report_msg_content()
will first output the PO file name and location of the message (line and entry number) within the file, and then the message content itself, with some highlighting (for field keywords, fuzzy state, etc.) if the output destination permits it. Since no modifications are done to messages, this example would be just as safe but run significantly faster if the PO file were opened in non-monitored mode. This is done by adding the monitored=False
argument to Catalog
constructor:
cat = Catalog(popath, monitored=False)
and no other modification is required.
When some messages are modified in a catalog created by opening a PO file on disk, the modifications will not be written back to disk until the .sync()
method is called -- not even if the program exists. If the catalog is monitored and there were no modifications to it up to the moment .sync()
is called, the file on disk will not be touched, and .sync()
will return False
(it returns True
if the file is written).[54] In a scenario where a bunch of PO files are processed, this allows you to report only those which were actually modified. Take as an example a simplistic[55] script to search and replace in translation:
import sys from pology.catalog import Catalog from pology.fsops import collect_catalogs from pology.report import report serchstr = sys.argv[1] replacestr = sys.argv[2] popaths = sys.argv[3:] popaths = collect_catalogs(popaths) for popath in popaths: cat = Catalog(popath) for msg in cat: for i, text in enumerate(msg.msgstr): msg.msgstr[i] = text.replace(searchstr, replacestr) if cat.sync(): report("%s (%d)" % (cat.filename, cat.modcount))
This script takes the search and replace strings as the first two arguments, followed by any number of PO paths. The paths do not have to be only file paths, but can also be directory paths, in which case the collect_catalogs()
function from fsops
module will recursively collect any PO files in them. After the search and replace iteration through a catalog is done (msgstr
being properly handled on plain and plural messages alike), its .sync()
method is called, and if it reports that the file was modified, the file's path and number of modified texts is output. The latter is obtained simply as the modification counter state of the catalog, since it was bumped up by one on each text that actually got modified. Note the use of .filename
attribute for illustration, although in this particular case we had the path available in popath
variable.
Syncing to disk is an atomic operation. This means that if you or something else aborts the program in the middle of execution, none of the processed PO files will become corrupted; they will either be in their original state, or in the expected modified state.
As can be seen, at its base the Catalog
class is an iterable container of messages. However, the precise nature of this container is less obvious. To the consumer (a program or converter) the PO file is a dictionary of messages by keys (msgctxt
and msgid
fields); there can be no two messages with the same key, and the order of messages is of no importance. For the human translator, however, the order of messages in the PO file is of great importance, because it is one of context indicators. Message keys are parts of the messages themselves, which means that a message is both its own dictionary key and the value. Taking these constraints together, in Pology the PO file is treated as an ordered set, and the Catalog
class interface is made to reflect this.
The ordered set nature of catalogs comes into play when the composition of messages, rather than just the messages themselves, is modified. For example, to remove all obsolete messages from the catalog, the .remove()
method could be used:
for msg in list(cat): if msg.obsolete: cat.remove(msg) cat.sync()
Note that the message sequence was first copied into a list, since the removal would otherwise clobber the iteration. Unfortunatelly, this code will be very slow (linear time wrt. catalog size), since when a message is removed, internal indexing has to be updated to maintain both the order and quick lookups. Instead, the better way to remove messges is the .remove_on_sync()
method, which marks the message for removal on syncing. This runs fast (constant time wrt. catalog size) and requires no copying into a list prior to iteration:
for msg in cat: if msg.obsolete: cat.remove_on_sync(msg) cat.sync()
A message is added to the catalog using the .add()
method. If .add()
is given only the message itself, it will overwrite the message with the same key if there is one such, or else insert it according to source references, or append it to the end. If .add()
is also given the insertion position, it will insert the message at that position only if the message with the same key does not exist in the catalog; if it does, it will ignore the given position and overwrite the existing message. When the message is inserted, .add()
suffers the same performance problem as .remove()
: it runs in linear time. However, the common case when an empty catalog is created and messages added one by one to the end can run in constant time, and this is what .add_last()
method does.[56]
The basic way to check if a message with the same key exists in the catalog is to use the in
operator. Since the catalog is ordered, if the position of the message is wanted, .find()
method can be used instead. Both these methods are fast, running in constant time. There is a series of .select_*()
methods for looking up messages by other than the key, which run in linear time, and return lists of messages since the result may not be unique any more.
Since it is ordered, the catalog can be indexed, and that either by a position or by a message (whose key is used for lookup). To replace a message in the catalog with a message which has the same key but is otherwise different, you can either first fetch its position and then use it as the index, or use the message itself as the index:
# Idexing by position. pos = cat.find(msg) cat[pos] = msg # Indexing by message key. cat[msg] = msg
This leads to the following question: what happens if you modify the key of a message (its .msgctxt
or .msgid
attributes) in the catalog? In that case the internal index goes out of sync, rather than being automatically updated. This is a necessary performance measure. If you need to change message keys, while doing that you should treat the catalog as a pure list, using only in
iteration and positional indexing. Afterwards you should either call .sync()
if you are done with the catalog, or .sync_map()
to only update indexing (and remove messages marked with .remove_on_sync()
) without writing out the PO file.
The Catalog
class provides a number of convenience methods which report things about the catalog based on the header information, rather than having to manually examine the header. These include the number of plural forms, the msgstr
index for the given plural number, as well as information important in some Pology contexts, like language code, accelerator markers, markup types, etc. Each of these methods has a counterpart which sets the appropriate value, but this value is not written to disk when the catalog is synced. This is because frequently there are more ways in which the value can be determined from the header, so it is ambiguous how to write it out. Instead, these methods are used to set or override values provided by the catalog (e.g. based on command line options) for the duration of processing only.
To create an empty catalog if it does not exist on disk, the create=True
argument can be added to the constructor. If the catalog does exist, it will be opened as usual; if it did not exist, the new PO file will be written to disk on sync. To unconditionally create an empty catalog, whether the PO file exists or not at the given path, the truncate=True
parameter should be added as well. In this case, if the PO file did exist, it will be overwritten with the new content only when the catalog is synced. The catalog can also be created with an empty string for path, in which case it is guaranteed to be empty even without setting truncate=True
. If a catalog with empty path should later be synced (as opposed to being transient during processing), its .filename
attribute can simply be assigned a valid path before calling .sync()
.
In summary, it can be said that the Catalog
class is biased, in terms of performance and ease of use, towards processing existing PO files rather than creating PO files from scratch, and towards processing existing messages in the PO file rather than shuffling them around.
This section describes the style and conventions that the code which is intended to be included in Pology distribution should adhere to. The general coding style is expected to follow the Python style guide described in PEP 8.
Lines should be up to 80 characters long. Class names should be written in camel case, and all other names in lower case with underscores:
class SomeThingy (object): ... def some_method (self, ...): ... longer_variable = ... def some_function (...): ...
Long expressions with operators should be wrapped in parentheses and before the binary operator, with the first line indented to the level of the other operand:
some_quantity = ( a_number_of_thingies * quantity_of_that_per_unit + the_base_offset)
In particular, long conditions in if
and while
statements should be written like this:
if ( something and something_else and yet_something and somewhere_in_between and who_knows_what_else ): do_something_appropriate()
All messages, warnings, and errors should be issued through msgreport
and msgreport
modules. There should be no print
statements or raw writes to sys.stdout
/sys.stderr
.
For the code in Pology library, it is always preferable to raise an exception instead of aborting execution. On the other hand, it is fine to add optional parameters by which the client can select if the function should abort rather than raise an exception. All topical problems should raise pology.PologyError
or a subclass of it, and built-in exceptions only for simple general problems (e.g. IndexError
for indexing past the end of something).
All user-visible text, be it reports, warnings, errors (including exception messages) should be wrapped for internationalization through Gettext. The top pology
module provides several wrappers for Gettext functions, which have the following special traits: context is mandatory on every wrapped text, all format directives must be named, and arguments are specified as keyword-value pairs just after the text argument (unless deferred translation is used). Some examples:
# Simple message with context marker. _("@info", "Trying to sync unnamed catalog.") # Simple message with extended context. _("@info command description", "Keep track of who, when, and how, has translated, modified, " "or reviewed messages in a collection of PO files.") # Another context marker and extended context. _("@title:column words per message in original", "w/msg-or") # Parameter substitution. _("@info", "Review tag '%(tag)s' not defined in '%(file)s'.", tag=rev_tag, file=config_path) # Plural message n_("@item:inlist", "written %(num)d word", "written %(num)d words", num=nwords) # Deferred translation, when arguments are known later. tmsg = t_("@info:progress", "Examining state: %(file)s") ... msg = tmsg.with_args(file=some_path).to_string()
Every context starts with the "context marker" in form of @
, drawn from a predefined set (see the article on i18n semantics at KDE Techbase); it is most often keyword
@info
in Pology code. The context marker may be, and should be, followed by a free-form extend context whenever it can help the translator to understand how and where the message is used. It is usual to have the context, text and arguments in different lines, though not necessary if they are short enough to fit one line.
Pology defines lightweight XML markup for coloring text in the colors
module. In fact, Gettext wrappers do not return ordinary strings, but ColorString
objects, and functions from report
and msgreport
modules know how to convert it to raw strings for given output destination (file, terminal, web page...). Therefore you can use colors in any wrapped string:
_("@info:progress", "<green>History follows:</green>") _("@info", "<bold>Context:</bold> %(snippet)s", snippet=some_text)
Coloring should be used sparingly, only when it will help to cue user's eyes to significant elements of the output.
There are two consequences of having text markup available throughout. The first is that every message must be well-formed XML, which means that it must contain no unballanced tags, and that literal <
characters must be escaped (and then also >
for good style):
_("@item automatic name for anonymous input stream", "<stream-%(num)s>", num=strno)
The other consequence is that ColorString
instances must be joined and interpolated with dedicated functions; see cjoin()
and cinterp()
functions in colors
module.
Unless the text of the message is specifically intended to be a title or an insert (i.e. @title
or @item
context markers), it should be a proper sentence, starting with a capital letter and ending with a dot.
Pology sieves are filtering-like processing elements applied by the posieve script to collections of PO files. A sieve can examine as well as modify the PO entries passed through it. Each sieve is written in a separate file. If the sieve file is put into sieve/
directory of Pology distribution (or intallation), the sieve can be referenced on posieve command line by the shorthand notation; otherwise the path to the sieve file is given. The former is called an internal sieve, and the latter an external sieve, but the sieve file layout and the sieve definition are same for both cases.
In the following, posieve will be referred to as "the client". This is because tools other than posieve may start to use sieves in the future, and it will also be described what these clients should adhere to when using sieves.
The sieve file must define the Sieve
class, with some mandatory and some optional interface methods and instance variables. There are no restrictions at what you can put into the sieve file beside this class, only keep in mind that posieve will load the sieve file as a Python module, exactly once during a single run.
Here is a simple sieve (also the complete sieve file) which just counts the number of translated messages:
class Sieve (object): def __init__ (self, params): self.ntranslated = 0 def process (self, msg, cat): if msg.translated: self.ntranslated += 1 def finalize (self): report("Total translated: %d" % self.ntranslated)
The constructor takes as argument an object specifying any sieve parameters (more on that soon). The process
method gets called for each message in each PO file processed by the client, and must take as parameters the message (instance of Message_base
) and the catalog which contains it (Catalog
). The client calls the finalize
method after no more messages will be fed to the sieve, but this method does need to be defined (client should check if it exists before placing the call).
Another optional method is process_header
, which the client calls on the PO header:
def process_header (self, hdr, cat): # ...
hdr
is an instance of Header
, and cat
is the containing catalog. The client will check for the presence of this method, and if it is defined, it will call it prior to any process
call on the messages from the given catalog. In other words, the client is not allowed to switch catalogs between two calls to process
without calling process_header
in between.
There is also the optional process_header_last
method, for which everything holds just like for process_header
, except that, when present, the client must call it after all consecutive process
calls on messages from the same catalog:
def process_header_last (self, hdr, cat): # ...
Sieve methods should not abort program execution in case of errors, instead they should throw an exception. In particular, if the process
method throws SieveMessageError
, it means that the sieve can still process other messages in the same catalog; if it throws SieveCatalogError
, then any following messages from the same catalog must be skipped, but other catalogs may be processed. Similarly, if process_header
throws SieveCatalogError
, other catalogs may still be processed. Any other type of exception tells the client that the sieve should no longer be used.
The process
and process_header
methods should either return None
or an integer exit code. A return value which is neither None
nor 0
indicates that while the evaluation was successfull (no exception was thrown), the processed entry (message or header) should not be passed further along the sieve chain.
The params
parameter of the sieve constructor is an object with data attributes as parameters which may influence the sieve operation. The sieve file can define the setup_sieve
function, which the client will call with a SubcmdView
object as the single argument, to fill in the sieve description and define all mandatory and optional parameters. For example, if the sieve takes an optional parameter named checklevel
, which controles the level (an integer) at which to perform some checks, here is how setup_sieve
could look like:
def setup_sieve (p): p.set_desc("An example sieve.") p.add_param("checklevel", int, defval=0, desc="Validity checking level.") class Sieve (object): def __init__ (self, params): if params.checklevel >= 1: # ...setup some level 1 validity checks... if params.checklevel >= 2: # ...setup some level 2 validity checks... #... ...
See the add_param
method for details on defining sieve parameters.
The client is not obliged to call setup_sieve
, but it must make sure that the object it sends to the sieve as params
has all the instance variable according to the defined parameters.
There are two boolean instance variables that the sieve may define, and which the client may check for to decide on the regime in which the catalogs are opened and closed:
class Sieve (object): def __init__ (self, params): # These are the defaults: self.caller_sync = True self.caller_monitored = True ...
The variables are:
caller_sync
instructs the client whether catalogs processed by the sieve should be synced to disk at the end. If the sieve does not define this variable, the client should assume True
and sync catalogs. This variable is typically set to False
in sieves which do not modify anything, because syncing catalogs takes time.
caller_monitored
tells the client whether it should open catalogs in monitored mode. If this variable is not set, the client should assume it True
. This is another way of reducing processing time for sieves which do not modify PO entries.
Usually a modifying sieve will set neither of these variables, i.e. catalogs will be monitored and synced by default, while a checker sieve will set both to False
. For a modifying sieve that unconditionally modifies all entries sent to it, only caller_monitored
may be set to False
and caller_sync
left undefined (i.e. True
).
If a sieve requests no monitoring or no syncing, the client is not obliged to satisfy these requests. On the other hand, if a sieve does request monitoring or syncing (either explicitly or by not defining the corresponding variables), the client must provide catalogs in that regime. This is because there may be several sieves operating at the same time (a sieve chain), and monitoring and syncing is usually necessary for proper operation of those sieves that request it.
Since monitored catalogs have modification counters, the sieve may use them within its process*
methods to find out if any modification really took place. The proper way to do this is to record the counter at start, and check for increase at end:
def process (self, msg, cat): startcount = msg.modcount # ... # ... do some stuff # ... if msg.modcount > startcount: self.nmodified += 1
The wrong way to do it would be to merely check if msg.modcount > 0
, because several modifying sieves may be operating at the same time, each increasing the counters.
If the sieve wants to remove the message from the catalog, if at all possible it should use catalog's remove_on_sync
instead of remove
method, to defer actual removal to sync time. This is because remove
will probably ruin client's iteration over the catalog, so if it must be used, the sieve documentation should state it clearly. remove
also has linear execution time, while remove_on_sync
has constant.
If the sieve is to become part of Pology distribution, it should be properly documented. This means fully equipped setup_sieve
function in the sieve file, and a piece of user manual documentation. The Sieve
class itself should not be documented in general. Only when process*
are returning an exit code, this should be stated in their own comments (and in the user manual).
Hooks are functions with specified sets of input parameters, return values, processing intent, and behavioral constraints. They can be used as modification and validation plugins in many processing contexts in Pology. There are three broad categories of hooks: filtering, validation and side-effect hooks.
Filtering hooks modify some of their inputs. Modifications are done in-place whenever the input is mutable (like a PO message), otherwise the modified input is provided in a return value (like a PO message text field).
Validation hooks perform certain checks on their inputs, and return a list of annotated spans or annotated parts, which record all the encountered errors:
Annotated spans are reported when the object of validation is a piece of text. Each span is a tuple of start and end index of the problematic segment in the text, and a note which explains the problem. The return value of a text-validation hook will thus be a list:
[(start1, end1, "note1"), (start2, end2, "note1"), ...]
The note can also be None
, if there is nothing to say about the problem.
Annotated parts are reported for an object which has more than one distinct piece of text, such as a PO message. Each annotated part is a tuple stating the name of the problematic part of the object (e.g. "msgid"
, "msgstr"
), the item index for array-like parts (e.g. for msgstr
), and the list of problems in appropriate form (for a PO message this is a list of annotated spans). The return value of a PO message-validation hook will look like this:
[("part1", item1, [(start11, end11, "note11"), ...]), ("part2", item2, [(start21, end21, "note21"), ...]), ...]
Side-effect hooks neither modify their inputs nor report validation information, but can be used for whatever purpose which is independent of the processing chain into which the hook is inserted. For example, a validation hook can be implemented like this as well, when it is enough that it reports problems to standard output, or where the hook client does not know how to use structured validation data (annotated spans or parts). The return value of a side-effect hook the number of errors encountered internally by the hook (an integer). Clients may use this number to decide upon further behavior. For example, if a side-effect hook modified a temporary copy of a file, the client may decide to abandon the result and use the original file if there were some errors.
In this section a number of hook types are described and assigned a formal type keyword, so that they can be conveniently referred to elsewhere in Pology documentation.
Each type keyword has the form <letter1><number><letter2>, e.g. F1A. The first letter represents the hook category: F for filtering hooks, V for validation hooks, and S for side-effect hooks. The number enumerates the input signature by parameter types, and the final letter denotes the difference in semantics of input parameters for equal input signatures. As a handy mnemonic, each type is also given an informal signature in the form of (param1, param2, ...) -> result
; in them, spans
stand for annotated spans, parts
for annotated parts, and numerr
for number of errors.
Hooks on pure text:
F1A ((text) -> text
): filters the text
V1A ((text) -> spans
): validates the text
S1A ((text) -> numerr
): side-effects on text
Hooks on text fields in a PO message in a catalog:
F3A ((text, msg, cat) -> text
): filters any text field
V3A ((text, msg, cat) -> spans
): validates any text field
S3A ((text, msg, cat) -> numerr
): side-effects on any text field
F3B ((msgid, msg, cat) -> msgid
): filters an original text field; original fields are either msgid
or msgid_plural
V3B ((msgid, msg, cat) -> spans
): validates an original text field
S3B ((msgid, msg, cat) -> numerr
): side-effects on an original text field
F3C ((msgstr, msg, cat) -> msgstr
): filters a translation text field; translation fields are the msgstr
array
V3C ((msgstr, msg, cat) -> spans
): validates a translation text field
S3C ((msgstr, msg, cat) -> numerr
): side-effects on a translation text field
*3B and *3C hook series are introduced next to *3A for cases when it does not make sense for text field to be any other but one of the original, or translation fields. For example, to process the translation sometimes the original (obtained by msg
parameter) must be consulted. If a *3B or *3C hook is applied on an inappropriate text field, the results are undefined.
Hooks on PO entries in a catalog:
F4A ((msg, cat) -> numerr
): filters a message, modifying it
V4A ((msg, cat) -> parts
): validates a message
S4A ((msg, cat) -> numerr
): side-effects on a message (no modification)
F4B ((hdr, cat) -> numerr
): filters a header, modifying it
V4B ((hdr, cat) -> parts
): validates a header
S4B ((hdr, cat) -> numerr
): side-effects on a header (no modification)
Hooks on PO catalogs:
F5A ((cat) -> numerr
): filters a catalog, modifying it in any way
S5A ((cat) -> numerr
): side-effects on a catalog (no modification)
Hooks on file paths:
F6A ((filepath) -> numerr
): filters a file, modifying it in any way
S6A ((filepath) -> numerr
): side-effects on a file, no modification
The *2* hook series (with signatures (text, msg) -> ...
) has been skipped because no need for them was observed so far next to *3* hooks.
Since hooks have fixed input signatures by type, the way to customize a given hook behavior is to produce its function by another function. The hook-producing function is called a I{hook factory}. It works by preparing anything needed for the hook, and then defining the hook proper and returning it, thereby creating a lexical closure around it:
def hook_factory (param1, param2, ...): # Use param1, param2, ... to prepare for hook definition. def hook (...): # Perhaps use param1, param2, ... in the hook definition too. return hook
In fact, most internal Pology hooks are defined by factories.
General hooks should be defined in top level modules, language-dependent hooks in lang.
, project-dependent hooks in code
.module
proj.
, and hooks that are both language- and project-dependent in name
.module
lang.
. Hooks placed like this can be fetched by code
.proj.name
.module
getfunc.get_hook_ireq
in various non-code contexts, in particular from Pology utilities which allow users to insert hooks into processing through command line options or configurations. If the complete module is dedicated to a single hook, the hook function (or factory) should be named same as the module, so that users can select it by giving only the hook module name.
Annotated parts for PO messages returned by hooks are a reduced but valid instance of highlight specifications used by reporting functions, e.g. msgreport.report_msg_content
. Annotated parts do not have the optional fourth element of a tuple in highlight specification, which is used to provide the filtered text against which spans were constructed, instead of the original text. If a validation hook constructs the list of problematic spans against the filtered text, just before returning it can apply diff.adapt_spans
to reconstruct the spans against the original text.
The documentation to a hook function should state the hook type within the short description, in square brackets at the end as [type ... hook]
. Input parameters should be named like in the informal signatures in the taxonomy above, and should not be omitted in @param:
Epydoc entries; but the return should be given under @return:
, also using one of the listed return names, in order to complete the hook signature.
The documentation to a hook factory should have [hook factory]
at the end of the short description. It should normally list all the input parameters, while the return value should be given as @return: type ... hook
, and the hook signature as the @rtype:
Epydoc field.
Ascription selectors are functions used by poascribe in the translation review workflow as described in Chapter 6, Ascribing Modifications and Reviews. This section describes how you can write your own ascription selector, which you can then put to use by following the instructions in Section 6.8.1, “Custom Review Selectors”.
In terms of code, an ascription selector is a function factory, which construct the actual selector function based on supplied selector arguments. It has the following form:
# Selector factory. def selector_foo (args): # Validate input arguments. if (...): raise PologyError(...) # Prepare selector definition. ... # The selector function itself. def selector (msg, cat, ahist, aconf): # Prepare selection process. ... # Iterate through ascription history looking for something. for i, asc in enumerate(ahist): ... # Return False or True if a shallow selector, # and 0 or 1-based history index if history selector. return ... return selector
It is customary to name the selector function selector_
, where something
something
will also be used as the selector name (in command line, etc). The input args
parameter is always a list of strings. It should first be validated, insofar as possible without having in hand the particular message, catalog, ascription history or ascription configuration. Whatever does not depend on any of these can also be precomputed for later use in the selector function.
The selector function takes as arguments the message (an instance of Message_base
), the catalog (Catalog
) it comes from, the ascription history (list of AscPoint
objects), and the ascription configuration (AscConfig
). For the most part, AscPoint
and AscConfig
are simple attribute objects; check their API documentation for the list and description of attributes. Some of the attributes of AscPoint
objects that you will usually inspect are .msg
(the historical version of the message), .user
(the user to whom the ascription was made), or .type
(the type of the ascription, one of AscPoint.ATYPE_*
constants). The ascription history is sorted from the latest to the earliest ascription. If the .user
of the first entry in the history is None
, that means that the current version of the message has not been ascribed yet (e.g. if its translation has been modified compared to the latest ascribed version). If you are writing a shallow selector, it should return True
to select the message, or False
otherwise. In a history selector, the return value should be a 1-based index of an entry in the ascription history which caused the message to be selected, or 0
if the message was not selected.[57]
The entry index returned by history selectors is used to compute embedded difference from a historical to the current version of the message, e.g. on poascribe diff
. Note that poascribe will actually take as base for differencing the first non-fuzzy historical message after the indexed one, because it is assumed that already the historical message which triggered the selection contains some changes to be inspected. (When this behavior is not sufficient, poascribe offers the user to specify a second history selector, which directly selects the historical message to base the difference on.)
Most of the time the selector will operate on messages covered by a single ascription configuration, which means that the ascription configuration argument sent to it will always be the same. On the other hand, the resolution of some of the arguments to the selector factory will depend only on the ascription configuration (e.g. a list of users). In this scenario, it would be waste of performance if such arguments were resolved anew in each call to the selector. You could instead write a small caching (memoizing) resolver function, which when called for the second and subsequent times with the same configuration object, returns previously resolved argument value from the cache. A few such caching resolvers for some common arguments have been provided in the ascript
module, functions named cached_*()
(e.g. cached_users()
).
[51] In Python 2 to be precise, on which Pology is based, while in Python 3 there are only Unicode strings.
[52] The canonical way to check if message is a plural message is msg.msgid_plural is not None
.
[53] There is also the report
module for reporting general strings. In fact, all code in Pology distribution is expected to use function from these modules for writing to output streams, and there should not be a print
in sight.
[54] This holds only for catalogs created with monitoring, i.e. no monitored=True
constructor argument. For non-monitored .sync()
will always touch the file and report True
.
[55] As opposed to the find-messages sieve.
[56] In fact, .add_last()
does a bit more: if both non-obsolete and obsolete messages are added in mixed order, in the catalog they will be separated such that all non-obsolete come before all obsolete, but otherwise maintaining the order of addition.
[57] In this way the history selector can automatically behave as shallow selector as well, because simply testing for falsity on the return value will show whether the message has been selected or not.