Package pology :: Module markup

Module markup

Convert and validate markup in text.


Author: Chusslove Illich (Часлав Илић) <caslav.ilic@gmx.net>

License: GPLv3

Functions
string
plain_to_unwrapped(text)
Convert wrapped plain text to unwrapped.
string
xml_to_plain(text, tags=None, subs={}, ents={}, keepws=set([]), ignels=set([]))
Convert any XML-like markup to plain text.
string
html_plain(text)
Convert HTML markup to plain text.
string
qtrich_to_plain(text)
Convert Qt rich-text markup to plain text.
string
kuit_to_plain(text)
Convert KUIT markup to plain text.
string
kde4_to_plain(text)
Convert KDE4 GUI markup to plain text.
string
docbook4_to_plain(text)
Convert Docbook 4.x markup to plain text.
dict
collect_xml_spec_l1(specpath)
Collect lightweight XML format specification, level 1.
list of (int, int, string) tuples
validate_xml_l1(text, spec=None, xmlfmt=None, ents=None, casesens=True, accelamp=False)
Validate XML markup in text against level1 specification.
(msgstr, msg, cat) -> numerr
check_xml(strict=False, entities={}, mkeyw=None)
Check general XML markup in translation [hook factory].
(msgstr, msg, cat) -> spans
check_xml_sp(strict=False, entities={}, mkeyw=None)
Like check_xml, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].
list of (int, int, string) tuples
validate_docbook4_l1(text, ents=None)
Validate Docbook 4.x markup in text against level1 specification.
(msgstr, msg, cat) -> numerr
check_docbook4(strict=False, entities={}, mkeyw=None)
Check XML markup in translations of Docbook 4.x catalogs [hook factory].
(msgstr, msg, cat) -> spans
check_docbook4_sp(strict=False, entities={}, mkeyw=None)
Like check_docbook4, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].
(msg, cat) -> parts
check_docbook4_msg(strict=False, entities={}, mkeyw=None)
Check for any known problem in translation in messages in Docbook 4.x catalogs [hook factory].
list of (int, int, string) tuples
validate_html_l1(text, ents=None)
Validate HTML markup in text against level1 specification.
(msgstr, msg, cat) -> numerr
check_html(strict=False, entities={}, mkeyw=None)
Check HTML markup in translations [hook factory].
(msgstr, msg, cat) -> spans
check_html_sp(strict=False, entities={}, mkeyw=None)
Like check_html, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].
list of (int, int, string) tuples
validate_qtrich_l1(text, ents=None)
Validate Qt rich-text markup in text against level1 specification.
(msgstr, msg, cat) -> numerr
check_qtrich(strict=False, entities={}, mkeyw=None)
Check Qt rich-text markup in translations [hook factory].
(msgstr, msg, cat) -> spans
check_qtrich_sp(strict=False, entities={}, mkeyw=None)
Like check_qtrich, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].
list of (int, int, string) tuples
validate_kuit_l1(text, ents=None)
Validate KUIT markup in text against level1 specification.
list of (int, int, string) tuples
validate_kde4_l1(text, ents=None)
Validate markup in texts used in KDE4 GUI.
(msgstr, msg, cat) -> numerr
check_kde4(strict=False, entities={}, mkeyw=None)
Check XML markup in translations of KDE4 UI catalogs [hook factory].
(msgstr, msg, cat) -> spans
check_kde4_sp(strict=False, entities={}, mkeyw=None)
Like check_kde4, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].
list of (int, int, string) tuples
validate_pango_l1(text, ents=None)
Validate Pango markup in text against level1 specification.
(msgstr, msg, cat) -> numerr
check_pango(strict=False, entities={}, mkeyw=None)
Check XML markup in translations of Pango UI catalogs [hook factory].
(msgstr, msg, cat) -> spans
check_pango_sp(strict=False, entities={}, mkeyw=None)
Like check_pango, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].
string or None
nument_to_char(nument)
Convert numeric XML entity to character.
list of (int, int, string) tuples
validate_xmlents(text, ents={}, default=False, numeric=False)
Check whether XML-like entities in the text are among known.
(msgstr, msg, cat) -> numerr
check_xmlents(strict=False, entities={}, mkeyw=None, default=False, numeric=False)
Check existence of XML entities in translations [hook factory].
(msgstr, msg, cat) -> spans
check_xmlents_sp(strict=False, entities={}, mkeyw=None, default=False, numeric=False)
Like check_xmlents, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].
list of (int, int, string) tuples
check_placeholder_els(orig, trans)
Check if sets of <placeholder-N/> elements are matching between original and translated text.
Variables
  flag_no_check_markup = 'no-check-markup'
  xml_entities = {'amp': '&', 'apos': '\'', 'gt': '>', 'lt': '<'...
  WS_SPACE = '\x04~sp'
  WS_TAB = '\x04~tb'
  WS_NEWLINE = '\x04~nl'
  html_entities = {u'AElig': u'Æ', u'Aacute': u'Á', u'Acirc': u'...
  kuit_entities = {u'nbsp': u' '}
  __package__ = 'pology'
  x = 'interface'
  y = ' '
Function Details

plain_to_unwrapped(text)

 

Convert wrapped plain text to unwrapped.

Two or more newlines are considered as paragraph boundaries and left in, while all other newlines are removed. Whitespace in the text is simplified throughout.

Parameters:
  • text (string) - text to unwrap
Returns: string
unwrapped text

xml_to_plain(text, tags=None, subs={}, ents={}, keepws=set([]), ignels=set([]))

 

Convert any XML-like markup to plain text.

By default, all tags in the text are replaced with a single space; entities, unless one of the XML default (&lt;, &gt;, &amp;, &quot;, &apos;), are left untouched; all whitespace groups are simplified to single space and leading and trailing removed.

If only a particular subset of tags should be taken into account, it can be specified by the tags parameter, as a sequence of tag names (the sequence is internally converted to set before processing).

If a tag should be replaced with a special sequence of characters (either opening or closing tag), or the text wrapped by it replaced too, this can be specified by the subs parameter. It is a dictionary of 3-tuples by tag name, which tells what to replace with the opening tag, the closing tag, and the wrapped text. For example, to replace <i>foobar</i> with /foobar/, the dictionary entry would be {"i": ("/", "/", None)} (where final None states not to touch the wrapped text); to replace <code>...</code> with @@@ (i.e. remove code segment completely but leave in a marker that there was something), the entry is {"code": ("", "", "@@@")}. The replacement for the wrapped text can also be a function, taking a string and returning a string. Note that whitespace is automatically simplified, so if whitespace given by the replacements should be exactly preserved, use WS_* string constants in place of corresponding whitespace characters.

To have some entities other than the XML default replaced with proper values, a dictionary of known entities with values may be provided using the ents parameter.

Whitespace can be preserved within some elements, as given by their tags in the keepws sequence.

Some elements may be completely removed, as given by the ignels sequence. Each element of the sequence should either be a tag, or a (tag, type) tuple, where type is the value of the type argument to element, if any.

It is assumed that the markup is well-formed, and if it is not the result is undefined; but best attempt at conversion is made.

There are several other functions in this module which deal with well known markups, such that it is not necessary to use this function with tags, subs, or ents manually specified.

If you only want to resolve entities from a known set, instead of calling this function with empty tags and entities given in ents, consider using the more powerfull pology.resolve.resolve_entities.

Parameters:
  • text (string) - markup text to convert to plain
  • tags (sequence of strings) - known tags
  • subs (dictionary of 3-tuples) - replacement specification
  • ents (dictionary) - known entities and their values
  • keepws (sequence of strings) - tags of elements in which to preserve whitespace
  • ignels (sequence of strings and (string, string) tuples) - tags or tag/types or elements to completely remove
Returns: string
plain text version

html_plain(text)

 

Convert HTML markup to plain text.

Parameters:
  • text (string) - HTML text to convert to plain
Returns: string
plain text version

qtrich_to_plain(text)

 

Convert Qt rich-text markup to plain text.

Parameters:
  • text (string) - Qt rich text to convert to plain
Returns: string
plain text version

kuit_to_plain(text)

 

Convert KUIT markup to plain text.

Parameters:
  • text (string) - KUIT text to convert to plain
Returns: string
plain text version

kde4_to_plain(text)

 

Convert KDE4 GUI markup to plain text.

KDE4 GUI texts may contain both Qt rich-text and KUIT markup, even mixed in the same text. Note that the conversion cannot be achieved, in general, by first converting Qt rich-text, and then KUIT, or vice versa. For example, if the text has &lt; entity, after first conversion it will become plain <, and interfere with second conversion.

Parameters:
  • text (string) - KDE4 text to convert to plain
Returns: string
plain text version

docbook4_to_plain(text)

 

Convert Docbook 4.x markup to plain text.

Parameters:
  • text (string) - Docbook text to convert to plain
Returns: string
plain text version

collect_xml_spec_l1(specpath)

 

Collect lightweight XML format specification, level 1.

Level 1 specification is the dictionary of all known tags, with allowed attributes and subtags for each.

File of the level 1 specification is in the following format:

   # A comment.
   # Tag with unconstrained attributes and subtags:
   tagA;
   # Tag with constrained attributes and unconstrained subtags:
   tagF : attr1 attr2 ...;
   # Tag with unconstrained attributes and constrained subtags:
   tagF > stag1 stag2 ...;
   # Tag with constrained attributes and subtags:
   tagF : attr1 attr2 ... > stag1 stag2 ...;
   # Tag with no attributes and unconstrained subtags:
   tagA :;
   # Tag with unconstrained attributes and no subtags:
   tagA >;
   # Tag with no attributes and no subtags:
   tagA :>;
   # Attribute value constrained by a regular expression:
   .... attr1=/^(val1|val2|val3)$/i ...
   # Reserved dummy tag specifying attributes common to all tags:
   pe-common-attrib : attrX attrY;

The specification can contain a dummy tag named pe-common-attrib, stating attributes which are common to all tags, instead of having to list them with each and every tag. To make an attribute mandatory, it's name should be prefixed by exclamation sign (!).

Specification file must be UTF-8 encoded.

Parameters:
  • specpath (string) - path to level 1 specification file
Returns: dict
level 1 specification

validate_xml_l1(text, spec=None, xmlfmt=None, ents=None, casesens=True, accelamp=False)

 

Validate XML markup in text against level1 specification.

Text is not required to have a top tag; if it does not, a dummy one will be assigned to assure that the check passes.

If spec is None, text is only checked to be well-formed.

If ents are None, entities in the text are ignored by the check; otherwise, an entity not belonging to the known set is considered erroneous. Default XML entities (&lt;, &gt;, &amp;, &quot;, &apos;) are automatically added to the set of known entities.

Tag and attribute names can be made case-insensitive by setting casesens to False.

If text is a part of user interface, and the environment may use the literal ampersand as accelerator marker, it can be allowed to pass the check by setting accelamp to True.

Text can be one or more entity definitions of the form <!ENTITY ...>, when special check is applied.

The result of the check is list of erroneous spans in the text, each given by start and end index (in Python standard semantics), and the error description, packed in a tuple. If there are no errors, empty list is returned. Reported spans need not be formally complete with respect to the error location, but are heuristically determined to be short and provide good visual indication of what triggers the error.

Parameters:
  • text (string) - text to check
  • spec (level1 specification) - markup definition
  • xmlfmt (string) - name of the particular XML format (for error messages)
  • ents (sequence) - set of known entities
  • casesens (bool) - whether tag names are case-insensitive
  • accelamp (bool) - whether to allow ampersand as accelerator marker
Returns: list of (int, int, string) tuples
erroneous spans in the text

check_xml(strict=False, entities={}, mkeyw=None)

 

Check general XML markup in translation [hook factory].

Text is only checked to be well-formed XML, and possibly also whether encountered entities are defined. Markup errors are reported to stdout.

msgstr can be either checked only if the msgid is valid itself, or regardless of the validity of the original. This is governed by the strict parameter.

Entities in addition to XML's default (&lt;, etc.) may be provided using the entities parameter. Several types of values with different semantic are possible:

  • if entities is None, unknown entities are ignored on checking
  • if string, it is understood as a general function evaluation request, and its result expected to be (name, value) dictionary-like object
  • otherwise, entities is considered to be a (name, value) dictionary

If a message has sieve flag no-check-markup, the check is skipped for that message. If one or several markup keywords are given as mkeyw parameter, check is skipped for all messages in a catalog which does not report one of the given keywords by its markup() method. See set_markup() for list of markup keywords recognized at the moment.

Parameters:
  • strict (bool) - whether to require valid msgstr even if msgid is not
  • entities (None, dict, or string) - additional entities to consider as known
  • mkeyw (string or list of strings) - markup keywords for taking catalogs into account
Returns: (msgstr, msg, cat) -> numerr
type S3C hook

check_xml_sp(strict=False, entities={}, mkeyw=None)

 

Like check_xml, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].

Returns: (msgstr, msg, cat) -> spans
type V3C hook

validate_docbook4_l1(text, ents=None)

 

Validate Docbook 4.x markup in text against level1 specification.

Markup definition is extended to include <placeholder-N/> elements, which xml2po uses to segment text when extracting markup documents into PO templates.

See validate_xml_l1 for description of the ents parameter and the return value.

Parameters:
  • text (string) - text to check
  • ents (sequence) - set of known entities (in addition to default)
Returns: list of (int, int, string) tuples
erroneous spans in the text

check_docbook4(strict=False, entities={}, mkeyw=None)

 

Check XML markup in translations of Docbook 4.x catalogs [hook factory].

See check_xml for description of parameters.

Returns: (msgstr, msg, cat) -> numerr
type S3C hook

check_docbook4_sp(strict=False, entities={}, mkeyw=None)

 

Like check_docbook4, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].

Returns: (msgstr, msg, cat) -> spans
type V3C hook

check_docbook4_msg(strict=False, entities={}, mkeyw=None)

 

Check for any known problem in translation in messages in Docbook 4.x catalogs [hook factory].

Currently performed checks:

  • Docbook markup
  • cross-message insertion placeholders

See check_xml for description of parameters.

Returns: (msg, cat) -> parts
type V4A hook

validate_html_l1(text, ents=None)

 

Validate HTML markup in text against level1 specification.

At the moment, this function can only check HTML markup if well-formed in the XML sense, although HTML allows omission of some closing tags.

See validate_xml_l1 for description of the ents parameter and the return value.

Parameters:
  • text (string) - text to check
  • ents (sequence) - set of known entities (in addition to default)
Returns: list of (int, int, string) tuples
erroneous spans in the text

check_html(strict=False, entities={}, mkeyw=None)

 

Check HTML markup in translations [hook factory].

See check_xml for description of parameters. See notes on checking HTML markup to validate_html_l1.

Returns: (msgstr, msg, cat) -> numerr
type S3C hook

check_html_sp(strict=False, entities={}, mkeyw=None)

 

Like check_html, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].

Returns: (msgstr, msg, cat) -> spans
type V3C hook

validate_qtrich_l1(text, ents=None)

 

Validate Qt rich-text markup in text against level1 specification.

At the moment, this function can only check Qt rich-text if well-formed in the XML sense, although Qt rich-text allows HTML-type omission of closing tags.

See validate_xml_l1 for description of the ents parameter and the return value.

Parameters:
  • text (string) - text to check
  • ents (sequence) - set of known entities (in addition to default)
Returns: list of (int, int, string) tuples
erroneous spans in the text

check_qtrich(strict=False, entities={}, mkeyw=None)

 

Check Qt rich-text markup in translations [hook factory].

See check_xml for description of parameters. See notes on checking Qt rich-text to validate_qtrich_l1.

Returns: (msgstr, msg, cat) -> numerr
type S3C hook

check_qtrich_sp(strict=False, entities={}, mkeyw=None)

 

Like check_qtrich, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].

Returns: (msgstr, msg, cat) -> spans
type V3C hook

validate_kuit_l1(text, ents=None)

 

Validate KUIT markup in text against level1 specification.

KUIT is the semantic markup for user interface in KDE4.

See validate_xml_l1 for description of the ents parameter and the return value.

Parameters:
  • text (string) - text to check
  • ents (sequence) - set of known entities (in addition to default)
Returns: list of (int, int, string) tuples
erroneous spans in the text

validate_kde4_l1(text, ents=None)

 

Validate markup in texts used in KDE4 GUI.

KDE4 GUI texts may contain both Qt rich-text and KUIT markup, even mixed in the same text.

See validate_xml_l1 for description of the ents parameter and the return value.

Parameters:
  • text (string) - text to check
  • ents (sequence) - set of known entities (in addition to default)
Returns: list of (int, int, string) tuples
erroneous spans in the text

check_kde4(strict=False, entities={}, mkeyw=None)

 

Check XML markup in translations of KDE4 UI catalogs [hook factory].

See check_xml for description of parameters.

Returns: (msgstr, msg, cat) -> numerr
type S3C hook

check_kde4_sp(strict=False, entities={}, mkeyw=None)

 

Like check_kde4, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].

Returns: (msgstr, msg, cat) -> spans
type V3C hook

validate_pango_l1(text, ents=None)

 

Validate Pango markup in text against level1 specification.

See validate_xml_l1 for description of the ents parameter and the return value.

Parameters:
  • text (string) - text to check
  • ents (sequence) - set of known entities (in addition to default)
Returns: list of (int, int, string) tuples
erroneous spans in the text

check_pango(strict=False, entities={}, mkeyw=None)

 

Check XML markup in translations of Pango UI catalogs [hook factory].

See check_xml for description of parameters.

Returns: (msgstr, msg, cat) -> numerr
type S3C hook

check_pango_sp(strict=False, entities={}, mkeyw=None)

 

Like check_pango, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].

Returns: (msgstr, msg, cat) -> spans
type V3C hook

nument_to_char(nument)

 

Convert numeric XML entity to character.

Numeric XML entities can be decimal, &#DDDD;, or hexadecimal, &#xHHHH;, where D and H stand for number system's digits. 4 digits is the maximum, but there can be less.

If the entity cannot be converted to a character, for whatever reason, None is reported.

Parameters:
  • nument (string) - numeric entity, with or without & and ;
Returns: string or None
character represented by the entity

validate_xmlents(text, ents={}, default=False, numeric=False)

 

Check whether XML-like entities in the text are among known.

The text does not have to be XML markup as such. No XML parsing is performed, only the raw search for XML-like entities.

Parameters:
  • text (string) - text with entities to check
  • ents (sequence) - known entities
  • default (bool) - whether default XML entities are allowed (&amp;, etc.)
  • numeric (bool) - whether numeric character entities are allowed
Returns: list of (int, int, string) tuples
erroneous spans in the text

check_xmlents(strict=False, entities={}, mkeyw=None, default=False, numeric=False)

 

Check existence of XML entities in translations [hook factory].

See check_xml for description of parameters strict, entities, and mkeyw. See validate_xmlents for parameters default and numeric, and for general notes on checking entities.

Returns: (msgstr, msg, cat) -> numerr
type S3C hook

check_xmlents_sp(strict=False, entities={}, mkeyw=None, default=False, numeric=False)

 

Like check_xmlents, except that erroneous spans are returned instead of reporting problems to stdout [hook factory].

Returns: (msgstr, msg, cat) -> spans
type V3C hook

check_placeholder_els(orig, trans)

 

Check if sets of <placeholder-N/> elements are matching between original and translated text.

<placeholder-N/> elements are added into text by xml2po, for finer segmentation of markup documents extracted into PO templates.

See validate_xml_l1 for description of the return value.

Parameters:
  • orig (string) - original text
  • trans (string) - translated text
Returns: list of (int, int, string) tuples
erroneous spans in translation

Variables Details

xml_entities

Value:
{'amp': '&', 'apos': '\'', 'gt': '>', 'lt': '<', 'quot': '"'}

html_entities

Value:
{u'AElig': u'Æ',
 u'Aacute': u'Á',
 u'Acirc': u'Â',
 u'Agrave': u'À',
 u'Aring': u'Å',
 u'Atilde': u'Ã',
 u'Auml': u'Ä',
 u'Ccedil': u'Ç',
...