← Glossary

Tag handling

The management of inline formatting markers, placeholders, and structural codes (HTML tags, variables, custom syntax) that must survive the translation process intact, correctly positioned, and unmodified.

The honest version

Tags are the thing that breaks most MT pipelines in production. They are also the thing that is most often underestimated in pilot evaluations, because pilots use clean source text and production strings do not.

Real software strings are not pure text. They contain <b>bold text</b>, {count, plural, one {# item} other {# items}}, %1$s, {{user_name}}, <ph id="1"/>, and dozens of other markers that must survive translation with specific properties: they must not be translated, they must not be dropped, they must be moved only as far as the target language’s grammar requires, and they must remain syntactically valid in whatever system will render them.

An LLM that translates Click <b>Save</b> to confirm your changes into Нажмите Сохранить, чтобы подтвердить изменения has stripped the bold tag. The output is linguistically correct. It is a production bug.

Why it matters for translation

In software localization, the majority of strings contain at least one tag or variable. UI strings have bold, links, line breaks, and count variables. Marketing strings have CMS-specific template syntax. Regulatory documents have structural tags that correspond to legal requirements. XLIFF files have inline codes that must be preserved through every processing step.

Incorrect tag handling produces a range of consequences depending on what the tag does. A dropped formatting tag (<b>, <em>) produces visual regression. A dropped structural tag produces broken layouts or content gaps. A dropped or malformed variable ({user_name}, %1$s) produces a literal placeholder appearing in the UI: the user sees {user_name} instead of their name. A dropped CDATA or conditional tag in a localizable resource file can break the file parser and prevent the entire locale from loading.

In the most serious cases (injection of script content via tag manipulation in a localization workflow), incorrect tag handling is a security issue, not just a quality issue.

Where it fails

MT systems (including LLM-based ones) treat tags as text unless you explicitly constrain them. The model will translate, move, or drop tags depending on what the surrounding content suggests is grammatically appropriate. A bold tag around a verb in English may or may not belong around a verb in German; the model makes a probabilistic decision that may be wrong.

Even models that handle simple HTML tags correctly will fail on custom tag formats they have not been trained to recognize. Your CMS might use [[placeholder:invoice_date]] or <xlt:var name="count"/> or ICU message format plurals, none of which the model has a reliable prior about. Handling requires explicit specification in the prompt and structural validation of the output.

Tag placement rules vary with grammar. In German, the verb often moves to the end of the clause; a bold tag around the verb in an English source may need to move significantly in the German target. In Japanese, word order changes more dramatically. In Arabic, directionality affects how bidirectional inline tags should be handled. A tag handling system that works for European languages may fail for languages with substantially different syntax.

The most reliable approach is structural validation: after translation, parse the output against a schema derived from the source tag structure and reject any output that does not comply. This catches drops, additions, and malformed tags before they enter the delivery pipeline. Relying on MT to handle tags correctly without validation is an assumption that production data will eventually falsify.