Keep translations when source is rearranged and tweaked

Description

Problem

If the key of a source string is changed (structured file formats) or sections are edited/inserted/moved/combined/deleted (unstructured file formats), Zanata does not keep translations properly associated with their source strings.

A few different things can happen, depending on the file format:

  • translations are not associated with anything (key changed, hash-keyed paragraph edited, or positional paragraphs removed from the end of the document)

  • translations are associated with the wrong source (positional paragraphs inserted/moved/combined/deleted)

A human would not make this mistake, since they would notice things like:

  1. the source text is identical but is in a different position or has a different key

  2. a paragraph is split into 2 paragraphs but the text is identical

  3. two paragraphs have been combined to a single paragraph

  4. a paragraph is very similar but just has a few words changed (typo fixed, updated terminology, grammar/tone corrected, punctuation fixed)

  5. a section with several paragraphs has been moved

  6. combinations of some or all of the above

Solution

There is a fairly straightforward algorithm that would easily identify many of these cases, which Zanata could use to increase the number of translations that remain associated with the correct source. This would reduce translator work as some translations would need no checking, and others would be faster to check and update since they are still associated with the correct source.

The algorithm

given "old" and "new" source documents, each as an ordered list of strings with keys:

  1. Calculate the similarity of each new string's content to each old string's content, and save as a "similarity pair" with match percentage and the identity of new and old strings

  2. Sort the similarity pairs in descending order of similarity.

    1. Second- and third-order sort pairs with identical similarities first, and pairs with identical index position first

  3. Iterate the pairs in order, for each:

    1. Stop iterating if the similarity of the pair is below a threshold (maybe 90% - how different the strings can be and still be treated as an updated version of the same string)

    2. Skip the pair if either old or new string is already flagged as being matched

    3. Flag the new and old strings as being matched

    4. Store the pair for use in the next step

  4. Iterate the stored pairs from the previous step, for each:

    1. if content is identical, associate translations from the old id with the new id using the current status. _Note: if the id is identical, no change is needed)

    2. if the content is not identical, associate translations from the old id with the new id but downgrade Translated/Approved status to Fuzzy

Note 1: additional sophistication could be added by finding strings with high substring similarity. e.g. an old paragraph that is split in half might have 99% substring similarity with 2 new paragraphs, in which case the full translation would be useful on each of them as fuzzy (translator can just delete the half that does not apply). e.g. two old paragraphs are combined may each have 99% substring similarity with one new paragraph, so the combined translations of both could be used as a fuzzy translation.

Note 2: it may work better if old strings are not flagged as matched. If a paragraph is duplicated then the duplicate has some minor edits, fuzzy translations of the original would be useful near it.

Note 3: a more conservative option for exact content match but different id or position is to downgrade translations to fuzzy, in case the order of paragraphs changes how they should be translated

Activity

Show:

Details

Assignee

Reporter

Labels

Tested Version/s

Components

Priority

More fields

Created 31 May 2016 at 05:06
Updated 22 February 2017 at 06:02