microsoft / xliff2-object-model Goto Github PK

If you’re looking to store localization data and propagate it through your localization pipeline allowing tools to interoperate then you may want to use the XLIFF 2.0 object model. The XLIFF 2.0 object model implements the OASIS Standard for the XLIFF 2.0 specification as defined at http://docs.oasis-open.org/xliff/xliff-core/v2.0/xliff-core-v2.0.html.

License: Other

C# 99.93% Batchfile 0.07%

xliff2-object-model's Introduction

XLIFF 2.0 Object Model

The XLIFF 2.0 Object Model contains classes and methods for generating and manipulating XLIFF 2.0 documents as described in the XLIFF 2.0 Standard. The library is built using the Portable Class Library enabling developers to generate XLIFF documents using various platforms.

Goals for this project

The XLIFF 2.0 Object Model allows a developer to build up an XLF document in memory and change various properties on the elements before writing the file. This is intended to give developers a head-start in building localization tools, platforms, and engineering systems that take advantage of the newest open localization standard.

What this project provides

The library includes classes for all the Core elements as well as all the module elements as described in the standard including:

Core Elements (xliff, file, group, etc) Change Tracking Module Format Style Module Glossary Module Metadata Module Resource Data Module Size and Length Restriction Module Translation Candidates Module Validation Module

For more information, please take a look at the XLIFF 2.0 Class Guide documentation provided with this project.

Constraint Validation

This initial drop allows developers to read and write Core XLIFF 2.0 and all associated modules as required by the XLIFF 2.0 Standard. However, full validation for constraints defined in the standard is only available for Core, Metadata, Glossary, and Translation Candidates Modules.

##Contributing Please help us improve the XLIFF 2.0 Object Model by filing bugs or feature requests on this repo. You are encouraged to fork and contribute a fix via a pull request.

Bug Fixes

If you believe you've found a bug, you're encouraged to file an issue on this repo.

Licensing

The XLIFF 2.0 Object Model is licensed under the MIT License.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

xliff2-object-model's People

Contributors

Stargazers

Watchers

xliff2-object-model's Issues

<source> with only spaces are emptied on deserialization

It looks like the content of <source> is removed on reading when it is made of spaces.
For example if we have this:

String data = "<xliff srcLang='en' version='2.0' xmlns='urn:oasis:names:tc:xliff:document:2.0'>"
    + "<file id='f1'><unit id='u1'>"
    + "<segment><source>Sentence 1.</source></segment>"
    + "<ignorable><source> </source></ignorable>"
    + "<segment><source>Sentence 2.</source></segment>"
    + "</unit></file></xliff>";
using (IO.MemoryStream ms = new IO.MemoryStream(Encoding.UTF8.GetBytes(data)) )
{
    XliffReader reader = new XliffReader();
    XliffDocument doc = reader.Deserialize(ms);
    foreach (XliffElement e in doc.CollapseChildren<XliffElement>() )
    {
        Console.WriteLine("Type: " + e.GetType().ToString());
        if ( e is PlainText )
        {
            PlainText pt = (PlainText)e;
            Console.WriteLine("Content: '" + pt.Text + "'");
        }
    }
}

We get this output (no content for <ignorable>):

Type: Localization.Xliff.OM.Core.File
Type: Localization.Xliff.OM.Core.Unit
Type: Localization.Xliff.OM.Core.Segment
Type: Localization.Xliff.OM.Core.Source
Type: Localization.Xliff.OM.Core.PlainText
Content: 'Sentence 1.'
Type: Localization.Xliff.OM.Core.Ignorable
Type: Localization.Xliff.OM.Core.Source
Type: Localization.Xliff.OM.Core.Segment
Type: Localization.Xliff.OM.Core.Source
Type: Localization.Xliff.OM.Core.PlainText
Content: 'Sentence 2.'

While I would expect this output:

Type: Localization.Xliff.OM.Core.File
Type: Localization.Xliff.OM.Core.Unit
Type: Localization.Xliff.OM.Core.Segment
Type: Localization.Xliff.OM.Core.Source
Type: Localization.Xliff.OM.Core.PlainText
Content: 'Sentence 1.'
Type: Localization.Xliff.OM.Core.Ignorable
Type: Localization.Xliff.OM.Core.Source
Type: Localization.Xliff.OM.Core.PlainText
Content: ' '
Type: Localization.Xliff.OM.Core.Segment
Type: Localization.Xliff.OM.Core.Source
Type: Localization.Xliff.OM.Core.PlainText
Content: 'Sentence 2.'

It happens also for <segment> elements.

The validator shouldn't validate any <ec/> tags which have isolated attribute set to "yes" and no id attribute

According to the Xliff 2.0 Documentation, the tag must have set the id attribute whenever the isolated attribute is set to "yes":

If and only if the attribute isolated is set to yes, the attribute id MUST be used instead of the attribute startRef that MUST be used otherwise.

Considering this, the OM shouldn't validate the tag , but it does validate it.

OM writes id attribute on an <ec/> tag even if the id attribute is not set

When creating a default StandaloneCodeEnd object and writing it back, then it automatically adds the id attribute with an empty string as the value.

For example, when writing back the tag , the OM will write it as

The "priority" attribute of the <note> element is not properly validated

The attached file should not pass validation because the value of the "priority" attribute in the <note> element does not conform to the schema - notice that its value is "1 " instead of "1".

space_test.xlf.txt

The character '/' in namespace URIs is not allowed by the serializer

When I read the file below, the Deserializer works fine.
But if I try to serialize the document read, I get this error:

Unhandled Exception: Localization.Xliff.OM.Exceptions.InvalidXmlSpecifierException: A valid XML prefix, namespace, and local name must be specified for the entity named 'version'. ---> System.Xml.XmlException: The '/' character, hexadecimal value 0x2F, cannot be included in a name.

If I replace the '/' in the ITS namespace URI by '_' all works fine.

Here is the example file:

<?xml` version="1.0"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en-us"
 xmlns:its="http://www.w3.org/2005/11/its" its:version="2.0">
 <file id="f1" original="test.html">
  <unit id="tu1">
   <segment>
    <source>Test</source>
   </segment>
  </unit>
 </file>
</xliff>

And here is the code:

class Program
{
    static void Main(string[] args)
    {
        using (IO.FileStream stream = new IO.FileStream(args[0], IO.FileMode.Open))
        {
            XliffReader reader = new XliffReader();
            XliffDocument doc = reader.Deserialize(stream);
            System.Console.WriteLine("cont 0 = " + doc.Files[0].Containers[0].Id);
            String path2 = "C:\\temp\\out.xlf";
            using (IO.FileStream stream2 = new IO.FileStream(path2, IO.FileMode.Create, IO.FileAccess.Write))
            {
                XliffWriter writer = new XliffWriter();
                writer.Serialize(stream2, doc);
            }
        }
    }
}

CDataTag in Note

Hello,

Is it possible to use CDataTag in note element?

 <note category="DevComment"><![CDATA[ Unicode control characters:  &#4;]]></note>

Enhancement request: Have option to open the file even if is not valid and return validation errors

There are some cases in which we want to open a file even if it has validation errors (for instance if you don't really care about a specific error or want to programmatically fix it). The library should let the user to decide whether he wants to open a file with validation or without validation, and if he chooses the latter one, he should have a way to find out which validation errors the library has thrown.

A way to do this thing would be to create two separate methods: ValidateFile(string filePath), which would return a list of validation errors, and DeserializeWithoutValidating(string filePath), which would return the XLIFF file stream even if the file is not valid.

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

nuget package

Any plans for publishing to nuget? It will be very helpful.

Validator rejects <mrk> tags without the "translate" attribute

When an XLIFF 2.0 file contains a <mrk> tag without the "translate" attribute, the validator sees it as invalid, throwing the error: "System.FormatException: A value for translate must be specified".

According the official specification, the translate attribute is optional in <mrk>, so the attached file should be valid. It also passes validation by http://okapi-lynx.appspot.com/validation.

This only started happening after the last commit, it wasn't an issue before that.

markedSpan.xlf.txt

Source with HTML content

I have HTML in the <source>, can the XLIFF2-Object-Model library create the XLIFF tags (<bpt> or <g>) from it?

The reason I need this is because I would like CAT tools like MemoQ or SDL Trados to display the source with the inline HTML tags like here:

http://kb.kilgray.com/admin/media_store/2/AA-00470/new_inlineTag.png

Another thing I tried is I created in Notepad a XLIFF file with source having restype='x-html' hoping that the tools will display the inline tags in their editor instead of plain text, but I didn't work:

     <trans-unit id="1" xml:space="preserve" restype='x-html'>
        <source>This is &lt;b&gt;bold&lt;/b&gt;</source>
     </trans-unit>

What's interesting is that these tools are able to show inline tags properly for HTML documents, so I'm trying to find a way to do this with XLIFF, but without having to write myself code which transforms the HTML tags into XLIFF tags

Additional whitespaces added to segments in generated file

When creating an Xliff file from the XliffDocument object with the indentation setting enabled, the library introduces a number of whitespace and newline characters in order to pretty-print the Xliff document.

If the or element contains at the beginning an inline element, then the output file will have the inline element on a new line.

Example:

<segment>
    <source><pc id="pc5">spanning code t2</pc></source>
    <target>target here</target>
</segment>

If we deserialize this Xliff document and serialize it back using the library, then the snippet will be written in the following way:

<segment>
    <source>
                     <pc id="pc5">spanning code t2</pc>             
</source>
    <target>target here</target>
</segment>

There should be no whitespaces added before or after the element.

Ongoing development? Next project iteration?

Hello maintainers. @RyanKing77 @marcta76
You did an incredible good job, but, unfortunately, the project seems to go to the death.
I guess, I can say on behalf of community, we really need this project and want to see it ongoing.

I completely understand, probably, you don't have enough time to maintain it. In that case, could you please consider adding more maintainers to the project. The community has already created a few alternative forks and nuget packages. I think it will be much better, if we find a new person who will care about the project and take responsibility for further development.

I'm highly interested in a supporting this project. I've already published a .net standard version and I'd like to fix all known issues and publish a new package version.
So, if you don't mind, I'd like to kindly ask you to grant me maintainer permissions. I'd be happy to help you with contributions.

Support for .Net Standard

Any plans in this direction?

Multiple Groups with the same Id should not be allowed

If now I add two Group with the same Id I can save the file, but then if I read it again I get a validation exception.
The validation should prevent me to add two groups with the same Id

Overwriting an attribute with default value results in writing the attribute explicitly

If we overwrite the default value of an attribute with default value then the serialization of the document will result in explicitly writing the attribute with its default value.

Example:

Overwriting the default value of the "EquivalentText" attribute with the default value (default value is empty string)

standaloneCode.EquivalentText= "";

When serializing the document this element will look like this:

If I overwrite the attribute or not I always expect the attributes with default values not to be explicitly written when serializing the document.

One type of note reference type is accepted

The only accepted note reference type for comment marked span elements is:

<note id="n1">Note text</note>
......
<mrk id="1" type="comment" ref="#n=n1"/>

This is the only accepted reference: #n=n1

If the XLIFF 2.0 file contains comment marked span with other types of references the validator will detect and error.

Other note references that are not supported:

#n1 : <mrk id="1" type="comment" ref="#n1"
#f=f2/u=u1/n=n1 : <mrk id="1" type="comment" ref="#f=f2/u=u1/n=n1"

This note references are valid and can be found in the XLIFF 2.0 documentation. Also validating with the Okapi-Lynx validator (http://okapi-lynx.appspot.com/validation) does not detect any errors.

The library should support different types of note references.

The validator shouldn't validate any <ec/> tags which have both id and startRef attributes set

According to the Xliff 2.0 Documentation, the tag must set the id attribute only when the isolated attribute is set to "yes" and in the case in which the isolated attribute is set to "no", then the startRef attribute must be used instead of the id attribute:

If and only if the attribute isolated is set to yes, the attribute id MUST be used instead of the attribute startRef that MUST be used otherwise.

For example, the tag should not be validated by the OM.

ViewValidations method in sample code doesn't show the actual validation problem

When running the sample code with an empty XliffDocument I get

ValidationException Details:
  'SelectorPath': '#'

Which is pretty useless.
Instead it should also print the actual exception message

Something more like

ValidationException Details:
The element must contain one or more Files.
  'SelectorPath': '#'

When 2 validation errors happen, exception should contain both errors

I added two empty units (without segments) I get exception only for the first error it encounters, not for all the errors.
For example

            XliffDocument xliff = new XliffDocument("en-GB");
            Localization.Xliff.OM.Core.File file = new Localization.Xliff.OM.Core.File("f1");
            xliff.Files.Add(file);

            Unit unit1 = new Unit("u1");
            Unit unit2 = new Unit("u2");

            file.Containers.Add(unit1);
            file.Containers.Add(unit2);

And I write it using the sample code

try
{
	using (IO.Stream stream = new IO.MemoryStream())
	{
		XliffWriter writer;

		writer = new XliffWriter();
		writer.Serialize(stream, document);
	}
}
catch (ValidationException e)
{
	Console.WriteLine("ValidationException Details:");
	Console.WriteLine(e.Message);
	if (e.Data != null)
	{
		foreach (object key in e.Data.Keys)
		{
			Console.WriteLine("  '{0}': '{1}'", key, e.Data[key]);
		}
	}
}

I get

ValidationException Details:
The element must contain one or more Resources.
  'SelectorPath': '#/f=f1/u=u1'

while I should also get '#/f=f1/u=u2'

Custom namespace prefixes should be allowed for attributes from known namespaces

Opening a sample file with the content below through XliffReader throws a NotSupportedException: "The attribute 'ns1:storageRestriction' is not supported on the Unit element.".

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" xmlns:ns1="urn:oasis:names:tc:xliff:sizerestriction:2.0" version="2.0" srcLang="de" trgLang="en">
	<file id="fQNQ3186">
		<ns1:profiles generalProfile="xliff:codepoints" storageProfile="xliff:utf16">
			<ns1:normalization general="nfc" storage="nfc"/>
		</ns1:profiles>
		<unit id="reference_info" ns1:storageRestriction="255">
			<segment>
				<source/>
				<target/>
			</segment>
		</unit>
	</file>
</xliff>

If the "ns1" namespace prefix is renamed to "slr", the file can be opened normally. Since the original version of the file is also valid, the XliffReader should not check what the prefix name is, only that it refers to the correct namespace ("urn:oasis:names:tc:xliff:sizerestriction:2.0" in this case).

xml:lang attribute should be allowed on some other elements than <source> and <target>

Currently, the validator incorrectly throws an exception if xml:lang attribute is found on elements such as <xliff> or <file>. Checking the xsd schema for the <xliff> attribute, we see that only the xml:space attribute is specifically noted, but the line <xs:anyAttribute namespace="##other" processContents="lax"/> means that any other attribute from other namespace than the target namespace (which is urn:oasis:names:tc:xliff:matches:2.0). Considering that the xml:lang attribute falls into that category, the following file should be valid:

<?xml version="1.0"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang='fr-FR' xml:lang="fr-FR">
 <file id="f1">
  <unit id="1">
   <segment>
    <source>source</source>
   </segment>
  </unit>
 </file>
</xliff>

To make sure, I have checked the file and the official xsd http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/schemas/xliff_core_2.0.xsd against some online validators and all of them said that the file is valid.

Also, besides <xliff>, the following elements from Core should support xml:lang and other namespace attributes: <file>, <group>, <unit>, <note>, <ph>, <pc>, <sc>, <ec>, <mrk> and <sm>.

<data> made of spaces only is not preserved.

When a <data> element contains only spaces the content is discarded when reading the file. The expected behavior is to preserve the spaces.

For example, for this content:

String data = "<xliff srcLang='en' version='2.0' xmlns='urn:oasis:names:tc:xliff:document:2.0'>"
   + "<file id='f1'><unit id='u1'>"
   + "<originalData><data id='d1'>   </data></originalData>"
   + "<segment><source><ph id='ph1' dataRef='d1'/>Sentence 1.</source></segment>"
   + "</unit></file></xliff>";

We get:

Type: Localization.Xliff.OM.Core.File
Type: Localization.Xliff.OM.Core.Unit
Type: Localization.Xliff.OM.Core.OriginalData
Type: Localization.Xliff.OM.Core.Data
Type: Localization.Xliff.OM.Core.Segment
Type: Localization.Xliff.OM.Core.Source
Type: Localization.Xliff.OM.Core.StandaloneCode
Type: Localization.Xliff.OM.Core.PlainText
Content: 'Sentence 1.'

While the expected result is:

Type: Localization.Xliff.OM.Core.File
Type: Localization.Xliff.OM.Core.Unit
Type: Localization.Xliff.OM.Core.OriginalData
Type: Localization.Xliff.OM.Core.Data
Type: Localization.Xliff.OM.Core.PlainText
Content: '   '
Type: Localization.Xliff.OM.Core.Segment
Type: Localization.Xliff.OM.Core.Source
Type: Localization.Xliff.OM.Core.StandaloneCode
Type: Localization.Xliff.OM.Core.PlainText
Content: 'Sentence 1.'

The validator should validate ref attributes on <sm/> tags which do not contain note

Currently if we use the ref attribute on the tag, we can only set reference to notes as value of that attribute, irrelevant of the type of the . However, this is true only in the case of the comment type. In the case of term or custom types, the ref attribute can have other values than reference to notes.

As an example, the following snippet should be valid:
<sm id="m1" translate="yes" type="term" ref="http://dbpedia.org/page/Doppelgänger"/>marked span start<em startRef=”m1”/>

Unit has at least one <segment> or <ignorable>

Hi.
I just started using this lib and found one strange exception that doesn't correspond with the documentation http://docs.oasis-open.org/xliff/xliff-core/v2.0/xliff-core-v2.0.html.

Unit tag has "at least one of ( OR )"? As a result, I get an exception with the following type of xlif file:

------------------------------------------
<unit id="Product">
	<ignorable>
		<source>Product Source</source>
	</ignorable>
</unit>
------------------------------------------

I can fix this issue, and deliver code. Are you okay with this?

<mrk> and <sm> elements with type="generic" should have the "translate" attribute set explicitly

In XLIFF 2.0 documentation is specified that if the “type” is generic then the “translate” attribute must be explicitly set.

When trying to validate files using Okapi validator (http://okapi-lynx.appspot.com/validation) it results the same thing: and elements with type=”generic” must have “translate” attribute explicitly set

Example of invalid elements which are not detected by the validator:

<mrk id="1"/> 
or 
<mrk id="1" type="generic"/>

This elements should look like this to be valid:

<mrk id="1" translate="yes"/> 
or 
<mrk id="1" type="generic" translate="yes" />

When serializing the document the "translate" type will not be set explicitly if is default. The only way to force this is by overwriting the default value with the default value.

The validator should detect this elements as invalid elements and when serializing the document, the "translate" attribute must be explicitly set on all and elements with generic type (default type).

Validator shouldn't validate <sm/> tags which do not have an <em/> tag referencing them.

Currently, the validator wrongly validates a file which contains the following snippet:

<source>Example of an sm with <sm id="2" type="term"/> term</source>

Since we didn't see anything about this in the XLIFF 2.0 Specification, we reached out to OASIS TC and they told us this should not be validated and they will also update the specification accordingly.

Since the tag might be on another segment, I will upload the whole document in question.

MarkedSpanStart.txt

Adding a Generic MarkedSpan element inside a Comment MarkedSpan element will change the type to "comment"

Adding a marked span element with default type (generic) into a marked span element with type="comment" will change the type from generic to comment:

Expected result:

<source><mrk id="m1" type="comment" value="This is a comment marked span element">marked <mrk id="m2" type="generic" translate="yes">marked span</mrk>span</mrk></source>

Actual result: ult:

<source><mrk id="m1" type="comment" value="This is a comment marked span element">marked <mrk id="m2" type="comment" translate="yes">marked span</mrk>span</mrk></source>

When serializing the document, the library will throw an error because the marked span element transfomed into comment does not contain "value" or "ref" attribute.

If we add a marked span with default type or "generic" inside a marked span element with type="comment" it shouldn't change the type from "generic" to "comment".