kannan-ar / marigold.openxhtml Goto Github PK

MariGold.OpenXHTML is a wrapper library for Open XML SDK to convert HTML documents into Open XML word documents.

License: MIT License

C# 99.64% HTML 0.36%

openxml html-parser html-docx html-docx-converter openxml-sdk docx docx-generator

marigold.openxhtml's Introduction

MariGold.OpenXHTML

OpenXHTML is a wrapper library for Open XML SDK to convert HTML documents into Open XML word documents. It has simply encapsulated the complexity of Open XML yet exposes the properties of Open XML for manipulation.

Installing via NuGet

In Package Manager Console, enter the following command:

Install-Package MariGold.OpenXHTML

Usage

To create an empty Open XML word document using the OpenXHTML, use the following code.

using MariGold.OpenXHTML;

WordDocument doc = new WordDocument("sample.docx");
doc.Save();

To create an Open XML document from an HTML document, use the following code.

using MariGold.OpenXHTML;

WordDocument doc = new WordDocument("sample.docx");
doc.Process(new HtmlParser("<div>sample text</div>"));
doc.Save();

Once the HTML is processed, you can access the Open XML document using the following properties in WordDocument.

public WordprocessingDocument WordprocessingDocument { get; }
public MainDocumentPart MainDocumentPart { get; }
public Document Document { get; }

Any modifications on Open XML document should be done before the Save method. This has to be done since the Save method will write all the changes and unload the document from the memory. So any further modifications may result in an exception. For example, if you want to append a paragraph at the document body, try the following code.

using MariGold.OpenXHTML;
using DocumentFormat.OpenXml.Wordprocessing;

WordDocument doc = new WordDocument("sample.docx");
doc.Process(new HtmlParser("<div>sample text</div>"));
doc.Document.Body.AppendChild<Paragraph>(new Paragraph(new Run(new Text("added text"))));
doc.Save();

You can also create an Open XML document in memory. Following example illustrates how to save the document in a MemoryStream.

using (MemoryStream mem = new MemoryStream())
{
	WordDocument doc = new WordDocument(mem);
	doc.Save();
}

Relative Images

OpenXHTML cannot process the images with relative URL. This can be solved using the ImagePath property to set the base address for every relative image paths. The image path can be either a URL or a physical folder address.

using MariGold.OpenXHTML;

WordDocument doc = new WordDocument("sample.docx");
doc.ImagePath = "http:\\abc.com";
doc.Process(new HtmlParser("<img src=\"sample.png\" />"));
doc.Save();

You can also assign any file URI address on image path.

doc.ImagePath = @"file:///C:/Img";

Base URL

Like relative images, an HTML document may also contain links with relative path. This can be resolved using the BaseURL property.

using MariGold.OpenXHTML;

WordDocument doc = new WordDocument("sample.docx");
doc.BaseURL = "http:\\abc.com";
doc.Process(new HtmlParser("<a href=\"index.htm\">sample</a>"));
doc.Save();

Also, if there are any relative images in the given html document and ImagePath is not assigned, OpenXHTML will attempt to use BaseURL to resolve relative image paths. So using BaseURL, you can resolve both relative image paths and links. The reason to create a seperate property for image path is that sometimes image location is different from base URL address.

Uri Schema

The protocol relative URLs can be resolved using the UriSchema property.

doc.UriSchema = Uri.UriSchemeHttp;

HTML Parsing

OpenXHTML has a built-in HTML and CSS parser (MariGold.HtmlParser) which can be complectly replaced with any external HTML and CSS parser. The Process method in WordDocument class expects an IParser interface type implementation to process the HTML and CSS. You can create an implementation of this IParser interface to parse the HTML and CSS.

public void Process(IParser parser);

interface IParser
{
	string BaseURL { get; set; }
	string UriSchema { get; set; }

	decimal CalculateRelativeChildFontSize(string parentFontSize, string childFontSize);
	IHtmlNode FindBodyOrFirstElement();
}

Here is the structure of IParser. The BaseURL and UriSchema are just two simple properties to store the base url address and uri schema for processing the HTML images and links. Both properties are used to resolve the protocol free and relative path of external style sheet URLs. The CalculateRelativeChildFontSize method is used to calculate the relative child font size. For example, in the below html, the font size of the h1 tag is 20 pixel.

<div style="font-size:16px"><h1>sample</h1></div>

If you don't want to re-implement this functionality, you can simply use the CSSUtility class in your implementation.

using MariGold.HtmlParser;

return CSSUtility.CalculateRelativeChildFontSize(parentFontSize, childFontSize);

The FindBodyOrFirstElement method is expected to return an IHtmlNode representation of html body tag and the hierarchy of its child elements. If the document does not have body element, then it is expected to return the first root element. All the CSS styles and HTML attributes of IHtmlNode must be resolved and filled in the respective properties.

References

Convert HTML to Word Document using CKEditor and MariGold.OpenXHTML

Windows Forms Application - Convert HTML Files To DOCX Files With MariGold.OpenXHTML

Implement Custom HTML Parser using AngleSharp

marigold.openxhtml's People

Contributors

Stargazers

Watchers

Forkers

crazyants yougayuki ojorma gonself frazelamont mrzink mechwave masums docsprodev jfaquinojr kc17 dudb ruanzx dhavalgajera masterwebtr ssgums bubdm delcullu gerhobbelt dishant111 adambarath

marigold.openxhtml's Issues

Handling OLE object from OLE_Compound_File

Hello,

I want to include an OLE object (excel, pdf, etc.) for which I have an OLE_Compound_File (*.ole). The OLE cmpound object I want to open from the resulting *.doc document. This OLE_Compound file comes from the requirement tool DOORS by exporting to *.reqif (in essence xhtml to show values).

The xhtml code snippet I have:
<object data="OLE_AB_4e7c971411315592_23_2100089280_2800000003__24775003-8f9c-4f50-8af2-33b21e0265ec_OBJECTTEXT_0.ole"" type="text/rtf"">This is ssss

There are also images which are working by exchanging '<object' by '<img'.

When I try to open the xhtml file with an browser I get the error 'Plugin not supported.

Any idea?

Thanks for your help,

Helmut

Table lets converter execute but output file is broken

I use this code to convert the html into openXML:
var html is the table as string
var filename is the filename as string

  WordDocument doc = new WordDocument(filename);
  doc.Process(new HtmlParser(html));
  doc.Save();

And following structure gives me problems in the convertion. The html is converted into a .docx file but when opening an "unknown error" appears. This makes it hard for me to debug.
I read the following table from file.

<table>
 <tbody>
   <tr>
     <th colspan="1" rowspan="1" colwidth="260">asdasd</th>
     <th colspan="1" rowspan="1"></th>
     <th colspan="1" rowspan="1"></th>
     <th colspan="1" rowspan="1" colwidth="110"></th>
   </tr>
   <tr>
    <td colspan="1" rowspan="1" colwidth="260"></td>
    <td colspan="1" rowspan="1"> </td>
    <td colspan="1" rowspan="1"></td>
    <td colspan="1" rowspan="1" colwidth="110">{{Arbeitgeber.Hey}}</td>
  </tr>
  <tr>
    <td colspan="1" rowspan="1" colwidth="260">asxdasd </td>
    <td colspan="1" rowspan="1">asdasd</td>
    <td colspan="1" rowspan="1">asd</td>
    <td colspan="1" rowspan="1" colwidth="110">asd</td>
  </tr>
 </tbody>
</table>

I also tried removing the tag but it doesn't change anything.

@kannan-ar Could you point me to my error?

how to set LeftMargin and RightMargin？

hello
how I can setting the LeftMargin and RightMargin？

Link to an files doesn't allow blanks

Hello,

when using a blank in the file path it doesn't work. After removing blanks everything runs smooth.

Example:
<img type="image/png" src="Feature Realisation - template - ReqIF V_000968c1/x04000000011CA34C.png" />

Thanks,

Helmut

how to use RTL format and change page size to A4

I want to change direction to RTL, is it possible ?
Is there any setting to change page size ?

Diacritic letters are in Calibri, while other letters use Times New Roman (the correct font)

Conversion works amazing, except for letters like "Č" or "Ć" which are assigned Calibri font, while the rest of the content is in Times New Roman.

blank paragraph

<div>111111</div>
<div>222222</div>
== additional blank paragraph
<div>111111</div><div>222222</div> == good

Licensing for this project

Would it be possible to license this project (and MariGold.HtmlParser) with an open source license? I would like to use this library for a server side application and wanted to make sure that I was legally in the clear.
Code hosted on Github without a license excludes both commercial and personal use for anyone except the author.

Alternatively, could I obtain explicit permission to use this library in my project?

Thanks

Images Base64 encoded are not rendered

Steps to reproduce:

Use html containing an img with src = base64 image as attached

base64 img.txt

Generate doc and it doesn't render the img

Problem with <hr> tag.

When convert from HTML to DOCX, if there is any

(not close the like

) tag then rest of the part of HTML are not converted.

images and tables are not rendered

How to reproduce:

Grab the html using TinyMCE
https://www.tiny.cloud/docs/demo/full-featured/
Generate docx

var doc = new WordDocument(filename);
doc.Process(new HtmlParser(html));

Nested ul/Li have no identation in docX file

When I have a html file with nested ul/li tag and I convert them to DocX all the bullets at the same level. They are not nested anymore.

How to fix that ?

Image sizing in outputted Word document

I'm outputting a word document and applying styles to the h1, h2, h3, p and table elements. I'm also trying to set the width using css and it doesnt seem to translate into the word document. I always end up with the image displayed full size. I basically want to constrain all images to a set width so that they flow inline with the paragraphs. Any help with overriding the image sizing would be really appreciated.

When "rowspan" is specified in HTML "table", the marged cell left and right border is null.

Hello.
I'm trying to validate the following HTML

<style type="text/css">
table.sample {
border:1px;
/*border-collapse:collapse;*/
}
table.sample th {
border-top:1px solid #000000;
border-left:1px solid #000000;
border-right:1px solid #000000;
border-bottom:1px solid #000000;
}
table.sample td.top {
border-top:3px double #000000;
border-left:1px solid #000000;
border-right:1px solid #000000;
border-bottom:1px solid #000000;
}
table.sample td {
border-top:1px solid #000000;
border-left:1px solid #000000;
border-right:1px solid #000000;
border-bottom:1px solid #000000;
}
</style>

<div>
<table class="sample">
  <thead>
    <tr>
      <th rowspan="3">Header1</th>
      <th colspan="4">Header2<br>two rows header</th>
    </tr>
    <tr>
      <th rowspan="2">Header2-1</th>
      <th rowspan="2">Header2-2</th>
      <th colspan="2">Header2-3</th>
    </tr>
    <tr>
      <th>Header2-3-1</th>
      <th>Header2-3-2<br>two rows header</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td class="top">val1</td>
      <td class="top">val2-1</td>
      <td class="top">val2-2</td>
      <td class="top">val2-3-1</td>
      <td class="top">val2-3-2</td>
    </tr>
  </tbody>
</table>
</div>

If I convert this HTML to docx, we get the following result.

When "rowspan" is specified in HTML "table", the marged cell left and right border is null.
What I expect is that the merged cell will also have a border, as shown below.

Is there a way to get the expected results?

Handle <object tag

Hello,

great converter!

Is it possible to also support the '<object' tag like:

Background: I want to convert ReqIF XHTML tags, and they often use <object tags. They usually do it for images but also for other kinds of binary objects.

Thanks a lot,

Helmut

Saving a File in Track Changes mode

Can I convert html document to .docx file in 'Track Changes' mode. That is, when I hover over the inserted/deleted text in the document, it should show which user inserted/deleted the text, at what datetime alongwith actual text?
And the actual inserted/deleted text would be identified with the help of custom tags in the html file.

Please confirm if this can be done using the current code.

Regards,

Page Break issue

HTML to word conversion

Used CSS page-break-before: always; but it is not reflecting in output word document. and there is not much documentation available for this kind of issue.

How page breaks works? Is there need any way other than CSS rules in html to breaks the page in out put word file.

Thank you for your works and efforts to make available this tool public.

kannan-ar / marigold.openxhtml Goto Github PK

marigold.openxhtml's Introduction

MariGold.OpenXHTML

Installing via NuGet

Usage

Relative Images

Base URL

Uri Schema

HTML Parsing

References

marigold.openxhtml's People

Contributors

Stargazers

Watchers

Forkers

marigold.openxhtml's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs