mwilliamson / dotnet-mammoth Goto Github PK

View Code? Open in Web Editor NEW

124.0 124.0 42.0 483 KB

Convert Word documents to simple and clean HTML (C#/.NET)

License: BSD 2-Clause "Simplified" License

C# 99.96% Makefile 0.04%

dotnet-mammoth's People

Contributors

Stargazers

Watchers

dotnet-mammoth's Issues

Preserving colors

Hello,

I'd like to preserve text colours if possible. I know in Headers you have them ignored (lossy), but for paragraph text, I don't see it mentioned, i.e.:

For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.

If not supported, do you have any suggestions?

I found this, but it's for your npm variant:
https://npmjs.com/package/mammoth-colors

Have I got the code right?

Hi
In the warnings I see
Unrecognised paragraph style: Level 1 (Style ID: Level1)

In my code I've got
var converter = new DocumentConverter();
converter.AddStyleMap("p[style-name='Level 1'] => h1.level1");

Is my expectation correct that it should be changing all "Level 1" styled paragraphs to a H1 with a class of "level1" ?

Thanks
Dan

Is there a way to (in the .NET version) suppress inclusion of the entire image file within the HTML?
I would like it to include only an <img src="c:/folder/filename" alt="" ... With no raw image data.
If so, how does one accomplish this? (An example in C# or VB would be helpful)
Thanks

An attempt was made to move the position before the beginning of the stream

Hi there,

I'm trying to make use of this library, but I'm running into the error "An attempt was made to move the position before the beginning of the stream" when calling the ConvertToHtml in the below code:

MemoryStream stream = getFormDocument(formID, context.Request);
            
DocumentConverter converter = new DocumentConverter();
IResult<string> result = converter.ConvertToHtml(stream);

Prior to using this code, I had been using the following code, which properly wrote to the response, the MemoryStream containing the .DOCX object, causing the browser to prompt to save/open the .DOCX file, which was rendered correctly:

MemoryStream stream = getFormDocument(formID, context.Request);

if (stream.Length > 0)
{
     context.Response.ContentType = "application/vnd.openxmlformats-officedocument.wordprocessingml.document";
     context.Response.AddHeader("Content-Disposition", "attachment; filename=\"test.docx\"");

     stream.Position = 0;
     stream.CopyTo(context.Response.OutputStream);
     context.Response.End();
}

I tried adding "stream.Position = 0;" before calling the ConvertToHtml, at which point the error changed to "expected token of type _SYMBOL but was of type _EOF".

Any ideas on what I could try? I looked in to your code to try to isolate the error, but it seems that the ConvertToHtml call is hidden behind libraries that I can't view.

Custom Font Usage

I really loved using mammoth, and the way is converts the docx into clear html. My problem is i'm using some custom fonts and they are not being converted to HTML. can you help me how to enable custom fonts for conversion of documents.
Thank you.

Read form memory and not from a physics path

I love this little program/tool, but I'd like to know if it's possible to clean and convert a word document without having a physical path to the file.

Custom mapping for <a> to <span>

Hi there,

Firstly, great job it works like a charm :)

I am trying to convert <a> to <span> to avoid linking. I tried the code below but I am getting error. I tried multiple combinations but still the same error.

var converter = new DocumentConverter()
                        .AddStyleMap("a => span");

Console error:

error reading style map at line 1, character 1: Unrecognised document element: Mammoth.Couscous.org.zwobble.mammoth.internal.styles.parsing.Token`1[Mammoth.Couscous.org.zwobble.mammoth.internal.styles.parsing.TokenType]
blazor.webassembly.js:1 
blazor.webassembly.js:1 a => span
blazor.webassembly.js:1 ^

Could you please help with this issue?

Thank you.

Getting compile time error in HTML.java

Downloaded the zip from https://github.com/mwilliamson/java-mammoth and trying to run but getting compile time error in HTML.java in following lines of code

The method getChildren() is undefined for the type Object Html.java /mammoth/src/main/java/org/zwobble/mammoth/internal/html line 144 Java Problem

The method getChildren() is undefined for the type Object Html.java /mammoth/src/main/java/org/zwobble/mammoth/internal/html line 147 Java Problem

The method isMatch(HtmlElement, HtmlElement) in the type Html is not applicable for the arguments (Object, HtmlElement) Html.java /mammoth/src/main/java/org/zwobble/mammoth/internal/html line 141 Java Problem

Type mismatch: cannot convert from Optional to Optional Html.java /mammoth/src/main/java/org/zwobble/mammoth/internal/html line 138 Java Problem

Tables, diagrams not visible

Greetings,

I try executing your application with the following code fragment

namespace Mammoth.Cli {
	internal class Program {
		public static void Main(string[] args) {

		    var converter = new DocumentConverter();
		    var result = converter.ConvertToHtml("document.docx");
		    var html = result.Value; 
            var warnings = result.Warnings; 

            Console.WriteLine(result.Value);
		}
	}
}

But your application does not show tables, diagrams that were imported from word to html using your solution. They are missing in HTML

Is it a workaround while using your library? Can it show tables, diagrams and other alike components in html as well? Thank you for your response.

OutOfMemoryException when converting converting docx with many images

I get an OutOfMemoryException when doing a conversion of a document (rather small - 1 MB), but with a lot of images. Here is the stack trace:

   at System.Text.StringBuilder.ToString()
   at Mammoth.Couscous.java.lang.StringBuilder.toString()
   at Mammoth.Couscous.org.zwobble.mammoth.internal.html.Html.write(List`1 nodes)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter__Anonymous_9.apply(List`1 arg0)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.results.InternalResult`1.map[R](Function`2 function)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter.convertToHtml(Optional`1 path, Archive zipFile)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter__Anonymous_0.apply(Archive zipFile)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter.withDocxFile[T](InputStream stream, Function`2 function)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter__Anonymous_1.get()
   at Mammoth.Couscous.org.zwobble.mammoth.internal.util.PassThroughException.unwrap[T](SupplierWithException`2 supplier)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter.convertToHtml(InputStream stream)
   at Mammoth.DocumentConverter.ConvertToHtml(Stream stream)

I'm attaching 2 files, one with 950 pages (that works) and one with 1000 pages (that fails). The IIS worker process goes up to 10 GB of memory or more processing those two. For the first document (950 images) works, for the second it crashes. The resulting HTML output for the first document is about 1 GB in size, if that counts...
Thank you for your support!

Big document 1000 pages.docx
Big document 950 pages.docx

Fonts and colors not preserved

Hi,

Not sure if this is the expected behaviour or if i am doing anything wrong here:

void GenerateHtmlWithMammoth(string filePath)
        {
            var converter = new DocumentConverter().ImageConverter(img =>
            {
                string extension = img.ContentType.Split('/')[1].ToLower();
                ImageFormat imageFormat = null;
                switch (extension)
                {
                    case "png":
                        imageFormat = ImageFormat.Png;
                        break;
                    case "bmp":
                        imageFormat = ImageFormat.Bmp;
                        break;
                    case "jpeg":
                        imageFormat = ImageFormat.Jpeg;
                        break;
                    case "tiff":
                        imageFormat = ImageFormat.Tiff;
                        break;
                }
                string b64 = string.Empty;

                using (var stream = img.GetStream())
                {
                    using (var memoryStream = new MemoryStream())
                    {
                        stream.CopyTo(memoryStream);
                        var arr = memoryStream.ToArray();
                        b64 = Convert.ToBase64String(arr);
                    }
                }
                var returnDictionary = new Dictionary<string, string>();
                returnDictionary.Add("src", "data:" + img.ContentType + ";base64," + b64);
                return returnDictionary;
            }).PreserveEmptyParagraphs();
            string head = @"<!DOCTYPE html><html><head><title></title><meta http-equiv=""content-type"" content=""text/html; charset=utf-8"" /><meta name=""author"" content=""Alikhan, Mujeeb M"" /><meta name=""lastsavedby"" content=""Alikhan, Mujeeb M"" /><meta name=""datecontentcreated"" content=""2019-01-24T16:32:00Z"" /><meta name=""datelastsaved"" content=""2019-01-24T16:38:00Z"" /><meta name=""application"" content=""Microsoft Office Word"" /><meta name=""company"" content=""Saudi Aramco"" /></head><body>";
            var result = converter.ConvertToHtml(filePath);
            var h = result.Value; 
            var warnings = result.Warnings;
            System.IO.File.WriteAllText(System.IO.Path.ChangeExtension(filePath, ".html"), head + h + "</body></html>");
        }

Attached files with expected output as well.
sample.zip

Thanks,
Aamir

Convert to html, and the checkbox cannot be displayed.

Other conversion software can display different characters.

Word Online: Missing entry in file: word/document.xml

Hi there.

Any document created by Word Online and not saved in the desktop word yet results in the above error if you download the file and run it through Mammoth.

As soon as you save in the desktop application it works just fine.

The Symbol corresponding to Greek letter "mu" is not interpreted

Hello,

I'm using Mammoth Converter to convert Microsoft Word documents in Open XML format to HTML and I just discovered that if I use the Symbol for Greek letter "mu" (0x6D in Symbol font), this character is completely ignored by the Mammoth Converter.

I have created this sample Word document showing the case.
Document with micro sign.docx

It seems that the XML element
<w:sym w:font="Symbol" w:char="006D"/>
is completely ignored during the conversion process.

Thank you.

.Net Standard version

This utility works great.
Any plans to release a .Net Standard version of this?
OR
if we would like to do the port ourselves, do you foresee any compatibility issues

mwilliamson / dotnet-mammoth Goto Github PK

dotnet-mammoth's People

Contributors

Stargazers

Watchers

Forkers

dotnet-mammoth's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs