GithubHelp home page GithubHelp logo

mwilliamson / dotnet-mammoth Goto Github PK

View Code? Open in Web Editor NEW
124.0 124.0 42.0 483 KB

Convert Word documents to simple and clean HTML (C#/.NET)

License: BSD 2-Clause "Simplified" License

C# 99.96% Makefile 0.04%

dotnet-mammoth's People

Contributors

mwilliamson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dotnet-mammoth's Issues

Preserving colors

Hello,

I'd like to preserve text colours if possible. I know in Headers you have them ignored (lossy), but for paragraph text, I don't see it mentioned, i.e.:

For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.

If not supported, do you have any suggestions?

I found this, but it's for your npm variant:
https://npmjs.com/package/mammoth-colors

Have I got the code right?

Hi
In the warnings I see
Unrecognised paragraph style: Level 1 (Style ID: Level1)

In my code I've got
var converter = new DocumentConverter();
converter.AddStyleMap("p[style-name='Level 1'] => h1.level1");

Is my expectation correct that it should be changing all "Level 1" styled paragraphs to a H1 with a class of "level1" ?

Thanks
Dan

Suppress <img stream data

Is there a way to (in the .NET version) suppress inclusion of the entire image file within the HTML?
I would like it to include only an <img src="c:/folder/filename" alt="" ... With no raw image data.
If so, how does one accomplish this? (An example in C# or VB would be helpful)
Thanks

An attempt was made to move the position before the beginning of the stream

Hi there,

I'm trying to make use of this library, but I'm running into the error "An attempt was made to move the position before the beginning of the stream" when calling the ConvertToHtml in the below code:

MemoryStream stream = getFormDocument(formID, context.Request);
            
DocumentConverter converter = new DocumentConverter();
IResult<string> result = converter.ConvertToHtml(stream);

Prior to using this code, I had been using the following code, which properly wrote to the response, the MemoryStream containing the .DOCX object, causing the browser to prompt to save/open the .DOCX file, which was rendered correctly:

MemoryStream stream = getFormDocument(formID, context.Request);

if (stream.Length > 0)
{
     context.Response.ContentType = "application/vnd.openxmlformats-officedocument.wordprocessingml.document";
     context.Response.AddHeader("Content-Disposition", "attachment; filename=\"test.docx\"");

     stream.Position = 0;
     stream.CopyTo(context.Response.OutputStream);
     context.Response.End();
}

I tried adding "stream.Position = 0;" before calling the ConvertToHtml, at which point the error changed to "expected token of type _SYMBOL but was of type _EOF".

Any ideas on what I could try? I looked in to your code to try to isolate the error, but it seems that the ConvertToHtml call is hidden behind libraries that I can't view.

Custom Font Usage

I really loved using mammoth, and the way is converts the docx into clear html. My problem is i'm using some custom fonts and they are not being converted to HTML. can you help me how to enable custom fonts for conversion of documents.
Thank you.

Custom mapping for <a> to <span>

Hi there,

Firstly, great job it works like a charm :)

I am trying to convert <a> to <span> to avoid linking. I tried the code below but I am getting error. I tried multiple combinations but still the same error.

var converter = new DocumentConverter()
                        .AddStyleMap("a => span");

Console error:

error reading style map at line 1, character 1: Unrecognised document element: Mammoth.Couscous.org.zwobble.mammoth.internal.styles.parsing.Token`1[Mammoth.Couscous.org.zwobble.mammoth.internal.styles.parsing.TokenType]
blazor.webassembly.js:1 
blazor.webassembly.js:1 a => span
blazor.webassembly.js:1 ^

Could you please help with this issue?

Thank you.

Getting compile time error in HTML.java

Downloaded the zip from https://github.com/mwilliamson/java-mammoth and trying to run but getting compile time error in HTML.java in following lines of code

The method getChildren() is undefined for the type Object Html.java /mammoth/src/main/java/org/zwobble/mammoth/internal/html line 144 Java Problem

The method getChildren() is undefined for the type Object Html.java /mammoth/src/main/java/org/zwobble/mammoth/internal/html line 147 Java Problem

The method isMatch(HtmlElement, HtmlElement) in the type Html is not applicable for the arguments (Object, HtmlElement) Html.java /mammoth/src/main/java/org/zwobble/mammoth/internal/html line 141 Java Problem

Type mismatch: cannot convert from Optional to Optional Html.java /mammoth/src/main/java/org/zwobble/mammoth/internal/html line 138 Java Problem

Tables, diagrams not visible

Greetings,

I try executing your application with the following code fragment

namespace Mammoth.Cli {
	internal class Program {
		public static void Main(string[] args) {

		    var converter = new DocumentConverter();
		    var result = converter.ConvertToHtml("document.docx");
		    var html = result.Value; 
            var warnings = result.Warnings; 

            Console.WriteLine(result.Value);
		}
	}
}

But your application does not show tables, diagrams that were imported from word to html using your solution. They are missing in HTML

Is it a workaround while using your library? Can it show tables, diagrams and other alike components in html as well? Thank you for your response.

OutOfMemoryException when converting converting docx with many images

I get an OutOfMemoryException when doing a conversion of a document (rather small - 1 MB), but with a lot of images. Here is the stack trace:

   at System.Text.StringBuilder.ToString()
   at Mammoth.Couscous.java.lang.StringBuilder.toString()
   at Mammoth.Couscous.org.zwobble.mammoth.internal.html.Html.write(List`1 nodes)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter__Anonymous_9.apply(List`1 arg0)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.results.InternalResult`1.map[R](Function`2 function)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter.convertToHtml(Optional`1 path, Archive zipFile)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter__Anonymous_0.apply(Archive zipFile)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter.withDocxFile[T](InputStream stream, Function`2 function)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter__Anonymous_1.get()
   at Mammoth.Couscous.org.zwobble.mammoth.internal.util.PassThroughException.unwrap[T](SupplierWithException`2 supplier)
   at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter.convertToHtml(InputStream stream)
   at Mammoth.DocumentConverter.ConvertToHtml(Stream stream)

I'm attaching 2 files, one with 950 pages (that works) and one with 1000 pages (that fails). The IIS worker process goes up to 10 GB of memory or more processing those two. For the first document (950 images) works, for the second it crashes. The resulting HTML output for the first document is about 1 GB in size, if that counts...
Thank you for your support!

Big document 1000 pages.docx
Big document 950 pages.docx

Fonts and colors not preserved

Hi,

Not sure if this is the expected behaviour or if i am doing anything wrong here:

void GenerateHtmlWithMammoth(string filePath)
        {
            var converter = new DocumentConverter().ImageConverter(img =>
            {
                string extension = img.ContentType.Split('/')[1].ToLower();
                ImageFormat imageFormat = null;
                switch (extension)
                {
                    case "png":
                        imageFormat = ImageFormat.Png;
                        break;
                    case "bmp":
                        imageFormat = ImageFormat.Bmp;
                        break;
                    case "jpeg":
                        imageFormat = ImageFormat.Jpeg;
                        break;
                    case "tiff":
                        imageFormat = ImageFormat.Tiff;
                        break;
                }
                string b64 = string.Empty;

                using (var stream = img.GetStream())
                {
                    using (var memoryStream = new MemoryStream())
                    {
                        stream.CopyTo(memoryStream);
                        var arr = memoryStream.ToArray();
                        b64 = Convert.ToBase64String(arr);
                    }
                }
                var returnDictionary = new Dictionary<string, string>();
                returnDictionary.Add("src", "data:" + img.ContentType + ";base64," + b64);
                return returnDictionary;
            }).PreserveEmptyParagraphs();
            string head = @"<!DOCTYPE html><html><head><title></title><meta http-equiv=""content-type"" content=""text/html; charset=utf-8"" /><meta name=""author"" content=""Alikhan, Mujeeb M"" /><meta name=""lastsavedby"" content=""Alikhan, Mujeeb M"" /><meta name=""datecontentcreated"" content=""2019-01-24T16:32:00Z"" /><meta name=""datelastsaved"" content=""2019-01-24T16:38:00Z"" /><meta name=""application"" content=""Microsoft Office Word"" /><meta name=""company"" content=""Saudi Aramco"" /></head><body>";
            var result = converter.ConvertToHtml(filePath);
            var h = result.Value; 
            var warnings = result.Warnings;
            System.IO.File.WriteAllText(System.IO.Path.ChangeExtension(filePath, ".html"), head + h + "</body></html>");
        }

Attached files with expected output as well.
sample.zip

Thanks,
Aamir

Word Online: Missing entry in file: word/document.xml

Hi there.

Any document created by Word Online and not saved in the desktop word yet results in the above error if you download the file and run it through Mammoth.

As soon as you save in the desktop application it works just fine.

The Symbol corresponding to Greek letter "mu" is not interpreted

Hello,

I'm using Mammoth Converter to convert Microsoft Word documents in Open XML format to HTML and I just discovered that if I use the Symbol for Greek letter "mu" (0x6D in Symbol font), this character is completely ignored by the Mammoth Converter.

I have created this sample Word document showing the case.
Document with micro sign.docx

It seems that the XML element
<w:sym w:font="Symbol" w:char="006D"/>
is completely ignored during the conversion process.

Thank you.

.Net Standard version

This utility works great.
Any plans to release a .Net Standard version of this?
OR
if we would like to do the port ourselves, do you foresee any compatibility issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.