mwilliamson / dotnet-mammoth Goto Github PK
View Code? Open in Web Editor NEWConvert Word documents to simple and clean HTML (C#/.NET)
License: BSD 2-Clause "Simplified" License
Convert Word documents to simple and clean HTML (C#/.NET)
License: BSD 2-Clause "Simplified" License
Hello,
I'd like to preserve text colours if possible. I know in Headers you have them ignored (lossy), but for paragraph text, I don't see it mentioned, i.e.:
For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.
If not supported, do you have any suggestions?
I found this, but it's for your npm variant:
https://npmjs.com/package/mammoth-colors
Hi
In the warnings I see
Unrecognised paragraph style: Level 1 (Style ID: Level1)
In my code I've got
var converter = new DocumentConverter();
converter.AddStyleMap("p[style-name='Level 1'] => h1.level1");
Is my expectation correct that it should be changing all "Level 1" styled paragraphs to a H1 with a class of "level1" ?
Thanks
Dan
Is there a way to (in the .NET version) suppress inclusion of the entire image file within the HTML?
I would like it to include only an <img src="c:/folder/filename" alt="" ... With no raw image data.
If so, how does one accomplish this? (An example in C# or VB would be helpful)
Thanks
Hi there,
I'm trying to make use of this library, but I'm running into the error "An attempt was made to move the position before the beginning of the stream" when calling the ConvertToHtml in the below code:
MemoryStream stream = getFormDocument(formID, context.Request);
DocumentConverter converter = new DocumentConverter();
IResult<string> result = converter.ConvertToHtml(stream);
Prior to using this code, I had been using the following code, which properly wrote to the response, the MemoryStream containing the .DOCX object, causing the browser to prompt to save/open the .DOCX file, which was rendered correctly:
MemoryStream stream = getFormDocument(formID, context.Request);
if (stream.Length > 0)
{
context.Response.ContentType = "application/vnd.openxmlformats-officedocument.wordprocessingml.document";
context.Response.AddHeader("Content-Disposition", "attachment; filename=\"test.docx\"");
stream.Position = 0;
stream.CopyTo(context.Response.OutputStream);
context.Response.End();
}
I tried adding "stream.Position = 0;" before calling the ConvertToHtml, at which point the error changed to "expected token of type _SYMBOL but was of type _EOF".
Any ideas on what I could try? I looked in to your code to try to isolate the error, but it seems that the ConvertToHtml call is hidden behind libraries that I can't view.
I really loved using mammoth, and the way is converts the docx into clear html. My problem is i'm using some custom fonts and they are not being converted to HTML. can you help me how to enable custom fonts for conversion of documents.
Thank you.
I love this little program/tool, but I'd like to know if it's possible to clean and convert a word document without having a physical path to the file.
Hi there,
Firstly, great job it works like a charm :)
I am trying to convert <a>
to <span>
to avoid linking. I tried the code below but I am getting error. I tried multiple combinations but still the same error.
var converter = new DocumentConverter()
.AddStyleMap("a => span");
Console error:
error reading style map at line 1, character 1: Unrecognised document element: Mammoth.Couscous.org.zwobble.mammoth.internal.styles.parsing.Token`1[Mammoth.Couscous.org.zwobble.mammoth.internal.styles.parsing.TokenType]
blazor.webassembly.js:1
blazor.webassembly.js:1 a => span
blazor.webassembly.js:1 ^
Could you please help with this issue?
Thank you.
Downloaded the zip from https://github.com/mwilliamson/java-mammoth and trying to run but getting compile time error in HTML.java in following lines of code
The method getChildren() is undefined for the type Object Html.java /mammoth/src/main/java/org/zwobble/mammoth/internal/html line 144 Java Problem
The method getChildren() is undefined for the type Object Html.java /mammoth/src/main/java/org/zwobble/mammoth/internal/html line 147 Java Problem
The method isMatch(HtmlElement, HtmlElement) in the type Html is not applicable for the arguments (Object, HtmlElement) Html.java /mammoth/src/main/java/org/zwobble/mammoth/internal/html line 141 Java Problem
Type mismatch: cannot convert from Optional to Optional Html.java /mammoth/src/main/java/org/zwobble/mammoth/internal/html line 138 Java Problem
Greetings,
I try executing your application with the following code fragment
namespace Mammoth.Cli {
internal class Program {
public static void Main(string[] args) {
var converter = new DocumentConverter();
var result = converter.ConvertToHtml("document.docx");
var html = result.Value;
var warnings = result.Warnings;
Console.WriteLine(result.Value);
}
}
}
But your application does not show tables, diagrams that were imported from word to html using your solution. They are missing in HTML
Is it a workaround while using your library? Can it show tables, diagrams and other alike components in html as well? Thank you for your response.
I get an OutOfMemoryException when doing a conversion of a document (rather small - 1 MB), but with a lot of images. Here is the stack trace:
at System.Text.StringBuilder.ToString()
at Mammoth.Couscous.java.lang.StringBuilder.toString()
at Mammoth.Couscous.org.zwobble.mammoth.internal.html.Html.write(List`1 nodes)
at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter__Anonymous_9.apply(List`1 arg0)
at Mammoth.Couscous.org.zwobble.mammoth.internal.results.InternalResult`1.map[R](Function`2 function)
at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter.convertToHtml(Optional`1 path, Archive zipFile)
at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter__Anonymous_0.apply(Archive zipFile)
at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter.withDocxFile[T](InputStream stream, Function`2 function)
at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter__Anonymous_1.get()
at Mammoth.Couscous.org.zwobble.mammoth.internal.util.PassThroughException.unwrap[T](SupplierWithException`2 supplier)
at Mammoth.Couscous.org.zwobble.mammoth.internal.InternalDocumentConverter.convertToHtml(InputStream stream)
at Mammoth.DocumentConverter.ConvertToHtml(Stream stream)
I'm attaching 2 files, one with 950 pages (that works) and one with 1000 pages (that fails). The IIS worker process goes up to 10 GB of memory or more processing those two. For the first document (950 images) works, for the second it crashes. The resulting HTML output for the first document is about 1 GB in size, if that counts...
Thank you for your support!
Hi,
Not sure if this is the expected behaviour or if i am doing anything wrong here:
void GenerateHtmlWithMammoth(string filePath)
{
var converter = new DocumentConverter().ImageConverter(img =>
{
string extension = img.ContentType.Split('/')[1].ToLower();
ImageFormat imageFormat = null;
switch (extension)
{
case "png":
imageFormat = ImageFormat.Png;
break;
case "bmp":
imageFormat = ImageFormat.Bmp;
break;
case "jpeg":
imageFormat = ImageFormat.Jpeg;
break;
case "tiff":
imageFormat = ImageFormat.Tiff;
break;
}
string b64 = string.Empty;
using (var stream = img.GetStream())
{
using (var memoryStream = new MemoryStream())
{
stream.CopyTo(memoryStream);
var arr = memoryStream.ToArray();
b64 = Convert.ToBase64String(arr);
}
}
var returnDictionary = new Dictionary<string, string>();
returnDictionary.Add("src", "data:" + img.ContentType + ";base64," + b64);
return returnDictionary;
}).PreserveEmptyParagraphs();
string head = @"<!DOCTYPE html><html><head><title></title><meta http-equiv=""content-type"" content=""text/html; charset=utf-8"" /><meta name=""author"" content=""Alikhan, Mujeeb M"" /><meta name=""lastsavedby"" content=""Alikhan, Mujeeb M"" /><meta name=""datecontentcreated"" content=""2019-01-24T16:32:00Z"" /><meta name=""datelastsaved"" content=""2019-01-24T16:38:00Z"" /><meta name=""application"" content=""Microsoft Office Word"" /><meta name=""company"" content=""Saudi Aramco"" /></head><body>";
var result = converter.ConvertToHtml(filePath);
var h = result.Value;
var warnings = result.Warnings;
System.IO.File.WriteAllText(System.IO.Path.ChangeExtension(filePath, ".html"), head + h + "</body></html>");
}
Attached files with expected output as well.
sample.zip
Thanks,
Aamir
Convert to html, and the checkbox cannot be displayed.
Other conversion software can display different characters.
Hi there.
Any document created by Word Online and not saved in the desktop word yet results in the above error if you download the file and run it through Mammoth.
As soon as you save in the desktop application it works just fine.
Hello,
I'm using Mammoth Converter to convert Microsoft Word documents in Open XML format to HTML and I just discovered that if I use the Symbol for Greek letter "mu" (0x6D in Symbol font), this character is completely ignored by the Mammoth Converter.
I have created this sample Word document showing the case.
Document with micro sign.docx
It seems that the XML element
<w:sym w:font="Symbol" w:char="006D"/>
is completely ignored during the conversion process.
Thank you.
This utility works great.
Any plans to release a .Net Standard version of this?
OR
if we would like to do the port ourselves, do you foresee any compatibility issues
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.