GithubHelp home page GithubHelp logo

benbalter / word-to-markdown Goto Github PK

View Code? Open in Web Editor NEW
1.4K 47.0 155.0 1.23 MB

A ruby gem to liberate content from Microsoft Word documents

Home Page: https://word2md.com

License: MIT License

Ruby 97.17% Shell 0.97% Dockerfile 1.86%
converter word microsoft-word markdown ruby libreoffice

word-to-markdown's Introduction

Word to Markdown converter

A Ruby gem to liberate content from the jail that is Word documents

CI Gem Version Inline docs Build status Maintainability Test Coverage

The problem

Our default content publishing workflow is terribly broken. We've all been trained to make paper, yet today, content authored once is more commonly consumed in multiple formats, and rarely, if ever, does it embody physical form. Put another way, our go-to content authoring workflow remains relatively unchanged since it was conceived in the early 80s.

I'm asked regularly by government employees — knowledge workers who fire up a desktop word processor as the first step to any project — for an automated pipeline to convert Microsoft Word documents to Markdown, the lingua franca of the internet, but as my recent foray into building just such a converter proves, it's not that simple.

Markdown isn't just an alternative format. Markdown forces you to write for the web.

Read more

Just want to convert a Microsoft Word (or Google) document to Markdown?

You can use this hosted service (or check out its source).

Install

You'll need to install LibreOffice. Then:

gem install word-to-markdown

Usage

file = WordToMarkdown.new("/path/to/document.docx")
=> <WordToMarkdown path="/path/to/document.docx">

file.to_s
=> "# Test\n\n This is a test"

file.document.tree
=> <Nokogiri Document>

Command line usage

Once you've installed the gem, it's just:

$ w2m path/to/document.docx

Outputs the resulting markdown to stdout

Supports

  • Paragraphs
  • Numbered lists
  • Unnumbered lists
  • Nested lists
  • Italic
  • Bold
  • Explicit headings (e.g., selected as "Heading 1" or "Heading 2")
  • Implicit headings (e.g., text with a larger font size relative to paragraph text)
  • Images
  • Tables
  • Hyperlinks

Requirements and configuration

Word-to-markdown requires soffice a command line interface to LibreOffice that works on Linux, Mac, and Windows. To install soffice, see the LibreOffice documentation.

Testing

script/cibuild

Docker

First, create the Gemfile.lock by installing the dependencies:

bundle install

Everything you need to run the executable locally:

docker-compose build
docker-compose run --rm app bundle exec w2m --help
docker-compose run --rm app bundle exec w2m test/fixtures/em.docx

Hosted service

Word-to-markdown-server contains a lightweight server for converting Word Documents as a service. A live version runs at word2md.com.

word-to-markdown's People

Contributors

benbalter avatar blambeau avatar cabo avatar cameron423698 avatar dependabot-preview[bot] avatar dependabot[bot] avatar dweinberger avatar erictrinh avatar fouad avatar github-actions[bot] avatar horlogeskynet avatar konklone avatar niharikasingh avatar parkr avatar rrrene avatar zlatanvasovic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

word-to-markdown's Issues

feat. request: to_gh_repo workflow

Let me upload an HTML-ified word doc, login with gh creds, create a public or private repo and let me continue editing inside of prose/gh's ace editor/something.

Copy to Clipboard

Cool project. Copy to Clipboard does not work on Mac 10.12 Chrome Version 58.0.3029.110 (64-bit)

Cannot find Soffice on Windows

Thanks for this cool little tool and a new to Ruby gem so I will apologize up front if this is an easy fix. Trying to convert a bunch of .docx files to markdown to publish in mkdocs. I've installed gem and when i run to command "w2m /path/todocument.docx" I get the below error:

C:\Users\pyin>w2m c:/users/pyin/desktop/temp/yourwaytest2.docx
C:/Ruby193/lib/ruby/gems/1.9.1/gems/word-to-markdown-1.1.0/lib/word-to-markdown.
rb:59:in ``': No such file or directory - C:\Program Files (x86)\LibreOffice 4\p
rogram/soffice.EXE --headless --convert-to html 'c:/users/pyin/desktop/temp/your
waytest2.docx' --outdir C:/Users/pyin/AppData/Local/Temp/d20141203-2896-oe614b (
Errno::ENOENT)
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/word-to-markdown-1.1.0/lib/word
-to-markdown.rb:59:inrun_command' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/word-to-markdown-1.1.0/lib/word -to-markdown/document.rb:88:in raw_html'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/word-to-markdown-1.1.0/lib/word
-to-markdown/document.rb:20:in`tree'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/word-to-markdown-1.1.0/lib/word
-to-markdown/converter.rb:79:in `semanticize_font_styles!'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/word-to-markdown-1.1.0/lib/word
-to-markdown/converter.rb:19:in`convert!'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/word-to-markdown-1.1.0/lib/word
-to-markdown.rb:29:in `initialize'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/word-to-markdown-1.1.0/bin/w2m:
10:in`new'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/word-to-markdown-1.1.0/bin/w2m:
10:in `<top (required)>'
from C:/Ruby193/bin/w2m:23:in`load'
from C:/Ruby193/bin/w2m:23:in `

'

Any advice on this?

Thanks.

Phillip

Crazy output

This is the output of a basic docx file. 2017-09-06 Meeting Agenda.docx

This doesn't look correct.

ISIG: Meeting Notes

Date September 6th from 2:00 PM-3:00 PM EST

Medium [https://join.skype.com/tT2eK2Eeid68](https://join.skype.com/tT2eK2Eeid68)

Chair/Note Taker: Don Richards

Agenda

- [Old Business](https://docs.google.com/document/d/1zHpCslyjtf1snle4Wvwz20EOb9tuSS8FpSIRjq8HCM4)
- [Google Chrome 62 Release](https://security.googleblog.com/2016/09/moving-towards-more-secure-web.html)
  - What is the possible impact (if any)?
  - [How to test and/or check within your browser](https://developers.google.com/web/updates/2016/10/avoid-not-secure-warn)?
- [Suggesting GhostScript 9.05 for Islandora Vagrant](https://github.com/Islandora-Labs/islandora_vagrant/issues/127)
  - Is there any issues with suggesting this or should suggestions be made to modify the modules affecting the error?
    - ■■9.06 Adds PDF/A compliance
    - ■■Possible to go up to the current 9.21?
      - Fixed in 9.20: Bug #697190 .initialize\_dsc\_parser doesn&#39;t validate the parameter is a dict type before using it. PostScript operator not to validate its parameter(s).
  - What modules are utilizing GhostScript?
    - ■■Islandora\_solution\_pack\_pdf
    - ■■islandora\_paged\_content
  - [Possibly unrelated to a solution](https://jira.duraspace.org/browse/ISLANDORA-2037)
- [https://jira.duraspace.org/browse/ISLANDORA-1999](https://jira.duraspace.org/browse/ISLANDORA-1999)
- What code or type of code would you like for us to look at next month?
  - [Fagan inspection](https://en.wikipedia.org/wiki/Fagan_inspection) method?

Notes:

- Google Chrome 62 release seems to have very little impact to the user experience (with the Islandora stack) and should not actually block any content.
- Link to an example page of http mismatched: [http://http-password.badssl.com/](http://http-password.badssl.com/)
- Ideal approach to address GhostScript for Islandora would be to test on vagrant updating drush commands for the PDF solution pack to include a test (determine which version) and set the switches accordingly.
- Cropbox may complicate the solution or have no effect at all.
- Cropbox testing failed and may have been caused by the GhostScript version (speculating)
- Fagan Inspection should be brought up during the next meeting.
  - Include some popular alternatives that have very little impact on the community if possible.

Next Meeting: October 4th, 2017 2PM EST

Next Chair: Takers?

 CONTINUED IN THE COMMENTS

Feature: Automator Workflow on Mac OSX

I love this great repo and do some further automation to speed up the whole process.
Because I'm using OneNote, every time I have to do the process below manually.

  1. Open Word

  2. Pause for 1 second. (to wait for system to open Word)

  3. Create new doc

  4. Paste clipboard content to word

  5. Save ~/a.docx

  6. Run terminal to w2m

    $ cd ~
    $ w2m a.docx
    
  7. Copy the result to clipboard. You got md here!

  8. Run terminal to remove doc

    $ cd ~
    $ rm a.docx
    
  9. Enjoy the result!

It's easier now

Now I use Automator on Mac OSX to help me do exactly the same thing.
Here's the workflow. clipboard2md.workflow

Now you only need to

  1. Copy your Word-format content
  2. Open the workflow(I'm using Alfred 2)
  3. Run it (⌘+R) and wait a while
  4. You got the markdown result at your clipboard! (⌘+V)

Please feel free to download it, or is there anywhere better to share this?

CLI wrapper

Basically, a way to easily use this gem from other non-Ruby contexts.

encounter an error while converting

Seems it will first convert to html?

jacksondeMacBook-Pro:word2markdown jackson$ w2m abc0.0.11.docx
/Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown/document.rb:91:in read': No such file or directory - /var/folders/vn/81b5g27170j8bqsydq6w7n1w0000gn/T/d20150531-1488-ogxosw/abc0.0.11.html (Errno::ENOENT) from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown/document.rb:91:inraw_html'
from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown/document.rb:58:in normalized_html' from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown/document.rb:20:intree'
from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown/converter.rb:78:in semanticize_font_styles!' from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown/converter.rb:18:inconvert!'
from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown.rb:31:in initialize' from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/bin/w2m:14:innew'
from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/bin/w2m:14:in <top (required)>' from /usr/bin/w2m:23:inload'
from /usr/bin/w2m:23:in `

'

Take a look at Yomu

https://github.com/Erol/yomu

Maybe it can read the HTML from a Word file directly?

require 'yomu'

data = File.read 'mydoc.docx'
text = Yomu.read :html, data
puts text

yields results like

<p class="list_Paragraph">And think of all the minor towns who sacrificed everything to build an altar for the locomotive. Mayors that believed their town&rsquo;s future success depended on the promises of growth that a train could deliver.</p>
<p class="list_Paragraph">Towns with names like Kiowa, Modena, Soap Creek, Barclay.</p>

Word-to-markdown gives an Internal Server Error

I have been converting word docs all morning. Now every time I try to open a doc, it goes to another page that says Internal Server Error. I even tried converting docs this morning that worked in the converter and were converted. Can you tell me why it isn't working?

soffice executable error with libreoffice 5

Possibly related:
unoconv/unoconv#60
https://forum.openoffice.org/en/forum/viewtopic.php?f=17&t=17547
https://bz.apache.org/ooo/show_bug.cgi?id=101203

Using:
WordToMarkdown v1.1.7
LibreOffice v5.0.4.2

RuntimeError - Command `/usr/local/bin/soffice --headless --convert-to html:XHTML Writer File:UTF8 /Users/ia/dev/rstacks/tmp/uploads/work/document/1454983300-24280-5845/AlveyandtheChameleon-3.docx --outdir /var/folders/zw/5jg9p0yx0gb97hh28z85c9rw0000gn/T/d20160208-24280-12tnqnf` failed: 2016-02-08 21:01:41.797 soffice[24343:313698] No Info.plist file in application bundle or no NSPrincipalClass in the Info.plist file, exiting
:
  word-to-markdown (1.1.7) lib/word-to-markdown.rb:53:in `run_command'
  word-to-markdown (1.1.7) lib/word-to-markdown/document.rb:91:in `raw_html'
  word-to-markdown (1.1.7) lib/word-to-markdown/document.rb:59:in `normalized_html'
  word-to-markdown (1.1.7) lib/word-to-markdown/document.rb:21:in `tree'
  word-to-markdown (1.1.7) lib/word-to-markdown/converter.rb:78:in `semanticize_font_styles!'
  word-to-markdown (1.1.7) lib/word-to-markdown/converter.rb:18:in `convert!'
  word-to-markdown (1.1.7) lib/word-to-markdown.rb:45:in `initialize'
  app/uploaders/document_uploader.rb:69:in `set_file_contents_md_and_html'
  carrierwave (0.10.0) lib/carrierwave/uploader/processing.rb:84:in `block in process!'
  carrierwave (0.10.0) lib/carrierwave/uploader/processing.rb:76:in `process!'
  carrierwave (0.10.0) lib/carrierwave/uploader/callbacks.rb:18:in `block in with_callbacks'
  carrierwave (0.10.0) lib/carrierwave/uploader/callbacks.rb:18:in `with_callbacks'
  carrierwave (0.10.0) lib/carrierwave/uploader/cache.rb:122:in `cache!'
  carrierwave (0.10.0) lib/carrierwave/mount.rb:329:in `cache'
  carrierwave (0.10.0) lib/carrierwave/mount.rb:163:in `document='
  carrierwave (0.10.0) lib/carrierwave/orm/activerecord.rb:39:in `document='
  activerecord (4.2.0) lib/active_record/attribute_assignment.rb:54:in `_assign_attribute'
  activerecord (4.2.0) lib/active_record/attribute_assignment.rb:41:in `block in assign_attributes'
  actionpack (4.2.0) lib/action_controller/metal/strong_parameters.rb:183:in `each_pair'
  activerecord (4.2.0) lib/active_record/attribute_assignment.rb:35:in `assign_attributes'
  activerecord (4.2.0) lib/active_record/core.rb:557:in `init_attributes'
  activerecord (4.2.0) lib/active_record/core.rb:280:in `initialize'
  actionpack (4.2.0) lib/action_dispatch/routing/url_for.rb:104:in `initialize'
  activerecord (4.2.0) lib/active_record/inheritance.rb:61:in `new'
  activerecord (4.2.0) lib/active_record/reflection.rb:131:in `build_association'
  activerecord (4.2.0) lib/active_record/associations/association.rb:247:in `build_record'
  activerecord (4.2.0) lib/active_record/associations/collection_association.rb:136:in `build'
  activerecord (4.2.0) lib/active_record/associations/collection_proxy.rb:254:in `build'
  app/controllers/works_controller.rb:304:in `block in create'
  actionpack (4.2.0) lib/action_controller/metal/mime_responds.rb:211:in `respond_to'
  app/controllers/works_controller.rb:303:in `create'
  actionpack (4.2.0) lib/action_controller/metal/implicit_render.rb:4:in `send_action'
  actionpack (4.2.0) lib/abstract_controller/base.rb:198:in `process_action'
  actionpack (4.2.0) lib/action_controller/metal/rendering.rb:10:in `process_action'
  actionpack (4.2.0) lib/abstract_controller/callbacks.rb:20:in `block in process_action'
  activesupport (4.2.0) lib/active_support/callbacks.rb:117:in `call'
  activesupport (4.2.0) lib/active_support/callbacks.rb:151:in `block in halting_and_conditional'
  activesupport (4.2.0) lib/active_support/callbacks.rb:151:in `block in halting_and_conditional'
  activesupport (4.2.0) lib/active_support/callbacks.rb:151:in `block in halting_and_conditional'
  activesupport (4.2.0) lib/active_support/callbacks.rb:151:in `block in halting_and_conditional'
  activesupport (4.2.0) lib/active_support/callbacks.rb:151:in `block in halting_and_conditional'
  activesupport (4.2.0) lib/active_support/callbacks.rb:151:in `block in halting_and_conditional'
  activesupport (4.2.0) lib/active_support/callbacks.rb:169:in `block in halting'
  activesupport (4.2.0) lib/active_support/callbacks.rb:234:in `block in halting'
  activesupport (4.2.0) lib/active_support/callbacks.rb:234:in `block in halting'
  activesupport (4.2.0) lib/active_support/callbacks.rb:169:in `block in halting'
  activesupport (4.2.0) lib/active_support/callbacks.rb:234:in `block in halting'
  activesupport (4.2.0) lib/active_support/callbacks.rb:169:in `block in halting'
  activesupport (4.2.0) lib/active_support/callbacks.rb:169:in `block in halting'
  activesupport (4.2.0) lib/active_support/callbacks.rb:169:in `block in halting'
  activesupport (4.2.0) lib/active_support/callbacks.rb:151:in `block in halting_and_conditional'
  activesupport (4.2.0) lib/active_support/callbacks.rb:92:in `_run_callbacks'
  activesupport (4.2.0) lib/active_support/callbacks.rb:734:in `_run_process_action_callbacks'
  activesupport (4.2.0) lib/active_support/callbacks.rb:81:in `run_callbacks'
  actionpack (4.2.0) lib/abstract_controller/callbacks.rb:19:in `process_action'
  actionpack (4.2.0) lib/action_controller/metal/rescue.rb:29:in `process_action'
  actionpack (4.2.0) lib/action_controller/metal/instrumentation.rb:31:in `block in process_action'
  activesupport (4.2.0) lib/active_support/notifications.rb:164:in `block in instrument'
  activesupport (4.2.0) lib/active_support/notifications/instrumenter.rb:20:in `instrument'
  activesupport (4.2.0) lib/active_support/notifications.rb:164:in `instrument'
  actionpack (4.2.0) lib/action_controller/metal/instrumentation.rb:30:in `process_action'
  actionpack (4.2.0) lib/action_controller/metal/params_wrapper.rb:250:in `process_action'
  activerecord (4.2.0) lib/active_record/railties/controller_runtime.rb:18:in `process_action'
  actionpack (4.2.0) lib/abstract_controller/base.rb:137:in `process'
  actionview (4.2.0) lib/action_view/rendering.rb:30:in `process'
  actionpack (4.2.0) lib/action_controller/metal.rb:195:in `dispatch'
  actionpack (4.2.0) lib/action_controller/metal/rack_delegation.rb:13:in `dispatch'
  actionpack (4.2.0) lib/action_controller/metal.rb:236:in `block in action'
  actionpack (4.2.0) lib/action_dispatch/routing/route_set.rb:73:in `dispatch'
  actionpack (4.2.0) lib/action_dispatch/routing/route_set.rb:42:in `serve'
  actionpack (4.2.0) lib/action_dispatch/journey/router.rb:43:in `block in serve'
  actionpack (4.2.0) lib/action_dispatch/journey/router.rb:30:in `serve'
  actionpack (4.2.0) lib/action_dispatch/routing/route_set.rb:802:in `call'
  warden (1.2.3) lib/warden/manager.rb:35:in `block in call'
  warden (1.2.3) lib/warden/manager.rb:34:in `call'
  rack (1.6.4) lib/rack/etag.rb:24:in `call'
  rack (1.6.4) lib/rack/conditionalget.rb:38:in `call'
  rack (1.6.4) lib/rack/head.rb:13:in `call'
  remotipart (1.2.1) lib/remotipart/middleware.rb:27:in `call'
  actionpack (4.2.0) lib/action_dispatch/middleware/params_parser.rb:27:in `call'
  actionpack (4.2.0) lib/action_dispatch/middleware/flash.rb:260:in `call'
  rack (1.6.4) lib/rack/session/abstract/id.rb:225:in `context'
  rack (1.6.4) lib/rack/session/abstract/id.rb:220:in `call'
  actionpack (4.2.0) lib/action_dispatch/middleware/cookies.rb:560:in `call'
  activerecord (4.2.0) lib/active_record/query_cache.rb:36:in `call'
  activerecord (4.2.0) lib/active_record/connection_adapters/abstract/connection_pool.rb:647:in `call'
  activerecord (4.2.0) lib/active_record/migration.rb:378:in `call'
  actionpack (4.2.0) lib/action_dispatch/middleware/callbacks.rb:29:in `block in call'
  activesupport (4.2.0) lib/active_support/callbacks.rb:88:in `_run_callbacks'
  activesupport (4.2.0) lib/active_support/callbacks.rb:734:in `_run_call_callbacks'
  activesupport (4.2.0) lib/active_support/callbacks.rb:81:in `run_callbacks'
  actionpack (4.2.0) lib/action_dispatch/middleware/callbacks.rb:27:in `call'
  actionpack (4.2.0) lib/action_dispatch/middleware/reloader.rb:73:in `call'
  actionpack (4.2.0) lib/action_dispatch/middleware/remote_ip.rb:78:in `call'
  better_errors (2.1.1) lib/better_errors/middleware.rb:84:in `protected_app_call'
  better_errors (2.1.1) lib/better_errors/middleware.rb:79:in `better_errors_call'
  better_errors (2.1.1) lib/better_errors/middleware.rb:57:in `call'
  actionpack (4.2.0) lib/action_dispatch/middleware/debug_exceptions.rb:17:in `call'
  web-console (2.2.1) lib/web_console/middleware.rb:39:in `call'
  actionpack (4.2.0) lib/action_dispatch/middleware/show_exceptions.rb:30:in `call'
  railties (4.2.0) lib/rails/rack/logger.rb:38:in `call_app'
  railties (4.2.0) lib/rails/rack/logger.rb:20:in `block in call'
  activesupport (4.2.0) lib/active_support/tagged_logging.rb:68:in `block in tagged'
  activesupport (4.2.0) lib/active_support/tagged_logging.rb:26:in `tagged'
  activesupport (4.2.0) lib/active_support/tagged_logging.rb:68:in `tagged'
  railties (4.2.0) lib/rails/rack/logger.rb:20:in `call'
  ahoy_matey (1.2.1) lib/ahoy/engine.rb:18:in `call_with_quiet_ahoy'
  request_store (1.3.0) lib/request_store/middleware.rb:9:in `call'
  actionpack (4.2.0) lib/action_dispatch/middleware/request_id.rb:21:in `call'
  rack (1.6.4) lib/rack/methodoverride.rb:22:in `call'
  rack (1.6.4) lib/rack/runtime.rb:18:in `call'
  activesupport (4.2.0) lib/active_support/cache/strategy/local_cache_middleware.rb:28:in `call'
  rack (1.6.4) lib/rack/lock.rb:17:in `call'
  actionpack (4.2.0) lib/action_dispatch/middleware/static.rb:113:in `call'
  rack (1.6.4) lib/rack/sendfile.rb:113:in `call'
  railties (4.2.0) lib/rails/engine.rb:518:in `call'
  railties (4.2.0) lib/rails/application.rb:164:in `call'
  rack (1.6.4) lib/rack/lock.rb:17:in `call'
  rack (1.6.4) lib/rack/content_length.rb:15:in `call'
  rack (1.6.4) lib/rack/handler/webrick.rb:88:in `service'
  /Users/ia/.rvm/rubies/ruby-2.2.2/lib/ruby/2.2.0/webrick/httpserver.rb:138:in `service'
  /Users/ia/.rvm/rubies/ruby-2.2.2/lib/ruby/2.2.0/webrick/httpserver.rb:94:in `run'
  /Users/ia/.rvm/rubies/ruby-2.2.2/lib/ruby/2.2.0/webrick/server.rb:294:in `block in start_thread'

Converting Math formulas

Hi,

Is there any function to convert math formulas from doc/docx back to markdown?

$$\phi = ax^2 + b$$

Cheers,
Arnaldo.

Unordered list rendering as ordered unordered list

Essentially, I'm trying to use this as a way of updating a blog (as an exercise), and ran into the following problem. Wrote a line of text, then an unordered list, then another line of text. This is parsed as line of text, second line of text, then the list, whose items are now both numbered and disced.

That may not be clear - this should be better

Word html output looks like this in Word (roughly the same as the input and how it looks in the browser).

Word output

The file itself looks like this

And the output (from the demo site) looks like this:

word-to-markdown output

I'm using Word for Mac 2011, version 14.3.9 (131030)

Fail with specific exception message, stating that soffice executable from LibreOffice cannot be found

bogon:AVOSDemo apple$ w2m /Users/apple/Downloads/Ipad.docx 
/Users/apple/.rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/open3.rb:211:in `spawn': no implicit conversion of nil into String (TypeError)
    from /Users/apple/.rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/open3.rb:211:in `popen_run'
    from /Users/apple/.rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/open3.rb:206:in `popen2e'
    from /Users/apple/.rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/open3.rb:372:in `capture2e'
    from /Users/apple/.rvm/gems/ruby-2.0.0-p353@global/gems/word-to-markdown-1.1.1/lib/word-to-markdown.rb:67:in `run_command'
    from /Users/apple/.rvm/rubies/ruby-2.0.0-p353/lib/ruby/gems/2.0.0/gems/word-to-markdown-1.1.1/lib/word-to-markdown/document.rb:88:in `raw_html'
    from /Users/apple/.rvm/rubies/ruby-2.0.0-p353/lib/ruby/gems/2.0.0/gems/word-to-markdown-1.1.1/lib/word-to-markdown/document.rb:20:in `tree'
    from /Users/apple/.rvm/rubies/ruby-2.0.0-p353/lib/ruby/gems/2.0.0/gems/word-to-markdown-1.1.1/lib/word-to-markdown/converter.rb:79:in `semanticize_font_styles!'
    from /Users/apple/.rvm/rubies/ruby-2.0.0-p353/lib/ruby/gems/2.0.0/gems/word-to-markdown-1.1.1/lib/word-to-markdown/converter.rb:19:in `convert!'
    from /Users/apple/.rvm/gems/ruby-2.0.0-p353@global/gems/word-to-markdown-1.1.1/lib/word-to-markdown.rb:31:in `initialize'
    from /Users/apple/.rvm/gems/ruby-2.0.0-p353@global/gems/word-to-markdown-1.1.1/bin/w2m:10:in `new'
    from /Users/apple/.rvm/gems/ruby-2.0.0-p353@global/gems/word-to-markdown-1.1.1/bin/w2m:10:in `<top (required)>'
    from /Users/apple/.rvm/rubies/ruby-2.0.0-p353/bin/w2m:23:in `load'
    from /Users/apple/.rvm/rubies/ruby-2.0.0-p353/bin/w2m:23:in `<main>'
    from /Users/apple/.rvm/gems/ruby-2.0.0-p353@global/bin/ruby_executable_hooks:15:in `eval'
    from /Users/apple/.rvm/gems/ruby-2.0.0-p353@global/bin/ruby_executable_hooks:15:in `<main>'

LibreOffice in Local directory

Thanks for the excellent tool. But small bug report.

The easy way to install LibreOffice on mac is now brew cask install LibreOffice (supposing that brew and brew cask are set up)

However, this installs into ~/Applications/ rather than /Applications. w2m should search in both places...

About normalized html

I find the current implement will remove fullwidth quotation marks.
I'm processing some Chinese Word documents and I would like those fullwidth characters remains.

Is the normalized_html necessary? If it isn't, I could send a pull request for adding options of without normalizing html.

Image handling?

Hi, this may be more a question rather than an issue but wasn't sure of a better place to ask.

Currently i'm just testing this code, would be perfect for me and save a lot of time. I just have a question about image handling.

I'm converting like this:

w2c /path/to/file.doc >> new.md

However, images are spamming the code with hex or whatever it is :). How can i convert the images to a uri or something that'll point to a local file?

Many thanks

Word-to-markdown gives an Internal Server Error for custom styles

Hi there,

I uploaded this document and got an Internal Server Error before I could click Parse. I kept deleting stuff I thought might be a problem, still no luck. I even got it down to a single word "hello", and no luck.

I tried making a new document and copying my content from the desired file to the new document. Here's what I used, which was successful. The difference is: in the original document, I had created and saved a set of custom styles.

So, word-to-markdown crashes when a document has a set of custom styles saved with it. Hope this helps.

Invalid multibyte character issue

I could be missing something very simple here in terms of the encoding, but when trying to run this via the OSX command line, I'm getting this error:

<user-home>/.rvm/gems/ruby-1.9.3-p194/gems/word-to-markdown-1.0.0/lib/word-to-markdown.rb:8:in `require_relative': <user-home>/.rvm/gems/ruby-1.9.3-p194/gems/word-to-markdown-1.0.0/lib/word-to-markdown/document.rb:60: invalid multibyte char (US-ASCII) (SyntaxError)
<user-home>/.rvm/gems/ruby-1.9.3-p194/gems/word-to-markdown-1.0.0/lib/word-to-markdown/document.rb:60: invalid multibyte char (US-ASCII)
<user-home>/.rvm/gems/ruby-1.9.3-p194/gems/word-to-markdown-1.0.0/lib/word-to-markdown/document.rb:60: syntax error, unexpected $end, expecting keyword_end
      html.gsub! /“|”/, '"'          # Straig...
                    ^
    from <user-home>/.rvm/gems/ruby-1.9.3-p194/gems/word-to-markdown-1.0.0/lib/word-to-markdown.rb:8:in `<top (required)>'
    from <user-home>/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:55:in `require'
    from <user-home>/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:55:in `require'
    from <user-home>/.rvm/gems/ruby-1.9.3-p194/gems/word-to-markdown-1.0.0/bin/w2m:3:in `<top (required)>'
    from <user-home>/.rvm/gems/ruby-1.9.3-p194/bin/w2m:19:in `load'
    from <user-home>/.rvm/gems/ruby-1.9.3-p194/bin/w2m:19:in `<main>'
    from <user-home>/.rvm/gems/ruby-1.9.3-p194/bin/ruby_executable_hooks:15:in `eval'
    from <user-home>/.rvm/gems/ruby-1.9.3-p194/bin/ruby_executable_hooks:15:in `<main>'

Install Error

I'm new to Ruby, so forgive me if it is inappropriate to post this here.

gem install word-to-markdown
Fetching: mini_portile-0.7.0.rc4.gem (100%)
Successfully installed mini_portile-0.7.0.rc4
Fetching: nokogiri-1.6.7.rc3.gem (100%)
Building native extensions. This could take a while...
ERROR: Error installing word-to-markdown:
ERROR: Failed to build gem native extension.

/usr/bin/ruby extconf.rb

mkmf.rb can't find header files for ruby at /usr/share/include/ruby.h

extconf failed, exit code 1

Gem files will remain installed in /home/mediascover/.gem/ruby/gems/nokogiri-1.6.7.rc3 for inspection.
Results logged to /home/mediascover/.gem/ruby/extensions/x86_64-linux/nokogiri-1.6.7.rc3/gem_make.out

Error when LibreOffice is open

When any LibreOffice program is open (tested Linux Mint with both Calc and Word) this fails with an error like the following:

/var/lib/gems/1.9.1/gems/word-to-markdown-1.1.3/lib/word-to-markdown/document.rb:91:in `read': No such file or directory - /tmp/d20150618-31341-8olz7h/some_document.html (Errno::ENOENT)
    from /var/lib/gems/1.9.1/gems/word-to-markdown-1.1.3/lib/word-to-markdown/document.rb:91:in `raw_html'
    from /var/lib/gems/1.9.1/gems/word-to-markdown-1.1.3/lib/word-to-markdown/document.rb:58:in `normalized_html'
    from /var/lib/gems/1.9.1/gems/word-to-markdown-1.1.3/lib/word-to-markdown/document.rb:20:in `tree'
    from /var/lib/gems/1.9.1/gems/word-to-markdown-1.1.3/lib/word-to-markdown/converter.rb:78:in `semanticize_font_styles!'
    from /var/lib/gems/1.9.1/gems/word-to-markdown-1.1.3/lib/word-to-markdown/converter.rb:18:in `convert!'
    from /var/lib/gems/1.9.1/gems/word-to-markdown-1.1.3/lib/word-to-markdown.rb:31:in `initialize'
    from /var/lib/gems/1.9.1/gems/word-to-markdown-1.1.3/bin/w2m:14:in `new'
    from /var/lib/gems/1.9.1/gems/word-to-markdown-1.1.3/bin/w2m:14:in `<top (required)>'
    from /usr/local/bin/w2m:23:in `load'
    from /usr/local/bin/w2m:23:in `<main>'

Invalid multibyte char (US-ASCII)

I installed the gem and tried to run w2m, but I got several of these errors, stack trace indicated line 9 of converter.rb and line 60 of document.rb, presumably because some of those characters you are trying to match are not ASCII. Adding # encoding: utf-8 to the top of both files fixed this issue for me, in case others experience the same thing.

Convert from onedrive.live.com?

I have a bunch of word docs online that I'd like to convert to markdown. Is there a way to do that easily (other than going through all of them and downloading & saving them manually)?

Its doesnt even convert at all. Shows weird errors.

These are the errors I received:

/System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/open3.rb:211:in spawn': no implicit conversion of nil into String (TypeError) from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/open3.rb:211:inpopen_run'
from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/open3.rb:206:in popen2e' from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/open3.rb:372:incapture2e'
from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown.rb:67:in run_command' from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown/document.rb:90:inraw_html'
from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown/document.rb:58:in normalized_html' from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown/document.rb:20:intree'
from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown/converter.rb:78:in semanticize_font_styles!' from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown/converter.rb:18:inconvert!'
from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/lib/word-to-markdown.rb:31:in initialize' from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/bin/w2m:14:innew'
from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.2/bin/w2m:14:in <top (required)>' from /usr/bin/w2m:23:inload'
from /usr/bin/w2m:23:in `

'

Conversion error

I love this library, and I've been using the demo version online, but I'm finally getting around to setting it up locally. I installed the gem and Libre Office (I'm on a Mac running Sierra), but when I try to convert I get this:

leo:docs iangilman$ w2m mydoc.docx
/Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.7/lib/word-to-markdown/document.rb:92:in `raw_html': Failed to convert /Users/iangilman/work/docs/mydoc.docx (WordToMarkdown::Document::ConverstionError)
	from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.7/lib/word-to-markdown/document.rb:59:in `normalized_html'
	from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.7/lib/word-to-markdown/document.rb:21:in `tree'
	from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.7/lib/word-to-markdown/converter.rb:78:in `semanticize_font_styles!'
	from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.7/lib/word-to-markdown/converter.rb:18:in `convert!'
	from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.7/lib/word-to-markdown.rb:45:in `initialize'
	from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.7/bin/w2m:14:in `new'
	from /Library/Ruby/Gems/2.0.0/gems/word-to-markdown-1.1.7/bin/w2m:14:in `<top (required)>'
	from /usr/local/bin/w2m:23:in `load'
	from /usr/local/bin/w2m:23:in `<main>'

The doc in question opens fine in Libre Office.

Any idea what is going wrong? Anything I can try to fix it? Thank you!

multi-level outlines

First off, Ben, this is wonderful. I absolutely love the direction you're going here. I went and found a word doc just so I could test it out and provide useful feedback if I could.

Here's the sample: https://gist.github.com/daveloyall/9772587

Note that I generated that with Office 2010 on Windows. (Send me a link to your gif-maker and I'll produce one that shows what I had to click.)

The issue is that this:

  1. Thing
    1. Subthing
    2. Subthing, too
  2. Other thing
    1. tameighta
    2. tamahta

...becomes this:

    1. Thing
  • i. Subthing
  • ii. Subthing, too
    1. Other thing
  • i. tameighta
  • ii. tamahta

Cheers,

Ordered lists becomes unordered lists

A chunk like this: https://gist.github.com/gjtorikian/9750065

Turns into this:

- 19. And think of all the minor towns who sacrificed everything to build an altar for the locomotive. Mayors that believed their town's future success depended on the promises of growth that a train could deliver.
- 20. Towns with names like Kiowa, Modena, Soap Creek, Barclay.

The preceding - should be dropped in favor of an ordered list.

ERROR: Error installing word-to-markdown:

sudo gem install word-to-markdown
Fetching: mini_portile2-2.3.0.gem (100%)
Successfully installed mini_portile2-2.3.0
Fetching: nokogiri-1.8.1.gem (100%)
Building native extensions. This could take a while...
ERROR: Error installing word-to-markdown:
ERROR: Failed to build gem native extension.

current directory: /var/lib/gems/2.3.0/gems/nokogiri-1.8.1/ext/nokogiri

/usr/bin/ruby2.3 -r ./siteconf20171106-14125-1km0ne7.rb extconf.rb
mkmf.rb can't find header files for ruby at /usr/lib/ruby/include/ruby.h

extconf failed, exit code 1

Gem files will remain installed in /var/lib/gems/2.3.0/gems/nokogiri-1.8.1 for inspection.
Results logged to /var/lib/gems/2.3.0/extensions/x86_64-linux/2.3.0/nokogiri-1.8.1/gem_make.out


current directory: /var/lib/gems/2.3.0/gems/nokogiri-1.8.1/ext/nokogiri
/usr/bin/ruby2.3 -r ./siteconf20171106-14125-1km0ne7.rb extconf.rb
mkmf.rb can't find header files for ruby at /usr/lib/ruby/include/ruby.h

extconf failed, exit code 1

Reflected XSS Vulnerability

When attempting to convert a Word document describing web application security issues, I discovered word-to-markdown is vulnerable to Reflected Cross-Site Scripting. Specifically, the application places input from the converted document directly into the source code of the webpage, which allows attackers to execute arbitrary script in a victim's browser if the victim converts a malicious document. Below is a harmless example that illustrates the attack:

poc
xss.docx

The Markdown preview correctly HTML-encodes 'problem characters' such as angle brackets, but these characters are not HTML-encoded in the Rendered view. While Markdown does support inline HTML, I feel it's unlikely a user would include HTML in a Word document and expect it to be rendered as such; thus the Rendered view should also HTML-encode characters. This would resolve the Reflected XSS vulnerability.

Error converting the file to markdown

Hi I followed the procedure you described. I installed libreoffice and then installed gems. Then installed word-to-markdown using following command gem install word-to-markdown.

Then I ran the command w2m .\Desktop\part1.docx as shown in image below and got the error.

image

Bug: Link at list won't show

I'm using Mac OSX with Word version 14.4.0
I could successfully transfer the file, while there is a little bug.

For example

word-to-markdown

would turn out to be

[word-to-markdown](https://github.com/benbalter/word-to-markdown)
- word-to-markdown

, which I guess this may be caused by indent.

Please tell me if I'm misunderstanding something, and how to fix it to accomplish what I expected it should be.
Thanks for this great repo!

Error message on markdown tool

Hi Ben,

For the past few weeks I've been getting the following message:


Application Error
An error occurred in the application and your page could not be served. Please try again in a few moments.

If you are the application owner, check your logs for details.

I have tried this with documents that were previously able to be marked down but now they are not working either.

I was wondering if perhaps you could take a look at what could be causing the problem.

Best,
Sam

Fails to run when source file contains whitespace

If the source document has a whitespace in the filename, w2m fails to run.

For example:

Here's w2m running without an error:

$ w2m policy.docx > rk-policy.md
WARNING: Nokogiri was built against LibXML version 2.9.0, but has dynamically loaded 2.9.1

However, when you rename the source file, it blows up:

$ mv policy.docx has\ space.docx
$ w2m has\ space.docx > /dev/null
WARNING: Nokogiri was built against LibXML version 2.9.0, but has dynamically loaded 2.9.1
/Users/vrivellino/.rvm/gems/ruby-2.0.0-p0/gems/word-to-markdown-1.0.0/lib/word-to-markdown/document.rb:91:in `read': No such file or directory - /var/folders/kf/34twy6lx36q2y45w9vwvd4mm0000gp/T/d20140616-48253-1d7c8p6/has space.html (Errno::ENOENT)
    from /Users/vrivellino/.rvm/gems/ruby-2.0.0-p0/gems/word-to-markdown-1.0.0/lib/word-to-markdown/document.rb:91:in `raw_html'
    from /Users/vrivellino/.rvm/gems/ruby-2.0.0-p0/gems/word-to-markdown-1.0.0/lib/word-to-markdown/document.rb:18:in `tree'
    from /Users/vrivellino/.rvm/gems/ruby-2.0.0-p0/gems/word-to-markdown-1.0.0/lib/word-to-markdown/converter.rb:78:in `semanticize_font_styles!'
    from /Users/vrivellino/.rvm/gems/ruby-2.0.0-p0/gems/word-to-markdown-1.0.0/lib/word-to-markdown/converter.rb:18:in `convert!'
    from /Users/vrivellino/.rvm/gems/ruby-2.0.0-p0/gems/word-to-markdown-1.0.0/lib/word-to-markdown.rb:29:in `initialize'
    from /Users/vrivellino/.rvm/gems/ruby-2.0.0-p0/gems/word-to-markdown-1.0.0/bin/w2m:10:in `new'
    from /Users/vrivellino/.rvm/gems/ruby-2.0.0-p0/gems/word-to-markdown-1.0.0/bin/w2m:10:in `<top (required)>'
    from /Users/vrivellino/.rvm/gems/ruby-2.0.0-p0/bin/w2m:23:in `load'
    from /Users/vrivellino/.rvm/gems/ruby-2.0.0-p0/bin/w2m:23:in `<main>'
    from /Users/vrivellino/.rvm/gems/ruby-2.0.0-p0/bin/ruby_noexec_wrapper:14:in `eval'
    from /Users/vrivellino/.rvm/gems/ruby-2.0.0-p0/bin/ruby_noexec_wrapper:14:in `<main>'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.