GithubHelp home page GithubHelp logo

doc2html's Introduction

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>DOC2HTML</title>
  <link href="https://fonts.googleapis.com/css2?family=Fanwood+Text:ital@0;1&family=IBM+Plex+Mono:ital,wght@0,500;1,500&display=swap" rel="stylesheet">
  <style type="text/css">
    h1 {
      text-align: center;
    }
    h2 {
      text-align: center;
      margin: 2em 0 1em 0;
    }
    h3 {
      text-align: left;
      margin: 2em 0 .5em 0;
    }
    body {
      font-size: 1.25em;
      background-color: #eed;
      font-family: "Fanwood Text", serif;
      margin: 0 8px 2em 10px;
    }
    p {

      margin: 0;
      text-indent: 1.75em;
    }
    li p {
      text-indent: 0;
      margin: .25em 0 .25em 0;
    }
    tt {
      font-family: "IBM Plex Mono", monospace;
      color: navy;
      background-color: #ccfbcc;
      margin: 0;
      font-size: 90%;
    }
  </style>
</head>
<body>
     <h1>A Program to Transform<br />
       Complex Text Documents<br />to<br />
       Simple HTML</h1>
<p>Word processors are wonderful: they manage multiple versions,
  and WYSIWYG editors such as Microsoft Word, Wordperfect, and
  LibreOffice can double as document publishing tools for simple
  (and even complex) documents.
</p>
<p>
  But this power comes with a price. Retrieving nicely
  formatted text as simplistic HTML (without the powerful
  formatting capabilities) is hard. Native output to HTML
  wants to include font information, color information,
  metadata, complex paragraph rules, &hellip
</p>
<p>
  Sometimes, all one wants is the simplest formatting.
  Headers, bold, italic, bold-italic, and paragraphs.
  Monospace, maybe. Retrieving that bare-bones HTML
  is difficult.
</p>
<p>
  <tt>DOC2HTML</tt> makes it easy. It uses a powerful
  parser <i>(Apache Tika)</i>
  to identify and parse input. It then renders the document
  into simple HTML.
</p>
<p><tt>DOC2HTML</tt> was written for Java 11. There is an all-in-one
JAR file, so running it as simple as </p>
<ul>
  <li><p>Making sure the right Java is present on the system. From a
  command prompt, run:<br/></p>
  <div class="indent">
    <tt>java --version</tt>
  </div>
    <p>As long as the version is 11 or higher, all is well.</p>
  </li>
  <li>
    <p>Download the <tt>doc2html.jar</tt> from the Release folder.
    This JAR is specially compiled as an all-in-one jar, thanks to
      the immensely useful
      <a href="http://one-jar.sourceforge.net/">One-JAR&trade;</a>
      project by Simon Tuffs. <i>Thank you, Simon, for this contribution to the open
    source community!</i>
  </li>
  <li>
    <p><tt>java -jar doc2html.jar -f MyDocument.docx -o MyOutput.html</tt>
  </li>
  <h2>Using the <b><tt>.exe</tt></b> version of doc2html</h2>
  <p>This is a Windows executable
    and it relies on having a Java engine available.
    I am not certain what version of Java it requires &mdash; quite possibly Java 11, but
    it <i>might</i> function with lower versions. I have tested it only with Java 11.
  </p>
  <p>The <tt>doc2html.exe</tt> file is available from the <tt>Releases</tt> folder.</p>
  <p>I am looking into something that will not require Java on Windows, but the program <i>is</i>
  written in Java, after all.
  </p>


<h2>Building this Project</h2>
  <p>Building this project and the OneJar release jar is a two-step process (albeit big steps).</p>
  <h3>Step One <i>build project</i></h3>
  <p>Build the project. This is most easily done in Jetbrains IntelliJ, as all
  the IDE configuration files and project files are committed as part of the project. There is
  nothing magic about Intellij, however, other than its putting all the required
    JAR files (and their dependencies) in the <tt>&hellip;/out/artifacts/doc2html_jar</tt>
  directory as part of building the JAR file.</p>
  <h3>Step Two <i>run makefile</i></h3>
  <p>From the command line, run&nbsp;<tt>make onejar</tt>, but double-check your
  assumptions at the door:</p>
  <ul>
    <li>The makefile assumes one is running on Windows. Rewriting it to work on *nix
    is straightforward, but left to the reader.</li>
    <li>The makefile assumes it can find two critical files <tt>d2h_MANIFEST.mf</tt>
    and <tt>one-jar-boot-0.97.jar</tt> in the OneJar directory.</li>
    <li>The makefile assumes it can create (and delete)
      a temporary directory <tt>root</tt>
    in the project directory, and it will assemble the project there</li>
    <li>The makefile assumes it has access to <tt>7z</tt> (to extract files from a
      jar file, and no, JAR is too limited to do this).</li>
    <li>The makefile assumes it has access to the <tt>jar</tt>
      command to reassemble the jar file. Using 7z was tempting
    from a performance perspective, but I am not certain that
    7z will stay within the rigid limits of the JAR file format
      (also, the <tt>JAR</tt> command makes putting
      the manifest in easier).</li>
    <li>This uses GNU <tt>make</tt>.
      Another flavor of <tt>make</tt>
    might not behave similarly. It <i>probably</i> will,
    but this is a list of assumptions, safe and otherwise.</li>
  </ul>
</ul>
     <h4>Building the DOC2HTML executable <i>not a step, no explicit directions provided here</i></h4>
<p>The project was written and compiled with GraalVM-CE, and uses the GraalVM native-image
command to transform the One-JAR&trade; file into an executable, albeit an executable that
still requires Java (this makes sense, as GraalVM&rsquo;s handling of reflection is
  admitted to be iffy, and One-JAR&trade; depends on reflection to get its custom class loader
  working. I&rsquo;m pleased it works at all, given the reservations that the GraalVM
  project team expresses about the use of reflection).
<p>Building a native-image directly from the initial JAR and the JAR files it requires on its
     CLASSPATH should be a simpler exercise &mdash; but it fails with an error message about
     a class being missing <i>that is present</i> in the <tt>org.apache.tika.parsers&hellipjar</tt>.
     This may be another, subtler, manifestation of a reflection problem, as Tika attempts to
     ignore parsers that aren&rsquo;t present (more reflection).</p>
</body>
</html>

doc2html's People

Contributors

nathanverrilli avatar

Watchers

 avatar

doc2html's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.