GithubHelp home page GithubHelp logo

Comments (10)

mkobos avatar mkobos commented on June 1, 2024 3

I introduced the timeout facility in #20.

Here I'll present design considerations that led to this solution and some comments related to testing the solution.

Design considerations

Two options were considered when implementing the timeout functionality:

  1. separate JVM: the processing code that should be prone to being interrupted by the timeout would be run in a separate JVM. That separate JVM process would be killed when the timeout deadline passed.
  2. check points: check points would be inserted into the existing processing code. They would check whether the timeout deadline passed and if so, they would throw an exception.

The "check points" approach was chosen, mainly because of its simplicity. However, here are the disadvantages of both solutions:

Disadvantages of separate JVM approach:

  • Resources issues: starting two JVMs - one corresponding to the code that the user interacts with directly and one for the processing that might be interrupted by a timeout - would require more startup time and more memory.
  • A lot of brittle boilerplate code that would allow for communication between two JVMs would have to be created and maintained.
  • It might not be obvious to decide which parts of code should be run in which JVM.
  • Completely clean exit from the processing code would not be possible (since the whole JVM process would be killed).

Disadvantages of check points approach:

  • It’s virtually impossible to guarantee that the program ends in given time. This is because of two issues.
    • A CERMINE developer might have introduced some long-running code without inserting timeout check point inside it.
    • It’s possible that there are some PDF documents where the processing time takes a lot of time in parts of the code that don’t have check points in them yet since I executed the tests and introduced the check points based on a certain sample of PDF files.
  • Maintaining the timeout functionality when a new code is introduced is difficult. This is because in general, you don't know where the program will spend most of its processing time, i.e. you don't know where the hot spots are.
    • Ideally, after adding some new code you would run the program on a batch of sample of PDF files with a profiler-like tool and look for hot spots in the code and then put check points inside them.
    • You would also run a tool that checks what is the real-life resolution of timeouts after introducing the new code. If the resolution is high, the processing stops almost immediately after given time passes, no matter what the actual value of the timeout is; if the resolution is low, the processing does not stop immediately and the actual stop time might heavily depend on the actual value of the timeout. After introducing the new code the resolution might become lower; in such case some additional check points should be introduced.
  • The resolution depends on the speed of the computer (the faster the computer, the higher resolution) and the complexity of the document (the more complex the document, the lower resolution because the longer time various processing steps take).

Testing the solution

When testing the solution in practice, I used:

  • a random sample of 100 PDF documents processed by OpenAIRE system's inference subsystem called IIS and
  • a set of 13 complex PDF files. One of these files came from IIS system - its characteristic feature is that it took very long to process using CERMINE in IIS (the file is a part of the source code now here: cermine-impl/src/test/resources/pl/edu/icm/cermine/tools/timeout/complex.pdf). The rest of the files were ones that @dtkaczyk gathered from various sources; they are also characterized by the fact that it takes CERMINE very long to process them.

I made sure that the resolution of timeouts on these files is appropriate (i.e. that it doesn't take more than 1-2 seconds to stop the application after the timeout deadline passes). As discussed with @dtkaczyk elsewhere, it would be nice to implement tests like these as integration tests.

from cermine.

mwojnars avatar mwojnars commented on June 1, 2024

Same problem occurs occassionally in Paperity. I would suggest adding a command-line parameter to CERMINE, say, "timeout", such that CERMINE can control the execution time internally and terminate when the timeout passes. In this approach, there is likely no need for using threads to control the timeout, just doing a periodical time check in a key loop of the pdf processing routine and raising an exception when the timeout passes.

from cermine.

mkobos avatar mkobos commented on June 1, 2024

I updated the "Testing the solution" section in the above comment #7 (comment) to reflect the fact that I tested the solution on additional complex PDF files.

from cermine.

mkobos avatar mkobos commented on June 1, 2024

@dtkaczyk: do you think that we can close this ticket?

from cermine.

mwojnars avatar mwojnars commented on June 1, 2024

Which CERMINE version contains this change and how to use it (is there a command-line parameter)? Does it work with PdfNLMContentExtractor class?

from cermine.

mkobos avatar mkobos commented on June 1, 2024

@mwojnars: It works with all user-facing classes and there's a command-line parameter as you requested. I don't know about the version though - we would need to ask @dtkaczyk (the lead of the project) when she's back from a leave next week.

from cermine.

dtkaczyk avatar dtkaczyk commented on June 1, 2024

@mwojnars: Timeout functionality can be found in the latest snapshot (1.10). It works with the newer ContentExtractor class, from the command line as well. For example this will extract: document metadata in JATS format ("-outputs jats"), and labelled textual zones ("-outputs zones"):

java -cp cermine-impl-1.10-20160717.210009-6-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path path/to/pdfs -outputs jats,zones -timeout 60

Classes Pdf*Extractor are now deprecated, and all the neccessary updates and extensions will be done in ContentExtractor only.

from cermine.

dtkaczyk avatar dtkaczyk commented on June 1, 2024

@mwojnars: Did you have time to test this functionality? Our tests on large collections in OpenAIRE system suggest it works as expected.

from cermine.

mwojnars avatar mwojnars commented on June 1, 2024

@dtkaczyk We upgraded in the meantime to a newer version of CERMINE where execution times dropped to a normal time of several seconds, so finally we didn't have a need to try timeout. If the problem re-occurs we'll check it. Thanks.

from cermine.

dtkaczyk avatar dtkaczyk commented on June 1, 2024

Thanks for the update. I am closing this issue for now, if any problems occur again, i'll reopen it.

from cermine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.