hazyresearch / fonduer Goto Github PK

A knowledge base construction engine for richly formatted data

Home Page: https://fonduer.readthedocs.io/

License: MIT License

Python 73.91% Shell 0.05% Makefile 0.06% HTML 25.86% Dockerfile 0.12%

multimodality machine-learning knowledge-base-construction

fonduer's Introduction

Fonduer is a Python package and framework for building knowledge base construction (KBC) applications from richly formatted data.

Note that Fonduer is still actively under development, so feedback and contributions are welcome. Submit bugs in the Issues section or feel free to submit your contributions as a pull request.

Getting Started

Check out our Getting Started Guide to get up and running with Fonduer.

Learning how to use Fonduer

The Fonduer tutorials cover the Fonduer workflow, showing how to extract relations from hardware datasheets and scientific literature.

Reference

Fonduer: Knowledge Base Construction from Richly Formatted Data (blog):

@inproceedings{wu2018fonduer,
  title={Fonduer: Knowledge Base Construction from Richly Formatted Data},
  author={Wu, Sen and Hsiao, Luke and Cheng, Xiao and Hancock, Braden and Rekatsinas, Theodoros and Levis, Philip and R{\'e}, Christopher},
  booktitle={Proceedings of the 2018 International Conference on Management of Data},
  pages={1301--1316},
  year={2018},
  organization={ACM}
}

Acknowledgements

Fonduer leverages the work of Emmental and Snorkel.

fonduer's People

Contributors

Stargazers

Watchers

fonduer's Issues

Change the base learning class to pytorch

Insufficient checking for missing PDFs in parser

Describe the bug
Insufficient missing_pdf checking in parser.py

fonduer/fonduer/parser/parser.py

Lines 130 to 139 in 1913f89

 # Add visual attributes 

 filename = self.pdf_path + document.name 

 missing_pdf = ( 

 not os.path.isfile(self.pdf_path) 

 and not os.path.isfile(filename + ".pdf") 

 and not os.path.isfile(filename + ".PDF") 

 and not os.path.isfile(filename) 

 ) 

 if missing_pdf: 

 logger.error("Visual parsing failed: pdf files are required")

For example, this misses the case where the pdf_path is a file, but is HTML.

Expected behavior
If a user is using HTML as input, but has visual parsing enabled, they should get an error describing that PDFs are missing.

Additional context
This could cause some users who are not using PDFs as input to get errors when doing a visual parse but not provide a useful error output to indicate that a PDF is missing. See HazyResearch/fonduer-tutorials#12.

Setup mailing list for discussions

Rather than being stuck in GitHub Issues for techincal discussion about Fonduer, we should use a mailing list (more permanence and searchability).

Make fonduer a pip-installable package

Once we have cleaned up the environment vars and dependencies, there should be nothing stopping us from improving usability and making fonduer a pip-installable package.

Typo in `get_page_horz_percentile` function

Hi,

I am testing lf_helpers functions and find the return type of get_page_horz_percentile is a tuple instead of float value.

return bbox_from_span(span).left, page_width

I think the , should be /, right?

Source:
https://github.com/HazyResearch/fonduer/blob/master/fonduer/supervision/lf_helpers.py#L920

Transistor Image Tutorial - UnicodeEncodeError from CorpusParser

Hello,

First off, thanks for this interesting library. I have only gone through it a little bit but am quite excited to explore its full capabilities.

I am going through the transistor_image_tutorial at the moment. Running the lines:

corpus_parser = Parser(structural=True, lingual=True, visual=True, pdf_path=pdf_path, flatten=[])
%time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)

I get the following UnicodeEncodeError.

As a result, when I execute the next line, I get

Documents: 0
Sentences: 0
Figures: 0

I was wondering if there is a workaround for this or if I needed to do something first to avoid the error and get the same result as the original transistor_image_tutorial file. Thank you.

Integrate new parser to support pdftotree output

TODO:

Create simple, verifiable test data for unit testing the parser
Get numbers from old parser for comparison
Test current version of new parser
Rewrite to fix the mismatches. That is, build the document model, and run each paragraph through spacy for phrases, rather than chunking the whole document like was done for CoreNLP.

Switch to spaCy as the default parser

Support using spaCy as the lingual parser for the old parser (i.e. the one that does not support pdftotree output).

TODO:

Upgrade to spaCy 2.x (#9)
Compare features pre and post spaCy
check visual linker for mismatches. Update: (see #12). However, it looks like we don't have unicode issues.

Document Preprocessor for PDF Documents

Hello,

I was wondering if there is a document preprocessor for PDF documents. I tried using the DocPreprocessor command but it looks like the parse_file function has not been built out yet. In the tutorial notebooks, it looks like both html files and pdf files are processed together, but I was wondering how I would go about parsing sentences for only pdf files. If there is some documentation I should be referring to, please let me know. Thanks.

Fonduer max_storage_temp_tutorial error while parsing html files

To Reproduce

While doing max_storage_temp_tutorial tutorial in the attached Jupyter notebook, I get an error while trying to execute the following code:

corpus_parser = Parser(structural=True, lingual=True, visual=True, pdf_path=pdf_path)
%time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)

Expected behavior
To complete parsing without any issue

Error Logs/Screenshots
The following is the error that I got:

UnicodeEncodeError: 'ascii' codec can't encode character '\uf0b7' in position 6282: ordinal not in range(128)

The following is the complete error stacktrace

[INFO] fonduer.utils.udf - Clearing existing...
[INFO] fonduer.utils.udf - Running UDF...
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<timed eval> in <module>()

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/fonduer/utils/udf.py in apply(self, xs, clear, parallelism, progress_bar, count, **kwargs)
     48         self.logger.info("Running UDF...")
     49         if parallelism is None or parallelism < 2:
---> 50             self.apply_st(xs, progress_bar, clear=clear, count=count, **kwargs)
     51         else:
     52             self.apply_mt(xs, parallelism, clear=clear, **kwargs)

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/fonduer/utils/udf.py in apply_st(self, xs, progress_bar, count, **kwargs)
     81 
     82         # Commit session and close progress bar if applicable
---> 83         udf.session.commit()
     84         if pb:
     85             pb.close()

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/orm/session.py in commit(self)
    941                 raise sa_exc.InvalidRequestError("No transaction is begun.")
    942 
--> 943         self.transaction.commit()
    944 
    945     def prepare(self):

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/orm/session.py in commit(self)
    465         self._assert_active(prepared_ok=True)
    466         if self._state is not PREPARED:
--> 467             self._prepare_impl()
    468 
    469         if self._parent is None or self.nested:

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/orm/session.py in _prepare_impl(self)
    445                 if self.session._is_clean():
    446                     break
--> 447                 self.session.flush()
    448             else:
    449                 raise exc.FlushError(

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/orm/session.py in flush(self, objects)
   2252         try:
   2253             self._flushing = True
-> 2254             self._flush(objects)
   2255         finally:
   2256             self._flushing = False

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/orm/session.py in _flush(self, objects)
   2378         except:
   2379             with util.safe_reraise():
-> 2380                 transaction.rollback(_capture_exception=True)
   2381 
   2382     def bulk_save_objects(

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback)
     64             self._exc_info = None   # remove potential circular references
     65             if not self.warn_only:
---> 66                 compat.reraise(exc_type, exc_value, exc_tb)
     67         else:
     68             if not compat.py3k and self._exc_info and self._exc_info[1]:

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    247         if value.__traceback__ is not tb:
    248             raise value.with_traceback(tb)
--> 249         raise value
    250 
    251 else:

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/orm/session.py in _flush(self, objects)
   2342             self._warn_on_events = True
   2343             try:
-> 2344                 flush_context.execute()
   2345             finally:
   2346                 self._warn_on_events = False

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/orm/unitofwork.py in execute(self)
    384                 while set_:
    385                     n = set_.pop()
--> 386                     n.execute_aggregate(self, set_)
    387         else:
    388             for rec in topological.sort(

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/orm/unitofwork.py in execute_aggregate(self, uow, recs)
    666                              [self.state] +
    667                              [r.state for r in our_recs],
--> 668                              uow)
    669 
    670     def __repr__(self):

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py in save_obj(base_mapper, states, uowtransaction, single)
    179         _emit_insert_statements(base_mapper, uowtransaction,
    180                                 cached_connections,
--> 181                                 mapper, table, insert)
    182 
    183     _finalize_insert_update_commands(

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py in _emit_insert_statements(base_mapper, uowtransaction, cached_connections, mapper, table, insert, bookkeeping)
    828 
    829             c = cached_connections[connection].\
--> 830                 execute(statement, multiparams)
    831 
    832             if bookkeeping:

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/engine/base.py in execute(self, object, *multiparams, **params)
    946             raise exc.ObjectNotExecutableError(object)
    947         else:
--> 948             return meth(self, multiparams, params)
    949 
    950     def _execute_function(self, func, multiparams, params):

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/sql/elements.py in _execute_on_connection(self, connection, multiparams, params)
    267     def _execute_on_connection(self, connection, multiparams, params):
    268         if self.supports_execution:
--> 269             return connection._execute_clauseelement(self, multiparams, params)
    270         else:
    271             raise exc.ObjectNotExecutableError(self)

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_clauseelement(self, elem, multiparams, params)
   1058             compiled_sql,
   1059             distilled_params,
-> 1060             compiled_sql, distilled_params
   1061         )
   1062         if self._has_events or self.engine._has_events:

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1198                 parameters,
   1199                 cursor,
-> 1200                 context)
   1201 
   1202         if self._has_events or self.engine._has_events:

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   1414                 )
   1415             else:
-> 1416                 util.reraise(*exc_info)
   1417 
   1418         finally:

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    247         if value.__traceback__ is not tb:
    248             raise value.with_traceback(tb)
--> 249         raise value
    250 
    251 else:

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1168                         statement,
   1169                         parameters,
-> 1170                         context)
   1171             elif not parameters and context.no_parameters:
   1172                 if self.dialect._has_events:

~/anaconda3/envs/fonduer/lib/python3.6/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py in do_executemany(self, cursor, statement, parameters, context)
    681             extras.execute_batch(cursor, statement, parameters)
    682         else:
--> 683             cursor.executemany(statement, parameters)
    684 
    685     @util.memoized_instancemethod

UnicodeEncodeError: 'ascii' codec can't encode character '\uf0b7' in position 6282: ordinal not in range(128)

Environment (please complete the following information):

OS: [Ubuntu 16.04 bash for Windows]
PostgreSQL Version: [9.5.13]
Poppler Utils Version: [0.41.0-0ubuntu1.7]
Fonduer Version: [0.2.3]

Additional context
I have used the corpus I downloaded using the download_data.sh script in the folder.

Eliminate global variable use from meta.py

Right now, snorkel uses some hacky global variables in meta.py. This poses two main issues:

It forces the user to set the SNORKELDB environment variable before importing
It is executing code on import alone. For example, just importing fonduer will create a snorkel.db file immediately. We should say no to import side-effects.

Perhaps this should be some sort of Session class instead, with attributes that are accessed when needed, rather than using these global variables throughout the codebase.

Upgrade to spaCy 2.x

For reference:

snorkel-team/snorkel#830

Using Oracle in place of Postgres for fonduer

Is your feature request related to a problem? Please describe.
I have an Oracle instance, and, want to use it instead of Postgres.

Describe the solution you'd like
What all changes are required to make the switch, and, is there an existing config/notes that can help to make the change. From my understanding the Alchemy supports both and we have to just make sure any Postgres specific intializations/imports are handled.

Remove phantomjs blob from git repo

We have a leftover commit from migrating from the snorkel repo containing a reference to the phantomjs binary that someone accidentally committed.

$ git-sizer 
Processing blobs: 10705                        
Processing trees: 13264                        
Processing commits: 4381                        
Matching commits to trees: 4381                        
Processing annotated tags: 1                        
Processing references: 70                        
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Biggest objects              |           |                                |
| * Trees                      |           |                                |
|   * Maximum entries      [1] |  2.83 k   | *                              |
| * Blobs                      |           |                                |
|   * Maximum size         [2] |  64.8 MiB | ******                         |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum tag depth      [3] |     1     | *                              |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Maximum path length    [4] |   119 B   | *                              |

[1]  d23d5cc63594e508c00ec026a2c85458f032a059 (refs/remotes/origin/newftrs:examples/old/gene_phen_relation_example/data)
[2]  d72e801ce9c8681a006994da60b76a516f7f1853 (refs/remotes/origin/fonduer_parser:snorkel/contrib/fonduer/phantomjs/bin/phantomjs)
[3]  0c6fa1639c65cd73d81e72c47db5b4c826b21008 (refs/tags/v0.4.alpha)
[4]  22ffd4e03fca0fcdb79add2578d8432e2075803e (8a3bf199abb2b05aba34447a7c5b3286cafb964a^{tree})

We should get rid of that blog completely from the repo. Most likely using BFG.

Refactor codebase into submodules for each pipeline phase

Currently, Fonduer is kind of a monolitic package, where all database tables are created on init. In order to make development easier, we want to split Fonduer into sobmodules, each of which a single task in the pipeline:

parser
candidates
featurization
supervision
learning
utils

Database tables will only be created in the initialization of each of these independent modules.

TODO:

Reorganize files as is, just fixing imports and setting up new directories
Use all absolute imports
~~Split up the initialization performed by Meta into the respective pipeline phases~~
Define each phase's API. Expose submodules, rather than all functions, in the root fonduer package
Simplify the source files where possible

Separate workers for parsing and database insertions

Is your feature request related to a problem? Please describe.
Decouple UDF processes from the backend/database session.
Right now, when we run UDFRunner.apply_mt(), we create a number of UDF worker processes. These processes all own an sqlalchemy Session object and add/commit to the database at the end of their respective parsing loop.

Describe the solution you'd like
Make the UDF processes backend-agnostic, e.g. by having a set of separate BackendWorker processes handle the insertion of sentences. One possible way: Connect the output_queue of UDF to the input of BackendWorker, which receive Sentence lists and handle the sqlalchemy commits.

This will not fully decouple UDF from the backend, because the parser returns sqlalchemy-specific Sentence objects, but it could be one step towards that goal.

Additional context
This feature request refers to decoupling of parsing and backend.
There's likely more coupling with the backend later in the processing pipeline.

Switch to logging in Fonduer.

Scrub code related to sqlite, which is unsupported in fonduer

In Fonduer we rely on parallelization to improve performance by using postgres. We do not support sqlite. Removing any code that relates to sqlite will simplify the code base and reduce confusion.

Support passing a single document to OmniParser

Right now pdf_path must be a directory and not a single file. However, the preprocessors can operate on a single file. It would be good to have parity here.

Error with fontconfig during poppler installation

If you get errors during poppler installation with pkg-config or fontconfig, make sure that you have the following two packages installed on your system.

sudo apt-get install pkg-config libfontconfig1-dev

Move all snorkel code directly into Fonduer.

Remove snorkel subdirectory.

The motivation here is that the line between importing from snorkel or fonduer is a little blurry, and kind of unnecessary. It may simplify the code if we just absorb all the snorkel files into fonduer directly.

Split lf_helpers into categories

Rather than a monolithic file, we should split these based on their types and or hierarchies.

Add data model, matchers, preprocessors to docs

We need better docs for the data model. For example, given a candidate in the lf_helpers, how do I view the sentence one of it's mentions is it? What attributes does a Span have? These are frequent questions that can be answered by looking at the code directly, but would be much more user friendly if we had docs for them.

Related: #32

Host documentation on readthedocs

There's a few parts to this task.

Setup readthedocs so that documentation is auto built
Fix the import errors with readthedocs
~~Go through the code and improve docstrings throughout~~ (this will be ongoing...)

First part is easy, second part will take time.

[Error]CalledProcessError in visual.py

Hello,
In fonduer-tutorials, after running cell:

%time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)

I got:

  File "/home/hagen/git/fonduer/fonduer/visual.py", line 59, in extract_pdf_words
    shell=True)
......
subprocess.CalledProcessError: Command 'pdftotext -f 1 -l 1 -bbox-layout 'data/pdf/DISES00616-1.pdf' -' returned non-zero exit status 99.

When I give a blank space between '-bbox' and '-layout' in line 59, visual.py：

 html_content = subprocess.check_output(
                "pdftotext -f {} -l {} -bbox -layout '{}' -".format(
                    str(i), str(i), self.pdf_file),
                shell=True)

it turned out to be like this many times:

RuntimeError: Words could not be extracted from PDF: data/pdf/DISES00616-1.pdf

Best regards.

Add japanese tokenization support

Is your feature request related to a problem? Please describe.
My documents are written in Japanese, which is not supported by SpaCy hence not by Fonduer.

Describe the solution you'd like
According to SpaCy, tokenization of Japanese and other languages is alpha supported.
Please support these languages if tokenization is useful than nothing for Fonduer.

Can Fonduer extract arbitrary tabular data?

I have a requirement where a table has three columns. first column data represent reach row's main header. and second,third column respective value of each row (in second,third column small regular expression function required to filter digits..etc). and data are text .

i want fonduer to consider each row as one candidate. how can i do that ?
and some cases where one row might have multiple paragraphs(sub-rows), in that case i want fonduer to give me each paragraph in that row as separate row with concatenating main header(paragraph) to each sub-row.
can fonduer do this type of task ?

when i was trying to understand fonduer flow, i found it keeping data in terms of phrases in "Phrase" table with keeping all track. but to solve my/above issue how can this going to helpful ?

Thanks in Advance.

psycopg2 error in python 3.6

On the fonduer_parser branch, Travis-CI is failing in Python 3.6. Specifically, build number 16.

If you look at those logs, python2 and python3.5 are failing, but for an unrelated reason (an assertion is failing). However, python3.6 is failing due to a ProgrammingError:

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <sqlalchemy.dialects.postgresql.psycopg2.PGDialect_psycopg2 object at 0x7fc90e7cc6a0>
cursor = <cursor object at 0x7fc8e2189ce0; closed: -1>
statement = 'INSERT INTO phrase (lemmas, pos_tags, ner_tags, dep_parents, dep_labels, row_start, row_end, col_start, col_end, posi...xt)s, %(words)s, %(char_offsets)s, %(entity_cids)s, %(entity_types)s, %(abs_char_offsets)s, %(table_id)s, %(cell_id)s)'
parameters = ({'abs_char_offsets': [3726, 3733, 3741, 3745, 3755, 3769, ...], 'bottom': [[207.15112578]], 'cell_id': 4852, 'char_of...m': [-inf, -inf, -inf, -inf, -inf, -inf, ...], 'cell_id': 5269, 'char_offsets': [0, 4, 10, 16, 17, 27, ...], ...}, ...)
context = <sqlalchemy.dialects.postgresql.psycopg2.PGExecutionContext_psycopg2 object at 0x7fc8df4a6208>
    def do_executemany(self, cursor, statement, parameters, context=None):
        if self.psycopg2_batch_mode:
            extras = self._psycopg2_extras()
            extras.execute_batch(cursor, statement, parameters)
        else:
>           cursor.executemany(statement, parameters)
E           psycopg2.ProgrammingError: ARRAY types double precision and numeric[] cannot be matched
E           LINE 1: ...ty'::float, 'Infinity'::float, 'Infinity'::float, ARRAY[364....
E                                                                        ^
../../../virtualenv/python3.6.3/lib/python3.6/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py:683: ProgrammingError

Break up large Python files into reasonable sub-files

For example, context.py is over 1000 lines of code and contains many classes which could reasonably be made into separate files.

Improving parser performance

The current Parser takes quite a bit of time to process a large corpus of documents (~25min for 100 PDF datasheets on a consumer laptop). It would be nice to see what can be done to improve performance.

Profiling Setup

I did some quick profiling using the e2e hardware tutorial using Fonduer v0.2.3, with two modifications:

max_docs = 10

and using only a single thread:

import cProfile

corpus_parser = Parser(structural=True, lingual=True, visual=True, pdf_path=pdf_path)
cProfile.runctx('corpus_parser.apply(doc_preprocessor, parallelism=1)', globals(), locals(), 'profile.prof')

Top 50 Cumulative Time

   List reduced from 1834 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      3/1    0.000    0.000  127.815  127.815 {built-in method builtins.exec}
      2/1    0.001    0.000  127.801  127.801 /home/lwhsiao/repos/fonduer/fonduer/utils/udf.py:31(apply)
      2/1    0.000    0.000  127.781  127.781 /home/lwhsiao/repos/fonduer/fonduer/utils/udf.py:57(apply_st)
        2    0.001    0.000  126.855   63.428 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/session.py:909(commit)
      3/2    0.000    0.000  126.855   63.427 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/session.py:464(commit)
      3/2    0.000    0.000  126.634   63.317 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/session.py:433(_prepare_impl)
        8    0.077    0.010  126.634   15.829 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/session.py:2220(flush)
        1    0.029    0.029  126.556  126.556 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/session.py:2271(_flush)
        1    0.014    0.014  126.068  126.068 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/unitofwork.py:369(execute)
        6    0.001    0.000  119.647   19.941 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/unitofwork.py:658(execute_aggregate)
        6    0.033    0.005  119.642   19.940 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py:131(save_obj)
       78    1.160    0.015  119.413    1.531 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py:799(_emit_insert_statements)
    16198    0.142    0.000  113.913    0.007 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py:882(execute)
    16192    0.102    0.000  113.730    0.007 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/sql/elements.py:267(_execute_on_connection)
    16192    0.686    0.000  113.628    0.007 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py:1016(_execute_clauseelement)
    16198    0.908    0.000  112.330    0.007 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py:1111(_execute_context)
21342/18958    0.118    0.000   80.031    0.004 /home/lwhsiao/repos/fonduer/fonduer/parser/parser.py:550(_parse_node)
11230/5902    0.204    0.000   79.332    0.013 /home/lwhsiao/repos/fonduer/fonduer/parser/parser.py:570(parse)
    18958    0.088    0.000   78.291    0.004 /home/lwhsiao/repos/fonduer/fonduer/parser/parser.py:419(_parse_paragraph)
    11227    0.428    0.000   67.047    0.006 /home/lwhsiao/repos/fonduer/fonduer/parser/parser.py:322(_parse_sentence)
        9    0.000    0.000   57.452    6.384 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py:678(do_executemany)
        9   57.031    6.337   57.452    6.384 {method 'executemany' of 'psycopg2.extensions.cursor' objects}
    10670    0.324    0.000   50.675    0.005 nn_parser.pyx:326(__call__)
    10670    0.088    0.000   47.024    0.004 nn_parser.pyx:727(get_batch_model)
    16193    0.074    0.000   46.940    0.003 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/engine/default.py:508(do_execute)
    16197   46.179    0.003   46.868    0.003 {method 'execute' of 'psycopg2.extensions.cursor' objects}
80025/16005    0.720    0.000   44.993    0.003 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/api.py:58(begin_update)
    10670    0.110    0.000   37.959    0.004 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/api.py:277(begin_update)
    58685    0.959    0.000   30.840    0.001 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/neural/_classes/layernorm.py:50(begin_update)
    11227    0.611    0.000   24.486    0.002 /home/lwhsiao/repos/fonduer/fonduer/parser/spacy_parser.py:121(parse)
    42680    0.229    0.000   23.576    0.001 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/neural/_classes/resnet.py:17(begin_update)
5912/5902    0.013    0.000   23.415    0.004 /home/lwhsiao/repos/fonduer/fonduer/parser/parser.py:124(apply)
     5902    0.003    0.000   23.271    0.004 /home/lwhsiao/repos/fonduer/fonduer/parser/visual_linker.py:30(parse_visual)
       10    0.012    0.001   21.981    2.198 /home/lwhsiao/repos/fonduer/fonduer/parser/visual_linker.py:47(extract_pdf_words)
    16005    0.464    0.000   20.554    0.001 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/api.py:365(uniqued_fwd)
     5335    0.051    0.000   20.365    0.004 pipeline.pyx:425(__call__)
     5335    0.073    0.000   20.089    0.004 pipeline.pyx:437(predict)
101365/10670    0.224    0.000   20.016    0.002 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/neural/_classes/model.py:155(__call__)
    10670    0.106    0.000   19.680    0.002 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/api.py:291(predict)
       74    0.005    0.000   19.545    0.264 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/bs4/__init__.py:87(__init__)
    58685    1.543    0.000   19.542    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/neural/_classes/maxout.py:66(begin_update)
32010/5335    0.183    0.000   18.830    0.004 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/api.py:53(predict)
    85360   17.542    0.000   17.542    0.000 ops.pyx:333(batch_dot)
      128    0.001    0.000   15.675    0.122 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/bs4/builder/_htmlparser.py:192(prepare_markup)
       64    0.003    0.000   15.673    0.245 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/bs4/dammit.py:344(__init__)
      128    0.001    0.000   15.662    0.122 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/bs4/dammit.py:240(encodings)
       64    0.001    0.000   15.656    0.245 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/bs4/dammit.py:33(chardet_dammit)
       64    0.003    0.000   15.654    0.245 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/chardet/__init__.py:24(detect)
       64    0.003    0.000   15.635    0.244 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/chardet/universaldetector.py:111(feed)
    10670    2.077    0.000   14.986    0.001 nn_parser.pyx:387(parse_batch)

Top 50 Total Time

   List reduced from 1834 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        9   57.031    6.337   57.452    6.384 {method 'executemany' of 'psycopg2.extensions.cursor' objects}
    16197   46.179    0.003   46.868    0.003 {method 'execute' of 'psycopg2.extensions.cursor' objects}
    85360   17.542    0.000   17.542    0.000 ops.pyx:333(batch_dot)
   901615    7.821    0.000    7.821    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    10670    6.399    0.001    6.399    0.001 {built-in method numpy.core.multiarray.dot}
  4904371    3.915    0.000    3.915    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/chardet/codingstatemachine.py:66(next_state)
   586850    3.229    0.000    8.996    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/numpy/core/fromnumeric.py:2456(prod)
    80025    2.345    0.000    3.893    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/numpy/core/_methods.py:86(_var)
      808    2.241    0.003    2.241    0.003 {method 'findall' of '_sre.SRE_Pattern' objects}
       62    2.096    0.034    4.827    0.078 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/chardet/utf8prober.py:57(feed)
    10670    2.077    0.000   14.986    0.001 nn_parser.pyx:387(parse_batch)
    64020    1.875    0.000    6.569    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/neural/_classes/hash_embed.py:48(begin_update)
       62    1.820    0.029    2.005    0.032 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/chardet/charsetprober.py:103(filter_with_english_letters)
   458810    1.725    0.000    9.098    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/neural/mem.py:28(__getitem__)
    58685    1.543    0.000   19.542    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/neural/_classes/maxout.py:66(begin_update)
       50    1.488    0.030    2.805    0.056 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/chardet/mbcharsetprober.py:61(feed)
    16192    1.446    0.000    3.478    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/engine/default.py:595(_init_compiled)
    80025    1.425    0.000    1.443    0.000 ops.pyx:419(maxout)
   458810    1.313    0.000   10.790    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/describe.py:35(__get__)
   631016    1.310    0.000    1.310    0.000 {built-in method numpy.core.multiarray.array}
      868    1.206    0.001    3.469    0.004 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/chardet/sbcharsetprober.py:77(feed)
   108507    1.205    0.000    2.466    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/sync.py:16(populate)
    80025    1.194    0.000    1.194    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/neural/_classes/layernorm.py:102(_forward)
       78    1.160    0.015  119.413    1.531 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py:799(_emit_insert_statements)
    80025    1.141    0.000    2.187    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/numpy/core/_methods.py:53(_mean)
   186725    1.087    0.000    1.166    0.000 ops.pyx:168(asarray)
    32440    0.975    0.000    1.746    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py:380(_collect_insert_commands)
    58685    0.959    0.000   30.840    0.001 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/neural/_classes/layernorm.py:50(begin_update)
       94    0.956    0.010    0.956    0.010 {method 'read' of '_io.BufferedReader' objects}
    16198    0.908    0.000  112.330    0.007 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py:1111(_execute_context)
    64020    0.908    0.000    2.544    0.000 ops.pyx:452(seq2col)
   736230    0.801    0.000    0.801    0.000 {method 'reshape' of 'numpy.ndarray' objects}
   128040    0.775    0.000    2.938    0.000 ops.pyx:158(allocate)
   298389    0.772    0.000    1.465    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/attributes.py:699(set)
       74    0.757    0.010    0.757    0.010 {built-in method posix.read}
    58685    0.756    0.000    3.397    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/neural/_classes/layernorm.py:70(_begin_update_scale_shift)
80025/16005    0.720    0.000   44.993    0.003 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/api.py:58(begin_update)
   311497    0.708    0.000    2.042    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/unitofwork.py:193(get_attribute_history)
    80025    0.689    0.000    7.659    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/thinc/neural/_classes/layernorm.py:81(_get_moments)
    16192    0.686    0.000  113.628    0.007 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py:1016(_execute_clauseelement)
   101365    0.649    0.000    0.649    0.000 {built-in method numpy.core.multiarray.concatenate}
    32362    0.637    0.000    2.376    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py:1168(_postfetch)
1862054/1861624    0.631    0.000    0.786    0.000 {built-in method builtins.isinstance}
   108741    0.614    0.000    0.614    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
    11227    0.611    0.000   24.486    0.002 /home/lwhsiao/repos/fonduer/fonduer/parser/spacy_parser.py:121(parse)
       74    0.588    0.008    0.588    0.008 {built-in method _posixsubprocess.fork_exec}
    48585    0.577    0.000    0.577    0.000 {built-in method _codecs.utf_8_decode}
519893/519599    0.549    0.000    0.586    0.000 {built-in method builtins.hasattr}
       10    0.534    0.053    1.217    0.122 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/chardet/sjisprober.py:56(feed)
   390715    0.527    0.000    0.579    0.000 /home/lwhsiao/repos/tutorials/.venv/lib/python3.6/site-packages/sqlalchemy/orm/state.py:665(_modified_event)

Clean up featurization code

There appears to be several bugs in the featurization code. Examples below.

In content_features.py:

We import global variables like Mention, Indicator, etc from treedlib without importing.
span.parent should probably be span.sentence

In structural_features.py

unary_tdl_feats should be unary_strlib_feats, I believe

Move poppler installation to just a dependency

only parsing first page of 136 page pdf

When running the code in the fonduer-tutorials max_storage_temp_tutorial notebook to parse the pdf linked to below, only the first page is parsed.
I only get 19 sentences and all of them are from the first page.

To Reproduce
Steps to reproduce the behavior:

download https://www.theice.com/publicdocs/regulatory_filings/ICUS_Rules_Clean_up.pdf .
convert the pdf to html with poppler's pdftohtml.
rename html file to match pdf file.
run through code the same way as the fonduer-tutorials max_storage_temp_tutorial notebook.

Expected behavior
Get a list of all the sentences in the 136 page document.

Environment (please complete the following information):

OS: Ubuntu 16.04 running on Windows Subsystem for Linux
PostgreSQL Version: 10+192.pgdg16.04+1
Poppler Utils Version: 0.41.0-0ubuntu1.7
Fonduer Version: 0.2.3

Enforce match between DB and candidate encodings

Suggest putting a check in the baseline fonduer docpreprocessor method that checks to ensure that the database encoding is the same as the candidate text encodings, else you'll end up with DB errors when running the featurizers.

Switch README to RST.

The standard format for PyPi is reStructuredText, not Markdown. It would simplify things by letting us cut out the conversion using pandoc.

Generating/checking matching pdf file paths for visual linking outside of parser

Is your feature request related to a problem? Please describe.
At the moment we use HtmlDocPreprocessor to separately generate pre-processed documents that are fed into the parser.

If we want to extract visual features, we currently need corresponding pdf files for each input document. Fetching the pdf file path currently happens inside parser, which is initialized with a pdf_path argument. This couples the parser with input data generation. Furthermore, we can only test whether a matching pdf file exists, when the ParserUDF.apply() method is called, because we have no knowledge about the html input files before.

Describe the solution you'd like
Have a (separate) generator that handles generation and checking of the matching pdf file paths, which are fed into the parser.apply() method, e.g. parser.apply((doc,text), pdf_path, **kwargs).

Describe alternatives you've considered
Extend HtmlDocPreprocessor to return tuples of three values (doc,text,pdf_path), if a visual_linking_pdf_path is provided.

Additional context
One thing to consider is that there are also other ways of visual linking that would not require PDF files in the future.

Word mismatch between HTML and PDF for visual linker

In the test md document we use an ordered HTML list, which renders as numbers in the PDF. This is causing a mismatch of words when doing the visual parse. The list of words in that document are

HTML	PDF
Sample	Sample
Markdown	Markdown
This	This
is	is
some	some
basic	basic
,	,
sample	sample
markdown	markdown
.	.
Second	Second
Heading	Heading
Unordered	Unordered
lists	lists
,	,
and	and
:	:
One	1
Two	.
Three	One
More	2
Blockquote	.
And	Two
bold	3
,	.
italics	Three
,	More
and	Blockquote
even	And
italics	bold
and	,
later	italics
.	,
Even	and
bold	even
strikethrough	italics
.	and
A	later
link	bold
to	.
somewhere	Even
.	strikethrough
Here	.
is	A
a	link
table	to
Name	somewhere
Lunch	.
order	Here
Spicy	is
Owes	a
Joan	table
saag	Name
paneer	Lunch
medium	order
$	Spicy
11	Owes
Sally	Joan
vindaloo	saag
mild	paneer
$	medium
14	$11
Erin	Sally
lamb	vindaloo
madras	mild
HOT	$14
$	Erin
5	lamb
Or	madras
inline	HOT
code	$5
like	Or
var	inline
foo	code
=	like
'", 'var
bar	foo
'", '=
;	'bar';
.	.
Or	Or
an	an
image	image
of	of
bears	bears
The	The
end	end
...	...

Notice that in the PDF words, the 1, 2, and 3 appear, whereas they do not in the HTML.

Ideally this will be resolved when we switch to pdftotree and the new parser.

Error when copying feature updates to Postgresql database

Hello,
I am getting this error when applying the batch feature annotator:
CalledProcessError: Command 'cat /tmp/reg_topic_reg_topic_feature_0_*.tsv | psql reg_topic -p 5432 -c "COPY reg_topic_feature_updates(candidate_id, keys, values) FROM STDIN" --set=ON_ERROR_STOP=true' returned non-zero exit status 1.

These are the info that I got:

[INFO] fonduer.udf - Clearing existing...
[INFO] fonduer.udf - Running UDF...
[INFO] fonduer.async_annotations - Copying reg_topic_feature_updates to postgres

Thank you in advance.

Remove PhantomJs dependency

Version conflicts in dependencies

tensorboard 1.8.0 has requirement bleach==1.5.0, but you'll have bleach 2.1.3 which is incompatible.  
bleach 2.1.3 has requirement html5lib!=1.0b1,!=1.0b2,!=1.0b3,!=1.0b4,!=1.0b5,!=1.0b6,!=1.0b7,!=1.0b8,>
=0.99999999pre, but you'll have html5lib 0.9999999 which is incompatible.

bleach 2.1.3 is installed by jupyter. We also had

error: html5lib 1.0b8 is installed but html5lib==0.9999999 is required by {'tensorboard'}

which will hopefully get resolved in the next release of tensorboard [1].

Handle psycopg2 renaming

....venv/lib/python3.5/site-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.

We need to see if we just switch to the binary or if there are other better options.

Split tutorials into a separate repo

This separate repo can then simply

pip install fonduer

(once we get that working) and it will be a nice clean boilerplate for future fonduer apps.

Can we remove this return + yield?

We have a strange bit of code here:

fonduer/fonduer/async_annotations.py

Lines 235 to 242 in 32c0972

 row = [ 

 str(candidate.id), 

 array_tsv_escape(keys), 

 array_tsv_escape(values) 

 ] 

 writer.write('\t'.join(row) + '\n') 

 return 

 yield

If I change this to just yield, or just return, the travis tests break with an index out of bounds on the end-to-end tests [1].

I'm not sure what's happening here.

[1] https://travis-ci.org/HazyResearch/fonduer/builds/395737547

New parser needs to handle angle brackets

With the INFNS19372-1.pdf document, after passing through pdftotree we have the following issue.

<section_header char='  F o r   f u r t h e r   i n f o r m a t i o n   o n t e
c h n o l o g y ,   d e l i v e r y   t e r m s   a n d   c o n d i t i o n s
a n d   p r i c e s , p l e a s e   c o n t a c t   t h e   n e a r e s t   I n
f i n e o n   T e c h n o l o g i e s   O f f i c e   ( < w w w . i n f i n e o
    n . c o m > ) . ' , [leaving off coordinates for brevity...] '>F or further
information on technology, delivery terms and conditions and prices, please
contact the nearest Infineon Technologies Office ( <www.infineon.com>).
</section_header>

This <www.infineon.com> is being parsed at the html_tag, rather than as part of the content.

As a related issue, we are currently swapping all ' for " in pdftotree, and swapping those back in the new parser. In both these cases, we want to treat these as normal strings, but because we're using standard parsing tools these can be problematic.

Should we instead just HTML escape these values? E.g., to <, >, "?

[1] https://stackoverflow.com/questions/44430571/how-to-get-text-in-angle-brackets-with-lxml-or-bs

Refactor docstrings and enforce using flake8-docstrings

Some of our docstrings may be inaccurate, and many are missing altogether.

Move BatchFeatureAnnotator

Currently this is in async_annotators within the supervision directory. It should be split out under features.

Make fonduer a pip-installable package?

This should be very doable, as it just requires the user to install phantomjs and poppler themselves, which seems reasonable.

The other main thing would be to actually use proper import paths so we don't have a bunch of environment variables that we need to modify.

Break doc_preprocessors into a directory of individual files

Rather than having all the preprocessors in a single file.

Wrong NER tag

I'm not really sure, but it looks like a bug unless I'm missing something.

Describe the bug

test_parser.py::test_parse_md_details tests NER tags at here like below:

assert header.ner_tags == ["ORG", "ORG"]

where header.words == ['Sample', 'Markdown'], but neither 'Sample' nor 'Markdown' is ORG (organization).

To Reproduce

You can confirm what is the right NET tag as follows:

import spacy
nlp = spacy.load('en')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
print("Recognized number of NER: %d" % len(doc.ents))
"""This should print
Recognized number of NER: 3
"""

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

"""This should print  
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY
"""

doc = nlp(u'Sample Markdown')
print("Recognized number of NER: %d" % len(doc.ents))
"""This should print
Recognized number of NER: 0
"""

for token in doc:
    print(token.text, token.pos_, token.dep_, token.ent_type_ if token.ent_type_ else "O")

"""This should print  
Sample NOUN compound O
Markdown PROPN ROOT O
"""

Expected behavior

The test should assert like this instead and should pass.

assert header.ner_tags == ["O", "O"]

Environment (please complete the following information):

Fonduer Version: 0.3.0 (d5c1e9b)
spacy: 2.0.12

[Errno 32] Broken pipe for Parser in parallel execution on OSX

Hi,

In fonduer-tutorials, after running cell:

corpus_parser = OmniParser(structural=True, lingual=True, visual=True, pdf_path=pdf_path)
%time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)

whenever is PARALLEL smaller than max_docs, I've got:

Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
    send_bytes(obj)
  File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 398, in _send_bytes
    self._send(buf)
  File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Otherwise (with PARALLEL bigger or equal than max_docs) result is empty tables in Postgresql.
When turning off parallelisation, it works.

Best regards

	# Add visual attributes
	filename = self.pdf_path + document.name
	missing_pdf = (
	not os.path.isfile(self.pdf_path)
	and not os.path.isfile(filename + ".pdf")
	and not os.path.isfile(filename + ".PDF")
	and not os.path.isfile(filename)
	)
	if missing_pdf:
	logger.error("Visual parsing failed: pdf files are required")

	row = [
	str(candidate.id),
	array_tsv_escape(keys),
	array_tsv_escape(values)
	]
	writer.write('\t'.join(row) + '\n')
	return
	yield

hazyresearch / fonduer Goto Github PK

fonduer's Introduction

Getting Started

Learning how to use Fonduer

Reference

Acknowledgements

fonduer's People

Contributors

Stargazers

Watchers

Forkers

fonduer's Issues

Profiling Setup

Top 50 Cumulative Time

Top 50 Total Time

Recommend Projects

Recommend Topics

Recommend Org

Jobs