GithubHelp home page GithubHelp logo

run-llama / llama_parse Goto Github PK

View Code? Open in Web Editor NEW
2.3K 21.0 240.0 16.36 MB

Parse files for optimal RAG

Home Page: https://www.llamaindex.ai

License: MIT License

Python 97.97% Makefile 2.03%
document parsing pdf pdf-document-processor ppt pptx structured-data

llama_parse's Introduction

LlamaParse

LlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks.

LlamaParse directly integrates with LlamaIndex.

Free plan is up to 1000 pages a day. Paid plan is free 7k pages per week + 0.3c per additional page.

Read below for some quickstart information, or see the full documentation.

Getting Started

First, login and get an api-key from https://cloud.llamaindex.ai ↗.

Then, make sure you have the latest LlamaIndex version installed.

NOTE: If you are upgrading from v0.9.X, we recommend following our migration guide, as well as uninstalling your previous version first.

pip uninstall llama-index  # run this if upgrading from v0.9.x or older
pip install -U llama-index --upgrade --no-cache-dir --force-reinstall

Lastly, install the package:

pip install llama-parse

Now you can run the following to parse your first PDF file:

import nest_asyncio

nest_asyncio.apply()

from llama_parse import LlamaParse

parser = LlamaParse(
    api_key="llx-...",  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    num_workers=4,  # if multiple files passed, split in `num_workers` API calls
    verbose=True,
    language="en",  # Optionally you can define a language, default=en
)

# sync
documents = parser.load_data("./my_file.pdf")

# sync batch
documents = parser.load_data(["./my_file1.pdf", "./my_file2.pdf"])

# async
documents = await parser.aload_data("./my_file.pdf")

# async batch
documents = await parser.aload_data(["./my_file1.pdf", "./my_file2.pdf"])

Using with file object

You can parse a file object directly:

import nest_asyncio

nest_asyncio.apply()

from llama_parse import LlamaParse

parser = LlamaParse(
    api_key="llx-...",  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    num_workers=4,  # if multiple files passed, split in `num_workers` API calls
    verbose=True,
    language="en",  # Optionally you can define a language, default=en
)

with open("./my_file1.pdf", "rb") as f:
    documents = parser.load_data(f)

# you can also pass file bytes directly
with open("./my_file1.pdf", "rb") as f:
    file_bytes = f.read()
    documents = parser.load_data(file_bytes)

Using with SimpleDirectoryReader

You can also integrate the parser as the default PDF loader in SimpleDirectoryReader:

import nest_asyncio

nest_asyncio.apply()

from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

parser = LlamaParse(
    api_key="llx-...",  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    verbose=True,
)

file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

Full documentation for SimpleDirectoryReader can be found on the LlamaIndex Documentation.

Examples

Several end-to-end indexing examples can be found in the examples folder

Documentation

https://docs.cloud.llamaindex.ai/

Terms of Service

See the Terms of Service Here.

llama_parse's People

Contributors

adreichert avatar ajit283 avatar alexleventer avatar anoopshrma avatar binarybrain avatar dependabot[bot] avatar disiok avatar eltociear avatar erichare avatar goswamig avatar hatianzhang avatar hemantmalik avatar hemidactylus avatar henrycunh avatar hexapode avatar horw avatar jerryjliu avatar jonathanhliu21 avatar leehuwuj avatar logan-markewich avatar metanov avatar ravi03071991 avatar seldo avatar sourabhdesai avatar yisding avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llama_parse's Issues

some information incorrectly presented as tables

First, let me say what a great service you have! really impressive. As an early tester I'm just doing my duty to help refine the product

Developing_the_Home-based_Early_Intervention_Program_A_Case_Study.pdf

You can see here an example where its just regular paragraph but added as a table

|Family Centered Approach|is another important approach in the developmental early intervention program. In early intervention services, studying with every family individually and planning intervention programs considering family’s properties and opinions effect results positively. Primary principle of the family centered approach is composed of focusing family’s strengths, respect family’s differences and values, letting family to decide and support their authority, communicating with the family in an open and cooperative way and having a flexible approach on supplying services (Bailey, Raspa & Fox, 2012). In the current study, all of these properties are considered during the applications.|
|---|---|
|According to results and previous studies|providing the required developmental support at correct time and directing families properly is very important in early intervention studies. Brorson (2005) stated that studies that evaluated the effectiveness of early intervention mention a single point: Early intervention has a positive effect on infants and young children. When all these early intervention studies are considered, applications in this study, serves to early intervention applications/programs for 0-3 age group. In our country, there is limited number of studies especially for 0-3 age group. A study by Gul & Diken (2009) also supports this idea. They investigated postgraduate studies in Turkey and did not find any study about teaching ability to 0-3 age group children with developmental delay or disability.|
|This study is thought to be important for early intervention studies for 0-3 age group children|with/with risk of developmental delay or developmental disability. According to results of the current study, it can be suggested to design a new “Early Intervention” regulation for 0-3 age group children and their families to guarantee their rights and opportunities; to develop a new systematical early intervention program incorporated with Ministry of Health to identify and redirect children with developmental delay or disability; to plan early intervention programs for 0-3 age group children according to natural environments, support them primarily in home environment or social environments and plan their transition to required institute based programs after 3 years old; to consider, evaluate and reorganize home environment according to developmental requirements when studying with children with developmental delay or disability; to accept individualized family education applications as a part of the early intervention program, plan it to contribute family into developmental support practices/programs and consider it as a family centered approach; to study early intervention applications in cooperation with related field specialists and according to transdisciplinary approach. More studies about early intervention programs and model suggestions for 0-3 age group children with developmental disability must be done in Turkey. Families must be consulted how to support their children’s development in their natural environment, whether child has institute-based or home-based support. Appropriate home based support programs must be prepared according to family’s individual properties.|

Here is the section from the pdf itsel:
Screenshot from 2024-02-27 21-30-36
f

would be nice to remove header\footer from pdf

probably a tricky ask, but when working with academic texts especially, but really books or any pdf will commonly have header footer information.. just gets in the way. especially using for rag, that can alter the understanding of a chunk.

I know it slows me down in manual review..

image

Multilingual support?

When will the multilingual support be out? I tried with Hindi document only to get some random characters as output. Hoping you extend to other languages soon!

Formulas are not parsed correctly

When formulas are parsed some characters like the square root √ are deleted.
Character that should be lowered ₐ as well as raised ² characters are not correctly positioned.

The input:
X4gv1P60KMGXes1n

This is the raw text from the PDF (copy and paste):
[H3O +] Ka
2

  • � Ka
    2
    4
    = – + Ka ca

The output:
[H3O+] = – K2a + K4a2 + Ka ca

The expected output:
[H₃O⁺] = -Kₐ/2 + √{Kₐ²/4+Kₐcₐ}

The expected output was made manually by me and contains only unicode characters and no markdown or formatting information.
For example: √ (U+221A) and ₐ (U+0061)

Missing values from table

Trying to extract financial data but there seem to be missing values when using the preview.
Markdown output:
image

PDF page:
image

As indicated in the PDF page image, the file contains a total sum row of all 'Vlottende activa' (current assets) where the markdown representation of the page does not include this row.

Wait long time but no result return

hi, I use the demo_api ipynb code on my colab. While I was trying to use llama parse to get text parsed from a pdf which has many unstructured contents like some figures and irregular text layout, I did not receive any response for a long time.The file size is 1.7MB not a big file at all. So I want to know that if the llama parse can not handle such a irregular text layout or there are some problems?

AttributeError: 'dict' object has no attribute 'json'

%%time
from llama_parse import LlamaParse
documents = LlamaParse(result_type="markdown").load_data("uber_10q_march_2022.pdf")

Output--Started parsing the file under job_id c9d970f1-76ea-4fe2-89cc-859878869740
Error while parsing the PDF file 'uber_10q_march_2022.pdf': 'dict' object has no attribute 'json'

ValueError: Could not extract json string from output

from llama_index.core.node_parser import MarkdownElementNodeParser
node_parser = MarkdownElementNodeParser(llm=Settings.llm, num_workers=8)
nodes = node_parser.get_nodes_from_documents(documents)
llm = mistral

ValueError: Could not extract json string from output: {
"summary": "As of December 31, 2021 and March 31, 2022, the company's assets include cash and cash equivalents, restricted cash and cash equivalents, accounts receivable, prepaid expenses, investments, equity method investments, property and equipment, operating lease right-of-use assets, and intangible assets. Liabilities include accounts payable, short-term insurance reserves, operating lease liabilities (current and non-current), long-term insurance reserves, long-term debt, and other long-term liabilities. Non-controlling interests are also reported. As of the end of the periods, total assets were $38,774 and $32,812, and total equity was $15,145 and $9,613, respectively.",
"table_title": "Balance Sheet",
"table_id": "id_16",
"columns": [
{
"col_name": "Assets",
"col_type": "string",
"summary": "As of December 31, 202

PDF too large error

Hello,

Thanks for making LlamaParse (and LlamaIndex, of course)!

I tried out parsing the content of a relatively small PDF file (~200 pages) using LlamaParse . In the extracted text file, the visual layout of the tables from the PDF file looked great!

Next, I tried doing the same with a relatively large PDF file (~800 pages; ~18 MB). However, I got the following error:

Started parsing the file under job_id 77ecd2da-c950-488d-8d86-821c0425602f
Error while parsing the PDF file:  Failed to parse the PDF file: Job failed: PDF_TOO_LARGE

Is there a way to work with large (much larger) PDF files?

Thanks.

AttributeError: 'dict' object has no attribute 'json'

from llama_parse import LlamaParse
documents = LlamaParse(result_type="markdown").load_data("uber_10q_march_2022.pdf")

Output--Started parsing the file under job_id c9d970f1-76ea-4fe2-89cc-859878869740
Error while parsing the PDF file 'uber_10q_march_2022.pdf': 'dict' object has no attribute 'json'

I am using colab notebook, btw it was working fine a day back.

Extracting tables and text information separately using parser

Hi Team,

We have already created rag pipeline using Unstructured io package as pdf parser, with this we can extract the table and text separately based on the extracted information, I have an approach to summarise the tables and text to save in vector db, I would like to continue using the same pipeline with this parser, unfortunately I don't see any option to extract the table data alone to continue using my pipeline. Could you please let me know will there be any option like this in future

Not Support to .doc extension

When I install llama parser using pip then
the list SUPPORTED_FILE_TYPES of base.py looks like
image
which gives error when i am dealing with .doc files
I suggest you to change the list with
SUPPORTED_FILE_TYPES = [
".pdf",
".xml",
".doc",
".docx",
".pptx",
".rtf",
".pages",
".key",
".epub"
]

`FlagEmbeddingReranker` OSError

I have a successful end-to-end test of the demo_advanced.ipynb using my own pdf files when executing the codes in jupyter notebook environment. So I write my own functions to encapsulate the codes and moved them to a python file (let's call this file utility_gpt_rag.py). I call my functions within jupyter notebook which still works fine, but not working when I call the functions from another python file (let's call this main_document_understanding.py).

The errors mention things like being authenticated with HuggingFace and the repo being private or not. These was not covered in the demo notebook or any or the tutorials on YouTube. Did anyone encounter a similar issue when executing the code from a python file?

This is the error traceback messages:

Traceback (most recent call last):
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\huggingface_hub\utils\_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\requests\models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/BBAI/bge-reranker-large/resolve/main/tokenizer_config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\transformers\utils\hub.py", line 398, in cached_file
    resolved_file = hf_hub_download(
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\huggingface_hub\utils\_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\huggingface_hub\file_download.py", line 1403, in hf_hub_download
    raise head_call_error
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\huggingface_hub\file_download.py", line 1261, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\huggingface_hub\utils\_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\huggingface_hub\file_download.py", line 1667, in get_hf_file_metadata
    r = _request_wrapper(
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\huggingface_hub\file_download.py", line 385, in _request_wrapper
    response = _request_wrapper(
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\huggingface_hub\file_download.py", line 409, in _request_wrapper
    hf_raise_for_status(response)
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\huggingface_hub\utils\_errors.py", line 352, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65f41745-40a3fa9e02ab0aa4589fafb6;ad44574d-642a-4a63-aae5-17c14136388d)

Repository Not Found for url: https://huggingface.co/BBAI/bge-reranker-large/resolve/main/tokenizer_config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\agpgago\data-science\document-automation-wizard\main_document_understanding.py", line 60, in <module>
    email_data = extract.extract_workflow(emails, container_name, folder_path)
  File "C:\Users\agpgago\data-science\document-automation-wizard\app\workflow_extract.py", line 81, in extract_workflow
    query_engine = llama.query_engine_w_ranker(embeddings)
  File "C:\Users\agpgago\data-science\document-automation-wizard\utility\utility_gpt_rag.py", line 58, in query_engine_w_ranker
    reranker = FlagEmbeddingReranker(top_n=2, model='BBAI/bge-reranker-large')
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\llama_index\postprocessor\flag_embedding_reranker\base.py", line 30, in __init__
    self._model = FlagReranker(
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\FlagEmbedding\flag_models.py", line 133, in __init__
    self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 767, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 600, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "C:\Users\agpgago\data-science\document-automation-wizard\.venv\lib\site-packages\transformers\utils\hub.py", line 421, in cached_file
    raise EnvironmentError(
OSError: BBAI/bge-reranker-large is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

Package versions:

FlagEmbedding @ git+https://github.com/FlagOpen/FlagEmbedding.git@5d43be378200e728ea9f32d552e7a125d8496c94
huggingface-hub==0.21.4
llama-index==0.10.19
llama-index-agent-openai==0.1.5
llama-index-cli==0.1.8
llama-index-core==0.10.19
llama-index-embeddings-openai==0.1.6
llama-index-indices-managed-llama-cloud==0.1.3
llama-index-legacy==0.9.48
llama-index-llms-openai==0.1.7
llama-index-multi-modal-llms-openai==0.1.4
llama-index-postprocessor-flag-embedding-reranker==0.1.2
llama-index-program-openai==0.1.4
llama-index-question-gen-openai==0.1.3
llama-index-readers-file==0.1.8
llama-index-readers-llama-parse==0.1.3
llama-index-vector-stores-chroma==0.1.5
llama-parse==0.3.8
llamaindex-py-client==0.1.13
openai==1.3.9

AttributeError on colab notebook?

Hi all,

the colab notebook fails. Does anybody know why?

!pip install -U numpy -q
!pip install -U llama-index --upgrade -q
!pip install -U llama-parse --upgrade -q

import nest_asyncio
nest_asyncio.apply()

from llama_parse import LlamaParse

parser = LlamaParse(
    api_key="llx-XXXXXXXXXXXXXXXXXXXXXXXXX",  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    num_workers=4, # if multiple files passed, split in `num_workers` API calls
    verbose=True
)

# sync
documents = parser.load_data("/content/EN_TXT.pdf")
Result:
--------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-2-0ded22901b98>](https://localhost:8080/#) in <cell line: 4>()
      2 nest_asyncio.apply()
      3 
----> 4 from llama_parse import LlamaParse
      5 
      6 parser = LlamaParse(

14 frames
[/usr/local/lib/python3.10/dist-packages/numpy/testing/_private/utils.py](https://localhost:8080/#) in <module>
     55 IS_PYSTON = hasattr(sys, "pyston_version_info")
     56 HAS_REFCOUNT = getattr(sys, 'getrefcount', None) is not None and not IS_PYSTON
---> 57 HAS_LAPACK64 = numpy.linalg._umath_linalg._ilp64
     58 
     59 _OLD_PROMOTION = lambda: np._get_promotion_state() == 'legacy'

llama_parse import error

Bug Description

Getting the error: TypeError: 'type' object is not subscriptable when just importing llamaparse

Version

Llama Index version : 0.10.19
Llamaparse version : 0.3.8

Steps to Reproduce

Just try to import llamaparse using from llama_parse import LlamaParse

Relevant Logs/Tracebacks

File "", line 1, in
File "/home/krd/.pyenv/versions/llm/lib/python3.8/site-packages/llama_parse/init.py", line 1, in
from llama_parse.base import LlamaParse, ResultType
File "/home/krd/.pyenv/versions/llm/lib/python3.8/site-packages/llama_parse/base.py", line 111, in
class LlamaParse(BasePydanticReader):
File "/home/krd/.pyenv/versions/llm/lib/python3.8/site-packages/llama_parse/base.py", line 321, in LlamaParse
def get_images(self, json_result: list[dict], download_path: str) -> List[dict]:
TypeError: 'type' object is not subscriptable

Error while parsing the PDF file: Failed to parse the PDF file

While working in the online preview, I get this error when running it on Jupyter notebook:
Error while parsing the PDF file: Failed to parse the PDF file: {"detail":[{"loc":["body","language",0],"msg":"value is not a valid enumeration member; permitted: 'af', 'az', 'bs', 'cs', 'cy', 'da', 'de', 'en', 'es', 'et', 'fr', 'ga', 'hr', 'hu', 'id', 'is', 'it', 'ku', 'la', 'lt', 'lv', 'mi', 'ms', 'mt', 'nl', 'no', 'oc', 'pi', 'pl', 'pt', 'ro', 'rs_latin', 'sk', 'sl', 'sq', 'sv', 'sw', 'tl', 'tr', 'uz', 'vi', 'ar', 'fa', 'ug', 'ur', 'bn', 'as', 'mni', 'ru', 'rs_cyrillic', 'be', 'bg', 'uk', 'mn', 'abq', 'ady', 'kbd', 'ava', 'dar', 'inh', 'che', 'lbe', 'lez', 'tab', 'tjk', 'hi', 'mr', 'ne', 'bh', 'mai', 'ang', 'bho', 'mah', 'sck', 'new', 'gom', 'sa', 'bgc', 'th', 'ch_sim', 'ch_tra', 'ja', 'ko', 'ta', 'te', 'kn'","type":"type_error.enum","ctx":{"enum_values":["af","az","bs","cs","cy","da","de","en","es","et","fr","ga","hr","hu","id","is","it","ku","la","lt","lv","mi","ms","mt","nl","no","oc","pi","pl","pt","ro","rs_latin","sk","sl","sq","sv","sw","tl","tr","uz","vi","ar","fa","ug","ur","bn","as","mni","ru","rs_cyrillic","be","bg","uk","mn","abq","ady","kbd","ava","dar","inh","che","lbe","lez","tab","tjk","hi","mr","ne","bh","mai","ang","bho","mah","sck","new","gom","sa","bgc","th","ch_sim","ch_tra","ja","ko","ta","te","kn"]}}]}

Tables Not Parsing Properly

I have a table,the table is not parsing properly,its missing one of the main column

Source Image:
image-20240304-122143

its parsing like this:

Valuation Inputs Level 2 Other Total Market Value at 10/31/22
Level 1 Quoted Prices Significant Observable Inputs
INVESTMENTS IN SECURITIES:
ASSETS (Market Value):
Common Stocks
Financial Services $294,480 $29,700
Other Industries (a) $2,825,497
Total Common Stocks $3,119,977 $29,700
Preferred Stocks (a) $36,165
Rights (a) $100
Warrants (a) $86
U.S. Government Obligations $1,338,806
TOTAL INVESTMENTS IN SECURITIES – ASSETS $3,156,228 $1,368,606

markdown table needs <br> rather than newline

Developmental_Trauma_Disorder_A_Legacy_of_Attachment_Trauma.pdf

Correctly interprets the table, but does not correctly render the table.

|Criterion|Subcriteria|
|---|---|
|Criterion A: Lifetime contemporaneous exposure to both types of developmental trauma|- A1: traumatic interpersonal victimization
- A2: traumatic disruption in attachment bonding with primary caregiver(s)
|
|Criterion B: Current emotion or somatic dysregulation (4 items; 3 required for DTD)|- B1: Emotion dysregulation
- B2: Somatic dysregulation
- B3: Impaired access to emotion or somatic feelings
- B4: Impaired verbal mediation of emotion or somatic feelings
|
|Criterion C: Current attentional or behavioral dysregulation (5 items; 2 required for DTD)|- C1: Attention bias toward or away from threat
- C2: Impaired self-protection
- C3: Maladaptive self-soothing
- C4: Nonsuicidal self-injury
- C5: Impaired ability to initiate or sustain goal-directed behavior
|
|Criterion D: Current relational- or self-dysregulation (6 items; 2 required for DTD)|- D1: Self-loathing or self viewed as irreparably damaged and defective
- D2: Attachment insecurity and disorganization
- D3: Betrayal-based relational schemas
- D4: Reactive verbal or physical aggression
- D5: Impaired psychological boundaries
- D6: Impaired interpersonal empathy
|

thanks again!

Invalid authentication token. Even with newly generated cloud API key.

Hello Team,

Really like your work on LlamaParse. The web app is working fine for PDF parsing but not the package.
Even when a new cloud API key is used, I got the same error.

import nest_asyncio
nest_asyncio.apply()

from llama_parse import LlamaParse
import os
from dotenv import load_dotenv

load_dotenv()
LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_PARSE_API")

parser = LlamaParse(api_key=LLAMA_CLOUD_API_KEY, result_type='markdown')

try:
    document = parser.load_data(file_path='data/file.pdf')
except Exception as error:
    print(f'An error occurred: {error}')
Error while parsing the PDF file 'data/filepdf': Failed to parse the PDF file: {"detail":"Invalid authentication token"}
An error occurred: Failed to parse the PDF file: {"detail":"Invalid authentication token"}

Column parsing errors

pharymngitis.pdf

For the below parsed document, the parser is having difficulty with two column layouts. e.g. page 2 column 2 is completely missing, while page 3 column 2 paragraphs appear before column 1. See marker project on github r.e. column identification strategies that appear to work well

Not getting result

When I am trying to use the parser it returns:
{'markdown': 'undefined\n---\nundefined\n---\nundefined\n---\nundefined\n---\nundefined\n---\nundefined\n---\nundefined\n---\nundefined\n---\nundefined'}

I tried both raw api and llama_parse. Same code was working and returning proper text for same pdf file yesterday. Also, I tried 'preview' in the llama cloud interface which returned same response.

I have premium plan.
This is the link to the parsing result page: https://api.cloud.llamaindex.ai/api/parsing/job/f3435fe9-94d5-45f0-949b-05aee265d3ad/result/markdown

Reading order is messed up

Hello, the attached information leaflet has a somewhat complex layout with different columns, and llama parse is completely confused about the reading order: xanax-uk.pdf

I would expect to read first the whole upper left column with the title, then the second column on the right, all columns in the first row, then all columns from left to right in the second row. Instead the returned text after the first line of the first column jumps to the second column, back to the first, and messes up completely the sense of the text.

# Component Type: Leaflet

# Package leaflet: Information for the patient

If you are pregnant, think you might be pregnant now, are planning to become pregnant or if you are breast-feeding (see also the sections on ‘Pregnancy’ and ‘Breast-feeding’ for more information).

Do not take your tablets with an alcoholic drink.

## 250 microgram and 500 microgram Tablets

### Warnings and precautions

Talk to your doctor or pharmacist before taking Xanax if you:

- Have ever felt so depressed that you have thought about taking your own life.
- ...

Plan to Opensource llama_parse

Hello! Thank you for this amazing work.
Currently, llama_parse is only available in the form of API call, I am wondering if there is a future plan to opensource the code base?

Page numbers returned in response

For some indexing into our VectorDB, it's very helpful to know the page numbers if possible.

Is this feature possible now, or to add?

Either way, love the library. Great work!

Error while parsing the PDF file: Job failed: PDF_TOO_LARGE

I am using Google colab notebook and trying to load a ~960 pages PDF file and i am getting the below error. Is this expected since its way below the 10k pages limit?

"" Error while parsing the PDF file '/content/drive/MyDrive/somefile.pdf': Failed to parse the PDF file: Job failed: PDF_TOO_LARGE
Failed to load file '/content/drive/MyDrive/somefile.pdf': with error: Failed to parse the PDF file: Job failed: PDF_TOO_LARGE. Skipping...
""

Issue with Parsing and No Error/Timeout

Using fedex 10-K from https://d18rn0p25nwr6d.cloudfront.net/CIK-0001048911/a9687faf-0a00-4135-a02e-0fc07e15d117.pdf

Screenshot 2024-02-27 at 9 06 47 AM

Received an error using the example code roughly, with no actual error output.

I suspect the issue here is that the client has a timeout not in the example code, because I uploaded the PDF to the llamaparse UI, it parsed successfully (at least, on the surface it did) - and then I noticed it was showing 446/1000 - the PDF is 223 pages long. So I'd conclude it worked, the client thought there was an error because it gave up too soon (it seemed like about 8-9 minutes when I did the ui drag and drop), which made me unable to access the output, but each attempt (ui+notebook) used the pages.

Auto Rotation of Pages with horizontal content

If a table/content is rotated horizontally the output comes out incorrect.

The page would need to be rotated to be extracted.

Example table as text in horizontal format:
image

Example table as image in horizontal format
image

No result (NO_CONTENT_HERE)

Hello,
After installing the new version (v0.10.40), I see many NO_CONTENT_HERE results with the pdf files, which have contents (about 10% of the documents returns NO_CONTENT_HERE).

The sampled documents are here: pdf1, pdf2, pdf3, pdf4.

This is the code that I used:

parser = LlamaParse(
    api_key=LLAMA_CLOUD_API_KEY,  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    num_workers=1, # if multiple files passed, split in `num_workers` API calls
    verbose=True,
    language="en" 
)

documents = parser.load_data(inputs) # inputs are the documents stored in my local storage

Textract result in blog post

image

I am curious about what the red highlight mean on this picture and notably for Textract. The output of the textract API is (near)-perfect on that document, so I am wondering where the degradation might come from.

Screenshot 2024-03-18 at 16 36 14

You might want to check some of the document to text approaches we have made available through our textractor client library in case they are useful:

https://aws-samples.github.io/amazon-textract-textractor/notebooks/document_linearization_to_markdown_or_html.html
https://aws-samples.github.io/amazon-textract-textractor/notebooks/textractor_for_large_language_models.html
https://aws-samples.github.io/amazon-textract-textractor/notebooks/tabular_data_linearization.html

thanks,

(disclaimer: I work on Textract table recognition)

Type Annotation Issue in get_images Method inside llama_parse -> base.py

Description:
I encountered an issue while using the get_images method in the provided codebase. It seems there's a type annotation error related to the list type subscript in the method signature. The error message indicates that subscripting for the list type will generate a runtime exception and suggests enclosing the type annotation in quotes.

Error Message:
error_message

Code Snippet (Affected Method):
affected_method

The issue arises with the list[dict] type annotation. To resolve this issue, the type annotation should be enclosed in quotes, like so:
issue_correction

This correction should prevent the runtime exception and ensure proper type hinting.

Environment:

Python version: [e.g., Python 3.8]

binding parameter 0

Hi there, for some reason some of my pdfs are not processed, I get an error.

Reference code:
def convert_pdf_to_markdown(file_name):
# Necessary for running async code in notebooks or scripts
nest_asyncio.apply()

# Initialize the LlamaParse parser
parser = LlamaParse(
    api_key=llamaindex_api_key,
    result_type="markdown",  # Choose "markdown" as the output format
    verbose=True,  # Enable verbose output to see detailed logs
)

# Define the path to your PDF file
pdf_file_path = os.path.join("./papers/", file_name)
print(pdf_file_path, "type:", type(pdf_file_path))
# Convert the PDF to Markdown
# This is a synchronous call, you can also use asynchronous calls as shown in the documentation
documents = parser.load_data(pdf_file_path)

# Return the converted documents
return documents

Terminal:
./papers/Retrieval-Augmented_Generation_for_Knowledge-Intensive_NLP_Tasks.pdf type: <class 'str'>
Started parsing the file under job_id 6b58df53-9d59-4286-8e88-427e3e93d956
An error occurred while updating paper rowid 77: Error binding parameter 0 - probably unsupported type.
./papers/RA-DIT_Retrieval-Augmented_Dual_Instruction_Tuning.pdf type: <class 'str'>
Started parsing the file under job_id fe3b8c19-0248-431a-a497-30c119a0694e
An error occurred while updating paper rowid 79: Error binding parameter 0 - probably unsupported type.
./papers/ColBERT_Efficient_and_Effective_Passage_Search_via_Contextualized_Late
__Interaction_over_BERT.pdf type: <class 'str'>
Started parsing the file under job_id c97fc58f-9e15-4e38-819e-90a0b140bfb8
An error occurred while updating paper rowid 80: Error binding parameter 0 - probably unsupported type.
./papers/Lost_in_the_Middle_How_Language_Models_Use_Long_Contexts.pdf type: <class 'str'>
Started parsing the file under job_id 16b4aee0-25a4-4f7f-926d-4d96a779567b
An error occurred while updating paper rowid 81: Error binding parameter 0 - probably unsupported type.
./papers/Enhancing_Recommender_Systems_with_Large_Language_Model_Reasoning_Graphs.pdf type: <class 'str'>
Started parsing the file under job_id a35e8ce8-cae6-4885-8e59-fcfecea99f59
An error occurred while updating paper rowid 84: Error binding parameter 0 - probably unsupported type.
./papers/LIMA_Less_Is_More_for_Alignment.pdf type: <class 'str'>
Started parsing the file under job_id 51a97a81-db5b-4432-951d-429a403620dc
An error occurred while updating paper rowid 85: Error binding parameter 0 - probably unsupported type.
./papers/Retrieval-Augmented_Generation_for_Knowledge-Intensive_NLP_Tasks.pdf type: <class 'str'>
Started parsing the file under job_id 1298bad1-3537-40f8-9563-641d53ef852c
An error occurred while updating paper rowid 92: Error binding parameter 0 - probably unsupported type.
./papers/Retrieval-Augmented_Generation_for_Large_Language_Models_A_Survey.pdf type: <class 'str'>
Started parsing the file under job_id 36e20f5d-2dd9-466d-bf6d-4c136d684ef9
An error occurred while updating paper rowid 102: Error binding parameter 0 - probably unsupported type.

Screenshot 2024-03-07 at 12 51 08 PM

A chunk of text is missing in the output

I have a PDF document with two-columnar structure and a lot of tables.
Tables were extracted correctly, but the text at the very beginning of the document wasn't provided in the parsing result.
On the screenshot the part in the red rectangle is completely missing in the output of parser:

photo_2024-02-21_18-36-19

Tried a couple of runs, but result is the same.

Metrics

Dear Llama Parse team
I really enjoyed Jerry's presentation at the MLOps conference yesterday. This seems like an exciting project.

What metrics did you use to evaluate parse quality for complex tables? And, how did Llama Parse do on these metrics vs the incumbents?

Thanks!

Issue in getting data from tables spanning across pages with footer text

If tables are spanning across pages - data gets split across multiple tables with inconsistent results:
Instance 1. Started a new table structure with different order of column names in the second part of the table (rows carrying over into the next page)
Instance 2. Assume first row in the next page as the header (although the header in the previous page was formatted as Bold Centered header row - different from rest of the table).
Instance 3. Completely omitted rows in the next page and started with text after the table - so erroneously dropped rows that carried into next page.

If there is a footer on each page, at times the footer is ignored and table is continued. In most instances, table gets split and text from footer is parsed as text between 2 tables. Again the table often assumes a different order or even different column header names.

Instance 4. Complex columns with Nested Structure (e.g. Merged Header field with 2 sub headers) does not produce consistent output
e.g.
| | | Merged Col C |
|Col-A|Col-A|Col-C1 | Col-C2|
|A1 |B1 |C1-1 |C2-1 |
|A2 |B2 |C1-2 |C2-2 |
|A3 |B3 |C1-3 |C2-3 |

The output from tables like these (Merged Col-C with split sub-columns) results in inconsistent output and headers getting confused.

Error parsing markdown (Parse -> Try demo)

In the Parse -> 'Try demo' tab, I attempted to upload an 8.3mb PDF. The right panel immediately showed "Error parsing markdown" after I dropped the file into the target area in the browser window. When the upload eventually finished, it showed "Error during upload" and an error toast in the bottom right corner popped up showing "undefined\nundefined".

Perhaps there is a size limit on PDF uploads?

Information for source file name ...

First up, great tool. Love using it and I'm getting very good usable results.

If possible, I'd like to request that the filename gets preserved, passed through in the parsing process and added to the metadata in the markdown output. The reason is that a lot of organizations (including mine) keep very orderly filename nomenclature. The name itself provides a lot of useful metadata that I need to make use of later on in the RAG pipeline. It would be a shame for it to be lost.

Hope this can be added. Or in case it's already available, please let me know how to access and retrieve the information. Think I inspected the output thoroughly but couldn't spot the original filenames post-parsing anywhere.

Cheers!

Error while parsing the PDF file:

Error while parsing the PDF file: Failed to parse the PDF file: {"detail":[{"loc":["body","language",0],"msg":"value is not a valid enumeration member; permitted: 'af', 'az', 'bs', 'cs', 'cy', 'da', 'de', 'en', 'es', 'et', 'fr', 'ga', 'hr', 'hu', 'id', 'is', 'it', 'ku', 'la', 'lt', 'lv', 'mi', 'ms', 'mt', 'nl', 'no', 'oc', 'pi', 'pl', 'pt', 'ro', 'rs_latin', 'sk', 'sl', 'sq', 'sv', 'sw', 'tl', 'tr', 'uz', 'vi', 'ar', 'fa', 'ug', 'ur', 'bn', 'as', 'mni', 'ru', 'rs_cyrillic', 'be', 'bg', 'uk', 'mn', 'abq', 'ady', 'kbd', 'ava', 'dar', 'inh', 'che', 'lbe', 'lez', 'tab', 'tjk', 'hi', 'mr', 'ne', 'bh', 'mai', 'ang', 'bho', 'mah', 'sck', 'new', 'gom', 'sa', 'bgc', 'th', 'ch_sim', 'ch_tra', 'ja', 'ko', 'ta', 'te', 'kn'","type":"type_error.enum","ctx":{"enum_values":["af","az","bs","cs","cy","da","de","en","es","et","fr","ga","hr","hu","id","is","it","ku","la","lt","lv","mi","ms","mt","nl","no","oc","pi","pl","pt","ro","rs_latin","sk","sl","sq","sv","sw","tl","tr","uz","vi","ar","fa","ug","ur","bn","as","mni","ru","rs_cyrillic","be","bg","uk","mn","abq","ady","kbd","ava","dar","inh","che","lbe","lez","tab","tjk","hi","mr","ne","bh","mai","ang","bho","mah","sck","new","gom","sa","bgc","th","ch_sim","ch_tra","ja","ko","ta","te","kn"]}}]}

Return positional information

I'd like to parse the document and get positional coordinates for each section of text and each table cell. This would take the form of a page number and a (x1,y1,x2,y2) rectangle. If the text spans a page, you might need to provide multiple coordinates. If the document doesn't have pages, then some other coordinate system appropriate to the document type should be used such as characters or lines.

Accept S3 type bucket links of PDFs?

We store our PDFs and documents in S3 buckets, and generate a signed key for them whenever we want to do anything with them.

Is there a way a to pass this in, instead of downloading the file and uploading it again when calling llama parse?

Example: https://mybucket.s3.amazonaws.com/uploads/file.pdf?X-Amz-Algorithm=...

Data in Columns in the Markdown is not correct.

You can see for example the Water Damage is being moved into the Limit column, when infact it should be under the deductible column.

Table (Source):
image

LlamaParse Output

### INSURING AGREEMENT PROPERTY (Appraisal Date: April 15, 2023) ||DEDUCTIBLE|LIMIT| |---|---|---| |All Property, Stated Amount Co-Insurance, Replacement Cost, Blanket By-Laws.| |$25,600,000| |Property Extensions| |Included| |Lock & Key|$2,500|$25,000| |Additional Living Expenses - Per Unit| |$50,000| |Additional Living Expenses - Annual Aggregate| |$1,000,000| |Excess Property Extensions - Annually Aggregated| |Up to $5,000,000| |- Excludes all damage arising from the peril of Earthquake| | | |All Risks| |$50,000| |Sewer Backup| |$250,000| |Water Damage| |$250,000| |Earthquake (Annual Aggregate)|20% (minimum $250,000)|100% of the Policy Limit| |Flood (Annual Aggregate)| |$250,000 100% of the Policy Limit| |Gross Rentals, 100% Co-Insurance, Indemnity Period (Months) :|N/A|Not Covered| 

Readable Markdown (translated via chatGPT)
image

Where is the md file stored?

I am trying to use the Llama_parse in google collab notebook to test it out.

This is the code:

import nest_asyncio
nest_asyncio.apply()

from llama_parse import LlamaParse

parser = LlamaParse(
api_key="llx-my-API",
result_type="markdown",
num_workers=4,
verbose=True,
language="en"
)

sync

documents = parser.load_data("./coding.pdf")

sync batch

#documents = parser.load_data(["./my_file1.pdf", "./my_file2.pdf"])

async

#documents = await parser.aload_data("./my_file.pdf")

async batch

#documents = await parser.aload_data(["./my_file1.pdf", "./my_file2.pdf"])

It says that: Started parsing the file under job_id

But where is the markdown file saved? Couldnt find it on the cloud nor in the collab notebook

ImportError: circular import?

See the below error with llama-parse-0.3.9

~/labs via 🐍 v3.10.12 took 3s 
❯ python llama_parse.py 
Traceback (most recent call last):
  File "/Users/hemanth/labs/llama_parse.py", line 4, in <module>
    from llama_parse import LlamaParse
  File "/Users/hemanth/labs/llama_parse.py", line 4, in <module>
    from llama_parse import LlamaParse
ImportError: cannot import name 'LlamaParse' from partially initialized module 'llama_parse' (most likely due to a circular import) (/Users/hemanth/labs/llama_parse.py)

for:

import nest_asyncio
nest_asyncio.apply()

from llama_parse import LlamaParse

parser = LlamaParse(
    api_key="llx-key",  
    result_type="markdown", 
    num_workers=4, 
    verbose=True,
    language="en"
)

documents = parser.load_data("./my.pdf")

bad ocr

Questioning_development_review.PDF

I wasn't intentionally testing OCR, but here we are. I won't share and example but its missing spaces\newlines and puts numbers where they don't belong.

when I run it through ocrmypdf with the following command: ocrmypdf --clean --output-type pdf --redo-ocr then re-run through llama-parse I get a much better result

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.