Skip to content

TypeError: 'PDFObjRef' object is not iterable #90

@sanjays-rao

Description

@sanjays-rao

Initial Checks

  • I confirm that I'm on the latest version

Description

Hi, need some help regarding the above error I'm facing while parsing my document. Few PDFs are not able to be parsed. Not able to understand why. It is an medical invoice PDF, where I am aiming to extract the text contents along with their bounding box coordinates.

Example Code

import openparse

basic_doc_path = "/home/sanjayr/Workspace/30-claims/42969914.pdf"
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)

Python, open-parse & OS Version

python_version: 3.8.20
             operating_system: Linux
                   os_version: 5.15.0-1074-azure
           open-parse version: 0.7.0
                 install path: /home/sanjayr/.conda/envs/be-env/lib/python3.8/site-packages/openparse
               python version: 3.8.20 (default, Oct  3 2024, 15:24:27)  [GCC 11.2.0]
                     platform: Linux-5.15.0-1074-azure-x86_64-with-glibc2.17
             related packages: PyMuPDF-1.24.11 pydantic-2.10.4 tokenizers-0.20.3 transformers-4.46.3 torch-2.4.1 torchvision-0.19.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions