LLM中对于非结构化 PDF 文本提取技术

bigegpt 2024-11-20 12:39 4 浏览

您是否曾经遇到过从非结构化 PDF 文件中提取内容的艰巨任务，却发现现有的 Python 包无法满足您的要求？非结构化是指具有多个元素和不同页面布局的 PDF。想象一下，一个文件在一页上包含一个表，在另一页上包含两个表，或者可能根本没有。想象一下，在一个页面上遇到单列布局，在另一个页面上遇到两列或三列布局，这些列中包含表格。这正是我面临的挑战。您是否曾经遇到过从非结构化 PDF 文件中提取内容的艰巨任务，却发现现有的 Python 包无法满足您的要求？非结构化是指具有多个元素和不同页面布局的 PDF。想象一下，一个文件在一页上包含一个表，在另一页上包含两个表，或者可能根本没有。想象一下，在一个页面上遇到单列布局，在另一个页面上遇到两列或三列布局，这些列中包含表格。这正是我面临的挑战。

我尝试了各种软件包，例如 tabula 等。但是，每个都有其优点和缺点。虽然 tabula 擅长提取表格，但它在从表格中提取合并的单元格信息方面遇到了困难，并且不适用于文本提取。我发现 pdfminer 非常适合从具有多列的页面中提取文本，而 pdfplumber 在提取具有合并单元格的表格时大放异彩。我甚至找到了一个名为非结构化的软件包，这是朝着正确方向迈出的一步，满足了我大约 80% 的需求。尽管它在提取某些表时表现良好，但它经常完全遗漏其他表。这个问题在使用光学字符识别（OCR）的提取包中很常见，其中并非所有表格都可以准确地从 PDF 文件中提取，有时甚至会省略表格的最后一行或最后一列。

因此，我设计了一个解决方案，成功地处理了我的大部分示例 PDF。但是，我稍后将讨论少数例外情况。

我的方法包括将 pymupdf 中的 pdfminer、pdfplumber 和 fitz 结合起来，以提取文本、表格信息和图像，同时保留它们的布局和流程。

提取文本和表格

为了以有组织的方式提取文本和表格，我使用了pdfplumber包函数来识别和提取页面中的表格。随后，我利用 pdfminer 函数按顺序从页面中提取文本元素，排除已识别表边界内的文本元素。打印输出字符串时，我确保保留读取流程和表格的位置。

A screenshot of a sample 2 column layout page of a PDF file and its output string in a text file. Note: The text file only shows part of the extraction, which is the second layout of the PDF — starting from the red text.

让我们直接深入了解该功能，而无需纠结于软件包安装：

在我的函数的初始阶段，我实例化了每个包中的对象，以全面处理单个 PDF 文件。此后，我遍历了每个页面，使用 pdfplumber 的 find_tables 函数优先标识页面中存在的表格对象，如下面的代码所示。

def pdf_process(path):
    plumberObj = pdfplumber.open(path)
    minerPages = extract_pages(path)
    fitzDoc = fitz.open(path)

    ...
    for i in enumerate(minerPages):
        tables = plumberPage.find_tables(table_settings={"text_vertical_ttb": False})
        page_text = miner_extract_page(page_layout, tables)

一旦我有了表格对象，我就转到了pdfminer。虽然 pdfminer 在表格提取方面并非完美无缺，但它擅长处理 PDF 中的不同页面布局。请注意，我的示例 PDF 文件在整个页面中的布局不一致。因此，对于每个页面布局，我确保使用帮助程序检查功能从该页面裁剪掉所有表格对象is_obj_in_bbox

请注意，该表对象包含一个名为 _bbox 的边界框的属性。我们在帮助程序函数中使用该边界。

def is_obj_in_bbox(obj, _bbox, page_height):
    """
    checks if an element boundary box is within another boundary
    """
    objx0, y0, objx1, y1 = obj
    x0, top, x1, bottom = _bbox
    return (objx0 >= x0) and (objx1 <= x1) and (page_height - y1 >= top) and (page_height - y0 <= bottom)

你会注意到，我在检查函数中使用了page_height，这是由于 pdfplumber 包和 pdfminer 包之间的边界尺寸不同。从本期中阅读有关这两个包之间的 bbox 反转的更多信息。

现在，让我们检查一下 miner_extract_page 函数：


def miner_extract_page(page_layout, tables):
    """
    this function extract texts, tables, images from a single page layout
    Paremeter:
      page_layout -> A pdfminer page object
      tables -> An array of pdfplumber table objects
    Returns:
      
    """
    page_height = page_layout.height
    extractedTables = []
    page_output_str = ""
    
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            tabBox = []
            # if current element exists in any of the tables, 
            # append the t to tabBox
            for t in tables:
                is_obj_n_box = is_obj_in_bbox(element.bbox, t.bbox, page_height)
                if is_obj_n_box:
                    tabBox.append(t)

            # if tabBox is empty, extract the element with get_text() function
            if not len(tabBox):
                if isinstance(element, LTTextContainer):
                    elementText = element.get_text()
                    page_output_str += elementText
                else:
                    # check for figures layout at this point
                    # check for part 2 when I talk about images/figures

            # else, element exist in a certain table
            # therefore, we extract the found table 
            # using pdfplumber table extract function and 
            # concatenate to our end results
            else:
                if not tabBox[0] in extractedTables:
                    table_str = tabulate(tabBox[0].extract(**({"vertical_ttb": False})), tablefmt="grid")
                    page_output_str += table_str
                    page_output_str += "\n"
                    # to avoid repetition we used extractedTables to 
                    # filter already extracted tables. 
                    extractedTables.append(tabBox[0])
    return page_out_str

在上面的函数中，我使用制表包在最终字符串上进行打印。否则没有必要。如果您要为 LLM 模型提取 PDF，最好避免使用它，因为它会在您的最终输出中添加更多字符。但也要确保采用某种有组织的方式来打印它，以避免混淆，因为表格提取的输出是二维数组。

提取图像

此任务还涉及使用 PDF 文件识别某些象形图/图像。我有大约 10 个不同的象形图，我应该检查文件中是否存在它们中的任何一个。我的解决方案是首先提取 PDF 页面上的所有图像，然后利用图像散列或其他模板比较方法（来自 cv2 包的 matchTemplate）来确定 PDF 文件的提取图像是否与我拥有的任何象形图相匹配。

提取图像的一个有效软件包是 PyMuPDF （fitz），我选择使用它。然而，我遇到了一个挑战，并非所有图像都被正确识别或提取。我推测这是否是由于 PDF 编译过程中使用的图像类型，例如 PNG 或 SVG，但我无法明确确认这一点。

对于图像提取过程，我实现了一个名为 check_for_image 的函数：

def check_for_image(pdf_path):
    """
    Function that get images from pdf pages and save them to local drive

    Parameters:
    pdf_path -> Path of the PDF file
    """

    pdf_document = fitz.open(pdf_path)
    xreflist = []

    page_num = 0
    il = pdf_document.get_page_images(page_num)
    logger.info(f"Found {len(il)} images")

    for img in il:
        xref = img[0]
        if xref in xreflist:
            continue

        width = img[2]
        height = img[3]
        
        #skip tiny images
        if min(width, height) <= 5:
            continue

        imgdata = img["image"]

        imgfile = os.path.join(f"drawing-img-{xref}-{page_num + 1}.{img['ext']}")
        fout = open(imgfile, "wb")
        fout.write(imgdata)
        fout.close()
        xreflist.append(xref)
    pdf_document.close()

但是由于提到的图像挑战，除了检查正常图像之外，我还实现了一个额外的功能。功能是检查图纸。对于绘图，从线条到更复杂的形状（您可以阅读更多关于函数get_drawings），我采用了各种步骤，例如将矩形大小放大一定数量，对任何较小的图形或较大图形中的任何图形应用过滤器，然后获得包含绘图的页面放大区域的 apixmap。随后，我会将生成的像素图与我必须做出决定的象形图进行比较。

def check_for_drawings(pdf_path):
    """
    Function that finds drawings from pdf pages and save pixmap of 
    on enlarged drawing's rectangle to local drive

    Parameters:
    pdf_path -> Path of the PDF file
    """

    pdf_document = fitz.open(pdf_path)

    for page_num in range(pdf_document.page_count):
        page = pdf_document[page_num]

        d = page.get_drawings()
        new_rects = []
        for p in d:
            # filter emplty rectangle
            if p["rect"].is_empty:
                continue
            w = p["width"]
            if w:
                r = p["rect"] + (-w, -w, w, w)  # enlarge each rectangle by width value
                for i in range(len(new_rects)):
                    if abs(r & new_rects[i]) > 0:  # touching one of the new rects?
                        new_rects[i] |= r  # enlarge it
                        break

                # now look if contained in one of the new rects
                remainder = [s for s in new_rects if r in s]
                if remainder == []:  # no ==> add this rect to new rects
                    new_rects.append(r)

        new_rects = list(set(new_rects))  # remove any duplicates
        new_rects.sort(key=lambda r: abs(r), reverse=True)
        remove = []
        for j in range(len(new_rects)):
            for i in range(len(new_rects)):
                if new_rects[j] in new_rects[i] and i != j:
                    remove.append(j)
        remove = list(set(remove))
        for i in reversed(remove):
            del new_rects[i]
        new_rects.sort(key=lambda r: (r.tl.y, r.tl.x))  # sort by location


        mat = fitz.Matrix(5, 5)  # high resolution matrix
        for i, r in enumerate(new_rects):
            if r.width is None or r.height <= 15 or r.width <= 15:
                continue  # skip lines and empty rects
            pix = page.get_pixmap(matrix=mat, clip=r)
            hayPath = f"drawing-rect{page_num}-{i}.png"
            if pix.n - pix.alpha >= 4:      # can be saved as PNG
                pix = fitz.Pixmap(fitz.csRGB, pix)
            pix.save(hayPath)
            pix = None                     # free Pixmap resources

    pdf_document.close()

这种全面的方法使我能够成功地从我遇到的几乎所有示例 PDF 中提取信息。

在反思我的解决方案的局限性或边缘情况时，我面临的一个挑战是提取无边框的表。尽管付出了巨大的努力，但我还没有找到一种直接的方法来实现这一目标，特别是如果您正在处理需要提取的大量 PDF 文件。但是，值得注意的是，pdfplumber github 页面上存在关于表格边框问题的问题和讨论。您可以使用以下搜索查询链接找到所有这些 https://github.com/jsvine/pdfplumber/issues?q=border

与此同时，我鼓励探索开源项目，这些项目可能会为这一特殊挑战提供有价值的见解和潜在的解决方案。

任何对代码感兴趣的人，你都可以找到它这里 https://github.com/KhadijaMahanga/textract

matchtemplate

上一篇：10个常见Kubernetes的陷阱
下一篇：OpenCV-直方图

LLM中对于非结构化 PDF 文本提取技术

提取文本和表格

提取图像

相关推荐

idea本地配置连接远程hadoop集群的一些网络问题解决汇总

无缓存不行?例行升级的入门级阿斯加特AN2 SSD装机点评

Ceph运维手册(基于P版本)

大数据开发前要做什么准备?8台Hadoop服务器进行集群规划前配置

Tensorflow分类loss函数总结 tensorflow绘制loss曲线

R语言学习笔记(七) -离散型数据的模型预测2

iOS Runtime详解

7 个对 Java 意义重大的性能指标，你知道几个?

PHP 远程调试最佳实践

Laravel框架使用图片处理简单教程