Hello, I’m using Pdf2TextLibrary to convert a PDF-file into text. The PDF contains a table-like structure showing days of the week, hours (worked) and status (approved). However, after converting into text, the formatting is not exactly how I would like it. It first grabs al the days (the first column) and then it seems to grab both the columns hours and status and put them in random order into text. I tried to remove or replace words and empty lines to get somewhat of a better format, but what I really need is the hours to be displayed after the corresponding days, so I can check if the hours and days match in the PDF (and also the total hours). Is there any way to do this? Or is there another library I could use? Or maybe some string formatting options? I have attached a screenshot of the PDF itself and of the text after converting (it is in Dutch). Thanks!
If you don’t mind writing a bit of Python code yourself:
I’ve heard that pdfplumber is good at extracting table information from PDFs.
I always wanted to try it myself, but haven’t had the time yet.
Hi John,
Well 2 OCR related questions on this forum in 2 hours … I wasn’t sure if I should reply because I’m not an OCR guru, but I’ll tell you what I know.
OCR is often quite flaky and the cleaner the image the better the results, I was surprised all your text came through with no incorrect characters, so your input image is quite clean
A few years back I performance tested a system that OCR’d invoices from external suppliers, so to test volumes of invoices we needed to generate PDF’s in our test script and submit them, then verify that the invoice generated in the system had the correct values.
The first thing we had to do was get some template supplier invoices because the OCR system had to be “trained” for each suppliers invoice, if the supplier changed their invoice, the system had to be retained.
The training was basically telling the OCR tool which region of the page to find which value, they didn’t OCR the whole page but rather OCR’d the region where the invoice number was, then OCR’d the region where the total was, etc.
In your case, for best results I’ll suggest, OCR the region where the column headings are, then OCR the region for row 1, etc if all goes well each OCR’d row should hopefully return 3 lines representing each column
Hope that helps,
Dave.
@JST81 You might try using PyMuPDF · PyPI If you could upload an example PDF then I could also check if this library does the trick, if you wish?
Hi Dave, I have no idea what you mean bij OCR? I never stated the PDF contains an image, it’s just text formatted in some kind of table form. So I’m not looking for something to convert an image to text, just extract the text from the PDF
Hi LukasB, I looked into that one (as well as others) but I’m not sure how that works. I need to be able to import a library and then use a simple keyword to extract text. It seems PyMuPDF does not have keywords but uses code like this:
import pymupdf # imports the pymupdf library
doc = pymupdf.open(“example.pdf”) # open a document
for page in doc: # iterate the document pages
text = page.get_text() # get plain text encoded as UTF-8
I have no idea how to use that in Robot Framework. Do I need to write some custom Python code for that? I can’t really upload a full PDF to be honest, since it’s work-related.
Yes you so have to write some Python code as bridge between Robot Framework and this library.
The following Python code can be used as a simple library to convert a PDF to text:
import pymupdf def pdf_to_text(pdf_file): """" Extracts text from the PDF file and returns the result. """ text = "" doc = pymupdf.open(pdf_file) # open a document for page in doc: # iterate the document pages text = text + page.get_text() return text # return the result
Just create a file named for instance pdf_to_text.py with the above code. Then in your Robot Framework file include this library like
*** Settings *** Library "..path_to_your_libraries.."/pdf_to_text.py
After this you can use the keyword Pdf To Text
with as argument the location to the PDF document.
${pdf_text} Pdf To Text ${my_pdf_file}
Also don’t forget to first install PyMuPDF library.
In the documentation and here is example code how to deal with some specific situations.
Hi John,
Ah my mistake, many PDF’s are just images inbedded in a PDF, but if your PDF is pure postscript that’s a different story, hopefully an easier one
Dave.
PyMuPDF has a Repo with a lot of examples:
They even have some about extracting values from tables:
I can personally recommend PyMuPDF , I also use it in my Robot Framework Library for document tests.
However, I don’t have any keywords to retrieve tables from PDF, so my library won’t help you that much (yet).
One more question if you don’t mind:
How to you WANT the text in the table to be extracted? So in what kind of format?
List of tuples? Nested List? List of dictionaries (with table column as key) ?
I’m just interested, in case I implement a keyword for table retrieval in my library.
Thanks Lukas, that seems to work! I created my own little .py-file using your code and imported that next to PyMuPDF itself. I was able to extract the text from the PDF this way and now the formatting is a bit better (the hours are placed directly after each day). So I managed to do my checks in the converted text and the test passes. One question though: how come this library formats the text in a different way than the one I was using before? Is that default or did you write something in the Python code to get a different formatting? Thanks again!
I have no idea why the formatting is different. It’s the default output. Maybe @Many has some idea, as he has more experience with this library than I do.
Your previous Library GitHub - qahive/robotframework-pdf2textlibrary: Pdf2TextLibrary is a Robot Framework library for read the pdf file as text data. used pdfminer2
to extract the data from the .pdf.
The latest release of pdfminder2
is from 2015 https://pypi.org/project/pdfminer2/ .
I guess there have been plently of improvements since then in the way how data can be retrieved from a PDF.
It seems PyMuPdf is better with tables
Just plain text, nothing more. As long as the values of the 3 columns are put together, and that is the case now