Help needed with formatting text from PDF

damies13 · 30 January 2025 12:10

Hi John,

Well 2 OCR related questions on this forum in 2 hours … I wasn’t sure if I should reply because I’m not an OCR guru, but I’ll tell you what I know.

OCR is often quite flaky and the cleaner the image the better the results, I was surprised all your text came through with no incorrect characters, so your input image is quite clean

A few years back I performance tested a system that OCR’d invoices from external suppliers, so to test volumes of invoices we needed to generate PDF’s in our test script and submit them, then verify that the invoice generated in the system had the correct values.

The first thing we had to do was get some template supplier invoices because the OCR system had to be “trained” for each suppliers invoice, if the supplier changed their invoice, the system had to be retained.

The training was basically telling the OCR tool which region of the page to find which value, they didn’t OCR the whole page but rather OCR’d the region where the invoice number was, then OCR’d the region where the total was, etc.

In your case, for best results I’ll suggest, OCR the region where the column headings are, then OCR the region for row 1, etc if all goes well each OCR’d row should hopefully return 3 lines representing each column

Hope that helps,

Dave.

Topic		Replies	Views
I need to read the PDF invoice and verify the fields of it in robot framework Robot Framework	3	87	30 January 2025
PDF/A-3A format with embeded xml Robot Framework	3	96	15 October 2024
How to compare two PDF documents Libraries	3	749	27 December 2023
Is there any way to export robot report.html,log.html as pdf Robot Framework	4	1347	14 May 2024
Time format related! Robot Framework	8	6850	13 January 2023

Help needed with formatting text from PDF

Related topics