#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (2024)

#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (1)

I have clearly been out of the loop because I have only recently learned about the tesseract library in R. If I knew about it earlier I would have wrote about it much sooner!

The tesseract library is a package which has bindings to the Tesseract-OCR engine: a powerful optical character recognition (OCR) engine that supports over 100 languages and enables users to scan and extract text from pictures, which has direct applications in any field that deals with with high amounts of manual data processing- like accounting, mortgages/real estate, insurance and archival work. In this blog I am going to explore how its possible to parse a JPMorgan Chase bank statement using the tesseract,pdftools, stringr, and tidyverse libraries.

Disclaimer: The bank statement I am using is from a sample template and is not personal information. The statement sample can be accessed here; the document being parsed is shown below.

From what I’ve seen in the CRAN documentation we need to first convert the .pdf file into .png files. This can be done with the pdf_convert() function available in the pdftools package. To ensure that text will be read accurately, setting the dpi argument to a large number is recommended.

bank_statement <- pdftools::pdf_convert("sample-bank-statement.pdf", dpi = 1000)

## Converting page 1 to sample-bank-statement_1.png... done!## Converting page 2 to sample-bank-statement_2.png... done!## Converting page 3 to sample-bank-statement_3.png... done!## Converting page 4 to sample-bank-statement_4.png... done!

Now that the bank statement has been converted to .png files it can now be read with tesseract. Right now the data is very unstructured and it needs to be parsed. I’ll save the output for you, but you can see it for yourself if you output the raw_text vector on your machine.

library(tesseract)raw_text <- ocr(bank_statement, engine = tesseract("eng") )

From a bank statement, businesses are interested in data from the fields listed in document. Namely the:

Deposits and additions,
Checks paid,
Other withdrawals, fees and charges; and
Daily ending balances.

While this bank statement is relatively small and only consists of 4 pages, to create a general method for extracting data from JPMorgan Chase bank statements which could be larger, the scanned text will need to be combined into a single text file and then parsed accordingly.

# Bind raw text togetherraw_text_combined<- paste(raw_text,collapse="")

To get the data for the desired fields, we can use the field titles as anchors for parsing the data. It was with this and with the help of regex101.com this cleaning script was constructed. Beyond the particular regular expressions involved, the cleaning script relies heavily on the natural anchors in the text relating encompassing the values where they begin (around the title of the table they belong) and where they end (just before the total amount).

library(tidyverse)library(stringr)deposits_and_additions<- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract("(?<=DEPOSITS AND ADDITIONS).*(?=Total Deposits and Additions)") %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), transaction = ...1 %>% str_extract('(?<=\\d{2} ).*(?= (\\$|\\d))'), amount = ...1 %>% str_extract("(?<=[A-z] ).*")%>% str_extract('\\d{1,}.*'))checks_paid <- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?<=CHECKS PAID).*(?=Total Checks Paid)') %>% # Would have to change regex to get check numbers and description but its not relevant here str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), amount = ...1 %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))others <- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?=OTHER WITHDRAWALS, FEES & CHARGES).*(?=Total Other Withdrawals, Fees & Charges)') %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), description = ...1 %>% str_extract('(?<=\\d{2} ).*(?= (\\$|\\d))'), amount = ...1 %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))daily_ending_balances<- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?<=DAILY ENDING BALANCE).*(?=SERVICE CHARGE SUMMARY)') %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% lapply(function(x) { x %>% str_split('(?= \\d{2}\\/\\d{2} )')}) %>% unlist() %>% as_tibble(.name_repair = "unique") %>% transmute(date= value %>% str_extract('\\d{2}\\/\\d{2}'), amount = value %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))

The cleaned data extracted from the bank statement is:

statement_data <- list("DEPOSITS AND ADDITIONS"=deposits_and_additions, "CHECKS PAID"=checks_paid, "OTHER WITHDRAWALS, FEES & CHARGES"=others, "DAILY ENDING BALANCE"=daily_ending_balances)statement_data

## $`DEPOSITS AND ADDITIONS`## # A tibble: 10 x 3## date transaction amount ## <chr> <chr> <chr> ## 1 07/02 Deposit 17,120.00## 2 07/09 Deposit 24,610.00## 3 07/14 Deposit 11,424.00## 4 07/15 Deposit 1,349.00 ## 5 07/21 Deposit 5,000.00 ## 6 07/21 Deposit 3,120.00 ## 7 07/23 Deposit 33,138.00## 8 07/28 Deposit 18,114.00## 9 07/30 Deposit 6,908.63 ## 10 07/30 Deposit 5,100.00 ## ## $`CHECKS PAID`## # A tibble: 2 x 2## date amount ## <chr> <chr> ## 1 07/14 1,471.99## 2 07/08 1,697.05## ## $`OTHER WITHDRAWALS, FEES & CHARGES`## # A tibble: 4 x 3## date description amount ## <chr> <chr> <chr> ## 1 07/11 Online Payment XXXXX To Vendor 8,928.00## 2 07/11 Online Payment XXXXX To Vendor 2,960.00## 3 07/25 Online Payment XXXXX To Vendor 250.00 ## 4 07/30 ADP TX/Fincl Svc ADP 2,887.68## ## $`DAILY ENDING BALANCE`## # A tibble: 11 x 2## date amount ## <chr> <chr> ## 1 07/02 98,727.40 ## 2 07/21 129,173.36## 3 07/08 97,030.35 ## 4 07/23 162,311.36## 5 07/09 121,640.35## 6 07/25 162,061.36## 7 07/11 109,752.35## 8 07/28 180,175.36## 9 07/14 108,280.36## 10 07/30 189,296.31## 11 07/16 121,053.36

There you have it! The tesseract package has opened up a world of data processing tools that I now have at my disposal and I hope I was able to show it in this blog. While this blog only focused on JPMorgan Chase bank statements, its possible to apply the same techniques to other bank statements by having the cleaning script tweaked accordingly.

Thank you for reading!

Want to see more of my content?

Be sure to subscribe and never miss an update!

YouTube

Facebook

Patreon

#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (2024)

FAQs

What is Tesseract OCR used for? ›

Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats.

Read On ›

Is Tesseract OCR still good? ›

Tesseract OCR can be very useful in many instances and use cases. However, like any other open-source solution, there are always some drawbacks to consider. In this section, we will shed light on these limitations one by one: Tesseract is not as accurate as more advanced solutions embedded with AI.

Discover More Details ›

How much does Tesseract OCR cost? ›

Tesseract OCR and Omni Page OCR are free. For aomni Page OCR you need to install the package UiPath. OmniPage.

Is Tesseract OCR free for commercial use? ›

Tesseract is an open-source OCR engine, available under the Apache 2.0 license. This means that the software is freely available for commercial use. For developers, it also means that they can access its source code, modify it to suit their needs, and contribute to its improvement.

See Details ›

What is a OCR used for? ›

OCR stands for "Optical Character Recognition." It is a technology that recognizes text within a digital image. It is commonly used to recognize text in scanned documents and images.

Find Out More ›

Is Google Tesseract free? ›

Tesseract is a free and open-source command line OCR engine that was developed at Hewlett-Packard in the mid–80s, and has been maintained by Google since 2006.

Tell Me More ›

Does Tesseract need internet? ›

Since it is open source and thus runs locally, the only cost of using Tesseract are the resources the machine uses, and there is no need to communicate the document and the results over the internet.

Show Me More ›

What are the limitations of Tesseract? ›

Some of the limitations of Tesseract OCR include: Tesseract is not very good at handling low-quality images or documents with low resolution. It may have difficulty accurately extracting text from images that are blurry, pixelated, or have poor lighting.

Explore More ›

What is the difference between OCR and Tesseract? ›

Tesseract is a library for OCR, which is a specialized subset of CV that's dedicated to extracting text from images. From Tesseract Github: .....can be used directly, or (for programmers) using an API to extract typed, handwritten or printed text from images. It supports a wide variety of languages.

Is Google OCR better than Tesseract? ›

Tesseract is a great option for clean text for which typography does not present particular challenges. Google Vision will produce high-quality results on more complex characters, as long as the layout is very basic.

Show Me More ›

Is OCR 100% accurate? ›

Even simple features like rubber band OCR and zonal OCR require accurate underlying character recognition. Although there's no such thing as 100% accurate OCR without human help, making a huge improvement is very possible.

Read The Full Story ›

How can I do OCR for free? ›

How To Do OCR on PDF for Free

Import or drag & drop your file to our PDF OCR tool.
Wait a few seconds while our OCR tech works its magic.
Edit the PDF with our other tools if needed.
Download or share your editable PDF file—done!

See Details ›

Can Tesseract read PDF? ›

Tesseract does not support reading PDF files. If you need to OCR PDF files, you should either convert them to another format or use OCRmyPDF. Note: Tesseract does support PDF as an output format.

Get More Info Here ›

Can Tesseract read handwriting? ›

Handwritten text can also be recognized using tesseract but with a lower accuracy as compared to the recognition done on printed or typed text. This is because every person has a unique style of writing and the computer has to be trained with a limited amount of input.

Does Tesseract OCR work offline? ›

Tesseract OCR is an offline tool, which provides some options it can be run with. The one that makes the most difference in the example problems we have here is page segmentation mode.

When would you use OCR? ›

OCR can be used for a variety of applications, including the following:

Scanning printed documents into versions that can be edited with word processors, like Microsoft Word or Google Docs.
Indexing print material for search engines.
Automating data entry, extraction and processing.

More items...

View Details ›

What are the capabilities of the Tesseract? ›

Capabilities. The Tesseract was a cube which contained the Space Stone, an Infinity Stone representing the element of space. The Tesseract could open wormholes to any part of the universe and provide interdimensional travel.