#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (2024)

#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (1)

I have clearly been out of the loop because I have only recently learned about the tesseract library in R. If I knew about it earlier I would have wrote about it much sooner!

The tesseract library is a package which has bindings to the Tesseract-OCR engine: a powerful optical character recognition (OCR) engine that supports over 100 languages and enables users to scan and extract text from pictures, which has direct applications in any field that deals with with high amounts of manual data processing- like accounting, mortgages/real estate, insurance and archival work. In this blog I am going to explore how its possible to parse a JPMorgan Chase bank statement using the tesseract,pdftools, stringr, and tidyverse libraries.

Disclaimer: The bank statement I am using is from a sample template and is not personal information. The statement sample can be accessed here; the document being parsed is shown below.

  • #RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (2)
  • #RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (3)
  • #RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (4)
  • #RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (5)

From what I’ve seen in the CRAN documentation we need to first convert the .pdf file into .png files. This can be done with the pdf_convert() function available in the pdftools package. To ensure that text will be read accurately, setting the dpi argument to a large number is recommended.

bank_statement <- pdftools::pdf_convert("sample-bank-statement.pdf", dpi = 1000)
## Converting page 1 to sample-bank-statement_1.png... done!## Converting page 2 to sample-bank-statement_2.png... done!## Converting page 3 to sample-bank-statement_3.png... done!## Converting page 4 to sample-bank-statement_4.png... done!

Now that the bank statement has been converted to .png files it can now be read with tesseract. Right now the data is very unstructured and it needs to be parsed. I’ll save the output for you, but you can see it for yourself if you output the raw_text vector on your machine.

From a bank statement, businesses are interested in data from the fields listed in document. Namely the:

  • Deposits and additions,
  • Checks paid,
  • Other withdrawals, fees and charges; and
  • Daily ending balances.

While this bank statement is relatively small and only consists of 4 pages, to create a general method for extracting data from JPMorgan Chase bank statements which could be larger, the scanned text will need to be combined into a single text file and then parsed accordingly.

# Bind raw text togetherraw_text_combined<- paste(raw_text,collapse="")

To get the data for the desired fields, we can use the field titles as anchors for parsing the data. It was with this and with the help of regex101.com this cleaning script was constructed. Beyond the particular regular expressions involved, the cleaning script relies heavily on the natural anchors in the text relating encompassing the values where they begin (around the title of the table they belong) and where they end (just before the total amount).

library(tidyverse)library(stringr)deposits_and_additions<- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract("(?<=DEPOSITS AND ADDITIONS).*(?=Total Deposits and Additions)") %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), transaction = ...1 %>% str_extract('(?<=\\d{2} ).*(?= (\\$|\\d))'), amount = ...1 %>% str_extract("(?<=[A-z] ).*")%>% str_extract('\\d{1,}.*'))checks_paid <- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?<=CHECKS PAID).*(?=Total Checks Paid)') %>% # Would have to change regex to get check numbers and description but its not relevant here str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), amount = ...1 %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))others <- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?=OTHER WITHDRAWALS, FEES & CHARGES).*(?=Total Other Withdrawals, Fees & Charges)') %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), description = ...1 %>% str_extract('(?<=\\d{2} ).*(?= (\\$|\\d))'), amount = ...1 %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))daily_ending_balances<- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?<=DAILY ENDING BALANCE).*(?=SERVICE CHARGE SUMMARY)') %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% lapply(function(x) { x %>% str_split('(?= \\d{2}\\/\\d{2} )')}) %>% unlist() %>% as_tibble(.name_repair = "unique") %>% transmute(date= value %>% str_extract('\\d{2}\\/\\d{2}'), amount = value %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))

The cleaned data extracted from the bank statement is:

statement_data <- list("DEPOSITS AND ADDITIONS"=deposits_and_additions, "CHECKS PAID"=checks_paid, "OTHER WITHDRAWALS, FEES & CHARGES"=others, "DAILY ENDING BALANCE"=daily_ending_balances)statement_data
## $`DEPOSITS AND ADDITIONS`## # A tibble: 10 x 3## date transaction amount ## <chr> <chr> <chr> ## 1 07/02 Deposit 17,120.00## 2 07/09 Deposit 24,610.00## 3 07/14 Deposit 11,424.00## 4 07/15 Deposit 1,349.00 ## 5 07/21 Deposit 5,000.00 ## 6 07/21 Deposit 3,120.00 ## 7 07/23 Deposit 33,138.00## 8 07/28 Deposit 18,114.00## 9 07/30 Deposit 6,908.63 ## 10 07/30 Deposit 5,100.00 ## ## $`CHECKS PAID`## # A tibble: 2 x 2## date amount ## <chr> <chr> ## 1 07/14 1,471.99## 2 07/08 1,697.05## ## $`OTHER WITHDRAWALS, FEES & CHARGES`## # A tibble: 4 x 3## date description amount ## <chr> <chr> <chr> ## 1 07/11 Online Payment XXXXX To Vendor 8,928.00## 2 07/11 Online Payment XXXXX To Vendor 2,960.00## 3 07/25 Online Payment XXXXX To Vendor 250.00 ## 4 07/30 ADP TX/Fincl Svc ADP 2,887.68## ## $`DAILY ENDING BALANCE`## # A tibble: 11 x 2## date amount ## <chr> <chr> ## 1 07/02 98,727.40 ## 2 07/21 129,173.36## 3 07/08 97,030.35 ## 4 07/23 162,311.36## 5 07/09 121,640.35## 6 07/25 162,061.36## 7 07/11 109,752.35## 8 07/28 180,175.36## 9 07/14 108,280.36## 10 07/30 189,296.31## 11 07/16 121,053.36

There you have it! The tesseract package has opened up a world of data processing tools that I now have at my disposal and I hope I was able to show it in this blog. While this blog only focused on JPMorgan Chase bank statements, its possible to apply the same techniques to other bank statements by having the cleaning script tweaked accordingly.

Thank you for reading!

#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (6)

Want to see more of my content?

Be sure to subscribe and never miss an update!

#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (2024)

FAQs

What is Tesseract OCR used for? ›

Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats.

Is Tesseract OCR still good? ›

Tesseract OCR can be very useful in many instances and use cases. However, like any other open-source solution, there are always some drawbacks to consider. In this section, we will shed light on these limitations one by one: Tesseract is not as accurate as more advanced solutions embedded with AI.

How much does Tesseract OCR cost? ›

Tesseract OCR and Omni Page OCR are free. For aomni Page OCR you need to install the package UiPath. OmniPage.

Is Tesseract OCR free for commercial use? ›

Tesseract is an open-source OCR engine, available under the Apache 2.0 license. This means that the software is freely available for commercial use. For developers, it also means that they can access its source code, modify it to suit their needs, and contribute to its improvement.

What is a OCR used for? ›

OCR stands for "Optical Character Recognition." It is a technology that recognizes text within a digital image. It is commonly used to recognize text in scanned documents and images.

Is Google Tesseract free? ›

Tesseract is a free and open-source command line OCR engine that was developed at Hewlett-Packard in the mid–80s, and has been maintained by Google since 2006.

Does Tesseract need internet? ›

Since it is open source and thus runs locally, the only cost of using Tesseract are the resources the machine uses, and there is no need to communicate the document and the results over the internet.

What are the limitations of Tesseract? ›

Some of the limitations of Tesseract OCR include: Tesseract is not very good at handling low-quality images or documents with low resolution. It may have difficulty accurately extracting text from images that are blurry, pixelated, or have poor lighting.

What is the difference between OCR and Tesseract? ›

Tesseract is a library for OCR, which is a specialized subset of CV that's dedicated to extracting text from images. From Tesseract Github: .....can be used directly, or (for programmers) using an API to extract typed, handwritten or printed text from images. It supports a wide variety of languages.

Is Google OCR better than Tesseract? ›

Tesseract is a great option for clean text for which typography does not present particular challenges. Google Vision will produce high-quality results on more complex characters, as long as the layout is very basic.

Is OCR 100% accurate? ›

Even simple features like rubber band OCR and zonal OCR require accurate underlying character recognition. Although there's no such thing as 100% accurate OCR without human help, making a huge improvement is very possible.

How can I do OCR for free? ›

How To Do OCR on PDF for Free
  1. ​​Import or drag & drop your file to our PDF OCR tool.
  2. Wait a few seconds while our OCR tech works its magic.
  3. Edit the PDF with our other tools if needed.
  4. Download or share your editable PDF file—done!

Can Tesseract read PDF? ›

Tesseract does not support reading PDF files. If you need to OCR PDF files, you should either convert them to another format or use OCRmyPDF. Note: Tesseract does support PDF as an output format.

Can Tesseract read handwriting? ›

Handwritten text can also be recognized using tesseract but with a lower accuracy as compared to the recognition done on printed or typed text. This is because every person has a unique style of writing and the computer has to be trained with a limited amount of input.

Does Tesseract OCR work offline? ›

Tesseract OCR is an offline tool, which provides some options it can be run with. The one that makes the most difference in the example problems we have here is page segmentation mode.

When would you use OCR? ›

OCR can be used for a variety of applications, including the following:
  1. Scanning printed documents into versions that can be edited with word processors, like Microsoft Word or Google Docs.
  2. Indexing print material for search engines.
  3. Automating data entry, extraction and processing.

What are the capabilities of the Tesseract? ›

Capabilities. The Tesseract was a cube which contained the Space Stone, an Infinity Stone representing the element of space. The Tesseract could open wormholes to any part of the universe and provide interdimensional travel.

Top Articles
Smart Savings Strategies: Budgeting Tips, Savings Accounts, & CD Insights - Synchrony Bank
Why is my dog staring at me? Here's what your pup could be trying to tell you.
Calvert Er Wait Time
Www.paystubportal.com/7-11 Login
Davita Internet
Activities and Experiments to Explore Photosynthesis in the Classroom - Project Learning Tree
Jennette Mccurdy And Joe Tmz Photos
Blairsville Online Yard Sale
Pitt Authorized User
Bluegabe Girlfriend
Routing Number 041203824
Canelo Vs Ryder Directv
Boat Jumping Female Otezla Commercial Actress
Student Rating Of Teaching Umn
Luciipurrrr_
How Many Slices Are In A Large Pizza? | Number Of Pizzas To Order For Your Next Party
Craigslist Cars Nwi
735 Reeds Avenue 737 & 739 Reeds Ave., Red Bluff, CA 96080 - MLS# 20240686 | CENTURY 21
Craiglist Galveston
Jesus Calling Oct 27
Byte Delta Dental
9044906381
Youravon Comcom
Aldi Sign In Careers
Extra Virgin Coconut Oil Walmart
Napa Autocare Locator
All Obituaries | Verkuilen-Van Deurzen Family Funeral Home | Little Chute WI funeral home and cremation
Mythical Escapee Of Crete
Craigs List Jonesboro Ar
24 Hour Drive Thru Car Wash Near Me
Mrstryst
Appleton Post Crescent Today's Obituaries
Diana Lolalytics
Sadie Sink Doesn't Want You to Define Her Style, Thank You Very Much
Can You Buy Pedialyte On Food Stamps
SF bay area cars & trucks "chevrolet 50" - craigslist
Craigslist Pets Huntsville Alabama
Today's Gas Price At Buc-Ee's
Elizaveta Viktorovna Bout
Bernie Platt, former Cherry Hill mayor and funeral home magnate, has died at 90
Noaa Marine Weather Forecast By Zone
Tyler Perry Marriage Counselor Play 123Movies
Scarlet Maiden F95Zone
Andrew Lee Torres
VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium
Comanche Or Crow Crossword Clue
Cch Staffnet
Sea Guini Dress Code
Neil Young - Sugar Mountain (2008) - MusicMeter.nl
Secrets Exposed: How to Test for Mold Exposure in Your Blood!
Maurices Thanks Crossword Clue
Turning Obsidian into My Perfect Writing App – The Sweet Setup
Latest Posts
Article information

Author: Mr. See Jast

Last Updated:

Views: 6764

Rating: 4.4 / 5 (75 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Mr. See Jast

Birthday: 1999-07-30

Address: 8409 Megan Mountain, New Mathew, MT 44997-8193

Phone: +5023589614038

Job: Chief Executive

Hobby: Leather crafting, Flag Football, Candle making, Flying, Poi, Gunsmithing, Swimming

Introduction: My name is Mr. See Jast, I am a open, jolly, gorgeous, courageous, inexpensive, friendly, homely person who loves writing and wants to share my knowledge and understanding with you.