How To Install and Use Tesseract OCR on Ubuntu 24.04
On this short tutorial we will show you how to install and use Tesseract on Ubuntu 24.04 Linux operating system.
Tesseract is an open-source Optical Character Recognition (OCR) software developed by Hewlett-Packard and now maintained by Google. It’s widely used for converting images or scanned documents into editable text. Tesseract supports various languages and can recognize text from different formats, including JPG, PNG, and PDF files. It works on multiple platforms, including Linux, macOS, and Windows, making it popular in document processing, data extraction, and image-to-text applications.
To install and use Tesseract on Ubuntu, follow these steps:
1. Install Tesseract
2. Convert Images to Text
3. Specify Language (Optional)
4. View Output
Step 1 : Install Tesseract
To install Tesseract we need to update our Ubuntu repository first, then do the installation. Open the terminal perform these tasks below .
$ sudo apt update $ sudo apt install tesseract-ocr
The output will be shown below :
ramansah@dev02:~$ sudo apt install tesseract-ocr [sudo] password for ramansah: Reading package lists... Done Building dependency tree... Done Reading state information... Done The following additional packages will be installed: liblept5 libtesseract5 tesseract-ocr-eng tesseract-ocr-osd The following NEW packages will be installed: liblept5 libtesseract5 tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd 0 upgraded, 5 newly installed, 0 to remove and 6 not upgraded. Need to get 8,377 kB of archives. After this operation, 21.8 MB of additional disk space will be used. Do you want to continue? [Y/n] Y Get:1 http://id.archive.ubuntu.com/ubuntu noble/universe amd64 liblept5 amd64 1.82.0-3build4 [1,099 kB] Get:2 http://id.archive.ubuntu.com/ubuntu noble/universe amd64 libtesseract5 amd64 5.3.4-1build5 [1,291 kB] Get:3 http://id.archive.ubuntu.com/ubuntu noble/universe amd64 tesseract-ocr-eng all 1:4.1.0-2 [1,818 kB] Get:4 http://id.archive.ubuntu.com/ubuntu noble/universe amd64 tesseract-ocr-osd all 1:4.1.0-2 [3,841 kB] Get:5 http://id.archive.ubuntu.com/ubuntu noble/universe amd64 tesseract-ocr amd64 5.3.4-1build5 [328 kB] Fetched 8,377 kB in 10s (843 kB/s) Selecting previously unselected package liblept5:amd64. (Reading database ... 302415 files and directories currently installed.) Preparing to unpack .../liblept5_1.82.0-3build4_amd64.deb ... Unpacking liblept5:amd64 (1.82.0-3build4) ... Selecting previously unselected package libtesseract5:amd64. Preparing to unpack .../libtesseract5_5.3.4-1build5_amd64.deb ... Unpacking libtesseract5:amd64 (5.3.4-1build5) ... Selecting previously unselected package tesseract-ocr-eng. Preparing to unpack .../tesseract-ocr-eng_1%3a4.1.0-2_all.deb ... Unpacking tesseract-ocr-eng (1:4.1.0-2) ... Selecting previously unselected package tesseract-ocr-osd. Preparing to unpack .../tesseract-ocr-osd_1%3a4.1.0-2_all.deb ... Unpacking tesseract-ocr-osd (1:4.1.0-2) ... Selecting previously unselected package tesseract-ocr. Preparing to unpack .../tesseract-ocr_5.3.4-1build5_amd64.deb ... Unpacking tesseract-ocr (5.3.4-1build5) ... Setting up tesseract-ocr-eng (1:4.1.0-2) ... Setting up liblept5:amd64 (1.82.0-3build4) ... Setting up libtesseract5:amd64 (5.3.4-1build5) ... Setting up tesseract-ocr-osd (1:4.1.0-2) ... Setting up tesseract-ocr (5.3.4-1build5) ... Processing triggers for man-db (2.12.0-4build2) ... Processing triggers for libc-bin (2.39-0ubuntu8.3) ...
You may also want to install additional language packs:
$ sudo apt install tesseract-ocr-LANG
Replace `LANG` with the desired language code (e.g., `eng` for English, `fra` for French).
After the installation, we will verify it by querying its version, by submitting command line :
$ tesseract --version
The output will as be shown belows :
ramansah@dev02:~$ tesseract --version tesseract 5.3.4 leptonica-1.82.0 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0 Found AVX2 Found AVX Found FMA Found SSE4.1 Found OpenMP 201511 Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.5 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5 Found libcurl/8.5.0 OpenSSL/3.0.13 zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 (+libidn2/2.3.7) libssh/0.10.6/openssl/zlib nghttp2/1.59.0 librtmp/2.3 OpenLDAP/2.6.7 ramansah@dev02:~$
2. Convert Images to Text
To convert the images to text we just use a simplec command belows.
$ tesseract image.jpg output.txt
Replace `image.jpg` with your image file and `output.txt` with your desired output file name. This will extract the text from the image and save it in `output.txt`.
The sample is as shown :
ramansah@dev02:~/Downloads$ ls -ltr total 76 -rw-rw-r-- 1 ramansah ramansah 8102 Nov 14 14:21 football.jpeg -rw-rw-r-- 1 ramansah ramansah 0 Nov 14 14:35 ocr_text.txt -rw-rw-r-- 1 ramansah ramansah 67411 Nov 14 14:48 sbux_sample.jpg ramansah@dev02:~/Downloads$ pwd /home/ramansah/Downloads ramansah@dev02:~/Downloads$ tesseract /home/ramansah/Downloads/sbux_sample.jpg /home/ramansah/ocr_text
3. Specify Language (Optional)
If the image text is in a specific language, add the `-l`
option:
$ tesseract image.jpg output.txt -l eng
4. View Output
Check the `output.txt` file for the extracted text.
The sample will be as shown belows :
1.Original source file
In this sample we will use a Starbucks Coffee Payment Receipt, to be converted to text file.
2. Converted file
As we can see from the screen shot above, if the image file was converted into the text as well. This tools is very useful for automation applications which is used to scan huge images file into text format.
Conclusion
On this short article we have shown you the Tesseract installation on Ubuntu 24.04 LTS operating system and also take a simple task for converting image file format into text file format.