iOS Offline On-Device Live OCR and Translation with ML Kit, Apple Vision and Tesseract

In this article, I’ll explore the technical implementation behind an iOS demo application designed to showcase offline live OCR with real-time translation, and barcode scanning capabilities.

The full source code is available in the GitHub repository: https://github.com/AndreiMaksimovich/ios-live-offline-ocr-and-translation–demo

Technology Stack

Application is written using Swift programming language and SwiftUI as GUI framework.

OCR

The application integrates multiple libraries to provide live, offline OCR functionality:

MLKit

The Google MLKit Text Recognition v2 API (on-device/offline) supports text recognition in Chinese, Devanagari, Japanese, Korean, and Latin-based character sets.

Learn more: https://developers.google.com/ml-kit/vision/text-recognition/v2

Apple Vision

The Apple Vision Framework supports on-device text recognition in 18 languages, including: English (US, UK, Australia), French (France, Canada), German, Italian, Spanish (Spain, Mexico), Portuguese (Brazil), Chinese (Simplified — zh-Hans, Traditional — zh-Hant, Hong Kong), Japanese, Korean, Arabic, Russian, and Ukrainian.

Learn more: https://developer.apple.com/documentation/vision/recognizing-text-in-images

Tesseract

Tesseract is an open-source OCR engine released under the Apache 2.0 license, capable of recognizing text in a wide range of languages (https://github.com/tesseract-ocr/tessdata/tree/3.04.00/). This project uses the Tesseract-OCR-iOS wrapper, which is based on Tesseract v3 (the latest available version at the time of writing is v5.5.1). This older library performs OCR on the CPU, making it slower and less reliable than Google ML Kit or Apple Vision. Within this demo, its primary role is as a fallback engine for less common languages, such as Georgian (Khartuli).

Learn more:

https://github.com/tesseract-ocr/tesseract

https://github.com/gali8/Tesseract-OCR-iOS

Offline On-Device Translation

For offline on-device translation, the project uses Google MLKit Translation, which supports over 50 languages: https://developers.google.com/ml-kit/language/translation/translation-language-support. Translation models are downloaded on demand, and while the translation quality isn’t perfect, it is sufficient for practical tasks such as translating labels, signs, and short text snippets.

Learn more: https://developers.google.com/ml-kit/language/translation

Barcode Scanner

For barcode scanning, the project uses the ML Kit Barcode Scanner API, which supports a wide range of standard formats:

Linear barcodes: Codabar, Code 39, Code 93, Code 128, EAN-8, EAN-13, ITF, UPC-A, UPC-E
2D barcodes: Aztec, Data Matrix, PDF417, QR Code

Learn more: https://developers.google.com/ml-kit/vision/barcode-scanning

Application Architecture

The application is built with SwiftUI, and its architecture for state management and service/API provisioning is centered around contextual usage, with support for seamless integration with SwiftUI previews as well as efficient on-device and off-device testing.

Application State

The application is simple enough to rely on a single global @Observable object (AppState) for global state management. For individual views, ViewModels handle local state, and in some cases, model values are passed directly to child views – as the project views doesn’t require complex contextual or shared model structures.

Service/API Provision

The application uses a single provision point for all Services and APIs – a global @Observable object called AppCore. All services, factories, and APIs are designed as abstract components, allowing this global object to be reconfigured during the production, debug, or test initialization phase to use production, mocked, proxied, or dummy-data implementations as needed.

Concurrency

The application uses a DispatchQueue to ensure ordered background execution of camera-related tasks. The camera subsystem also employs an Owner Lock mechanism to prevent overlapping or excessive configuration operations. All services and APIs run asynchronously on the main thread, while time-intensive or synchronous tasks are delegated to background threads through dedicated DispatchQueues, ensuring responsive and stable performance.

Camera

The camera system follows a Provider–Client architecture, where the camera is automatically initialized or stopped based on the presence of active clients. Components such as the Camera Preview View and image/video capture services operate as Camera Clients, leveraging the shared provider for coordinated access and resource management.

OCR

The OCR system is implemented as an abstraction layer / unified API that integrates Tesseract, ML Kit, and Apple Vision under a common interface. It uses a standardized OCR result data structures and a simple async API for provision of OCR functionality.
Language metadata backed into the application defines the preferred OCR engine and its configuration, enabling flexible initialization through the following flow:
Target Language → Model Info & Configuration → OCR Service Factory → Initialized OCR Service Instance.

Live Translation

The live translation system follows the similar architectural pattern as the OCR component. Its initialization flow is structured as:
Source & Target Languages → ML Kit Translation Source & Target Languages → Translation Service Factory → Initialized Translation Service Instance.

Barcode Scanner

The Barcode Scanner Service is the simplest component in the system – it doesn’t use factories or unified result structures. Instead, it provides a straightforward API abstraction that directly returns ML Kit’s Barcode Scanner results.

OCR Cycle

The application’s OCR workflow/cycle follows a straightforward async linear process:

Capture a video frame using the camera’s image video capture service.
Run OCR on the captured frame using the currently selected OCR model.
Translate the OCR results, if translation is enabled.
Scan for barcodes within the same frame.
Publish results to the shared model for display.
Wait briefly, then repeat the cycle.

Application UI

Initialization View

Application starts in initialization view. Which is a typical demo style method for initial data initialization, in this case it used to ensure that camera permission is granted and translation models are downloaded. In normal application this step should be replaces with on-demand resource download/initialization.

Main View

The application’s root view functions as a custom tab view, designed to provide full control over each tab’s lifecycle. In the current implementation, each tab contains a single view, but the system can be easily extended to support per-tab navigation stacks, features like history tracking, and interactions such as double-tapping a tab icon to reset its navigation history.

Camera Preview

The Camera Preview View serves as a SwiftUI wrapper around the UIKit-based CameraPreviewLayerView. Its model functions as a Camera Client, managing connection and interaction with the shared camera manager.

Result View

The Result View functions as an overlay on top of the Camera Preview View. It receives the OCR results, barcode scan results, and source image data directly from the parent view model. Using this information, it renders text and bounding rectangles around detected elements with SwiftUI’s Canvas component.

What Next

Tesseract

The current Tesseract wrapper relies on an outdated version of the engine. Upgrading to Tesseract v5 would be highly beneficial, especially since it introduces experimental OpenCL support, which could significantly reduce processing time by offloading computations to the GPU.

AR

Replacing the Camera Preview with an AR View could provide a more engaging user experience. Recognized text elements could serve as AR anchor points, allowing 3D text overlays to move dynamically with the anchors, creating a more interactive and visually appealing interface.

Image Analysis

Image analysis can be employed as a preprocessing step before performing OCR. For instance, using AR, the system can focus text recognition on detected surfaces, applying surface position data to correct perspective and align text to a front-facing orientation. On LiDAR-equipped devices, depth information can similarly aid in object differentiation. Additionally, Vision and ML Kit image analysis tools can segment images into smaller, context-aware regions, enabling more accurate and efficient OCR.

Result Caching / Memory

With AR anchoring capabilities, it would be beneficial to introduce a Result Caching/Memory system. This would enable incremental updates of recognized regions with more accurate results over time, allow experimentation with various effects or filters on low-confidence areas, and skip regions that already have high-confidence recognition scores.

Image Filters & Effects

OCR performs best on high-contrast images, ideally in grayscale or black and white. To improve accuracy in challenging areas, transformational effects and image filters should be integrated into the OCR workflow. A custom neural network could be trained to determine the appropriate filters and/or effects to apply – for example, handling cases like red text on a black background.