In this article, I’ll explore the technical implementation behind an iOS demo application designed to showcase offline live OCR with real-time translation, and barcode scanning capabilities.
The full source code is available in the GitHub repository: https://github.com/AndreiMaksimovich/ios-live-offline-ocr-and-translation–demo
Technology Stack
Application is written using Swift programming language and SwiftUI as GUI framework.
OCR
The application integrates multiple libraries to provide live, offline OCR functionality:
MLKit
The Google MLKit Text Recognition v2 API (on-device/offline) supports text recognition in Chinese, Devanagari, Japanese, Korean, and Latin-based character sets.
Learn more: https://developers.google.com/ml-kit/vision/text-recognition/v2
Apple Vision
The Apple Vision Framework supports on-device text recognition in 18 languages, including: English (US, UK, Australia), French (France, Canada), German, Italian, Spanish (Spain, Mexico), Portuguese (Brazil), Chinese (Simplified — zh-Hans, Traditional — zh-Hant, Hong Kong), Japanese, Korean, Arabic, Russian, and Ukrainian.
Learn more: https://developer.apple.com/documentation/vision/recognizing-text-in-images
Tesseract
Tesseract is an open-source OCR engine released under the Apache 2.0 license, capable of recognizing text in a wide range of languages (https://github.com/tesseract-ocr/tessdata/tree/3.04.00/). This project uses the Tesseract-OCR-iOS wrapper, which is based on Tesseract v3 (the latest available version at the time of writing is v5.5.1). This older library performs OCR on the CPU, making it slower and less reliable than Google ML Kit or Apple Vision. Within this demo, its primary role is as a fallback engine for less common languages, such as Georgian (Khartuli).
Learn more:
https://github.com/tesseract-ocr/tesseract
https://github.com/gali8/Tesseract-OCR-iOS
Offline On-Device Translation
For offline on-device translation, the project uses Google MLKit Translation, which supports over 50 languages: https://developers.google.com/ml-kit/language/translation/translation-language-support. Translation models are downloaded on demand, and while the translation quality isn’t perfect, it is sufficient for practical tasks such as translating labels, signs, and short text snippets.
Learn more: https://developers.google.com/ml-kit/language/translation
Barcode Scanner
For barcode scanning, the project uses the ML Kit Barcode Scanner API, which supports a wide range of standard formats:
- Linear barcodes: Codabar, Code 39, Code 93, Code 128, EAN-8, EAN-13, ITF, UPC-A, UPC-E
- 2D barcodes: Aztec, Data Matrix, PDF417, QR Code
Learn more: https://developers.google.com/ml-kit/vision/barcode-scanning
Application Architecture
The application is built with SwiftUI, and its architecture for state management and service/API provisioning is centered around contextual usage, with support for seamless integration with SwiftUI previews as well as efficient on-device and off-device testing.
Application State
The application is simple enough to rely on a single global @Observable object (AppState) for global state management. For individual views, ViewModels handle local state, and in some cases, model values are passed directly to child views – as the project views doesn’t require complex contextual or shared model structures.
Service/API Provision
The application uses a single provision point for all Services and APIs – a global @Observable object called AppCore. All services, factories, and APIs are designed as abstract components, allowing this global object to be reconfigured during the production, debug, or test initialization phase to use production, mocked, proxied, or dummy-data implementations as needed.
Concurrency
The application uses a DispatchQueue to ensure ordered background execution of camera-related tasks. The camera subsystem also employs an Owner Lock mechanism to prevent overlapping or excessive configuration operations. All services and APIs run asynchronously on the main thread, while time-intensive or synchronous tasks are delegated to background threads through dedicated DispatchQueues, ensuring responsive and stable performance.
Camera
The camera system follows a Provider–Client architecture, where the camera is automatically initialized or stopped based on the presence of active clients. Components such as the Camera Preview View and image/video capture services operate as Camera Clients, leveraging the shared provider for coordinated access and resource management.
OCR
The OCR system is implemented as an abstraction layer / unified API that integrates Tesseract, ML Kit, and Apple Vision under a common interface. It uses a standardized OCR result data structures and a simple async API for provision of OCR functionality.
Language metadata backed into the application defines the preferred OCR engine and its configuration, enabling flexible initialization through the following flow:
Target Language → Model Info & Configuration → OCR Service Factory → Initialized OCR Service Instance.
Live Translation
The live translation system follows the similar architectural pattern as the OCR component. Its initialization flow is structured as:
Source & Target Languages → ML Kit Translation Source & Target Languages → Translation Service Factory → Initialized Translation Service Instance.
Barcode Scanner
The Barcode Scanner Service is the simplest component in the system – it doesn’t use factories or unified result structures. Instead, it provides a straightforward API abstraction that directly returns ML Kit’s Barcode Scanner results.
OCR Cycle
The application’s OCR workflow/cycle follows a straightforward async linear process:
- Capture a video frame using the camera’s image video capture service.
- Run OCR on the captured frame using the currently selected OCR model.
- Translate the OCR results, if translation is enabled.
- Scan for barcodes within the same frame.
- Publish results to the shared model for display.
- Wait briefly, then repeat the cycle.
Application UI
Initialization View
Application starts in initialization view. Which is a typical demo style method for initial data initialization, in this case it used to ensure that camera permission is granted and translation models are downloaded. In normal application this step should be replaces with on-demand resource download/initialization.
Main View
The application’s root view functions as a custom tab view, designed to provide full control over each tab’s lifecycle. In the current implementation, each tab contains a single view, but the system can be easily extended to support per-tab navigation stacks, features like history tracking, and interactions such as double-tapping a tab icon to reset its navigation history.
Camera Preview
The Camera Preview View serves as a SwiftUI wrapper around the UIKit-based CameraPreviewLayerView. Its model functions as a Camera Client, managing connection and interaction with the shared camera manager.
Result View
The Result View functions as an overlay on top of the Camera Preview View. It receives the OCR results, barcode scan results, and source image data directly from the parent view model. Using this information, it renders text and bounding rectangles around detected elements with SwiftUI’s Canvas component.
What Next
Tesseract
The current Tesseract wrapper relies on an outdated version of the engine. Upgrading to Tesseract v5 would be highly beneficial, especially since it introduces experimental OpenCL support, which could significantly reduce processing time by offloading computations to the GPU.
AR
Replacing the Camera Preview with an AR View could provide a more engaging user experience. Recognized text elements could serve as AR anchor points, allowing 3D text overlays to move dynamically with the anchors, creating a more interactive and visually appealing interface.
Image Analysis
Image analysis can be employed as a preprocessing step before performing OCR. For instance, using AR, the system can focus text recognition on detected surfaces, applying surface position data to correct perspective and align text to a front-facing orientation. On LiDAR-equipped devices, depth information can similarly aid in object differentiation. Additionally, Vision and ML Kit image analysis tools can segment images into smaller, context-aware regions, enabling more accurate and efficient OCR.
Result Caching / Memory
With AR anchoring capabilities, it would be beneficial to introduce a Result Caching/Memory system. This would enable incremental updates of recognized regions with more accurate results over time, allow experimentation with various effects or filters on low-confidence areas, and skip regions that already have high-confidence recognition scores.
Image Filters & Effects
OCR performs best on high-contrast images, ideally in grayscale or black and white. To improve accuracy in challenging areas, transformational effects and image filters should be integrated into the OCR workflow. A custom neural network could be trained to determine the appropriate filters and/or effects to apply – for example, handling cases like red text on a black background.
