How We Built the Document Scanner

The Document Scanner is one of several new and exciting features we’ve released in the Bipsync Notes iOS app recently (available for download here). It’s a powerful tool that captures any form of document such as reports or articles. We’ve written about this feature , covering how to access it, as well as some tips and tricks to achieve the best possible results from a scan. In this post we detail the techniques we used to identify sentences and paragraphs within scanned documents.

The Process

Apple’s Vision framework makes the recognition of texts in images a straightforward task. These are the steps we take to extract text from images and attempt to form paragraphs:

Let’s discuss each of these in turn.

Step 1: Scan Images of Documents

To recognize texts in images, we require images. We opted to use a native solution by presenting VNDocumentCameraViewController. This class supports all the things we need to scan documents: it automatically recognizes documents and captures them, and users can also capture manually by tapping the shutter button or one of the volume buttons on your device.

The scanned documents from VNDocumentCameraViewController are then passed to an object that conforms to VNDocumentCameraViewControllerDelegate as a collection of UIImagetypes.

The code looks like this:

Step 2: Perform OCR on scanned Images

Now that we have a collection of images we can perform OCR. The Vision framework is used here too, specifically the VNImageRequestHandler and VNRecognizeTextRequest classes.

VNRecognizeTextRequest is an image-based request type that finds and extracts texts in images. We can define parameters such as the following:

  • recognitionLevel: this can be set to “Fast” or “Accurate”. You can read about the difference between these levels here. We opted to use the accurate recognition level as we could leverage words and sentences to form paragraphs for scanned documents.
  • recognitionLanguages: You can specify an array of languages to detect in the document. The order of the languages in the array defines the order in which languages are used during language processing and text recognition. We currently use American English only.
  • usesLanguageCorrection: A boolean property to perform language correction to greatly reduce any anomalies in the recognized words. It’s set to true for us as we want a more accurate result.
  • revision: Specifies which version of the text recognizer to use for a VNRecognizeTextRequest, currently using revision 1 as we support older versions of iOS (13+)

The VNImageRequestHandler class is responsible for processing a VNRecognizeTextRequest for a given image.

Because we support scanning multiple documents we need to perform OCR on multiple images. However the VNImageRequestHandler delegate is limited to performing OCR on one image at a time. To get around this issue we opted to use the Operations framework with delegates to create a chain of operations. Below is an example of the class we created to perform OCR for a single document image:

This snippet illustrates how we use a VNImageRequestHandler with VNRecognizeTextRequest. In our operation queue we call the performOCR: method which calls the ocrRequestHandler(request: method with a collection of OCR candidates — OCR candidates are returned from VNRequest observations — containing words and sentences from the image. As you can see we also rely on the delegate method calls in this class to update the user with progress as well as to determine when all OCR operations have completed.

Step 3: Transforming OCR Candidates to texts with paragraphs

The final step is to perform paragraphing before the extracted text can be displayed. This step is crucial to achieving the best possible OCR results.

An OCR candidate looks like this:

After the request handler processes the request, it calls the request’s completion closure (ocrRequestHandler(request:), passing it the request and any errors that occurred. We then retrieve the observations by querying the request object for its results, which it returns as an array of VNRecognizedTextObservation objects.

Each observation contains information about the extracted text, which involves the bounding box (CGRect) and the extracted word or sentence. Each observation also contains one or more variations of the extracted word or sentence, with the top variation being the more accurate to the least accurate. We then access the top variation and use the text and the bounding box to help with paragraphing.

The array of VNRecognizedTextObservation objects are often received in the order in which humans would read English text, but sometimes these observations can be positioned incorrectly in the array. For example, in the image below the observation for the term “The first draft” could be positioned just after the observation for the term “a paper or an article”, making it difficult to decide where words or sentences are positioned. The results also don’t tell us where paragraphs begin and end.

Despite the use of the accurate recognition level we can still expect anomalies like these. This usually happens when the images are distorted, misaligned or contain non-uniform formatting. So we needed to refine our approach to allow us to identify the position of certain words or sentences, which would in turn let us construct text with the best possible paragraphing. To achieve this we need to perform the following: 

1. Sorting Candidates

When an observation candidate’s bounding box Y position is greater than the next candidate’s bounding box Y position then they should be swapped because the next candidate should have an identical, similar, or larger Y position for it to precede the current candidate (note the transformation happening in method ocrRequestHandler(request:). This can be achieved with a sort in ascending Y position, although great care should be taken as we are dealing with fractions greater than two decimal points.

Here we are also assuming that observation candidates will be given to us in a manner which matches the way humans read English text. Below I included the code we use to sort the collection of observations. We call the sortCandidates: method before paragraphing is performed.

Detecting Paragraphs

Once we have the sorted candidates we call another function to concat all strings from the candidates into a text with paragraphs inserted as newlines ("\n\n") where necessary. The decision where to put paragraphs is a little complicated as we can’t accommodate every possible paragraph style, so we decided to make few assumptions so paragraphing can work for most cases.

We assumed that paragraphs should begin when there is a large enough gap between the current and next candidate’s bounding box. We also assumed paragraphs should start when the next candidate is on a new line and its bounding box x position is greater than the smallest x position of the current paragraph — in short, where there is an indentation.

The pseudocode is as follows: 


As a result of the steps mentioned above we achieved reasonable support for paragraphing and can now display the extracted texts with some level of confidence. This however may not be enough to support a variety of formatting found in documents. Formatting techniques which can cause issues include:

  • Multiple columns — the columns would be interpreted as one paragraph, especially if both columns are aligned parallel to each other. This would affect documents in this format:
  • Bulleted lists where the spacing between bullets is not significant.
  • Paragraphs that begin with a word which has an oversized initial character.
  • Embedded images, although any text within images will be extracted and added to the text as its own paragraph, if there is enough separation with the previous paragraph. 


We have significantly improved paragraph detection in our Document Scanner and the results are now available within the Bipsync Notes app. We’ve also identified a new set of challenges to address, and we hope to do so in due course by refining our technique and potentially making use of improved Apple APIs.