simon_bisson
Contributor

Use Azure Cognitive Services to automate forms processing

analysis
Feb 25, 20207 mins
Artificial IntelligenceCloud ComputingDeep Learning

Form Recognizer brings unsupervised machine learning to paper document processing, and it’s a snap to build into your applications

ai artificial intelligence technology abstract layered background
Credit: 4x-image / Getty Images

Microsoft’s Cognitive Services, powered by machine learning, are an easy way to add artificial intelligence to your apps, offering pay-as-you-go access to a selection of useful algorithms. Unlike many other web services, they’re continuously evolving, improving as they ingest more and more labeled data.

That’s an important difference between machine learning and other, more familiar, algorithms. As Microsoft improves its training and models, the scope of the services continues to get better, along with responsiveness and accuracy. Some can even take advantage of a process called Transfer Learning, where training a model with one set of data improves its performance with another.

Continuous improvement isn’t the only benefit of the research work Microsoft puts into its Cognitive Services. Cognitive Services operationalize that research, delivering new tools and services as research moves from the lab into commercial products. What matters here is the transition between preview and general availability, as Azure and Microsoft Research work together to take what was pure research and turn it into tools you can include in your applications.

Microsoft has been able to containerize some of its Cognitive Services for use on Azure’s Edge servers and on any other platform that supports Docker. Instead of pushing data to the cloud over low-bandwidth links, you can process data locally, as part of an IoT Hub instance, sending only the information that matters to other applications or to administrators.

Introducing Form Recognizer

One of the more interesting new services currently in preview is Form Recognizer. As organizations work their way through digital transformations, it’s important to bring paper documents and forms into new business processes. Traditional scanning and optical character recognition go some way to digitizing documents, but they miss the semistructured nature of forms, scanning all the data on the page.

Form Recognizer takes a more nuanced approach to working with form data, using machine learning to parse the structure of a form and then extract the information. By building a model of the structure of a form, you can use that model to build a semantically tagged output, with key/value pairs and tables that can then be used to populate structured stores, either using SQL or NoSQL document databases. All you need are a handful of forms to use as training data, allowing you to build a labeled data set that can be used to tune the Form Recognizer model to work with your files.

As Form Recognizer is an API, you can incorporate it in new and existing business processes, replacing manual data capture processes while flagging exceptions that may need human intervention. You can even use Form Recognizer in conjunction with Power Platform tools such as Power BI to deliver business insights from what would have been paper-only data.

Training a Form Recognizer model 

One of the interesting aspects of Form Recognizer is that the underlying model uses unsupervised learning. There’s no need to label the training data. The system recognizes the form elements and generates the appropriate data structures for your form data. Although that’s an easier way to train a system, you do have the option of using labeled data to get more accurate and faster results.

A key element of the training process is the layout API. This gives the model a structure for the layout of a form, with labels for the various fields. Using the data from this and from labeled training forms, you can quickly define the output data structures and ensure that your code is ready to work with the service.

Building labeled samples for training requires a local application, available as a Docker container with a Web UI. You can download it from Microsoft and run the container on Windows, MacOS, or Linux, if you have Docker installed. There’s even the option to run the container with Azure Kubernetes Service (AKS) or on any other Kubernetes infrastructure. Form images are stored in an Azure Blob, and the local recognizer will OCR the forms, making them ready for you to label the various form elements that you want to extract using Form Recognizer. You only need five or six sample forms to train the model.

Once trained, you have a custom Form Recognizer model with its own model ID and an accuracy score. If you want to improve the model, add more sample data. The resulting model can be tested using the training tool on documents that haven’t been part of your training set. You’ll be presented with a view of the source document with bounding boxes for recognized data and a confidence level for each element. It’s important to note that Form Recognizer can’t work with all form elements; at the moment there’s no support for check boxes or for complex tables.

Using Form Recognizer in your apps

Building an application around Form Recognizer is relatively easy. If you’re not using a language with a supported SDK there’s a REST API that can take your form images and extract the data. The service currently supports most common image formats: JPC, PNG, PDF, and TIFF.

The API is relatively simple; it uses POST to upload and analyze the form contents, with a GET to bring back the result. Images are sent as part of a JSON object with the POST, or as a standard file stream. Once the job has been loaded, a standard HTTP 2020 response returns the result ID that will hold the analysis results. You can then make a call to the service with the result ID. If the form has been processed, the results will be delivered in a JSON object that can be parsed, delivering the form key/value pairs and any result tables.

Like all the Cognitive Services, the results have a confidence level. You can use this to direct some forms for manual checks, otherwise delivering the result data into your line of business applications, either storing the data for future use or using it to drive a business process.

One useful feature of Form Recognizer (and one that clearly builds on Microsoft’s own requirements for its expense system) is a prebuilt model that works with common U.S. receipt formats. You can use it to capture and feed receipt data into your own expenses workflow, using a phone camera to capture receipt data on the go. Workers will be able to generate expense reports from their phones without having to spend time entering data into web forms; the Form Recognizer tools will capture the necessary data and, together with user information and device locations, update records automatically.

Getting started with Form Recognizer is relatively simple, and with a generous limit of 500 free pages per month, you should be able to see quickly if it works for you. Once up and running, it should provide a useful bridge between pen and paper and the digital world, using photographs or scans to quickly bring form content into your business processes. With the quality of modern phone cameras and their support for computational photography, it’s possible to make form recognition a simple plug-in that takes a photo and uploads it to your recognizer, saving a local copy for your records.

Form Recognizer is a tool that quickly shows the benefit of machine learning, with a model that’s designed to work flexibly in a relatively closed domain. Applying Azure’s Cognitive Services to specific business problems makes a lot of sense. Handling a paper-to-digital transition is one of those problems that has long been a blocker to improving business processes. Using machine learning to reduce the cost and time needed to deliver digitization is a win for most businesses, especially if it means that we can use the cameras in our pocket rather than expensive scanners and unreliable OCR software.

simon_bisson
Contributor

Author of InfoWorld's Enterprise Microsoft blog, Simon BIsson prefers to think of "career" as a verb rather than a noun, having worked in academic and telecoms research, as well as having been the CTO of a startup, running the technical side of UK Online (the first national ISP with content as well as connections), before moving into consultancy and technology strategy. He’s built plenty of large-scale web applications, designed architectures for multi-terabyte online image stores, implemented B2B information hubs, and come up with next generation mobile network architectures and knowledge management solutions. In between doing all that, he’s been a freelance journalist since the early days of the web and writes about everything from enterprise architecture down to gadgets.

More from this author