Easy Optical Character Recognition

3/27/2017

Let’s say we have scanned a document into our computer, and we want to read certain pieces of data off of that form. Now, if this was a one-off project then I’d simply open the form and read it myself, but let’s say that we’ve got hundreds of this same form and they all have the same layout but different data. This is where OCR comes in handy. I recently had such a project come up where I needed to be able to grab data off of a form that was supplied as an image. After a lot of research, I settled on the cheap and easy solution that I’m about to show you. Keep in mind that there are other, more elegant solutions available, but this solution was chosen specifically because of how cheap it is, and how little extra work must be done to make it work on a client's system.

Requirements
This solution requires that you have Microsoft Office 2003 or newer installed. Since office 2003, each version of Office has included a library called, Microsoft Office Document Imaging, or MODI for short. The actual name of the dll file is MDIVWCTL.DLL. You will want to make sure that this library is installed and available to your code project. You’ll also want to make sure that any targeted client machines also have this component of Microsoft Office installed as well.

Assumptions
For this tutorial, we’ll assume that you already have an Image object that contains a picture to work with. We’ll also assume that you already know the coordinates of the location on the image that you want to obtain text data from and that these coordinate are stored in the variables X1, Y1, X2, Y2. We finally assume that (X1, Y1) is the top left corner of the area, and that (X2, Y2) is the bottom right corner of the area.

Steps
Three steps are necessary for reading data from an area of an image. The first step is to extract the area into its own image. The second step is to save the new image as a temp file. The third and final step is to actually get the data from the new temp file. I know this doesn’t sound very elegant, and it’s not. As I said earlier there are other solutions that allow you to read the data from the area without having to save it to disc, but those solutions are generally much more expensive than using MODI. By the way, MODI is designed such that you can not work with images in memory, this is why we’re saving the images of the area to disc.

Extraction
The first step is to extract the area that we want to read.

Bitmap Extract(Bitmap bmp, int x1, int y1, int x2, int y2)
{
  try
  {
    var width = x2 - x1;
    var height = y2 - y1;
    if(bmp == null || width < 1 || height < 1)
    {
      return null;
    }
    var subImage = bmp.Clone(newRectangle(x1, y1, width, height), bmp.PixelFormat);
    return subImage;
  }
  catch
  {
    return null;
  }
}

That’s a simplified version of the method. For production, you should validate that none of the coordinates are outside the image and make the necessary adjustments if they are.

Temp File Creation
The second step is to save the newly created image into a temp file.

private string CreateTempFile(Bitmap img)
{
  var fId = Guid.NewGuid().ToString("N");
  var path = string.Format("{0}{1}.tiff", System.IO.Path.GetTempPath(), fId);
  img.Save(path, ImageFormat.Tiff);
  return path;
}

We are saving the image into a temp TIFF file. MODI only works with TIFF and MDI formats, but as users, we generally use more than that. This design allows us to open the image from whatever format we want, and save it into TIFF for MODI.

Reading the Image
Finally, the last step is to read the data from the image.

using System.Runtime.ExceptionServices;

[HandleProcessCorruptedStateExceptions]
private string OcrTempFile(string path)
{
  try
  {
    var md = new MODI.Document();
    md.Create(path);
    md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, false, false);
    var img = (MODI.Image)md.Images[0];
    var layout = img.Layout;
    var str = layout.Text;
    md.Close();
    return str.Trim();
  }
  catch
  {
    return string.Empty;
  }
}

This opens the temp file as a MODI document, and loads the first image as a MODI image, then retrieves the textual value of the image. MODI tends to throw a Corrupted State Exception(CSE) which isn’t normally caught by exception handlers, so we added the attributes to the method to force the exception to be caught and handled gracefully.

Putting It All Together
Finally, now that we’ve got the three steps built, we need to put it all together.

string result = "";
var newBmp = Extract(bmp, x1, y1, x2, y2);
if(newBmp == null)
{
  result = string.Empty;
}
else
{
  var tempFile = CreateTempFile(newBmp);
  result = OcrTempFile(tempFile);
}

Respond

	CupCode Gamers
	From the Cup, to the Code, for the Gamers