Data Labeling of Images for Supervised Learning

Successful machine learning applications depend on high-quality tags. In the field of Automated Visual Inspection (AVI), labels are human-provided cues to teach models:

  • How to identify a specific class of defects of interest
  • How to highlight the defect area

What is data labeling?

Data labeling is the process of manually annotating content with labels or tags. We refer to the people who add these tags as annotators. In the field of computer vision, the label identifies elements within the image. The annotated data is then used in supervised learning. The labeled dataset is used to teach the model through examples. Data labeling is critical to the success of machine learning mode. Label errors can result in lower model success rates.

Through labeling, we aim to distill the knowledge of subject matter experts (SMEs) with decades of experience in machine learning models. These models can be replicated across tens or hundreds of production lines to support the large-scale visual inspection process. The better the signals from human SMEs, the more accurate the output model will be. Depending on the application, there are different types of data identification. In an object detection task, we are not only interested in knowing the class of target objects, but also their locations. So we draw bounding boxes around the target objects on the images. There are also image classification, semantic segmentation, and instance segmentation tasks. We label with classes, segmentation maps and instance segmentation maps as shown in the image below.

AI engineers (MLEs) will team up with annotators to create labels on their datasets. To assist labelers with playing out the marking errands precisely, MLEs will set up a labeling book that gives exact depiction of the objective classes and nitty gritty guidance on the most proficient method to draw labels on images. In the automated visual inspection (AVI) domain, the labeling book is also called the defect book.

Challenges to Data Labeling in Computer Vision

From our past experience, we observed two key challenges on labeling in AVI:

  1. The number of defective samples is relatively small compared to research datasets like ImageNet and COCO.
  2. SME’s judgement on defective samples is not consistent.

Few Defective Sample Images

Modern quality control methods have reduced the inspection line defect rate to less than 1%. For some types of rare defects, a defective pattern may only occur once in a million. As a result, only a small number of unique samples per defect class can be collected by model iterations.

Defect classes in manufacturing are not commonly seen in daily life. Sometimes a defect is defined as “3 cm long gap” or “hair-like scratch at the top left corner reflecting light”. They are much more difficult to label than a cat, a dog or a motorcycle. Typically, SMEs take months or years to develop their heuristics to detect such failures on production lines.

Inconsistent Labeling of Image Data

Given the same defective pattern, different SMBs may have different opinions about the type of defect present in the image. In addition, the same SME can judge differently depending on the day or time.

Typically, deep learning research teams handle such inconsistencies by collecting large numbers of samples from a large tag team. Misalignments are averaged over the largest data set.

However, as we mentioned in the first issue, our sample data set is very limited.The time to train a large SME team is too expensive. So we need another method to get rid of the inconsistencies.

Below is a sample image showing clear vs. ambiguous visual inspection errors.

Credit: image is from the Batch newsletter

Obtaining Consistency in Data Labeling

A process has been developed to solve the two challenges above. It includes a few key steps, which we’ll cover in detail.

  1. Create a defect book
  2. Establish Defect Labeling Consensus
  3. Review Data Labeling for Quality Assurance

1. Create a Defect book

The defect book contains a list of the most important errors and their clear definitions as well as some example images. It provides a reliable and trusted source of truth on the ground. The error book provides the reader with a precise and unambiguous description of the error. For questions like “should this area of ​​the image be considered defective?” or “Is this model prediction correct?“.

Based on our past experience, creating an accurate and complete bug book is one of the most important prerequisites for successful AVI projects. They capture the formal definition of defects in a defect book to quickly train a new annotator to capture defects correctly.

The process of creating a bug book is to extract all the heuristics from the minds of experienced SMEs and put them in writing. Deviations between the SME judgment and the book of defects lead to labeling errors. If the bug book is complete enough, you can train new annotators to quickly reach the SME’s level of knowledge of those defects.

Below we describe the key elements that go into creating such a book.

Document the Project’s Context and Terminologies

First, the overview of the project background and terminologies. In our experience, many first-time practitioners ignored this step of describing the background and went straight to listing the defects. However, we have found that a detailed description of the project context and purpose improves communication with the annotators and makes them more aware of regions of interest or distinguishing between critical errors and noise.

Most AVI projects have special foreground and background compositions or domain-specific terminologies. Help readers understand by introducing key terminology and explaining the layout of the image at the beginning of the defect book.

Example battery inspection: explain the composition in the image.

Example steel surface inspection: explain which area in the image the annotators needs to inspect for defects.

Specify Each Class of Defects

Each section of the defect book should provide an accurate description of a specific type of defect. Include its major visual patterns and where it may appear in an image. We find it extremely effective for understanding the defect by providing sample images that represent the majority of the defects, both the common ones as well as some edge cases.

It is useful to include some counter-examples of images with similar patterns but are not valid defects. This helps labelers correctly determine one class of defects apart from others.

If a defect consists of a few distinctive looks, then to avoid confusion create a few subsections to introduce them separately.

Provide Clear Instruction on How to Label Defects

We have seen a few customers take this for granted. They start labeling without defining a clear set of labeling instructions. As a result, the labeling quality is very poor with large inconsistency among different annnotators. This problem can be avoided by defining a clear labeling book in the beginning.

If you are drawing bounding box or segmentation labels, here are recommended best practices:

  • Draw labels tightly around the target objects

The models will be penalized or rewarded based on how well their predictions are matched with the labels by pixels. If you keep unnecessary margins between the labels and the objects, you will misguide the model.

Example: draw bounding boxes tightly around the objects.

  • Label each target object individually

You may encounter scenarios when there is a cluster of small, target defective objects close to each other. Labeling each object with individual bounding boxes will cost time and make it difficult for your model to fit with each ground truth label precisely. Instead, draw a big bounding box that covers the cluster of defective objects. Create heuristics on when to draw a single bounding box and when to draw separate bounding boxes. Keep this consistent among annotators.

Example: draw bounding boxes for each of the defects separately.

  • Defect Books Should be Updated Frequently

Keep your defect book updated, so that all of your labelers will have the latest knowledge about the defects. When you have a new defect type or edge case sample, it’s time to update the defect book.

2. Establish Defect Labeling Consensus

After creating a defect book, quickly test its accuracy and coverage before labeling all the data. If there’s incorrect definition or edge cases not covered sufficiently in the defect book, capture these issues early. Rely on the defect consensus to evaluate whether people are aligned on their defect definitions and labeling.

Typically you will have three people participate in a defect consensus task. We ask both the SME and new annotators to label the same set of defect samples by referring to the defect book. It helps us surface up any possible misalignments. The recommended composition is to have one SME, one labeler, and one Machine Learning Engineer (MLE) or an additional labeler. The SME will label based on their knowledge as well as the defect book. Whereas the other participants will rely entirely on the defect book’s instruction, since they don’t have much domain knowledge.
We highly recommend having the MLE participate in this process. The MLE will get more context on the labeling rules and better understanding of the defect definitions by involvement in the defect consensus task. Later when analyzing model errors, the MLE can quickly tell if an error is due to ambiguities in the defect book. This is the most common type of error we’ve seen.

We recommend randomly picking 10 samples per defect from the entire dataset. This allows you to examine all the defect classes and cover major pattern types within each class. Then ask each participant to label these samples independently.

Once the participants are finished, an agreement score will be calculated for each image. We can developed an internal scoring system that covers all the labeling types and offers it as a tool to all of our users. For classification labeling, the agreement will be calculated based on the class given by participants. For object detection and semantic segmentation labels, the agreement score will be calculated with both the class and region labeled by all participants.

An overall consensus score is calculated by aggregating the agreement scores of all images. It tells you how well your participants are aligned with each other. This reflects how accurate and complete the defect book is given the sample dataset. For images that achieve very low agreement scores, discuss with SMEs the root cause of misalignment. Once you identify the source, update the corresponding section in the defect book. Add the image as an example if needed.

Illustration of how labelers drew their bounding boxes differently.

Establishing a defect consensus is not a one-time task. Everytime the defect book is updated, or a new labeler is added to the project, do a defect consensus. This ensures your labelers reach sufficiently high alignment on their understanding of the defect book.

3. Review Data Labeling for Quality Assurance

Finally, you are ready to start labeling all of our data. To ensure the quality of your labels, there’s usually a review process. This way only the approved images will be released to the next step. That’s when model training and evaluation occurs.

With an accurate and complete defect book, it helps you train labelers with SME’s knowledge without decades of exercise. Therefore, now you can afford to have multiple labelers working on your dataset. Two or more labelers are recommended for labeling the same dataset independently. For each image, assign multiple labelers to label and only accept labels with high agreement among all labelers.

After they finish, similar to what we did in the defect consensus, an agreement score will be calculated for each image. This is based on the class as well as the region, if available, labeled by all the participants. You can set up a minimum threshold to reject images with inconsistent labels automatically. Review the remaining images with agreement scores that are above the bar. By doing so, you can quickly review your labeled datasets and prevent any inconsistent labels leaked to the next step.
After they finish, similar to what we did in the defect consensus, an agreement score will be calculated for each image. This is based on the class as well as the region, if available, labeled by all the participants. You can set up a minimum threshold to reject images with inconsistent labels automatically. Review the remaining images with agreement scores that are above the bar. By doing so, you can quickly review your labeled datasets and prevent any inconsistent labels leaked to the next step.

Successful ML Projects Formalize Data Labeling

At TagOn, we observed how many projects took an unnecessarily long and painful process to complete. It was due to ambiguous defect definitions or poor labeling quality. In comparison, it will make the life of machine learning engineers much easier, and the whole project lifespan much shorter, by having a dataset with high quality labels. Therefore, it is very important to invest the time in the project’s early stage to clarify defect definitions and formalize labeling.

We iterated the labeling process described above among our many projects. We formalized the defect definitions and introduced the heuristics from SMEs on how to recognize defects into the defect book. It is an important source of truth to train labelers as well as evaluating model predictions at the model iteration stage.

With defect consensus, we can examine the accuracy and completeness of the defect book. We can identify possible misalignments on the knowledge of defect definitions between labelers and SMEs. In the final labeling step, we have multiple labelers label the same dataset and then only approve images with consistent and unambiguous labels. Once this whole process completes, the data is then ready to be used for model training and evaluation.

Source: LandingAI