Using Amazon SageMaker Ground Truth to Label Your Data


AllCloud Blog: Cloud Insights and Innovation

Machine Learning (ML) is an amazing technology that can help pretty much any company leverage their data to create actionable insights. This is especially true when using a reliable platform like the Amazon Web Services (AWS) SageMaker. However, any ML model will only be as good as the data it trained on! 

Data collected by organizations that did not design their systems specifically for ML purposes is usually disorganized or unlabeled. Once those organizations want to use their data, they find they need to invest large amounts of money to clean and label it. While we will discuss data cleaning, among other things, in the future, the difficult and costly challenge of labeling data has a direct answer in SageMaker: Ground Truth.

What is Labeling and How Does SageMaker Ground Truth Help?

Ground Truth (GT) is a platform that inputs unlabeled data and outputs it labeled. The ‘label’ that’s added represents what we would like the ML algorithm to learn later on, for example – let’s take a company that wants to predict churn rate of customers. The data collected daily on the customers is unlabeled. However, we can take a yearly batch of customer data, and mark which customers stayed subscribed compared to which customers stopped subscription. This is our label, and essentially the target for the ML algorithm to learn. GT helps with labeling by using a variety of methods presented below, but the premise is this: High-quality data is required to produce high-quality models. High-quality data is usually created by labeling experts in each domain, and takes a long term commitment as the process is lengthy. GT is an attempt to reduce the time and cost of creating high quality labeled data.

Using Human Labeling Workforces

In GT, you start by uploading your dataset. If your data is already in AWS on a S3 bucket, most of the heavy lifting is already done. All that’s left to do is automatically create a manifest file (essentially a list of your data on the S3 bucket), and create a Labeling Job. These jobs can be assigned to a human workforce of labelers of your choice: Amazon’s Mechanical Turk – a public crowdsourced workforce, an external labeling service from a curated list of service providers (by AWS), your own in-house workforce, or other AWS certified data experts. .

Automated Data Labeling

Using human resources for labeling is essential in this process, but GT is more than just a platform for human data labeling. While the workforce you choose labels the data, GT learns from the labeling efforts and trains a model that is automatically used to label your data similarly to data you have already labeled. This model is retrained every time a new record is labeled by the workforce. This process has been shown by AWS to save on the cost of labeling efforts by up to 70%.

Image captured from the Ground Truth Overview page, AWS
Image captured from the Ground Truth Overview page, AWS

Not All Data types are Born Equal

GT is a great tool for labeling data, but not all data is the right fit. GT supports, at the time of writing, several labeling tasks for Image and Text data. Image tasks include Classification (single/multi-label), Object Bounding Boxes, Semantic Segmentation and Label Verification. Similarly, text tasks include Classification (single/multi-label), and Entity Recognition. Additionally, GT supports completely custom tasks: You can create an entire flow with Lambda Functions and a custom HTML interface for your labeling task. This significant list of tasks continues to grow as SageMaker makes improvements on a regular basis.

AllCloud Chooses SageMaker Ground Truth

Here at AllCloud, we have the AWS expertise, Data Engineering experience and Machine Learning know-how to produce real value and actionable insights from customer data. We have used Ground Truth successfully as a labeling platform for customers that wanted to reap the benefits of ML, but didn’t have enough (or any) data labeled.

The Bottom Line

Do you want to use ML, but don’t have labeled data? The best platform on AWS for that is SageMaker Ground Truth, as it allows you to delegate the labeling to human workforces while saving on costs by using Automated Data Labeling. Get in touch with an AllCloud data professional to start your Machine Learning Journey on AWS today!

 

Ido Nissim

Data Engineer

Read more posts by Ido Nissim