Metrics for evaluating content moderation in Amazon Rekognition and other content moderation services
Content moderation is the process of screening and monitoring user-generated content online. To provide a safe environment for both users and brands, platforms must moderate content to ensure that it falls within preestablished guidelines of acceptable behavior that are specific to the platform and its audience.
When a platform moderates content, acceptable user-generated content (UGC) can be created and shared with other users. Inappropriate, toxic, or banned behaviors can be prevented, blocked in real time, or removed after the fact, depending on the content moderation tools and procedures the platform has in place.
You can use Amazon Rekognition Content Moderation to detect content that is inappropriate, unwanted, or offensive, to create a safer user experience, provide brand safety assurances to advertisers, and comply with local and global regulations.
In this post, we discuss the key elements needed to evaluate the performance aspect of a content moderation service in terms of various accuracy metrics, and a provide an example using Amazon Rekognition Content Moderation API’s.
What to evaluate
When evaluating a content moderation service, we recommend the following steps.
Before you can evaluate the performance of the API on your use cases, you need to prepare a representative test dataset. The following are some high-level guidelines:
Collection – Take a large enough random sample (images or videos) of the data you eventually want to run through Amazon Rekognition. For example, if you plan to moderate user-uploaded images, you can take a week’s worth of user images for the test. We recommend choosing a set that has enough images without getting too large to process (such as 1,000–10,000 images), although larger sets are better.
Definition – Use your application’s content guidelines to decide which types of unsafe content you’re interested in detecting from the Amazon Rekognition moderation concepts taxonomy. For example, you may be interested in detecting all types of explicit nudity and graphic violence or gore.
Annotation – Now you need a human-generated ground truth for your test set using the chosen labels, so that you can compare machine predictions against them. This means that each image is annotated for the presence or absence of your chosen concepts. To annotate your image data, you can use Amazon SageMaker Ground Truth (GT)to manage image annotation. You can refer to GT for image labeling, consolidating annotations and processing annotation output.
Get predictions on your test dataset with Amazon Rekognition
Next, you want to get predictions on your test dataset.
The first step is to decide on a minimum confidence score (a threshold value, such as 50%) at which you want to measure results. Our default threshold is set to 50, which offers a good balance between retrieving large amounts of unsafe content without incurring too many false predictions on safe content. However, your platform may have different business needs, so you should customize this confidence threshold as needed. You can use the MinConfidence parameter in your API requests to balance detection of content (recall) vs the accuracy of detection (precision). If you reduce MinConfidence, you are likely to detect most of the inappropriate content, but are also likely to pick up content that is not actually inappropriate. If you increase MinConfidence you are likely to ensure that all your detected content is truly inappropriate but some content may not be tagged. We suggest experimenting with a few MinConfidence values on your dataset and quantitatively select the best value for your data domain.
Next, run each sample (image or video) of your test set through the Amazon Rekognition moderation API (DetectModerationLabels).
Measure model accuracy on images
You can assess the accuracy of a model by comparing human-generated ground truth annotations with the model predictions. You repeat this comparison for every image independently and then aggregate over the whole test set:
Per-image results – A model prediction is defined as the pair label_name, confidence_score (where the confidence score >= the threshold you selected earlier). For each image, a prediction is considered correct when it matches the ground truth (GT). A prediction is one of the following options:
True Positive (TP): both prediction and GT are “unsafe”
True Negative (TN): both prediction and GT are “safe”
False Positive (FP): the prediction says “unsafe”, but the GT is “safe”
False Negative (FN): the prediction is “safe”, but the GT is “unsafe”
Aggregated results over all images – Next, you can aggregate these predictions into dataset-level results:
False positive rate (FPR) – This is the percentage of images in the test set that are wrongly flagged by the model as containing unsafe content: (FP): FP / (TN+FP).
False negative rate (FNR) – This is the percentage of unsafe images in the test set that are missed by the model: (FN): FN / (FN+TP).
True positive rate (TPR) – Also called recall, this computes the percentage of unsafe content (ground truth) that is correctly discovered or predicted by the model: TP / (TP + FN) = 1 – FNR.
Precision – This computes the percentage of correct predictions (unsafe content) with regards to the total number of predictions made: TP / (TP+FP).
Let’s explore an example. Let’s assume that your test set contains 10,000 images: 9,950 safe and 50 unsafe. The model correctly predicts 9,800 out of 9,950 images as safe and 45 out of 50 as unsafe:
TP = 45
TN = 9800
FP = 9950 – 9800 = 150
FN = 50 – 45 = 5
FPR = 150 / (9950 + 150) = 0.015 = 1.5%
FNR = 5 / (5 + 45) = 0.1 = 10%
TPR/Recall = 45 / (45 + 5) = 0.9 = 90%
Precision = 45 / (45 + 150) = 0.23 = 23%
Measure model accuracy on videos
If you want to evaluate the performance on videos, a few additional steps are necessary:
Sample a subset of frames from each video. We suggest sampling uniformly with a rate of 0.3–1 frames per second (fps). For example, if a video is encoded at 24 fps and you want to sample one frame every 3 seconds (0.3 fps), you need to select one every 72 frames.
Run these sampled frames through Amazon Rekognition content moderation. You can either use our video API, which already samples frames for you (at a rate of 3 fps), or use the image API, in which case you want to sample more sparsely. We recommend the latter option, given the redundancy of information in videos (consecutive frames are very similar).
Compute the per-frame results as explained in the previous section (per-image results).
Aggregate results over the whole test set. Here you have two options, depending on the type of outcome that matters for your business:
Frame-level results – This considers all the sampled frames as independent images and aggregates the results exactly as explained earlier for images (FPR, FNR, recall, precision). If some videos are considerably longer than others, they will contribute more frames to the total count, making the comparison unbalanced. In that case, we suggest changing the initial sampling strategy to a fixed number of frames per video. For example, you could uniformly sample 50–100 frames per video (assuming videos are at least 2–3 minutes long).
Video-level results – For some use cases, it doesn’t matter whether the model is capable of correctly predicting 50% or 99% of the frames in a video. Even a single wrong unsafe prediction on a single frame could trigger a downstream human evaluation and only videos with 100% correct predictions are truly considered correctly. If this is your use case, we suggest you compute FPR/FNR/TPR over the frames of each video and consider the video as follows:
Results Aggregated Over All the Frames of Video ID
Total FP = 0
Total FN = 0
Total FP > 0
False Positive (FP)
Total FN > 0
False Negative (FN)
After you have computed these for each video independently, you can then compute all the metrics we introduced earlier:
The percentage of videos that are wrongly flagged (FP) or missed (FN)
Precision and recall
Measure performance against goals
Finally, you need to interpret these results in the context of your goals and capabilities.
First, consider your business needs in regards to the following:
Data – Learn about your data (daily volume, type of data, and so on) and the distribution of your unsafe vs. safe content. For example, is it balanced (50/50), skewed (10/90) or very skewed (1/99, meaning that only 1% is unsafe)? Understanding such distribution can help you define your actual metric goals. For example, the number of safe content is often an order of magnitude larger than unsafe content (very skewed), making this almost an anomaly detection problem. Within this scenario, the number of false positives may outnumber the number of true positives, and you can use your data information (distribution skewness, volume of data, and so on) to decide the FPR you can work with.
Metric goals – What are the most critical aspects of your business? Lowering the FPR often comes at the cost of a higher FNR (and vice versa) and it’s important to find the right balance that works for you. If you can’t miss any unsafe content, you likely want close to 0% FNR (100% recall). However, this will incur the largest number of false positives, and you need to decide the target (maximum) FPR you can work with, based on your post-prediction pipeline. You may want to allow some level of false negatives to be able to find a better balance and lower your FPR: for example, accepting a 5% FNR instead of 0% could reduce the FPR from 2% to 0.5%, considerably reducing the number of flagged contents.
Next, ask yourself what mechanisms you will use to parse the flagged images. Even though the API’s may not provide 0% FPR and FNR, it can still bring huge savings and scale (for example, by only flagging 3% of your images, you have already filtered out 97% of your content). When you pair the API with some downstream mechanisms, like a human workforce that reviews the flagged content, you can easily reach your goals (for example, 0.5% flagged content). Note how this pairing is considerably cheaper than having to do a human review on 100% of your content.
When you have decided on your downstream mechanisms, we suggest you evaluate the throughput that you can support. For example, if you have a workforce that can only verify 2% of your daily content, then your target goal from our content moderation API is a flag rate (FPR+TPR) of 2%.
Finally, if obtaining ground truth annotations is too hard or too expensive (for example, your volume of data is too large), we suggest annotating the small number of images flagged by the API. Although this doesn’t allow for FNR evaluations (because your data doesn’t contain any false negatives), you can still measure TPR and FPR.
In the following section, we provide a solution for image moderation evaluation. You can take a similar approach for video moderation evaluation.
The following diagram illustrates the various AWS services you can use to evaluate the performance of Amazon Rekognition content moderation on your test dataset.
The content moderation evaluation has the following steps:
Upload your evaluation dataset into Amazon Simple Storage Service (Amazon S3).
Use Ground Truth to assign ground truth moderation labels.
Generate the predicted moderation labels using the Amazon Rekognition pre-trained moderation API using a few threshold values. (For example, 70%, 75% and 80%).
Assess the performance for each threshold by computing true positives, true negatives, false positives, and false negatives. Determine the optimum threshold value for your use case.
Optionally, you can tailor the size of the workforce based on true and false positives, and use Amazon Augmented AI (Amazon A2I) to automatically send all flagged content to your designated workforce for a manual review.
The following sections provide the code snippets for steps 1, 2, and 3. For complete end-to-end source code, refer to the provided Jupyter notebook.
Before you get started, complete the following steps to set up the Jupyter notebook:
Open the notebook for this post: content-moderation-evaluation/Evaluating-Amazon-Rekognition-Content-Moderation-Service.ipynb.
Upload your evaluation dataset to Amazon Simple Storage Service (Amazon S3).
We will now go through steps 2 through 4 in the Jupyter notebook.
Use Ground Truth to assign moderation labels
To assign labels in Ground Truth, complete the following steps:
Create a manifest input file for your Ground Truth job and upload it to Amazon S3.
Create the labeling configuration, which contains all moderation labels that are needed for the Ground Truth labeling job.To check the limit for the number of label categories you can use, refer to Label Category Quotas. In the following code snippet, we use five labels (refer to the hierarchical taxonomy used in Amazon Rekognition for more details) plus one label (Safe_Content) that marks content as safe:
Create a custom worker task template to provide the Ground Truth workforce with labeling instructions and upload it to Amazon S3.
The Ground Truth label job is defined as an image classification (multi-label) task. Refer to the source code for instructions to customize the instruction template.
Decide which workforce you want to use to complete the Ground Truth job. You have two options (refer to the source code for details):
Create and submit a Ground Truth labeling job. You can also adjust the following code to configure the labeling job parameters to meet your specific business requirements. Refer to the source code for complete instructions on creating and configuring the Ground Truth job.
After the job is submitted, you should see output similar to the following:
Wait for labeling job on the evaluation dataset to complete successfully, then continue to the next step.
Use the Amazon Rekognition moderation API to generate predicted moderation labels.
The following code snippet shows how to use the Amazon Rekognition moderation API to generate moderation labels:
Assess the performance
You first retrieved ground truth moderation labels from the Ground Truth labeling job results for the evaluation dataset, then you ran the Amazon Rekognition moderation API to get predicted moderation labels for the same dataset. Because this is a binary classification problem (safe vs. unsafe content), we calculate the following metrics (assuming unsafe content is positive):
We also calculate the corresponding evaluation metrics:
The following code snippet shows how to calculate those metrics:
This post discusses the key elements needed to evaluate the performance aspect of your content moderation service in terms of various accuracy metrics. However, accuracy is only one of the many dimensions that you need to evaluate when choosing a particular content moderation service. It’s critical that you include other parameters, such as the service’s total feature set, ease of use, existing integrations, privacy and security, customization options, scalability implications, customer service, and pricing. To learn more about content moderation in Amazon Rekognition, visit Amazon Rekognition Content Moderation.
About the authors
Davide Modolo is an Applied Science Manager at AWS AI Labs. He has a PhD in computer vision from the University of Edinburgh (UK) and is passionate about developing new scientific solutions for real-world customer problems. Outside of work, he enjoys traveling and playing any kind of sport, especially soccer.
Jian Wu is a Senior Enterprise Solutions Architect at AWS. He’s been with AWS for 6 years working with customers of all sizes. He is passionate about helping customers to innovate faster via the adoption of the Cloud and AI/ML. Prior to joining AWS, Jian spent 10+ years focusing on software development, system implementation and infrastructure management. Aside from work, he enjoys staying active and spending time with his family.