Augment search with metadata by chaining Amazon Textract, Amazon Comprehend, and Amazon Kendra
Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra reimagines enterprise search for your websites and applications so your employees and customers can easily find the content they’re looking for, even when it’s scattered across multiple locations and content repositories within your organization. With Amazon Kendra, you can stop searching through troves of unstructured data and discover the right answers to your questions, when you need them.
Although Amazon Kendra is a great search tool, it only performs as well as the quality of documents in its index. Like with all things AI/ML related, the better the quality of data that is input into Amazon Kendra, the more targeted and precise the search results. So how can we improve the documents in our Amazon Kendra index to maximize search result performance? To allow Amazon Kendra to return more targeted search results, we enrich the documents with metadata to use attributes such as main language, named entities, key phrases, and more.
In this post, we address the following use case: With large amounts of raw historical documents to search on, how do we connect metadata to the documents to take advantage of Amazon Kendra’s boosting and filtering features? We aim to demonstrate a way in which you can enrich your historical data by adding metadata to searchable documents with Amazon Textract and Amazon Comprehend, to get more targeted and flexible searches with Amazon Kendra.
Note that the following tutorial is written using Amazon SageMaker Notebooks as a code platform. However, these API calls can be done using any IDE of your choice. To save costs, and for the sake of your own familiarity, feel free to use your favorite IDE in place of SageMaker Notebooks to follow along.
For this post, we examine a hypothetical use case for a media entertainment company. We have many documents about movies and television shows and want to use Amazon Kendra to query the data. For demonstration purposes, we pull public data on Wikipedia to create PDF documents that act as our company’s data that we want to query on. We use an Amazon SageMaker notebook instance as our code platform. We use Python, along with the Boto3 Python library, to connect to and use the Amazon Textract, Amazon Comprehend, and Amazon Kendra APIs.
We walk you through the following high-level steps:
Create our media PDF documents through Wikipedia.
Create metadata using Amazon Textract and Amazon Comprehend.
Configure the Amazon Kendra index and load the data.
Run a sample query and experiment with boosting query performance.
As a prerequisite, we first set up a SageMaker notebook instance and a Python notebook within it.
Create a SageMaker notebook instance
To create a SageMaker notebook instance, you can follow the instructions in the documentation Create a Notebook Instance, or follow the configuration that we use in this post.
Create a notebook instance with the following configuration:
Notebook instance name – KendraAugmentation
Notebook instance class – ml.t2.medium
Elastic inference – None
Next, we create an AWS Identity and Access Management (IAM) role.
Choose Create a new role.
Choose Next to create a role.
The role name starts with AmazonSageMaker-ExecutionRole-xxxxxxxx. For this example, we create a role called AmazonSageMaker-ExecutionRole-Kendra-Blog.
For Root access, select Enable.
Leave the remaining options at their default.
Choose Create notebook instance.
You’re redirected to a page that shows that your notebook instance is being created. The process takes a few minutes. When you see a green InService state, the notebook is ready.
Create a Python3 notebook in your SageMaker notebook instance
With this, we’re ready to start writing and running Python code. To run the rest of the code that follows, run the following code in the first Jupyter notebook cell to import the necessary modules we need:
As we go through each of the sections, we import other modules as necessary.
Create Media PDF documents through Wikipedia
Run the following code in one of the Jupyter notebook cells to create an Amazon Simple Storage Service (Amazon S3) bucket where we store all the media documents that we search on:
For this post, our bucket is kendra-augmentation-documents-jp. You can update the code with a different name.
As we mentioned earlier, we create mock PDF documents from public Wikipedia content that represent the media data that we augment and perform searches on. I’ve pre-selected movies and TV shows from the entertainment industry in the following code, but you can choose different topics in your notebook.
Create metadata using Amazon Textract and Amazon Comprehend
To create metadata for each of our PDF files, we must first extract the text portions of each PDF using Amazon Textract. We run the extracted text through Amazon Comprehend to attach attributes (metadata) to the PDFs, such as named entities, dominant language, and key phrases. Note that Amazon Comprehend will be able to read PDF files directly in a future feature release.
Use the following helper function (s3_get_filenames) to get all the file names in a specific bucket or prefix folder in Amazon S3:
We run Amazon Textract on each of our PDF files to extract the text of each file and transform the data into a format that we later ingest into Amazon Comprehend.
Next, we create the S3 bucket and service role settings needed to run Amazon Textract through SageMaker notebook instances.
Create a new S3 bucket to store our Amazon Textract output, called kendra-augmentation-textract-output-jp:
Attach the AmazonTextractFullAccess policy to the same AmazonSageMaker-ExecutionRole-Kendra-Blog role.
Run the following code to run Amazon Textract on our PDF files, create new .txt files for Amazon Comprehend to use, and send these files to the S3 bucket we created:
Now that we have our .txt files with the text representation of our PDF files, we can create metadata out of them using Amazon Comprehend.
We extract and attach the following Amazon Comprehend metadata attributes to each document for Amazon Kendra to index on:
Dominant language – The language that’s being used the most in the document
Named entities – A textual reference to the unique name of a real-world object, such as people, places, and commercial items, and precise references to measures such as dates and quantities
Key phrases – A string containing a noun phrase that describes a particular thing
Sentiment – The positive, negative, neutral, and mixed sentiment score of the entire document
Use the following ComprehendAnalyzer Python class to simplify and unify the Amazon Comprehend API calls. Either copy and paste the code into one of the notebook cells and run it, or create a separate .py file and import it in the notebook.
We now have everything we need to create an Amazon Kendra index, create and add metadata to the index, and start boosting and filtering our Amazon Kendra searches!
Configure our Amazon Kendra index
Now that we’ve got our Amazon Textract outputs and our Amazon Comprehend class in ComprehendAnalyzer, we can put everything together with Amazon Kendra.
Configure Amazon Kendra IAM access
Like in the previous steps, we need to give SageMaker access to use Amazon Kendra by attaching the AmazonKendraFullAccess policy to the AmazonSageMaker-ExecutionRole-Kendra-Blog role. Then we create an IAM policy and service role.
To create an index with Amazon Kendra, we first create an IAM policy that lets Amazon Kendra access our CloudWatch Logs, and then create an Amazon Kendra service role. For full instructions, see the Amazon Kendra Developer Guide. I outline the exact steps in this section for your convenience.
On the IAM console, choose Policies in the navigation pane.
Choose Create policy.
Choose JSON and replace the default policy with the following:
Choose Review policy.
Name the policy KendraPolicyForGettingStartedIndex and choose Create policy.
Choose Another AWS account and enter your account ID.
Choose Next: Permissions.
In the navigation pane, choose Roles.
Choose Create role.
Choose the policy that you just created and choose Next: Tags.
Don’t add any tags and choose Next: Review.
Name the role KendraRoleForGettingStartedIndex and choose Create role.
Find the role that you just created and open the role summary.
Choose Trust relationships and then choose Edit trust relationship.
Replace the existing trust relationship with the following:
Choose Update trust policy.
Create your Amazon Kendra Index
Now that we’ve got all the policies and roles that we need, let’s create our Amazon Kendra index using the following code. You have to update the role ARN with your AWS account number.
Define the Amazon Kendra index metadata configuration
We now define the metadata configuration for the index blog-media-company-index we just made. It follows the Amazon Comprehend attributes we defined in our Python class ComprehendAnalyzer. See the following code:
Create metadata using ComprehendAnalyzer
Now that we’ve created our index blog-media-company-index and defined and set our metadata configuration, we use ComprehendAnalyzer to extract metadata from our media files in Amazon S3:
If you want to see what the metadata looks like, look at the first item in the documents Python list by running the following code:
Load metadata into the Amazon Kendra index
The last step is to load the metadata we extracted using ComprehendAnalyzer into the blog-media-company-index index by running the following code:
Now we’re ready to start querying and boosting some of the metadata attributes!
Query the index and boost metadata attributes
We now have everything set up to start querying our data. We’re able to weigh attributes differently in terms of significance, make metadata attributes searchable, influence the order of results coming back from the query by improving the sentiment metadata, and much more.
Run a sample query
Before we get into a few examples that demonstrate the power and flexibility this metadata attachment gives us, let’s run the following code to query the blog-media-company-index index:
We can test the following query to get a sense of how to query our new index:
Now that you know how to query, let’s get into some examples of how we can use our metadata to influence our searches.
This section contains some examples of how we can influence and control our search for more targeted results. For each of the examples, we update our blog-media-company-index index by modifying our meta_config_dict and rerunning the following code:
Example 1: Weighing attributes
To weigh attributes by significance, update the Importance value of the attributes. The range for importance goes from 1–10, 1 being lowest, and 10 being the highest.
For example, let’s say we have a use case where we have different country entities and we have documents in many different languages. We can increase the significance of the Languages metadata attribute to account for this by updating its Importance to 10, and making sure Searchable is set to True. This makes it so that the text in the field Languages is searchable. See the following code:
Now let’s say that we’re looking for more positive context results. We increase the Importance value of the metadata attribute Sentiment to 10:
Example 2: Ranking search results
Let’s say we want to influence the rank of the search results by a particular sentiment metadata attribute. We can simply configure the Importance and RankOrder of the sentiment we want. For example, if we want to increase the significance of the positive results and rank those results higher than the negative, we update the Positive_score attribute to have an Importance of 10 and a RankOrder of DESCENDING to put the most positive results at the top. We leave the Importance of Negative_Score at 1 and update its RankOrder to ASCENDING to make sure the least negative sentiment results show up higher. See the following code:
At this point, you’ve got your Amazon Kendra index and metadata attributes set up. Go ahead and play around with querying, weighing metadata, and ranking results by creating your own creative combinations!
To avoid extra charges, shut down the SageMaker and Amazon Kendra resources when you’re done.
On the SageMaker console, choose Notebook and Notebook instances.
Select the notebook that you created.
On the Actions menu, choose Stop.
Alternatively, you can keep the instance stopped indefinitely and not be charged.
On the Amazon Kendra console, choose Indexes.
Select the index you created.
On the Actions menu, choose Delete.
Because we used Amazon Textract and Amazon Comprehend via API, there are no shutdown steps necessary for those resources.
In this post, we showed how to do the following:
Use Amazon Textract on PDF files to extract text from documents
Use Amazon Comprehend to extract metadata attributes from Amazon Textract output
Perform targeted searches with Amazon Kendra using the metadata attributes extracted by Amazon Comprehend
Although this may have been a mock media company example using public sample data, I hope you were able to have some fun following along and realize the potential—and power—of chaining Amazon Textract, Amazon Comprehend, and Amazon Kendra together. Use this new knowledge and start augmenting your historical data! To learn more about how Amazon Kendra’s fully managed intelligent search service can help your business, visit our webpage or dive into our documentation and tutorials!
About the Author
James Poquiz is a Data Scientist with AWS Professional Services based in Orange County, California. He has a BS in Computer Science from the University of California, Irvine and has several years of experience working in the data domain having played many different roles. Today he works on implementing and deploying scalable ML solutions to achieve business outcomes for AWS clients.