How to Use Watson Discovery to Store and Query Your PDF documents

Tian Huat Tan
4 min readNov 29, 2021

Storing data in the database has been in practice for at least two decades. However, storing PDF documents and querying the content remains a challenge, as content within PDF documents is highly unstructured.

In this article, I will show you how to leverage Watson Discovery to achieve this purpose by leveraging AI developed at IBM. You’ll be amazed at how simple it is.

The following diagram illustrates the process.

First, we import the documents into Watson Discovery. Second, we annotate the PDF documents. We can optionally include stop words and synonyms for better query performance. With that, Watson Discovery can then index the PDF documents. Once the indexing is done, we can search the documents that are being indexed in Watson Discovery using natural languages.

With the high-level workflow in mind, I’ll go through how to actually set this up on IBM Watson Discovery (cloud version).

First, in the IBM Watson Discovery interface, select “Upload your own data,” fill in the collection name, and click “Create.”

Subsequently, select your PDF documents and upload them.

After uploading your documents, click on “Configure data.”

Now, you will be able to annotate your documents using the “Field Labels” on the right-hand side. Examples of “Field Labels” are “header,” “footer,” and “subtitle.” You only need to annotate a few of them (e.g., “header” on the first few pages), and Watson Discovery will be able to learn using AI and annotate the remaining for you (e.g., “header” on the remaining pages).

Once we are done with the annotation, we select the annotation to be indexed. Only the text within the selected annotated portions will be stored in the index, and others will be ignored. We can also choose to split the documents further using the chosen annotations — that will help us create smaller documents, which will make our search more targeted.

Once this is done, we choose “Apply changes to collection.” Then, viola! Watson Discovery will index the document for us. Sit back and grab a coffee, as that will take a few minutes to complete.

Our PDF documents are about cases related to business law. We are interested in the case related to keyword rainbow. And once we click “run query,” we can get the related documents!

This is just one possible query. There are unlimited queries you can ask for your PDF documents.

The query works well for most cases in our example. There are a few misclassifications, mostly due to the usage of shortcuts. For example, a user might use shortcuts like “misrep” instead of “misrepresentation.” We can address this by adding “misrep” and “misrepresentation” as synonyms. Once we have added the shortcuts, all the misclassifications will be rectified.

That’s it. Watson Discovery is visually driven and designed for a non-technical person. You can easily use the provided UI to achieve a challenging task in a few steps. I hope you enjoyed this article.

--

--

Tian Huat Tan
0 Followers

Father of one. AI Practitioner and Geek. Computer Science PhD.