Agnetic Splitting: Our novel approach to chunking

Engineering
Shashank Agarwal
June 8, 2024

At Zango, we are building regulation-aware AI agents for compliance (read KYC, AML, etc.) teams. Our initial use case targets identifying non-compliance areas within compliance team policies, considering the constant regulatory updates in various jurisdictions.

We leverage large language models (LLMs) in the legal and regulatory compliance domains by developing advanced generative AI solutions, including a horizon scanning tool (constantly scanning public websites for any updates from regulators), a gap analysis tool, and an LLM-based compliance expert system.

Our approach integrates RAG, LLM Multi Models, text embeddings, fine-tuning, and prompt engineering techniques to effectively minimize hallucinations and produce reliable, accurate domain-specific outputs. Moreover, a human-in-the-loop control mechanism serves as the ultimate safeguard to ensure accuracy and mitigate risk.

Here’s what our research focuses on:

  • Expanding support to multiple regions and languages
  • Improving models, prompts, and text embedding datasets for better compliance reasoning
  • Implementing human-in-the-loop processes and verification workflows
  • Benchmarking models
  • Creating a regulatory-indexed repository

The challenge with chunking

The primary challenge when it comes to using AI in compliance  is that regulatory updates our anywhere between 10-500 pages long and almost always refer back to 5-6 other regulatory documents. This means that in order for the update to be meaningful for the compliance team, we have to combine multiple documents to build the context for a single regulatory update. For example, this UK AML regulation is 120 pages long and, after converting, it will be 614 chunks with around 592,467 tokens. We don't have access to an LLM that can accept that many tokens in memory and perform the creative tasks of referring back to 5 other documents mentioned in this policy update.

Since LLM models are trained on historical publicly available data, there are mainly two issues with that:

  1. These are trained up to a fixed date, like “gpt-4-turbo-2024-04-09 is supposed to have knowledge up to December 2023,” so they miss knowledge after December 2023.
  2. They are trained on public data, so there’s no guarantee of the right answer or reliable sources.

One solution is to search the internet in real time while querying, but this does not completely eliminate the issue of unreliable sources. Therefore, we are using RAG (retrieval-augmented generation) to prevent hallucinations by providing the model with an updated context (document) along with fine-tuning. In the RAG model, we set the temperature to 0 to ensure the model uses only the information provided in the document during the query. We also control hallucinations with strict prompts.

The major problem with LLM models is that they are constrained by token limits. The below graph shows the limitations of different LLM models based on token limits:

To work with the LLM models, we have to split the documents into smaller pieces to work within the LLM models' token limits. This is called splitting or chunking (we'll use these terms interchangeably). In the world of multi-modal, splitting also applies to images.

The solution

There are different approaches for chunking, and some popular ones are:

  1. Character Splitting: Simple static character chunks of data. 
  2. Recursive Character Text Splitting: Recursive chunking based on a list of separators.
  3. Document Specific Splitting: Various chunking methods for different document types (PDF, Python, Markdown). 
  4. Semantic Splitting: Embedding walk-based chunking. All the above methods depend on fixed text size chunking (splitting) but do not consider the actual content, where content can lose meaning during chunking. Embeddings represent the semantic meaning of a string. When we compare embeddings of texts, the system can start to infer the relationship between chunks and build clusters of similar content. 

    The advantage of this approach is that it is better than all the above ones, it considers content along with text and can be used with chatbots during semantic matching.
  5. Agentic Splitting: Experimental method of splitting text with an agent-like system, where another model acts like an agent

.

We chose option 5, Agentic Splitting, for chunking because most of the above options are mostly chunking based on text but were losing context and understanding. So we wanted to do the chunking as any human would do. 

So why not use the LLM itself to do this chunking job?

Here’s how a human would do the chunking:

  • Create a fresh document
  • Start reading the whole document and assume the starting as the first chunk and create a first chunk (topic)
  • Keep going down the document and evaluate if a new sentence or piece of the document should be a part of the first chunk (topic); if not, then create a new one
  • Continue this process until reaching the end of the document

Instead of using simple raw text-based chunks from the document, we used proposition-based retrieval. Instead of splitting the page into chunks, we run it through a "proposition-izer," which extracts propositions. Those are then stored.

Full paper: https://arxiv.org/pdf/2312.06648 

Example: "Payment Providers need to do KYC. They need to do KYB also" > ['Payment Providers need to do KYC.', ‘Payment Providers need to do KYB']

We are using Claude Opus (we will discuss more about the model selection in upcoming blogs) for this. As output, we see that they look like regular sentences, but they are actually statements that can stand on their own. These won’t be follow-up sentences dependent on the previous one. For example, if we split sentences, then “They need to do KYB also” would not make sense independently. The split should be “Payment Providers need to do KYB.”

Then we built a system that can reason about each proposition and determine whether it should be a part of an existing chunk or if a new chunk should be made. This will create chunks like this:

Chunk #1

  • Chunk_ID: a445F
  • Summary: This chunk contains information about payment providers and customer identification.
  • Propositions: Payment providers must do KYC for all customers.

Great, we can store these chunks in a vector datastore and now use them in our evaluations for retrieval in the RAG workflow.

This highlights why software historically couldn't automate this use case, leading compliance teams to rely on in-house experts or outside consultants to read, synthesize regulatory updates, and perform gap analyses with internal policies and processes. 

In our next blog, we will detail how a human reads a regulatory update while referring to multiple sources to provide a comprehensive update to the compliance team, further illustrating the challenges in developing an AI-based solution for this use case.

Top Image source : https://www.linkedin.com/pulse/chunking-your-way-ai-precision-tars-8sc8e/

Reference:

https://arxiv.org/pdf/2312.06648

https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb

Other popular blogs

See all