At Zango, we are building regulation-aware AI agents for compliance (read KYC, AML, etc.) teams. Our initial use case targets identifying non-compliance areas within compliance team policies, considering the constant regulatory updates in various jurisdictions.
We leverage large language models (LLMs) in the legal and regulatory compliance domains by developing advanced generative AI solutions, including a horizon scanning tool (constantly scanning public websites for any updates from regulators), a gap analysis tool, and an LLM-based compliance expert system.
Our approach integrates RAG, LLM Multi Models, text embeddings, fine-tuning, and prompt engineering techniques to effectively minimize hallucinations and produce reliable, accurate domain-specific outputs. Moreover, a human-in-the-loop control mechanism serves as the ultimate safeguard to ensure accuracy and mitigate risk.
Here’s what our research focuses on:
The primary challenge when it comes to using AI in compliance is that regulatory updates our anywhere between 10-500 pages long and almost always refer back to 5-6 other regulatory documents. This means that in order for the update to be meaningful for the compliance team, we have to combine multiple documents to build the context for a single regulatory update. For example, this UK AML regulation is 120 pages long and, after converting, it will be 614 chunks with around 592,467 tokens. We don't have access to an LLM that can accept that many tokens in memory and perform the creative tasks of referring back to 5 other documents mentioned in this policy update.
Since LLM models are trained on historical publicly available data, there are mainly two issues with that:
One solution is to search the internet in real time while querying, but this does not completely eliminate the issue of unreliable sources. Therefore, we are using RAG (retrieval-augmented generation) to prevent hallucinations by providing the model with an updated context (document) along with fine-tuning. In the RAG model, we set the temperature to 0 to ensure the model uses only the information provided in the document during the query. We also control hallucinations with strict prompts.
The major problem with LLM models is that they are constrained by token limits. The below graph shows the limitations of different LLM models based on token limits:
To work with the LLM models, we have to split the documents into smaller pieces to work within the LLM models' token limits. This is called splitting or chunking (we'll use these terms interchangeably). In the world of multi-modal, splitting also applies to images.
There are different approaches for chunking, and some popular ones are:
.
We chose option 5, Agentic Splitting, for chunking because most of the above options are mostly chunking based on text but were losing context and understanding. So we wanted to do the chunking as any human would do.
So why not use the LLM itself to do this chunking job?
Instead of using simple raw text-based chunks from the document, we used proposition-based retrieval. Instead of splitting the page into chunks, we run it through a "proposition-izer," which extracts propositions. Those are then stored.
Full paper: https://arxiv.org/pdf/2312.06648
Example: "Payment Providers need to do KYC. They need to do KYB also" > ['Payment Providers need to do KYC.', ‘Payment Providers need to do KYB']
We are using Claude Opus (we will discuss more about the model selection in upcoming blogs) for this. As output, we see that they look like regular sentences, but they are actually statements that can stand on their own. These won’t be follow-up sentences dependent on the previous one. For example, if we split sentences, then “They need to do KYB also” would not make sense independently. The split should be “Payment Providers need to do KYB.”
Then we built a system that can reason about each proposition and determine whether it should be a part of an existing chunk or if a new chunk should be made. This will create chunks like this:
Chunk #1
Great, we can store these chunks in a vector datastore and now use them in our evaluations for retrieval in the RAG workflow.
This highlights why software historically couldn't automate this use case, leading compliance teams to rely on in-house experts or outside consultants to read, synthesize regulatory updates, and perform gap analyses with internal policies and processes.
In our next blog, we will detail how a human reads a regulatory update while referring to multiple sources to provide a comprehensive update to the compliance team, further illustrating the challenges in developing an AI-based solution for this use case.
Top Image source : https://www.linkedin.com/pulse/chunking-your-way-ai-precision-tars-8sc8e/
Reference:
https://arxiv.org/pdf/2312.06648
https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb