System Architecture

HLD (High Level Design)

The Hack Rx 6.0 system is built on a modern, modular architecture designed for efficiency, scalability, and accuracy. It follows a two-stage process:

Document Ingestion and Indexing: First, the system processes and understands the input documents, converting them into a searchable, intelligent format.
Query Processing and Answer Generation: When a user asks a question, the system searches the indexed documents for relevant information and uses a large language model (LLM) to generate a precise, human-readable answer.

This dual-stage approach ensures that user queries are handled with minimal latency, as the heavy lifting of document processing is done upfront.

The following diagram illustrates the end-to-end flow of data and processing within the Hack Rx 6.0 engine:

This stage is responsible for preparing the knowledge base that the system will use to answer questions.

PDF Processing Engine: The process begins when insurance documents (in PDF format) are fed into the system. The Document Service uses libraries like PyMuPDF to handle these files.
Text Extraction & Chunking: The raw text is extracted from the documents. To make it manageable for the language model, this text is broken down into smaller, overlapping "chunks." This technique ensures that the semantic context of each passage is preserved.
Sentence Embeddings: Each text chunk is then converted into a numerical representation called an "embedding" using the all-MiniLM-L6-v2 Sentence Transformer model. This model is highly optimized for semantic similarity, meaning that chunks with similar meanings will have mathematically close vector representations.
FAISS Vector Store: These embeddings are stored and indexed in a FAISS (Facebook AI Similarity Search) vector store. FAISS is a highly efficient library that allows for incredibly fast similarity searches across millions of vectors, making it perfect for finding the most relevant document chunks in real-time.

This stage is activated when a user submits a question to the API.

Query Processing: The user's question, received as a natural language string, is also converted into an embedding using the same all-MiniLM-L6-v2 model.
Similarity Search: The system then queries the FAISS vector store with the user's question embedding. FAISS rapidly identifies the text chunks from the original documents whose embeddings are most mathematically similar to the question's embedding.
Context Retrieval: The most relevant text chunks identified by FAISS are retrieved. These chunks form the "context" that will be used to answer the user's question. This is a critical step in retrieval-augmented generation (RAG), as it grounds the LLM's response in the actual content of the source documents.
Answer Generation: The retrieved context, along with the original user question, is passed to the gemini-1.5-flash generative model. A carefully crafted prompt instructs the model to formulate an answer only based on the provided context, ensuring that the response is accurate and directly tied to the source material.
Structured Response: The final answer is formatted into a clear, structured response and sent back to the user via the API, completing the query lifecycle.

Last updated 11 days ago