Core Services
Core Services Deep Dive
This section breaks down the core components of the Hack Rx 6.0 engine. Each service is designed with a specific responsibility, creating a modular and maintainable system.
1. DocumentService
File: app/services/document_service.py
The DocumentService
is the entry point for all document-related processing. Its primary role is to ingest raw files, extract their text content, and prepare it for the embedding process.
Key Responsibilities:
File Handling: It can process documents from a local file path, which is the primary method used in the current setup.
Text Extraction: It contains specialized methods for extracting clean text from different file types:
_extract_text_from_pdf()
: Uses thePyMuPDF
library to parse PDF files._extract_text_from_docx()
: Uses thepython-docx
library for Microsoft Word documents.
Content Hashing: To avoid reprocessing the same document, it calculates a unique SHA-256 hash of the file's content. This hash is used for efficient caching in the database.
Text Chunking: The
chunk_text()
method intelligently splits the extracted text into smaller, overlapping chunks (default size: 500 words with a 50-word overlap). This is a crucial step to ensure that the semantic meaning of the text is preserved when generating embeddings.
2. EmbeddingService
File: app/services/embedding_service.py
The EmbeddingService
is the heart of the AI-powered search capability. It converts text into numerical vectors (embeddings) and manages the high-speed search index.
Key Responsibilities:
Embedding Generation: It uses the
all-MiniLM-L6-v2
Sentence Transformer model to convert text chunks into 384-dimensional vector embeddings. These embeddings capture the semantic meaning of the text.FAISS Index Management:
It builds a FAISS (Facebook AI Similarity Search) index, which is a highly optimized library for fast similarity searching in large datasets of vectors. The index is configured to use
IndexFlatIP
(Inner Product), which is ideal for cosine similarity searches after L2 normalization.It handles the loading and saving of the FAISS index to disk, allowing for persistence between application restarts.
Similarity Search: The
search()
method takes a user's query, generates an embedding for it, and uses the FAISS index to find the most semantically similar text chunks from the documents.
3. ClauseMatcher
File: app/services/clause_matcher.py
The ClauseMatcher
acts as a refinement layer on top of the EmbeddingService
. It is responsible for identifying the most relevant clauses from the document to construct the context for the LLM.
Key Responsibilities:
Clause Extraction: It uses the
EmbeddingService
to perform the initial semantic search and retrieve a set of candidate clauses based on the user's question.Relevance Ranking: The
rank_clauses_by_relevance()
method improves the search results by combining the initial semantic similarity score with a keyword overlap score. This hybrid approach ensures that the most relevant clauses are prioritized.Source Identification: It attempts to identify the source section of each clause within the original document, providing valuable context for the final answer.
4. QAService
File: app/services/qa_service.py
The QAService
is the final and most critical component of the pipeline. It orchestrates the process of generating a human-readable answer by bringing together the retrieved context and the power of a large language model (LLM).
Key Responsibilities:
Orchestration: The
answer_questions()
method manages the end-to-end process for each question: it calls theClauseMatcher
to get relevant context, builds the prompt, and generates the answer.Context Building: It compiles the top-ranked clauses into a clean, structured context to be fed into the LLM.
Prompt Engineering: It uses a carefully designed prompt template (
_create_prompt
) that instructs thegemini-1.5-flash
model to answer the user's question based only on the provided context. This is a key technique in Retrieval-Augmented Generation (RAG) to prevent the model from hallucinating and to ensure the answer is grounded in the source document.Answer Generation: It interacts with the Google Generative AI API to get the final answer from the
gemini-1.5-flash
model.

Database Schema and Models
The system's data is persisted in a PostgreSQL database, with the structure managed by SQLAlchemy ORM. The schema is designed to be simple yet effective for caching documents and logging user interactions.
Database Schema
The database consists of two primary tables, which are defined in app/models/database.py
and initialized using the migrations/__init__.sql
script.
documents
Table
This table stores the content and metadata of every document processed by the system. It acts as a cache to prevent redundant processing of the same file.
id
INTEGER
A unique, auto-incrementing primary key for each document record.
blob_url
TEXT
The original file path or URL of the document, used as a unique identifier.
content_hash
VARCHAR(64)
A SHA-256 hash of the document's binary content. This is the key field used for caching. If a new document with the same hash is submitted, the system will use the existing record instead of re-processing it.
processed_at
TIMESTAMP
The timestamp indicating when the document was first processed and added to the database.
content
TEXT
The full, extracted text content of the document.
Export to Sheets
qa_sessions
Table
This table serves as a log for every query made to the system, linking questions to their generated answers and the document they were based on.
id
INTEGER
A unique, auto-incrementing primary key for each Q&A session.
document_id
INTEGER
A foreign key that links to the id
of the corresponding record in the documents
table.
questions
JSONB
The list of questions asked by the user, stored in a flexible JSONB format.
answers
JSONB
The list of answers generated by the system, also stored in JSONB format.
created_at
TIMESTAMP
The timestamp indicating when the Q&A session occurred.
Export to Sheets
Data Models (Pydantic Schemas)
The API uses Pydantic models, defined in app/models/schemas.py
, for data validation and serialization.
QueryRequest
: Defines the expected structure of the incoming request body for the/hackrx/run
endpoint, ensuring that bothdocuments
andquestions
are provided.QueryResponse
: Defines the structure of the API's response, guaranteeing that it will contain a list ofanswers
.ClauseMatch
: A model used internally to structure the data for relevant clauses found by theClauseMatcher
.
Last updated