Core Services

Core Services Deep Dive

This section breaks down the core components of the Hack Rx 6.0 engine. Each service is designed with a specific responsibility, creating a modular and maintainable system.

1. DocumentService

File: app/services/document_service.py

The DocumentService is the entry point for all document-related processing. Its primary role is to ingest raw files, extract their text content, and prepare it for the embedding process.

Key Responsibilities:

  • File Handling: It can process documents from a local file path, which is the primary method used in the current setup.

  • Text Extraction: It contains specialized methods for extracting clean text from different file types:

    • _extract_text_from_pdf(): Uses the PyMuPDF library to parse PDF files.

    • _extract_text_from_docx(): Uses the python-docx library for Microsoft Word documents.

  • Content Hashing: To avoid reprocessing the same document, it calculates a unique SHA-256 hash of the file's content. This hash is used for efficient caching in the database.

  • Text Chunking: The chunk_text() method intelligently splits the extracted text into smaller, overlapping chunks (default size: 500 words with a 50-word overlap). This is a crucial step to ensure that the semantic meaning of the text is preserved when generating embeddings.

2. EmbeddingService

File: app/services/embedding_service.py

The EmbeddingService is the heart of the AI-powered search capability. It converts text into numerical vectors (embeddings) and manages the high-speed search index.

Key Responsibilities:

  • Embedding Generation: It uses the all-MiniLM-L6-v2 Sentence Transformer model to convert text chunks into 384-dimensional vector embeddings. These embeddings capture the semantic meaning of the text.

  • FAISS Index Management:

    • It builds a FAISS (Facebook AI Similarity Search) index, which is a highly optimized library for fast similarity searching in large datasets of vectors. The index is configured to use IndexFlatIP (Inner Product), which is ideal for cosine similarity searches after L2 normalization.

    • It handles the loading and saving of the FAISS index to disk, allowing for persistence between application restarts.

  • Similarity Search: The search() method takes a user's query, generates an embedding for it, and uses the FAISS index to find the most semantically similar text chunks from the documents.

3. ClauseMatcher

File: app/services/clause_matcher.py

The ClauseMatcher acts as a refinement layer on top of the EmbeddingService. It is responsible for identifying the most relevant clauses from the document to construct the context for the LLM.

Key Responsibilities:

  • Clause Extraction: It uses the EmbeddingService to perform the initial semantic search and retrieve a set of candidate clauses based on the user's question.

  • Relevance Ranking: The rank_clauses_by_relevance() method improves the search results by combining the initial semantic similarity score with a keyword overlap score. This hybrid approach ensures that the most relevant clauses are prioritized.

  • Source Identification: It attempts to identify the source section of each clause within the original document, providing valuable context for the final answer.

4. QAService

File: app/services/qa_service.py

The QAService is the final and most critical component of the pipeline. It orchestrates the process of generating a human-readable answer by bringing together the retrieved context and the power of a large language model (LLM).

Key Responsibilities:

  • Orchestration: The answer_questions() method manages the end-to-end process for each question: it calls the ClauseMatcher to get relevant context, builds the prompt, and generates the answer.

  • Context Building: It compiles the top-ranked clauses into a clean, structured context to be fed into the LLM.

  • Prompt Engineering: It uses a carefully designed prompt template (_create_prompt) that instructs the gemini-1.5-flash model to answer the user's question based only on the provided context. This is a key technique in Retrieval-Augmented Generation (RAG) to prevent the model from hallucinating and to ensure the answer is grounded in the source document.

  • Answer Generation: It interacts with the Google Generative AI API to get the final answer from the gemini-1.5-flash model.

Database Schema and Models

The system's data is persisted in a PostgreSQL database, with the structure managed by SQLAlchemy ORM. The schema is designed to be simple yet effective for caching documents and logging user interactions.

Database Schema

The database consists of two primary tables, which are defined in app/models/database.py and initialized using the migrations/__init__.sql script.

documents Table

This table stores the content and metadata of every document processed by the system. It acts as a cache to prevent redundant processing of the same file.

Column
Type
Description

id

INTEGER

A unique, auto-incrementing primary key for each document record.

blob_url

TEXT

The original file path or URL of the document, used as a unique identifier.

content_hash

VARCHAR(64)

A SHA-256 hash of the document's binary content. This is the key field used for caching. If a new document with the same hash is submitted, the system will use the existing record instead of re-processing it.

processed_at

TIMESTAMP

The timestamp indicating when the document was first processed and added to the database.

content

TEXT

The full, extracted text content of the document.

Export to Sheets

qa_sessions Table

This table serves as a log for every query made to the system, linking questions to their generated answers and the document they were based on.

Column
Type
Description

id

INTEGER

A unique, auto-incrementing primary key for each Q&A session.

document_id

INTEGER

A foreign key that links to the id of the corresponding record in the documents table.

questions

JSONB

The list of questions asked by the user, stored in a flexible JSONB format.

answers

JSONB

The list of answers generated by the system, also stored in JSONB format.

created_at

TIMESTAMP

The timestamp indicating when the Q&A session occurred.

Export to Sheets

Data Models (Pydantic Schemas)

The API uses Pydantic models, defined in app/models/schemas.py, for data validation and serialization.

  • QueryRequest: Defines the expected structure of the incoming request body for the /hackrx/run endpoint, ensuring that both documents and questions are provided.

  • QueryResponse: Defines the structure of the API's response, guaranteeing that it will contain a list of answers.

  • ClauseMatch: A model used internally to structure the data for relevant clauses found by the ClauseMatcher.

Last updated