# Component Interfaces ## Current Interfaces ### JinaSimilarity Class #### Initialization ```python js = JinaSimilarity() ``` - **Description**: Initializes the JinaSimilarity class - **Requirements**: JINA_API_KEY environment variable must be set - **Raises**: ValueError if JINA_API_KEY is not set #### count_tokens ```python token_count = js.count_tokens(text) ``` - **Description**: Counts the number of tokens in a text - **Parameters**: - `text` (str): The text to count tokens for - **Returns**: int - Number of tokens in the text - **Dependencies**: tiktoken library #### get_embedding ```python embedding = js.get_embedding(text) ``` - **Description**: Generates an embedding for a text using Jina AI's Embeddings API - **Parameters**: - `text` (str): The text to generate an embedding for (max 8,192 tokens) - **Returns**: list - The embedding vector - **Raises**: - `TokenLimitError`: If the text exceeds 8,192 tokens - `requests.exceptions.RequestException`: If the API call fails - **Dependencies**: requests library, Jina AI API #### compute_similarity ```python similarity, chunk_embedding, query_embedding = js.compute_similarity(chunk, query) ``` - **Description**: Computes similarity between a text chunk and a query - **Parameters**: - `chunk` (str): The text chunk to compare against - `query` (str): The query text - **Returns**: Tuple containing: - `similarity` (float): Cosine similarity score (0-1) - `chunk_embedding` (list): Chunk embedding - `query_embedding` (list): Query embedding - **Raises**: - `TokenLimitError`: If either text exceeds 8,192 tokens - `requests.exceptions.RequestException`: If the API calls fail - **Dependencies**: numpy library, get_embedding method ### Markdown Segmenter #### segment_markdown ```python segments = segment_markdown(file_path) ``` - **Description**: Segments a markdown file using Jina AI's Segmenter API - **Parameters**: - `file_path` (str): Path to the markdown file - **Returns**: dict - JSON structure containing the segments - **Raises**: Exception if segmentation fails - **Dependencies**: requests library, Jina AI API ### Test Similarity Script #### Command-line Interface ``` python test_similarity.py chunk_file query_file [--verbose] ``` - **Description**: Computes similarity between text from two files - **Arguments**: - `chunk_file`: Path to the file containing the text chunk - `query_file`: Path to the file containing the query - `--verbose` or `-v`: Print token counts and embeddings - **Output**: Similarity score and optional verbose information - **Dependencies**: JinaSimilarity class #### read_file ```python content = read_file(file_path) ``` - **Description**: Reads content from a file - **Parameters**: - `file_path` (str): Path to the file to read - **Returns**: str - Content of the file - **Raises**: FileNotFoundError if the file doesn't exist ## Search Execution Module ### SearchExecutor Class #### Initialization ```python from execution.search_executor import SearchExecutor executor = SearchExecutor() ``` - **Description**: Initializes the SearchExecutor class - **Requirements**: Configuration file with API keys for search engines #### execute_search ```python results = executor.execute_search(query_data) ``` - **Description**: Executes a search across multiple search engines - **Parameters**: - `query_data` (dict): Dictionary containing query information with keys: - `raw_query` (str): The original user query - `enhanced_query` (str): The enhanced query from the LLM - `search_engines` (list, optional): List of search engines to use - `num_results` (int, optional): Number of results to return per engine - **Returns**: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results - **Example**: ```python results = executor.execute_search({ 'raw_query': 'quantum computing', 'enhanced_query': 'recent advancements in quantum computing algorithms and hardware' }) ``` ### BaseSearchHandler Class #### search ```python results = handler.search(query, num_results=10, **kwargs) ``` - **Description**: Abstract method for searching implemented by all handlers - **Parameters**: - `query` (str): The search query - `num_results` (int): Number of results to return - `**kwargs`: Additional parameters specific to the search engine - **Returns**: List[Dict[str, Any]] - List of search results - **Example**: ```python from execution.api_handlers.serper_handler import SerperSearchHandler handler = SerperSearchHandler() results = handler.search("quantum computing", num_results=5) ``` ### SerperSearchHandler Class #### search ```python from execution.api_handlers.serper_handler import SerperSearchHandler handler = SerperSearchHandler() results = handler.search(query, num_results=10, **kwargs) ``` - **Description**: Executes a search using the Serper API - **Parameters**: - `query` (str): The search query - `num_results` (int): Number of results to return - `**kwargs`: Additional parameters for the Serper API - **Returns**: List[Dict[str, Any]] - List of search results with keys: - `title` (str): Title of the result - `url` (str): URL of the result - `snippet` (str): Snippet of text from the result - `source` (str): Source of the result (always "serper") - **Requirements**: Serper API key in configuration - **Example**: ```python results = handler.search("quantum computing", num_results=5) ``` ### ScholarSearchHandler Class #### search ```python from execution.api_handlers.scholar_handler import ScholarSearchHandler handler = ScholarSearchHandler() results = handler.search(query, num_results=10, **kwargs) ``` - **Description**: Executes a search on Google Scholar using the Serper API - **Parameters**: - `query` (str): The search query - `num_results` (int): Number of results to return - `**kwargs`: Additional parameters for the Scholar API - **Returns**: List[Dict[str, Any]] - List of search results with keys: - `title` (str): Title of the paper - `url` (str): URL of the paper - `snippet` (str): Snippet of text from the paper - `source` (str): Source of the result (always "scholar") - `authors` (str): Authors of the paper - `publication` (str): Publication venue - `year` (int): Publication year - **Requirements**: Serper API key in configuration - **Example**: ```python results = handler.search("quantum computing", num_results=5) ``` ### ArxivSearchHandler Class #### search ```python from execution.api_handlers.arxiv_handler import ArxivSearchHandler handler = ArxivSearchHandler() results = handler.search(query, num_results=10, **kwargs) ``` - **Description**: Executes a search on arXiv - **Parameters**: - `query` (str): The search query - `num_results` (int): Number of results to return - `**kwargs`: Additional parameters for the arXiv API - **Returns**: List[Dict[str, Any]] - List of search results with keys: - `title` (str): Title of the paper - `url` (str): URL of the paper - `pdf_url` (str): URL to the PDF - `snippet` (str): Abstract of the paper - `source` (str): Source of the result (always "arxiv") - `arxiv_id` (str): arXiv ID - `authors` (list): List of author names - `categories` (list): List of arXiv categories - `published_date` (str): Publication date - `updated_date` (str): Last update date - `full_text` (str): Full abstract text - **Example**: ```python results = handler.search("quantum computing", num_results=5) ``` ### ResultCollector Class #### process_results ```python from execution.result_collector import ResultCollector collector = ResultCollector() processed_results = collector.process_results(search_results, dedup=True, max_results=None) ``` - **Description**: Processes search results from multiple search engines - **Parameters**: - `search_results` (Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search results - `dedup` (bool): Whether to deduplicate results based on URL - `max_results` (Optional[int]): Maximum number of results to return - **Returns**: List[Dict[str, Any]] - Combined and processed list of search results - **Example**: ```python processed_results = collector.process_results({ 'serper': serper_results, 'scholar': scholar_results, 'arxiv': arxiv_results }, dedup=True, max_results=20) ``` #### save_results ```python collector.save_results(results, file_path) ``` - **Description**: Saves search results to a JSON file - **Parameters**: - `results` (List[Dict[str, Any]]): List of search results - `file_path` (str): Path to save the results - **Example**: ```python collector.save_results(processed_results, "search_results.json") ``` ## Planned Interfaces for Research System ### ResearchSystem Class #### Initialization ```python rs = ResearchSystem(config=None) ``` - **Description**: Initializes the ResearchSystem with optional configuration - **Parameters**: - `config` (dict, optional): Configuration options for the research system - **Requirements**: Various API keys set in environment variables or config - **Raises**: ValueError if required API keys are not set #### execute_research ```python report = rs.execute_research(query, options=None) ``` - **Description**: Executes a complete research pipeline from query to report - **Parameters**: - `query` (str): The research query - `options` (dict, optional): Options to customize the research process - **Returns**: dict - Research report with metadata - **Raises**: Various exceptions for different stages of the pipeline #### save_report ```python rs.save_report(report, file_path, format="markdown") ``` - **Description**: Saves the research report to a file - **Parameters**: - `report` (dict): The research report to save - `file_path` (str): Path to save the report - `format` (str, optional): Format of the report (markdown, html, pdf) - **Raises**: IOError if the file cannot be saved ### QueryProcessor Class #### process_query ```python structured_query = query_processor.process_query(query) ``` - **Description**: Processes a raw query into a structured format - **Parameters**: - `query` (str): The raw research query - **Returns**: dict - Structured query with metadata - **Raises**: ValueError if the query is invalid ### SearchStrategy Class #### develop_strategy ```python search_plan = search_strategy.develop_strategy(structured_query) ``` - **Description**: Develops a search strategy based on the query - **Parameters**: - `structured_query` (dict): The structured query - **Returns**: dict - Search plan with target-specific queries - **Raises**: ValueError if the query cannot be processed ### SearchExecutor Class #### execute_search ```python search_results = search_executor.execute_search(search_plan) ``` - **Description**: Executes search queries against selected targets - **Parameters**: - `search_plan` (dict): The search plan with queries - **Returns**: dict - Collection of search results - **Raises**: APIError if the search APIs fail ### JinaReranker Class #### rerank ```python ranked_documents = jina_reranker.rerank(query, documents, top_n=None) ``` - **Description**: Rerank documents based on their relevance to the query. - **Parameters**: - `query` (str): The query to rank documents against - `documents` (List[str]): List of document strings to rerank - `top_n` (Optional[int]): Number of top results to return (optional) - **Returns**: List of dictionaries containing reranked documents with scores and indices #### rerank_with_metadata ```python ranked_documents = jina_reranker.rerank_with_metadata(query, documents, document_key='content', top_n=None) ``` - **Description**: Rerank documents with metadata based on their relevance to the query. - **Parameters**: - `query` (str): The query to rank documents against - `documents` (List[Dict[str, Any]]): List of document dictionaries containing content and metadata - `document_key` (str): The key in the document dictionaries that contains the text content - `top_n` (Optional[int]): Number of top results to return (optional) - **Returns**: List of dictionaries containing reranked documents with scores, indices, and original metadata #### get_jina_reranker ```python jina_reranker = get_jina_reranker() ``` - **Description**: Get the global Jina Reranker instance. - **Returns**: JinaReranker instance ### DocumentScraper Class #### scrape_documents ```python markdown_documents = document_scraper.scrape_documents(ranked_documents) ``` - **Description**: Scrapes and converts documents to markdown - **Parameters**: - `ranked_documents` (list): The ranked list of documents to scrape - **Returns**: list - Collection of markdown documents - **Raises**: ScrapingError if the documents cannot be scraped ### DocumentSelector Class #### select_documents ```python selected_documents = document_selector.select_documents(documents_with_scores) ``` - **Description**: Selects the most relevant and diverse documents - **Parameters**: - `documents_with_scores` (list): Documents with similarity scores - **Returns**: list - Curated set of documents - **Raises**: ValueError if the selection criteria are invalid ### ReportGenerator Class #### generate_report ```python report = report_generator.generate_report(selected_documents, query) ``` - **Description**: Generates a research report from selected documents - **Parameters**: - `selected_documents` (list): The selected documents - `query` (str): The original query for context - **Returns**: dict - Final research report - **Raises**: GenerationError if the report cannot be generated ## Search Execution Module ### SearchExecutor Class The `SearchExecutor` class manages the execution of search queries across multiple search engines. #### Initialization ```python executor = SearchExecutor() ``` - **Description**: Initializes the search executor with available search handlers - **Requirements**: Appropriate API keys must be set for the search engines to be used #### execute_search ```python results = executor.execute_search(structured_query, search_engines=["google", "scholar"], num_results=10) ``` - **Description**: Executes search queries across specified search engines in parallel - **Parameters**: - `structured_query` (Dict[str, Any]): The structured query from the query processor - `search_engines` (Optional[List[str]]): List of search engines to use - `num_results` (int): Number of results to return per search engine - `timeout` (int): Timeout in seconds for each search engine - **Returns**: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results #### execute_search_async ```python results = await executor.execute_search_async(structured_query, search_engines=["google", "scholar"]) ``` - **Description**: Executes search queries across specified search engines asynchronously - **Parameters**: Same as `execute_search` - **Returns**: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results #### get_available_search_engines ```python engines = executor.get_available_search_engines() ``` - **Description**: Gets a list of available search engines - **Returns**: List[str] - List of available search engine names ### ResultCollector Class The `ResultCollector` class processes and organizes search results from multiple search engines. #### Initialization ```python collector = ResultCollector() ``` - **Description**: Initializes the result collector #### process_results ```python processed_results = collector.process_results(search_results, dedup=True, max_results=20) ``` - **Description**: Processes search results from multiple search engines - **Parameters**: - `search_results` (Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search results - `dedup` (bool): Whether to deduplicate results based on URL - `max_results` (Optional[int]): Maximum number of results to return - **Returns**: List[Dict[str, Any]] - List of processed search results #### filter_results ```python filtered_results = collector.filter_results(results, filters={"domains": ["arxiv.org"], "min_score": 5}) ``` - **Description**: Filters results based on specified criteria - **Parameters**: - `results` (List[Dict[str, Any]]): List of search results - `filters` (Dict[str, Any]): Dictionary of filter criteria - **Returns**: List[Dict[str, Any]] - Filtered list of search results #### group_results_by_domain ```python grouped_results = collector.group_results_by_domain(results) ``` - **Description**: Groups results by domain - **Parameters**: - `results` (List[Dict[str, Any]]): List of search results - **Returns**: Dict[str, List[Dict[str, Any]]] - Dictionary mapping domains to lists of search results ### BaseSearchHandler Interface The `BaseSearchHandler` class defines the interface for all search API handlers. #### search ```python results = handler.search(query, num_results=10, **kwargs) ``` - **Description**: Executes a search query - **Parameters**: - `query` (str): The search query to execute - `num_results` (int): Number of results to return - `**kwargs`: Additional search parameters specific to the API - **Returns**: List[Dict[str, Any]] - List of search results #### get_name ```python name = handler.get_name() ``` - **Description**: Gets the name of the search handler - **Returns**: str - Name of the search handler #### is_available ```python available = handler.is_available() ``` - **Description**: Checks if the search API is available - **Returns**: bool - True if the API is available, False otherwise #### get_rate_limit_info ```python rate_limits = handler.get_rate_limit_info() ``` - **Description**: Gets information about the API's rate limits - **Returns**: Dict[str, Any] - Dictionary with rate limit information ## Search Execution Testing The search execution module has been tested to ensure it correctly executes search queries across multiple search engines and processes the results. ### Test Script (test_search_execution.py) ```python # Process a query and execute search results = test_search_execution("What are the latest advancements in quantum computing?") # Save test results save_test_results(results, "search_execution_test_results.json") ``` - **Purpose**: Tests the search execution module with various queries - **Features**: - Tests with multiple queries - Uses all available search engines - Saves results to a JSON file - Provides detailed output of search results ## Document Ranking Interface ### JinaReranker The `JinaReranker` class provides an interface for reranking documents based on their relevance to a query using Jina AI's Reranker API. #### Methods ```python def rerank(query: str, documents: List[str], top_n: Optional[int] = None) -> List[Dict[str, Any]]: """ Rerank documents based on their relevance to the query. Args: query: The query to rank documents against documents: List of document strings to rerank top_n: Number of top results to return (optional) Returns: List of dictionaries containing reranked documents with scores and indices """ ``` ```python def rerank_with_metadata(query: str, documents: List[Dict[str, Any]], document_key: str = 'content', top_n: Optional[int] = None) -> List[Dict[str, Any]]: """ Rerank documents with metadata based on their relevance to the query. Args: query: The query to rank documents against documents: List of document dictionaries containing content and metadata document_key: The key in the document dictionaries that contains the text content top_n: Number of top results to return (optional) Returns: List of dictionaries containing reranked documents with scores, indices, and original metadata """ ``` #### Factory Function ```python def get_jina_reranker() -> JinaReranker: """ Get the global Jina Reranker instance. Returns: JinaReranker instance """ ``` #### Example Usage ```python from ranking.jina_reranker import get_jina_reranker # Get the reranker reranker = get_jina_reranker() # Rerank documents results = reranker.rerank( query="What is quantum computing?", documents=["Document about quantum physics", "Document about quantum computing", "Document about classical computing"], top_n=2 ) # Process results for result in results: print(f"Score: {result['score']}, Document: {result['document']}") ## Query Processor Testing The query processor module has been tested with the Groq LLM provider to ensure it functions correctly with the newly integrated models. ### Test Scripts Two test scripts have been created to validate the query processor functionality: #### Basic Test Script (test_query_processor.py) ```python # Get the query processor processor = get_query_processor() # Process a query result = processor.process_query("What are the latest advancements in quantum computing?") # Generate search queries search_result = processor.generate_search_queries(result, ["google", "bing", "scholar"]) ``` - **Purpose**: Tests the core functionality of the query processor - **Features**: - Uses monkey patching to ensure the Groq model is used - Provides detailed output of processing results #### Comprehensive Test Script (test_query_processor_comprehensive.py) ```python # Test query enhancement enhanced_query = test_enhance_query("What is quantum computing?") # Test query classification classification = test_classify_query("What is quantum computing?") # Test the full processing pipeline structured_query = test_process_query("What is quantum computing?") # Test search query generation search_result = test_generate_search_queries(structured_query, ["google", "bing", "scholar"]) ``` - **Purpose**: Tests all aspects of the query processor in detail - **Features**: - Tests individual components in isolation - Tests a variety of query types - Saves detailed test results to a JSON file ## LLM Interface ### LLMInterface Class The `LLMInterface` class provides a unified interface for interacting with various LLM providers through LiteLLM. #### Initialization ```python llm = LLMInterface(model_name="gpt-4") ``` - **Description**: Initializes the LLM interface with the specified model - **Parameters**: - `model_name` (Optional[str]): The name of the model to use (defaults to config value) - **Requirements**: Appropriate API key must be set in environment or config #### complete ```python response = llm.complete(prompt, system_prompt=None, temperature=None, max_tokens=None) ``` - **Description**: Generates a completion for the given prompt - **Parameters**: - `prompt` (str): The prompt to complete - `system_prompt` (Optional[str]): System prompt for context - `temperature` (Optional[float]): Temperature for generation - `max_tokens` (Optional[int]): Maximum tokens to generate - **Returns**: str - The generated completion - **Raises**: LLMError if the completion fails #### complete_json ```python json_response = llm.complete_json(prompt, system_prompt=None, json_schema=None) ``` - **Description**: Generates a JSON response for the given prompt - **Parameters**: - `prompt` (str): The prompt to complete - `system_prompt` (Optional[str]): System prompt for context - `json_schema` (Optional[Dict]): JSON schema for validation - **Returns**: Dict - The generated JSON response - **Raises**: LLMError if the completion fails or JSON is invalid #### Supported Providers - OpenAI - Azure OpenAI - Anthropic - Ollama - Groq - OpenRouter #### Example Usage ```python from query.llm_interface import LLMInterface # Initialize with specific model llm = LLMInterface(model_name="llama-3.1-8b-instant") # Generate a completion response = llm.complete( prompt="Explain quantum computing", system_prompt="You are a helpful assistant that explains complex topics simply.", temperature=0.7 ) print(response)