ira/.note/interfaces.md

# Component Interfaces

## Current Interfaces

### JinaSimilarity Class

#### Initialization
```python
js = JinaSimilarity()
```
- **Description**: Initializes the JinaSimilarity class
- **Requirements**: JINA_API_KEY environment variable must be set
- **Raises**: ValueError if JINA_API_KEY is not set

#### count_tokens
```python
token_count = js.count_tokens(text)
```
- **Description**: Counts the number of tokens in a text
- **Parameters**:
  - `text` (str): The text to count tokens for
- **Returns**: int - Number of tokens in the text
- **Dependencies**: tiktoken library

#### get_embedding
```python
embedding = js.get_embedding(text)
```
- **Description**: Generates an embedding for a text using Jina AI's Embeddings API
- **Parameters**:
  - `text` (str): The text to generate an embedding for (max 8,192 tokens)
- **Returns**: list - The embedding vector
- **Raises**:
  - `TokenLimitError`: If the text exceeds 8,192 tokens
  - `requests.exceptions.RequestException`: If the API call fails
- **Dependencies**: requests library, Jina AI API

#### compute_similarity
```python
similarity, chunk_embedding, query_embedding = js.compute_similarity(chunk, query)
```
- **Description**: Computes similarity between a text chunk and a query
- **Parameters**:
  - `chunk` (str): The text chunk to compare against
  - `query` (str): The query text
- **Returns**: Tuple containing:
  - `similarity` (float): Cosine similarity score (0-1)
  - `chunk_embedding` (list): Chunk embedding
  - `query_embedding` (list): Query embedding
- **Raises**:
  - `TokenLimitError`: If either text exceeds 8,192 tokens
  - `requests.exceptions.RequestException`: If the API calls fail
- **Dependencies**: numpy library, get_embedding method

### Markdown Segmenter

#### segment_markdown
```python
segments = segment_markdown(file_path)
```
- **Description**: Segments a markdown file using Jina AI's Segmenter API
- **Parameters**:
  - `file_path` (str): Path to the markdown file
- **Returns**: dict - JSON structure containing the segments
- **Raises**: Exception if segmentation fails
- **Dependencies**: requests library, Jina AI API

### Test Similarity Script

#### Command-line Interface
```
python test_similarity.py chunk_file query_file [--verbose]
```
- **Description**: Computes similarity between text from two files
- **Arguments**:
  - `chunk_file`: Path to the file containing the text chunk
  - `query_file`: Path to the file containing the query
  - `--verbose` or `-v`: Print token counts and embeddings
- **Output**: Similarity score and optional verbose information
- **Dependencies**: JinaSimilarity class

#### read_file
```python
content = read_file(file_path)
```
- **Description**: Reads content from a file
- **Parameters**:
  - `file_path` (str): Path to the file to read
- **Returns**: str - Content of the file
- **Raises**: FileNotFoundError if the file doesn't exist

## Search Execution Module

### SearchExecutor Class

#### Initialization
```python
from execution.search_executor import SearchExecutor
executor = SearchExecutor()
```
- **Description**: Initializes the SearchExecutor class
- **Requirements**: Configuration file with API keys for search engines

#### execute_search
```python
results = executor.execute_search(query_data)
```
- **Description**: Executes a search across multiple search engines
- **Parameters**:
  - `query_data` (dict): Dictionary containing query information with keys:
    - `raw_query` (str): The original user query
    - `enhanced_query` (str): The enhanced query from the LLM
    - `search_engines` (list, optional): List of search engines to use
    - `num_results` (int, optional): Number of results to return per engine
- **Returns**: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results
- **Example**:
```python
results = executor.execute_search({
    'raw_query': 'quantum computing',
    'enhanced_query': 'recent advancements in quantum computing algorithms and hardware'
})
```

### BaseSearchHandler Class

#### search
```python
results = handler.search(query, num_results=10, **kwargs)
```
- **Description**: Abstract method for searching implemented by all handlers
- **Parameters**:
  - `query` (str): The search query
  - `num_results` (int): Number of results to return
  - `**kwargs`: Additional parameters specific to the search engine
- **Returns**: List[Dict[str, Any]] - List of search results
- **Example**:
```python
from execution.api_handlers.serper_handler import SerperSearchHandler
handler = SerperSearchHandler()
results = handler.search("quantum computing", num_results=5)
```

### SerperSearchHandler Class

#### search
```python
from execution.api_handlers.serper_handler import SerperSearchHandler
handler = SerperSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
```
- **Description**: Executes a search using the Serper API
- **Parameters**:
  - `query` (str): The search query
  - `num_results` (int): Number of results to return
  - `**kwargs`: Additional parameters for the Serper API
- **Returns**: List[Dict[str, Any]] - List of search results with keys:
  - `title` (str): Title of the result
  - `url` (str): URL of the result
  - `snippet` (str): Snippet of text from the result
  - `source` (str): Source of the result (always "serper")
- **Requirements**: Serper API key in configuration
- **Example**:
```python
results = handler.search("quantum computing", num_results=5)
```

### ScholarSearchHandler Class

#### search
```python
from execution.api_handlers.scholar_handler import ScholarSearchHandler
handler = ScholarSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
```
- **Description**: Executes a search on Google Scholar using the Serper API
- **Parameters**:
  - `query` (str): The search query
  - `num_results` (int): Number of results to return
  - `**kwargs`: Additional parameters for the Scholar API
- **Returns**: List[Dict[str, Any]] - List of search results with keys:
  - `title` (str): Title of the paper
  - `url` (str): URL of the paper
  - `snippet` (str): Snippet of text from the paper
  - `source` (str): Source of the result (always "scholar")
  - `authors` (str): Authors of the paper
  - `publication` (str): Publication venue
  - `year` (int): Publication year
- **Requirements**: Serper API key in configuration
- **Example**:
```python
results = handler.search("quantum computing", num_results=5)
```

### ArxivSearchHandler Class

#### search
```python
from execution.api_handlers.arxiv_handler import ArxivSearchHandler
handler = ArxivSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
```
- **Description**: Executes a search on arXiv
- **Parameters**:
  - `query` (str): The search query
  - `num_results` (int): Number of results to return
  - `**kwargs`: Additional parameters for the arXiv API
- **Returns**: List[Dict[str, Any]] - List of search results with keys:
  - `title` (str): Title of the paper
  - `url` (str): URL of the paper
  - `pdf_url` (str): URL to the PDF
  - `snippet` (str): Abstract of the paper
  - `source` (str): Source of the result (always "arxiv")
  - `arxiv_id` (str): arXiv ID
  - `authors` (list): List of author names
  - `categories` (list): List of arXiv categories
  - `published_date` (str): Publication date
  - `updated_date` (str): Last update date
  - `full_text` (str): Full abstract text
- **Example**:
```python
results = handler.search("quantum computing", num_results=5)
```

### ResultCollector Class

#### process_results
```python
from execution.result_collector import ResultCollector
collector = ResultCollector()
processed_results = collector.process_results(search_results, dedup=True, max_results=None)
```
- **Description**: Processes search results from multiple search engines
- **Parameters**:
  - `search_results` (Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search results
  - `dedup` (bool): Whether to deduplicate results based on URL
  - `max_results` (Optional[int]): Maximum number of results to return
- **Returns**: List[Dict[str, Any]] - Combined and processed list of search results
- **Example**:
```python
processed_results = collector.process_results({
    'serper': serper_results,
    'scholar': scholar_results,
    'arxiv': arxiv_results
}, dedup=True, max_results=20)
```

#### save_results
```python
collector.save_results(results, file_path)
```
- **Description**: Saves search results to a JSON file
- **Parameters**:
  - `results` (List[Dict[str, Any]]): List of search results
  - `file_path` (str): Path to save the results
- **Example**:
```python
collector.save_results(processed_results, "search_results.json")
```

## Planned Interfaces for Research System

### ResearchSystem Class

#### Initialization
```python
rs = ResearchSystem(config=None)
```
- **Description**: Initializes the ResearchSystem with optional configuration
- **Parameters**:
  - `config` (dict, optional): Configuration options for the research system
- **Requirements**: Various API keys set in environment variables or config
- **Raises**: ValueError if required API keys are not set

#### execute_research
```python
report = rs.execute_research(query, options=None)
```
- **Description**: Executes a complete research pipeline from query to report
- **Parameters**:
  - `query` (str): The research query
  - `options` (dict, optional): Options to customize the research process
- **Returns**: dict - Research report with metadata
- **Raises**: Various exceptions for different stages of the pipeline

#### save_report
```python
rs.save_report(report, file_path, format="markdown")
```
- **Description**: Saves the research report to a file
- **Parameters**:
  - `report` (dict): The research report to save
  - `file_path` (str): Path to save the report
  - `format` (str, optional): Format of the report (markdown, html, pdf)
- **Raises**: IOError if the file cannot be saved

### QueryProcessor Class

#### process_query
```python
structured_query = query_processor.process_query(query)
```
- **Description**: Processes a raw query into a structured format
- **Parameters**:
  - `query` (str): The raw research query
- **Returns**: dict - Structured query with metadata
- **Raises**: ValueError if the query is invalid

### SearchStrategy Class

#### develop_strategy
```python
search_plan = search_strategy.develop_strategy(structured_query)
```
- **Description**: Develops a search strategy based on the query
- **Parameters**:
  - `structured_query` (dict): The structured query
- **Returns**: dict - Search plan with target-specific queries
- **Raises**: ValueError if the query cannot be processed

### SearchExecutor Class

#### execute_search
```python
search_results = search_executor.execute_search(search_plan)
```
- **Description**: Executes search queries against selected targets
- **Parameters**:
  - `search_plan` (dict): The search plan with queries
- **Returns**: dict - Collection of search results
- **Raises**: APIError if the search APIs fail

### JinaReranker Class

#### rerank
```python
ranked_documents = jina_reranker.rerank(query, documents, top_n=None)
```
- **Description**: Rerank documents based on their relevance to the query.
- **Parameters**:
  - `query` (str): The query to rank documents against
  - `documents` (List[str]): List of document strings to rerank
  - `top_n` (Optional[int]): Number of top results to return (optional)
- **Returns**: List of dictionaries containing reranked documents with scores and indices

#### rerank_with_metadata
```python
ranked_documents = jina_reranker.rerank_with_metadata(query, documents, document_key='content', top_n=None)
```
- **Description**: Rerank documents with metadata based on their relevance to the query.
- **Parameters**:
  - `query` (str): The query to rank documents against
  - `documents` (List[Dict[str, Any]]): List of document dictionaries containing content and metadata
  - `document_key` (str): The key in the document dictionaries that contains the text content
  - `top_n` (Optional[int]): Number of top results to return (optional)
- **Returns**: List of dictionaries containing reranked documents with scores, indices, and original metadata

#### get_jina_reranker
```python
jina_reranker = get_jina_reranker()
```
- **Description**: Get the global Jina Reranker instance.
- **Returns**: JinaReranker instance

### DocumentScraper Class

#### scrape_documents
```python
markdown_documents = document_scraper.scrape_documents(ranked_documents)
```
- **Description**: Scrapes and converts documents to markdown
- **Parameters**:
  - `ranked_documents` (list): The ranked list of documents to scrape
- **Returns**: list - Collection of markdown documents
- **Raises**: ScrapingError if the documents cannot be scraped

### DocumentSelector Class

#### select_documents
```python
selected_documents = document_selector.select_documents(documents_with_scores)
```
- **Description**: Selects the most relevant and diverse documents
- **Parameters**:
  - `documents_with_scores` (list): Documents with similarity scores
- **Returns**: list - Curated set of documents
- **Raises**: ValueError if the selection criteria are invalid

### ReportGenerator Class

#### generate_report
```python
report = report_generator.generate_report(selected_documents, query)
```
- **Description**: Generates a research report from selected documents
- **Parameters**:
  - `selected_documents` (list): The selected documents
  - `query` (str): The original query for context
- **Returns**: dict - Final research report
- **Raises**: GenerationError if the report cannot be generated

## Search Execution Module

### SearchExecutor Class

The `SearchExecutor` class manages the execution of search queries across multiple search engines.

#### Initialization
```python
executor = SearchExecutor()
```
- **Description**: Initializes the search executor with available search handlers
- **Requirements**: Appropriate API keys must be set for the search engines to be used

#### execute_search
```python
results = executor.execute_search(structured_query, search_engines=["google", "scholar"], num_results=10)
```
- **Description**: Executes search queries across specified search engines in parallel
- **Parameters**:
  - `structured_query` (Dict[str, Any]): The structured query from the query processor
  - `search_engines` (Optional[List[str]]): List of search engines to use
  - `num_results` (int): Number of results to return per search engine
  - `timeout` (int): Timeout in seconds for each search engine
- **Returns**: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results

#### execute_search_async
```python
results = await executor.execute_search_async(structured_query, search_engines=["google", "scholar"])
```
- **Description**: Executes search queries across specified search engines asynchronously
- **Parameters**: Same as `execute_search`
- **Returns**: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results

#### get_available_search_engines
```python
engines = executor.get_available_search_engines()
```
- **Description**: Gets a list of available search engines
- **Returns**: List[str] - List of available search engine names

### ResultCollector Class

The `ResultCollector` class processes and organizes search results from multiple search engines.

#### Initialization
```python
collector = ResultCollector()
```
- **Description**: Initializes the result collector

#### process_results
```python
processed_results = collector.process_results(search_results, dedup=True, max_results=20)
```
- **Description**: Processes search results from multiple search engines
- **Parameters**:
  - `search_results` (Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search results
  - `dedup` (bool): Whether to deduplicate results based on URL
  - `max_results` (Optional[int]): Maximum number of results to return
- **Returns**: List[Dict[str, Any]] - List of processed search results

#### filter_results
```python
filtered_results = collector.filter_results(results, filters={"domains": ["arxiv.org"], "min_score": 5})
```
- **Description**: Filters results based on specified criteria
- **Parameters**:
  - `results` (List[Dict[str, Any]]): List of search results
  - `filters` (Dict[str, Any]): Dictionary of filter criteria
- **Returns**: List[Dict[str, Any]] - Filtered list of search results

#### group_results_by_domain
```python
grouped_results = collector.group_results_by_domain(results)
```
- **Description**: Groups results by domain
- **Parameters**:
  - `results` (List[Dict[str, Any]]): List of search results
- **Returns**: Dict[str, List[Dict[str, Any]]] - Dictionary mapping domains to lists of search results

### BaseSearchHandler Interface

The `BaseSearchHandler` class defines the interface for all search API handlers.

#### search
```python
results = handler.search(query, num_results=10, **kwargs)
```
- **Description**: Executes a search query
- **Parameters**:
  - `query` (str): The search query to execute
  - `num_results` (int): Number of results to return
  - `**kwargs`: Additional search parameters specific to the API
- **Returns**: List[Dict[str, Any]] - List of search results

#### get_name
```python
name = handler.get_name()
```
- **Description**: Gets the name of the search handler
- **Returns**: str - Name of the search handler

#### is_available
```python
available = handler.is_available()
```
- **Description**: Checks if the search API is available
- **Returns**: bool - True if the API is available, False otherwise

#### get_rate_limit_info
```python
rate_limits = handler.get_rate_limit_info()
```
- **Description**: Gets information about the API's rate limits
- **Returns**: Dict[str, Any] - Dictionary with rate limit information

## Search Execution Testing

The search execution module has been tested to ensure it correctly executes search queries across multiple search engines and processes the results.

### Test Script (test_search_execution.py)

```python
# Process a query and execute search
results = test_search_execution("What are the latest advancements in quantum computing?")

# Save test results
save_test_results(results, "search_execution_test_results.json")
```

- **Purpose**: Tests the search execution module with various queries
- **Features**:
  - Tests with multiple queries
  - Uses all available search engines
  - Saves results to a JSON file
  - Provides detailed output of search results

## Document Ranking Interface

### JinaReranker

The `JinaReranker` class provides an interface for reranking documents based on their relevance to a query using Jina AI's Reranker API.

#### Methods

```python
def rerank(query: str, documents: List[str], top_n: Optional[int] = None) -> List[Dict[str, Any]]:
    """
    Rerank documents based on their relevance to the query.

    Args:
        query: The query to rank documents against
        documents: List of document strings to rerank
        top_n: Number of top results to return (optional)

    Returns:
        List of dictionaries containing reranked documents with scores and indices
    """
```

```python
def rerank_with_metadata(query: str, documents: List[Dict[str, Any]],
                        document_key: str = 'content',
                        top_n: Optional[int] = None) -> List[Dict[str, Any]]:
    """
    Rerank documents with metadata based on their relevance to the query.

    Args:
        query: The query to rank documents against
        documents: List of document dictionaries containing content and metadata
        document_key: The key in the document dictionaries that contains the text content
        top_n: Number of top results to return (optional)

    Returns:
        List of dictionaries containing reranked documents with scores, indices, and original metadata
    """
```

#### Factory Function

```python
def get_jina_reranker() -> JinaReranker:
    """
    Get the global Jina Reranker instance.

    Returns:
        JinaReranker instance
    """
```

#### Example Usage

```python
from ranking.jina_reranker import get_jina_reranker

# Get the reranker
reranker = get_jina_reranker()

# Rerank documents
results = reranker.rerank(
    query="What is quantum computing?",
    documents=["Document about quantum physics", "Document about quantum computing", "Document about classical computing"],
    top_n=2
)

# Process results
for result in results:
    print(f"Score: {result['score']}, Document: {result['document']}")

## Query Processor Testing

The query processor module has been tested with the Groq LLM provider to ensure it functions correctly with the newly integrated models.

### Test Scripts

Two test scripts have been created to validate the query processor functionality:

#### Basic Test Script (test_query_processor.py)

```python
# Get the query processor
processor = get_query_processor()

# Process a query
result = processor.process_query("What are the latest advancements in quantum computing?")

# Generate search queries
search_result = processor.generate_search_queries(result, ["google", "bing", "scholar"])
```

- **Purpose**: Tests the core functionality of the query processor
- **Features**:
  - Uses monkey patching to ensure the Groq model is used
  - Provides detailed output of processing results

#### Comprehensive Test Script (test_query_processor_comprehensive.py)

```python
# Test query enhancement
enhanced_query = test_enhance_query("What is quantum computing?")

# Test query classification
classification = test_classify_query("What is quantum computing?")

# Test the full processing pipeline
structured_query = test_process_query("What is quantum computing?")

# Test search query generation
search_result = test_generate_search_queries(structured_query, ["google", "bing", "scholar"])
```

- **Purpose**: Tests all aspects of the query processor in detail
- **Features**:
  - Tests individual components in isolation
  - Tests a variety of query types
  - Saves detailed test results to a JSON file

## LLM Interface

### LLMInterface Class

The `LLMInterface` class provides a unified interface for interacting with various LLM providers through LiteLLM.

#### Initialization
```python
llm = LLMInterface(model_name="gpt-4")
```
- **Description**: Initializes the LLM interface with the specified model
- **Parameters**:
  - `model_name` (Optional[str]): The name of the model to use (defaults to config value)
- **Requirements**: Appropriate API key must be set in environment or config

#### complete
```python
response = llm.complete(prompt, system_prompt=None, temperature=None, max_tokens=None)
```
- **Description**: Generates a completion for the given prompt
- **Parameters**:
  - `prompt` (str): The prompt to complete
  - `system_prompt` (Optional[str]): System prompt for context
  - `temperature` (Optional[float]): Temperature for generation
  - `max_tokens` (Optional[int]): Maximum tokens to generate
- **Returns**: str - The generated completion
- **Raises**: LLMError if the completion fails

#### complete_json
```python
json_response = llm.complete_json(prompt, system_prompt=None, json_schema=None)
```
- **Description**: Generates a JSON response for the given prompt
- **Parameters**:
  - `prompt` (str): The prompt to complete
  - `system_prompt` (Optional[str]): System prompt for context
  - `json_schema` (Optional[Dict]): JSON schema for validation
- **Returns**: Dict - The generated JSON response
- **Raises**: LLMError if the completion fails or JSON is invalid

#### Supported Providers
- OpenAI
- Azure OpenAI
- Anthropic
- Ollama
- Groq
- OpenRouter

#### Example Usage
```python
from query.llm_interface import LLMInterface

# Initialize with specific model
llm = LLMInterface(model_name="llama-3.1-8b-instant")

# Generate a completion
response = llm.complete(
    prompt="Explain quantum computing",
    system_prompt="You are a helpful assistant that explains complex topics simply.",
    temperature=0.7
)

print(response)