ira/.note/interfaces.md

24 KiB

Component Interfaces

Current Interfaces

JinaSimilarity Class

Initialization

js = JinaSimilarity()
  • Description: Initializes the JinaSimilarity class
  • Requirements: JINA_API_KEY environment variable must be set
  • Raises: ValueError if JINA_API_KEY is not set

count_tokens

token_count = js.count_tokens(text)
  • Description: Counts the number of tokens in a text
  • Parameters:
    • text (str): The text to count tokens for
  • Returns: int - Number of tokens in the text
  • Dependencies: tiktoken library

get_embedding

embedding = js.get_embedding(text)
  • Description: Generates an embedding for a text using Jina AI's Embeddings API
  • Parameters:
    • text (str): The text to generate an embedding for (max 8,192 tokens)
  • Returns: list - The embedding vector
  • Raises:
    • TokenLimitError: If the text exceeds 8,192 tokens
    • requests.exceptions.RequestException: If the API call fails
  • Dependencies: requests library, Jina AI API

compute_similarity

similarity, chunk_embedding, query_embedding = js.compute_similarity(chunk, query)
  • Description: Computes similarity between a text chunk and a query
  • Parameters:
    • chunk (str): The text chunk to compare against
    • query (str): The query text
  • Returns: Tuple containing:
    • similarity (float): Cosine similarity score (0-1)
    • chunk_embedding (list): Chunk embedding
    • query_embedding (list): Query embedding
  • Raises:
    • TokenLimitError: If either text exceeds 8,192 tokens
    • requests.exceptions.RequestException: If the API calls fail
  • Dependencies: numpy library, get_embedding method

Markdown Segmenter

segment_markdown

segments = segment_markdown(file_path)
  • Description: Segments a markdown file using Jina AI's Segmenter API
  • Parameters:
    • file_path (str): Path to the markdown file
  • Returns: dict - JSON structure containing the segments
  • Raises: Exception if segmentation fails
  • Dependencies: requests library, Jina AI API

Test Similarity Script

Command-line Interface

python test_similarity.py chunk_file query_file [--verbose]
  • Description: Computes similarity between text from two files
  • Arguments:
    • chunk_file: Path to the file containing the text chunk
    • query_file: Path to the file containing the query
    • --verbose or -v: Print token counts and embeddings
  • Output: Similarity score and optional verbose information
  • Dependencies: JinaSimilarity class

read_file

content = read_file(file_path)
  • Description: Reads content from a file
  • Parameters:
    • file_path (str): Path to the file to read
  • Returns: str - Content of the file
  • Raises: FileNotFoundError if the file doesn't exist

Search Execution Module

SearchExecutor Class

Initialization

from execution.search_executor import SearchExecutor
executor = SearchExecutor()
  • Description: Initializes the SearchExecutor class
  • Requirements: Configuration file with API keys for search engines
results = executor.execute_search(query_data)
  • Description: Executes a search across multiple search engines
  • Parameters:
    • query_data (dict): Dictionary containing query information with keys:
      • raw_query (str): The original user query
      • enhanced_query (str): The enhanced query from the LLM
      • search_engines (list, optional): List of search engines to use
      • num_results (int, optional): Number of results to return per engine
  • Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results
  • Example:
results = executor.execute_search({
    'raw_query': 'quantum computing',
    'enhanced_query': 'recent advancements in quantum computing algorithms and hardware'
})

BaseSearchHandler Class

results = handler.search(query, num_results=10, **kwargs)
  • Description: Abstract method for searching implemented by all handlers
  • Parameters:
    • query (str): The search query
    • num_results (int): Number of results to return
    • **kwargs: Additional parameters specific to the search engine
  • Returns: List[Dict[str, Any]] - List of search results
  • Example:
from execution.api_handlers.serper_handler import SerperSearchHandler
handler = SerperSearchHandler()
results = handler.search("quantum computing", num_results=5)

SerperSearchHandler Class

search

from execution.api_handlers.serper_handler import SerperSearchHandler
handler = SerperSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
  • Description: Executes a search using the Serper API
  • Parameters:
    • query (str): The search query
    • num_results (int): Number of results to return
    • **kwargs: Additional parameters for the Serper API
  • Returns: List[Dict[str, Any]] - List of search results with keys:
    • title (str): Title of the result
    • url (str): URL of the result
    • snippet (str): Snippet of text from the result
    • source (str): Source of the result (always "serper")
  • Requirements: Serper API key in configuration
  • Example:
results = handler.search("quantum computing", num_results=5)

ScholarSearchHandler Class

search

from execution.api_handlers.scholar_handler import ScholarSearchHandler
handler = ScholarSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
  • Description: Executes a search on Google Scholar using the Serper API
  • Parameters:
    • query (str): The search query
    • num_results (int): Number of results to return
    • **kwargs: Additional parameters for the Scholar API
  • Returns: List[Dict[str, Any]] - List of search results with keys:
    • title (str): Title of the paper
    • url (str): URL of the paper
    • snippet (str): Snippet of text from the paper
    • source (str): Source of the result (always "scholar")
    • authors (str): Authors of the paper
    • publication (str): Publication venue
    • year (int): Publication year
  • Requirements: Serper API key in configuration
  • Example:
results = handler.search("quantum computing", num_results=5)

ArxivSearchHandler Class

search

from execution.api_handlers.arxiv_handler import ArxivSearchHandler
handler = ArxivSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
  • Description: Executes a search on arXiv
  • Parameters:
    • query (str): The search query
    • num_results (int): Number of results to return
    • **kwargs: Additional parameters for the arXiv API
  • Returns: List[Dict[str, Any]] - List of search results with keys:
    • title (str): Title of the paper
    • url (str): URL of the paper
    • pdf_url (str): URL to the PDF
    • snippet (str): Abstract of the paper
    • source (str): Source of the result (always "arxiv")
    • arxiv_id (str): arXiv ID
    • authors (list): List of author names
    • categories (list): List of arXiv categories
    • published_date (str): Publication date
    • updated_date (str): Last update date
    • full_text (str): Full abstract text
  • Example:
results = handler.search("quantum computing", num_results=5)

ResultCollector Class

process_results

from execution.result_collector import ResultCollector
collector = ResultCollector()
processed_results = collector.process_results(search_results, dedup=True, max_results=None)
  • Description: Processes search results from multiple search engines
  • Parameters:
    • search_results (Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search results
    • dedup (bool): Whether to deduplicate results based on URL
    • max_results (Optional[int]): Maximum number of results to return
  • Returns: List[Dict[str, Any]] - Combined and processed list of search results
  • Example:
processed_results = collector.process_results({
    'serper': serper_results,
    'scholar': scholar_results,
    'arxiv': arxiv_results
}, dedup=True, max_results=20)

save_results

collector.save_results(results, file_path)
  • Description: Saves search results to a JSON file
  • Parameters:
    • results (List[Dict[str, Any]]): List of search results
    • file_path (str): Path to save the results
  • Example:
collector.save_results(processed_results, "search_results.json")

Planned Interfaces for Research System

ResearchSystem Class

Initialization

rs = ResearchSystem(config=None)
  • Description: Initializes the ResearchSystem with optional configuration
  • Parameters:
    • config (dict, optional): Configuration options for the research system
  • Requirements: Various API keys set in environment variables or config
  • Raises: ValueError if required API keys are not set

execute_research

report = rs.execute_research(query, options=None)
  • Description: Executes a complete research pipeline from query to report
  • Parameters:
    • query (str): The research query
    • options (dict, optional): Options to customize the research process
  • Returns: dict - Research report with metadata
  • Raises: Various exceptions for different stages of the pipeline

save_report

rs.save_report(report, file_path, format="markdown")
  • Description: Saves the research report to a file
  • Parameters:
    • report (dict): The research report to save
    • file_path (str): Path to save the report
    • format (str, optional): Format of the report (markdown, html, pdf)
  • Raises: IOError if the file cannot be saved

QueryProcessor Class

process_query

structured_query = query_processor.process_query(query)
  • Description: Processes a raw query into a structured format
  • Parameters:
    • query (str): The raw research query
  • Returns: dict - Structured query with metadata
  • Raises: ValueError if the query is invalid

SearchStrategy Class

develop_strategy

search_plan = search_strategy.develop_strategy(structured_query)
  • Description: Develops a search strategy based on the query
  • Parameters:
    • structured_query (dict): The structured query
  • Returns: dict - Search plan with target-specific queries
  • Raises: ValueError if the query cannot be processed

SearchExecutor Class

execute_search

search_results = search_executor.execute_search(search_plan)
  • Description: Executes search queries against selected targets
  • Parameters:
    • search_plan (dict): The search plan with queries
  • Returns: dict - Collection of search results
  • Raises: APIError if the search APIs fail

JinaReranker Class

rerank

ranked_documents = jina_reranker.rerank(query, documents, top_n=None)
  • Description: Rerank documents based on their relevance to the query.
  • Parameters:
    • query (str): The query to rank documents against
    • documents (List[str]): List of document strings to rerank
    • top_n (Optional[int]): Number of top results to return (optional)
  • Returns: List of dictionaries containing reranked documents with scores and indices

rerank_with_metadata

ranked_documents = jina_reranker.rerank_with_metadata(query, documents, document_key='content', top_n=None)
  • Description: Rerank documents with metadata based on their relevance to the query.
  • Parameters:
    • query (str): The query to rank documents against
    • documents (List[Dict[str, Any]]): List of document dictionaries containing content and metadata
    • document_key (str): The key in the document dictionaries that contains the text content
    • top_n (Optional[int]): Number of top results to return (optional)
  • Returns: List of dictionaries containing reranked documents with scores, indices, and original metadata

get_jina_reranker

jina_reranker = get_jina_reranker()
  • Description: Get the global Jina Reranker instance.
  • Returns: JinaReranker instance

DocumentScraper Class

scrape_documents

markdown_documents = document_scraper.scrape_documents(ranked_documents)
  • Description: Scrapes and converts documents to markdown
  • Parameters:
    • ranked_documents (list): The ranked list of documents to scrape
  • Returns: list - Collection of markdown documents
  • Raises: ScrapingError if the documents cannot be scraped

DocumentSelector Class

select_documents

selected_documents = document_selector.select_documents(documents_with_scores)
  • Description: Selects the most relevant and diverse documents
  • Parameters:
    • documents_with_scores (list): Documents with similarity scores
  • Returns: list - Curated set of documents
  • Raises: ValueError if the selection criteria are invalid

ReportGenerator Class

generate_report

report = report_generator.generate_report(selected_documents, query)
  • Description: Generates a research report from selected documents
  • Parameters:
    • selected_documents (list): The selected documents
    • query (str): The original query for context
  • Returns: dict - Final research report
  • Raises: GenerationError if the report cannot be generated

Search Execution Module

SearchExecutor Class

The SearchExecutor class manages the execution of search queries across multiple search engines.

Initialization

executor = SearchExecutor()
  • Description: Initializes the search executor with available search handlers
  • Requirements: Appropriate API keys must be set for the search engines to be used

execute_search

results = executor.execute_search(structured_query, search_engines=["google", "scholar"], num_results=10)
  • Description: Executes search queries across specified search engines in parallel
  • Parameters:
    • structured_query (Dict[str, Any]): The structured query from the query processor
    • search_engines (Optional[List[str]]): List of search engines to use
    • num_results (int): Number of results to return per search engine
    • timeout (int): Timeout in seconds for each search engine
  • Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results

execute_search_async

results = await executor.execute_search_async(structured_query, search_engines=["google", "scholar"])
  • Description: Executes search queries across specified search engines asynchronously
  • Parameters: Same as execute_search
  • Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results

get_available_search_engines

engines = executor.get_available_search_engines()
  • Description: Gets a list of available search engines
  • Returns: List[str] - List of available search engine names

ResultCollector Class

The ResultCollector class processes and organizes search results from multiple search engines.

Initialization

collector = ResultCollector()
  • Description: Initializes the result collector

process_results

processed_results = collector.process_results(search_results, dedup=True, max_results=20)
  • Description: Processes search results from multiple search engines
  • Parameters:
    • search_results (Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search results
    • dedup (bool): Whether to deduplicate results based on URL
    • max_results (Optional[int]): Maximum number of results to return
  • Returns: List[Dict[str, Any]] - List of processed search results

filter_results

filtered_results = collector.filter_results(results, filters={"domains": ["arxiv.org"], "min_score": 5})
  • Description: Filters results based on specified criteria
  • Parameters:
    • results (List[Dict[str, Any]]): List of search results
    • filters (Dict[str, Any]): Dictionary of filter criteria
  • Returns: List[Dict[str, Any]] - Filtered list of search results

group_results_by_domain

grouped_results = collector.group_results_by_domain(results)
  • Description: Groups results by domain
  • Parameters:
    • results (List[Dict[str, Any]]): List of search results
  • Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping domains to lists of search results

BaseSearchHandler Interface

The BaseSearchHandler class defines the interface for all search API handlers.

search

results = handler.search(query, num_results=10, **kwargs)
  • Description: Executes a search query
  • Parameters:
    • query (str): The search query to execute
    • num_results (int): Number of results to return
    • **kwargs: Additional search parameters specific to the API
  • Returns: List[Dict[str, Any]] - List of search results

get_name

name = handler.get_name()
  • Description: Gets the name of the search handler
  • Returns: str - Name of the search handler

is_available

available = handler.is_available()
  • Description: Checks if the search API is available
  • Returns: bool - True if the API is available, False otherwise

get_rate_limit_info

rate_limits = handler.get_rate_limit_info()
  • Description: Gets information about the API's rate limits
  • Returns: Dict[str, Any] - Dictionary with rate limit information

Search Execution Testing

The search execution module has been tested to ensure it correctly executes search queries across multiple search engines and processes the results.

Test Script (test_search_execution.py)

# Process a query and execute search
results = test_search_execution("What are the latest advancements in quantum computing?")

# Save test results
save_test_results(results, "search_execution_test_results.json")
  • Purpose: Tests the search execution module with various queries
  • Features:
    • Tests with multiple queries
    • Uses all available search engines
    • Saves results to a JSON file
    • Provides detailed output of search results

Document Ranking Interface

JinaReranker

The JinaReranker class provides an interface for reranking documents based on their relevance to a query using Jina AI's Reranker API.

Methods

def rerank(query: str, documents: List[str], top_n: Optional[int] = None) -> List[Dict[str, Any]]:
    """
    Rerank documents based on their relevance to the query.
    
    Args:
        query: The query to rank documents against
        documents: List of document strings to rerank
        top_n: Number of top results to return (optional)
        
    Returns:
        List of dictionaries containing reranked documents with scores and indices
    """
def rerank_with_metadata(query: str, documents: List[Dict[str, Any]], 
                        document_key: str = 'content',
                        top_n: Optional[int] = None) -> List[Dict[str, Any]]:
    """
    Rerank documents with metadata based on their relevance to the query.
    
    Args:
        query: The query to rank documents against
        documents: List of document dictionaries containing content and metadata
        document_key: The key in the document dictionaries that contains the text content
        top_n: Number of top results to return (optional)
        
    Returns:
        List of dictionaries containing reranked documents with scores, indices, and original metadata
    """

Factory Function

def get_jina_reranker() -> JinaReranker:
    """
    Get the global Jina Reranker instance.
    
    Returns:
        JinaReranker instance
    """

Example Usage

from ranking.jina_reranker import get_jina_reranker

# Get the reranker
reranker = get_jina_reranker()

# Rerank documents
results = reranker.rerank(
    query="What is quantum computing?",
    documents=["Document about quantum physics", "Document about quantum computing", "Document about classical computing"],
    top_n=2
)

# Process results
for result in results:
    print(f"Score: {result['score']}, Document: {result['document']}")

## Query Processor Testing

The query processor module has been tested with the Groq LLM provider to ensure it functions correctly with the newly integrated models.

### Test Scripts

Two test scripts have been created to validate the query processor functionality:

#### Basic Test Script (test_query_processor.py)

```python
# Get the query processor
processor = get_query_processor()

# Process a query
result = processor.process_query("What are the latest advancements in quantum computing?")

# Generate search queries
search_result = processor.generate_search_queries(result, ["google", "bing", "scholar"])
  • Purpose: Tests the core functionality of the query processor
  • Features:
    • Uses monkey patching to ensure the Groq model is used
    • Provides detailed output of processing results

Comprehensive Test Script (test_query_processor_comprehensive.py)

# Test query enhancement
enhanced_query = test_enhance_query("What is quantum computing?")

# Test query classification
classification = test_classify_query("What is quantum computing?")

# Test the full processing pipeline
structured_query = test_process_query("What is quantum computing?")

# Test search query generation
search_result = test_generate_search_queries(structured_query, ["google", "bing", "scholar"])
  • Purpose: Tests all aspects of the query processor in detail
  • Features:
    • Tests individual components in isolation
    • Tests a variety of query types
    • Saves detailed test results to a JSON file

LLM Interface

LLMInterface Class

The LLMInterface class provides a unified interface for interacting with various LLM providers through LiteLLM.

Initialization

llm = LLMInterface(model_name="gpt-4")
  • Description: Initializes the LLM interface with the specified model
  • Parameters:
    • model_name (Optional[str]): The name of the model to use (defaults to config value)
  • Requirements: Appropriate API key must be set in environment or config

complete

response = llm.complete(prompt, system_prompt=None, temperature=None, max_tokens=None)
  • Description: Generates a completion for the given prompt
  • Parameters:
    • prompt (str): The prompt to complete
    • system_prompt (Optional[str]): System prompt for context
    • temperature (Optional[float]): Temperature for generation
    • max_tokens (Optional[int]): Maximum tokens to generate
  • Returns: str - The generated completion
  • Raises: LLMError if the completion fails

complete_json

json_response = llm.complete_json(prompt, system_prompt=None, json_schema=None)
  • Description: Generates a JSON response for the given prompt
  • Parameters:
    • prompt (str): The prompt to complete
    • system_prompt (Optional[str]): System prompt for context
    • json_schema (Optional[Dict]): JSON schema for validation
  • Returns: Dict - The generated JSON response
  • Raises: LLMError if the completion fails or JSON is invalid

Supported Providers

  • OpenAI
  • Azure OpenAI
  • Anthropic
  • Ollama
  • Groq
  • OpenRouter

Example Usage

from query.llm_interface import LLMInterface

# Initialize with specific model
llm = LLMInterface(model_name="llama-3.1-8b-instant")

# Generate a completion
response = llm.complete(
    prompt="Explain quantum computing",
    system_prompt="You are a helpful assistant that explains complex topics simply.",
    temperature=0.7
)

print(response)