ira/.note/interfaces.md

30 KiB

Component Interfaces

Current Interfaces

JinaSimilarity Class

Initialization

js = JinaSimilarity()
  • Description: Initializes the JinaSimilarity class
  • Requirements: JINA_API_KEY environment variable must be set
  • Raises: ValueError if JINA_API_KEY is not set

count_tokens

token_count = js.count_tokens(text)
  • Description: Counts the number of tokens in a text
  • Parameters:
    • text (str): The text to count tokens for
  • Returns: int - Number of tokens in the text
  • Dependencies: tiktoken library

get_embedding

embedding = js.get_embedding(text)
  • Description: Generates an embedding for a text using Jina AI's Embeddings API
  • Parameters:
    • text (str): The text to generate an embedding for (max 8,192 tokens)
  • Returns: list - The embedding vector
  • Raises:
    • TokenLimitError: If the text exceeds 8,192 tokens
    • requests.exceptions.RequestException: If the API call fails
  • Dependencies: requests library, Jina AI API

compute_similarity

similarity, chunk_embedding, query_embedding = js.compute_similarity(chunk, query)
  • Description: Computes similarity between a text chunk and a query
  • Parameters:
    • chunk (str): The text chunk to compare against
    • query (str): The query text
  • Returns: Tuple containing:
    • similarity (float): Cosine similarity score (0-1)
    • chunk_embedding (list): Chunk embedding
    • query_embedding (list): Query embedding
  • Raises:
    • TokenLimitError: If either text exceeds 8,192 tokens
    • requests.exceptions.RequestException: If the API calls fail
  • Dependencies: numpy library, get_embedding method

Markdown Segmenter

segment_markdown

segments = segment_markdown(file_path)
  • Description: Segments a markdown file using Jina AI's Segmenter API
  • Parameters:
    • file_path (str): Path to the markdown file
  • Returns: dict - JSON structure containing the segments
  • Raises: Exception if segmentation fails
  • Dependencies: requests library, Jina AI API

Test Similarity Script

Command-line Interface

python test_similarity.py chunk_file query_file [--verbose]
  • Description: Computes similarity between text from two files
  • Arguments:
    • chunk_file: Path to the file containing the text chunk
    • query_file: Path to the file containing the query
    • --verbose or -v: Print token counts and embeddings
  • Output: Similarity score and optional verbose information
  • Dependencies: JinaSimilarity class

read_file

content = read_file(file_path)
  • Description: Reads content from a file
  • Parameters:
    • file_path (str): Path to the file to read
  • Returns: str - Content of the file
  • Raises: FileNotFoundError if the file doesn't exist

Search Execution Module

SearchExecutor Class

Initialization

from execution.search_executor import SearchExecutor
executor = SearchExecutor()
  • Description: Initializes the SearchExecutor class
  • Requirements: Configuration file with API keys for search engines
results = executor.execute_search(query_data)
  • Description: Executes a search across multiple search engines
  • Parameters:
    • query_data (dict): Dictionary containing query information with keys:
      • raw_query (str): The original user query
      • enhanced_query (str): The enhanced query from the LLM
      • search_engines (list, optional): List of search engines to use
      • num_results (int, optional): Number of results to return per engine
  • Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results
  • Example:
results = executor.execute_search({
    'raw_query': 'quantum computing',
    'enhanced_query': 'recent advancements in quantum computing algorithms and hardware'
})

BaseSearchHandler Class

results = handler.search(query, num_results=10, **kwargs)
  • Description: Abstract method for searching implemented by all handlers
  • Parameters:
    • query (str): The search query
    • num_results (int): Number of results to return
    • **kwargs: Additional parameters specific to the search engine
  • Returns: List[Dict[str, Any]] - List of search results
  • Example:
from execution.api_handlers.serper_handler import SerperSearchHandler
handler = SerperSearchHandler()
results = handler.search("quantum computing", num_results=5)

SerperSearchHandler Class

search

from execution.api_handlers.serper_handler import SerperSearchHandler
handler = SerperSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
  • Description: Executes a search using the Serper API
  • Parameters:
    • query (str): The search query
    • num_results (int): Number of results to return
    • **kwargs: Additional parameters for the Serper API
  • Returns: List[Dict[str, Any]] - List of search results with keys:
    • title (str): Title of the result
    • url (str): URL of the result
    • snippet (str): Snippet of text from the result
    • source (str): Source of the result (always "serper")
  • Requirements: Serper API key in configuration
  • Example:
results = handler.search("quantum computing", num_results=5)

ScholarSearchHandler Class

search

from execution.api_handlers.scholar_handler import ScholarSearchHandler
handler = ScholarSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
  • Description: Executes a search on Google Scholar using the Serper API
  • Parameters:
    • query (str): The search query
    • num_results (int): Number of results to return
    • **kwargs: Additional parameters for the Scholar API
  • Returns: List[Dict[str, Any]] - List of search results with keys:
    • title (str): Title of the paper
    • url (str): URL of the paper
    • snippet (str): Snippet of text from the paper
    • source (str): Source of the result (always "scholar")
    • authors (str): Authors of the paper
    • publication (str): Publication venue
    • year (int): Publication year
  • Requirements: Serper API key in configuration
  • Example:
results = handler.search("quantum computing", num_results=5)

ArxivSearchHandler Class

search

from execution.api_handlers.arxiv_handler import ArxivSearchHandler
handler = ArxivSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
  • Description: Executes a search on arXiv
  • Parameters:
    • query (str): The search query
    • num_results (int): Number of results to return
    • **kwargs: Additional parameters for the arXiv API
  • Returns: List[Dict[str, Any]] - List of search results with keys:
    • title (str): Title of the paper
    • url (str): URL of the paper
    • pdf_url (str): URL to the PDF
    • snippet (str): Abstract of the paper
    • source (str): Source of the result (always "arxiv")
    • arxiv_id (str): arXiv ID
    • authors (list): List of author names
    • categories (list): List of arXiv categories
    • published_date (str): Publication date
    • updated_date (str): Last update date
    • full_text (str): Full abstract text
  • Example:
results = handler.search("quantum computing", num_results=5)

ResultCollector Class

process_results

from execution.result_collector import ResultCollector
collector = ResultCollector()
processed_results = collector.process_results(search_results, dedup=True, max_results=None)
  • Description: Processes search results from multiple search engines
  • Parameters:
    • search_results (Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search results
    • dedup (bool): Whether to deduplicate results based on URL
    • max_results (Optional[int]): Maximum number of results to return
  • Returns: List[Dict[str, Any]] - Combined and processed list of search results
  • Example:
processed_results = collector.process_results({
    'serper': serper_results,
    'scholar': scholar_results,
    'arxiv': arxiv_results
}, dedup=True, max_results=20)

save_results

collector.save_results(results, file_path)
  • Description: Saves search results to a JSON file
  • Parameters:
    • results (List[Dict[str, Any]]): List of search results
    • file_path (str): Path to save the results
  • Example:
collector.save_results(processed_results, "search_results.json")

Planned Interfaces for Research System

ResearchSystem Class

Initialization

rs = ResearchSystem(config=None)
  • Description: Initializes the ResearchSystem with optional configuration
  • Parameters:
    • config (dict, optional): Configuration options for the research system
  • Requirements: Various API keys set in environment variables or config
  • Raises: ValueError if required API keys are not set

execute_research

report = rs.execute_research(query, options=None)
  • Description: Executes a complete research pipeline from query to report
  • Parameters:
    • query (str): The research query
    • options (dict, optional): Options to customize the research process
  • Returns: dict - Research report with metadata
  • Raises: Various exceptions for different stages of the pipeline

save_report

rs.save_report(report, file_path, format="markdown")
  • Description: Saves the research report to a file
  • Parameters:
    • report (dict): The research report to save
    • file_path (str): Path to save the report
    • format (str, optional): Format of the report (markdown, html, pdf)
  • Raises: IOError if the file cannot be saved

QueryProcessor Class

process_query

structured_query = query_processor.process_query(query)
  • Description: Processes a raw query into a structured format
  • Parameters:
    • query (str): The raw research query
  • Returns: dict - Structured query with metadata
  • Raises: ValueError if the query is invalid

SearchStrategy Class

develop_strategy

search_plan = search_strategy.develop_strategy(structured_query)
  • Description: Develops a search strategy based on the query
  • Parameters:
    • structured_query (dict): The structured query
  • Returns: dict - Search plan with target-specific queries
  • Raises: ValueError if the query cannot be processed

SearchExecutor Class

execute_search

search_results = search_executor.execute_search(search_plan)
  • Description: Executes search queries against selected targets
  • Parameters:
    • search_plan (dict): The search plan with queries
  • Returns: dict - Collection of search results
  • Raises: APIError if the search APIs fail

JinaReranker Class

rerank

ranked_documents = jina_reranker.rerank(query, documents, top_n=None)
  • Description: Rerank documents based on their relevance to the query.
  • Parameters:
    • query (str): The query to rank documents against
    • documents (List[str]): List of document strings to rerank
    • top_n (Optional[int]): Number of top results to return (optional)
  • Returns: List of dictionaries containing reranked documents with scores and indices

rerank_with_metadata

ranked_documents = jina_reranker.rerank_with_metadata(query, documents, document_key='content', top_n=None)
  • Description: Rerank documents with metadata based on their relevance to the query.
  • Parameters:
    • query (str): The query to rank documents against
    • documents (List[Dict[str, Any]]): List of document dictionaries containing content and metadata
    • document_key (str): The key in the document dictionaries that contains the text content
    • top_n (Optional[int]): Number of top results to return (optional)
  • Returns: List of dictionaries containing reranked documents with scores, indices, and original metadata

get_jina_reranker

jina_reranker = get_jina_reranker()
  • Description: Get the global Jina Reranker instance.
  • Returns: JinaReranker instance

DocumentScraper Class

scrape_documents

markdown_documents = document_scraper.scrape_documents(ranked_documents)
  • Description: Scrapes and converts documents to markdown
  • Parameters:
    • ranked_documents (list): The ranked list of documents to scrape
  • Returns: list - Collection of markdown documents
  • Raises: ScrapingError if the documents cannot be scraped

DocumentSelector Class

select_documents

selected_documents = document_selector.select_documents(documents_with_scores)
  • Description: Selects the most relevant and diverse documents
  • Parameters:
    • documents_with_scores (list): Documents with similarity scores
  • Returns: list - Curated set of documents
  • Raises: ValueError if the selection criteria are invalid

ReportGenerator Class

generate_report

report = report_generator.generate_report(selected_documents, query)
  • Description: Generates a research report from selected documents
  • Parameters:
    • selected_documents (list): The selected documents
    • query (str): The original query for context
  • Returns: dict - Final research report
  • Raises: GenerationError if the report cannot be generated

Search Execution Module

SearchExecutor Class

The SearchExecutor class manages the execution of search queries across multiple search engines.

Initialization

executor = SearchExecutor()
  • Description: Initializes the search executor with available search handlers
  • Requirements: Appropriate API keys must be set for the search engines to be used

execute_search

results = executor.execute_search(structured_query, search_engines=["google", "scholar"], num_results=10)
  • Description: Executes search queries across specified search engines in parallel
  • Parameters:
    • structured_query (Dict[str, Any]): The structured query from the query processor
    • search_engines (Optional[List[str]]): List of search engines to use
    • num_results (int): Number of results to return per search engine
    • timeout (int): Timeout in seconds for each search engine
  • Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results

execute_search_async

results = await executor.execute_search_async(structured_query, search_engines=["google", "scholar"])
  • Description: Executes search queries across specified search engines asynchronously
  • Parameters: Same as execute_search
  • Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results

get_available_search_engines

engines = executor.get_available_search_engines()
  • Description: Gets a list of available search engines
  • Returns: List[str] - List of available search engine names

ResultCollector Class

The ResultCollector class processes and organizes search results from multiple search engines.

Initialization

collector = ResultCollector()
  • Description: Initializes the result collector

process_results

processed_results = collector.process_results(search_results, dedup=True, max_results=20)
  • Description: Processes search results from multiple search engines
  • Parameters:
    • search_results (Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search results
    • dedup (bool): Whether to deduplicate results based on URL
    • max_results (Optional[int]): Maximum number of results to return
  • Returns: List[Dict[str, Any]] - List of processed search results

filter_results

filtered_results = collector.filter_results(results, filters={"domains": ["arxiv.org"], "min_score": 5})
  • Description: Filters results based on specified criteria
  • Parameters:
    • results (List[Dict[str, Any]]): List of search results
    • filters (Dict[str, Any]): Dictionary of filter criteria
  • Returns: List[Dict[str, Any]] - Filtered list of search results

group_results_by_domain

grouped_results = collector.group_results_by_domain(results)
  • Description: Groups results by domain
  • Parameters:
    • results (List[Dict[str, Any]]): List of search results
  • Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping domains to lists of search results

BaseSearchHandler Interface

The BaseSearchHandler class defines the interface for all search API handlers.

search

results = handler.search(query, num_results=10, **kwargs)
  • Description: Executes a search query
  • Parameters:
    • query (str): The search query to execute
    • num_results (int): Number of results to return
    • **kwargs: Additional search parameters specific to the API
  • Returns: List[Dict[str, Any]] - List of search results

get_name

name = handler.get_name()
  • Description: Gets the name of the search handler
  • Returns: str - Name of the search handler

is_available

available = handler.is_available()
  • Description: Checks if the search API is available
  • Returns: bool - True if the API is available, False otherwise

get_rate_limit_info

rate_limits = handler.get_rate_limit_info()
  • Description: Gets information about the API's rate limits
  • Returns: Dict[str, Any] - Dictionary with rate limit information

Ranking Module

JinaReranker Class

The JinaReranker class provides document reranking functionality using Jina AI's Reranker API.

Initialization

reranker = JinaReranker(
    api_key=None,  # Optional, will use environment variable if not provided
    model="jina-reranker-v2-base-multilingual",  # Default model
    endpoint="https://api.jina.ai/v1/rerank"  # Default endpoint
)
  • Description: Initializes the JinaReranker with the specified API key, model, and endpoint
  • Parameters:
    • api_key (Optional[str]): Jina AI API key (defaults to environment variable)
    • model (str): The reranker model to use
    • endpoint (str): The API endpoint
  • Requirements: JINA_API_KEY environment variable must be set if api_key is not provided
  • Raises: ValueError if API key is not available

rerank

reranked_docs = reranker.rerank(query, documents, top_n=None)
  • Description: Reranks a list of documents based on their relevance to the query
  • Parameters:
    • query (str): The query string
    • documents (List[str]): List of document strings to rerank
    • top_n (Optional[int]): Number of top documents to return (defaults to all)
  • Returns: List[Dict[str, Any]] - List of reranked documents with scores
  • Example Return Format:
[
  {
    "index": 0,
    "score": 0.95,
    "document": "Document content here"
  },
  {
    "index": 3,
    "score": 0.82,
    "document": "Another document content"
  }
]

get_jina_reranker

reranker = get_jina_reranker()
  • Description: Factory function to get a JinaReranker instance with configuration from the config file
  • Returns: JinaReranker - Initialized reranker instance
  • Raises: ValueError if API key is not available

Usage Examples

Basic Usage

from ranking.jina_reranker import JinaReranker

# Initialize with specific model
reranker = JinaReranker()

# Rerank documents
results = reranker.rerank(
    query="What is quantum computing?",
    documents=["Document about quantum physics", "Document about quantum computing", "Document about classical computing"],
    top_n=2
)

# Process results
for result in results:
    print(f"Score: {result['score']}, Document: {result['document']}")

Integration with ResultCollector

from execution.result_collector import ResultCollector
from ranking.jina_reranker import get_jina_reranker

# Initialize components
reranker = get_jina_reranker()
collector = ResultCollector(reranker=reranker)

# Process search results with reranking
reranked_results = collector.process_results(
    search_results,
    dedup=True,
    max_results=20,
    use_reranker=True
)

Testing

# Simple test script
import json
from ranking.jina_reranker import get_jina_reranker

reranker = get_jina_reranker()
query = "What is quantum computing?"
documents = [
    "Quantum computing is a type of computation that harnesses quantum mechanics.",
    "Classical computers use bits, while quantum computers use qubits.",
    "Machine learning is a subset of artificial intelligence."
]

reranked = reranker.rerank(query, documents)
print(json.dumps(reranked, indent=2))

Search Execution Testing

The search execution module has been tested to ensure it correctly executes search queries across multiple search engines and processes the results.

Test Script (test_search_execution.py)

# Process a query and execute search
results = test_search_execution("What are the latest advancements in quantum computing?")

# Save test results
save_test_results(results, "search_execution_test_results.json")
  • Purpose: Tests the search execution module with various queries
  • Features:
    • Tests with multiple queries
    • Uses all available search engines
    • Saves results to a JSON file
    • Provides detailed output of search results

UI Module

GradioInterface Class

Initialization

from ui.gradio_interface import GradioInterface
interface = GradioInterface()
  • Description: Initializes the Gradio interface for the research system
  • Requirements: Gradio library installed

process_query

markdown_results, results_file = interface.process_query(query, num_results=10)
  • Description: Processes a query and returns the results
  • Parameters:
    • query (str): The query to process
    • num_results (int): Number of results to return
  • Returns:
    • markdown_results (str): Markdown formatted results
    • results_file (str): Path to the JSON file with saved results
  • Example:
results, file_path = interface.process_query("What are the latest advancements in quantum computing?", num_results=15)

create_interface

interface_blocks = interface.create_interface()
  • Description: Creates and returns the Gradio interface
  • Returns: gr.Blocks - The Gradio interface object
  • Example:
blocks = interface.create_interface()
blocks.launch()

launch

interface.launch(share=True, server_port=7860, debug=False)
  • Description: Launches the Gradio interface
  • Parameters:
    • share (bool): Whether to create a public link for sharing
    • server_port (int): Port to run the server on
    • debug (bool): Whether to run in debug mode
  • Example:
interface.launch(share=True)

Running the UI

python run_ui.py --share --port 7860
  • Description: Runs the Gradio interface
  • Parameters:
    • --share: Create a public link for sharing
    • --port: Port to run the server on (default: 7860)
    • --debug: Run in debug mode
  • Example:
python run_ui.py --share

Document Ranking Interface

JinaReranker

The JinaReranker class provides an interface for reranking documents based on their relevance to a query using Jina AI's Reranker API.

Methods

def rerank(query: str, documents: List[str], top_n: Optional[int] = None) -> List[Dict[str, Any]]:
    """
    Rerank documents based on their relevance to the query.
    
    Args:
        query: The query to rank documents against
        documents: List of document strings to rerank
        top_n: Number of top results to return (optional)
        
    Returns:
        List of dictionaries containing reranked documents with scores and indices
    """
def rerank_with_metadata(query: str, documents: List[Dict[str, Any]], 
                        document_key: str = 'content',
                        top_n: Optional[int] = None) -> List[Dict[str, Any]]:
    """
    Rerank documents with metadata based on their relevance to the query.
    
    Args:
        query: The query to rank documents against
        documents: List of document dictionaries containing content and metadata
        document_key: The key in the document dictionaries that contains the text content
        top_n: Number of top results to return (optional)
        
    Returns:
        List of dictionaries containing reranked documents with scores, indices, and original metadata
    """

Factory Function

def get_jina_reranker() -> JinaReranker:
    """
    Get the global Jina Reranker instance.
    
    Returns:
        JinaReranker instance
    """

Example Usage

from ranking.jina_reranker import get_jina_reranker

# Get the reranker
reranker = get_jina_reranker()

# Rerank documents
results = reranker.rerank(
    query="What is quantum computing?",
    documents=["Document about quantum physics", "Document about quantum computing", "Document about classical computing"],
    top_n=2
)

# Process results
for result in results:
    print(f"Score: {result['score']}, Document: {result['document']}")

Report Generation Module

ReportDetailLevelManager Class

The ReportDetailLevelManager class manages configurations for different report detail levels.

Initialization

detail_level_manager = get_report_detail_level_manager()
  • Description: Gets a singleton instance of the ReportDetailLevelManager

get_detail_level_config

config = detail_level_manager.get_detail_level_config(detail_level)
  • Description: Gets configuration parameters for a specific detail level
  • Parameters:
    • detail_level (str): Detail level as a string (brief, standard, detailed, comprehensive)
  • Returns: Dict[str, Any] - Configuration parameters for the specified detail level
  • Raises: ValueError if the detail level is not valid

get_template_modifier

template = detail_level_manager.get_template_modifier(detail_level, query_type)
  • Description: Gets template modifier for a specific detail level and query type
  • Parameters:
    • detail_level (str): Detail level as a string (brief, standard, detailed, comprehensive)
    • query_type (str): Query type as a string (factual, exploratory, comparative)
  • Returns: str - Template modifier as a string
  • Raises: ValueError if the detail level or query type is not valid

get_available_detail_levels

levels = detail_level_manager.get_available_detail_levels()
  • Description: Gets a list of available detail levels with descriptions
  • Returns: List[Tuple[str, str]] - List of tuples containing detail level and description

ReportGenerator Class

The ReportGenerator class generates reports from search results.

Initialization

report_generator = get_report_generator()
  • Description: Gets a singleton instance of the ReportGenerator

initialize

await report_generator.initialize()
  • Description: Initializes the report generator by setting up the database
  • Returns: None

set_detail_level

report_generator.set_detail_level(detail_level)
  • Description: Sets the detail level for report generation
  • Parameters:
    • detail_level (str): Detail level (brief, standard, detailed, comprehensive)
  • Returns: None
  • Raises: ValueError if the detail level is not valid

get_detail_level_config

config = report_generator.get_detail_level_config()
  • Description: Gets the current detail level configuration
  • Returns: Dict[str, Any] - Configuration parameters for the current detail level

get_available_detail_levels

levels = report_generator.get_available_detail_levels()
  • Description: Gets a list of available detail levels with descriptions
  • Returns: List[Tuple[str, str]] - List of tuples containing detail level and description

process_search_results

documents = await report_generator.process_search_results(search_results)
  • Description: Processes search results by scraping the URLs and storing them in the database
  • Parameters:
    • search_results (List[Dict[str, Any]]): List of search results, each containing at least a 'url' field
  • Returns: List[Dict[str, Any]] - List of processed documents

prepare_documents_for_report

chunks = await report_generator.prepare_documents_for_report(search_results, token_budget, chunk_size, overlap_size)
  • Description: Prepares documents for report generation by chunking and selecting relevant content
  • Parameters:
    • search_results (List[Dict[str, Any]]): List of search results
    • token_budget (Optional[int]): Maximum number of tokens to use
    • chunk_size (Optional[int]): Maximum number of tokens per chunk
    • overlap_size (Optional[int]): Number of tokens to overlap between chunks
  • Returns: List[Dict[str, Any]] - List of selected document chunks

generate_report

report = await report_generator.generate_report(
    search_results=search_results,
    query=query,
    token_budget=token_budget,
    chunk_size=chunk_size,
    overlap_size=overlap_size,
    detail_level=detail_level
)
  • Description: Generates a report from search results
  • Parameters:
    • search_results (List[Dict[str, Any]]): List of search results
    • query (str): Original search query
    • token_budget (Optional[int]): Maximum number of tokens to use
    • chunk_size (Optional[int]): Maximum number of tokens per chunk
    • overlap_size (Optional[int]): Number of tokens to overlap between chunks
    • detail_level (Optional[str]): Level of detail for the report (brief, standard, detailed, comprehensive)
  • Returns: str - Generated report as a string

initialize_report_generator

await initialize_report_generator()
  • Description: Initializes the global report generator instance
  • Returns: None

get_report_generator

report_generator = get_report_generator()
  • Description: Gets the global report generator instance
  • Returns: ReportGenerator - Initialized report generator instance