30 KiB
Component Interfaces
Current Interfaces
JinaSimilarity Class
Initialization
js = JinaSimilarity()
- Description: Initializes the JinaSimilarity class
- Requirements: JINA_API_KEY environment variable must be set
- Raises: ValueError if JINA_API_KEY is not set
count_tokens
token_count = js.count_tokens(text)
- Description: Counts the number of tokens in a text
- Parameters:
text
(str): The text to count tokens for
- Returns: int - Number of tokens in the text
- Dependencies: tiktoken library
get_embedding
embedding = js.get_embedding(text)
- Description: Generates an embedding for a text using Jina AI's Embeddings API
- Parameters:
text
(str): The text to generate an embedding for (max 8,192 tokens)
- Returns: list - The embedding vector
- Raises:
TokenLimitError
: If the text exceeds 8,192 tokensrequests.exceptions.RequestException
: If the API call fails
- Dependencies: requests library, Jina AI API
compute_similarity
similarity, chunk_embedding, query_embedding = js.compute_similarity(chunk, query)
- Description: Computes similarity between a text chunk and a query
- Parameters:
chunk
(str): The text chunk to compare againstquery
(str): The query text
- Returns: Tuple containing:
similarity
(float): Cosine similarity score (0-1)chunk_embedding
(list): Chunk embeddingquery_embedding
(list): Query embedding
- Raises:
TokenLimitError
: If either text exceeds 8,192 tokensrequests.exceptions.RequestException
: If the API calls fail
- Dependencies: numpy library, get_embedding method
Markdown Segmenter
segment_markdown
segments = segment_markdown(file_path)
- Description: Segments a markdown file using Jina AI's Segmenter API
- Parameters:
file_path
(str): Path to the markdown file
- Returns: dict - JSON structure containing the segments
- Raises: Exception if segmentation fails
- Dependencies: requests library, Jina AI API
Test Similarity Script
Command-line Interface
python test_similarity.py chunk_file query_file [--verbose]
- Description: Computes similarity between text from two files
- Arguments:
chunk_file
: Path to the file containing the text chunkquery_file
: Path to the file containing the query--verbose
or-v
: Print token counts and embeddings
- Output: Similarity score and optional verbose information
- Dependencies: JinaSimilarity class
read_file
content = read_file(file_path)
- Description: Reads content from a file
- Parameters:
file_path
(str): Path to the file to read
- Returns: str - Content of the file
- Raises: FileNotFoundError if the file doesn't exist
Search Execution Module
SearchExecutor Class
Initialization
from execution.search_executor import SearchExecutor
executor = SearchExecutor()
- Description: Initializes the SearchExecutor class
- Requirements: Configuration file with API keys for search engines
execute_search
results = executor.execute_search(query_data)
- Description: Executes a search across multiple search engines
- Parameters:
query_data
(dict): Dictionary containing query information with keys:raw_query
(str): The original user queryenhanced_query
(str): The enhanced query from the LLMsearch_engines
(list, optional): List of search engines to usenum_results
(int, optional): Number of results to return per engine
- Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results
- Example:
results = executor.execute_search({
'raw_query': 'quantum computing',
'enhanced_query': 'recent advancements in quantum computing algorithms and hardware'
})
BaseSearchHandler Class
search
results = handler.search(query, num_results=10, **kwargs)
- Description: Abstract method for searching implemented by all handlers
- Parameters:
query
(str): The search querynum_results
(int): Number of results to return**kwargs
: Additional parameters specific to the search engine
- Returns: List[Dict[str, Any]] - List of search results
- Example:
from execution.api_handlers.serper_handler import SerperSearchHandler
handler = SerperSearchHandler()
results = handler.search("quantum computing", num_results=5)
SerperSearchHandler Class
search
from execution.api_handlers.serper_handler import SerperSearchHandler
handler = SerperSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
- Description: Executes a search using the Serper API
- Parameters:
query
(str): The search querynum_results
(int): Number of results to return**kwargs
: Additional parameters for the Serper API
- Returns: List[Dict[str, Any]] - List of search results with keys:
title
(str): Title of the resulturl
(str): URL of the resultsnippet
(str): Snippet of text from the resultsource
(str): Source of the result (always "serper")
- Requirements: Serper API key in configuration
- Example:
results = handler.search("quantum computing", num_results=5)
ScholarSearchHandler Class
search
from execution.api_handlers.scholar_handler import ScholarSearchHandler
handler = ScholarSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
- Description: Executes a search on Google Scholar using the Serper API
- Parameters:
query
(str): The search querynum_results
(int): Number of results to return**kwargs
: Additional parameters for the Scholar API
- Returns: List[Dict[str, Any]] - List of search results with keys:
title
(str): Title of the paperurl
(str): URL of the papersnippet
(str): Snippet of text from the papersource
(str): Source of the result (always "scholar")authors
(str): Authors of the paperpublication
(str): Publication venueyear
(int): Publication year
- Requirements: Serper API key in configuration
- Example:
results = handler.search("quantum computing", num_results=5)
ArxivSearchHandler Class
search
from execution.api_handlers.arxiv_handler import ArxivSearchHandler
handler = ArxivSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
- Description: Executes a search on arXiv
- Parameters:
query
(str): The search querynum_results
(int): Number of results to return**kwargs
: Additional parameters for the arXiv API
- Returns: List[Dict[str, Any]] - List of search results with keys:
title
(str): Title of the paperurl
(str): URL of the paperpdf_url
(str): URL to the PDFsnippet
(str): Abstract of the papersource
(str): Source of the result (always "arxiv")arxiv_id
(str): arXiv IDauthors
(list): List of author namescategories
(list): List of arXiv categoriespublished_date
(str): Publication dateupdated_date
(str): Last update datefull_text
(str): Full abstract text
- Example:
results = handler.search("quantum computing", num_results=5)
ResultCollector Class
process_results
from execution.result_collector import ResultCollector
collector = ResultCollector()
processed_results = collector.process_results(search_results, dedup=True, max_results=None)
- Description: Processes search results from multiple search engines
- Parameters:
search_results
(Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search resultsdedup
(bool): Whether to deduplicate results based on URLmax_results
(Optional[int]): Maximum number of results to return
- Returns: List[Dict[str, Any]] - Combined and processed list of search results
- Example:
processed_results = collector.process_results({
'serper': serper_results,
'scholar': scholar_results,
'arxiv': arxiv_results
}, dedup=True, max_results=20)
save_results
collector.save_results(results, file_path)
- Description: Saves search results to a JSON file
- Parameters:
results
(List[Dict[str, Any]]): List of search resultsfile_path
(str): Path to save the results
- Example:
collector.save_results(processed_results, "search_results.json")
Planned Interfaces for Research System
ResearchSystem Class
Initialization
rs = ResearchSystem(config=None)
- Description: Initializes the ResearchSystem with optional configuration
- Parameters:
config
(dict, optional): Configuration options for the research system
- Requirements: Various API keys set in environment variables or config
- Raises: ValueError if required API keys are not set
execute_research
report = rs.execute_research(query, options=None)
- Description: Executes a complete research pipeline from query to report
- Parameters:
query
(str): The research queryoptions
(dict, optional): Options to customize the research process
- Returns: dict - Research report with metadata
- Raises: Various exceptions for different stages of the pipeline
save_report
rs.save_report(report, file_path, format="markdown")
- Description: Saves the research report to a file
- Parameters:
report
(dict): The research report to savefile_path
(str): Path to save the reportformat
(str, optional): Format of the report (markdown, html, pdf)
- Raises: IOError if the file cannot be saved
QueryProcessor Class
process_query
structured_query = query_processor.process_query(query)
- Description: Processes a raw query into a structured format
- Parameters:
query
(str): The raw research query
- Returns: dict - Structured query with metadata
- Raises: ValueError if the query is invalid
SearchStrategy Class
develop_strategy
search_plan = search_strategy.develop_strategy(structured_query)
- Description: Develops a search strategy based on the query
- Parameters:
structured_query
(dict): The structured query
- Returns: dict - Search plan with target-specific queries
- Raises: ValueError if the query cannot be processed
SearchExecutor Class
execute_search
search_results = search_executor.execute_search(search_plan)
- Description: Executes search queries against selected targets
- Parameters:
search_plan
(dict): The search plan with queries
- Returns: dict - Collection of search results
- Raises: APIError if the search APIs fail
JinaReranker Class
rerank
ranked_documents = jina_reranker.rerank(query, documents, top_n=None)
- Description: Rerank documents based on their relevance to the query.
- Parameters:
query
(str): The query to rank documents againstdocuments
(List[str]): List of document strings to reranktop_n
(Optional[int]): Number of top results to return (optional)
- Returns: List of dictionaries containing reranked documents with scores and indices
rerank_with_metadata
ranked_documents = jina_reranker.rerank_with_metadata(query, documents, document_key='content', top_n=None)
- Description: Rerank documents with metadata based on their relevance to the query.
- Parameters:
query
(str): The query to rank documents againstdocuments
(List[Dict[str, Any]]): List of document dictionaries containing content and metadatadocument_key
(str): The key in the document dictionaries that contains the text contenttop_n
(Optional[int]): Number of top results to return (optional)
- Returns: List of dictionaries containing reranked documents with scores, indices, and original metadata
get_jina_reranker
jina_reranker = get_jina_reranker()
- Description: Get the global Jina Reranker instance.
- Returns: JinaReranker instance
DocumentScraper Class
scrape_documents
markdown_documents = document_scraper.scrape_documents(ranked_documents)
- Description: Scrapes and converts documents to markdown
- Parameters:
ranked_documents
(list): The ranked list of documents to scrape
- Returns: list - Collection of markdown documents
- Raises: ScrapingError if the documents cannot be scraped
DocumentSelector Class
select_documents
selected_documents = document_selector.select_documents(documents_with_scores)
- Description: Selects the most relevant and diverse documents
- Parameters:
documents_with_scores
(list): Documents with similarity scores
- Returns: list - Curated set of documents
- Raises: ValueError if the selection criteria are invalid
ReportGenerator Class
generate_report
report = report_generator.generate_report(selected_documents, query)
- Description: Generates a research report from selected documents
- Parameters:
selected_documents
(list): The selected documentsquery
(str): The original query for context
- Returns: dict - Final research report
- Raises: GenerationError if the report cannot be generated
Search Execution Module
SearchExecutor Class
The SearchExecutor
class manages the execution of search queries across multiple search engines.
Initialization
executor = SearchExecutor()
- Description: Initializes the search executor with available search handlers
- Requirements: Appropriate API keys must be set for the search engines to be used
execute_search
results = executor.execute_search(structured_query, search_engines=["google", "scholar"], num_results=10)
- Description: Executes search queries across specified search engines in parallel
- Parameters:
structured_query
(Dict[str, Any]): The structured query from the query processorsearch_engines
(Optional[List[str]]): List of search engines to usenum_results
(int): Number of results to return per search enginetimeout
(int): Timeout in seconds for each search engine
- Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results
execute_search_async
results = await executor.execute_search_async(structured_query, search_engines=["google", "scholar"])
- Description: Executes search queries across specified search engines asynchronously
- Parameters: Same as
execute_search
- Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results
get_available_search_engines
engines = executor.get_available_search_engines()
- Description: Gets a list of available search engines
- Returns: List[str] - List of available search engine names
ResultCollector Class
The ResultCollector
class processes and organizes search results from multiple search engines.
Initialization
collector = ResultCollector()
- Description: Initializes the result collector
process_results
processed_results = collector.process_results(search_results, dedup=True, max_results=20)
- Description: Processes search results from multiple search engines
- Parameters:
search_results
(Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search resultsdedup
(bool): Whether to deduplicate results based on URLmax_results
(Optional[int]): Maximum number of results to return
- Returns: List[Dict[str, Any]] - List of processed search results
filter_results
filtered_results = collector.filter_results(results, filters={"domains": ["arxiv.org"], "min_score": 5})
- Description: Filters results based on specified criteria
- Parameters:
results
(List[Dict[str, Any]]): List of search resultsfilters
(Dict[str, Any]): Dictionary of filter criteria
- Returns: List[Dict[str, Any]] - Filtered list of search results
group_results_by_domain
grouped_results = collector.group_results_by_domain(results)
- Description: Groups results by domain
- Parameters:
results
(List[Dict[str, Any]]): List of search results
- Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping domains to lists of search results
BaseSearchHandler Interface
The BaseSearchHandler
class defines the interface for all search API handlers.
search
results = handler.search(query, num_results=10, **kwargs)
- Description: Executes a search query
- Parameters:
query
(str): The search query to executenum_results
(int): Number of results to return**kwargs
: Additional search parameters specific to the API
- Returns: List[Dict[str, Any]] - List of search results
get_name
name = handler.get_name()
- Description: Gets the name of the search handler
- Returns: str - Name of the search handler
is_available
available = handler.is_available()
- Description: Checks if the search API is available
- Returns: bool - True if the API is available, False otherwise
get_rate_limit_info
rate_limits = handler.get_rate_limit_info()
- Description: Gets information about the API's rate limits
- Returns: Dict[str, Any] - Dictionary with rate limit information
Ranking Module
JinaReranker Class
The JinaReranker
class provides document reranking functionality using Jina AI's Reranker API.
Initialization
reranker = JinaReranker(
api_key=None, # Optional, will use environment variable if not provided
model="jina-reranker-v2-base-multilingual", # Default model
endpoint="https://api.jina.ai/v1/rerank" # Default endpoint
)
- Description: Initializes the JinaReranker with the specified API key, model, and endpoint
- Parameters:
api_key
(Optional[str]): Jina AI API key (defaults to environment variable)model
(str): The reranker model to useendpoint
(str): The API endpoint
- Requirements: JINA_API_KEY environment variable must be set if api_key is not provided
- Raises: ValueError if API key is not available
rerank
reranked_docs = reranker.rerank(query, documents, top_n=None)
- Description: Reranks a list of documents based on their relevance to the query
- Parameters:
query
(str): The query stringdocuments
(List[str]): List of document strings to reranktop_n
(Optional[int]): Number of top documents to return (defaults to all)
- Returns: List[Dict[str, Any]] - List of reranked documents with scores
- Example Return Format:
[
{
"index": 0,
"score": 0.95,
"document": "Document content here"
},
{
"index": 3,
"score": 0.82,
"document": "Another document content"
}
]
get_jina_reranker
reranker = get_jina_reranker()
- Description: Factory function to get a JinaReranker instance with configuration from the config file
- Returns: JinaReranker - Initialized reranker instance
- Raises: ValueError if API key is not available
Usage Examples
Basic Usage
from ranking.jina_reranker import JinaReranker
# Initialize with specific model
reranker = JinaReranker()
# Rerank documents
results = reranker.rerank(
query="What is quantum computing?",
documents=["Document about quantum physics", "Document about quantum computing", "Document about classical computing"],
top_n=2
)
# Process results
for result in results:
print(f"Score: {result['score']}, Document: {result['document']}")
Integration with ResultCollector
from execution.result_collector import ResultCollector
from ranking.jina_reranker import get_jina_reranker
# Initialize components
reranker = get_jina_reranker()
collector = ResultCollector(reranker=reranker)
# Process search results with reranking
reranked_results = collector.process_results(
search_results,
dedup=True,
max_results=20,
use_reranker=True
)
Testing
# Simple test script
import json
from ranking.jina_reranker import get_jina_reranker
reranker = get_jina_reranker()
query = "What is quantum computing?"
documents = [
"Quantum computing is a type of computation that harnesses quantum mechanics.",
"Classical computers use bits, while quantum computers use qubits.",
"Machine learning is a subset of artificial intelligence."
]
reranked = reranker.rerank(query, documents)
print(json.dumps(reranked, indent=2))
Search Execution Testing
The search execution module has been tested to ensure it correctly executes search queries across multiple search engines and processes the results.
Test Script (test_search_execution.py)
# Process a query and execute search
results = test_search_execution("What are the latest advancements in quantum computing?")
# Save test results
save_test_results(results, "search_execution_test_results.json")
- Purpose: Tests the search execution module with various queries
- Features:
- Tests with multiple queries
- Uses all available search engines
- Saves results to a JSON file
- Provides detailed output of search results
UI Module
GradioInterface Class
Initialization
from ui.gradio_interface import GradioInterface
interface = GradioInterface()
- Description: Initializes the Gradio interface for the research system
- Requirements: Gradio library installed
process_query
markdown_results, results_file = interface.process_query(query, num_results=10)
- Description: Processes a query and returns the results
- Parameters:
query
(str): The query to processnum_results
(int): Number of results to return
- Returns:
markdown_results
(str): Markdown formatted resultsresults_file
(str): Path to the JSON file with saved results
- Example:
results, file_path = interface.process_query("What are the latest advancements in quantum computing?", num_results=15)
create_interface
interface_blocks = interface.create_interface()
- Description: Creates and returns the Gradio interface
- Returns:
gr.Blocks
- The Gradio interface object - Example:
blocks = interface.create_interface()
blocks.launch()
launch
interface.launch(share=True, server_port=7860, debug=False)
- Description: Launches the Gradio interface
- Parameters:
share
(bool): Whether to create a public link for sharingserver_port
(int): Port to run the server ondebug
(bool): Whether to run in debug mode
- Example:
interface.launch(share=True)
Running the UI
python run_ui.py --share --port 7860
- Description: Runs the Gradio interface
- Parameters:
--share
: Create a public link for sharing--port
: Port to run the server on (default: 7860)--debug
: Run in debug mode
- Example:
python run_ui.py --share
Document Ranking Interface
JinaReranker
The JinaReranker
class provides an interface for reranking documents based on their relevance to a query using Jina AI's Reranker API.
Methods
def rerank(query: str, documents: List[str], top_n: Optional[int] = None) -> List[Dict[str, Any]]:
"""
Rerank documents based on their relevance to the query.
Args:
query: The query to rank documents against
documents: List of document strings to rerank
top_n: Number of top results to return (optional)
Returns:
List of dictionaries containing reranked documents with scores and indices
"""
def rerank_with_metadata(query: str, documents: List[Dict[str, Any]],
document_key: str = 'content',
top_n: Optional[int] = None) -> List[Dict[str, Any]]:
"""
Rerank documents with metadata based on their relevance to the query.
Args:
query: The query to rank documents against
documents: List of document dictionaries containing content and metadata
document_key: The key in the document dictionaries that contains the text content
top_n: Number of top results to return (optional)
Returns:
List of dictionaries containing reranked documents with scores, indices, and original metadata
"""
Factory Function
def get_jina_reranker() -> JinaReranker:
"""
Get the global Jina Reranker instance.
Returns:
JinaReranker instance
"""
Example Usage
from ranking.jina_reranker import get_jina_reranker
# Get the reranker
reranker = get_jina_reranker()
# Rerank documents
results = reranker.rerank(
query="What is quantum computing?",
documents=["Document about quantum physics", "Document about quantum computing", "Document about classical computing"],
top_n=2
)
# Process results
for result in results:
print(f"Score: {result['score']}, Document: {result['document']}")
Report Generation Module
ReportDetailLevelManager Class
The ReportDetailLevelManager
class manages configurations for different report detail levels.
Initialization
detail_level_manager = get_report_detail_level_manager()
- Description: Gets a singleton instance of the ReportDetailLevelManager
get_detail_level_config
config = detail_level_manager.get_detail_level_config(detail_level)
- Description: Gets configuration parameters for a specific detail level
- Parameters:
detail_level
(str): Detail level as a string (brief, standard, detailed, comprehensive)
- Returns: Dict[str, Any] - Configuration parameters for the specified detail level
- Raises: ValueError if the detail level is not valid
get_template_modifier
template = detail_level_manager.get_template_modifier(detail_level, query_type)
- Description: Gets template modifier for a specific detail level and query type
- Parameters:
detail_level
(str): Detail level as a string (brief, standard, detailed, comprehensive)query_type
(str): Query type as a string (factual, exploratory, comparative)
- Returns: str - Template modifier as a string
- Raises: ValueError if the detail level or query type is not valid
get_available_detail_levels
levels = detail_level_manager.get_available_detail_levels()
- Description: Gets a list of available detail levels with descriptions
- Returns: List[Tuple[str, str]] - List of tuples containing detail level and description
ReportGenerator Class
The ReportGenerator
class generates reports from search results.
Initialization
report_generator = get_report_generator()
- Description: Gets a singleton instance of the ReportGenerator
initialize
await report_generator.initialize()
- Description: Initializes the report generator by setting up the database
- Returns: None
set_detail_level
report_generator.set_detail_level(detail_level)
- Description: Sets the detail level for report generation
- Parameters:
detail_level
(str): Detail level (brief, standard, detailed, comprehensive)
- Returns: None
- Raises: ValueError if the detail level is not valid
get_detail_level_config
config = report_generator.get_detail_level_config()
- Description: Gets the current detail level configuration
- Returns: Dict[str, Any] - Configuration parameters for the current detail level
get_available_detail_levels
levels = report_generator.get_available_detail_levels()
- Description: Gets a list of available detail levels with descriptions
- Returns: List[Tuple[str, str]] - List of tuples containing detail level and description
process_search_results
documents = await report_generator.process_search_results(search_results)
- Description: Processes search results by scraping the URLs and storing them in the database
- Parameters:
search_results
(List[Dict[str, Any]]): List of search results, each containing at least a 'url' field
- Returns: List[Dict[str, Any]] - List of processed documents
prepare_documents_for_report
chunks = await report_generator.prepare_documents_for_report(search_results, token_budget, chunk_size, overlap_size)
- Description: Prepares documents for report generation by chunking and selecting relevant content
- Parameters:
search_results
(List[Dict[str, Any]]): List of search resultstoken_budget
(Optional[int]): Maximum number of tokens to usechunk_size
(Optional[int]): Maximum number of tokens per chunkoverlap_size
(Optional[int]): Number of tokens to overlap between chunks
- Returns: List[Dict[str, Any]] - List of selected document chunks
generate_report
report = await report_generator.generate_report(
search_results=search_results,
query=query,
token_budget=token_budget,
chunk_size=chunk_size,
overlap_size=overlap_size,
detail_level=detail_level
)
- Description: Generates a report from search results
- Parameters:
search_results
(List[Dict[str, Any]]): List of search resultsquery
(str): Original search querytoken_budget
(Optional[int]): Maximum number of tokens to usechunk_size
(Optional[int]): Maximum number of tokens per chunkoverlap_size
(Optional[int]): Number of tokens to overlap between chunksdetail_level
(Optional[str]): Level of detail for the report (brief, standard, detailed, comprehensive)
- Returns: str - Generated report as a string
initialize_report_generator
await initialize_report_generator()
- Description: Initializes the global report generator instance
- Returns: None
get_report_generator
report_generator = get_report_generator()
- Description: Gets the global report generator instance
- Returns: ReportGenerator - Initialized report generator instance