25 KiB
25 KiB
Component Interfaces
Current Interfaces
JinaSimilarity Class
Initialization
js = JinaSimilarity()
- Description: Initializes the JinaSimilarity class
- Requirements: JINA_API_KEY environment variable must be set
- Raises: ValueError if JINA_API_KEY is not set
count_tokens
token_count = js.count_tokens(text)
- Description: Counts the number of tokens in a text
- Parameters:
text
(str): The text to count tokens for
- Returns: int - Number of tokens in the text
- Dependencies: tiktoken library
get_embedding
embedding = js.get_embedding(text)
- Description: Generates an embedding for a text using Jina AI's Embeddings API
- Parameters:
text
(str): The text to generate an embedding for (max 8,192 tokens)
- Returns: list - The embedding vector
- Raises:
TokenLimitError
: If the text exceeds 8,192 tokensrequests.exceptions.RequestException
: If the API call fails
- Dependencies: requests library, Jina AI API
compute_similarity
similarity, chunk_embedding, query_embedding = js.compute_similarity(chunk, query)
- Description: Computes similarity between a text chunk and a query
- Parameters:
chunk
(str): The text chunk to compare againstquery
(str): The query text
- Returns: Tuple containing:
similarity
(float): Cosine similarity score (0-1)chunk_embedding
(list): Chunk embeddingquery_embedding
(list): Query embedding
- Raises:
TokenLimitError
: If either text exceeds 8,192 tokensrequests.exceptions.RequestException
: If the API calls fail
- Dependencies: numpy library, get_embedding method
Markdown Segmenter
segment_markdown
segments = segment_markdown(file_path)
- Description: Segments a markdown file using Jina AI's Segmenter API
- Parameters:
file_path
(str): Path to the markdown file
- Returns: dict - JSON structure containing the segments
- Raises: Exception if segmentation fails
- Dependencies: requests library, Jina AI API
Test Similarity Script
Command-line Interface
python test_similarity.py chunk_file query_file [--verbose]
- Description: Computes similarity between text from two files
- Arguments:
chunk_file
: Path to the file containing the text chunkquery_file
: Path to the file containing the query--verbose
or-v
: Print token counts and embeddings
- Output: Similarity score and optional verbose information
- Dependencies: JinaSimilarity class
read_file
content = read_file(file_path)
- Description: Reads content from a file
- Parameters:
file_path
(str): Path to the file to read
- Returns: str - Content of the file
- Raises: FileNotFoundError if the file doesn't exist
Search Execution Module
SearchExecutor Class
Initialization
from execution.search_executor import SearchExecutor
executor = SearchExecutor()
- Description: Initializes the SearchExecutor class
- Requirements: Configuration file with API keys for search engines
execute_search
results = executor.execute_search(query_data)
- Description: Executes a search across multiple search engines
- Parameters:
query_data
(dict): Dictionary containing query information with keys:raw_query
(str): The original user queryenhanced_query
(str): The enhanced query from the LLMsearch_engines
(list, optional): List of search engines to usenum_results
(int, optional): Number of results to return per engine
- Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results
- Example:
results = executor.execute_search({
'raw_query': 'quantum computing',
'enhanced_query': 'recent advancements in quantum computing algorithms and hardware'
})
BaseSearchHandler Class
search
results = handler.search(query, num_results=10, **kwargs)
- Description: Abstract method for searching implemented by all handlers
- Parameters:
query
(str): The search querynum_results
(int): Number of results to return**kwargs
: Additional parameters specific to the search engine
- Returns: List[Dict[str, Any]] - List of search results
- Example:
from execution.api_handlers.serper_handler import SerperSearchHandler
handler = SerperSearchHandler()
results = handler.search("quantum computing", num_results=5)
SerperSearchHandler Class
search
from execution.api_handlers.serper_handler import SerperSearchHandler
handler = SerperSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
- Description: Executes a search using the Serper API
- Parameters:
query
(str): The search querynum_results
(int): Number of results to return**kwargs
: Additional parameters for the Serper API
- Returns: List[Dict[str, Any]] - List of search results with keys:
title
(str): Title of the resulturl
(str): URL of the resultsnippet
(str): Snippet of text from the resultsource
(str): Source of the result (always "serper")
- Requirements: Serper API key in configuration
- Example:
results = handler.search("quantum computing", num_results=5)
ScholarSearchHandler Class
search
from execution.api_handlers.scholar_handler import ScholarSearchHandler
handler = ScholarSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
- Description: Executes a search on Google Scholar using the Serper API
- Parameters:
query
(str): The search querynum_results
(int): Number of results to return**kwargs
: Additional parameters for the Scholar API
- Returns: List[Dict[str, Any]] - List of search results with keys:
title
(str): Title of the paperurl
(str): URL of the papersnippet
(str): Snippet of text from the papersource
(str): Source of the result (always "scholar")authors
(str): Authors of the paperpublication
(str): Publication venueyear
(int): Publication year
- Requirements: Serper API key in configuration
- Example:
results = handler.search("quantum computing", num_results=5)
ArxivSearchHandler Class
search
from execution.api_handlers.arxiv_handler import ArxivSearchHandler
handler = ArxivSearchHandler()
results = handler.search(query, num_results=10, **kwargs)
- Description: Executes a search on arXiv
- Parameters:
query
(str): The search querynum_results
(int): Number of results to return**kwargs
: Additional parameters for the arXiv API
- Returns: List[Dict[str, Any]] - List of search results with keys:
title
(str): Title of the paperurl
(str): URL of the paperpdf_url
(str): URL to the PDFsnippet
(str): Abstract of the papersource
(str): Source of the result (always "arxiv")arxiv_id
(str): arXiv IDauthors
(list): List of author namescategories
(list): List of arXiv categoriespublished_date
(str): Publication dateupdated_date
(str): Last update datefull_text
(str): Full abstract text
- Example:
results = handler.search("quantum computing", num_results=5)
ResultCollector Class
process_results
from execution.result_collector import ResultCollector
collector = ResultCollector()
processed_results = collector.process_results(search_results, dedup=True, max_results=None)
- Description: Processes search results from multiple search engines
- Parameters:
search_results
(Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search resultsdedup
(bool): Whether to deduplicate results based on URLmax_results
(Optional[int]): Maximum number of results to return
- Returns: List[Dict[str, Any]] - Combined and processed list of search results
- Example:
processed_results = collector.process_results({
'serper': serper_results,
'scholar': scholar_results,
'arxiv': arxiv_results
}, dedup=True, max_results=20)
save_results
collector.save_results(results, file_path)
- Description: Saves search results to a JSON file
- Parameters:
results
(List[Dict[str, Any]]): List of search resultsfile_path
(str): Path to save the results
- Example:
collector.save_results(processed_results, "search_results.json")
Planned Interfaces for Research System
ResearchSystem Class
Initialization
rs = ResearchSystem(config=None)
- Description: Initializes the ResearchSystem with optional configuration
- Parameters:
config
(dict, optional): Configuration options for the research system
- Requirements: Various API keys set in environment variables or config
- Raises: ValueError if required API keys are not set
execute_research
report = rs.execute_research(query, options=None)
- Description: Executes a complete research pipeline from query to report
- Parameters:
query
(str): The research queryoptions
(dict, optional): Options to customize the research process
- Returns: dict - Research report with metadata
- Raises: Various exceptions for different stages of the pipeline
save_report
rs.save_report(report, file_path, format="markdown")
- Description: Saves the research report to a file
- Parameters:
report
(dict): The research report to savefile_path
(str): Path to save the reportformat
(str, optional): Format of the report (markdown, html, pdf)
- Raises: IOError if the file cannot be saved
QueryProcessor Class
process_query
structured_query = query_processor.process_query(query)
- Description: Processes a raw query into a structured format
- Parameters:
query
(str): The raw research query
- Returns: dict - Structured query with metadata
- Raises: ValueError if the query is invalid
SearchStrategy Class
develop_strategy
search_plan = search_strategy.develop_strategy(structured_query)
- Description: Develops a search strategy based on the query
- Parameters:
structured_query
(dict): The structured query
- Returns: dict - Search plan with target-specific queries
- Raises: ValueError if the query cannot be processed
SearchExecutor Class
execute_search
search_results = search_executor.execute_search(search_plan)
- Description: Executes search queries against selected targets
- Parameters:
search_plan
(dict): The search plan with queries
- Returns: dict - Collection of search results
- Raises: APIError if the search APIs fail
JinaReranker Class
rerank
ranked_documents = jina_reranker.rerank(query, documents, top_n=None)
- Description: Rerank documents based on their relevance to the query.
- Parameters:
query
(str): The query to rank documents againstdocuments
(List[str]): List of document strings to reranktop_n
(Optional[int]): Number of top results to return (optional)
- Returns: List of dictionaries containing reranked documents with scores and indices
rerank_with_metadata
ranked_documents = jina_reranker.rerank_with_metadata(query, documents, document_key='content', top_n=None)
- Description: Rerank documents with metadata based on their relevance to the query.
- Parameters:
query
(str): The query to rank documents againstdocuments
(List[Dict[str, Any]]): List of document dictionaries containing content and metadatadocument_key
(str): The key in the document dictionaries that contains the text contenttop_n
(Optional[int]): Number of top results to return (optional)
- Returns: List of dictionaries containing reranked documents with scores, indices, and original metadata
get_jina_reranker
jina_reranker = get_jina_reranker()
- Description: Get the global Jina Reranker instance.
- Returns: JinaReranker instance
DocumentScraper Class
scrape_documents
markdown_documents = document_scraper.scrape_documents(ranked_documents)
- Description: Scrapes and converts documents to markdown
- Parameters:
ranked_documents
(list): The ranked list of documents to scrape
- Returns: list - Collection of markdown documents
- Raises: ScrapingError if the documents cannot be scraped
DocumentSelector Class
select_documents
selected_documents = document_selector.select_documents(documents_with_scores)
- Description: Selects the most relevant and diverse documents
- Parameters:
documents_with_scores
(list): Documents with similarity scores
- Returns: list - Curated set of documents
- Raises: ValueError if the selection criteria are invalid
ReportGenerator Class
generate_report
report = report_generator.generate_report(selected_documents, query)
- Description: Generates a research report from selected documents
- Parameters:
selected_documents
(list): The selected documentsquery
(str): The original query for context
- Returns: dict - Final research report
- Raises: GenerationError if the report cannot be generated
Search Execution Module
SearchExecutor Class
The SearchExecutor
class manages the execution of search queries across multiple search engines.
Initialization
executor = SearchExecutor()
- Description: Initializes the search executor with available search handlers
- Requirements: Appropriate API keys must be set for the search engines to be used
execute_search
results = executor.execute_search(structured_query, search_engines=["google", "scholar"], num_results=10)
- Description: Executes search queries across specified search engines in parallel
- Parameters:
structured_query
(Dict[str, Any]): The structured query from the query processorsearch_engines
(Optional[List[str]]): List of search engines to usenum_results
(int): Number of results to return per search enginetimeout
(int): Timeout in seconds for each search engine
- Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results
execute_search_async
results = await executor.execute_search_async(structured_query, search_engines=["google", "scholar"])
- Description: Executes search queries across specified search engines asynchronously
- Parameters: Same as
execute_search
- Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results
get_available_search_engines
engines = executor.get_available_search_engines()
- Description: Gets a list of available search engines
- Returns: List[str] - List of available search engine names
ResultCollector Class
The ResultCollector
class processes and organizes search results from multiple search engines.
Initialization
collector = ResultCollector()
- Description: Initializes the result collector
process_results
processed_results = collector.process_results(search_results, dedup=True, max_results=20)
- Description: Processes search results from multiple search engines
- Parameters:
search_results
(Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search resultsdedup
(bool): Whether to deduplicate results based on URLmax_results
(Optional[int]): Maximum number of results to return
- Returns: List[Dict[str, Any]] - List of processed search results
filter_results
filtered_results = collector.filter_results(results, filters={"domains": ["arxiv.org"], "min_score": 5})
- Description: Filters results based on specified criteria
- Parameters:
results
(List[Dict[str, Any]]): List of search resultsfilters
(Dict[str, Any]): Dictionary of filter criteria
- Returns: List[Dict[str, Any]] - Filtered list of search results
group_results_by_domain
grouped_results = collector.group_results_by_domain(results)
- Description: Groups results by domain
- Parameters:
results
(List[Dict[str, Any]]): List of search results
- Returns: Dict[str, List[Dict[str, Any]]] - Dictionary mapping domains to lists of search results
BaseSearchHandler Interface
The BaseSearchHandler
class defines the interface for all search API handlers.
search
results = handler.search(query, num_results=10, **kwargs)
- Description: Executes a search query
- Parameters:
query
(str): The search query to executenum_results
(int): Number of results to return**kwargs
: Additional search parameters specific to the API
- Returns: List[Dict[str, Any]] - List of search results
get_name
name = handler.get_name()
- Description: Gets the name of the search handler
- Returns: str - Name of the search handler
is_available
available = handler.is_available()
- Description: Checks if the search API is available
- Returns: bool - True if the API is available, False otherwise
get_rate_limit_info
rate_limits = handler.get_rate_limit_info()
- Description: Gets information about the API's rate limits
- Returns: Dict[str, Any] - Dictionary with rate limit information
Search Execution Testing
The search execution module has been tested to ensure it correctly executes search queries across multiple search engines and processes the results.
Test Script (test_search_execution.py)
# Process a query and execute search
results = test_search_execution("What are the latest advancements in quantum computing?")
# Save test results
save_test_results(results, "search_execution_test_results.json")
- Purpose: Tests the search execution module with various queries
- Features:
- Tests with multiple queries
- Uses all available search engines
- Saves results to a JSON file
- Provides detailed output of search results
UI Module
GradioInterface Class
Initialization
from ui.gradio_interface import GradioInterface
interface = GradioInterface()
- Description: Initializes the Gradio interface for the research system
- Requirements: Gradio library installed
process_query
markdown_results, results_file = interface.process_query(query, num_results=10)
- Description: Processes a query and returns the results
- Parameters:
query
(str): The query to processnum_results
(int): Number of results to return
- Returns:
markdown_results
(str): Markdown formatted resultsresults_file
(str): Path to the JSON file with saved results
- Example:
results, file_path = interface.process_query("What are the latest advancements in quantum computing?", num_results=15)
create_interface
interface_blocks = interface.create_interface()
- Description: Creates and returns the Gradio interface
- Returns:
gr.Blocks
- The Gradio interface object - Example:
blocks = interface.create_interface()
blocks.launch()
launch
interface.launch(share=True, server_port=7860, debug=False)
- Description: Launches the Gradio interface
- Parameters:
share
(bool): Whether to create a public link for sharingserver_port
(int): Port to run the server ondebug
(bool): Whether to run in debug mode
- Example:
interface.launch(share=True)
Running the UI
python run_ui.py --share --port 7860
- Description: Runs the Gradio interface
- Parameters:
--share
: Create a public link for sharing--port
: Port to run the server on (default: 7860)--debug
: Run in debug mode
- Example:
python run_ui.py --share
Document Ranking Interface
JinaReranker
The JinaReranker
class provides an interface for reranking documents based on their relevance to a query using Jina AI's Reranker API.
Methods
def rerank(query: str, documents: List[str], top_n: Optional[int] = None) -> List[Dict[str, Any]]:
"""
Rerank documents based on their relevance to the query.
Args:
query: The query to rank documents against
documents: List of document strings to rerank
top_n: Number of top results to return (optional)
Returns:
List of dictionaries containing reranked documents with scores and indices
"""
def rerank_with_metadata(query: str, documents: List[Dict[str, Any]],
document_key: str = 'content',
top_n: Optional[int] = None) -> List[Dict[str, Any]]:
"""
Rerank documents with metadata based on their relevance to the query.
Args:
query: The query to rank documents against
documents: List of document dictionaries containing content and metadata
document_key: The key in the document dictionaries that contains the text content
top_n: Number of top results to return (optional)
Returns:
List of dictionaries containing reranked documents with scores, indices, and original metadata
"""
Factory Function
def get_jina_reranker() -> JinaReranker:
"""
Get the global Jina Reranker instance.
Returns:
JinaReranker instance
"""
Example Usage
from ranking.jina_reranker import get_jina_reranker
# Get the reranker
reranker = get_jina_reranker()
# Rerank documents
results = reranker.rerank(
query="What is quantum computing?",
documents=["Document about quantum physics", "Document about quantum computing", "Document about classical computing"],
top_n=2
)
# Process results
for result in results:
print(f"Score: {result['score']}, Document: {result['document']}")
## Query Processor Testing
The query processor module has been tested with the Groq LLM provider to ensure it functions correctly with the newly integrated models.
### Test Scripts
Two test scripts have been created to validate the query processor functionality:
#### Basic Test Script (test_query_processor.py)
```python
# Get the query processor
processor = get_query_processor()
# Process a query
result = processor.process_query("What are the latest advancements in quantum computing?")
# Generate search queries
search_result = processor.generate_search_queries(result, ["google", "bing", "scholar"])
- Purpose: Tests the core functionality of the query processor
- Features:
- Uses monkey patching to ensure the Groq model is used
- Provides detailed output of processing results
Comprehensive Test Script (test_query_processor_comprehensive.py)
# Test query enhancement
enhanced_query = test_enhance_query("What is quantum computing?")
# Test query classification
classification = test_classify_query("What is quantum computing?")
# Test the full processing pipeline
structured_query = test_process_query("What is quantum computing?")
# Test search query generation
search_result = test_generate_search_queries(structured_query, ["google", "bing", "scholar"])
- Purpose: Tests all aspects of the query processor in detail
- Features:
- Tests individual components in isolation
- Tests a variety of query types
- Saves detailed test results to a JSON file
LLM Interface
LLMInterface Class
The LLMInterface
class provides a unified interface for interacting with various LLM providers through LiteLLM.
Initialization
llm = LLMInterface(model_name="gpt-4")
- Description: Initializes the LLM interface with the specified model
- Parameters:
model_name
(Optional[str]): The name of the model to use (defaults to config value)
- Requirements: Appropriate API key must be set in environment or config
complete
response = llm.complete(prompt, system_prompt=None, temperature=None, max_tokens=None)
- Description: Generates a completion for the given prompt
- Parameters:
prompt
(str): The prompt to completesystem_prompt
(Optional[str]): System prompt for contexttemperature
(Optional[float]): Temperature for generationmax_tokens
(Optional[int]): Maximum tokens to generate
- Returns: str - The generated completion
- Raises: LLMError if the completion fails
complete_json
json_response = llm.complete_json(prompt, system_prompt=None, json_schema=None)
- Description: Generates a JSON response for the given prompt
- Parameters:
prompt
(str): The prompt to completesystem_prompt
(Optional[str]): System prompt for contextjson_schema
(Optional[Dict]): JSON schema for validation
- Returns: Dict - The generated JSON response
- Raises: LLMError if the completion fails or JSON is invalid
Supported Providers
- OpenAI
- Azure OpenAI
- Anthropic
- Ollama
- Groq
- OpenRouter
Example Usage
from query.llm_interface import LLMInterface
# Initialize with specific model
llm = LLMInterface(model_name="llama-3.1-8b-instant")
# Generate a completion
response = llm.complete(
prompt="Explain quantum computing",
system_prompt="You are a helpful assistant that explains complex topics simply.",
temperature=0.7
)
print(response)