928 lines
30 KiB
Markdown
928 lines
30 KiB
Markdown
# Component Interfaces
|
|
|
|
## Current Interfaces
|
|
|
|
### JinaSimilarity Class
|
|
|
|
#### Initialization
|
|
```python
|
|
js = JinaSimilarity()
|
|
```
|
|
- **Description**: Initializes the JinaSimilarity class
|
|
- **Requirements**: JINA_API_KEY environment variable must be set
|
|
- **Raises**: ValueError if JINA_API_KEY is not set
|
|
|
|
#### count_tokens
|
|
```python
|
|
token_count = js.count_tokens(text)
|
|
```
|
|
- **Description**: Counts the number of tokens in a text
|
|
- **Parameters**:
|
|
- `text` (str): The text to count tokens for
|
|
- **Returns**: int - Number of tokens in the text
|
|
- **Dependencies**: tiktoken library
|
|
|
|
#### get_embedding
|
|
```python
|
|
embedding = js.get_embedding(text)
|
|
```
|
|
- **Description**: Generates an embedding for a text using Jina AI's Embeddings API
|
|
- **Parameters**:
|
|
- `text` (str): The text to generate an embedding for (max 8,192 tokens)
|
|
- **Returns**: list - The embedding vector
|
|
- **Raises**:
|
|
- `TokenLimitError`: If the text exceeds 8,192 tokens
|
|
- `requests.exceptions.RequestException`: If the API call fails
|
|
- **Dependencies**: requests library, Jina AI API
|
|
|
|
#### compute_similarity
|
|
```python
|
|
similarity, chunk_embedding, query_embedding = js.compute_similarity(chunk, query)
|
|
```
|
|
- **Description**: Computes similarity between a text chunk and a query
|
|
- **Parameters**:
|
|
- `chunk` (str): The text chunk to compare against
|
|
- `query` (str): The query text
|
|
- **Returns**: Tuple containing:
|
|
- `similarity` (float): Cosine similarity score (0-1)
|
|
- `chunk_embedding` (list): Chunk embedding
|
|
- `query_embedding` (list): Query embedding
|
|
- **Raises**:
|
|
- `TokenLimitError`: If either text exceeds 8,192 tokens
|
|
- `requests.exceptions.RequestException`: If the API calls fail
|
|
- **Dependencies**: numpy library, get_embedding method
|
|
|
|
### Markdown Segmenter
|
|
|
|
#### segment_markdown
|
|
```python
|
|
segments = segment_markdown(file_path)
|
|
```
|
|
- **Description**: Segments a markdown file using Jina AI's Segmenter API
|
|
- **Parameters**:
|
|
- `file_path` (str): Path to the markdown file
|
|
- **Returns**: dict - JSON structure containing the segments
|
|
- **Raises**: Exception if segmentation fails
|
|
- **Dependencies**: requests library, Jina AI API
|
|
|
|
### Test Similarity Script
|
|
|
|
#### Command-line Interface
|
|
```
|
|
python test_similarity.py chunk_file query_file [--verbose]
|
|
```
|
|
- **Description**: Computes similarity between text from two files
|
|
- **Arguments**:
|
|
- `chunk_file`: Path to the file containing the text chunk
|
|
- `query_file`: Path to the file containing the query
|
|
- `--verbose` or `-v`: Print token counts and embeddings
|
|
- **Output**: Similarity score and optional verbose information
|
|
- **Dependencies**: JinaSimilarity class
|
|
|
|
#### read_file
|
|
```python
|
|
content = read_file(file_path)
|
|
```
|
|
- **Description**: Reads content from a file
|
|
- **Parameters**:
|
|
- `file_path` (str): Path to the file to read
|
|
- **Returns**: str - Content of the file
|
|
- **Raises**: FileNotFoundError if the file doesn't exist
|
|
|
|
## Search Execution Module
|
|
|
|
### SearchExecutor Class
|
|
|
|
#### Initialization
|
|
```python
|
|
from execution.search_executor import SearchExecutor
|
|
executor = SearchExecutor()
|
|
```
|
|
- **Description**: Initializes the SearchExecutor class
|
|
- **Requirements**: Configuration file with API keys for search engines
|
|
|
|
#### execute_search
|
|
```python
|
|
results = executor.execute_search(query_data)
|
|
```
|
|
- **Description**: Executes a search across multiple search engines
|
|
- **Parameters**:
|
|
- `query_data` (dict): Dictionary containing query information with keys:
|
|
- `raw_query` (str): The original user query
|
|
- `enhanced_query` (str): The enhanced query from the LLM
|
|
- `search_engines` (list, optional): List of search engines to use
|
|
- `num_results` (int, optional): Number of results to return per engine
|
|
- **Returns**: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results
|
|
- **Example**:
|
|
```python
|
|
results = executor.execute_search({
|
|
'raw_query': 'quantum computing',
|
|
'enhanced_query': 'recent advancements in quantum computing algorithms and hardware'
|
|
})
|
|
```
|
|
|
|
### BaseSearchHandler Class
|
|
|
|
#### search
|
|
```python
|
|
results = handler.search(query, num_results=10, **kwargs)
|
|
```
|
|
- **Description**: Abstract method for searching implemented by all handlers
|
|
- **Parameters**:
|
|
- `query` (str): The search query
|
|
- `num_results` (int): Number of results to return
|
|
- `**kwargs`: Additional parameters specific to the search engine
|
|
- **Returns**: List[Dict[str, Any]] - List of search results
|
|
- **Example**:
|
|
```python
|
|
from execution.api_handlers.serper_handler import SerperSearchHandler
|
|
handler = SerperSearchHandler()
|
|
results = handler.search("quantum computing", num_results=5)
|
|
```
|
|
|
|
### SerperSearchHandler Class
|
|
|
|
#### search
|
|
```python
|
|
from execution.api_handlers.serper_handler import SerperSearchHandler
|
|
handler = SerperSearchHandler()
|
|
results = handler.search(query, num_results=10, **kwargs)
|
|
```
|
|
- **Description**: Executes a search using the Serper API
|
|
- **Parameters**:
|
|
- `query` (str): The search query
|
|
- `num_results` (int): Number of results to return
|
|
- `**kwargs`: Additional parameters for the Serper API
|
|
- **Returns**: List[Dict[str, Any]] - List of search results with keys:
|
|
- `title` (str): Title of the result
|
|
- `url` (str): URL of the result
|
|
- `snippet` (str): Snippet of text from the result
|
|
- `source` (str): Source of the result (always "serper")
|
|
- **Requirements**: Serper API key in configuration
|
|
- **Example**:
|
|
```python
|
|
results = handler.search("quantum computing", num_results=5)
|
|
```
|
|
|
|
### ScholarSearchHandler Class
|
|
|
|
#### search
|
|
```python
|
|
from execution.api_handlers.scholar_handler import ScholarSearchHandler
|
|
handler = ScholarSearchHandler()
|
|
results = handler.search(query, num_results=10, **kwargs)
|
|
```
|
|
- **Description**: Executes a search on Google Scholar using the Serper API
|
|
- **Parameters**:
|
|
- `query` (str): The search query
|
|
- `num_results` (int): Number of results to return
|
|
- `**kwargs`: Additional parameters for the Scholar API
|
|
- **Returns**: List[Dict[str, Any]] - List of search results with keys:
|
|
- `title` (str): Title of the paper
|
|
- `url` (str): URL of the paper
|
|
- `snippet` (str): Snippet of text from the paper
|
|
- `source` (str): Source of the result (always "scholar")
|
|
- `authors` (str): Authors of the paper
|
|
- `publication` (str): Publication venue
|
|
- `year` (int): Publication year
|
|
- **Requirements**: Serper API key in configuration
|
|
- **Example**:
|
|
```python
|
|
results = handler.search("quantum computing", num_results=5)
|
|
```
|
|
|
|
### ArxivSearchHandler Class
|
|
|
|
#### search
|
|
```python
|
|
from execution.api_handlers.arxiv_handler import ArxivSearchHandler
|
|
handler = ArxivSearchHandler()
|
|
results = handler.search(query, num_results=10, **kwargs)
|
|
```
|
|
- **Description**: Executes a search on arXiv
|
|
- **Parameters**:
|
|
- `query` (str): The search query
|
|
- `num_results` (int): Number of results to return
|
|
- `**kwargs`: Additional parameters for the arXiv API
|
|
- **Returns**: List[Dict[str, Any]] - List of search results with keys:
|
|
- `title` (str): Title of the paper
|
|
- `url` (str): URL of the paper
|
|
- `pdf_url` (str): URL to the PDF
|
|
- `snippet` (str): Abstract of the paper
|
|
- `source` (str): Source of the result (always "arxiv")
|
|
- `arxiv_id` (str): arXiv ID
|
|
- `authors` (list): List of author names
|
|
- `categories` (list): List of arXiv categories
|
|
- `published_date` (str): Publication date
|
|
- `updated_date` (str): Last update date
|
|
- `full_text` (str): Full abstract text
|
|
- **Example**:
|
|
```python
|
|
results = handler.search("quantum computing", num_results=5)
|
|
```
|
|
|
|
### ResultCollector Class
|
|
|
|
#### process_results
|
|
```python
|
|
from execution.result_collector import ResultCollector
|
|
collector = ResultCollector()
|
|
processed_results = collector.process_results(search_results, dedup=True, max_results=None)
|
|
```
|
|
- **Description**: Processes search results from multiple search engines
|
|
- **Parameters**:
|
|
- `search_results` (Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search results
|
|
- `dedup` (bool): Whether to deduplicate results based on URL
|
|
- `max_results` (Optional[int]): Maximum number of results to return
|
|
- **Returns**: List[Dict[str, Any]] - Combined and processed list of search results
|
|
- **Example**:
|
|
```python
|
|
processed_results = collector.process_results({
|
|
'serper': serper_results,
|
|
'scholar': scholar_results,
|
|
'arxiv': arxiv_results
|
|
}, dedup=True, max_results=20)
|
|
```
|
|
|
|
#### save_results
|
|
```python
|
|
collector.save_results(results, file_path)
|
|
```
|
|
- **Description**: Saves search results to a JSON file
|
|
- **Parameters**:
|
|
- `results` (List[Dict[str, Any]]): List of search results
|
|
- `file_path` (str): Path to save the results
|
|
- **Example**:
|
|
```python
|
|
collector.save_results(processed_results, "search_results.json")
|
|
```
|
|
|
|
## Planned Interfaces for Research System
|
|
|
|
### ResearchSystem Class
|
|
|
|
#### Initialization
|
|
```python
|
|
rs = ResearchSystem(config=None)
|
|
```
|
|
- **Description**: Initializes the ResearchSystem with optional configuration
|
|
- **Parameters**:
|
|
- `config` (dict, optional): Configuration options for the research system
|
|
- **Requirements**: Various API keys set in environment variables or config
|
|
- **Raises**: ValueError if required API keys are not set
|
|
|
|
#### execute_research
|
|
```python
|
|
report = rs.execute_research(query, options=None)
|
|
```
|
|
- **Description**: Executes a complete research pipeline from query to report
|
|
- **Parameters**:
|
|
- `query` (str): The research query
|
|
- `options` (dict, optional): Options to customize the research process
|
|
- **Returns**: dict - Research report with metadata
|
|
- **Raises**: Various exceptions for different stages of the pipeline
|
|
|
|
#### save_report
|
|
```python
|
|
rs.save_report(report, file_path, format="markdown")
|
|
```
|
|
- **Description**: Saves the research report to a file
|
|
- **Parameters**:
|
|
- `report` (dict): The research report to save
|
|
- `file_path` (str): Path to save the report
|
|
- `format` (str, optional): Format of the report (markdown, html, pdf)
|
|
- **Raises**: IOError if the file cannot be saved
|
|
|
|
### QueryProcessor Class
|
|
|
|
#### process_query
|
|
```python
|
|
structured_query = query_processor.process_query(query)
|
|
```
|
|
- **Description**: Processes a raw query into a structured format
|
|
- **Parameters**:
|
|
- `query` (str): The raw research query
|
|
- **Returns**: dict - Structured query with metadata
|
|
- **Raises**: ValueError if the query is invalid
|
|
|
|
### SearchStrategy Class
|
|
|
|
#### develop_strategy
|
|
```python
|
|
search_plan = search_strategy.develop_strategy(structured_query)
|
|
```
|
|
- **Description**: Develops a search strategy based on the query
|
|
- **Parameters**:
|
|
- `structured_query` (dict): The structured query
|
|
- **Returns**: dict - Search plan with target-specific queries
|
|
- **Raises**: ValueError if the query cannot be processed
|
|
|
|
### SearchExecutor Class
|
|
|
|
#### execute_search
|
|
```python
|
|
search_results = search_executor.execute_search(search_plan)
|
|
```
|
|
- **Description**: Executes search queries against selected targets
|
|
- **Parameters**:
|
|
- `search_plan` (dict): The search plan with queries
|
|
- **Returns**: dict - Collection of search results
|
|
- **Raises**: APIError if the search APIs fail
|
|
|
|
### JinaReranker Class
|
|
|
|
#### rerank
|
|
```python
|
|
ranked_documents = jina_reranker.rerank(query, documents, top_n=None)
|
|
```
|
|
- **Description**: Rerank documents based on their relevance to the query.
|
|
- **Parameters**:
|
|
- `query` (str): The query to rank documents against
|
|
- `documents` (List[str]): List of document strings to rerank
|
|
- `top_n` (Optional[int]): Number of top results to return (optional)
|
|
- **Returns**: List of dictionaries containing reranked documents with scores and indices
|
|
|
|
#### rerank_with_metadata
|
|
```python
|
|
ranked_documents = jina_reranker.rerank_with_metadata(query, documents, document_key='content', top_n=None)
|
|
```
|
|
- **Description**: Rerank documents with metadata based on their relevance to the query.
|
|
- **Parameters**:
|
|
- `query` (str): The query to rank documents against
|
|
- `documents` (List[Dict[str, Any]]): List of document dictionaries containing content and metadata
|
|
- `document_key` (str): The key in the document dictionaries that contains the text content
|
|
- `top_n` (Optional[int]): Number of top results to return (optional)
|
|
- **Returns**: List of dictionaries containing reranked documents with scores, indices, and original metadata
|
|
|
|
#### get_jina_reranker
|
|
```python
|
|
jina_reranker = get_jina_reranker()
|
|
```
|
|
- **Description**: Get the global Jina Reranker instance.
|
|
- **Returns**: JinaReranker instance
|
|
|
|
### DocumentScraper Class
|
|
|
|
#### scrape_documents
|
|
```python
|
|
markdown_documents = document_scraper.scrape_documents(ranked_documents)
|
|
```
|
|
- **Description**: Scrapes and converts documents to markdown
|
|
- **Parameters**:
|
|
- `ranked_documents` (list): The ranked list of documents to scrape
|
|
- **Returns**: list - Collection of markdown documents
|
|
- **Raises**: ScrapingError if the documents cannot be scraped
|
|
|
|
### DocumentSelector Class
|
|
|
|
#### select_documents
|
|
```python
|
|
selected_documents = document_selector.select_documents(documents_with_scores)
|
|
```
|
|
- **Description**: Selects the most relevant and diverse documents
|
|
- **Parameters**:
|
|
- `documents_with_scores` (list): Documents with similarity scores
|
|
- **Returns**: list - Curated set of documents
|
|
- **Raises**: ValueError if the selection criteria are invalid
|
|
|
|
### ReportGenerator Class
|
|
|
|
#### generate_report
|
|
```python
|
|
report = report_generator.generate_report(selected_documents, query)
|
|
```
|
|
- **Description**: Generates a research report from selected documents
|
|
- **Parameters**:
|
|
- `selected_documents` (list): The selected documents
|
|
- `query` (str): The original query for context
|
|
- **Returns**: dict - Final research report
|
|
- **Raises**: GenerationError if the report cannot be generated
|
|
|
|
## Search Execution Module
|
|
|
|
### SearchExecutor Class
|
|
|
|
The `SearchExecutor` class manages the execution of search queries across multiple search engines.
|
|
|
|
#### Initialization
|
|
```python
|
|
executor = SearchExecutor()
|
|
```
|
|
- **Description**: Initializes the search executor with available search handlers
|
|
- **Requirements**: Appropriate API keys must be set for the search engines to be used
|
|
|
|
#### execute_search
|
|
```python
|
|
results = executor.execute_search(structured_query, search_engines=["google", "scholar"], num_results=10)
|
|
```
|
|
- **Description**: Executes search queries across specified search engines in parallel
|
|
- **Parameters**:
|
|
- `structured_query` (Dict[str, Any]): The structured query from the query processor
|
|
- `search_engines` (Optional[List[str]]): List of search engines to use
|
|
- `num_results` (int): Number of results to return per search engine
|
|
- `timeout` (int): Timeout in seconds for each search engine
|
|
- **Returns**: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results
|
|
|
|
#### execute_search_async
|
|
```python
|
|
results = await executor.execute_search_async(structured_query, search_engines=["google", "scholar"])
|
|
```
|
|
- **Description**: Executes search queries across specified search engines asynchronously
|
|
- **Parameters**: Same as `execute_search`
|
|
- **Returns**: Dict[str, List[Dict[str, Any]]] - Dictionary mapping search engine names to lists of search results
|
|
|
|
#### get_available_search_engines
|
|
```python
|
|
engines = executor.get_available_search_engines()
|
|
```
|
|
- **Description**: Gets a list of available search engines
|
|
- **Returns**: List[str] - List of available search engine names
|
|
|
|
### ResultCollector Class
|
|
|
|
The `ResultCollector` class processes and organizes search results from multiple search engines.
|
|
|
|
#### Initialization
|
|
```python
|
|
collector = ResultCollector()
|
|
```
|
|
- **Description**: Initializes the result collector
|
|
|
|
#### process_results
|
|
```python
|
|
processed_results = collector.process_results(search_results, dedup=True, max_results=20)
|
|
```
|
|
- **Description**: Processes search results from multiple search engines
|
|
- **Parameters**:
|
|
- `search_results` (Dict[str, List[Dict[str, Any]]]): Dictionary mapping search engine names to lists of search results
|
|
- `dedup` (bool): Whether to deduplicate results based on URL
|
|
- `max_results` (Optional[int]): Maximum number of results to return
|
|
- **Returns**: List[Dict[str, Any]] - List of processed search results
|
|
|
|
#### filter_results
|
|
```python
|
|
filtered_results = collector.filter_results(results, filters={"domains": ["arxiv.org"], "min_score": 5})
|
|
```
|
|
- **Description**: Filters results based on specified criteria
|
|
- **Parameters**:
|
|
- `results` (List[Dict[str, Any]]): List of search results
|
|
- `filters` (Dict[str, Any]): Dictionary of filter criteria
|
|
- **Returns**: List[Dict[str, Any]] - Filtered list of search results
|
|
|
|
#### group_results_by_domain
|
|
```python
|
|
grouped_results = collector.group_results_by_domain(results)
|
|
```
|
|
- **Description**: Groups results by domain
|
|
- **Parameters**:
|
|
- `results` (List[Dict[str, Any]]): List of search results
|
|
- **Returns**: Dict[str, List[Dict[str, Any]]] - Dictionary mapping domains to lists of search results
|
|
|
|
### BaseSearchHandler Interface
|
|
|
|
The `BaseSearchHandler` class defines the interface for all search API handlers.
|
|
|
|
#### search
|
|
```python
|
|
results = handler.search(query, num_results=10, **kwargs)
|
|
```
|
|
- **Description**: Executes a search query
|
|
- **Parameters**:
|
|
- `query` (str): The search query to execute
|
|
- `num_results` (int): Number of results to return
|
|
- `**kwargs`: Additional search parameters specific to the API
|
|
- **Returns**: List[Dict[str, Any]] - List of search results
|
|
|
|
#### get_name
|
|
```python
|
|
name = handler.get_name()
|
|
```
|
|
- **Description**: Gets the name of the search handler
|
|
- **Returns**: str - Name of the search handler
|
|
|
|
#### is_available
|
|
```python
|
|
available = handler.is_available()
|
|
```
|
|
- **Description**: Checks if the search API is available
|
|
- **Returns**: bool - True if the API is available, False otherwise
|
|
|
|
#### get_rate_limit_info
|
|
```python
|
|
rate_limits = handler.get_rate_limit_info()
|
|
```
|
|
- **Description**: Gets information about the API's rate limits
|
|
- **Returns**: Dict[str, Any] - Dictionary with rate limit information
|
|
|
|
## Ranking Module
|
|
|
|
### JinaReranker Class
|
|
|
|
The `JinaReranker` class provides document reranking functionality using Jina AI's Reranker API.
|
|
|
|
#### Initialization
|
|
```python
|
|
reranker = JinaReranker(
|
|
api_key=None, # Optional, will use environment variable if not provided
|
|
model="jina-reranker-v2-base-multilingual", # Default model
|
|
endpoint="https://api.jina.ai/v1/rerank" # Default endpoint
|
|
)
|
|
```
|
|
- **Description**: Initializes the JinaReranker with the specified API key, model, and endpoint
|
|
- **Parameters**:
|
|
- `api_key` (Optional[str]): Jina AI API key (defaults to environment variable)
|
|
- `model` (str): The reranker model to use
|
|
- `endpoint` (str): The API endpoint
|
|
- **Requirements**: JINA_API_KEY environment variable must be set if api_key is not provided
|
|
- **Raises**: ValueError if API key is not available
|
|
|
|
#### rerank
|
|
```python
|
|
reranked_docs = reranker.rerank(query, documents, top_n=None)
|
|
```
|
|
- **Description**: Reranks a list of documents based on their relevance to the query
|
|
- **Parameters**:
|
|
- `query` (str): The query string
|
|
- `documents` (List[str]): List of document strings to rerank
|
|
- `top_n` (Optional[int]): Number of top documents to return (defaults to all)
|
|
- **Returns**: List[Dict[str, Any]] - List of reranked documents with scores
|
|
- **Example Return Format**:
|
|
```json
|
|
[
|
|
{
|
|
"index": 0,
|
|
"score": 0.95,
|
|
"document": "Document content here"
|
|
},
|
|
{
|
|
"index": 3,
|
|
"score": 0.82,
|
|
"document": "Another document content"
|
|
}
|
|
]
|
|
```
|
|
|
|
#### get_jina_reranker
|
|
```python
|
|
reranker = get_jina_reranker()
|
|
```
|
|
- **Description**: Factory function to get a JinaReranker instance with configuration from the config file
|
|
- **Returns**: JinaReranker - Initialized reranker instance
|
|
- **Raises**: ValueError if API key is not available
|
|
|
|
### Usage Examples
|
|
|
|
#### Basic Usage
|
|
```python
|
|
from ranking.jina_reranker import JinaReranker
|
|
|
|
# Initialize with specific model
|
|
reranker = JinaReranker()
|
|
|
|
# Rerank documents
|
|
results = reranker.rerank(
|
|
query="What is quantum computing?",
|
|
documents=["Document about quantum physics", "Document about quantum computing", "Document about classical computing"],
|
|
top_n=2
|
|
)
|
|
|
|
# Process results
|
|
for result in results:
|
|
print(f"Score: {result['score']}, Document: {result['document']}")
|
|
```
|
|
|
|
#### Integration with ResultCollector
|
|
```python
|
|
from execution.result_collector import ResultCollector
|
|
from ranking.jina_reranker import get_jina_reranker
|
|
|
|
# Initialize components
|
|
reranker = get_jina_reranker()
|
|
collector = ResultCollector(reranker=reranker)
|
|
|
|
# Process search results with reranking
|
|
reranked_results = collector.process_results(
|
|
search_results,
|
|
dedup=True,
|
|
max_results=20,
|
|
use_reranker=True
|
|
)
|
|
```
|
|
|
|
#### Testing
|
|
```python
|
|
# Simple test script
|
|
import json
|
|
from ranking.jina_reranker import get_jina_reranker
|
|
|
|
reranker = get_jina_reranker()
|
|
query = "What is quantum computing?"
|
|
documents = [
|
|
"Quantum computing is a type of computation that harnesses quantum mechanics.",
|
|
"Classical computers use bits, while quantum computers use qubits.",
|
|
"Machine learning is a subset of artificial intelligence."
|
|
]
|
|
|
|
reranked = reranker.rerank(query, documents)
|
|
print(json.dumps(reranked, indent=2))
|
|
```
|
|
|
|
## Search Execution Testing
|
|
|
|
The search execution module has been tested to ensure it correctly executes search queries across multiple search engines and processes the results.
|
|
|
|
### Test Script (test_search_execution.py)
|
|
|
|
```python
|
|
# Process a query and execute search
|
|
results = test_search_execution("What are the latest advancements in quantum computing?")
|
|
|
|
# Save test results
|
|
save_test_results(results, "search_execution_test_results.json")
|
|
```
|
|
|
|
- **Purpose**: Tests the search execution module with various queries
|
|
- **Features**:
|
|
- Tests with multiple queries
|
|
- Uses all available search engines
|
|
- Saves results to a JSON file
|
|
- Provides detailed output of search results
|
|
|
|
## UI Module
|
|
|
|
### GradioInterface Class
|
|
|
|
#### Initialization
|
|
```python
|
|
from ui.gradio_interface import GradioInterface
|
|
interface = GradioInterface()
|
|
```
|
|
- **Description**: Initializes the Gradio interface for the research system
|
|
- **Requirements**: Gradio library installed
|
|
|
|
#### process_query
|
|
```python
|
|
markdown_results, results_file = interface.process_query(query, num_results=10)
|
|
```
|
|
- **Description**: Processes a query and returns the results
|
|
- **Parameters**:
|
|
- `query` (str): The query to process
|
|
- `num_results` (int): Number of results to return
|
|
- **Returns**:
|
|
- `markdown_results` (str): Markdown formatted results
|
|
- `results_file` (str): Path to the JSON file with saved results
|
|
- **Example**:
|
|
```python
|
|
results, file_path = interface.process_query("What are the latest advancements in quantum computing?", num_results=15)
|
|
```
|
|
|
|
#### create_interface
|
|
```python
|
|
interface_blocks = interface.create_interface()
|
|
```
|
|
- **Description**: Creates and returns the Gradio interface
|
|
- **Returns**: `gr.Blocks` - The Gradio interface object
|
|
- **Example**:
|
|
```python
|
|
blocks = interface.create_interface()
|
|
blocks.launch()
|
|
```
|
|
|
|
#### launch
|
|
```python
|
|
interface.launch(share=True, server_port=7860, debug=False)
|
|
```
|
|
- **Description**: Launches the Gradio interface
|
|
- **Parameters**:
|
|
- `share` (bool): Whether to create a public link for sharing
|
|
- `server_port` (int): Port to run the server on
|
|
- `debug` (bool): Whether to run in debug mode
|
|
- **Example**:
|
|
```python
|
|
interface.launch(share=True)
|
|
```
|
|
|
|
### Running the UI
|
|
```bash
|
|
python run_ui.py --share --port 7860
|
|
```
|
|
- **Description**: Runs the Gradio interface
|
|
- **Parameters**:
|
|
- `--share`: Create a public link for sharing
|
|
- `--port`: Port to run the server on (default: 7860)
|
|
- `--debug`: Run in debug mode
|
|
- **Example**:
|
|
```bash
|
|
python run_ui.py --share
|
|
```
|
|
|
|
## Document Ranking Interface
|
|
|
|
### JinaReranker
|
|
|
|
The `JinaReranker` class provides an interface for reranking documents based on their relevance to a query using Jina AI's Reranker API.
|
|
|
|
#### Methods
|
|
|
|
```python
|
|
def rerank(query: str, documents: List[str], top_n: Optional[int] = None) -> List[Dict[str, Any]]:
|
|
"""
|
|
Rerank documents based on their relevance to the query.
|
|
|
|
Args:
|
|
query: The query to rank documents against
|
|
documents: List of document strings to rerank
|
|
top_n: Number of top results to return (optional)
|
|
|
|
Returns:
|
|
List of dictionaries containing reranked documents with scores and indices
|
|
"""
|
|
```
|
|
|
|
```python
|
|
def rerank_with_metadata(query: str, documents: List[Dict[str, Any]],
|
|
document_key: str = 'content',
|
|
top_n: Optional[int] = None) -> List[Dict[str, Any]]:
|
|
"""
|
|
Rerank documents with metadata based on their relevance to the query.
|
|
|
|
Args:
|
|
query: The query to rank documents against
|
|
documents: List of document dictionaries containing content and metadata
|
|
document_key: The key in the document dictionaries that contains the text content
|
|
top_n: Number of top results to return (optional)
|
|
|
|
Returns:
|
|
List of dictionaries containing reranked documents with scores, indices, and original metadata
|
|
"""
|
|
```
|
|
|
|
#### Factory Function
|
|
|
|
```python
|
|
def get_jina_reranker() -> JinaReranker:
|
|
"""
|
|
Get the global Jina Reranker instance.
|
|
|
|
Returns:
|
|
JinaReranker instance
|
|
"""
|
|
```
|
|
|
|
#### Example Usage
|
|
|
|
```python
|
|
from ranking.jina_reranker import get_jina_reranker
|
|
|
|
# Get the reranker
|
|
reranker = get_jina_reranker()
|
|
|
|
# Rerank documents
|
|
results = reranker.rerank(
|
|
query="What is quantum computing?",
|
|
documents=["Document about quantum physics", "Document about quantum computing", "Document about classical computing"],
|
|
top_n=2
|
|
)
|
|
|
|
# Process results
|
|
for result in results:
|
|
print(f"Score: {result['score']}, Document: {result['document']}")
|
|
```
|
|
|
|
## Report Generation Module
|
|
|
|
### ReportDetailLevelManager Class
|
|
|
|
The `ReportDetailLevelManager` class manages configurations for different report detail levels.
|
|
|
|
#### Initialization
|
|
```python
|
|
detail_level_manager = get_report_detail_level_manager()
|
|
```
|
|
- **Description**: Gets a singleton instance of the ReportDetailLevelManager
|
|
|
|
#### get_detail_level_config
|
|
```python
|
|
config = detail_level_manager.get_detail_level_config(detail_level)
|
|
```
|
|
- **Description**: Gets configuration parameters for a specific detail level
|
|
- **Parameters**:
|
|
- `detail_level` (str): Detail level as a string (brief, standard, detailed, comprehensive)
|
|
- **Returns**: Dict[str, Any] - Configuration parameters for the specified detail level
|
|
- **Raises**: ValueError if the detail level is not valid
|
|
|
|
#### get_template_modifier
|
|
```python
|
|
template = detail_level_manager.get_template_modifier(detail_level, query_type)
|
|
```
|
|
- **Description**: Gets template modifier for a specific detail level and query type
|
|
- **Parameters**:
|
|
- `detail_level` (str): Detail level as a string (brief, standard, detailed, comprehensive)
|
|
- `query_type` (str): Query type as a string (factual, exploratory, comparative)
|
|
- **Returns**: str - Template modifier as a string
|
|
- **Raises**: ValueError if the detail level or query type is not valid
|
|
|
|
#### get_available_detail_levels
|
|
```python
|
|
levels = detail_level_manager.get_available_detail_levels()
|
|
```
|
|
- **Description**: Gets a list of available detail levels with descriptions
|
|
- **Returns**: List[Tuple[str, str]] - List of tuples containing detail level and description
|
|
|
|
### ReportGenerator Class
|
|
|
|
The `ReportGenerator` class generates reports from search results.
|
|
|
|
#### Initialization
|
|
```python
|
|
report_generator = get_report_generator()
|
|
```
|
|
- **Description**: Gets a singleton instance of the ReportGenerator
|
|
|
|
#### initialize
|
|
```python
|
|
await report_generator.initialize()
|
|
```
|
|
- **Description**: Initializes the report generator by setting up the database
|
|
- **Returns**: None
|
|
|
|
#### set_detail_level
|
|
```python
|
|
report_generator.set_detail_level(detail_level)
|
|
```
|
|
- **Description**: Sets the detail level for report generation
|
|
- **Parameters**:
|
|
- `detail_level` (str): Detail level (brief, standard, detailed, comprehensive)
|
|
- **Returns**: None
|
|
- **Raises**: ValueError if the detail level is not valid
|
|
|
|
#### get_detail_level_config
|
|
```python
|
|
config = report_generator.get_detail_level_config()
|
|
```
|
|
- **Description**: Gets the current detail level configuration
|
|
- **Returns**: Dict[str, Any] - Configuration parameters for the current detail level
|
|
|
|
#### get_available_detail_levels
|
|
```python
|
|
levels = report_generator.get_available_detail_levels()
|
|
```
|
|
- **Description**: Gets a list of available detail levels with descriptions
|
|
- **Returns**: List[Tuple[str, str]] - List of tuples containing detail level and description
|
|
|
|
#### process_search_results
|
|
```python
|
|
documents = await report_generator.process_search_results(search_results)
|
|
```
|
|
- **Description**: Processes search results by scraping the URLs and storing them in the database
|
|
- **Parameters**:
|
|
- `search_results` (List[Dict[str, Any]]): List of search results, each containing at least a 'url' field
|
|
- **Returns**: List[Dict[str, Any]] - List of processed documents
|
|
|
|
#### prepare_documents_for_report
|
|
```python
|
|
chunks = await report_generator.prepare_documents_for_report(search_results, token_budget, chunk_size, overlap_size)
|
|
```
|
|
- **Description**: Prepares documents for report generation by chunking and selecting relevant content
|
|
- **Parameters**:
|
|
- `search_results` (List[Dict[str, Any]]): List of search results
|
|
- `token_budget` (Optional[int]): Maximum number of tokens to use
|
|
- `chunk_size` (Optional[int]): Maximum number of tokens per chunk
|
|
- `overlap_size` (Optional[int]): Number of tokens to overlap between chunks
|
|
- **Returns**: List[Dict[str, Any]] - List of selected document chunks
|
|
|
|
#### generate_report
|
|
```python
|
|
report = await report_generator.generate_report(
|
|
search_results=search_results,
|
|
query=query,
|
|
token_budget=token_budget,
|
|
chunk_size=chunk_size,
|
|
overlap_size=overlap_size,
|
|
detail_level=detail_level
|
|
)
|
|
```
|
|
- **Description**: Generates a report from search results
|
|
- **Parameters**:
|
|
- `search_results` (List[Dict[str, Any]]): List of search results
|
|
- `query` (str): Original search query
|
|
- `token_budget` (Optional[int]): Maximum number of tokens to use
|
|
- `chunk_size` (Optional[int]): Maximum number of tokens per chunk
|
|
- `overlap_size` (Optional[int]): Number of tokens to overlap between chunks
|
|
- `detail_level` (Optional[str]): Level of detail for the report (brief, standard, detailed, comprehensive)
|
|
- **Returns**: str - Generated report as a string
|
|
|
|
#### initialize_report_generator
|
|
```python
|
|
await initialize_report_generator()
|
|
```
|
|
- **Description**: Initializes the global report generator instance
|
|
- **Returns**: None
|
|
|
|
#### get_report_generator
|
|
```python
|
|
report_generator = get_report_generator()
|
|
```
|
|
- **Description**: Gets the global report generator instance
|
|
- **Returns**: ReportGenerator - Initialized report generator instance
|