ira/README.md

139 lines
3.8 KiB
Markdown

# Intelligent Research System
An end-to-end research automation system that handles the entire process from query to final report, leveraging multiple search sources and semantic similarity to produce comprehensive research results.
## Overview
This system automates the research process by:
1. Processing and enhancing user queries
2. Executing searches across multiple engines (Serper, Google Scholar, arXiv)
3. Ranking and filtering results based on relevance
4. Generating comprehensive research reports
## Features
- **Query Processing**: Enhances user queries with additional context and classifies them by type and intent
- **Multi-Source Search**: Executes searches across Serper (Google), Google Scholar, and arXiv
- **Intelligent Ranking**: Uses Jina AI's Re-Ranker to prioritize the most relevant results
- **Result Deduplication**: Removes duplicate results across different search engines
- **Modular Architecture**: Easily extensible with new search engines and LLM providers
## Components
- **Query Processor**: Enhances and classifies user queries
- **Search Executor**: Executes searches across multiple engines
- **Result Collector**: Processes and organizes search results
- **Document Ranker**: Ranks documents by relevance
- **Report Generator**: Synthesizes information into a coherent report (coming soon)
## Getting Started
### Prerequisites
- Python 3.8+
- API keys for:
- Serper API (for Google and Scholar search)
- Groq (or other LLM provider)
- Jina AI (for reranking)
### Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/sim-search.git
cd sim-search
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Create a configuration file:
```bash
cp config/config.yaml.example config/config.yaml
```
4. Edit the configuration file to add your API keys:
```yaml
api_keys:
serper: "your-serper-api-key"
groq: "your-groq-api-key"
jina: "your-jina-api-key"
```
### Usage
#### Basic Usage
```python
from query.query_processor import QueryProcessor
from execution.search_executor import SearchExecutor
from execution.result_collector import ResultCollector
# Initialize components
query_processor = QueryProcessor()
search_executor = SearchExecutor()
result_collector = ResultCollector()
# Process a query
processed_query = query_processor.process_query("What are the latest advancements in quantum computing?")
# Execute search
search_results = search_executor.execute_search(processed_query)
# Process results
processed_results = result_collector.process_results(search_results)
# Print top results
for i, result in enumerate(processed_results[:5]):
print(f"{i+1}. {result['title']}")
print(f" URL: {result['url']}")
print(f" Snippet: {result['snippet'][:100]}...")
print()
```
#### Testing
Run the test scripts to verify functionality:
```bash
# Test search execution
python test_search_execution.py
# Test all search handlers
python test_all_handlers.py
```
## Project Structure
```
sim-search/
├── config/ # Configuration management
├── query/ # Query processing
├── execution/ # Search execution
│ └── api_handlers/ # Search API handlers
├── ranking/ # Document ranking
├── test_*.py # Test scripts
└── requirements.txt # Dependencies
```
## LLM Providers
The system supports multiple LLM providers through the LiteLLM interface:
- Groq (currently using Llama 3.1-8b-instant)
- OpenAI
- Anthropic
- OpenRouter
- Azure OpenAI
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Acknowledgments
- [Jina AI](https://jina.ai/) for their embedding and reranking APIs
- [Serper](https://serper.dev/) for their Google search API
- [Groq](https://groq.com/) for their fast LLM inference