ira/README.md

158 lines
5.0 KiB
Markdown

# Intelligent Research System
An end-to-end research automation system that handles the entire process from query to final report, leveraging multiple search sources and semantic similarity to produce comprehensive research results.
## Overview
This system automates the research process by:
1. Processing and enhancing user queries
2. Executing searches across multiple engines (Serper, Google Scholar, arXiv)
3. Ranking and filtering results based on relevance
4. Generating comprehensive research reports
## Features
- **Query Processing**: Enhances user queries with additional context and classifies them by type and intent
- **Multi-Source Search**: Executes searches across general web (Serper/Google), academic sources, and current news
- **Specialized Search Handlers**:
- **Current Events**: Optimized news search for recent developments
- **Academic Research**: Specialized academic search with OpenAlex, CORE, arXiv, and Google Scholar
- **Open Access Detection**: Finds freely available versions of paywalled papers using Unpaywall
- **Code/Programming**: Specialized code search using GitHub and StackExchange
- **Intelligent Ranking**: Uses Jina AI's Re-Ranker to prioritize the most relevant results
- **Result Deduplication**: Removes duplicate results across different search engines
- **Modular Architecture**: Easily extensible with new search engines and LLM providers
## Components
- **Query Processor**: Enhances and classifies user queries
- **Search Executor**: Executes searches across multiple engines
- **Result Collector**: Processes and organizes search results
- **Document Ranker**: Ranks documents by relevance
- **Report Generator**: Synthesizes information into coherent reports with specialized templates for different query types
## Getting Started
### Prerequisites
- Python 3.8+
- API keys for:
- Serper API (for Google and Scholar search)
- NewsAPI (for current events search)
- CORE API (for open access academic search)
- GitHub API (for code search)
- StackExchange API (for programming Q&A content)
- Groq (or other LLM provider)
- Jina AI (for reranking)
- Email for OpenAlex and Unpaywall (recommended but not required)
### Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/sim-search.git
cd sim-search
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Create a configuration file:
```bash
cp config/config.yaml.example config/config.yaml
```
4. Edit the configuration file to add your API keys:
```yaml
api_keys:
serper: "your-serper-api-key"
newsapi: "your-newsapi-key"
groq: "your-groq-api-key"
jina: "your-jina-api-key"
github: "your-github-api-key"
stackexchange: "your-stackexchange-api-key"
```
### Usage
#### Basic Usage
```python
from query.query_processor import QueryProcessor
from execution.search_executor import SearchExecutor
from execution.result_collector import ResultCollector
# Initialize components
query_processor = QueryProcessor()
search_executor = SearchExecutor()
result_collector = ResultCollector()
# Process a query
processed_query = query_processor.process_query("What are the latest advancements in quantum computing?")
# Execute search
search_results = search_executor.execute_search(processed_query)
# Process results
processed_results = result_collector.process_results(search_results)
# Print top results
for i, result in enumerate(processed_results[:5]):
print(f"{i+1}. {result['title']}")
print(f" URL: {result['url']}")
print(f" Snippet: {result['snippet'][:100]}...")
print()
```
#### Testing
Run the test scripts to verify functionality:
```bash
# Test search execution
python test_search_execution.py
# Test all search handlers
python test_all_handlers.py
```
## Project Structure
```
sim-search/
├── config/ # Configuration management
├── query/ # Query processing
├── execution/ # Search execution
│ └── api_handlers/ # Search API handlers
├── ranking/ # Document ranking
├── test_*.py # Test scripts
└── requirements.txt # Dependencies
```
## LLM Providers
The system supports multiple LLM providers through the LiteLLM interface:
- Groq (currently using Llama 3.1-8b-instant)
- OpenAI
- Anthropic
- OpenRouter
- Azure OpenAI
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Acknowledgments
- [Jina AI](https://jina.ai/) for their embedding and reranking APIs
- [Serper](https://serper.dev/) for their Google search API
- [NewsAPI](https://newsapi.org/) for their news search API
- [OpenAlex](https://openalex.org/) for their academic search API
- [CORE](https://core.ac.uk/) for their open access academic search API
- [Unpaywall](https://unpaywall.org/) for their open access discovery API
- [Groq](https://groq.com/) for their fast LLM inference
- [GitHub](https://github.com/) for their code search API
- [StackExchange](https://stackexchange.com/) for their programming Q&A API