ira/README.md

# Intelligent Research System

An end-to-end research automation system that handles the entire process from query to final report, leveraging multiple search sources and semantic similarity to produce comprehensive research results.

## Overview

This system automates the research process by:
1. Processing and enhancing user queries
2. Executing searches across multiple engines (Serper, Google Scholar, arXiv)
3. Ranking and filtering results based on relevance
4. Generating comprehensive research reports

## Features

- **Query Processing**: Enhances user queries with additional context and classifies them by type and intent
- **Multi-Source Search**: Executes searches across general web (Serper/Google), academic sources, and current news
- **Specialized Search Handlers**:
  - **Current Events**: Optimized news search for recent developments
  - **Academic Research**: Specialized academic search with OpenAlex, CORE, arXiv, and Google Scholar
  - **Open Access Detection**: Finds freely available versions of paywalled papers using Unpaywall
  - **Code/Programming**: Specialized code search using GitHub and StackExchange
- **Intelligent Ranking**: Uses Jina AI's Re-Ranker to prioritize the most relevant results
- **Result Deduplication**: Removes duplicate results across different search engines
- **Modular Architecture**: Easily extensible with new search engines and LLM providers

## Components

- **Query Processor**: Enhances and classifies user queries
- **Search Executor**: Executes searches across multiple engines
- **Result Collector**: Processes and organizes search results
- **Document Ranker**: Ranks documents by relevance
- **Report Generator**: Synthesizes information into coherent reports with specialized templates for different query types

## Getting Started

### Prerequisites

- Python 3.8+
- API keys for:
  - Serper API (for Google and Scholar search)
  - NewsAPI (for current events search)
  - CORE API (for open access academic search)
  - GitHub API (for code search)
  - StackExchange API (for programming Q&A content)
  - Groq (or other LLM provider)
  - Jina AI (for reranking)
  - Email for OpenAlex and Unpaywall (recommended but not required)

### Installation

1. Clone the repository:
```bash
git clone https://github.com/yourusername/sim-search.git
cd sim-search
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Create a configuration file:
```bash
cp config/config.yaml.example config/config.yaml
```

4. Edit the configuration file to add your API keys:
```yaml
api_keys:
  serper: "your-serper-api-key"
  newsapi: "your-newsapi-key"
  groq: "your-groq-api-key"
  jina: "your-jina-api-key"
  github: "your-github-api-key"
  stackexchange: "your-stackexchange-api-key"
```

### Usage

#### Basic Usage

```python
from query.query_processor import QueryProcessor
from execution.search_executor import SearchExecutor
from execution.result_collector import ResultCollector

# Initialize components
query_processor = QueryProcessor()
search_executor = SearchExecutor()
result_collector = ResultCollector()

# Process a query
processed_query = query_processor.process_query("What are the latest advancements in quantum computing?")

# Execute search
search_results = search_executor.execute_search(processed_query)

# Process results
processed_results = result_collector.process_results(search_results)

# Print top results
for i, result in enumerate(processed_results[:5]):
    print(f"{i+1}. {result['title']}")
    print(f"   URL: {result['url']}")
    print(f"   Snippet: {result['snippet'][:100]}...")
    print()
```

#### Testing

Run the test scripts to verify functionality:

```bash
# Test search execution
python test_search_execution.py

# Test all search handlers
python test_all_handlers.py
```

## Project Structure

```
sim-search/
├── config/                 # Configuration management
├── query/                  # Query processing
├── execution/              # Search execution
│   └── api_handlers/       # Search API handlers
├── ranking/                # Document ranking
├── test_*.py               # Test scripts
└── requirements.txt        # Dependencies
```

## LLM Providers

The system supports multiple LLM providers through the LiteLLM interface:
- Groq (currently using Llama 3.1-8b-instant)
- OpenAI
- Anthropic
- OpenRouter
- Azure OpenAI

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Acknowledgments

- [Jina AI](https://jina.ai/) for their embedding and reranking APIs
- [Serper](https://serper.dev/) for their Google search API
- [NewsAPI](https://newsapi.org/) for their news search API
- [OpenAlex](https://openalex.org/) for their academic search API
- [CORE](https://core.ac.uk/) for their open access academic search API
- [Unpaywall](https://unpaywall.org/) for their open access discovery API
- [Groq](https://groq.com/) for their fast LLM inference
- [GitHub](https://github.com/) for their code search API
- [StackExchange](https://stackexchange.com/) for their programming Q&A API