139 lines
3.8 KiB
Markdown
139 lines
3.8 KiB
Markdown
# Intelligent Research System
|
|
|
|
An end-to-end research automation system that handles the entire process from query to final report, leveraging multiple search sources and semantic similarity to produce comprehensive research results.
|
|
|
|
## Overview
|
|
|
|
This system automates the research process by:
|
|
1. Processing and enhancing user queries
|
|
2. Executing searches across multiple engines (Serper, Google Scholar, arXiv)
|
|
3. Ranking and filtering results based on relevance
|
|
4. Generating comprehensive research reports
|
|
|
|
## Features
|
|
|
|
- **Query Processing**: Enhances user queries with additional context and classifies them by type and intent
|
|
- **Multi-Source Search**: Executes searches across Serper (Google), Google Scholar, and arXiv
|
|
- **Intelligent Ranking**: Uses Jina AI's Re-Ranker to prioritize the most relevant results
|
|
- **Result Deduplication**: Removes duplicate results across different search engines
|
|
- **Modular Architecture**: Easily extensible with new search engines and LLM providers
|
|
|
|
## Components
|
|
|
|
- **Query Processor**: Enhances and classifies user queries
|
|
- **Search Executor**: Executes searches across multiple engines
|
|
- **Result Collector**: Processes and organizes search results
|
|
- **Document Ranker**: Ranks documents by relevance
|
|
- **Report Generator**: Synthesizes information into a coherent report (coming soon)
|
|
|
|
## Getting Started
|
|
|
|
### Prerequisites
|
|
|
|
- Python 3.8+
|
|
- API keys for:
|
|
- Serper API (for Google and Scholar search)
|
|
- Groq (or other LLM provider)
|
|
- Jina AI (for reranking)
|
|
|
|
### Installation
|
|
|
|
1. Clone the repository:
|
|
```bash
|
|
git clone https://github.com/yourusername/sim-search.git
|
|
cd sim-search
|
|
```
|
|
|
|
2. Install dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
3. Create a configuration file:
|
|
```bash
|
|
cp config/config.yaml.example config/config.yaml
|
|
```
|
|
|
|
4. Edit the configuration file to add your API keys:
|
|
```yaml
|
|
api_keys:
|
|
serper: "your-serper-api-key"
|
|
groq: "your-groq-api-key"
|
|
jina: "your-jina-api-key"
|
|
```
|
|
|
|
### Usage
|
|
|
|
#### Basic Usage
|
|
|
|
```python
|
|
from query.query_processor import QueryProcessor
|
|
from execution.search_executor import SearchExecutor
|
|
from execution.result_collector import ResultCollector
|
|
|
|
# Initialize components
|
|
query_processor = QueryProcessor()
|
|
search_executor = SearchExecutor()
|
|
result_collector = ResultCollector()
|
|
|
|
# Process a query
|
|
processed_query = query_processor.process_query("What are the latest advancements in quantum computing?")
|
|
|
|
# Execute search
|
|
search_results = search_executor.execute_search(processed_query)
|
|
|
|
# Process results
|
|
processed_results = result_collector.process_results(search_results)
|
|
|
|
# Print top results
|
|
for i, result in enumerate(processed_results[:5]):
|
|
print(f"{i+1}. {result['title']}")
|
|
print(f" URL: {result['url']}")
|
|
print(f" Snippet: {result['snippet'][:100]}...")
|
|
print()
|
|
```
|
|
|
|
#### Testing
|
|
|
|
Run the test scripts to verify functionality:
|
|
|
|
```bash
|
|
# Test search execution
|
|
python test_search_execution.py
|
|
|
|
# Test all search handlers
|
|
python test_all_handlers.py
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
sim-search/
|
|
├── config/ # Configuration management
|
|
├── query/ # Query processing
|
|
├── execution/ # Search execution
|
|
│ └── api_handlers/ # Search API handlers
|
|
├── ranking/ # Document ranking
|
|
├── test_*.py # Test scripts
|
|
└── requirements.txt # Dependencies
|
|
```
|
|
|
|
## LLM Providers
|
|
|
|
The system supports multiple LLM providers through the LiteLLM interface:
|
|
- Groq (currently using Llama 3.1-8b-instant)
|
|
- OpenAI
|
|
- Anthropic
|
|
- OpenRouter
|
|
- Azure OpenAI
|
|
|
|
## License
|
|
|
|
This project is licensed under the MIT License - see the LICENSE file for details.
|
|
|
|
## Acknowledgments
|
|
|
|
- [Jina AI](https://jina.ai/) for their embedding and reranking APIs
|
|
- [Serper](https://serper.dev/) for their Google search API
|
|
- [Groq](https://groq.com/) for their fast LLM inference
|