# Intelligent Research System An end-to-end research automation system that handles the entire process from query to final report, leveraging multiple search sources and semantic similarity to produce comprehensive research results. ## Overview This system automates the research process by: 1. Processing and enhancing user queries 2. Executing searches across multiple engines (Serper, Google Scholar, arXiv) 3. Ranking and filtering results based on relevance 4. Generating comprehensive research reports ## Features - **Query Processing**: Enhances user queries with additional context and classifies them by type and intent - **Multi-Source Search**: Executes searches across Serper (Google), Google Scholar, and arXiv - **Intelligent Ranking**: Uses Jina AI's Re-Ranker to prioritize the most relevant results - **Result Deduplication**: Removes duplicate results across different search engines - **Modular Architecture**: Easily extensible with new search engines and LLM providers ## Components - **Query Processor**: Enhances and classifies user queries - **Search Executor**: Executes searches across multiple engines - **Result Collector**: Processes and organizes search results - **Document Ranker**: Ranks documents by relevance - **Report Generator**: Synthesizes information into a coherent report (coming soon) ## Getting Started ### Prerequisites - Python 3.8+ - API keys for: - Serper API (for Google and Scholar search) - Groq (or other LLM provider) - Jina AI (for reranking) ### Installation 1. Clone the repository: ```bash git clone https://github.com/yourusername/sim-search.git cd sim-search ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Create a configuration file: ```bash cp config/config.yaml.example config/config.yaml ``` 4. Edit the configuration file to add your API keys: ```yaml api_keys: serper: "your-serper-api-key" groq: "your-groq-api-key" jina: "your-jina-api-key" ``` ### Usage #### Basic Usage ```python from query.query_processor import QueryProcessor from execution.search_executor import SearchExecutor from execution.result_collector import ResultCollector # Initialize components query_processor = QueryProcessor() search_executor = SearchExecutor() result_collector = ResultCollector() # Process a query processed_query = query_processor.process_query("What are the latest advancements in quantum computing?") # Execute search search_results = search_executor.execute_search(processed_query) # Process results processed_results = result_collector.process_results(search_results) # Print top results for i, result in enumerate(processed_results[:5]): print(f"{i+1}. {result['title']}") print(f" URL: {result['url']}") print(f" Snippet: {result['snippet'][:100]}...") print() ``` #### Testing Run the test scripts to verify functionality: ```bash # Test search execution python test_search_execution.py # Test all search handlers python test_all_handlers.py ``` ## Project Structure ``` sim-search/ ├── config/ # Configuration management ├── query/ # Query processing ├── execution/ # Search execution │ └── api_handlers/ # Search API handlers ├── ranking/ # Document ranking ├── test_*.py # Test scripts └── requirements.txt # Dependencies ``` ## LLM Providers The system supports multiple LLM providers through the LiteLLM interface: - Groq (currently using Llama 3.1-8b-instant) - OpenAI - Anthropic - OpenRouter - Azure OpenAI ## License This project is licensed under the MIT License - see the LICENSE file for details. ## Acknowledgments - [Jina AI](https://jina.ai/) for their embedding and reranking APIs - [Serper](https://serper.dev/) for their Google search API - [Groq](https://groq.com/) for their fast LLM inference