Intelligent Research Assistant

Go to file

Steve White 3c661b0024 Document UI progress indicator consolidation in session log		2025-03-17 13:09:00 -05:00
.gradio	Integrate Jina Reranker with ResultCollector for semantic ranking	2025-02-27 16:59:54 -06:00
.note	Document UI progress indicator consolidation in session log	2025-03-17 13:09:00 -05:00
config	massive changes	2025-03-14 16:14:09 -05:00
examples	massive changes	2025-03-14 16:14:09 -05:00
execution	massive changes	2025-03-14 16:14:09 -05:00
query	Add code search capability with GitHub and StackExchange APIs	2025-03-14 16:12:26 -05:00
ranking	Fix Jina Reranker API integration with proper request and response handling	2025-02-27 17:16:52 -06:00
report	massive changes	2025-03-14 16:14:09 -05:00
scripts	Add code search capability with GitHub and StackExchange APIs	2025-03-14 16:12:26 -05:00
tests	massive changes	2025-03-14 16:14:09 -05:00
ui	Enhanced UI progress callback to use Gradio's built-in progress tracking mechanism for better real-time updates during report generation	2025-03-17 12:54:19 -05:00
utils	Clean up repository: Remove unused test files and add new test directories	2025-03-11 16:56:58 -05:00
.clinerules	massive changes	2025-03-14 16:14:09 -05:00
.gitignore	massive changes	2025-03-14 16:14:09 -05:00
.windsurfrules	Clean up repository: Remove unused test files and add new test directories	2025-03-11 16:56:58 -05:00
README.md	Add code search capability with GitHub and StackExchange APIs	2025-03-14 16:12:26 -05:00
jina-ai-metaprompt.md	Initial commit: Intelligent Research System with search execution module	2025-02-27 16:21:54 -06:00
report.md	Clean up repository: Remove unused test files and add new test directories	2025-03-11 16:56:58 -05:00
report_115857.md	Add support for custom models and thinking tag processing	2025-02-28 09:19:27 -06:00
report_20250228_090933_deepseek-r1-distill-llama-70b-specdec.md	Add support for custom models and thinking tag processing	2025-02-28 09:19:27 -06:00
requirements.txt	massive changes	2025-03-14 16:14:09 -05:00
run_ui.py	Add progress tracking to report generation UI	2025-03-12 11:20:40 -05:00
test_report.md	Clean up repository: Remove unused test files and add new test directories	2025-03-11 16:56:58 -05:00

README.md

Intelligent Research System

An end-to-end research automation system that handles the entire process from query to final report, leveraging multiple search sources and semantic similarity to produce comprehensive research results.

Overview

This system automates the research process by:

Processing and enhancing user queries
Executing searches across multiple engines (Serper, Google Scholar, arXiv)
Ranking and filtering results based on relevance
Generating comprehensive research reports

Features

Query Processing: Enhances user queries with additional context and classifies them by type and intent
Multi-Source Search: Executes searches across general web (Serper/Google), academic sources, and current news
Specialized Search Handlers:
- Current Events: Optimized news search for recent developments
- Academic Research: Specialized academic search with OpenAlex, CORE, arXiv, and Google Scholar
- Open Access Detection: Finds freely available versions of paywalled papers using Unpaywall
- Code/Programming: Specialized code search using GitHub and StackExchange
Intelligent Ranking: Uses Jina AI's Re-Ranker to prioritize the most relevant results
Result Deduplication: Removes duplicate results across different search engines
Modular Architecture: Easily extensible with new search engines and LLM providers

Components

Query Processor: Enhances and classifies user queries
Search Executor: Executes searches across multiple engines
Result Collector: Processes and organizes search results
Document Ranker: Ranks documents by relevance
Report Generator: Synthesizes information into coherent reports with specialized templates for different query types

Getting Started

Prerequisites

Python 3.8+
API keys for:
- Serper API (for Google and Scholar search)
- NewsAPI (for current events search)
- CORE API (for open access academic search)
- GitHub API (for code search)
- StackExchange API (for programming Q&A content)
- Groq (or other LLM provider)
- Jina AI (for reranking)
- Email for OpenAlex and Unpaywall (recommended but not required)

Installation

Clone the repository:

git clone https://github.com/yourusername/sim-search.git
cd sim-search

Install dependencies:

pip install -r requirements.txt

Create a configuration file:

cp config/config.yaml.example config/config.yaml

Edit the configuration file to add your API keys:

api_keys:
  serper: "your-serper-api-key"
  newsapi: "your-newsapi-key"
  groq: "your-groq-api-key"
  jina: "your-jina-api-key"
  github: "your-github-api-key"
  stackexchange: "your-stackexchange-api-key"

Usage

Basic Usage

from query.query_processor import QueryProcessor
from execution.search_executor import SearchExecutor
from execution.result_collector import ResultCollector

# Initialize components
query_processor = QueryProcessor()
search_executor = SearchExecutor()
result_collector = ResultCollector()

# Process a query
processed_query = query_processor.process_query("What are the latest advancements in quantum computing?")

# Execute search
search_results = search_executor.execute_search(processed_query)

# Process results
processed_results = result_collector.process_results(search_results)

# Print top results
for i, result in enumerate(processed_results[:5]):
    print(f"{i+1}. {result['title']}")
    print(f"   URL: {result['url']}")
    print(f"   Snippet: {result['snippet'][:100]}...")
    print()

Testing

Run the test scripts to verify functionality:

# Test search execution
python test_search_execution.py

# Test all search handlers
python test_all_handlers.py

Project Structure

sim-search/
├── config/                 # Configuration management
├── query/                  # Query processing
├── execution/              # Search execution
│   └── api_handlers/       # Search API handlers
├── ranking/                # Document ranking
├── test_*.py               # Test scripts
└── requirements.txt        # Dependencies

LLM Providers

The system supports multiple LLM providers through the LiteLLM interface:

Groq (currently using Llama 3.1-8b-instant)
OpenAI
Anthropic
OpenRouter
Azure OpenAI

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Jina AI for their embedding and reranking APIs
Serper for their Google search API
NewsAPI for their news search API
OpenAlex for their academic search API
CORE for their open access academic search API
Unpaywall for their open access discovery API
Groq for their fast LLM inference
GitHub for their code search API
StackExchange for their programming Q&A API