3.8 KiB

Raw Blame History

Project Overview: Intelligent Research System with Semantic Search

Purpose

This project implements an intelligent research system that automates the process of finding, filtering, and synthesizing information from various sources. At its core, the system uses semantic similarity search powered by Jina AI's APIs to understand context beyond simple keyword matching, enabling more intelligent document processing and information retrieval.

Goals

Create an end-to-end research automation system that handles the entire process from query to final report
Leverage multiple search sources to gather comprehensive information (Serper, Google Scholar, arXiv)
Implement intelligent filtering and ranking of documents using semantic similarity
Produce synthesized reports that extract and combine the most relevant information
Build a modular and extensible architecture that can be enhanced with additional capabilities

High-Level Architecture

The system follows a modular pipeline:

Query Processing:
- Accept and process user research queries
- Enhance queries with additional context and structure
- Classify queries by type, intent, and entities
- Generate optimized queries for different search engines
Search Execution:
- Execute search queries across multiple search engines (Serper, Google Scholar, arXiv)
- Collect and process search results
- Handle deduplication and result filtering
Document Ranking:
- Use Jina AI's Re-Ranker to order documents by relevance
- Filter out less relevant documents
- Apply additional filtering based on metadata (date, source, etc.)
Report Generation:
- Synthesize a comprehensive report from the selected documents
- Format the report for readability
- Include citations and references
User Interface:
- Provide a Gradio-based web interface for user interaction
- Display search results and generated reports
- Allow configuration of search parameters

Current Implementation Status

The project currently has the following modules implemented:

Configuration Module:
- Manages configuration settings for the entire system
- Handles API keys and model selections
- Supports different LLM providers and endpoints
Query Processing Module:
- Processes and enhances user queries
- Classifies queries by type and intent
- Generates optimized search queries
- Integrates with LiteLLM for LLM provider support
Search Execution Module:
- Executes search queries across multiple search engines
- Implements handlers for Serper, Google Scholar, and arXiv
- Collects and processes search results
- Handles deduplication and result filtering
Document Ranking Module:
- Implements Jina AI's Re-Ranker for document ranking
- Supports reranking with metadata preservation
- Provides filtering capabilities

Dependencies

requests: For making API calls to various APIs
numpy: For vector operations in similarity computation
tiktoken: For tokenization and token counting
litellm: For unified LLM provider interface
pyyaml: For configuration file parsing
feedparser: For parsing RSS/Atom feeds (arXiv)
beautifulsoup4: For HTML parsing
gradio: For web interface (planned)

LLM Providers

The system supports multiple LLM providers through the LiteLLM interface:

Groq (currently using Llama 3.1-8b-instant)
OpenAI
Anthropic
OpenRouter
Azure OpenAI

Search Engines

The system currently integrates with the following search engines:

Serper API (for Google search)
Google Scholar (via Serper API)
arXiv (via official API)

Next Steps

Implement the Report Generation module
Develop the Gradio UI for user interaction
Add more search engines and LLM providers
Implement document retrieval and processing
Add support for saving and loading research sessions

3.8 KiB Raw Blame History