3.8 KiB
Project Overview: Intelligent Research System with Semantic Search
Purpose
This project implements an intelligent research system that automates the process of finding, filtering, and synthesizing information from various sources. At its core, the system uses semantic similarity search powered by Jina AI's APIs to understand context beyond simple keyword matching, enabling more intelligent document processing and information retrieval.
Goals
- Create an end-to-end research automation system that handles the entire process from query to final report
- Leverage multiple search sources to gather comprehensive information (Serper, Google Scholar, arXiv)
- Implement intelligent filtering and ranking of documents using semantic similarity
- Produce synthesized reports that extract and combine the most relevant information
- Build a modular and extensible architecture that can be enhanced with additional capabilities
High-Level Architecture
The system follows a modular pipeline:
-
Query Processing:
- Accept and process user research queries
- Enhance queries with additional context and structure
- Classify queries by type, intent, and entities
- Generate optimized queries for different search engines
-
Search Execution:
- Execute search queries across multiple search engines (Serper, Google Scholar, arXiv)
- Collect and process search results
- Handle deduplication and result filtering
-
Document Ranking:
- Use Jina AI's Re-Ranker to order documents by relevance
- Filter out less relevant documents
- Apply additional filtering based on metadata (date, source, etc.)
-
Report Generation:
- Synthesize a comprehensive report from the selected documents
- Format the report for readability
- Include citations and references
-
User Interface:
- Provide a Gradio-based web interface for user interaction
- Display search results and generated reports
- Allow configuration of search parameters
Current Implementation Status
The project currently has the following modules implemented:
-
Configuration Module:
- Manages configuration settings for the entire system
- Handles API keys and model selections
- Supports different LLM providers and endpoints
-
Query Processing Module:
- Processes and enhances user queries
- Classifies queries by type and intent
- Generates optimized search queries
- Integrates with LiteLLM for LLM provider support
-
Search Execution Module:
- Executes search queries across multiple search engines
- Implements handlers for Serper, Google Scholar, and arXiv
- Collects and processes search results
- Handles deduplication and result filtering
-
Document Ranking Module:
- Implements Jina AI's Re-Ranker for document ranking
- Supports reranking with metadata preservation
- Provides filtering capabilities
Dependencies
requests
: For making API calls to various APIsnumpy
: For vector operations in similarity computationtiktoken
: For tokenization and token countinglitellm
: For unified LLM provider interfacepyyaml
: For configuration file parsingfeedparser
: For parsing RSS/Atom feeds (arXiv)beautifulsoup4
: For HTML parsinggradio
: For web interface (planned)
LLM Providers
The system supports multiple LLM providers through the LiteLLM interface:
- Groq (currently using Llama 3.1-8b-instant)
- OpenAI
- Anthropic
- OpenRouter
- Azure OpenAI
Search Engines
The system currently integrates with the following search engines:
- Serper API (for Google search)
- Google Scholar (via Serper API)
- arXiv (via official API)
Next Steps
- Implement the Report Generation module
- Develop the Gradio UI for user interaction
- Add more search engines and LLM providers
- Implement document retrieval and processing
- Add support for saving and loading research sessions