504 lines
24 KiB
Markdown
504 lines
24 KiB
Markdown
# Code Structure
|
||
|
||
## Current Project Organization
|
||
|
||
```
|
||
project/
|
||
│
|
||
├── examples/ # Sample data and query examples
|
||
├── report/ # Report generation module
|
||
│ ├── __init__.py
|
||
│ ├── report_generator.py # Module for generating reports
|
||
│ ├── report_synthesis.py # Module for synthesizing reports
|
||
│ ├── progressive_report_synthesis.py # Module for progressive report generation
|
||
│ ├── document_processor.py # Module for processing documents
|
||
│ ├── document_scraper.py # Module for scraping documents
|
||
│ ├── report_detail_levels.py # Module for managing report detail levels
|
||
│ ├── report_templates.py # Module for managing report templates
|
||
│ └── database/ # Database for storing reports
|
||
│ ├── __init__.py
|
||
│ └── db_manager.py # Module for managing the database
|
||
├── tests/ # Test suite
|
||
│ ├── __init__.py
|
||
│ ├── execution/ # Search execution tests
|
||
│ │ ├── __init__.py
|
||
│ │ ├── test_search.py
|
||
│ │ ├── test_search_execution.py
|
||
│ │ └── test_all_handlers.py
|
||
│ ├── integration/ # Integration tests
|
||
│ │ ├── __init__.py
|
||
│ │ ├── test_ev_query.py
|
||
│ │ └── test_query_to_report.py
|
||
│ ├── query/ # Query processing tests
|
||
│ │ ├── __init__.py
|
||
│ │ ├── test_query_processor.py
|
||
│ │ ├── test_query_processor_comprehensive.py
|
||
│ │ └── test_llm_interface.py
|
||
│ ├── ranking/ # Ranking algorithm tests
|
||
│ │ ├── __init__.py
|
||
│ │ ├── test_reranker.py
|
||
│ │ ├── test_similarity.py
|
||
│ │ └── test_simple_reranker.py
|
||
│ ├── report/ # Report generation tests
|
||
│ │ ├── __init__.py
|
||
│ │ ├── test_custom_model.py
|
||
│ │ ├── test_detail_levels.py
|
||
│ │ ├── test_brief_report.py
|
||
│ │ └── test_report_templates.py
|
||
│ ├── ui/ # UI component tests
|
||
│ │ ├── __init__.py
|
||
│ │ └── test_ui_search.py
|
||
│ ├── test_document_processor.py
|
||
│ ├── test_document_scraper.py
|
||
│ └── test_report_synthesis.py
|
||
├── utils/ # Utility scripts and shared functions
|
||
│ ├── __init__.py
|
||
│ ├── jina_similarity.py # Module for computing text similarity
|
||
│ └── markdown_segmenter.py # Module for segmenting markdown documents
|
||
├── config/ # Configuration management
|
||
│ ├── __init__.py
|
||
│ ├── config.py # Configuration management class
|
||
│ └── config.yaml # YAML configuration file with settings for different components
|
||
├── query/ # Query processing module
|
||
│ ├── __init__.py
|
||
│ ├── query_processor.py # Module for processing user queries
|
||
│ └── llm_interface.py # Module for interacting with LLM providers
|
||
├── execution/ # Search execution module
|
||
│ ├── __init__.py
|
||
│ ├── search_executor.py # Module for executing search queries
|
||
│ ├── result_collector.py # Module for collecting search results
|
||
│ └── api_handlers/ # Handlers for different search APIs
|
||
│ ├── __init__.py
|
||
│ ├── base_handler.py # Base class for search handlers
|
||
│ ├── serper_handler.py # Handler for Serper API (Google search)
|
||
│ ├── scholar_handler.py # Handler for Google Scholar via Serper
|
||
│ ├── google_handler.py # Handler for Google search
|
||
│ └── arxiv_handler.py # Handler for arXiv API
|
||
├── ranking/ # Ranking module
|
||
│ ├── __init__.py
|
||
│ └── jina_reranker.py # Module for reranking documents using Jina AI
|
||
├── ui/ # UI module
|
||
│ ├── __init__.py
|
||
│ └── gradio_interface.py # Gradio-based web interface
|
||
├── scripts/ # Scripts
|
||
│ └── query_to_report.py # Script for generating reports from queries
|
||
├── sim-search-api/ # FastAPI backend
|
||
│ ├── app/
|
||
│ │ ├── api/
|
||
│ │ │ ├── routes/
|
||
│ │ │ │ ├── __init__.py
|
||
│ │ │ │ ├── auth.py # Authentication routes
|
||
│ │ │ │ ├── query.py # Query processing routes
|
||
│ │ │ │ ├── search.py # Search execution routes
|
||
│ │ │ │ └── report.py # Report generation routes
|
||
│ │ │ ├── __init__.py
|
||
│ │ │ └── dependencies.py # API dependencies (auth, rate limiting)
|
||
│ │ ├── core/
|
||
│ │ │ ├── __init__.py
|
||
│ │ │ ├── config.py # API configuration
|
||
│ │ │ └── security.py # Security utilities
|
||
│ │ ├── db/
|
||
│ │ │ ├── __init__.py
|
||
│ │ │ ├── session.py # Database session
|
||
│ │ │ └── models.py # Database models for reports, searches
|
||
│ │ ├── schemas/
|
||
│ │ │ ├── __init__.py
|
||
│ │ │ ├── token.py # Token schemas
|
||
│ │ │ ├── user.py # User schemas
|
||
│ │ │ ├── query.py # Query schemas
|
||
│ │ │ ├── search.py # Search result schemas
|
||
│ │ │ └── report.py # Report schemas
|
||
│ │ ├── services/
|
||
│ │ │ ├── __init__.py
|
||
│ │ │ ├── query_service.py # Query processing service
|
||
│ │ │ ├── search_service.py # Search execution service
|
||
│ │ │ └── report_service.py # Report generation service
|
||
│ │ └── main.py # FastAPI application
|
||
│ ├── alembic/ # Database migrations
|
||
│ │ ├── versions/
|
||
│ │ │ └── 001_initial_migration.py # Initial migration
|
||
│ │ ├── env.py # Alembic environment
|
||
│ │ └── script.py.mako # Alembic script template
|
||
│ ├── .env.example # Environment variables template
|
||
│ ├── alembic.ini # Alembic configuration
|
||
│ ├── requirements.txt # API dependencies
|
||
│ ├── run.py # Script to run the API
|
||
│ └── README.md # API documentation
|
||
├── run_ui.py # Script to run the UI
|
||
└── requirements.txt # Project dependencies
|
||
```
|
||
|
||
## Module Details
|
||
|
||
### Config Module
|
||
|
||
The `config` module manages configuration settings for the entire system, including API keys, model selections, and other parameters.
|
||
|
||
### Files
|
||
|
||
- `__init__.py`: Package initialization file
|
||
- `config.py`: Configuration management class
|
||
- `config.yaml`: YAML configuration file with settings for different components
|
||
|
||
### Classes
|
||
|
||
- `Config`: Singleton class for loading and accessing configuration settings
|
||
- `load_config(config_path)`: Loads configuration from a YAML file
|
||
- `get(key, default=None)`: Gets a configuration value by key
|
||
|
||
### Query Module
|
||
|
||
The `query` module handles the processing and enhancement of user queries, including classification and optimization for search.
|
||
|
||
### Files
|
||
|
||
- `__init__.py`: Package initialization file
|
||
- `query_processor.py`: Main module for processing user queries
|
||
- `query_classifier.py`: Module for classifying query types
|
||
- `llm_interface.py`: Interface for interacting with LLM providers
|
||
|
||
### Classes
|
||
|
||
- `QueryProcessor`: Main class for processing user queries
|
||
- `process_query(query)`: Processes a user query and returns enhanced results
|
||
- `classify_query(query)`: Classifies a query by type and intent
|
||
- `generate_search_queries(query, classification)`: Generates optimized search queries
|
||
|
||
- `QueryClassifier`: Class for classifying queries
|
||
- `classify(query)`: Classifies a query by type, intent, and entities
|
||
|
||
- `LLMInterface`: Interface for interacting with LLM providers
|
||
- `get_completion(prompt, model=None)`: Gets a completion from an LLM
|
||
- `enhance_query(query)`: Enhances a query with additional context
|
||
- `classify_query(query)`: Uses an LLM to classify a query
|
||
|
||
### Execution Module
|
||
|
||
The `execution` module handles the execution of search queries across multiple search engines and the collection of results.
|
||
|
||
### Files
|
||
|
||
- `__init__.py`: Package initialization file
|
||
- `search_executor.py`: Module for executing search queries
|
||
- `result_collector.py`: Module for collecting and processing search results
|
||
- `api_handlers/`: Directory containing handlers for different search APIs
|
||
- `__init__.py`: Package initialization file
|
||
- `base_handler.py`: Base class for search handlers
|
||
- `serper_handler.py`: Handler for Serper API (Google search)
|
||
- `scholar_handler.py`: Handler for Google Scholar via Serper
|
||
- `arxiv_handler.py`: Handler for arXiv API
|
||
|
||
### Classes
|
||
|
||
- `SearchExecutor`: Class for executing search queries
|
||
- `execute_search(query_data)`: Executes a search across multiple engines
|
||
- `_execute_search_async(query, engines)`: Executes a search asynchronously
|
||
- `_execute_search_sync(query, engines)`: Executes a search synchronously
|
||
|
||
- `ResultCollector`: Class for collecting and processing search results
|
||
- `process_results(search_results)`: Processes search results from multiple engines
|
||
- `deduplicate_results(results)`: Deduplicates results based on URL
|
||
- `save_results(results, file_path)`: Saves results to a file
|
||
|
||
- `BaseSearchHandler`: Base class for search handlers
|
||
- `search(query, num_results)`: Abstract method for searching
|
||
- `_process_response(response)`: Processes the API response
|
||
|
||
- `SerperSearchHandler`: Handler for Serper API
|
||
- `search(query, num_results)`: Searches using Serper API
|
||
- `_process_response(response)`: Processes the Serper API response
|
||
|
||
- `ScholarSearchHandler`: Handler for Google Scholar via Serper
|
||
- `search(query, num_results)`: Searches Google Scholar
|
||
- `_process_response(response)`: Processes the Scholar API response
|
||
|
||
- `ArxivSearchHandler`: Handler for arXiv API
|
||
- `search(query, num_results)`: Searches arXiv
|
||
- `_process_response(response)`: Processes the arXiv API response
|
||
|
||
### Ranking Module
|
||
|
||
The `ranking` module provides functionality for reranking and prioritizing documents based on their relevance to the user's query.
|
||
|
||
### Files
|
||
|
||
- `__init__.py`: Package initialization file
|
||
- `jina_reranker.py`: Module for reranking documents using Jina AI
|
||
- `filter_manager.py`: Module for filtering documents
|
||
|
||
### Classes
|
||
|
||
- `JinaReranker`: Class for reranking documents
|
||
- `rerank(documents, query)`: Reranks documents based on relevance to query
|
||
- `_prepare_inputs(documents, query)`: Prepares inputs for the reranker
|
||
|
||
- `FilterManager`: Class for filtering documents
|
||
- `filter_by_date(documents, start_date, end_date)`: Filters by date
|
||
- `filter_by_source(documents, sources)`: Filters by source
|
||
|
||
### Report Templates Module
|
||
|
||
The `report_templates` module provides a template system for generating reports with different detail levels and query types.
|
||
|
||
### Files
|
||
|
||
- `__init__.py`: Package initialization file
|
||
- `report_templates.py`: Module for managing report templates
|
||
|
||
### Classes
|
||
|
||
- `QueryType` (Enum): Defines the types of queries supported by the system
|
||
- `FACTUAL`: For factual queries seeking specific information
|
||
- `EXPLORATORY`: For exploratory queries investigating a topic
|
||
- `COMPARATIVE`: For comparative queries comparing multiple items
|
||
|
||
- `DetailLevel` (Enum): Defines the levels of detail for generated reports
|
||
- `BRIEF`: Short summary with key findings
|
||
- `STANDARD`: Standard report with introduction, key findings, and analysis
|
||
- `DETAILED`: Detailed report with methodology and more in-depth analysis
|
||
- `COMPREHENSIVE`: Comprehensive report with executive summary, literature review, and appendices
|
||
|
||
- `ReportTemplate`: Class representing a report template
|
||
- `template` (str): The template string with placeholders
|
||
- `detail_level` (DetailLevel): The detail level of the template
|
||
- `query_type` (QueryType): The query type the template is designed for
|
||
- `model` (Optional[str]): The LLM model recommended for this template
|
||
- `required_sections` (Optional[List[str]]): Required sections in the template
|
||
- `validate()`: Validates that the template contains all required sections
|
||
|
||
- `ReportTemplateManager`: Class for managing report templates
|
||
- `add_template(template)`: Adds a template to the manager
|
||
- `get_template(query_type, detail_level)`: Gets a template for a specific query type and detail level
|
||
- `get_available_templates()`: Gets a list of available templates
|
||
- `initialize_default_templates()`: Initializes the default templates for all combinations of query types and detail levels
|
||
|
||
### Progressive Report Synthesis Module
|
||
|
||
The `progressive_report_synthesis` module provides functionality to synthesize reports from document chunks using a progressive approach, where chunks are processed iteratively and the report is refined over time.
|
||
|
||
### Files
|
||
|
||
- `__init__.py`: Package initialization file
|
||
- `progressive_report_synthesis.py`: Module for progressive report generation
|
||
|
||
### Classes
|
||
|
||
- `ReportState`: Class to track the state of a progressive report
|
||
- `current_report` (str): The current version of the report
|
||
- `processed_chunks` (Set[str]): Set of document IDs that have been processed
|
||
- `version` (int): Current version number of the report
|
||
- `last_update_time` (float): Timestamp of the last update
|
||
- `improvement_scores` (List[float]): List of improvement scores for each iteration
|
||
- `is_complete` (bool): Whether the report generation is complete
|
||
- `termination_reason` (Optional[str]): Reason for termination if complete
|
||
|
||
- `ProgressiveReportSynthesizer`: Class for progressive report synthesis
|
||
- Extends `ReportSynthesizer` to implement a progressive approach
|
||
- `set_progress_callback(callback)`: Sets a callback function to report progress
|
||
- `prioritize_chunks(chunks, query)`: Prioritizes chunks based on relevance
|
||
- `extract_information_from_chunk(chunk, query, detail_level)`: Extracts key information from a chunk
|
||
- `refine_report(current_report, new_information, query, query_type, detail_level)`: Refines the report with new information
|
||
- `initialize_report(initial_chunks, query, query_type, detail_level)`: Initializes the report with the first batch of chunks
|
||
- `should_terminate(improvement_score)`: Determines if the process should terminate
|
||
- `synthesize_report_progressively(chunks, query, query_type, detail_level)`: Main method for progressive report generation
|
||
- `synthesize_report(chunks, query, query_type, detail_level)`: Override of parent method to use progressive approach for comprehensive detail level
|
||
|
||
- `get_progressive_report_synthesizer(model_name)`: Factory function to get a singleton instance
|
||
|
||
### FastAPI Backend Module
|
||
|
||
The `sim-search-api` module provides a RESTful API for the sim-search system, allowing for query processing, search execution, and report generation through HTTP endpoints.
|
||
|
||
### Files
|
||
|
||
- `app/`: Main application directory
|
||
- `api/`: API routes and dependencies
|
||
- `routes/`: API route handlers
|
||
- `auth.py`: Authentication routes
|
||
- `query.py`: Query processing routes
|
||
- `search.py`: Search execution routes
|
||
- `report.py`: Report generation routes
|
||
- `dependencies.py`: API dependencies (auth, rate limiting)
|
||
- `core/`: Core functionality
|
||
- `config.py`: API configuration
|
||
- `security.py`: Security utilities
|
||
- `db/`: Database models and session management
|
||
- `models.py`: Database models for users, searches, and reports
|
||
- `session.py`: Database session management
|
||
- `schemas/`: Pydantic schemas for request/response validation
|
||
- `token.py`: Token schemas
|
||
- `user.py`: User schemas
|
||
- `query.py`: Query schemas
|
||
- `search.py`: Search result schemas
|
||
- `report.py`: Report schemas
|
||
- `services/`: Service layer for business logic
|
||
- `query_service.py`: Query processing service
|
||
- `search_service.py`: Search execution service
|
||
- `report_service.py`: Report generation service
|
||
- `main.py`: FastAPI application entry point
|
||
- `alembic/`: Database migrations
|
||
- `versions/`: Migration versions
|
||
- `001_initial_migration.py`: Initial migration
|
||
- `env.py`: Alembic environment
|
||
- `script.py.mako`: Alembic script template
|
||
- `alembic.ini`: Alembic configuration
|
||
- `requirements.txt`: API dependencies
|
||
- `run.py`: Script to run the API
|
||
- `.env.example`: Environment variables template
|
||
- `README.md`: API documentation
|
||
|
||
### Classes
|
||
|
||
- `app.db.models.User`: User model for authentication
|
||
- `id` (str): User ID
|
||
- `email` (str): User email
|
||
- `hashed_password` (str): Hashed password
|
||
- `full_name` (str): User's full name
|
||
- `is_active` (bool): Whether the user is active
|
||
- `is_superuser` (bool): Whether the user is a superuser
|
||
|
||
- `app.db.models.Search`: Search model for storing search results
|
||
- `id` (str): Search ID
|
||
- `user_id` (str): User ID
|
||
- `query` (str): Original query
|
||
- `enhanced_query` (str): Enhanced query
|
||
- `query_type` (str): Query type
|
||
- `engines` (str): Search engines used
|
||
- `results_count` (int): Number of results
|
||
- `results` (JSON): Search results
|
||
- `created_at` (datetime): Creation timestamp
|
||
|
||
- `app.db.models.Report`: Report model for storing generated reports
|
||
- `id` (str): Report ID
|
||
- `user_id` (str): User ID
|
||
- `search_id` (str): Search ID
|
||
- `title` (str): Report title
|
||
- `content` (str): Report content
|
||
- `detail_level` (str): Detail level
|
||
- `query_type` (str): Query type
|
||
- `model_used` (str): Model used for generation
|
||
- `created_at` (datetime): Creation timestamp
|
||
- `updated_at` (datetime): Update timestamp
|
||
|
||
- `app.services.QueryService`: Service for query processing
|
||
- `process_query(query)`: Processes a query
|
||
- `classify_query(query)`: Classifies a query
|
||
|
||
- `app.services.SearchService`: Service for search execution
|
||
- `execute_search(structured_query, search_engines, num_results, timeout, user_id, db)`: Executes a search
|
||
- `get_available_search_engines()`: Gets available search engines
|
||
- `get_search_results(search)`: Gets results for a specific search
|
||
|
||
- `app.services.ReportService`: Service for report generation
|
||
- `generate_report_background(report_id, report_in, search, db, progress_dict)`: Generates a report in the background
|
||
- `generate_report_file(report, format)`: Generates a report file in the specified format
|
||
|
||
## Recent Updates
|
||
|
||
### 2025-03-20: FastAPI Backend Implementation
|
||
|
||
1. **FastAPI Application Structure**:
|
||
- Created a new directory `sim-search-api` for the FastAPI application
|
||
- Set up project structure with API routes, core functionality, database models, schemas, and services
|
||
- Implemented a layered architecture with API, service, and data layers
|
||
- Added proper `__init__.py` files to make all directories proper Python packages
|
||
|
||
2. **API Routes Implementation**:
|
||
- Created authentication routes for user registration and token generation
|
||
- Implemented query processing routes for query enhancement and classification
|
||
- Added search execution routes for executing searches and managing search history
|
||
- Created report generation routes for generating and managing reports
|
||
- Implemented proper error handling and validation for all routes
|
||
|
||
3. **Service Layer Implementation**:
|
||
- Created `QueryService` to bridge between API and existing query processing functionality
|
||
- Implemented `SearchService` for search execution and result management
|
||
- Added `ReportService` for report generation and management
|
||
- Ensured proper integration with existing sim-search functionality
|
||
- Implemented asynchronous operation for all services
|
||
|
||
4. **Database Setup**:
|
||
- Created SQLAlchemy models for users, searches, and reports
|
||
- Implemented database session management
|
||
- Set up Alembic for database migrations
|
||
- Created initial migration script to create all tables
|
||
- Added proper relationships between models
|
||
|
||
5. **Authentication and Security**:
|
||
- Implemented JWT-based authentication
|
||
- Added password hashing and verification
|
||
- Created token generation and validation
|
||
- Implemented user registration and login
|
||
- Added proper authorization for protected routes
|
||
|
||
6. **Documentation and Configuration**:
|
||
- Created comprehensive API documentation
|
||
- Added OpenAPI documentation endpoints
|
||
- Implemented environment variable configuration
|
||
- Created a README with setup and usage instructions
|
||
- Added example environment variables file
|
||
|
||
### 2025-03-12: Progressive Report Generation Implementation
|
||
|
||
1. **Progressive Report Synthesis Module**:
|
||
- Created a new module `progressive_report_synthesis.py` for progressive report generation
|
||
- Implemented `ReportState` class to track the state of a progressive report
|
||
- Created `ProgressiveReportSynthesizer` class extending from `ReportSynthesizer`
|
||
- Implemented chunk prioritization algorithm based on relevance scores
|
||
- Developed iterative refinement process with specialized prompts
|
||
- Added state management to track report versions and processed chunks
|
||
- Implemented termination conditions (all chunks processed, diminishing returns, max iterations)
|
||
- Added support for different models with adaptive batch sizing
|
||
- Implemented progress tracking and callback mechanism
|
||
|
||
2. **Report Generator Integration**:
|
||
- Modified `report_generator.py` to use the progressive report synthesizer for comprehensive detail level
|
||
- Created a hybrid system that uses standard map-reduce for brief/standard/detailed levels
|
||
- Added proper model selection and configuration for both synthesizers
|
||
|
||
3. **Testing**:
|
||
- Created `test_progressive_report.py` to test progressive report generation
|
||
- Implemented comparison functionality between progressive and standard approaches
|
||
- Added test cases for different query types and document collections
|
||
|
||
### 2025-03-11: Report Templates Implementation
|
||
|
||
1. **Report Templates Module**:
|
||
- Created a new module `report_templates.py` for managing report templates
|
||
- Implemented enums for query types (FACTUAL, EXPLORATORY, COMPARATIVE) and detail levels (BRIEF, STANDARD, DETAILED, COMPREHENSIVE)
|
||
- Created a template system with placeholders for different report sections
|
||
- Implemented 12 different templates (3 query types × 4 detail levels)
|
||
- Added validation to ensure templates contain all required sections
|
||
|
||
2. **Report Synthesis Integration**:
|
||
- Updated the report synthesis module to use the new template system
|
||
- Added support for different templates based on query type and detail level
|
||
- Implemented fallback to standard templates when specific templates are not found
|
||
- Added better logging for template retrieval process
|
||
|
||
3. **Testing**:
|
||
- Created test_report_templates.py to test template retrieval and validation
|
||
- Implemented test_brief_report.py to test the brief report generation
|
||
- Successfully tested all combinations of detail levels and query types
|
||
|
||
### 2025-02-28: Async Implementation and Reference Formatting
|
||
|
||
1. **LLM Interface Updates**:
|
||
- Converted key methods to async:
|
||
- `generate_completion`
|
||
- `classify_query`
|
||
- `enhance_query`
|
||
- `generate_search_queries`
|
||
- Added special handling for Gemini models
|
||
- Improved reference formatting instructions
|
||
|
||
2. **Query Processor Updates**:
|
||
- Updated `process_query` to be async
|
||
- Made `generate_search_queries` async
|
||
- Fixed async/await patterns throughout
|
||
|
||
3. **Gradio Interface Updates**:
|
||
- Modified `generate_report` to handle async operations
|
||
- Updated report button click handler
|
||
- Improved error handling
|