24 KiB

Raw Blame History

Code Structure

Current Project Organization

project/
│
├── examples/          # Sample data and query examples
├── report/            # Report generation module
│   ├── __init__.py
│   ├── report_generator.py    # Module for generating reports
│   ├── report_synthesis.py    # Module for synthesizing reports
│   ├── progressive_report_synthesis.py # Module for progressive report generation
│   ├── document_processor.py  # Module for processing documents
│   ├── document_scraper.py    # Module for scraping documents
│   ├── report_detail_levels.py # Module for managing report detail levels
│   ├── report_templates.py    # Module for managing report templates
│   └── database/              # Database for storing reports
│       ├── __init__.py
│       └── db_manager.py      # Module for managing the database
├── tests/             # Test suite
│   ├── __init__.py
│   ├── execution/     # Search execution tests
│   │   ├── __init__.py
│   │   ├── test_search.py
│   │   ├── test_search_execution.py
│   │   └── test_all_handlers.py
│   ├── integration/   # Integration tests
│   │   ├── __init__.py
│   │   ├── test_ev_query.py
│   │   └── test_query_to_report.py
│   ├── query/         # Query processing tests
│   │   ├── __init__.py
│   │   ├── test_query_processor.py
│   │   ├── test_query_processor_comprehensive.py
│   │   └── test_llm_interface.py
│   ├── ranking/       # Ranking algorithm tests
│   │   ├── __init__.py
│   │   ├── test_reranker.py
│   │   ├── test_similarity.py
│   │   └── test_simple_reranker.py
│   ├── report/        # Report generation tests
│   │   ├── __init__.py
│   │   ├── test_custom_model.py
│   │   ├── test_detail_levels.py
│   │   ├── test_brief_report.py
│   │   └── test_report_templates.py
│   ├── ui/            # UI component tests
│   │   ├── __init__.py
│   │   └── test_ui_search.py
│   ├── test_document_processor.py
│   ├── test_document_scraper.py
│   └── test_report_synthesis.py
├── utils/             # Utility scripts and shared functions
│   ├── __init__.py
│   ├── jina_similarity.py     # Module for computing text similarity
│   └── markdown_segmenter.py  # Module for segmenting markdown documents
├── config/            # Configuration management
│   ├── __init__.py
│   ├── config.py              # Configuration management class
│   └── config.yaml            # YAML configuration file with settings for different components
├── query/            # Query processing module
│   ├── __init__.py
│   ├── query_processor.py     # Module for processing user queries
│   └── llm_interface.py       # Module for interacting with LLM providers
├── execution/        # Search execution module
│   ├── __init__.py
│   ├── search_executor.py     # Module for executing search queries
│   ├── result_collector.py    # Module for collecting search results
│   └── api_handlers/          # Handlers for different search APIs
│       ├── __init__.py
│       ├── base_handler.py    # Base class for search handlers
│       ├── serper_handler.py  # Handler for Serper API (Google search)
│       ├── scholar_handler.py # Handler for Google Scholar via Serper
│       ├── google_handler.py  # Handler for Google search
│       └── arxiv_handler.py   # Handler for arXiv API
├── ranking/          # Ranking module
│   ├── __init__.py
│   └── jina_reranker.py       # Module for reranking documents using Jina AI
├── ui/              # UI module
│   ├── __init__.py
│   └── gradio_interface.py    # Gradio-based web interface
├── scripts/         # Scripts
│   └── query_to_report.py     # Script for generating reports from queries
├── sim-search-api/   # FastAPI backend
│   ├── app/
│   │   ├── api/
│   │   │   ├── routes/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── auth.py           # Authentication routes
│   │   │   │   ├── query.py          # Query processing routes
│   │   │   │   ├── search.py         # Search execution routes
│   │   │   │   └── report.py         # Report generation routes
│   │   │   ├── __init__.py
│   │   │   └── dependencies.py       # API dependencies (auth, rate limiting)
│   │   ├── core/
│   │   │   ├── __init__.py
│   │   │   ├── config.py             # API configuration
│   │   │   └── security.py           # Security utilities
│   │   ├── db/
│   │   │   ├── __init__.py
│   │   │   ├── session.py            # Database session
│   │   │   └── models.py             # Database models for reports, searches
│   │   ├── schemas/
│   │   │   ├── __init__.py
│   │   │   ├── token.py              # Token schemas
│   │   │   ├── user.py               # User schemas
│   │   │   ├── query.py              # Query schemas
│   │   │   ├── search.py             # Search result schemas
│   │   │   └── report.py             # Report schemas
│   │   ├── services/
│   │   │   ├── __init__.py
│   │   │   ├── query_service.py      # Query processing service
│   │   │   ├── search_service.py     # Search execution service
│   │   │   └── report_service.py     # Report generation service
│   │   └── main.py                   # FastAPI application
│   ├── alembic/                      # Database migrations
│   │   ├── versions/
│   │   │   └── 001_initial_migration.py  # Initial migration
│   │   ├── env.py                    # Alembic environment
│   │   └── script.py.mako            # Alembic script template
│   ├── .env.example                  # Environment variables template
│   ├── alembic.ini                   # Alembic configuration
│   ├── requirements.txt              # API dependencies
│   ├── run.py                        # Script to run the API
│   └── README.md                     # API documentation
├── run_ui.py         # Script to run the UI
└── requirements.txt  # Project dependencies

Module Details

Config Module

The config module manages configuration settings for the entire system, including API keys, model selections, and other parameters.

Files

__init__.py: Package initialization file
config.py: Configuration management class
config.yaml: YAML configuration file with settings for different components

Classes

Config: Singleton class for loading and accessing configuration settings
- load_config(config_path): Loads configuration from a YAML file
- get(key, default=None): Gets a configuration value by key

Query Module

The query module handles the processing and enhancement of user queries, including classification and optimization for search.

Files

__init__.py: Package initialization file
query_processor.py: Main module for processing user queries
query_classifier.py: Module for classifying query types
llm_interface.py: Interface for interacting with LLM providers

Classes

QueryProcessor: Main class for processing user queries
- process_query(query): Processes a user query and returns enhanced results
- classify_query(query): Classifies a query by type and intent
- generate_search_queries(query, classification): Generates optimized search queries
QueryClassifier: Class for classifying queries
- classify(query): Classifies a query by type, intent, and entities
LLMInterface: Interface for interacting with LLM providers
- get_completion(prompt, model=None): Gets a completion from an LLM
- enhance_query(query): Enhances a query with additional context
- classify_query(query): Uses an LLM to classify a query

Execution Module

The execution module handles the execution of search queries across multiple search engines and the collection of results.

Files

__init__.py: Package initialization file
search_executor.py: Module for executing search queries
result_collector.py: Module for collecting and processing search results
api_handlers/: Directory containing handlers for different search APIs
- __init__.py: Package initialization file
- base_handler.py: Base class for search handlers
- serper_handler.py: Handler for Serper API (Google search)
- scholar_handler.py: Handler for Google Scholar via Serper
- arxiv_handler.py: Handler for arXiv API

Classes

SearchExecutor: Class for executing search queries
- execute_search(query_data): Executes a search across multiple engines
- _execute_search_async(query, engines): Executes a search asynchronously
- _execute_search_sync(query, engines): Executes a search synchronously
ResultCollector: Class for collecting and processing search results
- process_results(search_results): Processes search results from multiple engines
- deduplicate_results(results): Deduplicates results based on URL
- save_results(results, file_path): Saves results to a file
BaseSearchHandler: Base class for search handlers
- search(query, num_results): Abstract method for searching
- _process_response(response): Processes the API response
SerperSearchHandler: Handler for Serper API
- search(query, num_results): Searches using Serper API
- _process_response(response): Processes the Serper API response
ScholarSearchHandler: Handler for Google Scholar via Serper
- search(query, num_results): Searches Google Scholar
- _process_response(response): Processes the Scholar API response
ArxivSearchHandler: Handler for arXiv API
- search(query, num_results): Searches arXiv
- _process_response(response): Processes the arXiv API response

Ranking Module

The ranking module provides functionality for reranking and prioritizing documents based on their relevance to the user's query.

Files

__init__.py: Package initialization file
jina_reranker.py: Module for reranking documents using Jina AI
filter_manager.py: Module for filtering documents

Classes

JinaReranker: Class for reranking documents
- rerank(documents, query): Reranks documents based on relevance to query
- _prepare_inputs(documents, query): Prepares inputs for the reranker
FilterManager: Class for filtering documents
- filter_by_date(documents, start_date, end_date): Filters by date
- filter_by_source(documents, sources): Filters by source

Report Templates Module

The report_templates module provides a template system for generating reports with different detail levels and query types.

Files

__init__.py: Package initialization file
report_templates.py: Module for managing report templates

Classes

QueryType (Enum): Defines the types of queries supported by the system
- FACTUAL: For factual queries seeking specific information
- EXPLORATORY: For exploratory queries investigating a topic
- COMPARATIVE: For comparative queries comparing multiple items
DetailLevel (Enum): Defines the levels of detail for generated reports
- BRIEF: Short summary with key findings
- STANDARD: Standard report with introduction, key findings, and analysis
- DETAILED: Detailed report with methodology and more in-depth analysis
- COMPREHENSIVE: Comprehensive report with executive summary, literature review, and appendices
ReportTemplate: Class representing a report template
- template (str): The template string with placeholders
- detail_level (DetailLevel): The detail level of the template
- query_type (QueryType): The query type the template is designed for
- model (Optional[str]): The LLM model recommended for this template
- required_sections (Optional[List[str]]): Required sections in the template
- validate(): Validates that the template contains all required sections
ReportTemplateManager: Class for managing report templates
- add_template(template): Adds a template to the manager
- get_template(query_type, detail_level): Gets a template for a specific query type and detail level
- get_available_templates(): Gets a list of available templates
- initialize_default_templates(): Initializes the default templates for all combinations of query types and detail levels

Progressive Report Synthesis Module

The progressive_report_synthesis module provides functionality to synthesize reports from document chunks using a progressive approach, where chunks are processed iteratively and the report is refined over time.

Files

__init__.py: Package initialization file
progressive_report_synthesis.py: Module for progressive report generation

Classes

ReportState: Class to track the state of a progressive report
- current_report (str): The current version of the report
- processed_chunks (Set[str]): Set of document IDs that have been processed
- version (int): Current version number of the report
- last_update_time (float): Timestamp of the last update
- improvement_scores (List[float]): List of improvement scores for each iteration
- is_complete (bool): Whether the report generation is complete
- termination_reason (Optional[str]): Reason for termination if complete
ProgressiveReportSynthesizer: Class for progressive report synthesis
- Extends ReportSynthesizer to implement a progressive approach
- set_progress_callback(callback): Sets a callback function to report progress
- prioritize_chunks(chunks, query): Prioritizes chunks based on relevance
- extract_information_from_chunk(chunk, query, detail_level): Extracts key information from a chunk
- refine_report(current_report, new_information, query, query_type, detail_level): Refines the report with new information
- initialize_report(initial_chunks, query, query_type, detail_level): Initializes the report with the first batch of chunks
- should_terminate(improvement_score): Determines if the process should terminate
- synthesize_report_progressively(chunks, query, query_type, detail_level): Main method for progressive report generation
- synthesize_report(chunks, query, query_type, detail_level): Override of parent method to use progressive approach for comprehensive detail level
get_progressive_report_synthesizer(model_name): Factory function to get a singleton instance

FastAPI Backend Module

The sim-search-api module provides a RESTful API for the sim-search system, allowing for query processing, search execution, and report generation through HTTP endpoints.

Files

app/: Main application directory
- api/: API routes and dependencies
  - routes/: API route handlers
    - auth.py: Authentication routes
    - query.py: Query processing routes
    - search.py: Search execution routes
    - report.py: Report generation routes
  - dependencies.py: API dependencies (auth, rate limiting)
- core/: Core functionality
  - config.py: API configuration
  - security.py: Security utilities
- db/: Database models and session management
  - models.py: Database models for users, searches, and reports
  - session.py: Database session management
- schemas/: Pydantic schemas for request/response validation
  - token.py: Token schemas
  - user.py: User schemas
  - query.py: Query schemas
  - search.py: Search result schemas
  - report.py: Report schemas
- services/: Service layer for business logic
  - query_service.py: Query processing service
  - search_service.py: Search execution service
  - report_service.py: Report generation service
- main.py: FastAPI application entry point
alembic/: Database migrations
- versions/: Migration versions
  - 001_initial_migration.py: Initial migration
- env.py: Alembic environment
- script.py.mako: Alembic script template
alembic.ini: Alembic configuration
requirements.txt: API dependencies
run.py: Script to run the API
.env.example: Environment variables template
README.md: API documentation

Classes

app.db.models.User: User model for authentication
- id (str): User ID
- email (str): User email
- hashed_password (str): Hashed password
- full_name (str): User's full name
- is_active (bool): Whether the user is active
- is_superuser (bool): Whether the user is a superuser
app.db.models.Search: Search model for storing search results
- id (str): Search ID
- user_id (str): User ID
- query (str): Original query
- enhanced_query (str): Enhanced query
- query_type (str): Query type
- engines (str): Search engines used
- results_count (int): Number of results
- results (JSON): Search results
- created_at (datetime): Creation timestamp
app.db.models.Report: Report model for storing generated reports
- id (str): Report ID
- user_id (str): User ID
- search_id (str): Search ID
- title (str): Report title
- content (str): Report content
- detail_level (str): Detail level
- query_type (str): Query type
- model_used (str): Model used for generation
- created_at (datetime): Creation timestamp
- updated_at (datetime): Update timestamp
app.services.QueryService: Service for query processing
- process_query(query): Processes a query
- classify_query(query): Classifies a query
app.services.SearchService: Service for search execution
- execute_search(structured_query, search_engines, num_results, timeout, user_id, db): Executes a search
- get_available_search_engines(): Gets available search engines
- get_search_results(search): Gets results for a specific search
app.services.ReportService: Service for report generation
- generate_report_background(report_id, report_in, search, db, progress_dict): Generates a report in the background
- generate_report_file(report, format): Generates a report file in the specified format

Recent Updates

2025-03-20: FastAPI Backend Implementation

FastAPI Application Structure:
- Created a new directory sim-search-api for the FastAPI application
- Set up project structure with API routes, core functionality, database models, schemas, and services
- Implemented a layered architecture with API, service, and data layers
- Added proper __init__.py files to make all directories proper Python packages
API Routes Implementation:
- Created authentication routes for user registration and token generation
- Implemented query processing routes for query enhancement and classification
- Added search execution routes for executing searches and managing search history
- Created report generation routes for generating and managing reports
- Implemented proper error handling and validation for all routes
Service Layer Implementation:
- Created QueryService to bridge between API and existing query processing functionality
- Implemented SearchService for search execution and result management
- Added ReportService for report generation and management
- Ensured proper integration with existing sim-search functionality
- Implemented asynchronous operation for all services
Database Setup:
- Created SQLAlchemy models for users, searches, and reports
- Implemented database session management
- Set up Alembic for database migrations
- Created initial migration script to create all tables
- Added proper relationships between models
Authentication and Security:
- Implemented JWT-based authentication
- Added password hashing and verification
- Created token generation and validation
- Implemented user registration and login
- Added proper authorization for protected routes
Documentation and Configuration:
- Created comprehensive API documentation
- Added OpenAPI documentation endpoints
- Implemented environment variable configuration
- Created a README with setup and usage instructions
- Added example environment variables file

2025-03-12: Progressive Report Generation Implementation

Progressive Report Synthesis Module:
- Created a new module progressive_report_synthesis.py for progressive report generation
- Implemented ReportState class to track the state of a progressive report
- Created ProgressiveReportSynthesizer class extending from ReportSynthesizer
- Implemented chunk prioritization algorithm based on relevance scores
- Developed iterative refinement process with specialized prompts
- Added state management to track report versions and processed chunks
- Implemented termination conditions (all chunks processed, diminishing returns, max iterations)
- Added support for different models with adaptive batch sizing
- Implemented progress tracking and callback mechanism
Report Generator Integration:
- Modified report_generator.py to use the progressive report synthesizer for comprehensive detail level
- Created a hybrid system that uses standard map-reduce for brief/standard/detailed levels
- Added proper model selection and configuration for both synthesizers
Testing:
- Created test_progressive_report.py to test progressive report generation
- Implemented comparison functionality between progressive and standard approaches
- Added test cases for different query types and document collections

2025-03-11: Report Templates Implementation

Report Templates Module:
- Created a new module report_templates.py for managing report templates
- Implemented enums for query types (FACTUAL, EXPLORATORY, COMPARATIVE) and detail levels (BRIEF, STANDARD, DETAILED, COMPREHENSIVE)
- Created a template system with placeholders for different report sections
- Implemented 12 different templates (3 query types × 4 detail levels)
- Added validation to ensure templates contain all required sections
Report Synthesis Integration:
- Updated the report synthesis module to use the new template system
- Added support for different templates based on query type and detail level
- Implemented fallback to standard templates when specific templates are not found
- Added better logging for template retrieval process
Testing:
- Created test_report_templates.py to test template retrieval and validation
- Implemented test_brief_report.py to test the brief report generation
- Successfully tested all combinations of detail levels and query types

2025-02-28: Async Implementation and Reference Formatting

LLM Interface Updates:
- Converted key methods to async:
  - generate_completion
  - classify_query
  - enhance_query
  - generate_search_queries
- Added special handling for Gemini models
- Improved reference formatting instructions
Query Processor Updates:
- Updated process_query to be async
- Made generate_search_queries async
- Fixed async/await patterns throughout
Gradio Interface Updates:
- Modified generate_report to handle async operations
- Updated report button click handler
- Improved error handling

24 KiB Raw Blame History Unescape Escape

Code Structure

Current Project Organization

Module Details

Config Module

Files

Classes

Query Module

Files

Classes

Execution Module

Files

Classes

Ranking Module

Files

Classes

Report Templates Module

Files

Classes

Progressive Report Synthesis Module

Files

Classes

FastAPI Backend Module

Files

Classes

Recent Updates

2025-03-20: FastAPI Backend Implementation

2025-03-12: Progressive Report Generation Implementation

2025-03-11: Report Templates Implementation

2025-02-28: Async Implementation and Reference Formatting

24 KiB

Raw Blame History