24 KiB
Code Structure
Current Project Organization
project/
│
├── examples/ # Sample data and query examples
├── report/ # Report generation module
│ ├── __init__.py
│ ├── report_generator.py # Module for generating reports
│ ├── report_synthesis.py # Module for synthesizing reports
│ ├── progressive_report_synthesis.py # Module for progressive report generation
│ ├── document_processor.py # Module for processing documents
│ ├── document_scraper.py # Module for scraping documents
│ ├── report_detail_levels.py # Module for managing report detail levels
│ ├── report_templates.py # Module for managing report templates
│ └── database/ # Database for storing reports
│ ├── __init__.py
│ └── db_manager.py # Module for managing the database
├── tests/ # Test suite
│ ├── __init__.py
│ ├── execution/ # Search execution tests
│ │ ├── __init__.py
│ │ ├── test_search.py
│ │ ├── test_search_execution.py
│ │ └── test_all_handlers.py
│ ├── integration/ # Integration tests
│ │ ├── __init__.py
│ │ ├── test_ev_query.py
│ │ └── test_query_to_report.py
│ ├── query/ # Query processing tests
│ │ ├── __init__.py
│ │ ├── test_query_processor.py
│ │ ├── test_query_processor_comprehensive.py
│ │ └── test_llm_interface.py
│ ├── ranking/ # Ranking algorithm tests
│ │ ├── __init__.py
│ │ ├── test_reranker.py
│ │ ├── test_similarity.py
│ │ └── test_simple_reranker.py
│ ├── report/ # Report generation tests
│ │ ├── __init__.py
│ │ ├── test_custom_model.py
│ │ ├── test_detail_levels.py
│ │ ├── test_brief_report.py
│ │ └── test_report_templates.py
│ ├── ui/ # UI component tests
│ │ ├── __init__.py
│ │ └── test_ui_search.py
│ ├── test_document_processor.py
│ ├── test_document_scraper.py
│ └── test_report_synthesis.py
├── utils/ # Utility scripts and shared functions
│ ├── __init__.py
│ ├── jina_similarity.py # Module for computing text similarity
│ └── markdown_segmenter.py # Module for segmenting markdown documents
├── config/ # Configuration management
│ ├── __init__.py
│ ├── config.py # Configuration management class
│ └── config.yaml # YAML configuration file with settings for different components
├── query/ # Query processing module
│ ├── __init__.py
│ ├── query_processor.py # Module for processing user queries
│ └── llm_interface.py # Module for interacting with LLM providers
├── execution/ # Search execution module
│ ├── __init__.py
│ ├── search_executor.py # Module for executing search queries
│ ├── result_collector.py # Module for collecting search results
│ └── api_handlers/ # Handlers for different search APIs
│ ├── __init__.py
│ ├── base_handler.py # Base class for search handlers
│ ├── serper_handler.py # Handler for Serper API (Google search)
│ ├── scholar_handler.py # Handler for Google Scholar via Serper
│ ├── google_handler.py # Handler for Google search
│ └── arxiv_handler.py # Handler for arXiv API
├── ranking/ # Ranking module
│ ├── __init__.py
│ └── jina_reranker.py # Module for reranking documents using Jina AI
├── ui/ # UI module
│ ├── __init__.py
│ └── gradio_interface.py # Gradio-based web interface
├── scripts/ # Scripts
│ └── query_to_report.py # Script for generating reports from queries
├── sim-search-api/ # FastAPI backend
│ ├── app/
│ │ ├── api/
│ │ │ ├── routes/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── auth.py # Authentication routes
│ │ │ │ ├── query.py # Query processing routes
│ │ │ │ ├── search.py # Search execution routes
│ │ │ │ └── report.py # Report generation routes
│ │ │ ├── __init__.py
│ │ │ └── dependencies.py # API dependencies (auth, rate limiting)
│ │ ├── core/
│ │ │ ├── __init__.py
│ │ │ ├── config.py # API configuration
│ │ │ └── security.py # Security utilities
│ │ ├── db/
│ │ │ ├── __init__.py
│ │ │ ├── session.py # Database session
│ │ │ └── models.py # Database models for reports, searches
│ │ ├── schemas/
│ │ │ ├── __init__.py
│ │ │ ├── token.py # Token schemas
│ │ │ ├── user.py # User schemas
│ │ │ ├── query.py # Query schemas
│ │ │ ├── search.py # Search result schemas
│ │ │ └── report.py # Report schemas
│ │ ├── services/
│ │ │ ├── __init__.py
│ │ │ ├── query_service.py # Query processing service
│ │ │ ├── search_service.py # Search execution service
│ │ │ └── report_service.py # Report generation service
│ │ └── main.py # FastAPI application
│ ├── alembic/ # Database migrations
│ │ ├── versions/
│ │ │ └── 001_initial_migration.py # Initial migration
│ │ ├── env.py # Alembic environment
│ │ └── script.py.mako # Alembic script template
│ ├── .env.example # Environment variables template
│ ├── alembic.ini # Alembic configuration
│ ├── requirements.txt # API dependencies
│ ├── run.py # Script to run the API
│ └── README.md # API documentation
├── run_ui.py # Script to run the UI
└── requirements.txt # Project dependencies
Module Details
Config Module
The config
module manages configuration settings for the entire system, including API keys, model selections, and other parameters.
Files
__init__.py
: Package initialization fileconfig.py
: Configuration management classconfig.yaml
: YAML configuration file with settings for different components
Classes
Config
: Singleton class for loading and accessing configuration settingsload_config(config_path)
: Loads configuration from a YAML fileget(key, default=None)
: Gets a configuration value by key
Query Module
The query
module handles the processing and enhancement of user queries, including classification and optimization for search.
Files
__init__.py
: Package initialization filequery_processor.py
: Main module for processing user queriesquery_classifier.py
: Module for classifying query typesllm_interface.py
: Interface for interacting with LLM providers
Classes
-
QueryProcessor
: Main class for processing user queriesprocess_query(query)
: Processes a user query and returns enhanced resultsclassify_query(query)
: Classifies a query by type and intentgenerate_search_queries(query, classification)
: Generates optimized search queries
-
QueryClassifier
: Class for classifying queriesclassify(query)
: Classifies a query by type, intent, and entities
-
LLMInterface
: Interface for interacting with LLM providersget_completion(prompt, model=None)
: Gets a completion from an LLMenhance_query(query)
: Enhances a query with additional contextclassify_query(query)
: Uses an LLM to classify a query
Execution Module
The execution
module handles the execution of search queries across multiple search engines and the collection of results.
Files
__init__.py
: Package initialization filesearch_executor.py
: Module for executing search queriesresult_collector.py
: Module for collecting and processing search resultsapi_handlers/
: Directory containing handlers for different search APIs__init__.py
: Package initialization filebase_handler.py
: Base class for search handlersserper_handler.py
: Handler for Serper API (Google search)scholar_handler.py
: Handler for Google Scholar via Serperarxiv_handler.py
: Handler for arXiv API
Classes
-
SearchExecutor
: Class for executing search queriesexecute_search(query_data)
: Executes a search across multiple engines_execute_search_async(query, engines)
: Executes a search asynchronously_execute_search_sync(query, engines)
: Executes a search synchronously
-
ResultCollector
: Class for collecting and processing search resultsprocess_results(search_results)
: Processes search results from multiple enginesdeduplicate_results(results)
: Deduplicates results based on URLsave_results(results, file_path)
: Saves results to a file
-
BaseSearchHandler
: Base class for search handlerssearch(query, num_results)
: Abstract method for searching_process_response(response)
: Processes the API response
-
SerperSearchHandler
: Handler for Serper APIsearch(query, num_results)
: Searches using Serper API_process_response(response)
: Processes the Serper API response
-
ScholarSearchHandler
: Handler for Google Scholar via Serpersearch(query, num_results)
: Searches Google Scholar_process_response(response)
: Processes the Scholar API response
-
ArxivSearchHandler
: Handler for arXiv APIsearch(query, num_results)
: Searches arXiv_process_response(response)
: Processes the arXiv API response
Ranking Module
The ranking
module provides functionality for reranking and prioritizing documents based on their relevance to the user's query.
Files
__init__.py
: Package initialization filejina_reranker.py
: Module for reranking documents using Jina AIfilter_manager.py
: Module for filtering documents
Classes
-
JinaReranker
: Class for reranking documentsrerank(documents, query)
: Reranks documents based on relevance to query_prepare_inputs(documents, query)
: Prepares inputs for the reranker
-
FilterManager
: Class for filtering documentsfilter_by_date(documents, start_date, end_date)
: Filters by datefilter_by_source(documents, sources)
: Filters by source
Report Templates Module
The report_templates
module provides a template system for generating reports with different detail levels and query types.
Files
__init__.py
: Package initialization filereport_templates.py
: Module for managing report templates
Classes
-
QueryType
(Enum): Defines the types of queries supported by the systemFACTUAL
: For factual queries seeking specific informationEXPLORATORY
: For exploratory queries investigating a topicCOMPARATIVE
: For comparative queries comparing multiple items
-
DetailLevel
(Enum): Defines the levels of detail for generated reportsBRIEF
: Short summary with key findingsSTANDARD
: Standard report with introduction, key findings, and analysisDETAILED
: Detailed report with methodology and more in-depth analysisCOMPREHENSIVE
: Comprehensive report with executive summary, literature review, and appendices
-
ReportTemplate
: Class representing a report templatetemplate
(str): The template string with placeholdersdetail_level
(DetailLevel): The detail level of the templatequery_type
(QueryType): The query type the template is designed formodel
(Optional[str]): The LLM model recommended for this templaterequired_sections
(Optional[List[str]]): Required sections in the templatevalidate()
: Validates that the template contains all required sections
-
ReportTemplateManager
: Class for managing report templatesadd_template(template)
: Adds a template to the managerget_template(query_type, detail_level)
: Gets a template for a specific query type and detail levelget_available_templates()
: Gets a list of available templatesinitialize_default_templates()
: Initializes the default templates for all combinations of query types and detail levels
Progressive Report Synthesis Module
The progressive_report_synthesis
module provides functionality to synthesize reports from document chunks using a progressive approach, where chunks are processed iteratively and the report is refined over time.
Files
__init__.py
: Package initialization fileprogressive_report_synthesis.py
: Module for progressive report generation
Classes
-
ReportState
: Class to track the state of a progressive reportcurrent_report
(str): The current version of the reportprocessed_chunks
(Set[str]): Set of document IDs that have been processedversion
(int): Current version number of the reportlast_update_time
(float): Timestamp of the last updateimprovement_scores
(List[float]): List of improvement scores for each iterationis_complete
(bool): Whether the report generation is completetermination_reason
(Optional[str]): Reason for termination if complete
-
ProgressiveReportSynthesizer
: Class for progressive report synthesis- Extends
ReportSynthesizer
to implement a progressive approach set_progress_callback(callback)
: Sets a callback function to report progressprioritize_chunks(chunks, query)
: Prioritizes chunks based on relevanceextract_information_from_chunk(chunk, query, detail_level)
: Extracts key information from a chunkrefine_report(current_report, new_information, query, query_type, detail_level)
: Refines the report with new informationinitialize_report(initial_chunks, query, query_type, detail_level)
: Initializes the report with the first batch of chunksshould_terminate(improvement_score)
: Determines if the process should terminatesynthesize_report_progressively(chunks, query, query_type, detail_level)
: Main method for progressive report generationsynthesize_report(chunks, query, query_type, detail_level)
: Override of parent method to use progressive approach for comprehensive detail level
- Extends
-
get_progressive_report_synthesizer(model_name)
: Factory function to get a singleton instance
FastAPI Backend Module
The sim-search-api
module provides a RESTful API for the sim-search system, allowing for query processing, search execution, and report generation through HTTP endpoints.
Files
app/
: Main application directoryapi/
: API routes and dependenciesroutes/
: API route handlersauth.py
: Authentication routesquery.py
: Query processing routessearch.py
: Search execution routesreport.py
: Report generation routes
dependencies.py
: API dependencies (auth, rate limiting)
core/
: Core functionalityconfig.py
: API configurationsecurity.py
: Security utilities
db/
: Database models and session managementmodels.py
: Database models for users, searches, and reportssession.py
: Database session management
schemas/
: Pydantic schemas for request/response validationtoken.py
: Token schemasuser.py
: User schemasquery.py
: Query schemassearch.py
: Search result schemasreport.py
: Report schemas
services/
: Service layer for business logicquery_service.py
: Query processing servicesearch_service.py
: Search execution servicereport_service.py
: Report generation service
main.py
: FastAPI application entry point
alembic/
: Database migrationsversions/
: Migration versions001_initial_migration.py
: Initial migration
env.py
: Alembic environmentscript.py.mako
: Alembic script template
alembic.ini
: Alembic configurationrequirements.txt
: API dependenciesrun.py
: Script to run the API.env.example
: Environment variables templateREADME.md
: API documentation
Classes
-
app.db.models.User
: User model for authenticationid
(str): User IDemail
(str): User emailhashed_password
(str): Hashed passwordfull_name
(str): User's full nameis_active
(bool): Whether the user is activeis_superuser
(bool): Whether the user is a superuser
-
app.db.models.Search
: Search model for storing search resultsid
(str): Search IDuser_id
(str): User IDquery
(str): Original queryenhanced_query
(str): Enhanced queryquery_type
(str): Query typeengines
(str): Search engines usedresults_count
(int): Number of resultsresults
(JSON): Search resultscreated_at
(datetime): Creation timestamp
-
app.db.models.Report
: Report model for storing generated reportsid
(str): Report IDuser_id
(str): User IDsearch_id
(str): Search IDtitle
(str): Report titlecontent
(str): Report contentdetail_level
(str): Detail levelquery_type
(str): Query typemodel_used
(str): Model used for generationcreated_at
(datetime): Creation timestampupdated_at
(datetime): Update timestamp
-
app.services.QueryService
: Service for query processingprocess_query(query)
: Processes a queryclassify_query(query)
: Classifies a query
-
app.services.SearchService
: Service for search executionexecute_search(structured_query, search_engines, num_results, timeout, user_id, db)
: Executes a searchget_available_search_engines()
: Gets available search enginesget_search_results(search)
: Gets results for a specific search
-
app.services.ReportService
: Service for report generationgenerate_report_background(report_id, report_in, search, db, progress_dict)
: Generates a report in the backgroundgenerate_report_file(report, format)
: Generates a report file in the specified format
Recent Updates
2025-03-20: FastAPI Backend Implementation
-
FastAPI Application Structure:
- Created a new directory
sim-search-api
for the FastAPI application - Set up project structure with API routes, core functionality, database models, schemas, and services
- Implemented a layered architecture with API, service, and data layers
- Added proper
__init__.py
files to make all directories proper Python packages
- Created a new directory
-
API Routes Implementation:
- Created authentication routes for user registration and token generation
- Implemented query processing routes for query enhancement and classification
- Added search execution routes for executing searches and managing search history
- Created report generation routes for generating and managing reports
- Implemented proper error handling and validation for all routes
-
Service Layer Implementation:
- Created
QueryService
to bridge between API and existing query processing functionality - Implemented
SearchService
for search execution and result management - Added
ReportService
for report generation and management - Ensured proper integration with existing sim-search functionality
- Implemented asynchronous operation for all services
- Created
-
Database Setup:
- Created SQLAlchemy models for users, searches, and reports
- Implemented database session management
- Set up Alembic for database migrations
- Created initial migration script to create all tables
- Added proper relationships between models
-
Authentication and Security:
- Implemented JWT-based authentication
- Added password hashing and verification
- Created token generation and validation
- Implemented user registration and login
- Added proper authorization for protected routes
-
Documentation and Configuration:
- Created comprehensive API documentation
- Added OpenAPI documentation endpoints
- Implemented environment variable configuration
- Created a README with setup and usage instructions
- Added example environment variables file
2025-03-12: Progressive Report Generation Implementation
-
Progressive Report Synthesis Module:
- Created a new module
progressive_report_synthesis.py
for progressive report generation - Implemented
ReportState
class to track the state of a progressive report - Created
ProgressiveReportSynthesizer
class extending fromReportSynthesizer
- Implemented chunk prioritization algorithm based on relevance scores
- Developed iterative refinement process with specialized prompts
- Added state management to track report versions and processed chunks
- Implemented termination conditions (all chunks processed, diminishing returns, max iterations)
- Added support for different models with adaptive batch sizing
- Implemented progress tracking and callback mechanism
- Created a new module
-
Report Generator Integration:
- Modified
report_generator.py
to use the progressive report synthesizer for comprehensive detail level - Created a hybrid system that uses standard map-reduce for brief/standard/detailed levels
- Added proper model selection and configuration for both synthesizers
- Modified
-
Testing:
- Created
test_progressive_report.py
to test progressive report generation - Implemented comparison functionality between progressive and standard approaches
- Added test cases for different query types and document collections
- Created
2025-03-11: Report Templates Implementation
-
Report Templates Module:
- Created a new module
report_templates.py
for managing report templates - Implemented enums for query types (FACTUAL, EXPLORATORY, COMPARATIVE) and detail levels (BRIEF, STANDARD, DETAILED, COMPREHENSIVE)
- Created a template system with placeholders for different report sections
- Implemented 12 different templates (3 query types × 4 detail levels)
- Added validation to ensure templates contain all required sections
- Created a new module
-
Report Synthesis Integration:
- Updated the report synthesis module to use the new template system
- Added support for different templates based on query type and detail level
- Implemented fallback to standard templates when specific templates are not found
- Added better logging for template retrieval process
-
Testing:
- Created test_report_templates.py to test template retrieval and validation
- Implemented test_brief_report.py to test the brief report generation
- Successfully tested all combinations of detail levels and query types
2025-02-28: Async Implementation and Reference Formatting
-
LLM Interface Updates:
- Converted key methods to async:
generate_completion
classify_query
enhance_query
generate_search_queries
- Added special handling for Gemini models
- Improved reference formatting instructions
- Converted key methods to async:
-
Query Processor Updates:
- Updated
process_query
to be async - Made
generate_search_queries
async - Fixed async/await patterns throughout
- Updated
-
Gradio Interface Updates:
- Modified
generate_report
to handle async operations - Updated report button click handler
- Improved error handling
- Modified