224 lines
9.9 KiB
Markdown
224 lines
9.9 KiB
Markdown
# Decision Log
|
|
|
|
## 2025-02-27: Initial Project Setup
|
|
|
|
### Decision: Use Jina AI APIs for Semantic Search
|
|
- **Context**: Need for semantic search capabilities that understand context beyond keywords
|
|
- **Options Considered**:
|
|
1. Build custom embedding solution
|
|
2. Use open-source models locally
|
|
3. Use Jina AI's APIs
|
|
- **Decision**: Use Jina AI's APIs for embedding generation and similarity computation
|
|
- **Rationale**:
|
|
- High-quality embeddings with state-of-the-art models
|
|
- No need to manage model deployment and infrastructure
|
|
- Simple API integration with reasonable pricing
|
|
- Support for long texts through segmentation
|
|
|
|
### Decision: Separate Markdown Segmentation from Similarity Computation
|
|
- **Context**: Need to handle potentially long markdown documents
|
|
- **Options Considered**:
|
|
1. Integrate segmentation directly into the similarity module
|
|
2. Create a separate module for segmentation
|
|
- **Decision**: Create a separate module (markdown_segmenter.py) for document segmentation
|
|
- **Rationale**:
|
|
- Better separation of concerns
|
|
- More modular design allows for independent use of components
|
|
- Easier to maintain and extend each component separately
|
|
|
|
### Decision: Use Environment Variables for API Keys
|
|
- **Context**: Need to securely manage API credentials
|
|
- **Options Considered**:
|
|
1. Configuration files
|
|
2. Environment variables
|
|
3. Secret management service
|
|
- **Decision**: Use environment variables (JINA_API_KEY)
|
|
- **Rationale**:
|
|
- Simple to implement
|
|
- Standard practice for managing secrets
|
|
- Works well across different environments
|
|
- Prevents accidental commit of credentials to version control
|
|
|
|
### Decision: Use Cosine Similarity with Normalized Vectors
|
|
- **Context**: Need a metric for comparing semantic similarity between text embeddings
|
|
- **Options Considered**:
|
|
1. Euclidean distance
|
|
2. Cosine similarity
|
|
3. Dot product
|
|
- **Decision**: Use cosine similarity with normalized vectors
|
|
- **Rationale**:
|
|
- Standard approach for semantic similarity
|
|
- Normalized vectors simplify computation (dot product equals cosine similarity)
|
|
- Less sensitive to embedding magnitude, focusing on direction (meaning)
|
|
|
|
## 2025-02-27: Research System Architecture
|
|
|
|
### Decision: Implement a Multi-Stage Research Pipeline
|
|
- **Context**: Need to define the overall architecture for the intelligent research system
|
|
- **Options Considered**:
|
|
1. Monolithic application with tightly coupled components
|
|
2. Microservices architecture with independent services
|
|
3. Pipeline architecture with distinct processing stages
|
|
- **Decision**: Implement an 8-stage pipeline architecture
|
|
- **Rationale**:
|
|
- Clear separation of concerns with each stage having a specific responsibility
|
|
- Easier to develop and test individual components
|
|
- Flexibility to swap or enhance specific stages without affecting others
|
|
- Natural flow of data through the system matches the research process
|
|
|
|
### Decision: Use Multiple Search Sources
|
|
- **Context**: Need to gather comprehensive information from various sources
|
|
- **Options Considered**:
|
|
1. Use a single search API for simplicity
|
|
2. Implement custom web scraping for all sources
|
|
3. Use multiple specialized search APIs
|
|
- **Decision**: Integrate multiple search sources (Google, Serper, Jina Search, Google Scholar, arXiv)
|
|
- **Rationale**:
|
|
- Different sources provide different types of information (academic, general, etc.)
|
|
- Increases the breadth and diversity of search results
|
|
- Specialized APIs like arXiv provide domain-specific information
|
|
- Redundancy ensures more comprehensive coverage
|
|
|
|
### Decision: Use Jina AI for Semantic Processing
|
|
- **Context**: Need for advanced semantic understanding in document processing
|
|
- **Options Considered**:
|
|
1. Use simple keyword matching
|
|
2. Implement custom embedding models
|
|
3. Use Jina AI's suite of APIs
|
|
- **Decision**: Use Jina AI's APIs for embedding generation, similarity computation, and reranking
|
|
- **Rationale**:
|
|
- High-quality embeddings with state-of-the-art models
|
|
- Comprehensive API suite covering multiple needs (embeddings, segmentation, reranking)
|
|
- Simple integration with reasonable pricing
|
|
- Consistent approach across different semantic processing tasks
|
|
|
|
## 2025-02-27: Search Execution Architecture
|
|
|
|
### Decision: Search Execution Architecture
|
|
- **Context**: We needed to implement a search execution module that could execute search queries across multiple search engines and process the results in a standardized way.
|
|
|
|
- **Decision**:
|
|
1. Create a modular search execution architecture:
|
|
- Implement a base handler interface (`BaseSearchHandler`) for all search API handlers
|
|
- Create specific handlers for each search engine (Google, Serper, Scholar, arXiv)
|
|
- Develop a central `SearchExecutor` class to manage execution across multiple engines
|
|
- Implement a `ResultCollector` class for processing and organizing results
|
|
|
|
2. Use parallel execution for search queries:
|
|
- Implement thread-based parallelism using `concurrent.futures`
|
|
- Add support for both synchronous and asynchronous execution
|
|
- Include timeout management and error handling
|
|
|
|
3. Standardize search results:
|
|
- Define a common result format across all search engines
|
|
- Include metadata specific to each search engine in a standardized way
|
|
- Implement deduplication and scoring for result ranking
|
|
|
|
- **Rationale**:
|
|
- A modular architecture allows for easy addition of new search engines
|
|
- Parallel execution significantly improves search performance
|
|
- Standardized result format simplifies downstream processing
|
|
- Separation of concerns between execution and result processing
|
|
|
|
- **Alternatives Considered**:
|
|
1. Sequential execution of search queries:
|
|
- Simpler implementation
|
|
- Much slower performance
|
|
- Would not scale well with additional search engines
|
|
|
|
2. Separate modules for each search engine:
|
|
- Would lead to code duplication
|
|
- More difficult to maintain
|
|
- Less consistent result format
|
|
|
|
3. Using a third-party search aggregation service:
|
|
- Would introduce additional dependencies
|
|
- Less control over the search process
|
|
- Potential cost implications
|
|
|
|
- **Impact**:
|
|
- Efficient execution of search queries across multiple engines
|
|
- Consistent result format for downstream processing
|
|
- Flexible architecture that can be extended with new search engines
|
|
- Clear separation of concerns between different components
|
|
|
|
## 2025-02-27: Search Execution Module Refinements
|
|
|
|
### Decision: Remove Google Search Handler
|
|
- **Context**: Both Google and Serper handlers were implemented, but Serper is essentially a front-end for Google search
|
|
- **Options Considered**:
|
|
1. Keep both handlers for redundancy
|
|
2. Remove the Google handler and only use Serper
|
|
- **Decision**: Remove the Google search handler
|
|
- **Rationale**:
|
|
- Redundant functionality as Serper provides the same results
|
|
- Simplifies the codebase and reduces maintenance
|
|
- Reduces API costs by avoiding duplicate searches
|
|
- Serper provides a more reliable and consistent API for Google search
|
|
|
|
### Decision: Modify LLM Query Enhancement Prompt
|
|
- **Context**: The LLM was returning enhanced queries with explanations, which caused issues with search APIs
|
|
- **Options Considered**:
|
|
1. Post-process the LLM output to extract just the query
|
|
2. Modify the prompt to request only the enhanced query
|
|
- **Decision**: Modify the LLM prompt to request only the enhanced query without explanations
|
|
- **Rationale**:
|
|
- More reliable than post-processing, which could be error-prone
|
|
- Cleaner implementation that addresses the root cause
|
|
- Ensures consistent output format for downstream processing
|
|
- Reduces the risk of exceeding API character limits
|
|
|
|
### Decision: Implement Query Truncation
|
|
- **Context**: Enhanced queries could exceed the Serper API's 2048 character limit
|
|
- **Options Considered**:
|
|
1. Limit the LLM's output length
|
|
2. Truncate queries before sending to the API
|
|
3. Split long queries into multiple searches
|
|
- **Decision**: Implement query truncation in the search executor
|
|
- **Rationale**:
|
|
- Simple and effective solution
|
|
- Preserves as much of the enhanced query as possible
|
|
- Ensures API requests don't fail due to length constraints
|
|
- Can be easily adjusted if API limits change
|
|
|
|
## 2025-02-27: Testing Strategy for Query Processor
|
|
|
|
### Context
|
|
After integrating Groq and OpenRouter as additional LLM providers, we needed to verify that the query processor module functions correctly with these new providers.
|
|
|
|
### Decision
|
|
1. Create dedicated test scripts to validate the query processor functionality:
|
|
- A basic test script for the core processing pipeline
|
|
- A comprehensive test script for detailed component testing
|
|
|
|
2. Use monkey patching to ensure tests consistently use the Groq model:
|
|
- Create a global LLM interface with the Groq model
|
|
- Override the `get_llm_interface` function to always return this interface
|
|
- This approach allows testing without modifying the core code
|
|
|
|
3. Test all key functionality of the query processor:
|
|
- Query enhancement
|
|
- Query classification
|
|
- Search query generation
|
|
- End-to-end processing pipeline
|
|
|
|
### Rationale
|
|
- Dedicated test scripts provide a repeatable way to verify functionality
|
|
- Monkey patching allows testing with specific models without changing the core code
|
|
- Comprehensive testing ensures all components work correctly with the new providers
|
|
- Saving test results to a JSON file provides a reference for future development
|
|
|
|
### Alternatives Considered
|
|
1. Modifying the query processor to accept a model parameter:
|
|
- Would require changing the core code
|
|
- Could introduce bugs in the production code
|
|
|
|
2. Using environment variables to control model selection:
|
|
- Less precise control over which model is used
|
|
- Could interfere with other tests or production use
|
|
|
|
### Impact
|
|
- Verified that the query processor works correctly with Groq models
|
|
- Established a testing approach that can be used for other modules
|
|
- Created reusable test scripts for future development
|