Compare commits

..

2 Commits

Author SHA1 Message Date
Steve White 12b453a14f massive changes 2025-03-14 16:14:09 -05:00
Steve White b6b50e4ef8 Add code search capability with GitHub and StackExchange APIs
This commit implements specialized code/programming search functionality:
- Add CODE as a new query type with templates at all detail levels
- Implement GitHub and StackExchange search handlers
- Add code query detection based on programming languages and frameworks
- Update result ranking to prioritize code-related sources
- Implement integration testing for code queries
- Update configuration and documentation
2025-03-14 16:12:26 -05:00
34 changed files with 2393 additions and 450 deletions

5
.clinerules Normal file
View File

@ -0,0 +1,5 @@
Review the contensts of .note/ before modifying any files.
After each major successful test, please commit the changes to the repository with a meaningful commit message.
Update the contents of .note/ after each major change.

1
.gitignore vendored
View File

@ -51,3 +51,4 @@ logs/
# Database files
*.db
report/database/*.db
config/config.yaml

View File

@ -47,6 +47,18 @@
- Tested the reranking functionality with the `JinaReranker` class
- Checked that the report generation process works with the new structure
### Query Type Selection in Gradio UI
- ✅ Added a dropdown menu for query type selection in the "Generate Report" tab
- ✅ Included options for "auto-detect", "factual", "exploratory", and "comparative"
- ✅ Added descriptive tooltips explaining each query type
- ✅ Set "auto-detect" as the default option
- ✅ Modified the `generate_report` method in the `GradioInterface` class to handle the new query_type parameter
- ✅ Updated the report button click handler to pass the query type to the generate_report method
- ✅ Updated the `generate_report` method in the `ReportGenerator` class to accept a query_type parameter
- ✅ Modified the report synthesizer calls to pass the query_type parameter
- ✅ Added a "Query Types" section to the Gradio UI explaining each query type
- ✅ Committed changes with message "Add query type selection to Gradio UI and improve report generation"
## Next Steps
1. Run comprehensive tests to ensure all functionality works with the new directory structure
@ -75,11 +87,20 @@
- Estimated difficulty: Easy to Moderate (2-3 days of work)
2. **UI Improvements**:
- **Add Chunk Processing Progress Indicators**:
- Modify the `report_synthesis.py` file to add logging during the map phase of the map-reduce process
- Add a counter variable to track which chunk is being processed
- Use the existing logging infrastructure to output progress messages in the UI
- Estimated difficulty: Easy (15-30 minutes of work)
- ✅ **Add Chunk Processing Progress Indicators**:
- ✅ Added a `set_progress_callback` method to the `ReportGenerator` class
- ✅ Implemented progress tracking in both standard and progressive report synthesizers
- ✅ Updated the Gradio UI to display progress during report generation
- ✅ Fixed issues with progress reporting in the UI
- ✅ Ensured proper initialization of the report generator in the UI
- ✅ Added proper error handling for progress updates
- ✅ **Add Query Type Selection**:
- ✅ Added a dropdown menu for query type selection in the "Generate Report" tab
- ✅ Included options for "auto-detect", "factual", "exploratory", "comparative", and "code"
- ✅ Added descriptive tooltips explaining each query type
- ✅ Modified the report generation logic to handle the selected query type
- ✅ Added documentation to help users understand when to use each query type
3. **Visualization Components**:
- Identify common data types in reports that would benefit from visualization
@ -96,8 +117,9 @@
- Implementing report versioning and comparison
2. **Integration with UI**:
- Adding report generation options to the UI
- Implementing progress indicators for document scraping and report generation
- ✅ Adding report generation options to the UI
- ✅ Implementing progress indicators for document scraping and report generation
- ✅ Adding query type selection to the UI
- Creating visualization components for generated reports
- Adding options to customize report generation parameters
@ -111,11 +133,11 @@
1. **Report Templates Implementation**:
- ✅ Created a dedicated `report_templates.py` module with a comprehensive template system
- ✅ Implemented `QueryType` enum for categorizing queries (FACTUAL, EXPLORATORY, COMPARATIVE)
- ✅ Implemented `QueryType` enum for categorizing queries (FACTUAL, EXPLORATORY, COMPARATIVE, CODE)
- ✅ Created `DetailLevel` enum for different report detail levels (BRIEF, STANDARD, DETAILED, COMPREHENSIVE)
- ✅ Designed a `ReportTemplate` class with validation for required sections
- ✅ Implemented a `ReportTemplateManager` to manage and retrieve templates
- ✅ Created 12 different templates (3 query types × 4 detail levels)
- ✅ Created 16 different templates (4 query types × 4 detail levels)
- ✅ Added testing with `test_report_templates.py` and `test_brief_report.py`
- ✅ Updated memory bank documentation with template system details
@ -127,6 +149,12 @@
- ✅ Improved error handling in template retrieval with fallback to standard templates
- ✅ Added better logging for template retrieval process
3. **UI Enhancements**:
- ✅ Added progress tracking for report generation
- ✅ Added query type selection dropdown
- ✅ Added documentation for query types and detail levels
- ✅ Improved error handling in the UI
### Next Steps
1. **Further Refinement of Report Templates**:
@ -173,7 +201,20 @@
- ✅ Implemented optimization for token usage and processing efficiency
- ✅ Fine-tuned prompts and parameters based on testing results
3. **Visualization Components**:
3. **Query Type Selection Enhancement**:
- ✅ Added query type selection dropdown to the UI
- ✅ Implemented handling of user-selected query types in the report generation process
- ✅ Added documentation to help users understand when to use each query type
- ✅ Added CODE as a new query type with specialized templates at all detail levels
- ✅ Implemented code query detection with language, framework, and pattern recognition
- ✅ Added GitHub and StackExchange search handlers for code-related queries
- ⏳ Test the query type selection with various queries to ensure it works correctly
- ⏳ Gather user feedback on the usefulness of manual query type selection
- ⏳ Consider adding more specialized templates for specific query types
- ⏳ Explore adding query type detection confidence scores to help users decide when to override
- ⏳ Add examples of each query type to help users understand the differences
4. **Visualization Components**:
- Identify common data types in reports that would benefit from visualization
- Design and implement visualization components for these data types
- Integrate visualization components into the report generation process
@ -194,3 +235,14 @@
- Tracks improvement scores to detect diminishing returns
- Adapts batch size based on model context window
- Provides progress tracking through callback mechanism
- Added query type selection to the UI:
- Allows users to explicitly select the query type (factual, exploratory, comparative, code)
- Provides auto-detect option for convenience
- Includes documentation to help users understand when to use each query type
- Passes the selected query type through the report generation pipeline
- Implemented specialized code query support:
- Added GitHub API for searching code repositories
- Added StackExchange API for programming Q&A content
- Created code detection based on programming languages, frameworks, and patterns
- Designed specialized report templates for code content with syntax highlighting
- Enhanced result ranking to prioritize code-related sources for programming queries

View File

@ -583,281 +583,103 @@ In this session, we fixed issues in the Gradio UI for report generation and plan
3. Test the current implementation with various query types to identify any remaining issues
4. Update the documentation to reflect the new features and future plans
## Session: 2025-02-28: Google Gemini Integration and Reference Formatting
## Session: 2025-03-12 - Query Type Selection in Gradio UI
### Overview
Fixed the integration of Google Gemini models with LiteLLM, and fixed reference formatting issues.
In this session, we enhanced the Gradio UI by adding a query type selection dropdown, allowing users to explicitly select the query type (factual, exploratory, comparative) instead of relying on automatic detection.
### Key Activities
1. **Fixed Google Gemini Integration**:
- Updated the model format to `gemini/gemini-2.0-flash` in config.yaml
- Modified message formatting for Gemini models in LLM interface
- Added proper handling for the 'gemini' provider in environment variable setup
1. **Added Query Type Selection to Gradio UI**:
- Added a dropdown menu for query type selection in the "Generate Report" tab
- Included options for "auto-detect", "factual", "exploratory", and "comparative"
- Added descriptive tooltips explaining each query type
- Set "auto-detect" as the default option
2. **Fixed Reference Formatting Issues**:
- Enhanced the instructions for reference formatting to ensure URLs are included
- Added a recovery mechanism for truncated references
- Improved context preparation to better extract URLs for references
2. **Updated Report Generation Logic**:
- Modified the `generate_report` method in the `GradioInterface` class to handle the new query_type parameter
- Updated the report button click handler to pass the query type to the generate_report method
- Added logging to show when a user-selected query type is being used
3. **Converted LLM Interface Methods to Async**:
- Made `generate_completion`, `classify_query`, and `enhance_query` methods async
- Updated dependent code to properly await these methods
- Fixed runtime errors related to async/await patterns
3. **Enhanced Report Generator**:
- Updated the `generate_report` method in the `ReportGenerator` class to accept a query_type parameter
- Modified the report synthesizer calls to pass the query_type parameter
- Added logging to track query type usage
### Key Insights
- Gemini models require special message formatting (using 'user' and 'model' roles instead of 'system' and 'assistant')
- References were getting cut off due to token limits, requiring a separate generation step
- The async conversion was necessary to properly handle async LLM calls throughout the codebase
4. **Added Documentation**:
- Added a "Query Types" section to the Gradio UI explaining each query type
- Included examples of when to use each query type
- Updated code comments to explain the query type parameter
### Insights
- Explicit query type selection gives users more control over the report generation process
- Different query types benefit from specialized report templates and structures
- The auto-detect option provides convenience while still allowing manual override
- Clear documentation helps users understand when to use each query type
### Challenges
- Ensuring that the templates produce appropriate output for each detail level
- Balancing between speed and quality for different detail levels
- Managing token budgets effectively across different detail levels
- Ensuring backward compatibility with existing code
- Maintaining the auto-detect functionality while adding manual selection
- Passing the query type parameter through multiple layers of the application
- Providing clear explanations of query types for users
### Next Steps
1. Continue testing with Gemini models to ensure stable operation
2. Consider adding more robust error handling for LLM provider-specific issues
3. Improve the reference formatting further if needed
1. Test the query type selection with various queries to ensure it works correctly
2. Gather user feedback on the usefulness of manual query type selection
3. Consider adding more specialized templates for specific query types
4. Explore adding query type detection confidence scores to help users decide when to override
5. Add examples of each query type to help users understand the differences
## Session: 2025-02-28: Fixing Reference Formatting and Async Implementation
## Session: 2025-03-12 - Fixed Query Type Parameter Bug
### Overview
Fixed reference formatting issues with Gemini models and updated the codebase to properly handle async methods.
Fixed a bug in the report generation process where the `query_type` parameter was not properly handled, causing an error when it was `None`.
### Key Activities
1. **Enhanced Reference Formatting**:
- Improved instructions to emphasize including URLs for each reference
- Added duplicate URL fields in the context to ensure URLs are captured
- Updated the reference generation prompt to explicitly request URLs
- Added a separate reference generation step to handle truncated references
1. **Fixed NoneType Error in Report Synthesis**:
- Added a null check in the `_get_extraction_prompt` method in `report_synthesis.py`
- Modified the condition that checks for comparative queries to handle the case where `query_type` is `None`
- Ensured the method works correctly regardless of whether a query type is explicitly provided
2. **Fixed Async Implementation**:
- Converted all LLM interface methods to async for proper handling
- Updated QueryProcessor's generate_search_queries method to be async
- Modified query_to_report.py to correctly await async methods
- Fixed runtime errors related to async/await patterns
3. **Updated Gradio Interface**:
- Modified the generate_report method to properly handle async operations
- Updated the report button click handler to correctly pass parameters
- Fixed the parameter order in the lambda function for async execution
- Improved error handling in the UI
## Session: 2025-03-11
### Overview
Reorganized the project directory structure to improve maintainability and clarity, ensuring all components are properly organized into their respective directories.
### Key Activities
1. **Directory Structure Reorganization**:
- Created a dedicated `utils/` directory for utility scripts
- Moved `jina_similarity.py` to `utils/`
- Added `__init__.py` to make it a proper Python package
- Organized test files into subdirectories under `tests/`
- Created subdirectories for each module (query, execution, ranking, report, ui, integration)
- Added `__init__.py` files to all test directories
- Created an `examples/` directory with subdirectories for data and scripts
- Moved sample data to `examples/data/`
- Added `__init__.py` files to make them proper Python packages
- Added a dedicated `scripts/` directory for utility scripts
- Moved `query_to_report.py` to `scripts/`
2. **Pipeline Verification**:
- Tested the pipeline after reorganization to ensure functionality
- Verified that the UI works correctly with the new directory structure
- Confirmed that all imports are working properly with the new structure
3. **Embedding Usage Analysis**:
- Confirmed that the pipeline uses Jina AI's Embeddings API through the `JinaSimilarity` class
- Verified that the `JinaReranker` class uses embeddings for document reranking
- Analyzed how embeddings are integrated into the search and ranking process
2. **Root Cause Analysis**:
- Identified that the error occurred when the `query_type` parameter was `None` and the code tried to call `.lower()` on it
- Traced the issue through the call chain from the UI to the report generator to the report synthesizer
- Confirmed that the fix addresses the specific error message: `'NoneType' object has no attribute 'lower'`
### Insights
- A well-organized directory structure significantly improves code maintainability and readability
- Using proper Python package structure with `__init__.py` files ensures clean imports
- Separating tests, utilities, examples, and scripts into dedicated directories makes the codebase more navigable
- The Jina AI embeddings are used throughout the pipeline for semantic similarity and document reranking
### Challenges
- Ensuring all import statements are updated correctly after moving files
- Maintaining backward compatibility with existing code
- Verifying that all components still work together after reorganization
- Proper null checking is essential when working with optional parameters that are passed through multiple layers
- The error occurred in the report synthesis module but was triggered by the UI's query type selection feature
- The fix maintains backward compatibility while ensuring the new query type selection feature works correctly
### Next Steps
1. Test the fix with various query types to ensure it works correctly
2. Consider adding similar null checks in other parts of the code that handle the query_type parameter
3. Add more comprehensive error handling throughout the report generation process
4. Update the test suite to include tests for null query_type values
1. Run comprehensive tests to ensure all functionality works with the new directory structure
2. Update any remaining documentation to reflect the new directory structure
3. Consider moving the remaining test files in the root of the `tests/` directory to appropriate subdirectories
4. Review import statements throughout the codebase to ensure they follow the new structure
### Key Insights
- Async/await patterns need to be consistently applied throughout the codebase
- Reference formatting requires explicit instructions to include URLs
- Gradio's interface needs special handling for async functions
### Challenges
- Ensuring that all async methods are properly awaited
- Balancing between detailed instructions and token limits for reference generation
- Managing the increased processing time for async operations
### Next Steps
1. Continue testing with Gemini models to ensure stable operation
2. Consider adding more robust error handling for LLM provider-specific issues
3. Improve the reference formatting further if needed
4. Update documentation to reflect the changes made to the LLM interface
5. Consider adding more unit tests for the async methods
## Session: 2025-02-28: Fixed NoneType Error in Report Synthesis
### Issue
Encountered an error during report generation:
```
TypeError: 'NoneType' object is not subscriptable
```
The error occurred in the `map_document_chunks` method of the `ReportSynthesizer` class when trying to slice a title that was `None`.
### Changes Made
1. Fixed the chunk counter in `map_document_chunks` method:
- Used a separate counter for individual chunks instead of using the batch index
- Added a null check for chunk titles with a fallback to 'Untitled'
2. Added defensive code in `synthesize_report` method:
- Added code to ensure all chunks have a title before processing
- Added null checks for title fields
3. Updated the `DocumentProcessor` class:
- Modified `process_documents_for_report` to ensure all chunks have a title
- Updated `chunk_document_by_sections`, `chunk_document_fixed_size`, and `chunk_document_hierarchical` methods to handle None titles
- Added default 'Untitled' value for all title fields
### Testing
The changes were tested with a report generation task that previously failed, and the error was resolved.
### Next Steps
1. Consider adding more comprehensive null checks throughout the codebase
2. Add unit tests to verify proper handling of missing or null fields
3. Implement better error handling and recovery mechanisms
## Session: 2025-03-11
## Session: 2025-03-12 - Fixed Template Retrieval for Null Query Type
### Overview
Focused on resolving issues with the report generation template system and ensuring that different detail levels and query types work correctly in the report synthesis process.
Fixed a second issue in the report generation process where the template retrieval was failing when the `query_type` parameter was `None`.
### Key Activities
1. **Fixed Template Retrieval Issues**:
- Updated the `get_template` method in the `ReportTemplateManager` to ensure it retrieves templates correctly based on query type and detail level
- Implemented a helper method `_get_template_from_strings` in the `ReportSynthesizer` to convert string values for query types and detail levels to their respective enum objects
- Added better logging for template retrieval process to aid in debugging
1. **Fixed Template Retrieval for Null Query Type**:
- Updated the `_get_template_from_strings` method in `report_synthesis.py` to handle `None` query_type
- Added a default value of "exploratory" when query_type is `None`
- Modified the method signature to explicitly indicate that query_type_str can be `None`
- Added logging to indicate when the default query type is being used
2. **Tested All Detail Levels and Query Types**:
- Created a comprehensive test script `test_all_detail_levels.py` to test all combinations of detail levels and query types
- Successfully tested all detail levels (brief, standard, detailed, comprehensive) with factual queries
- Successfully tested all detail levels with exploratory queries
- Successfully tested all detail levels with comparative queries
3. **Improved Error Handling**:
- Added fallback to standard templates if specific templates are not found
- Enhanced logging to track whether templates are found during the synthesis process
4. **Code Organization**:
- Removed duplicate `ReportTemplateManager` and `ReportTemplate` classes from `report_synthesis.py`
- Used the imported versions from `report_templates.py` for better code maintainability
2. **Root Cause Analysis**:
- Identified that the error occurred when trying to convert `None` to a `QueryType` enum value
- The error message was: "No template found for None standard" and "None is not a valid QueryType"
- The issue was in the template retrieval process which is used by both standard and progressive report synthesis
### Insights
- The template system is now working correctly for all combinations of query types and detail levels
- Proper logging is essential for debugging template retrieval issues
- Converting string values to enum objects is necessary for consistent template retrieval
- Having a dedicated test script for all combinations helps ensure comprehensive coverage
### Challenges
- Initially encountered issues where templates were not found during report synthesis, leading to `ValueError`
- Needed to ensure that the correct classes and methods were used for template retrieval
- When fixing one issue with optional parameters, it's important to check for similar issues in related code paths
- Providing sensible defaults for optional parameters helps maintain robustness
- Proper error handling and logging helps diagnose issues in complex systems with multiple layers
### Next Steps
1. Conduct additional testing with real-world queries and document sets
2. Compare the analytical depth and quality of reports generated with different detail levels
3. Gather user feedback on the improved reports at different detail levels
4. Further refine the detail level configurations based on testing and feedback
## Session: 2025-03-12 - Report Templates and Progressive Report Generation
### Overview
Implemented a dedicated report templates module to standardize report generation across different query types and detail levels, and implemented progressive report generation for comprehensive reports.
### Key Activities
1. **Created Report Templates Module**:
- Developed a new `report_templates.py` module with a comprehensive template system
- Implemented `QueryType` enum for categorizing queries (FACTUAL, EXPLORATORY, COMPARATIVE)
- Created `DetailLevel` enum for different report detail levels (BRIEF, STANDARD, DETAILED, COMPREHENSIVE)
- Designed a `ReportTemplate` class with validation for required sections
- Implemented a `ReportTemplateManager` to manage and retrieve templates
2. **Implemented Template Variations**:
- Created 12 different templates (3 query types × 4 detail levels)
- Designed templates with appropriate sections for each combination
- Added placeholders for dynamic content in each template
- Ensured templates follow a consistent structure while adapting to specific needs
3. **Added Testing**:
- Created `test_report_templates.py` to verify template retrieval and validation
- Implemented `test_brief_report.py` to test brief report generation with a simple query
- Verified that all templates can be correctly retrieved and used
4. **Implemented Progressive Report Generation**:
- Created a new `progressive_report_synthesis.py` module with a `ProgressiveReportSynthesizer` class
- Implemented chunk prioritization algorithm based on relevance scores
- Developed iterative refinement process with specialized prompts
- Added state management to track report versions and processed chunks
- Implemented termination conditions (all chunks processed, diminishing returns, max iterations)
- Added support for different models with adaptive batch sizing
- Implemented progress tracking and callback mechanism
- Created comprehensive test suite for progressive report generation
5. **Updated Report Generator**:
- Modified `report_generator.py` to use the progressive report synthesizer for comprehensive detail level
- Created a hybrid system that uses standard map-reduce for brief/standard/detailed levels
- Added proper model selection and configuration for both synthesizers
6. **Updated Memory Bank**:
- Added report templates information to code_structure.md
- Updated current_focus.md with implementation details for progressive report generation
- Updated session_log.md with details about the implementation
- Ensured all new files are properly documented
### Insights
- A standardized template system significantly improves report consistency
- Different query types require specialized report structures
- Validation ensures all required sections are present in templates
- Enums provide type safety and prevent errors from string comparisons
- Progressive report generation provides better results for very large document collections
- The hybrid approach leverages the strengths of both map-reduce and progressive methods
- Tracking improvement scores helps detect diminishing returns and optimize processing
- Adaptive batch sizing based on model context window improves efficiency
### Challenges
- Designing templates that are flexible enough for various content types
- Balancing between standardization and customization for different query types
- Ensuring proper integration with the existing report synthesis process
- Managing state and tracking progress in progressive report generation
- Preventing entrenchment of initial report structure in progressive approach
- Optimizing token usage when sending entire reports for refinement
- Determining appropriate termination conditions for the progressive approach
### Next Steps
1. Integrate the progressive approach with the UI
- Implement controls to pause, resume, or terminate the process
- Create a preview mode to see the current report state
- Add options to compare different versions of the report
2. Conduct additional testing with real-world queries and document sets
3. Add specialized templates for specific research domains
4. Implement template customization options for users
5. Implement visualization components for data mentioned in reports
1. Test the fix with comprehensive reports to ensure it works correctly
2. Consider adding similar default values for other optional parameters
3. Review the codebase for other potential null reference issues
4. Update documentation to clarify the behavior when optional parameters are not provided

View File

@ -13,7 +13,12 @@ This system automates the research process by:
## Features
- **Query Processing**: Enhances user queries with additional context and classifies them by type and intent
- **Multi-Source Search**: Executes searches across Serper (Google), Google Scholar, and arXiv
- **Multi-Source Search**: Executes searches across general web (Serper/Google), academic sources, and current news
- **Specialized Search Handlers**:
- **Current Events**: Optimized news search for recent developments
- **Academic Research**: Specialized academic search with OpenAlex, CORE, arXiv, and Google Scholar
- **Open Access Detection**: Finds freely available versions of paywalled papers using Unpaywall
- **Code/Programming**: Specialized code search using GitHub and StackExchange
- **Intelligent Ranking**: Uses Jina AI's Re-Ranker to prioritize the most relevant results
- **Result Deduplication**: Removes duplicate results across different search engines
- **Modular Architecture**: Easily extensible with new search engines and LLM providers
@ -24,7 +29,7 @@ This system automates the research process by:
- **Search Executor**: Executes searches across multiple engines
- **Result Collector**: Processes and organizes search results
- **Document Ranker**: Ranks documents by relevance
- **Report Generator**: Synthesizes information into a coherent report (coming soon)
- **Report Generator**: Synthesizes information into coherent reports with specialized templates for different query types
## Getting Started
@ -33,8 +38,13 @@ This system automates the research process by:
- Python 3.8+
- API keys for:
- Serper API (for Google and Scholar search)
- NewsAPI (for current events search)
- CORE API (for open access academic search)
- GitHub API (for code search)
- StackExchange API (for programming Q&A content)
- Groq (or other LLM provider)
- Jina AI (for reranking)
- Email for OpenAlex and Unpaywall (recommended but not required)
### Installation
@ -58,8 +68,11 @@ cp config/config.yaml.example config/config.yaml
```yaml
api_keys:
serper: "your-serper-api-key"
newsapi: "your-newsapi-key"
groq: "your-groq-api-key"
jina: "your-jina-api-key"
github: "your-github-api-key"
stackexchange: "your-stackexchange-api-key"
```
### Usage
@ -135,4 +148,10 @@ This project is licensed under the MIT License - see the LICENSE file for detail
- [Jina AI](https://jina.ai/) for their embedding and reranking APIs
- [Serper](https://serper.dev/) for their Google search API
- [NewsAPI](https://newsapi.org/) for their news search API
- [OpenAlex](https://openalex.org/) for their academic search API
- [CORE](https://core.ac.uk/) for their open access academic search API
- [Unpaywall](https://unpaywall.org/) for their open access discovery API
- [Groq](https://groq.com/) for their fast LLM inference
- [GitHub](https://github.com/) for their code search API
- [StackExchange](https://stackexchange.com/) for their programming Q&A API

View File

@ -1,157 +0,0 @@
# Example configuration file for the intelligent research system
# Rename this file to config.yaml and fill in your API keys and settings
# API keys (alternatively, set environment variables)
api_keys:
openai: "your-openai-api-key" # Or set OPENAI_API_KEY environment variable
jina: "your-jina-api-key" # Or set JINA_API_KEY environment variable
serper: "your-serper-api-key" # Or set SERPER_API_KEY environment variable
google: "your-google-api-key" # Or set GOOGLE_API_KEY environment variable
anthropic: "your-anthropic-api-key" # Or set ANTHROPIC_API_KEY environment variable
openrouter: "your-openrouter-api-key" # Or set OPENROUTER_API_KEY environment variable
groq: "your-groq-api-key" # Or set GROQ_API_KEY environment variable
# LLM model configurations
models:
gpt-3.5-turbo:
provider: "openai"
temperature: 0.7
max_tokens: 1000
top_p: 1.0
endpoint: null # Use default OpenAI endpoint
gpt-4:
provider: "openai"
temperature: 0.5
max_tokens: 2000
top_p: 1.0
endpoint: null # Use default OpenAI endpoint
claude-2:
provider: "anthropic"
temperature: 0.7
max_tokens: 1500
top_p: 1.0
endpoint: null # Use default Anthropic endpoint
azure-gpt-4:
provider: "azure"
temperature: 0.5
max_tokens: 2000
top_p: 1.0
endpoint: "https://your-azure-endpoint.openai.azure.com"
deployment_name: "your-deployment-name"
api_version: "2023-05-15"
local-llama:
provider: "ollama"
temperature: 0.8
max_tokens: 1000
endpoint: "http://localhost:11434/api/generate"
model_name: "llama2"
llama-3.1-8b-instant:
provider: "groq"
model_name: "llama-3.1-8b-instant"
temperature: 0.7
max_tokens: 1024
top_p: 1.0
endpoint: "https://api.groq.com/openai/v1"
llama-3.3-70b-versatile:
provider: "groq"
model_name: "llama-3.3-70b-versatile"
temperature: 0.5
max_tokens: 2048
top_p: 1.0
endpoint: "https://api.groq.com/openai/v1"
openrouter-mixtral:
provider: "openrouter"
model_name: "mistralai/mixtral-8x7b-instruct"
temperature: 0.7
max_tokens: 1024
top_p: 1.0
endpoint: "https://openrouter.ai/api/v1"
openrouter-claude:
provider: "openrouter"
model_name: "anthropic/claude-3-opus"
temperature: 0.5
max_tokens: 2048
top_p: 1.0
endpoint: "https://openrouter.ai/api/v1"
gemini-2.0-flash:
provider: "gemini"
model_name: "gemini-2.0-flash"
temperature: 0.5
max_tokens: 2048
top_p: 1.0
# Default model to use if not specified for a module
default_model: "llama-3.1-8b-instant" # Using Groq's Llama 3.1 8B model for testing
# Module-specific model assignments
module_models:
# Query processing module
query_processing:
enhance_query: "llama-3.1-8b-instant" # Use Groq's Llama 3.1 8B for query enhancement
classify_query: "llama-3.1-8b-instant" # Use Groq's Llama 3.1 8B for classification
generate_search_queries: "llama-3.1-8b-instant" # Use Groq's Llama 3.1 8B for generating search queries
# Search strategy module
search_strategy:
develop_strategy: "llama-3.1-8b-instant" # Use Groq's Llama 3.1 8B for developing search strategies
target_selection: "llama-3.1-8b-instant" # Use Groq's Llama 3.1 8B for target selection
# Document ranking module
document_ranking:
rerank_documents: "jina-reranker" # Use Jina's reranker for document reranking
# Report generation module
report_generation:
synthesize_report: "gemini-2.0-flash" # Use Google's Gemini 2.0 Flash for report synthesis
format_report: "llama-3.1-8b-instant" # Use Groq's Llama 3.1 8B for formatting
# Search engine configurations
search_engines:
google:
enabled: true
max_results: 10
serper:
enabled: true
max_results: 10
jina:
enabled: true
max_results: 10
scholar:
enabled: false
max_results: 5
arxiv:
enabled: false
max_results: 5
# Jina AI specific configurations
jina:
reranker:
model: "jina-reranker-v2-base-multilingual" # Default reranker model
top_n: 10 # Default number of top results to return
# UI configuration
ui:
theme: "light" # light or dark
port: 7860
share: false
title: "Intelligent Research System"
description: "An automated system for finding, filtering, and synthesizing information"
# System settings
system:
cache_dir: "data/cache"
results_dir: "data/results"
log_level: "INFO" # DEBUG, INFO, WARNING, ERROR, CRITICAL

View File

@ -10,6 +10,10 @@ api_keys:
anthropic: "your-anthropic-api-key" # Or set ANTHROPIC_API_KEY environment variable
openrouter: "your-openrouter-api-key" # Or set OPENROUTER_API_KEY environment variable
groq: "your-groq-api-key" # Or set GROQ_API_KEY environment variable
newsapi: "your-newsapi-key" # Or set NEWSAPI_API_KEY environment variable
core: "your-core-api-key" # Or set CORE_API_KEY environment variable
github: "your-github-api-key" # Or set GITHUB_API_KEY environment variable
stackexchange: "your-stackexchange-api-key" # Or set STACKEXCHANGE_API_KEY environment variable
# LLM model configurations
models:
@ -129,6 +133,35 @@ search_engines:
enabled: false
max_results: 5
news:
enabled: true
max_results: 10
days_back: 7
use_headlines: false # Set to true to use top headlines endpoint
country: "us" # Country code for top headlines
language: "en" # Language code
openalex:
enabled: true
max_results: 10
filter_open_access: false # Set to true to only return open access publications
core:
enabled: true
max_results: 10
full_text: true # Set to true to search in full text of papers
github:
enabled: true
max_results: 10
sort: "best_match" # Options: best_match, stars, forks, updated
stackexchange:
enabled: true
max_results: 10
site: "stackoverflow" # Default site (stackoverflow, serverfault, superuser, etc.)
sort: "relevance" # Options: relevance, votes, creation, activity
# Jina AI specific configurations
jina:
reranker:
@ -143,6 +176,22 @@ ui:
title: "Intelligent Research System"
description: "An automated system for finding, filtering, and synthesizing information"
# Academic search settings
academic_search:
email: "user@example.com" # Used for Unpaywall and OpenAlex APIs
# OpenAlex settings
openalex:
default_sort: "relevance_score:desc" # Other options: cited_by_count:desc, publication_date:desc
# Unpaywall settings
unpaywall:
# No specific settings needed
# CORE settings
core:
# No specific settings needed
# System settings
system:
cache_dir: "data/cache"

View File

@ -0,0 +1,88 @@
"""
Example script for using the academic search handlers.
"""
import asyncio
import sys
import os
from datetime import datetime
# Add the project root to the Python path
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
from execution.search_executor import SearchExecutor
from query.query_processor import get_query_processor
from config.config import get_config
async def main():
"""Run a sample academic search."""
# Initialize components
query_processor = get_query_processor()
search_executor = SearchExecutor()
# Get a list of available search engines
available_engines = search_executor.get_available_search_engines()
print(f"Available search engines: {', '.join(available_engines)}")
# Check if academic search engines are available
academic_engines = ["openalex", "core", "scholar", "arxiv"]
available_academic = [engine for engine in academic_engines if engine in available_engines]
if not available_academic:
print("No academic search engines are available. Please check your configuration.")
return
else:
print(f"Available academic search engines: {', '.join(available_academic)}")
# Prompt for the query
query = input("Enter your academic research query: ") or "What are the latest papers on large language model alignment?"
print(f"\nProcessing query: {query}")
# Process the query
start_time = datetime.now()
structured_query = await query_processor.process_query(query)
# Add academic query flag
structured_query["is_academic"] = True
# Generate search queries optimized for each engine
structured_query = await query_processor.generate_search_queries(
structured_query, available_academic
)
# Print the optimized queries
print("\nOptimized queries for academic search:")
for engine in available_academic:
print(f"\n{engine.upper()} queries:")
for i, q in enumerate(structured_query.get("search_queries", {}).get(engine, [])):
print(f"{i+1}. {q}")
# Execute the search
results = await search_executor.execute_search_async(
structured_query,
search_engines=available_academic,
num_results=5
)
# Print the results
total_results = sum(len(engine_results) for engine_results in results.values())
print(f"\nFound {total_results} academic results:")
for engine, engine_results in results.items():
print(f"\n--- {engine.upper()} Results ({len(engine_results)}) ---")
for i, result in enumerate(engine_results):
print(f"\n{i+1}. {result.get('title', 'No title')}")
print(f"Authors: {result.get('authors', 'Unknown')}")
print(f"Year: {result.get('year', 'Unknown')}")
print(f"Access: {result.get('access_status', 'Unknown')}")
print(f"URL: {result.get('url', 'No URL')}")
print(f"Snippet: {result.get('snippet', 'No snippet')[0:200]}...")
end_time = datetime.now()
print(f"\nSearch completed in {(end_time - start_time).total_seconds():.2f} seconds")
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,76 @@
"""
Example script for using the news search handler.
"""
import asyncio
import sys
import os
from datetime import datetime
# Add the project root to the Python path
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
from execution.search_executor import SearchExecutor
from query.query_processor import get_query_processor
from config.config import get_config
async def main():
"""Run a sample news search."""
# Initialize components
query_processor = get_query_processor()
search_executor = SearchExecutor()
# Get a list of available search engines
available_engines = search_executor.get_available_search_engines()
print(f"Available search engines: {', '.join(available_engines)}")
# Check if news search is available
if "news" not in available_engines:
print("News search is not available. Please check your NewsAPI configuration.")
return
# Prompt for the query
query = input("Enter your query about recent events: ") or "Trump tariffs latest announcement"
print(f"\nProcessing query: {query}")
# Process the query
start_time = datetime.now()
structured_query = await query_processor.process_query(query)
# Generate search queries optimized for each engine
structured_query = await query_processor.generate_search_queries(
structured_query, ["news"]
)
# Print the optimized queries
print("\nOptimized queries for news search:")
for i, q in enumerate(structured_query.get("search_queries", {}).get("news", [])):
print(f"{i+1}. {q}")
# Execute the search
results = await search_executor.execute_search_async(
structured_query,
search_engines=["news"],
num_results=10
)
# Print the results
news_results = results.get("news", [])
print(f"\nFound {len(news_results)} news results:")
for i, result in enumerate(news_results):
print(f"\n--- Result {i+1} ---")
print(f"Title: {result.get('title', 'No title')}")
print(f"Source: {result.get('source', 'Unknown')}")
print(f"Date: {result.get('published_date', 'Unknown date')}")
print(f"URL: {result.get('url', 'No URL')}")
print(f"Snippet: {result.get('snippet', 'No snippet')[0:200]}...")
end_time = datetime.now()
print(f"\nSearch completed in {(end_time - start_time).total_seconds():.2f} seconds")
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,160 @@
"""
CORE.ac.uk API handler.
Provides access to open access academic papers from institutional repositories.
"""
import os
import requests
from typing import Dict, List, Any, Optional
from .base_handler import BaseSearchHandler
from config.config import get_config, get_api_key
class CoreSearchHandler(BaseSearchHandler):
"""Handler for CORE.ac.uk academic search API."""
def __init__(self):
"""Initialize the CORE search handler."""
self.config = get_config()
self.api_key = get_api_key("core")
self.base_url = "https://api.core.ac.uk/v3/search/works"
self.available = self.api_key is not None
# Get any custom settings from config
self.academic_config = self.config.config_data.get("academic_search", {}).get("core", {})
def search(self, query: str, num_results: int = 10, **kwargs) -> List[Dict[str, Any]]:
"""
Execute a search query using CORE.ac.uk.
Args:
query: The search query to execute
num_results: Number of results to return
**kwargs: Additional search parameters:
- full_text: Whether to search in full text (default: True)
- filter_year: Filter by publication year or range
- sort: Sort by relevance or publication date
- repositories: Limit to specific repositories
Returns:
List of search results with standardized format
"""
if not self.available:
raise ValueError("CORE API is not available. API key is missing.")
# Set up the request headers
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
# Set up the request body
body = {
"q": query,
"limit": num_results,
"offset": 0
}
# Add full text search parameter
full_text = kwargs.get("full_text", True)
if full_text:
body["fields"] = ["title", "authors", "year", "abstract", "fullText"]
else:
body["fields"] = ["title", "authors", "year", "abstract"]
# Add year filter if specified
if "filter_year" in kwargs:
body["filters"] = [{"year": kwargs["filter_year"]}]
# Add sort parameter
if "sort" in kwargs:
if kwargs["sort"] == "date":
body["sort"] = [{"year": "desc"}]
else:
body["sort"] = [{"_score": "desc"}] # Default to relevance
# Add repository filter if specified
if "repositories" in kwargs:
if "filters" not in body:
body["filters"] = []
body["filters"].append({"repositoryIds": kwargs["repositories"]})
try:
# Make the request
response = requests.post(self.base_url, headers=headers, json=body)
response.raise_for_status()
# Parse the response
data = response.json()
# Process the results
results = []
for item in data.get("results", []):
# Extract authors
authors = []
for author in item.get("authors", [])[:3]:
author_name = author.get("name", "")
if author_name:
authors.append(author_name)
# Get publication year
pub_year = item.get("year", "Unknown")
# Get DOI
doi = item.get("doi", "")
# Determine URL - prefer the download URL if available
url = item.get("downloadUrl", "")
if not url and doi:
url = f"https://doi.org/{doi}"
if not url:
url = item.get("sourceFulltextUrls", [""])[0] if item.get("sourceFulltextUrls") else ""
# Create snippet from abstract or first part of full text
snippet = item.get("abstract", "")
if not snippet and "fullText" in item:
snippet = item.get("fullText", "")[:500] + "..."
# If no snippet is available, create one from metadata
if not snippet:
journal = item.get("publisher", "Unknown Journal")
snippet = f"Open access academic paper from {journal}. {pub_year}."
# Create the result
result = {
"title": item.get("title", "Untitled"),
"url": url,
"snippet": snippet,
"source": "core",
"authors": ", ".join(authors),
"year": pub_year,
"journal": item.get("publisher", ""),
"doi": doi,
"open_access": True # CORE only indexes open access content
}
results.append(result)
return results
except requests.exceptions.RequestException as e:
print(f"Error executing CORE search: {e}")
return []
def get_name(self) -> str:
"""Get the name of the search handler."""
return "core"
def is_available(self) -> bool:
"""Check if the CORE API is available."""
return self.available
def get_rate_limit_info(self) -> Dict[str, Any]:
"""Get information about the API's rate limits."""
# These limits are based on the free tier
return {
"requests_per_minute": 30,
"requests_per_day": 10000,
"current_usage": None
}

View File

@ -0,0 +1,206 @@
"""
GitHub API handler for code search.
This module implements a search handler for GitHub's API,
allowing code searches across GitHub repositories.
"""
import os
import requests
from typing import Dict, List, Any, Optional
from config.config import get_config
from ..api_handlers.base_handler import BaseSearchHandler
class GitHubSearchHandler(BaseSearchHandler):
"""Handler for GitHub code search."""
def __init__(self):
"""Initialize the GitHub search handler."""
self.config = get_config()
self.api_key = os.environ.get('GITHUB_API_KEY') or self.config.config_data.get('api_keys', {}).get('github')
self.api_url = "https://api.github.com"
self.search_endpoint = "/search/code"
self.user_agent = "SimSearch-Research-Assistant"
def search(self, query: str, num_results: int = 10, **kwargs) -> List[Dict[str, Any]]:
"""
Execute a code search on GitHub.
Args:
query: The search query
num_results: Number of results to return
**kwargs: Additional search parameters
- language: Filter by programming language
- sort: Sort by (indexed, stars, forks, updated)
- order: Sort order (asc, desc)
Returns:
List of search results
"""
if not self.is_available():
return []
# Prepare query parameters
params = {
"q": query,
"per_page": min(num_results, 30), # GitHub API limit
"page": 1
}
# Add optional parameters
if kwargs.get("language"):
params["q"] += f" language:{kwargs['language']}"
if kwargs.get("sort"):
params["sort"] = kwargs["sort"]
if kwargs.get("order"):
params["order"] = kwargs["order"]
# Set up headers
headers = {
"Authorization": f"token {self.api_key}",
"Accept": "application/vnd.github.v3+json",
"User-Agent": self.user_agent
}
try:
# Make the API request
response = requests.get(
f"{self.api_url}{self.search_endpoint}",
params=params,
headers=headers
)
response.raise_for_status()
# Process results
data = response.json()
results = []
for item in data.get("items", []):
# For each code result, fetch a bit of the file content
snippet = self._get_code_snippet(item) if item.get("url") else "Code snippet not available"
# Construct a standardized result entry
result = {
"title": item.get("name", "Unnamed"),
"url": item.get("html_url", ""),
"snippet": snippet,
"source": "github",
"metadata": {
"repository": item.get("repository", {}).get("full_name", ""),
"path": item.get("path", ""),
"language": kwargs.get("language", ""),
"score": item.get("score", 0)
}
}
results.append(result)
return results
except requests.RequestException as e:
print(f"GitHub API error: {e}")
return []
def _get_code_snippet(self, item: Dict[str, Any]) -> str:
"""
Fetch a snippet of the code file.
Args:
item: The GitHub code search result item
Returns:
A string containing a snippet of the code
"""
try:
# Get the raw content URL
content_url = item.get("url")
if not content_url:
return "Content not available"
# Request the content
headers = {
"Authorization": f"token {self.api_key}",
"Accept": "application/vnd.github.v3.raw",
"User-Agent": self.user_agent
}
response = requests.get(content_url, headers=headers)
response.raise_for_status()
# Get content and create a snippet
content = response.json().get("content", "")
if content:
# GitHub returns Base64 encoded content
import base64
decoded = base64.b64decode(content).decode('utf-8')
# Create a snippet (first ~500 chars)
snippet = decoded[:500] + ("..." if len(decoded) > 500 else "")
return snippet
return "Content not available"
except Exception as e:
print(f"Error fetching code snippet: {e}")
return "Error fetching code snippet"
def get_name(self) -> str:
"""
Get the name of the search handler.
Returns:
Name of the search handler
"""
return "github"
def is_available(self) -> bool:
"""
Check if the GitHub API is available and properly configured.
Returns:
True if the API is available, False otherwise
"""
return self.api_key is not None
def get_rate_limit_info(self) -> Dict[str, Any]:
"""
Get information about GitHub API rate limits.
Returns:
Dictionary with rate limit information
"""
if not self.is_available():
return {"error": "GitHub API not configured"}
try:
headers = {
"Authorization": f"token {self.api_key}",
"Accept": "application/vnd.github.v3+json",
"User-Agent": self.user_agent
}
response = requests.get(
f"{self.api_url}/rate_limit",
headers=headers
)
response.raise_for_status()
data = response.json()
rate_limits = data.get("resources", {}).get("search", {})
return {
"requests_per_minute": 30, # GitHub search API limit
"requests_per_hour": rate_limits.get("limit", 0),
"current_usage": {
"remaining": rate_limits.get("remaining", 0),
"reset_time": rate_limits.get("reset", 0)
}
}
except Exception as e:
print(f"Error getting rate limit info: {e}")
return {
"error": str(e),
"requests_per_minute": 30,
"requests_per_hour": 5000 # Default limit
}

View File

@ -0,0 +1,152 @@
"""
NewsAPI handler for current events searches.
Provides access to recent news articles from various sources.
"""
import os
import requests
import datetime
from typing import Dict, List, Any, Optional
from .base_handler import BaseSearchHandler
from config.config import get_config, get_api_key
class NewsSearchHandler(BaseSearchHandler):
"""Handler for NewsAPI.org for current events searches."""
def __init__(self):
"""Initialize the NewsAPI search handler."""
self.config = get_config()
self.api_key = get_api_key("newsapi")
self.base_url = "https://newsapi.org/v2/everything"
self.top_headlines_url = "https://newsapi.org/v2/top-headlines"
self.available = self.api_key is not None
def search(self, query: str, num_results: int = 10, **kwargs) -> List[Dict[str, Any]]:
"""
Execute a search query using NewsAPI.
Args:
query: The search query to execute
num_results: Number of results to return
**kwargs: Additional search parameters:
- days_back: Number of days back to search (default: 7)
- sort_by: Sort by criteria ("relevancy", "popularity", "publishedAt")
- language: Language code (default: "en")
- sources: Comma-separated list of news sources
- domains: Comma-separated list of domains
- use_headlines: Whether to use top headlines endpoint (default: False)
- country: Country code for headlines (default: "us")
- category: Category for headlines
Returns:
List of search results with standardized format
"""
if not self.available:
raise ValueError("NewsAPI is not available. API key is missing.")
# Determine which endpoint to use
use_headlines = kwargs.get("use_headlines", False)
url = self.top_headlines_url if use_headlines else self.base_url
# Calculate date range
days_back = kwargs.get("days_back", 7)
end_date = datetime.datetime.now().strftime("%Y-%m-%d")
start_date = (datetime.datetime.now() - datetime.timedelta(days=days_back)).strftime("%Y-%m-%d")
# Set up the request parameters
params = {
"q": query,
"pageSize": num_results,
"apiKey": self.api_key,
}
# Add parameters for everything endpoint
if not use_headlines:
params["from"] = start_date
params["to"] = end_date
params["sortBy"] = kwargs.get("sort_by", "publishedAt")
if "language" in kwargs:
params["language"] = kwargs["language"]
else:
params["language"] = "en" # Default to English
if "sources" in kwargs:
params["sources"] = kwargs["sources"]
if "domains" in kwargs:
params["domains"] = kwargs["domains"]
# Add parameters for top-headlines endpoint
else:
if "country" in kwargs:
params["country"] = kwargs["country"]
else:
params["country"] = "us" # Default to US
if "category" in kwargs:
params["category"] = kwargs["category"]
try:
# Make the request
response = requests.get(url, params=params)
response.raise_for_status()
# Parse the response
data = response.json()
# Check if the request was successful
if data.get("status") != "ok":
print(f"NewsAPI error: {data.get('message', 'Unknown error')}")
return []
# Process the results
results = []
for article in data.get("articles", []):
# Get the publication date with proper formatting
pub_date = article.get("publishedAt", "")
if pub_date:
try:
date_obj = datetime.datetime.fromisoformat(pub_date.replace("Z", "+00:00"))
formatted_date = date_obj.strftime("%Y-%m-%d %H:%M:%S")
except ValueError:
formatted_date = pub_date
else:
formatted_date = ""
# Create a standardized result
result = {
"title": article.get("title", ""),
"url": article.get("url", ""),
"snippet": article.get("description", ""),
"source": f"news:{article.get('source', {}).get('name', 'unknown')}",
"published_date": formatted_date,
"author": article.get("author", ""),
"image_url": article.get("urlToImage", ""),
"content": article.get("content", "")
}
results.append(result)
return results
except requests.exceptions.RequestException as e:
print(f"Error executing NewsAPI search: {e}")
return []
def get_name(self) -> str:
"""Get the name of the search handler."""
return "news"
def is_available(self) -> bool:
"""Check if the NewsAPI is available."""
return self.available
def get_rate_limit_info(self) -> Dict[str, Any]:
"""Get information about the API's rate limits."""
# These are based on NewsAPI's developer plan
return {
"requests_per_minute": 100,
"requests_per_day": 500, # Free tier limit
"current_usage": None # NewsAPI doesn't provide usage info in responses
}

View File

@ -0,0 +1,180 @@
"""
OpenAlex API handler.
Provides access to academic research papers and scholarly information.
"""
import os
import requests
from typing import Dict, List, Any, Optional
from .base_handler import BaseSearchHandler
from config.config import get_config, get_api_key
class OpenAlexSearchHandler(BaseSearchHandler):
"""Handler for OpenAlex academic search API."""
def __init__(self):
"""Initialize the OpenAlex search handler."""
self.config = get_config()
# OpenAlex doesn't require an API key, but using an email is recommended
self.email = self.config.config_data.get("academic_search", {}).get("email", "user@example.com")
self.base_url = "https://api.openalex.org/works"
self.available = True # OpenAlex doesn't require an API key
# Get any custom settings from config
self.academic_config = self.config.config_data.get("academic_search", {}).get("openalex", {})
def search(self, query: str, num_results: int = 10, **kwargs) -> List[Dict[str, Any]]:
"""
Execute a search query using OpenAlex.
Args:
query: The search query to execute
num_results: Number of results to return
**kwargs: Additional search parameters:
- filter_type: Filter by work type (article, book, etc.)
- filter_year: Filter by publication year or range
- filter_open_access: Only return open access publications
- sort: Sort by relevance, citations, publication date
- filter_concept: Filter by academic concept/field
Returns:
List of search results with standardized format
"""
# Build the search URL with parameters
params = {
"search": query,
"per_page": num_results,
"mailto": self.email # Good practice for the API
}
# Add filters
filters = []
# Type filter (article, book, etc.)
if "filter_type" in kwargs:
filters.append(f"type.id:{kwargs['filter_type']}")
# Year filter
if "filter_year" in kwargs:
filters.append(f"publication_year:{kwargs['filter_year']}")
# Open access filter
if kwargs.get("filter_open_access", False):
filters.append("is_oa:true")
# Concept/field filter
if "filter_concept" in kwargs:
filters.append(f"concepts.id:{kwargs['filter_concept']}")
# Combine filters if there are any
if filters:
params["filter"] = ",".join(filters)
# Sort parameter
if "sort" in kwargs:
params["sort"] = kwargs["sort"]
else:
# Default to sorting by relevance score
params["sort"] = "relevance_score:desc"
try:
# Make the request
response = requests.get(self.base_url, params=params)
response.raise_for_status()
# Parse the response
data = response.json()
# Process the results
results = []
for item in data.get("results", []):
# Extract authors
authors = []
for author in item.get("authorships", [])[:3]:
author_name = author.get("author", {}).get("display_name", "")
if author_name:
authors.append(author_name)
# Format citation count
citation_count = item.get("cited_by_count", 0)
# Get the publication year
pub_year = item.get("publication_year", "Unknown")
# Check if it's open access
is_oa = item.get("open_access", {}).get("is_oa", False)
oa_status = "Open Access" if is_oa else "Subscription"
# Get journal/venue name
journal = None
if "primary_location" in item and item["primary_location"]:
source = item.get("primary_location", {}).get("source", {})
if source:
journal = source.get("display_name", "Unknown Journal")
# Get DOI
doi = item.get("doi")
url = f"https://doi.org/{doi}" if doi else item.get("url", "")
# Get abstract
abstract = item.get("abstract_inverted_index", None)
snippet = ""
# Convert abstract_inverted_index to readable text if available
if abstract:
try:
# The OpenAlex API uses an inverted index format
# We need to reconstruct the text from this format
words = {}
for word, positions in abstract.items():
for pos in positions:
words[pos] = word
# Reconstruct the abstract from the positions
snippet = " ".join([words.get(i, "") for i in sorted(words.keys())])
except:
snippet = "Abstract not available in readable format"
# Fallback if no abstract is available
if not snippet:
snippet = f"Academic paper: {item.get('title', 'Untitled')}. Published in {journal or 'Unknown'} ({pub_year}). {citation_count} citations."
# Create the result
result = {
"title": item.get("title", "Untitled"),
"url": url,
"snippet": snippet,
"source": "openalex",
"authors": ", ".join(authors),
"year": pub_year,
"citation_count": citation_count,
"access_status": oa_status,
"journal": journal,
"doi": doi
}
results.append(result)
return results
except requests.exceptions.RequestException as e:
print(f"Error executing OpenAlex search: {e}")
return []
def get_name(self) -> str:
"""Get the name of the search handler."""
return "openalex"
def is_available(self) -> bool:
"""Check if the OpenAlex API is available."""
return self.available
def get_rate_limit_info(self) -> Dict[str, Any]:
"""Get information about the API's rate limits."""
return {
"requests_per_minute": 100, # OpenAlex is quite generous with rate limits
"requests_per_day": 100000, # 100k requests per day for anonymous users
"current_usage": None # OpenAlex doesn't provide usage info in responses
}

View File

@ -0,0 +1,231 @@
"""
StackExchange API handler for programming question search.
This module implements a search handler for the StackExchange API,
focusing on Stack Overflow and related programming Q&A sites.
"""
import os
import requests
import time
from typing import Dict, List, Any, Optional
from urllib.parse import quote
from config.config import get_config
from ..api_handlers.base_handler import BaseSearchHandler
class StackExchangeSearchHandler(BaseSearchHandler):
"""Handler for StackExchange/Stack Overflow search."""
def __init__(self):
"""Initialize the StackExchange search handler."""
self.config = get_config()
self.api_key = os.environ.get('STACKEXCHANGE_API_KEY') or self.config.config_data.get('api_keys', {}).get('stackexchange')
self.api_url = "https://api.stackexchange.com/2.3"
self.search_endpoint = "/search/advanced"
self.last_request_time = 0
self.min_request_interval = 1.0 # seconds between requests to avoid throttling
def search(self, query: str, num_results: int = 10, **kwargs) -> List[Dict[str, Any]]:
"""
Execute a search on StackExchange.
Args:
query: The search query
num_results: Number of results to return
**kwargs: Additional search parameters
- site: StackExchange site to search (default: stackoverflow)
- sort: Sort by (relevance, votes, creation, activity)
- tags: List of tags to filter by
- accepted: Only return questions with accepted answers
Returns:
List of search results
"""
if not self.is_available():
return []
# Rate limiting to avoid API restrictions
self._respect_rate_limit()
# Prepare query parameters
site = kwargs.get("site", "stackoverflow")
params = {
"q": query,
"site": site,
"pagesize": min(num_results, 30), # SE API limit per page
"page": 1,
"filter": "withbody", # Include question body
"key": self.api_key
}
# Add optional parameters
if kwargs.get("sort"):
params["sort"] = kwargs["sort"]
if kwargs.get("tags"):
params["tagged"] = ";".join(kwargs["tags"])
if kwargs.get("accepted"):
params["accepted"] = "True"
try:
# Make the API request
response = requests.get(
f"{self.api_url}{self.search_endpoint}",
params=params
)
response.raise_for_status()
# Process results
data = response.json()
results = []
for item in data.get("items", []):
# Get answer count and score
answer_count = item.get("answer_count", 0)
score = item.get("score", 0)
has_accepted = item.get("is_answered", False)
# Format tags
tags = item.get("tags", [])
tag_str = ", ".join(tags)
# Create snippet from question body
body = item.get("body", "")
snippet = self._extract_snippet(body, max_length=300)
# Additional metadata for result display
meta_info = f"Score: {score} | Answers: {answer_count}"
if has_accepted:
meta_info += " | Has accepted answer"
# Format the snippet with meta information
full_snippet = f"{snippet}\n\nTags: {tag_str}\n{meta_info}"
# Construct a standardized result entry
result = {
"title": item.get("title", "Unnamed Question"),
"url": item.get("link", ""),
"snippet": full_snippet,
"source": f"stackexchange_{site}",
"metadata": {
"score": score,
"answer_count": answer_count,
"has_accepted": has_accepted,
"tags": tags,
"question_id": item.get("question_id", ""),
"creation_date": item.get("creation_date", "")
}
}
results.append(result)
return results
except requests.RequestException as e:
print(f"StackExchange API error: {e}")
return []
def _extract_snippet(self, html_content: str, max_length: int = 300) -> str:
"""
Extract a readable snippet from HTML content.
Args:
html_content: HTML content from Stack Overflow
max_length: Maximum length of the snippet
Returns:
A plain text snippet
"""
try:
# Basic HTML tag removal (a more robust solution would use a library like BeautifulSoup)
import re
text = re.sub(r'<[^>]+>', ' ', html_content)
# Remove excessive whitespace
text = re.sub(r'\s+', ' ', text).strip()
# Truncate to max_length
if len(text) > max_length:
text = text[:max_length] + "..."
return text
except Exception as e:
print(f"Error extracting snippet: {e}")
return "Snippet extraction failed"
def _respect_rate_limit(self):
"""
Ensure we don't exceed StackExchange API rate limits.
"""
current_time = time.time()
time_since_last = current_time - self.last_request_time
if time_since_last < self.min_request_interval:
sleep_time = self.min_request_interval - time_since_last
time.sleep(sleep_time)
self.last_request_time = time.time()
def get_name(self) -> str:
"""
Get the name of the search handler.
Returns:
Name of the search handler
"""
return "stackexchange"
def is_available(self) -> bool:
"""
Check if the StackExchange API is available.
Note: StackExchange API can be used without an API key with reduced quotas.
Returns:
True if the API is available
"""
return True # Can be used with or without API key
def get_rate_limit_info(self) -> Dict[str, Any]:
"""
Get information about StackExchange API rate limits.
Returns:
Dictionary with rate limit information
"""
quota_max = 300 if self.api_key else 100 # Default quotas
try:
# Make a request to check quota
params = {
"site": "stackoverflow"
}
if self.api_key:
params["key"] = self.api_key
response = requests.get(
f"{self.api_url}/info",
params=params
)
response.raise_for_status()
data = response.json()
quota_remaining = data.get("quota_remaining", quota_max)
return {
"requests_per_minute": 30, # Conservative estimate
"requests_per_day": quota_max,
"current_usage": {
"remaining": quota_remaining,
"max": quota_max,
"reset_time": "Daily" # SE resets quotas daily
}
}
except Exception as e:
print(f"Error getting rate limit info: {e}")
return {
"error": str(e),
"requests_per_minute": 30,
"requests_per_day": quota_max
}

View File

@ -28,6 +28,15 @@ class ResultCollector:
print("Jina Reranker not available. Will use basic scoring instead.")
self.reranker_available = False
# Initialize result enrichers
try:
from .result_enrichers.unpaywall_enricher import UnpaywallEnricher
self.unpaywall_enricher = UnpaywallEnricher()
self.unpaywall_available = True
except (ImportError, ValueError):
print("Unpaywall enricher not available. Will not enrich results with open access links.")
self.unpaywall_available = False
def process_results(self,
search_results: Dict[str, List[Dict[str, Any]]],
dedup: bool = True,
@ -68,6 +77,16 @@ class ResultCollector:
if dedup:
print(f"Deduplicated to {len(flattened_results)} results")
# Enrich results with open access links if available
is_academic_query = any(result.get("source") in ["openalex", "core", "arxiv", "scholar"] for result in flattened_results)
if is_academic_query and hasattr(self, 'unpaywall_enricher') and self.unpaywall_available:
print("Enriching academic results with open access information")
try:
flattened_results = self.unpaywall_enricher.enrich_results(flattened_results)
print("Results enriched with open access information")
except Exception as e:
print(f"Error enriching results with Unpaywall: {str(e)}")
# Apply reranking if requested and available
if use_reranker and self.reranker is not None:
print("Using Jina Reranker for semantic ranking")
@ -161,12 +180,22 @@ class ResultCollector:
source = result.get("source", "")
if source == "scholar":
score += 10
elif source == "serper":
score += 9
elif source == "openalex":
score += 10 # Top priority for academic queries
elif source == "core":
score += 9 # High priority for open access academic content
elif source == "arxiv":
score += 8
score += 8 # Good for preprints and specific fields
elif source == "github":
score += 9 # High priority for code/programming queries
elif source.startswith("stackexchange"):
score += 10 # Top priority for code/programming questions
elif source == "serper":
score += 7 # General web search
elif source == "news":
score += 8 # Good for current events
elif source == "google":
score += 5
score += 5 # Generic search
# Boost score based on position in original results
position = result.get("raw_data", {}).get("position", 0)

View File

@ -0,0 +1,7 @@
"""
Result enrichers for improving search results with additional data.
"""
from .unpaywall_enricher import UnpaywallEnricher
__all__ = ["UnpaywallEnricher"]

View File

@ -0,0 +1,132 @@
"""
Unpaywall enricher for finding open access versions of scholarly articles.
"""
import os
import requests
from typing import Dict, List, Any, Optional
from config.config import get_config, get_api_key
class UnpaywallEnricher:
"""Enricher for finding open access versions of papers using Unpaywall."""
def __init__(self):
"""Initialize the Unpaywall enricher."""
self.config = get_config()
# Unpaywall recommends using an email for API access
self.email = self.config.config_data.get("academic_search", {}).get("email", "user@example.com")
self.base_url = "https://api.unpaywall.org/v2/"
self.available = True # Unpaywall doesn't require an API key, just an email
# Get any custom settings from config
self.academic_config = self.config.config_data.get("academic_search", {}).get("unpaywall", {})
def enrich_results(self, results: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""
Enrich search results with open access links from Unpaywall.
Args:
results: List of search results to enrich
Returns:
Enriched list of search results
"""
if not self.available:
return results
# Process each result that has a DOI
for result in results:
doi = result.get("doi")
if not doi:
continue
# Skip results that are already marked as open access
if result.get("open_access", False) or result.get("access_status") == "Open Access":
continue
# Lookup the DOI in Unpaywall
oa_data = self._lookup_doi(doi)
if not oa_data:
continue
# Enrich the result with open access data
if oa_data.get("is_oa", False):
result["open_access"] = True
result["access_status"] = "Open Access"
# Get the best open access URL
best_oa_url = self._get_best_oa_url(oa_data)
if best_oa_url:
result["oa_url"] = best_oa_url
# Add a note to the snippet about open access availability
if "snippet" in result:
result["snippet"] += " [Open access version available]"
else:
result["open_access"] = False
result["access_status"] = "Subscription"
return results
def _lookup_doi(self, doi: str) -> Optional[Dict[str, Any]]:
"""
Look up a DOI in Unpaywall.
Args:
doi: The DOI to look up
Returns:
Unpaywall data for the DOI, or None if not found
"""
try:
# Normalize the DOI
doi = doi.strip().lower()
if doi.startswith("https://doi.org/"):
doi = doi[16:]
elif doi.startswith("doi:"):
doi = doi[4:]
# Make the request to Unpaywall
url = f"{self.base_url}{doi}?email={self.email}"
response = requests.get(url)
# Check for successful response
if response.status_code == 200:
return response.json()
return None
except Exception as e:
print(f"Error looking up DOI in Unpaywall: {e}")
return None
def _get_best_oa_url(self, oa_data: Dict[str, Any]) -> Optional[str]:
"""
Get the best open access URL from Unpaywall data.
Args:
oa_data: Unpaywall data for a DOI
Returns:
Best open access URL, or None if not available
"""
# Check if there's a best OA location
best_oa_location = oa_data.get("best_oa_location", None)
if best_oa_location:
# Get the URL from the best location
return best_oa_location.get("url_for_pdf") or best_oa_location.get("url")
# If no best location, check all OA locations
oa_locations = oa_data.get("oa_locations", [])
if oa_locations:
# Prefer PDF URLs
for location in oa_locations:
if location.get("url_for_pdf"):
return location.get("url_for_pdf")
# Fall back to HTML URLs
for location in oa_locations:
if location.get("url"):
return location.get("url")
return None

View File

@ -15,6 +15,12 @@ from .api_handlers.base_handler import BaseSearchHandler
from .api_handlers.serper_handler import SerperSearchHandler
from .api_handlers.scholar_handler import ScholarSearchHandler
from .api_handlers.arxiv_handler import ArxivSearchHandler
from .api_handlers.news_handler import NewsSearchHandler
from .api_handlers.openalex_handler import OpenAlexSearchHandler
from .api_handlers.core_handler import CoreSearchHandler
from .api_handlers.github_handler import GitHubSearchHandler
from .api_handlers.stackexchange_handler import StackExchangeSearchHandler
from .result_enrichers.unpaywall_enricher import UnpaywallEnricher
class SearchExecutor:
@ -30,6 +36,9 @@ class SearchExecutor:
self.available_handlers = {name: handler for name, handler in self.handlers.items()
if handler.is_available()}
# Initialize result enrichers
self.unpaywall_enricher = UnpaywallEnricher()
def _initialize_handlers(self) -> Dict[str, BaseSearchHandler]:
"""
Initialize all search handlers.
@ -40,7 +49,12 @@ class SearchExecutor:
return {
"serper": SerperSearchHandler(),
"scholar": ScholarSearchHandler(),
"arxiv": ArxivSearchHandler()
"arxiv": ArxivSearchHandler(),
"news": NewsSearchHandler(),
"openalex": OpenAlexSearchHandler(),
"core": CoreSearchHandler(),
"github": GitHubSearchHandler(),
"stackexchange": StackExchangeSearchHandler()
}
def get_available_search_engines(self) -> List[str]:
@ -82,14 +96,111 @@ class SearchExecutor:
# If no search engines specified, use all available
if search_engines is None:
search_engines = list(self.available_handlers.keys())
# Handle specialized query types
# Current events queries
if structured_query.get("is_current_events", False) and "news" in self.available_handlers:
print("Current events query detected, prioritizing news search")
# Make sure news is in the search engines
if "news" not in search_engines:
search_engines.append("news")
# If a specific engine is requested, honor that - otherwise limit to news + a general search engine
# for a faster response with more relevant results
if not structured_query.get("specific_engines", False):
general_engines = ["serper", "google"]
# Find an available general engine
general_engine = next((e for e in general_engines if e in self.available_handlers), None)
if general_engine:
search_engines = ["news", general_engine]
else:
# Fall back to just news
search_engines = ["news"]
# Academic queries
elif structured_query.get("is_academic", False):
print("Academic query detected, prioritizing academic search engines")
# Define academic search engines in order of priority
academic_engines = ["openalex", "core", "arxiv", "scholar"]
available_academic = [engine for engine in academic_engines if engine in self.available_handlers]
# Always include at least one general search engine for backup
general_engines = ["serper", "google"]
available_general = [engine for engine in general_engines if engine in self.available_handlers]
if available_academic and not structured_query.get("specific_engines", False):
# Use available academic engines plus one general engine if available
search_engines = available_academic
if available_general:
search_engines.append(available_general[0])
elif not available_academic:
# Just use general search if no academic engines are available
search_engines = available_general
print(f"Selected engines for academic query: {search_engines}")
# Code/programming queries
elif structured_query.get("is_code", False):
print("Code/programming query detected, prioritizing code search engines")
# Define code search engines in order of priority
code_engines = ["github", "stackexchange"]
available_code = [engine for engine in code_engines if engine in self.available_handlers]
# Always include at least one general search engine for backup
general_engines = ["serper", "google"]
available_general = [engine for engine in general_engines if engine in self.available_handlers]
if available_code and not structured_query.get("specific_engines", False):
# Use available code engines plus one general engine if available
search_engines = available_code
if available_general:
search_engines.append(available_general[0])
elif not available_code:
# Just use general search if no code engines are available
search_engines = available_general
print(f"Selected engines for code query: {search_engines}")
else:
# Filter to only include available search engines
search_engines = [engine for engine in search_engines
if engine in self.available_handlers]
# Add specialized handlers based on query type
# For current events queries
if structured_query.get("is_current_events", False) and "news" in self.available_handlers and "news" not in search_engines:
print("Current events query detected, adding news search")
search_engines.append("news")
# For academic queries
elif structured_query.get("is_academic", False):
academic_engines = ["openalex", "core", "arxiv", "scholar"]
for engine in academic_engines:
if engine in self.available_handlers and engine not in search_engines:
print(f"Academic query detected, adding {engine} search")
search_engines.append(engine)
# For code/programming queries
elif structured_query.get("is_code", False):
code_engines = ["github", "stackexchange"]
for engine in code_engines:
if engine in self.available_handlers and engine not in search_engines:
print(f"Code query detected, adding {engine} search")
search_engines.append(engine)
# Get the search queries for each engine
search_queries = structured_query.get("search_queries", {})
# For news searches on current events queries, add special parameters
news_params = {}
if "news" in search_engines and structured_query.get("is_current_events", False):
# Set up news search parameters
news_params["days_back"] = 7 # Limit to 7 days for current events
news_params["sort_by"] = "publishedAt" # Sort by publication date
# Execute searches in parallel
results = {}
with concurrent.futures.ThreadPoolExecutor() as executor:
@ -102,12 +213,18 @@ class SearchExecutor:
# Get the appropriate query for this engine
engine_query = search_queries.get(engine, query)
# Additional parameters for certain engines
kwargs = {}
if engine == "news" and news_params:
kwargs = news_params
# Submit the search task
future = executor.submit(
self._execute_single_search,
engine=engine,
query=engine_query,
num_results=num_results
num_results=num_results,
**kwargs
)
future_to_engine[future] = engine
@ -123,7 +240,7 @@ class SearchExecutor:
return results
def _execute_single_search(self, engine: str, query: str, num_results: int) -> List[Dict[str, Any]]:
def _execute_single_search(self, engine: str, query: str, num_results: int, **kwargs) -> List[Dict[str, Any]]:
"""
Execute a search on a single search engine.
@ -131,6 +248,7 @@ class SearchExecutor:
engine: Name of the search engine
query: Query to execute
num_results: Number of results to return
**kwargs: Additional parameters to pass to the search handler
Returns:
List of search results
@ -140,8 +258,8 @@ class SearchExecutor:
return []
try:
# Execute the search
results = handler.search(query, num_results=num_results)
# Execute the search with any additional parameters
results = handler.search(query, num_results=num_results, **kwargs)
return results
except Exception as e:
print(f"Error executing search for {engine}: {e}")
@ -164,17 +282,51 @@ class SearchExecutor:
Returns:
Dictionary mapping search engine names to lists of search results
"""
# Get the enhanced query
query = structured_query.get("enhanced_query", structured_query.get("original_query", ""))
# If no search engines specified, use all available
if search_engines is None:
search_engines = list(self.available_handlers.keys())
# If this is a current events query, prioritize news handler if available
if structured_query.get("is_current_events", False) and "news" in self.available_handlers:
print("Current events query detected, prioritizing news search (async)")
# Make sure news is in the search engines
if "news" not in search_engines:
search_engines.append("news")
# If a specific engine is requested, honor that - otherwise limit to news + a general search engine
# for a faster response with more relevant results
if not structured_query.get("specific_engines", False):
general_engines = ["serper", "google"]
# Find an available general engine
general_engine = next((e for e in general_engines if e in self.available_handlers), None)
if general_engine:
search_engines = ["news", general_engine]
else:
# Fall back to just news
search_engines = ["news"]
else:
# Filter to only include available search engines
search_engines = [engine for engine in search_engines
if engine in self.available_handlers]
# If this is a current events query, add news handler if available and not already included
if structured_query.get("is_current_events", False) and "news" in self.available_handlers and "news" not in search_engines:
print("Current events query detected, adding news search (async)")
search_engines.append("news")
# Get the search queries for each engine
search_queries = structured_query.get("search_queries", {})
# For news searches on current events queries, add special parameters
news_params = {}
if "news" in search_engines and structured_query.get("is_current_events", False):
# Set up news search parameters
news_params["days_back"] = 7 # Limit to 7 days for current events
news_params["sort_by"] = "publishedAt" # Sort by publication date
# Create tasks for each search engine
tasks = []
for engine in search_engines:
@ -182,10 +334,15 @@ class SearchExecutor:
continue
# Get the appropriate query for this engine
query = search_queries.get(engine, structured_query.get("enhanced_query", ""))
engine_query = search_queries.get(engine, query)
# Additional parameters for certain engines
kwargs = {}
if engine == "news" and news_params:
kwargs = news_params
# Create a task for this search
task = self._execute_single_search_async(engine, query, num_results)
task = self._execute_single_search_async(engine, engine_query, num_results, **kwargs)
tasks.append((engine, task))
# Execute all tasks with timeout
@ -203,7 +360,7 @@ class SearchExecutor:
return results
async def _execute_single_search_async(self, engine: str, query: str, num_results: int) -> List[Dict[str, Any]]:
async def _execute_single_search_async(self, engine: str, query: str, num_results: int, **kwargs) -> List[Dict[str, Any]]:
"""
Execute a search on a single search engine asynchronously.
@ -211,12 +368,16 @@ class SearchExecutor:
engine: Name of the search engine
query: Query to execute
num_results: Number of results to return
**kwargs: Additional parameters to pass to the search handler
Returns:
List of search results
"""
# Execute in a thread pool since most API calls are blocking
loop = asyncio.get_event_loop()
return await loop.run_in_executor(
None, self._execute_single_search, engine, query, num_results
)
# Create a partial function with all the arguments
def execute_search():
return self._execute_single_search(engine, query, num_results, **kwargs)
return await loop.run_in_executor(None, execute_search)

View File

@ -305,8 +305,75 @@ class LLMInterface:
"""Implementation of search query generation."""
engines_str = ", ".join(search_engines)
# Special instructions for news searches
news_instructions = ""
if "news" in search_engines:
news_instructions = """
For the 'news' search engine:
- Focus on recent events and timely information
- Include specific date ranges when relevant (e.g., "last week", "since June 1")
- Use names of people, organizations, or specific events
- For current events queries, prioritize factual keywords over conceptual terms
- Include terms like "latest", "recent", "update", "announcement" where appropriate
- Exclude general background terms that would dilute current event focus
- Generate 3 queries optimized for news search
"""
# Special instructions for academic searches
academic_instructions = ""
if any(engine in search_engines for engine in ["openalex", "core", "arxiv"]):
academic_instructions = """
For academic search engines ('openalex', 'core', 'arxiv'):
- Focus on specific academic terminology and precise research concepts
- Include field-specific keywords and methodological terms
- For 'openalex' search:
- Include author names, journal names, or specific methodology terms when relevant
- Be precise with scientific terminology
- Consider including "review" or "meta-analysis" for summary-type queries
- For 'core' search:
- Focus on open access content
- Include institutional keywords when relevant
- Balance specificity with breadth
- For 'arxiv' search:
- Use more technical/mathematical terminology
- Include relevant field categories (e.g., "cs.AI", "physics", "math")
- Be precise with notation and specialized terms
- Generate 3 queries optimized for each academic search engine
"""
# Special instructions for code/programming searches
code_instructions = ""
if any(engine in search_engines for engine in ["github", "stackexchange"]):
code_instructions = """
For code/programming search engines ('github', 'stackexchange'):
- Focus on specific technical terminology, programming languages, and frameworks
- Include specific error messages, function names, or library references when relevant
- For 'github' search:
- Include programming language keywords (e.g., "python", "javascript", "java")
- Specify file extensions when relevant (e.g., ".py", ".js", ".java")
- Include framework or library names (e.g., "react", "tensorflow", "django")
- Use code-specific syntax and terminology
- Focus on implementation details, patterns, or techniques
- For 'stackexchange' search:
- Phrase as a specific programming question or problem
- Include relevant error messages as exact quotes when applicable
- Include specific version information when relevant
- Use precise technical terms that would appear in developer discussions
- Focus on problem-solving aspects or best practices
- Generate 3 queries optimized for each code search engine
"""
messages = [
{"role": "system", "content": f"You are an AI research assistant. Generate optimized search queries for the following search engines: {engines_str}. For each search engine, provide 3 variations of the query that are optimized for that engine's search algorithm and will yield comprehensive results."},
{"role": "system", "content": f"""You are an AI research assistant. Generate optimized search queries for the following search engines: {engines_str}.
For each search engine, provide 3 variations of the query that are optimized for that engine's search algorithm and will yield comprehensive results.
{news_instructions}
{academic_instructions}
{code_instructions}
Return your response as a JSON object where each key is a search engine name and the value is an array of 3 optimized queries.
"""},
{"role": "user", "content": f"Generate optimized search queries for this research topic: {query}"}
]

View File

@ -59,6 +59,11 @@ class QueryProcessor:
Returns:
Dictionary containing the structured query
"""
# Detect query types
is_current_events = self._is_current_events_query(original_query, classification)
is_academic = self._is_academic_query(original_query, classification)
is_code = self._is_code_query(original_query, classification)
return {
'original_query': original_query,
'enhanced_query': enhanced_query,
@ -66,11 +71,194 @@ class QueryProcessor:
'intent': classification.get('intent', 'research'),
'entities': classification.get('entities', []),
'timestamp': None, # Will be filled in by the caller
'is_current_events': is_current_events,
'is_academic': is_academic,
'is_code': is_code,
'metadata': {
'classification': classification
}
}
def _is_current_events_query(self, query: str, classification: Dict[str, Any]) -> bool:
"""
Determine if a query is related to current events.
Args:
query: The original user query
classification: The query classification
Returns:
True if the query is about current events, False otherwise
"""
# Check for time-related keywords in the query
time_keywords = ['recent', 'latest', 'current', 'today', 'yesterday', 'week', 'month',
'this year', 'breaking', 'news', 'announced', 'election',
'now', 'trends', 'emerging']
query_lower = query.lower()
# Check for named entities typical of current events
current_event_entities = ['trump', 'biden', 'president', 'government', 'congress',
'senate', 'tariffs', 'election', 'policy', 'coronavirus',
'covid', 'market', 'stocks', 'stock market', 'war']
# Count matches for time keywords
time_keyword_count = sum(1 for keyword in time_keywords if keyword in query_lower)
# Count matches for current event entities
entity_count = sum(1 for entity in current_event_entities if entity in query_lower)
# If the query directly asks about what's happening or what happened
action_verbs = ['happen', 'occurred', 'announced', 'said', 'stated', 'declared', 'launched']
verb_matches = sum(1 for verb in action_verbs if verb in query_lower)
# Determine if this is likely a current events query
# Either multiple time keywords or current event entities, or a combination
is_current = (time_keyword_count >= 1 and entity_count >= 1) or time_keyword_count >= 2 or entity_count >= 2 or verb_matches >= 1
return is_current
def _is_academic_query(self, query: str, classification: Dict[str, Any]) -> bool:
"""
Determine if a query is related to academic or scholarly research.
Args:
query: The original user query
classification: The query classification
Returns:
True if the query is about academic research, False otherwise
"""
query_lower = query.lower()
# Check for academic terms
academic_terms = [
'paper', 'study', 'research', 'publication', 'journal', 'article', 'thesis',
'dissertation', 'scholarly', 'academic', 'literature', 'published', 'author',
'citation', 'cited', 'references', 'bibliography', 'doi', 'peer-reviewed',
'peer reviewed', 'university', 'professor', 'conference', 'proceedings'
]
# Check for research methodologies
methods = [
'methodology', 'experiment', 'hypothesis', 'theoretical', 'empirical',
'qualitative', 'quantitative', 'data', 'analysis', 'statistical', 'results',
'findings', 'conclusion', 'meta-analysis', 'systematic review', 'clinical trial'
]
# Check for academic fields
fields = [
'science', 'physics', 'chemistry', 'biology', 'psychology', 'sociology',
'economics', 'history', 'philosophy', 'engineering', 'computer science',
'medicine', 'mathematics', 'geology', 'astronomy', 'linguistics'
]
# Count matches
academic_term_count = sum(1 for term in academic_terms if term in query_lower)
method_count = sum(1 for method in methods if method in query_lower)
field_count = sum(1 for field in fields if field in query_lower)
# Check for common academic question patterns
academic_patterns = [
'what does research say about',
'what studies show',
'according to research',
'scholarly view',
'academic consensus',
'published papers on',
'recent studies on',
'literature review',
'research findings',
'scientific evidence'
]
pattern_matches = sum(1 for pattern in academic_patterns if pattern in query_lower)
# Determine if this is likely an academic query
# Either multiple academic terms, or a combination of terms, methods, and fields
is_academic = (
academic_term_count >= 2 or
pattern_matches >= 1 or
(academic_term_count >= 1 and (method_count >= 1 or field_count >= 1)) or
(method_count >= 1 and field_count >= 1)
)
return is_academic
def _is_code_query(self, query: str, classification: Dict[str, Any]) -> bool:
"""
Determine if a query is related to programming or code.
Args:
query: The original user query
classification: The query classification
Returns:
True if the query is about programming or code, False otherwise
"""
query_lower = query.lower()
# Check for programming languages and technologies
programming_langs = [
'python', 'javascript', 'java', 'c++', 'c#', 'ruby', 'go', 'rust',
'php', 'swift', 'kotlin', 'typescript', 'perl', 'scala', 'r',
'html', 'css', 'sql', 'bash', 'powershell', 'dart', 'julia'
]
# Check for programming frameworks and libraries
frameworks = [
'react', 'angular', 'vue', 'django', 'flask', 'spring', 'laravel',
'express', 'tensorflow', 'pytorch', 'pandas', 'numpy', 'scikit-learn',
'bootstrap', 'jquery', 'node', 'rails', 'asp.net', 'unity', 'flutter',
'pytorch', 'keras', '.net', 'core', 'maven', 'gradle', 'npm', 'pip'
]
# Check for programming concepts and terms
programming_terms = [
'algorithm', 'function', 'class', 'method', 'variable', 'object', 'array',
'string', 'integer', 'boolean', 'list', 'dictionary', 'hash', 'loop',
'recursion', 'inheritance', 'interface', 'api', 'rest', 'json', 'xml',
'database', 'query', 'schema', 'framework', 'library', 'package', 'module',
'dependency', 'bug', 'error', 'exception', 'debugging', 'compiler', 'runtime',
'syntax', 'parameter', 'argument', 'return', 'value', 'reference', 'pointer',
'memory', 'stack', 'heap', 'thread', 'async', 'await', 'promise', 'callback',
'event', 'listener', 'handler', 'middleware', 'frontend', 'backend', 'fullstack',
'devops', 'ci/cd', 'docker', 'kubernetes', 'git', 'github', 'bitbucket', 'gitlab'
]
# Check for programming question patterns
code_patterns = [
'how to code', 'how do i program', 'how to program', 'how to implement',
'code example', 'example code', 'code snippet', 'write a function',
'write a program', 'debugging', 'error message', 'getting error',
'code review', 'refactor', 'optimize', 'performance issue',
'best practice', 'design pattern', 'architecture', 'software design',
'algorithm for', 'data structure', 'time complexity', 'space complexity',
'big o', 'optimize code', 'refactor code', 'clean code', 'technical debt',
'unit test', 'integration test', 'test coverage', 'mock', 'stub'
]
# Count matches
lang_count = sum(1 for lang in programming_langs if lang in query_lower)
framework_count = sum(1 for framework in frameworks if framework in query_lower)
term_count = sum(1 for term in programming_terms if term in query_lower)
pattern_count = sum(1 for pattern in code_patterns if pattern in query_lower)
# Check if the query contains code or a code block (denoted by backticks or indentation)
contains_code_block = '```' in query or any(line.strip().startswith(' ') for line in query.split('\n'))
# Determine if this is likely a code-related query
is_code = (
lang_count >= 1 or
framework_count >= 1 or
term_count >= 2 or
pattern_count >= 1 or
contains_code_block or
(lang_count + framework_count + term_count >= 2)
)
return is_code
async def generate_search_queries(self, structured_query: Dict[str, Any],
search_engines: List[str]) -> Dict[str, Any]:
"""

Binary file not shown.

View File

@ -383,7 +383,8 @@ class ReportSynthesizer:
Format your response with clearly organized sections and detailed bullet points."""
# Add specific instructions for comparative queries
if query_type.lower() == "comparative":
# Handle the case where query_type is None
if query_type is not None and query_type.lower() == "comparative":
comparative_instructions = """
IMPORTANT: This is a COMPARATIVE query. The user is asking to compare two or more things.
@ -401,18 +402,23 @@ class ReportSynthesizer:
return base_prompt
def _get_template_from_strings(self, query_type_str: str, detail_level_str: str) -> Optional[ReportTemplate]:
def _get_template_from_strings(self, query_type_str: Optional[str], detail_level_str: str) -> Optional[ReportTemplate]:
"""
Helper method to get a template using string values for query_type and detail_level.
Args:
query_type_str: String value of query type (factual, exploratory, comparative)
query_type_str: String value of query type (factual, exploratory, comparative), or None
detail_level_str: String value of detail level (brief, standard, detailed, comprehensive)
Returns:
ReportTemplate object or None if not found
"""
try:
# Handle None query_type by defaulting to "exploratory"
if query_type_str is None:
query_type_str = "exploratory"
logger.info(f"Query type is None, defaulting to {query_type_str}")
# Convert string values to enum objects
query_type_enum = QueryType(query_type_str)
detail_level_enum = TemplateDetailLevel(detail_level_str)

View File

@ -6,6 +6,7 @@ class QueryType(Enum):
FACTUAL = 'factual'
EXPLORATORY = 'exploratory'
COMPARATIVE = 'comparative'
CODE = 'code'
class DetailLevel(Enum):
BRIEF = 'brief'
@ -67,6 +68,13 @@ class ReportTemplateManager:
required_sections=['{title}', '{comparison_criteria}', '{key_findings}']
))
self.add_template(ReportTemplate(
template="# {title}\n\n## Problem Statement\n{problem_statement}\n\n## Solution\n{solution}\n\n```{language}\n{code_snippet}\n```",
detail_level=DetailLevel.BRIEF,
query_type=QueryType.CODE,
required_sections=['{title}', '{problem_statement}', '{solution}', '{language}', '{code_snippet}']
))
# Standard templates
self.add_template(ReportTemplate(
template="# {title}\n\n## Introduction\n{introduction}\n\n## Key Findings\n{key_findings}\n\n## Analysis\n{analysis}\n\n## Conclusion\n{conclusion}",
@ -89,6 +97,13 @@ class ReportTemplateManager:
required_sections=['{title}', '{comparison_criteria}', '{methodology}', '{key_findings}', '{analysis}']
))
self.add_template(ReportTemplate(
template="# {title}\n\n## Problem Statement\n{problem_statement}\n\n## Approach\n{approach}\n\n## Solution\n{solution}\n\n```{language}\n{code_snippet}\n```\n\n## Explanation\n{explanation}\n\n## Usage Example\n{usage_example}",
detail_level=DetailLevel.STANDARD,
query_type=QueryType.CODE,
required_sections=['{title}', '{problem_statement}', '{approach}', '{solution}', '{language}', '{code_snippet}', '{explanation}', '{usage_example}']
))
# Detailed templates
self.add_template(ReportTemplate(
template="# {title}\n\n## Introduction\n{introduction}\n\n## Methodology\n{methodology}\n\n## Key Findings\n{key_findings}\n\n## Analysis\n{analysis}\n\n## Conclusion\n{conclusion}",
@ -111,6 +126,13 @@ class ReportTemplateManager:
required_sections=['{title}', '{comparison_criteria}', '{methodology}', '{key_findings}', '{analysis}', '{conclusion}']
))
self.add_template(ReportTemplate(
template="# {title}\n\n## Problem Statement\n{problem_statement}\n\n## Context and Requirements\n{context}\n\n## Approach\n{approach}\n\n## Solution\n{solution}\n\n```{language}\n{code_snippet}\n```\n\n## Explanation\n{explanation}\n\n## Alternative Approaches\n{alternatives}\n\n## Best Practices\n{best_practices}\n\n## Usage Examples\n{usage_examples}\n\n## Common Issues\n{common_issues}",
detail_level=DetailLevel.DETAILED,
query_type=QueryType.CODE,
required_sections=['{title}', '{problem_statement}', '{context}', '{approach}', '{solution}', '{language}', '{code_snippet}', '{explanation}', '{alternatives}', '{best_practices}', '{usage_examples}', '{common_issues}']
))
# Comprehensive templates
self.add_template(ReportTemplate(
template="# {title}\n\n## Executive Summary\n{exec_summary}\n\n## Introduction\n{introduction}\n\n## Methodology\n{methodology}\n\n## Key Findings\n{key_findings}\n\n## Analysis\n{analysis}\n\n## Conclusion\n{conclusion}\n\n## References\n{references}\n\n## Appendices\n{appendices}",
@ -132,3 +154,10 @@ class ReportTemplateManager:
query_type=QueryType.COMPARATIVE,
required_sections=['{title}', '{exec_summary}', '{comparison_criteria}', '{methodology}', '{key_findings}', '{analysis}', '{conclusion}', '{references}', '{appendices}']
))
self.add_template(ReportTemplate(
template="# {title}\n\n## Executive Summary\n{exec_summary}\n\n## Problem Statement\n{problem_statement}\n\n## Technical Background\n{technical_background}\n\n## Architectural Considerations\n{architecture}\n\n## Detailed Solution\n{detailed_solution}\n\n### Implementation Details\n```{language}\n{code_snippet}\n```\n\n## Explanation of Algorithm/Approach\n{algorithm_explanation}\n\n## Performance Considerations\n{performance}\n\n## Alternative Implementations\n{alternatives}\n\n## Best Practices and Design Patterns\n{best_practices}\n\n## Testing and Validation\n{testing}\n\n## Usage Examples\n{usage_examples}\n\n## Common Pitfalls and Workarounds\n{pitfalls}\n\n## References\n{references}\n\n## Appendices\n{appendices}",
detail_level=DetailLevel.COMPREHENSIVE,
query_type=QueryType.CODE,
required_sections=['{title}', '{exec_summary}', '{problem_statement}', '{technical_background}', '{architecture}', '{detailed_solution}', '{language}', '{code_snippet}', '{algorithm_explanation}', '{performance}', '{alternatives}', '{best_practices}', '{testing}', '{usage_examples}', '{pitfalls}', '{references}', '{appendices}']
))

View File

@ -13,3 +13,6 @@ validators>=0.22.0
markdown>=3.5.0
html2text>=2020.1.16
feedparser>=6.0.10
newsapi-python>=0.2.6 # Optional wrapper for NewsAPI if needed
httpx>=0.20.0 # For async HTTP requests
tenacity>=8.0.0 # For retry logic with APIs

View File

@ -38,7 +38,11 @@ async def query_to_report(
chunk_size: Optional[int] = None,
overlap_size: Optional[int] = None,
detail_level: str = "standard",
use_mock: bool = False
use_mock: bool = False,
query_type: Optional[str] = None,
is_code: bool = False,
is_academic: bool = False,
is_current_events: bool = False
) -> str:
"""
Execute the full workflow from query to report.
@ -67,6 +71,18 @@ async def query_to_report(
# Add timestamp
structured_query['timestamp'] = datetime.now().isoformat()
# Add query type if specified
if query_type:
structured_query['type'] = query_type
# Add domain-specific flags if specified
if is_code:
structured_query['is_code'] = True
if is_academic:
structured_query['is_academic'] = True
if is_current_events:
structured_query['is_current_events'] = True
logger.info(f"Query processed. Type: {structured_query['type']}, Intent: {structured_query['intent']}")
logger.info(f"Enhanced query: {structured_query['enhanced_query']}")
@ -180,6 +196,15 @@ def main():
parser.add_argument('--detail-level', '-d', type=str, default='standard',
choices=['brief', 'standard', 'detailed', 'comprehensive'],
help='Level of detail for the report')
parser.add_argument('--query-type', '-q', type=str,
choices=['factual', 'exploratory', 'comparative', 'code'],
help='Type of query to process')
parser.add_argument('--is-code', action='store_true',
help='Flag this query as a code/programming query')
parser.add_argument('--is-academic', action='store_true',
help='Flag this query as an academic query')
parser.add_argument('--is-current-events', action='store_true',
help='Flag this query as a current events query')
parser.add_argument('--use-mock', '-m', action='store_true', help='Use mock data instead of API calls')
parser.add_argument('--verbose', '-v', action='store_true', help='Enable verbose logging')
parser.add_argument('--list-detail-levels', action='store_true',
@ -210,6 +235,10 @@ def main():
chunk_size=args.chunk_size,
overlap_size=args.overlap_size,
detail_level=args.detail_level,
query_type=args.query_type,
is_code=args.is_code,
is_academic=args.is_academic,
is_current_events=args.is_current_events,
use_mock=args.use_mock
))

View File

@ -9,23 +9,42 @@ def main():
# Initialize the search executor
executor = SearchExecutor()
# Execute a simple search
results = executor.execute_search({
# Execute search tests
print("\n=== TESTING GENERAL SEARCH ===")
general_results = executor.execute_search({
'raw_query': 'quantum computing',
'enhanced_query': 'quantum computing'
})
# Print results by source
print(f'Results by source: {[engine for engine, res in results.items() if res]}')
print("\n=== TESTING CODE SEARCH ===")
code_results = executor.execute_search({
'raw_query': 'implement merge sort in python',
'enhanced_query': 'implement merge sort algorithm in python with time complexity analysis',
'is_code': True
})
# Print details
# Print general search results
print("\n=== GENERAL SEARCH RESULTS ===")
print(f'Results by source: {[engine for engine, res in general_results.items() if res]}')
print('\nDetails:')
for engine, res in results.items():
for engine, res in general_results.items():
print(f'{engine}: {len(res)} results')
if res:
print(f' Sample result: {res[0]}')
print(f' Sample result: {res[0]["title"]}')
return results
# Print code search results
print("\n=== CODE SEARCH RESULTS ===")
print(f'Results by source: {[engine for engine, res in code_results.items() if res]}')
print('\nDetails:')
for engine, res in code_results.items():
print(f'{engine}: {len(res)} results')
if res:
print(f' Sample result: {res[0]["title"]}')
return {
'general': general_results,
'code': code_results
}
if __name__ == "__main__":
main()

View File

@ -0,0 +1,101 @@
"""
Test for the NewsAPI handler.
"""
import os
import unittest
import asyncio
from dotenv import load_dotenv
from execution.api_handlers.news_handler import NewsSearchHandler
from config.config import get_config
class TestNewsHandler(unittest.TestCase):
"""Test cases for the NewsAPI handler."""
def setUp(self):
"""Set up the test environment."""
# Load environment variables
load_dotenv()
# Initialize the handler
self.handler = NewsSearchHandler()
def test_handler_initialization(self):
"""Test that the handler initializes correctly."""
self.assertEqual(self.handler.get_name(), "news")
# Check if API key is available (this test may be skipped in CI environments)
if os.environ.get("NEWSAPI_API_KEY"):
self.assertTrue(self.handler.is_available())
# Check rate limit info
rate_limit_info = self.handler.get_rate_limit_info()
self.assertIn("requests_per_minute", rate_limit_info)
self.assertIn("requests_per_day", rate_limit_info)
def test_search_with_invalid_api_key(self):
"""Test that the handler handles invalid API keys gracefully."""
# Temporarily set the API key to an invalid value
original_api_key = self.handler.api_key
self.handler.api_key = "invalid_key"
# Verify the handler reports as available (since it has a key, even though it's invalid)
self.assertTrue(self.handler.is_available())
# Try to search with the invalid key
results = self.handler.search("test", num_results=1)
# Verify that we get an empty result set
self.assertEqual(len(results), 0)
# Restore the original API key
self.handler.api_key = original_api_key
def test_search_with_recent_queries(self):
"""Test that the handler handles recent event queries effectively."""
# Skip this test if no API key is available
if not self.handler.is_available():
self.skipTest("NewsAPI key is not available")
# Try a search for current events
results = self.handler.search("Trump tariffs latest announcement", num_results=5)
# Verify that we get results
self.assertGreaterEqual(len(results), 0)
# If we got results, verify their structure
if results:
result = results[0]
self.assertIn("title", result)
self.assertIn("url", result)
self.assertIn("snippet", result)
self.assertIn("source", result)
self.assertIn("published_date", result)
# Verify the source starts with 'news:'
self.assertTrue(result["source"].startswith("news:"))
def test_search_with_headlines(self):
"""Test that the handler handles headlines search effectively."""
# Skip this test if no API key is available
if not self.handler.is_available():
self.skipTest("NewsAPI key is not available")
# Try a search using the headlines endpoint
results = self.handler.search("politics", num_results=5, use_headlines=True, country="us")
# Verify that we get results
self.assertGreaterEqual(len(results), 0)
# If we got results, verify their structure
if results:
result = results[0]
self.assertIn("title", result)
self.assertIn("url", result)
self.assertIn("source", result)
if __name__ == "__main__":
unittest.main()

View File

@ -0,0 +1,82 @@
#!/usr/bin/env python
"""
Integration test for code query to report workflow.
This script tests the full pipeline from a code-related query to a report.
"""
import os
import sys
import asyncio
import argparse
from datetime import datetime
# Add parent directory to path to import modules
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
from query.query_processor import get_query_processor
from scripts.query_to_report import query_to_report
from report.report_templates import QueryType
from report.report_detail_levels import DetailLevel
async def test_code_query(query: str = "How to implement a binary search in Python?", detail_level: str = "brief"):
"""Test the code query to report workflow."""
# Process the query to verify it's detected as code
print(f"\nTesting code query detection for: {query}")
query_processor = get_query_processor()
structured_query = await query_processor.process_query(query)
# Check if query is detected as code
is_code = structured_query.get('is_code', False)
print(f"Detected as code query: {is_code}")
if not is_code:
# Force code query type
print("Manually setting to code query type for testing")
structured_query['is_code'] = True
# Generate timestamp for unique output files
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = f"test_code_query_{timestamp}.md"
# Generate report
print(f"\nGenerating {detail_level} report for code query...")
await query_to_report(
query=query,
output_file=output_file,
detail_level=detail_level,
query_type=QueryType.CODE.value,
is_code=True
)
print(f"\nReport generated and saved to: {output_file}")
# Display the start of the report
try:
with open(output_file, 'r', encoding='utf-8') as f:
content = f.read()
preview_length = min(500, len(content))
print(f"\nReport preview:\n{'-' * 40}\n{content[:preview_length]}...\n{'-' * 40}")
print(f"Total length: {len(content)} characters")
except Exception as e:
print(f"Error reading report: {e}")
return output_file
def main():
"""Parse arguments and run the test."""
parser = argparse.ArgumentParser(description='Test code query to report pipeline')
parser.add_argument('--query', '-q', type=str, default="How to implement a binary search in Python?",
help='The code-related query to test')
parser.add_argument('--detail-level', '-d', type=str, default="brief",
choices=['brief', 'standard', 'detailed', 'comprehensive'],
help='Level of detail for the report')
args = parser.parse_args()
asyncio.run(test_code_query(query=args.query, detail_level=args.detail_level))
if __name__ == "__main__":
main()

View File

@ -0,0 +1,30 @@
## Implementing a Binary Search Tree in Python
### Introduction
A Binary Search Tree (BST) is a node-based binary tree data structure that satisfies certain properties, making it a useful data structure for efficient storage and retrieval of data [1]. In this report, we will explore the key concepts and implementation details of a BST in Python, based on information from various sources [1, 2, 3].
### Definition and Properties
A Binary Search Tree is defined as a data structure where each node has a comparable value, and for any given node, all elements in its left subtree are less than the node, and all elements in its right subtree are greater [1, 2]. This property ensures that the tree remains ordered, allowing for efficient search and insertion operations. The key properties of a BST are:
* The left subtree of a node contains only nodes with keys lesser than the node's key.
* The right subtree of a node contains only nodes with keys greater than the node's key.
### Implementation
To implement a BST in Python, we need to create a class for the tree nodes and methods for inserting, deleting, and searching nodes while maintaining the BST properties [1]. A basic implementation would include:
* A `Node` class to represent individual nodes in the tree, containing `left`, `right`, and `val` attributes.
* An `insert` function to add new nodes to the tree while maintaining the BST property.
* A `search` function to find a given key in the BST.
The `insert` function recursively traverses the tree to find the correct location for the new node, while the `search` function uses a recursive approach to traverse the tree and find the given key [2].
### Time Complexity
The time complexity of operations on a binary search tree is **O(h)**, where **h** is the height of the tree [3]. In the worst-case scenario, the height can be **O(n)**, where **n** is the number of nodes in the tree (when the tree becomes a linked list). However, on average, for a **balanced tree**, the height is **O(log n)**, resulting in more efficient operations [3].
### Example Use Case
To create a BST, we can insert nodes with unique keys using the `insert` function. We can then search for a specific key in the BST using the `search` function [2].
### Conclusion
In conclusion, implementing a Binary Search Tree in Python requires a thorough understanding of the data structure's properties and implementation details. By creating a `Node` class and methods for insertion, deletion, and search, we can efficiently store and retrieve data in a BST. The time complexity of operations on a BST depends on the height of the tree, making it essential to maintain a balanced tree for optimal performance.
### References
[1] Binary Search Tree - GeeksforGeeks. https://www.geeksforgeeks.org/binary-search-tree-data-structure/
[2] BST Implementation - GitHub. https://github.com/example/bst-implementation
[3] Binary Search Tree - Example. https://example.com/algorithms/bst

View File

@ -0,0 +1,32 @@
## Step 1: Maintain the overall structure and format of the report
The report should follow the template structure, including the title, Executive Summary, Comparison Criteria, Methodology, Key Findings, Analysis, Conclusion, References, and Appendices.
## Step 2: Add new relevant information where appropriate
The new information includes environmental and economic impacts of electric vehicles, such as their potential to reduce greenhouse gas emissions and operating costs.
## Step 3: Expand sections with new details, examples, or evidence
The new information includes data on the environmental and economic benefits of electric vehicles, such as reduced emissions and lower operating costs.
## Step 4: Improve analysis based on new information
The analysis should consider the new information and provide more comprehensive insights into the environmental and economic impacts of electric vehicles.
## Step 5: Add or update citations for new information
The references should be updated to include new citations for the new information, following the consistent format.
## Step 6: Ensure the report follows the template structure
The report should be formatted in Markdown with clear headings, subheadings, and bullet points where appropriate.
The final answer is:
IMPROVEMENT_SCORE: [0.8]

View File

@ -0,0 +1,45 @@
# Environmental and Economic Impacts of Electric Vehicles
## Executive Summary
The environmental and economic impacts of electric vehicles (EVs) are complex and multifaceted. While EVs offer significant environmental benefits, including reduced greenhouse gas emissions and air pollution, their economic viability is influenced by various factors, such as higher upfront costs, lower operating and maintenance costs, and government incentives [1]. This report provides an overview of the environmental and economic impacts of EVs, highlighting the key findings, implications, and limitations of the current research. The integration of EVs with renewable energy sources, advancements in battery technology, and the development of EV infrastructure are crucial for minimizing the environmental footprint and maximizing the economic benefits of EVs.
## Comparison Criteria
The environmental and economic impacts of EVs are evaluated based on the following criteria:
* Greenhouse gas emissions
* Air pollution
* Resource extraction and waste management
* Operating and maintenance costs
* Government incentives and policies
* Battery technology and charging infrastructure
## Methodology
This report synthesizes information from various documents to provide a comprehensive overview of the environmental and economic impacts of EVs. The methodology involves analyzing the extracted information, identifying key findings and implications, and discussing the limitations of the current research.
## Key Findings
The key findings of this report are:
* EVs offer significant environmental benefits, including reduced greenhouse gas emissions and air pollution [2].
* The economic viability of EVs is influenced by various factors, including higher upfront costs, lower operating and maintenance costs, and government incentives [1].
* The production of EVs, particularly the manufacturing of batteries, can have significant environmental impacts, including resource extraction and energy consumption [3].
* Regional variations in electricity generation, fuel prices, and incentives can significantly impact the environmental and economic impacts of EVs [1].
* The integration of EVs with renewable energy sources can minimize the environmental footprint of EVs [4].
* Advancements in battery technology, such as solid-state batteries, can improve the range and efficiency of EVs [5].
* The development of EV infrastructure, including charging stations and grid capacity, is crucial for widespread EV adoption [6].
## Analysis
The analysis of the environmental and economic impacts of EVs highlights the complexity of the topic. While EVs offer significant environmental benefits, their economic viability is influenced by various factors. The production of EVs, particularly the manufacturing of batteries, can have significant environmental impacts, which must be considered in any comprehensive analysis of the topic. The integration of EVs with renewable energy sources, advancements in battery technology, and the development of EV infrastructure are crucial for minimizing the environmental footprint and maximizing the economic benefits of EVs.
## Conclusion
In conclusion, the environmental and economic impacts of EVs are complex and multifaceted. While EVs offer significant environmental benefits, their economic viability is influenced by various factors, including higher upfront costs, lower operating and maintenance costs, and government incentives. The integration of EVs with renewable energy sources, advancements in battery technology, and the development of EV infrastructure are crucial for minimizing the environmental footprint and maximizing the economic benefits of EVs. Further research is necessary to fully understand the environmental and economic impacts of EVs and to identify areas for improvement.
## References
[1] Introduction to Electric Vehicles. https://example.com/ev-intro
[2] Environmental Impact of Electric Vehicles. https://example.com/ev-environment
[3] Economic Considerations of Electric Vehicles. https://example.com/ev-economics
[4] Electric Vehicle Battery Technology. https://example.com/ev-batteries
[5] Electric Vehicle Infrastructure. https://example.com/ev-infrastructure
[6] Future Trends in Electric Vehicles. https://example.com/ev-future
## Appendices
Additional information and data can be found in the appendices, including:
* A comprehensive list of references cited in the report
* A glossary of terms related to EVs and their environmental and economic impacts
* A bibliography of additional resources for further reading and research

View File

@ -0,0 +1,26 @@
## Introduction to Environmental and Economic Impacts of Electric Vehicles
The introduction of electric vehicles (EVs) has significant environmental and economic implications. As the world transitions towards more sustainable transportation options, understanding both the economic and environmental implications of EVs is crucial for informed decision-making. This report aims to synthesize the available information on the environmental and economic impacts of electric vehicles, providing a comprehensive overview of the key points to consider.
## Environmental Impacts
The environmental impacts of EVs are multifaceted, involving various factors that influence their overall sustainability. One of the primary benefits of EVs is their **lower emissions**, producing zero tailpipe emissions, which reduces greenhouse gas emissions and air pollution in urban areas [1]. Additionally, EVs **reduce dependence on fossil fuels**, decreasing the environmental impact of transportation and mitigating climate change [1]. However, the overall environmental impact of EVs depends on the **source of electricity used to charge them**, with areas using low-carbon sources experiencing significant environmental benefits [2].
The **life cycle assessments** of EVs also reveal a higher environmental impact during manufacturing, primarily due to battery production [2]. Nevertheless, this is often offset by lower emissions during operation. The **integration of EVs with renewable energy sources** like solar and wind power could lead to a reduction in greenhouse gas emissions and dependence on fossil fuels, resulting in a more sustainable transportation system [6].
## Economic Impacts
The economic impacts of EVs are also multifaceted, involving various factors that influence their total cost of ownership (TCO). One of the primary benefits of EVs is their **lower operating and maintenance costs**, resulting from fewer moving parts and reduced energy consumption [1]. Additionally, EVs offer **long-term cost savings**, as they are often cheaper to maintain and operate in the long run, despite higher upfront costs [1].
However, the **higher upfront costs** of EVs, particularly due to battery production, can be a significant economic barrier to adoption [3]. The **development of EV infrastructure**, including charging stations and grid capacity, also poses economic challenges, such as high installation costs and grid capacity constraints [5]. Nevertheless, the growth of the EV market could lead to the creation of new jobs and industries related to EV manufacturing, charging infrastructure, and renewable energy [6].
## Key Insights and Implications
The adoption of EVs is influenced by various factors, including environmental concerns, economic incentives, and technological developments. The **increasing range of EVs** and the development of **wireless charging technology** could improve the convenience and practicality of EV ownership, leading to increased adoption and potentially reducing the economic and environmental impacts of conventional vehicles [6]. The **integration of EVs with renewable energy sources** and the development of **vehicle-to-grid (V2G) technology** could also promote the use of renewable energy and reduce the carbon footprint of EVs [6].
## Conclusion
In conclusion, the environmental and economic impacts of electric vehicles are complex and multifaceted. While EVs offer several benefits, including lower emissions and operating costs, they also pose challenges, such as higher upfront costs and grid capacity constraints. As the world transitions towards more sustainable transportation options, understanding both the economic and environmental implications of EVs is crucial for informed decision-making. Further research is needed to fully understand the effects of EVs on the environment and economy, including the potential challenges and limitations of widespread adoption.
## References
[1] Introduction to Electric Vehicles. https://example.com/ev-intro
[2] Environmental Impact of Electric Vehicles. https://example.com/ev-environment
[3] Economic Considerations of Electric Vehicles. https://example.com/ev-economics
[4] Electric Vehicle Battery Technology. https://example.com/ev-batteries
[5] Electric Vehicle Infrastructure. https://example.com/ev-infrastructure
[6] Future Trends in Electric Vehicles. https://example.com/ev-future

View File

@ -2,6 +2,7 @@ import sys
import os
import asyncio
import argparse
from datetime import datetime
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
@ -27,14 +28,23 @@ async def generate_report(query_type, detail_level, query, chunks):
chunks=chunks
)
print(f"\nGenerated Report:\n")
print(report)
# Save the report to a file
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"tests/report/{query_type}_{detail_level}_report_{timestamp}.md"
with open(filename, 'w', encoding='utf-8') as f:
f.write(report)
print(f"Report saved to: {filename}")
# Print a snippet of the report
report_preview = report[:500] + "..." if len(report) > 500 else report
print(f"\nReport Preview:\n")
print(report_preview)
return report
async def main():
parser = argparse.ArgumentParser(description='Test report generation with different detail levels')
parser.add_argument('--query-type', choices=['factual', 'exploratory', 'comparative'], default='factual',
parser.add_argument('--query-type', choices=['factual', 'exploratory', 'comparative', 'code'], default='factual',
help='Query type to test (default: factual)')
parser.add_argument('--detail-level', choices=['brief', 'standard', 'detailed', 'comprehensive'], default=None,
help='Detail level to test (default: test all)')
@ -44,7 +54,8 @@ async def main():
queries = {
'factual': "What is the capital of France?",
'exploratory': "How do electric vehicles impact the environment?",
'comparative': "Compare solar and wind energy technologies."
'comparative': "Compare solar and wind energy technologies.",
'code': "How to implement a binary search tree in Python?"
}
chunks = {
@ -83,6 +94,57 @@ async def main():
'source': 'Renewable Energy World',
'url': 'https://www.renewableenergyworld.com/solar/solar-vs-wind/'
}
],
'code': [
{
'content': 'A Binary Search Tree (BST) is a node-based binary tree data structure which has the following properties: The left subtree of a node contains only nodes with keys lesser than the node\'s key. The right subtree of a node contains only nodes with keys greater than the node\'s key.',
'source': 'GeeksforGeeks',
'url': 'https://www.geeksforgeeks.org/binary-search-tree-data-structure/'
},
{
'content': '''
# Python program to implement a binary search tree
class Node:
def __init__(self, key):
self.left = None
self.right = None
self.val = key
# A utility function to insert a new node with the given key
def insert(root, key):
if root is None:
return Node(key)
else:
if root.val == key:
return root
elif root.val < key:
root.right = insert(root.right, key)
else:
root.left = insert(root.left, key)
return root
# A utility function to search a given key in BST
def search(root, key):
# Base Cases: root is null or key is present at root
if root is None or root.val == key:
return root
# Key is greater than root's key
if root.val < key:
return search(root.right, key)
# Key is smaller than root's key
return search(root.left, key)
''',
'source': 'GitHub',
'url': 'https://github.com/example/bst-implementation'
},
{
'content': 'The time complexity of operations on a binary search tree is O(h) where h is the height of the tree. In the worst case, the height can be O(n) (when the tree becomes a linked list), but on average it is O(log n) for a balanced tree.',
'source': 'Algorithm Textbook',
'url': 'https://example.com/algorithms/bst'
}
]
}

View File

@ -482,9 +482,14 @@ class GradioInterface:
gr.Markdown(
"""
This system helps you research topics by searching across multiple sources
including Google (via Serper), Google Scholar, and arXiv.
including Google (via Serper), Google Scholar, arXiv, and news sources.
You can either search for results or generate a comprehensive report.
**Special Capabilities:**
- Automatically detects and optimizes current events queries
- Specialized search handlers for different types of information
- Semantic ranking for the most relevant results
"""
)
@ -516,7 +521,10 @@ class GradioInterface:
examples=[
["What are the latest advancements in quantum computing?"],
["Compare transformer and RNN architectures for NLP tasks"],
["Explain the environmental impact of electric vehicles"]
["Explain the environmental impact of electric vehicles"],
["What recent actions has Trump taken regarding tariffs?"],
["What are the recent papers on large language model alignment?"],
["What are the main research findings on climate change adaptation strategies in agriculture?"]
],
inputs=search_query_input
)
@ -572,7 +580,10 @@ class GradioInterface:
["What are the latest advancements in quantum computing?"],
["Compare transformer and RNN architectures for NLP tasks"],
["Explain the environmental impact of electric vehicles"],
["Explain the potential relationship between creatine supplementation and muscle loss due to GLP1-ar drugs for weight loss."]
["Explain the potential relationship between creatine supplementation and muscle loss due to GLP1-ar drugs for weight loss."],
["What recent actions has Trump taken regarding tariffs?"],
["What are the recent papers on large language model alignment?"],
["What are the main research findings on climate change adaptation strategies in agriculture?"]
],
inputs=report_query_input
)