# Paper System A tool to fetch, filter, and process arXiv papers using LLM-based criteria. The system consists of three main components: 1. **arxiv-processor**: Fetches papers from arXiv based on category and date range 2. **llm_processor**: Evaluates papers using specified criteria through an LLM 3. **json2md**: Generates formatted markdown output of accepted/rejected papers ## Installation 1. Ensure you have Go installed (1.20 or later) 2. Clone this repository 3. Build the system: ```bash go build -o paper-system ``` ## Configuration The system requires an OpenRouter API key for LLM processing. Set it as an environment variable: ```bash export OPENROUTER_API_KEY=your-api-key ``` ## Usage The system can operate in two modes: ### 1. ArXiv Fetch Mode Fetches papers from arXiv, processes them with LLM, and generates markdown output: ```bash ./paper-system \ -start 20240101 \ -end 20240131 \ -search cs.AI \ -criteria criteria.txt \ -output papers.md ``` Required flags: - `-start`: Start date in YYYYMMDD format - `-end`: End date in YYYYMMDD format - `-search`: arXiv category/search query (e.g., 'cs.AI', 'physics.comp-ph') - `-criteria`: Path to filter criteria file Optional flags: - `-output`: Output markdown file path (default: papers.md) - `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct) - `-max-results`: Maximum number of papers to retrieve (default: 100, max: 2000) ### 2. Input JSON Mode Process an existing JSON file of papers (useful for running different criteria against the same dataset): ```bash ./paper-system \ -input-json papers.json \ -criteria new-criteria.txt \ -output results.md ``` Required flags: - `-input-json`: Path to input JSON file - `-criteria`: Path to filter criteria file Optional flags: - `-output`: Output markdown file path (default: papers.md) - `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct) ## Input/Output Files ### Criteria File Format Create a text file with your evaluation criteria. Example: ``` Please evaluate this paper based on the following criteria: 1. Practical Applications: Does the paper demonstrate clear real-world applications? 2. Experimental Results: Are there quantitative metrics and thorough evaluations? 3. Technical Innovation: Does the paper present novel techniques or improvements? Respond with a JSON object containing: { "decision": "ACCEPT" or "REJECT", "explanation": "Detailed reasoning for the decision" } ``` ### Output Format The system generates dated output files when using arXiv queries: 1. **`YYYYMMDD-YYYYMMDD-CATEGORY-papers.json`**: Raw arXiv results 2. **`YYYYMMDD-YYYYMMDD-CATEGORY-papers.md`**: Final filtered results When using `--input-json`, specify output name with `--output`. Example files: ```json [ { "title": "Paper Title", "abstract": "Paper abstract...", "arxiv_id": "2401.12345", "authors": ["Author 1", "Author 2"] } ] ``` ```markdown # Accepted Papers ... ``` ```markdown # Accepted Papers ## [Paper Title](https://arxiv.org/abs/2401.12345) **arXiv ID:** 2401.12345 **Abstract:** > Paper abstract... **Decision Explanation:** Meets criteria for practical applications... --- # Rejected Papers ... ``` ## Workflow Examples ### Basic Usage ```bash # Fetch and evaluate papers ./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt # This creates: # - papers.json (raw paper data) # - papers.md (evaluation results) ``` ### Multiple Evaluations ```bash # 1. First fetch papers ./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria1.txt -output results1.md # 2. Run different criteria on the same papers ./paper-system -input-json papers.json -criteria criteria2.txt -output results2.md ``` ### Fetching More Papers ```bash # Fetch up to 2000 papers ./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt -max-results 2000 ``` ## Error Handling The system includes several safeguards: - Validates all required parameters - Ensures max-results is between 1 and 2000 - Prevents mixing of arXiv and input JSON modes - Retries LLM processing on failure - Maintains temporary files for debugging ## Notes - The system preserves papers.json when fetching from arXiv, allowing for future reuse - Temporary files (temp_input.json, temp_output.json) are automatically cleaned up - The LLM processor uses a batch size of 32 papers for efficient processing