4.4 KiB
Paper System
A tool to fetch, filter, and process arXiv papers using LLM-based criteria. The system consists of three main components:
- arxiv-processor: Fetches papers from arXiv based on category and date range
- llm_processor: Evaluates papers using specified criteria through an LLM
- json2md: Generates formatted markdown output of accepted/rejected papers
Installation
- Ensure you have Go installed (1.20 or later)
- Clone this repository
- Build the system:
go build -o paper-system
Configuration
The system requires an OpenRouter API key for LLM processing. Set it as an environment variable:
export OPENROUTER_API_KEY=your-api-key
Usage
The system can operate in two modes:
1. ArXiv Fetch Mode
Fetches papers from arXiv, processes them with LLM, and generates markdown output:
./paper-system \
-start 20240101 \
-end 20240131 \
-search cs.AI \
-criteria criteria.txt \
-output papers.md
Required flags:
-start
: Start date in YYYYMMDD format-end
: End date in YYYYMMDD format-search
: arXiv category/search query (e.g., 'cs.AI', 'physics.comp-ph')-criteria
: Path to filter criteria file
Optional flags:
-output
: Output markdown file path (default: papers.md)-model
: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)-max-results
: Maximum number of papers to retrieve (default: 100, max: 2000)
2. Input JSON Mode
Process an existing JSON file of papers (useful for running different criteria against the same dataset):
./paper-system \
-input-json papers.json \
-criteria new-criteria.txt \
-output results.md
Required flags:
-input-json
: Path to input JSON file-criteria
: Path to filter criteria file
Optional flags:
-output
: Output markdown file path (default: papers.md)-model
: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)
Input/Output Files
Criteria File Format
Create a text file with your evaluation criteria. Example:
Please evaluate this paper based on the following criteria:
1. Practical Applications: Does the paper demonstrate clear real-world applications?
2. Experimental Results: Are there quantitative metrics and thorough evaluations?
3. Technical Innovation: Does the paper present novel techniques or improvements?
Respond with a JSON object containing:
{
"decision": "ACCEPT" or "REJECT",
"explanation": "Detailed reasoning for the decision"
}
Output Format
The system generates dated output files when using arXiv queries:
YYYYMMDD-YYYYMMDD-CATEGORY-papers.json
: Raw arXiv resultsYYYYMMDD-YYYYMMDD-CATEGORY-papers.md
: Final filtered results
When using --input-json
, specify output name with --output
.
Example files:
[
{
"title": "Paper Title",
"abstract": "Paper abstract...",
"arxiv_id": "2401.12345",
"authors": ["Author 1", "Author 2"]
}
]
# Accepted Papers
...
# Accepted Papers
## [Paper Title](https://arxiv.org/abs/2401.12345)
**arXiv ID:** 2401.12345
**Abstract:**
> Paper abstract...
**Decision Explanation:** Meets criteria for practical applications...
---
# Rejected Papers
...
Workflow Examples
Basic Usage
# Fetch and evaluate papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt
# This creates:
# - papers.json (raw paper data)
# - papers.md (evaluation results)
Multiple Evaluations
# 1. First fetch papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria1.txt -output results1.md
# 2. Run different criteria on the same papers
./paper-system -input-json papers.json -criteria criteria2.txt -output results2.md
Fetching More Papers
# Fetch up to 2000 papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt -max-results 2000
Error Handling
The system includes several safeguards:
- Validates all required parameters
- Ensures max-results is between 1 and 2000
- Prevents mixing of arXiv and input JSON modes
- Retries LLM processing on failure
- Maintains temporary files for debugging
Notes
- The system preserves papers.json when fetching from arXiv, allowing for future reuse
- Temporary files (temp_input.json, temp_output.json) are automatically cleaned up
- The LLM processor uses a batch size of 32 papers for efficient processing