168 lines
4.2 KiB
Markdown
168 lines
4.2 KiB
Markdown
|
# Paper System
|
||
|
|
||
|
A tool to fetch, filter, and process arXiv papers using LLM-based criteria. The system consists of three main components:
|
||
|
|
||
|
1. **arxiv-processor**: Fetches papers from arXiv based on category and date range
|
||
|
2. **llm_processor**: Evaluates papers using specified criteria through an LLM
|
||
|
3. **json2md**: Generates formatted markdown output of accepted/rejected papers
|
||
|
|
||
|
## Installation
|
||
|
|
||
|
1. Ensure you have Go installed (1.20 or later)
|
||
|
2. Clone this repository
|
||
|
3. Build the system:
|
||
|
```bash
|
||
|
go build -o paper-system
|
||
|
```
|
||
|
|
||
|
## Configuration
|
||
|
|
||
|
The system requires an OpenRouter API key for LLM processing. Set it as an environment variable:
|
||
|
|
||
|
```bash
|
||
|
export OPENROUTER_API_KEY=your-api-key
|
||
|
```
|
||
|
|
||
|
## Usage
|
||
|
|
||
|
The system can operate in two modes:
|
||
|
|
||
|
### 1. ArXiv Fetch Mode
|
||
|
|
||
|
Fetches papers from arXiv, processes them with LLM, and generates markdown output:
|
||
|
|
||
|
```bash
|
||
|
./paper-system \
|
||
|
-start 20240101 \
|
||
|
-end 20240131 \
|
||
|
-search cs.AI \
|
||
|
-criteria criteria.txt \
|
||
|
-output papers.md
|
||
|
```
|
||
|
|
||
|
Required flags:
|
||
|
- `-start`: Start date in YYYYMMDD format
|
||
|
- `-end`: End date in YYYYMMDD format
|
||
|
- `-search`: arXiv category/search query (e.g., 'cs.AI', 'physics.comp-ph')
|
||
|
- `-criteria`: Path to filter criteria file
|
||
|
|
||
|
Optional flags:
|
||
|
- `-output`: Output markdown file path (default: papers.md)
|
||
|
- `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)
|
||
|
- `-max-results`: Maximum number of papers to retrieve (default: 100, max: 2000)
|
||
|
|
||
|
### 2. Input JSON Mode
|
||
|
|
||
|
Process an existing JSON file of papers (useful for running different criteria against the same dataset):
|
||
|
|
||
|
```bash
|
||
|
./paper-system \
|
||
|
-input-json papers.json \
|
||
|
-criteria new-criteria.txt \
|
||
|
-output results.md
|
||
|
```
|
||
|
|
||
|
Required flags:
|
||
|
- `-input-json`: Path to input JSON file
|
||
|
- `-criteria`: Path to filter criteria file
|
||
|
|
||
|
Optional flags:
|
||
|
- `-output`: Output markdown file path (default: papers.md)
|
||
|
- `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)
|
||
|
|
||
|
## Input/Output Files
|
||
|
|
||
|
### Criteria File Format
|
||
|
|
||
|
Create a text file with your evaluation criteria. Example:
|
||
|
```
|
||
|
Please evaluate this paper based on the following criteria:
|
||
|
|
||
|
1. Practical Applications: Does the paper demonstrate clear real-world applications?
|
||
|
2. Experimental Results: Are there quantitative metrics and thorough evaluations?
|
||
|
3. Technical Innovation: Does the paper present novel techniques or improvements?
|
||
|
|
||
|
Respond with a JSON object containing:
|
||
|
{
|
||
|
"decision": "ACCEPT" or "REJECT",
|
||
|
"explanation": "Detailed reasoning for the decision"
|
||
|
}
|
||
|
```
|
||
|
|
||
|
### Output Format
|
||
|
|
||
|
The system generates two types of output:
|
||
|
|
||
|
1. **papers.json**: Raw paper data in JSON format (when fetching from arXiv)
|
||
|
```json
|
||
|
[
|
||
|
{
|
||
|
"title": "Paper Title",
|
||
|
"abstract": "Paper abstract...",
|
||
|
"arxiv_id": "2401.12345",
|
||
|
"authors": ["Author 1", "Author 2"]
|
||
|
}
|
||
|
]
|
||
|
```
|
||
|
|
||
|
2. **papers.md**: Formatted markdown with accepted/rejected papers
|
||
|
```markdown
|
||
|
# Accepted Papers
|
||
|
|
||
|
## [Paper Title](https://arxiv.org/abs/2401.12345)
|
||
|
**arXiv ID:** 2401.12345
|
||
|
|
||
|
**Abstract:**
|
||
|
> Paper abstract...
|
||
|
|
||
|
**Decision Explanation:** Meets criteria for practical applications...
|
||
|
|
||
|
---
|
||
|
|
||
|
# Rejected Papers
|
||
|
...
|
||
|
```
|
||
|
|
||
|
## Workflow Examples
|
||
|
|
||
|
### Basic Usage
|
||
|
```bash
|
||
|
# Fetch and evaluate papers
|
||
|
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt
|
||
|
|
||
|
# This creates:
|
||
|
# - papers.json (raw paper data)
|
||
|
# - papers.md (evaluation results)
|
||
|
```
|
||
|
|
||
|
### Multiple Evaluations
|
||
|
```bash
|
||
|
# 1. First fetch papers
|
||
|
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria1.txt -output results1.md
|
||
|
|
||
|
# 2. Run different criteria on the same papers
|
||
|
./paper-system -input-json papers.json -criteria criteria2.txt -output results2.md
|
||
|
```
|
||
|
|
||
|
### Fetching More Papers
|
||
|
```bash
|
||
|
# Fetch up to 2000 papers
|
||
|
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt -max-results 2000
|
||
|
```
|
||
|
|
||
|
## Error Handling
|
||
|
|
||
|
The system includes several safeguards:
|
||
|
|
||
|
- Validates all required parameters
|
||
|
- Ensures max-results is between 1 and 2000
|
||
|
- Prevents mixing of arXiv and input JSON modes
|
||
|
- Retries LLM processing on failure
|
||
|
- Maintains temporary files for debugging
|
||
|
|
||
|
## Notes
|
||
|
|
||
|
- The system preserves papers.json when fetching from arXiv, allowing for future reuse
|
||
|
- Temporary files (temp_input.json, temp_output.json) are automatically cleaned up
|
||
|
- The LLM processor uses a batch size of 32 papers for efficient processing
|