paper-system/README.md

4.4 KiB

Paper System

A tool to fetch, filter, and process arXiv papers using LLM-based criteria. The system consists of three main components:

  1. arxiv-processor: Fetches papers from arXiv based on category and date range
  2. llm_processor: Evaluates papers using specified criteria through an LLM
  3. json2md: Generates formatted markdown output of accepted/rejected papers

Installation

  1. Ensure you have Go installed (1.20 or later)
  2. Clone this repository
  3. Build the system:
go build -o paper-system

Configuration

The system requires an OpenRouter API key for LLM processing. Set it as an environment variable:

export OPENROUTER_API_KEY=your-api-key

Usage

The system can operate in two modes:

1. ArXiv Fetch Mode

Fetches papers from arXiv, processes them with LLM, and generates markdown output:

./paper-system \
  -start 20240101 \
  -end 20240131 \
  -search cs.AI \
  -criteria criteria.txt \
  -output papers.md

Required flags:

  • -start: Start date in YYYYMMDD format
  • -end: End date in YYYYMMDD format
  • -search: arXiv category/search query (e.g., 'cs.AI', 'physics.comp-ph')
  • -criteria: Path to filter criteria file

Optional flags:

  • -output: Output markdown file path (default: papers.md)
  • -model: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)
  • -max-results: Maximum number of papers to retrieve (default: 100, max: 2000)

2. Input JSON Mode

Process an existing JSON file of papers (useful for running different criteria against the same dataset):

./paper-system \
  -input-json papers.json \
  -criteria new-criteria.txt \
  -output results.md

Required flags:

  • -input-json: Path to input JSON file
  • -criteria: Path to filter criteria file

Optional flags:

  • -output: Output markdown file path (default: papers.md)
  • -model: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)

Input/Output Files

Criteria File Format

Create a text file with your evaluation criteria. Example:

Please evaluate this paper based on the following criteria:

1. Practical Applications: Does the paper demonstrate clear real-world applications?
2. Experimental Results: Are there quantitative metrics and thorough evaluations?
3. Technical Innovation: Does the paper present novel techniques or improvements?

Respond with a JSON object containing:
{
  "decision": "ACCEPT" or "REJECT",
  "explanation": "Detailed reasoning for the decision"
}

Output Format

The system generates dated output files when using arXiv queries:

  1. YYYYMMDD-YYYYMMDD-CATEGORY-papers.json: Raw arXiv results
  2. YYYYMMDD-YYYYMMDD-CATEGORY-papers.md: Final filtered results

When using --input-json, specify output name with --output.

Example files:

[
  {
    "title": "Paper Title",
    "abstract": "Paper abstract...",
    "arxiv_id": "2401.12345",
    "authors": ["Author 1", "Author 2"]
  }
]
# Accepted Papers
...
# Accepted Papers

## [Paper Title](https://arxiv.org/abs/2401.12345)
**arXiv ID:** 2401.12345

**Abstract:**
> Paper abstract...

**Decision Explanation:** Meets criteria for practical applications...

---

# Rejected Papers
...

Workflow Examples

Basic Usage

# Fetch and evaluate papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt

# This creates:
# - papers.json (raw paper data)
# - papers.md (evaluation results)

Multiple Evaluations

# 1. First fetch papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria1.txt -output results1.md

# 2. Run different criteria on the same papers
./paper-system -input-json papers.json -criteria criteria2.txt -output results2.md

Fetching More Papers

# Fetch up to 2000 papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt -max-results 2000

Error Handling

The system includes several safeguards:

  • Validates all required parameters
  • Ensures max-results is between 1 and 2000
  • Prevents mixing of arXiv and input JSON modes
  • Retries LLM processing on failure
  • Maintains temporary files for debugging

Notes

  • The system preserves papers.json when fetching from arXiv, allowing for future reuse
  • Temporary files (temp_input.json, temp_output.json) are automatically cleaned up
  • The LLM processor uses a batch size of 32 papers for efficient processing