paper-system/arxiv-processor/README.md

# ArXiv Processor

A Go package for fetching and processing papers from arXiv.

## Installation

1. Clone the repository:
```bash
git clone https://github.com/yourusername/arxiv-processor.git
cd arxiv-processor
```

2. Initialize the Go module:
```bash
go mod init arxiv-processor
go mod tidy
```

## Usage

### As a Library
To use the package in your Go application:

```go
import "github.com/yourusername/arxiv-processor"

func main() {
    // Create client
    client := arxiv.NewClient()

    // Define search parameters
    // The format "20060102" is Go's reference time format:
    // 2006 = year
    // 01 = month
    // 02 = day
    // Note: The arXiv API returns full timestamps including time of day,
    // but the search API only uses the date portion for filtering
    startDate, _ := time.Parse("20060102", "20240101")
    endDate, _ := time.Parse("20060102", "20240131")

    // Fetch papers
    // The FetchPapers method returns all papers at once after completion
    // of the API request and any necessary pagination
    papers, err := client.FetchPapers("cat:cs.AI", startDate, endDate)
    if err != nil {
        log.Fatal(err)
    }

    // Use papers directly (in-memory)
    // The papers slice contains all results after completion
    for _, paper := range papers {
        fmt.Printf("Title: %s\n", paper.Title)
        fmt.Printf("Abstract: %s\n", paper.Abstract)
    }

    // Optionally save papers to file
    err = arxiv.SavePapers("papers.json", papers)
    if err != nil {
        log.Fatal(err)
    }
}
```

Note: The package currently writes to a file by default. To modify this behavior to only return JSON objects:
1. Remove the SavePapers call
2. Use the returned papers slice directly
3. The papers slice contains all paper data as Go structs
4. You can marshal to JSON using json.Marshal(papers) if needed

### Command Line Interface
To use the CLI:

```bash
go run main.go --search "cat:cs.AI" --date-range "YYYYMMDD-YYYYMMDD"
```

#### Command Line Options
- `--search`: Search query (e.g., "cat:cs.AI" for AI papers)
- `--date-range`: Date range in YYYYMMDD-YYYYMMDD format
- `--output`: Output file path (default: papers_data.json)

### Example: Fetch AI Papers
```bash
go run main.go --search "cat:cs.AI" --date-range "20250115-20250118"
```

### Program Output
- Fetched papers are saved to `papers_data.json`
- Example JSON structure:
```json
[
  {
    "title": "Sample Paper Title",
    "abstract": "This is a sample abstract...",
    "arxiv_id": "2501.08565v1"
  }
]
```
- The JSON file contains paper metadata including:
  - Title
  - Abstract
  - arXiv ID

## Configuration

### Environment Variables
- `ARXIV_MAX_RESULTS`: Maximum number of results to fetch (default: 100)
- `ARXIV_START_INDEX`: Start index for pagination (default: 0)

## Package Structure

```
arxiv-processor/
├── arxiv/          # arXiv API client
├── storage/        # Data storage handlers
├── llm/            # LLM integration (TODO)
├── main.go         # Main entry point
└── README.md       # This file
```

## Contributing

1. Fork the repository
2. Create a new branch
3. Make your changes
4. Submit a pull request

## License

MIT License
Initial Commit; working system 2025-01-24 15:26:47 +00:00			`# ArXiv Processor`

			`A Go package for fetching and processing papers from arXiv.`

			`## Installation`

			`1. Clone the repository:`
			```bash
			`git clone https://github.com/yourusername/arxiv-processor.git`
			`cd arxiv-processor`
			```

			`2. Initialize the Go module:`
			```bash
			`go mod init arxiv-processor`
			`go mod tidy`
			```

			`## Usage`

			`### As a Library`
			`To use the package in your Go application:`

			```go
			`import "github.com/yourusername/arxiv-processor"`

			`func main() {`
			`// Create client`
			`client := arxiv.NewClient()`

			`// Define search parameters`
			`// The format "20060102" is Go's reference time format:`
			`// 2006 = year`
			`// 01 = month`
			`// 02 = day`
			`// Note: The arXiv API returns full timestamps including time of day,`
			`// but the search API only uses the date portion for filtering`
			`startDate, _ := time.Parse("20060102", "20240101")`
			`endDate, _ := time.Parse("20060102", "20240131")`

			`// Fetch papers`
			`// The FetchPapers method returns all papers at once after completion`
			`// of the API request and any necessary pagination`
			`papers, err := client.FetchPapers("cat:cs.AI", startDate, endDate)`
			`if err != nil {`
			`log.Fatal(err)`
			`}`

			`// Use papers directly (in-memory)`
			`// The papers slice contains all results after completion`
			`for _, paper := range papers {`
			`fmt.Printf("Title: %s\n", paper.Title)`
			`fmt.Printf("Abstract: %s\n", paper.Abstract)`
			`}`

			`// Optionally save papers to file`
			`err = arxiv.SavePapers("papers.json", papers)`
			`if err != nil {`
			`log.Fatal(err)`
			`}`
			`}`
			```

			`Note: The package currently writes to a file by default. To modify this behavior to only return JSON objects:`
			`1. Remove the SavePapers call`
			`2. Use the returned papers slice directly`
			`3. The papers slice contains all paper data as Go structs`
			`4. You can marshal to JSON using json.Marshal(papers) if needed`

			`### Command Line Interface`
			`To use the CLI:`

			```bash
			`go run main.go --search "cat:cs.AI" --date-range "YYYYMMDD-YYYYMMDD"`
			```

			`#### Command Line Options`
			- `--search`: Search query (e.g., "cat:cs.AI" for AI papers)
			- `--date-range`: Date range in YYYYMMDD-YYYYMMDD format
			- `--output`: Output file path (default: papers_data.json)

			`### Example: Fetch AI Papers`
			```bash
			`go run main.go --search "cat:cs.AI" --date-range "20250115-20250118"`
			```

			`### Program Output`
			- Fetched papers are saved to `papers_data.json`
			`- Example JSON structure:`
			```json
			`[`
			`{`
			`"title": "Sample Paper Title",`
			`"abstract": "This is a sample abstract...",`
			`"arxiv_id": "2501.08565v1"`
			`}`
			`]`
			```
			`- The JSON file contains paper metadata including:`
			`- Title`
			`- Abstract`
			`- arXiv ID`

			`## Configuration`

			`### Environment Variables`
			- `ARXIV_MAX_RESULTS`: Maximum number of results to fetch (default: 100)
			- `ARXIV_START_INDEX`: Start index for pagination (default: 0)

			`## Package Structure`

			```
			`arxiv-processor/`
			`├── arxiv/ # arXiv API client`
			`├── storage/ # Data storage handlers`
			`├── llm/ # LLM integration (TODO)`
			`├── main.go # Main entry point`
			`└── README.md # This file`
			```

			`## Contributing`

			`1. Fork the repository`
			`2. Create a new branch`
			`3. Make your changes`
			`4. Submit a pull request`

			`## License`

			`MIT License`