History

Steve White 9396e2da3a Initial Commit; working system		2025-01-24 09:26:47 -06:00
..
arxiv	Initial Commit; working system	2025-01-24 09:26:47 -06:00
storage	Initial Commit; working system	2025-01-24 09:26:47 -06:00
20250123-papers.json	Initial Commit; working system	2025-01-24 09:26:47 -06:00
README.md	Initial Commit; working system	2025-01-24 09:26:47 -06:00
arxiv-2501.11599v1.md	Initial Commit; working system	2025-01-24 09:26:47 -06:00
go.mod	Initial Commit; working system	2025-01-24 09:26:47 -06:00
main.go	Initial Commit; working system	2025-01-24 09:26:47 -06:00

README.md

ArXiv Processor

A Go package for fetching and processing papers from arXiv.

Installation

Clone the repository:

git clone https://github.com/yourusername/arxiv-processor.git
cd arxiv-processor

Initialize the Go module:

go mod init arxiv-processor
go mod tidy

Usage

As a Library

To use the package in your Go application:

import "github.com/yourusername/arxiv-processor"

func main() {
    // Create client
    client := arxiv.NewClient()

    // Define search parameters
    // The format "20060102" is Go's reference time format:
    // 2006 = year
    // 01 = month
    // 02 = day
    // Note: The arXiv API returns full timestamps including time of day,
    // but the search API only uses the date portion for filtering
    startDate, _ := time.Parse("20060102", "20240101")
    endDate, _ := time.Parse("20060102", "20240131")

    // Fetch papers
    // The FetchPapers method returns all papers at once after completion
    // of the API request and any necessary pagination
    papers, err := client.FetchPapers("cat:cs.AI", startDate, endDate)
    if err != nil {
        log.Fatal(err)
    }

    // Use papers directly (in-memory)
    // The papers slice contains all results after completion
    for _, paper := range papers {
        fmt.Printf("Title: %s\n", paper.Title)
        fmt.Printf("Abstract: %s\n", paper.Abstract)
    }

    // Optionally save papers to file
    err = arxiv.SavePapers("papers.json", papers)
    if err != nil {
        log.Fatal(err)
    }
}

Note: The package currently writes to a file by default. To modify this behavior to only return JSON objects:

Remove the SavePapers call
Use the returned papers slice directly
The papers slice contains all paper data as Go structs
You can marshal to JSON using json.Marshal(papers) if needed

Command Line Interface

To use the CLI:

go run main.go --search "cat:cs.AI" --date-range "YYYYMMDD-YYYYMMDD"

Command Line Options

--search: Search query (e.g., "cat:cs.AI" for AI papers)
--date-range: Date range in YYYYMMDD-YYYYMMDD format
--output: Output file path (default: papers_data.json)

Example: Fetch AI Papers

go run main.go --search "cat:cs.AI" --date-range "20250115-20250118"

Program Output

Fetched papers are saved to papers_data.json
Example JSON structure:

[
  {
    "title": "Sample Paper Title",
    "abstract": "This is a sample abstract...",
    "arxiv_id": "2501.08565v1"
  }
]

The JSON file contains paper metadata including:
- Title
- Abstract
- arXiv ID

Configuration

Environment Variables

ARXIV_MAX_RESULTS: Maximum number of results to fetch (default: 100)
ARXIV_START_INDEX: Start index for pagination (default: 0)

Package Structure

arxiv-processor/
├── arxiv/          # arXiv API client
├── storage/        # Data storage handlers
├── llm/            # LLM integration (TODO)
├── main.go         # Main entry point
└── README.md       # This file

Contributing

Fork the repository
Create a new branch
Make your changes
Submit a pull request

License

MIT License