131 lines
3.1 KiB
Markdown
131 lines
3.1 KiB
Markdown
|
# ArXiv Processor
|
||
|
|
||
|
A Go package for fetching and processing papers from arXiv.
|
||
|
|
||
|
## Installation
|
||
|
|
||
|
1. Clone the repository:
|
||
|
```bash
|
||
|
git clone https://github.com/yourusername/arxiv-processor.git
|
||
|
cd arxiv-processor
|
||
|
```
|
||
|
|
||
|
2. Initialize the Go module:
|
||
|
```bash
|
||
|
go mod init arxiv-processor
|
||
|
go mod tidy
|
||
|
```
|
||
|
|
||
|
## Usage
|
||
|
|
||
|
### As a Library
|
||
|
To use the package in your Go application:
|
||
|
|
||
|
```go
|
||
|
import "github.com/yourusername/arxiv-processor"
|
||
|
|
||
|
func main() {
|
||
|
// Create client
|
||
|
client := arxiv.NewClient()
|
||
|
|
||
|
// Define search parameters
|
||
|
// The format "20060102" is Go's reference time format:
|
||
|
// 2006 = year
|
||
|
// 01 = month
|
||
|
// 02 = day
|
||
|
// Note: The arXiv API returns full timestamps including time of day,
|
||
|
// but the search API only uses the date portion for filtering
|
||
|
startDate, _ := time.Parse("20060102", "20240101")
|
||
|
endDate, _ := time.Parse("20060102", "20240131")
|
||
|
|
||
|
// Fetch papers
|
||
|
// The FetchPapers method returns all papers at once after completion
|
||
|
// of the API request and any necessary pagination
|
||
|
papers, err := client.FetchPapers("cat:cs.AI", startDate, endDate)
|
||
|
if err != nil {
|
||
|
log.Fatal(err)
|
||
|
}
|
||
|
|
||
|
// Use papers directly (in-memory)
|
||
|
// The papers slice contains all results after completion
|
||
|
for _, paper := range papers {
|
||
|
fmt.Printf("Title: %s\n", paper.Title)
|
||
|
fmt.Printf("Abstract: %s\n", paper.Abstract)
|
||
|
}
|
||
|
|
||
|
// Optionally save papers to file
|
||
|
err = arxiv.SavePapers("papers.json", papers)
|
||
|
if err != nil {
|
||
|
log.Fatal(err)
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
Note: The package currently writes to a file by default. To modify this behavior to only return JSON objects:
|
||
|
1. Remove the SavePapers call
|
||
|
2. Use the returned papers slice directly
|
||
|
3. The papers slice contains all paper data as Go structs
|
||
|
4. You can marshal to JSON using json.Marshal(papers) if needed
|
||
|
|
||
|
### Command Line Interface
|
||
|
To use the CLI:
|
||
|
|
||
|
```bash
|
||
|
go run main.go --search "cat:cs.AI" --date-range "YYYYMMDD-YYYYMMDD"
|
||
|
```
|
||
|
|
||
|
#### Command Line Options
|
||
|
- `--search`: Search query (e.g., "cat:cs.AI" for AI papers)
|
||
|
- `--date-range`: Date range in YYYYMMDD-YYYYMMDD format
|
||
|
- `--output`: Output file path (default: papers_data.json)
|
||
|
|
||
|
### Example: Fetch AI Papers
|
||
|
```bash
|
||
|
go run main.go --search "cat:cs.AI" --date-range "20250115-20250118"
|
||
|
```
|
||
|
|
||
|
### Program Output
|
||
|
- Fetched papers are saved to `papers_data.json`
|
||
|
- Example JSON structure:
|
||
|
```json
|
||
|
[
|
||
|
{
|
||
|
"title": "Sample Paper Title",
|
||
|
"abstract": "This is a sample abstract...",
|
||
|
"arxiv_id": "2501.08565v1"
|
||
|
}
|
||
|
]
|
||
|
```
|
||
|
- The JSON file contains paper metadata including:
|
||
|
- Title
|
||
|
- Abstract
|
||
|
- arXiv ID
|
||
|
|
||
|
## Configuration
|
||
|
|
||
|
### Environment Variables
|
||
|
- `ARXIV_MAX_RESULTS`: Maximum number of results to fetch (default: 100)
|
||
|
- `ARXIV_START_INDEX`: Start index for pagination (default: 0)
|
||
|
|
||
|
## Package Structure
|
||
|
|
||
|
```
|
||
|
arxiv-processor/
|
||
|
├── arxiv/ # arXiv API client
|
||
|
├── storage/ # Data storage handlers
|
||
|
├── llm/ # LLM integration (TODO)
|
||
|
├── main.go # Main entry point
|
||
|
└── README.md # This file
|
||
|
```
|
||
|
|
||
|
## Contributing
|
||
|
|
||
|
1. Fork the repository
|
||
|
2. Create a new branch
|
||
|
3. Make your changes
|
||
|
4. Submit a pull request
|
||
|
|
||
|
## License
|
||
|
|
||
|
MIT License
|