3998 lines
253 KiB
Markdown
3998 lines
253 KiB
Markdown
|
# Accepted Papers
|
||
|
|
||
|
## [Monte Carlo Tree Search for Comprehensive Exploration in LLM-Based
|
||
|
Automatic Heuristic Design](https://arxiv.org/abs/2501.08603v2)
|
||
|
**arXiv ID:** 2501.08603v2
|
||
|
|
||
|
**Abstract:**
|
||
|
> Handcrafting heuristics for solving complex planning tasks (e.g., NP-hard
|
||
|
combinatorial optimization (CO) problems) is a common practice but requires
|
||
|
extensive domain knowledge. Recently, Large Language Model (LLM)-based
|
||
|
automatic heuristics design (AHD) methods have shown promise in generating
|
||
|
high-quality heuristics without manual intervention. Existing LLM-based AHD
|
||
|
methods employ a population to maintain a fixed number of top-performing
|
||
|
LLM-generated heuristics and introduce evolutionary computation (EC) to enhance
|
||
|
the population iteratively. However, the population-based procedure brings
|
||
|
greedy properties, often resulting in convergence to local optima. Instead, to
|
||
|
more comprehensively explore the space of heuristics, we propose using Monte
|
||
|
Carlo Tree Search (MCTS) for LLM-based heuristic evolution while preserving all
|
||
|
LLM-generated heuristics in a tree structure. With a novel thought-alignment
|
||
|
process and an exploration-decay technique, the proposed MCTS-AHD method
|
||
|
delivers significantly higher-quality heuristics on various complex tasks. Our
|
||
|
code is available at https://github.com/zz1358m/MCTS-AHD-master.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of LLMs in automatic heuristic design, including experimental results with quantitative metrics, and comparing its results to existing state-of-the-art techniques, demonstrating performance improvements.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Leveraging Large Language Models as Knowledge-Driven Agents for Reliable
|
||
|
Retrosynthesis Planning](https://arxiv.org/abs/2501.08897v1)
|
||
|
**arXiv ID:** 2501.08897v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Identifying reliable synthesis pathways in materials chemistry is a complex
|
||
|
task, particularly in polymer science, due to the intricate and often
|
||
|
non-unique nomenclature of macromolecules. To address this challenge, we
|
||
|
propose an agent system that integrates large language models (LLMs) and
|
||
|
knowledge graphs (KGs). By leveraging LLMs' powerful capabilities for
|
||
|
extracting and recognizing chemical substance names, and storing the extracted
|
||
|
data in a structured knowledge graph, our system fully automates the retrieval
|
||
|
of relevant literatures, extraction of reaction data, database querying,
|
||
|
construction of retrosynthetic pathway trees, further expansion through the
|
||
|
retrieval of additional literature and recommendation of optimal reaction
|
||
|
pathways. A novel Multi-branched Reaction Pathway Search (MBRPS) algorithm
|
||
|
enables the exploration of all pathways, with a particular focus on
|
||
|
multi-branched ones, helping LLMs overcome weak reasoning in multi-branched
|
||
|
paths. This work represents the first attempt to develop a fully automated
|
||
|
retrosynthesis planning agent tailored specially for macromolecules powered by
|
||
|
LLMs. Applied to polyimide synthesis, our new approach constructs a
|
||
|
retrosynthetic pathway tree with hundreds of pathways and recommends optimized
|
||
|
routes, including both known and novel pathways, demonstrating its
|
||
|
effectiveness and potential for broader applications.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of LLMs in knowledge graphs and retrieval-augmented generation, including experimental results with quantitative metrics, and comparing its results to state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Text Semantics to Flexible Design: A Residential Layout Generation
|
||
|
Method Based on Stable Diffusion Model](https://arxiv.org/abs/2501.09279v1)
|
||
|
**arXiv ID:** 2501.09279v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Flexibility in the AI-based residential layout design remains a significant
|
||
|
challenge, as traditional methods like rule-based heuristics and graph-based
|
||
|
generation often lack flexibility and require substantial design knowledge from
|
||
|
users. To address these limitations, we propose a cross-modal design approach
|
||
|
based on the Stable Diffusion model for generating flexible residential
|
||
|
layouts. The method offers multiple input types for learning objectives,
|
||
|
allowing users to specify both boundaries and layouts. It incorporates natural
|
||
|
language as design constraints and introduces ControlNet to enable stable
|
||
|
layout generation through two distinct pathways. We also present a scheme that
|
||
|
encapsulates design expertise within a knowledge graph and translates it into
|
||
|
natural language, providing an interpretable representation of design
|
||
|
knowledge. This comprehensibility and diversity of input options enable
|
||
|
professionals and non-professionals to directly express design requirements,
|
||
|
enhancing flexibility and controllability. Finally, experiments verify the
|
||
|
flexibility of the proposed methods under multimodal constraints better than
|
||
|
state-of-the-art models, even when specific semantic information about room
|
||
|
areas or connections is incomplete.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of Large Language Models in residential layout generation, including experimental results with quantitative metrics, and comparing its results to state-of-the-art models. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [SOP-Agent: Empower General Purpose AI Agent with Domain-Specific SOPs](https://arxiv.org/abs/2501.09316v1)
|
||
|
**arXiv ID:** 2501.09316v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Despite significant advancements in general-purpose AI agents, several
|
||
|
challenges still hinder their practical application in real-world scenarios.
|
||
|
First, the limited planning capabilities of Large Language Models (LLM)
|
||
|
restrict AI agents from effectively solving complex tasks that require
|
||
|
long-horizon planning. Second, general-purpose AI agents struggle to
|
||
|
efficiently utilize domain-specific knowledge and human expertise. In this
|
||
|
paper, we introduce the Standard Operational Procedure-guided Agent
|
||
|
(SOP-agent), a novel framework for constructing domain-specific agents through
|
||
|
pseudocode-style Standard Operational Procedures (SOPs) written in natural
|
||
|
language. Formally, we represent a SOP as a decision graph, which is traversed
|
||
|
to guide the agent in completing tasks specified by the SOP. We conduct
|
||
|
extensive experiments across tasks in multiple domains, including
|
||
|
decision-making, search and reasoning, code generation, data cleaning, and
|
||
|
grounded customer service. The SOP-agent demonstrates excellent versatility,
|
||
|
achieving performance superior to general-purpose agent frameworks and
|
||
|
comparable to domain-specific agent systems. Additionally, we introduce the
|
||
|
Grounded Customer Service Benchmark, the first benchmark designed to evaluate
|
||
|
the grounded decision-making capabilities of AI agents in customer service
|
||
|
scenarios based on SOPs.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of Large Language Models, demonstrating experimental results with quantitative metrics, and comparing its results with state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications. The paper introduces a novel approach and demonstrates robust experimental validation, making it a strong candidate for acceptance.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Aligning Instruction Tuning with Pre-training](https://arxiv.org/abs/2501.09368v1)
|
||
|
**arXiv ID:** 2501.09368v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Instruction tuning enhances large language models (LLMs) to follow human
|
||
|
instructions across diverse tasks, relying on high-quality datasets to guide
|
||
|
behavior. However, these datasets, whether manually curated or synthetically
|
||
|
generated, are often narrowly focused and misaligned with the broad
|
||
|
distributions captured during pre-training, limiting LLM generalization and
|
||
|
effective use of pre-trained knowledge. We propose *Aligning Instruction Tuning
|
||
|
with Pre-training* (AITP), a method that bridges this gap by identifying
|
||
|
coverage shortfalls in instruction-tuning datasets and rewriting
|
||
|
underrepresented pre-training data into high-quality instruction-response
|
||
|
pairs. This approach enriches dataset diversity while preserving task-specific
|
||
|
objectives. Evaluations on three fully open LLMs across eight benchmarks
|
||
|
demonstrate consistent performance improvements with AITP. Ablations highlight
|
||
|
the benefits of adaptive data selection, controlled rewriting, and balanced
|
||
|
integration, emphasizing the importance of aligning instruction tuning with
|
||
|
pre-training distributions to unlock the full potential of LLMs.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of LLMs, including experimental results with quantitative metrics, and comparing its results with existing state-of-the-art techniques, demonstrating performance improvements.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Platform-Aware Mission Planning](https://arxiv.org/abs/2501.09632v1)
|
||
|
**arXiv ID:** 2501.09632v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Planning for autonomous systems typically requires reasoning with models at
|
||
|
different levels of abstraction, and the harmonization of two competing sets of
|
||
|
objectives: high-level mission goals that refer to an interaction of the system
|
||
|
with the external environment, and low-level platform constraints that aim to
|
||
|
preserve the integrity and the correct interaction of the subsystems. The
|
||
|
complicated interplay between these two models makes it very hard to reason on
|
||
|
the system as a whole, especially when the objective is to find plans with
|
||
|
robustness guarantees, considering the non-deterministic behavior of the lower
|
||
|
layers of the system.
|
||
|
In this paper, we introduce the problem of Platform-Aware Mission Planning
|
||
|
(PAMP), addressing it in the setting of temporal durative actions. The PAMP
|
||
|
problem differs from standard temporal planning for its exists-forall nature:
|
||
|
the high-level plan dealing with mission goals is required to satisfy safety
|
||
|
and executability constraints, for all the possible non-deterministic
|
||
|
executions of the low-level model of the platform and the environment. We
|
||
|
propose two approaches for solving PAMP. The first baseline approach
|
||
|
amalgamates the mission and platform levels, while the second is based on an
|
||
|
abstraction-refinement loop that leverages the combination of a planner and a
|
||
|
verification engine. We prove the soundness and completeness of the proposed
|
||
|
approaches and validate them experimentally, demonstrating the importance of
|
||
|
heterogeneous modeling and the superiority of the technique based on
|
||
|
abstraction-refinement.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 2 and 3 by including experimental results with quantitative metrics and comparing its results with existing state-of-the-art techniques, demonstrating advancements in performance. It also meets criteria 4 by clearly describing its methodology and implementation, and touches on agentic AI by discussing autonomous systems and mission planning.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [NS-Gym: Open-Source Simulation Environments and Benchmarks for
|
||
|
Non-Stationary Markov Decision Processes](https://arxiv.org/abs/2501.09646v1)
|
||
|
**arXiv ID:** 2501.09646v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> In many real-world applications, agents must make sequential decisions in
|
||
|
environments where conditions are subject to change due to various exogenous
|
||
|
factors. These non-stationary environments pose significant challenges to
|
||
|
traditional decision-making models, which typically assume stationary dynamics.
|
||
|
Non-stationary Markov decision processes (NS-MDPs) offer a framework to model
|
||
|
and solve decision problems under such changing conditions. However, the lack
|
||
|
of standardized benchmarks and simulation tools has hindered systematic
|
||
|
evaluation and advance in this field. We present NS-Gym, the first simulation
|
||
|
toolkit designed explicitly for NS-MDPs, integrated within the popular
|
||
|
Gymnasium framework. In NS-Gym, we segregate the evolution of the environmental
|
||
|
parameters that characterize non-stationarity from the agent's decision-making
|
||
|
module, allowing for modular and flexible adaptations to dynamic environments.
|
||
|
We review prior work in this domain and present a toolkit encapsulating key
|
||
|
problem characteristics and types in NS-MDPs. This toolkit is the first effort
|
||
|
to develop a set of standardized interfaces and benchmark problems to enable
|
||
|
consistent and reproducible evaluation of algorithms under non-stationary
|
||
|
conditions. We also benchmark six algorithmic approaches from prior work on
|
||
|
NS-MDPs using NS-Gym. Our vision is that NS-Gym will enable researchers to
|
||
|
assess the adaptability and robustness of their decision-making algorithms to
|
||
|
non-stationary conditions.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of decision-making models in non-stationary environments, presenting experimental results with quantitative metrics, and comparing its results with existing state-of-the-art techniques. Additionally, it meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications and challenges.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Doc-Guided Sent2Sent++: A Sent2Sent++ Agent with Doc-Guided memory for
|
||
|
Document-level Machine Translation](https://arxiv.org/abs/2501.08523v1)
|
||
|
**arXiv ID:** 2501.08523v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The field of artificial intelligence has witnessed significant advancements
|
||
|
in natural language processing, largely attributed to the capabilities of Large
|
||
|
Language Models (LLMs). These models form the backbone of Agents designed to
|
||
|
address long-context dependencies, particularly in Document-level Machine
|
||
|
Translation (DocMT). DocMT presents unique challenges, with quality,
|
||
|
consistency, and fluency being the key metrics for evaluation. Existing
|
||
|
approaches, such as Doc2Doc and Doc2Sent, either omit sentences or compromise
|
||
|
fluency. This paper introduces Doc-Guided Sent2Sent++, an Agent that employs an
|
||
|
incremental sentence-level forced decoding strategy \textbf{to ensure every
|
||
|
sentence is translated while enhancing the fluency of adjacent sentences.} Our
|
||
|
Agent leverages a Doc-Guided Memory, focusing solely on the summary and its
|
||
|
translation, which we find to be an efficient approach to maintaining
|
||
|
consistency. Through extensive testing across multiple languages and domains,
|
||
|
we demonstrate that Sent2Sent++ outperforms other methods in terms of quality,
|
||
|
consistency, and fluency. The results indicate that, our approach has achieved
|
||
|
significant improvements in metrics such as s-COMET, d-COMET, LTCR-$1_f$, and
|
||
|
document-level perplexity (d-ppl). The contributions of this paper include a
|
||
|
detailed analysis of current DocMT research, the introduction of the
|
||
|
Sent2Sent++ decoding method, the Doc-Guided Memory mechanism, and validation of
|
||
|
its effectiveness across languages and domains.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of Large Language Models in document-level machine translation, including experimental results with quantitative metrics, and comparing its results to state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [LlamaRestTest: Effective REST API Testing with Small Language Models](https://arxiv.org/abs/2501.08598v1)
|
||
|
**arXiv ID:** 2501.08598v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Modern web services rely heavily on REST APIs, typically documented using the
|
||
|
OpenAPI specification. The widespread adoption of this standard has resulted in
|
||
|
the development of many black-box testing tools that generate tests based on
|
||
|
these specifications. Recent advancements in Natural Language Processing (NLP),
|
||
|
particularly with Large Language Models (LLMs), have enhanced REST API testing
|
||
|
by extracting actionable rules and generating input values from the
|
||
|
human-readable portions of the specification. However, these advancements
|
||
|
overlook the potential of continuously refining the identified rules and test
|
||
|
inputs based on server responses. To address this limitation, we present
|
||
|
LlamaRestTest, a novel approach that employs two custom LLMs to generate
|
||
|
realistic test inputs and uncover parameter dependencies during the testing
|
||
|
process by incorporating server responses. These LLMs are created by
|
||
|
fine-tuning the Llama3-8b model, using mined datasets of REST API example
|
||
|
values and inter-parameter dependencies. We evaluated LlamaRestTest on 12
|
||
|
real-world services (including popular services such as Spotify), comparing it
|
||
|
against RESTGPT, a GPT-powered specification-enhancement tool, as well as
|
||
|
several state-of-the-art REST API testing tools, including RESTler, MoRest,
|
||
|
EvoMaster, and ARAT-RL. Our results show that fine-tuning enables smaller LLMs
|
||
|
to outperform larger models in detecting actionable rules and generating inputs
|
||
|
for REST API testing. We evaluated configurations from the base Llama3-8B to
|
||
|
fine-tuned versions and explored 2-bit, 4-bit, and 8-bit quantization for
|
||
|
efficiency. LlamaRestTest surpasses state-of-the-art tools in code coverage and
|
||
|
error detection, even with RESTGPT-enhanced specifications, and an ablation
|
||
|
study highlights the impact of its novel components.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of Large Language Models in REST API testing, including experimental results with quantitative metrics and comparisons with state-of-the-art techniques, demonstrating performance improvements.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [AutoRestTest: A Tool for Automated REST API Testing Using LLMs and MARL](https://arxiv.org/abs/2501.08600v1)
|
||
|
**arXiv ID:** 2501.08600v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> As REST APIs have become widespread in modern web services, comprehensive
|
||
|
testing of these APIs has become increasingly crucial. Due to the vast search
|
||
|
space consisting of operations, parameters, and parameter values along with
|
||
|
their complex dependencies and constraints, current testing tools suffer from
|
||
|
low code coverage, leading to suboptimal fault detection. To address this
|
||
|
limitation, we present a novel tool, AutoRestTest, which integrates the
|
||
|
Semantic Operation Dependency Graph (SODG) with Multi-Agent Reinforcement
|
||
|
Learning (MARL) and large language models (LLMs) for effective REST API
|
||
|
testing. AutoRestTest determines operation-dependent parameters using the SODG
|
||
|
and employs five specialized agents (operation, parameter, value, dependency,
|
||
|
and header) to identify dependencies of operations and generate operation
|
||
|
sequences, parameter combinations, and values. AutoRestTest provides a
|
||
|
command-line interface and continuous telemetry on successful operation count,
|
||
|
unique server errors detected, and time elapsed. Upon completion, AutoRestTest
|
||
|
generates a detailed report highlighting errors detected and operations
|
||
|
exercised. In this paper, we introduce our tool and present preliminary
|
||
|
results.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 4 by focusing on a practical application of LLMs in REST API testing, including experimental results, and clearly describing its methodology and implementation. It also introduces a novel approach by integrating LLMs with MARL, showing potential for innovative applications and performance improvements.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [MAGNET: Augmenting Generative Decoders with Representation Learning and
|
||
|
Infilling Capabilities](https://arxiv.org/abs/2501.08648v1)
|
||
|
**arXiv ID:** 2501.08648v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> While originally designed for unidirectional generative modeling,
|
||
|
decoder-only large language models (LLMs) are increasingly being adapted for
|
||
|
bidirectional modeling. However, unidirectional and bidirectional models are
|
||
|
typically trained separately with distinct objectives (generation and
|
||
|
representation learning, respectively). This separation overlooks the
|
||
|
opportunity for developing a more versatile language model and for these
|
||
|
objectives to complement each other. In this work, we introduce MAGNET, an
|
||
|
adaptation of decoder-only LLMs that enhances their ability to generate robust
|
||
|
representations and infill missing text spans, while preserving their knowledge
|
||
|
and text generation capabilities. MAGNET employs three self-supervised training
|
||
|
objectives and introduces an attention mechanism that combines bidirectional
|
||
|
and causal attention, enabling unified training across all objectives. Our
|
||
|
results demonstrate that LLMs adapted with MAGNET (1) surpass strong text
|
||
|
encoders on token-level and sentence-level representation learning tasks, (2)
|
||
|
generate contextually appropriate text infills by leveraging future context,
|
||
|
(3) retain the ability for open-ended text generation without exhibiting
|
||
|
repetition problem, and (4) preserve the knowledge gained by the LLM during
|
||
|
pretraining.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of LLMs, including experimental results with quantitative metrics, and comparing its results with state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Incrementally Learning Multiple Diverse Data Domains via Multi-Source
|
||
|
Dynamic Expansion Model](https://arxiv.org/abs/2501.08878v1)
|
||
|
**arXiv ID:** 2501.08878v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Continual Learning seeks to develop a model capable of incrementally
|
||
|
assimilating new information while retaining prior knowledge. However, current
|
||
|
research predominantly addresses a straightforward learning context, wherein
|
||
|
all data samples originate from a singular data domain. This paper shifts focus
|
||
|
to a more complex and realistic learning environment, characterized by data
|
||
|
samples sourced from multiple distinct domains. We tackle this intricate
|
||
|
learning challenge by introducing a novel methodology, termed the Multi-Source
|
||
|
Dynamic Expansion Model (MSDEM), which leverages various pre-trained models as
|
||
|
backbones and progressively establishes new experts based on them to adapt to
|
||
|
emerging tasks. Additionally, we propose an innovative dynamic expandable
|
||
|
attention mechanism designed to selectively harness knowledge from multiple
|
||
|
backbones, thereby accelerating the new task learning. Moreover, we introduce a
|
||
|
dynamic graph weight router that strategically reuses all previously acquired
|
||
|
parameters and representations for new task learning, maximizing the positive
|
||
|
knowledge transfer effect, which further improves generalization performance.
|
||
|
We conduct a comprehensive series of experiments, and the empirical findings
|
||
|
indicate that our proposed approach achieves state-of-the-art performance.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of incremental learning, including experimental results with quantitative metrics, and comparing its results to state-of-the-art techniques, demonstrating advancements in performance.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Attention is All You Need Until You Need Retention](https://arxiv.org/abs/2501.09166v1)
|
||
|
**arXiv ID:** 2501.09166v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> This work introduces a novel Retention Layer mechanism for Transformer based
|
||
|
architectures, addressing their inherent lack of intrinsic retention
|
||
|
capabilities. Unlike human cognition, which can encode and dynamically recall
|
||
|
symbolic templates, Generative Pretrained Transformers rely solely on fixed
|
||
|
pretrained weights and ephemeral context windows, limiting their adaptability.
|
||
|
The proposed Retention Layer incorporates a persistent memory module capable of
|
||
|
real time data population, dynamic recall, and guided output generation. This
|
||
|
enhancement allows models to store, update, and reuse observed patterns across
|
||
|
sessions, enabling incremental learning and bridging the gap between static
|
||
|
pretraining and dynamic, context sensitive adaptation. The Retention Layer
|
||
|
design parallels social learning processes, encompassing attention, retention,
|
||
|
reproduction, and motivation stages. Technically, it integrates a memory
|
||
|
attention mechanism and episodic buffers to manage memory scalability, mitigate
|
||
|
overfitting, and ensure efficient recall. Applications span adaptive personal
|
||
|
assistants, real time fraud detection, autonomous robotics, content moderation,
|
||
|
and healthcare diagnostics. In each domain, the retention mechanism enables
|
||
|
systems to learn incrementally, personalize outputs, and respond to evolving
|
||
|
real world challenges effectively. By emulating key aspects of human learning,
|
||
|
this retention enhanced architecture fosters a more fluid and responsive AI
|
||
|
paradigm, paving the way for dynamic, session aware models that extend the
|
||
|
capabilities of traditional Transformers into domains requiring continual
|
||
|
adaptation.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by introducing a novel Retention Layer mechanism for Transformer-based architectures, demonstrating potential performance improvements, and comparing its results to existing state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications. The paper's approach is novel, implementable with current standard tools, and has potential for reproducibility and robust experimental validation.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Guiding Retrieval using LLM-based Listwise Rankers](https://arxiv.org/abs/2501.09186v1)
|
||
|
**arXiv ID:** 2501.09186v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Large Language Models (LLMs) have shown strong promise as rerankers,
|
||
|
especially in ``listwise'' settings where an LLM is prompted to rerank several
|
||
|
search results at once. However, this ``cascading'' retrieve-and-rerank
|
||
|
approach is limited by the bounded recall problem: relevant documents not
|
||
|
retrieved initially are permanently excluded from the final ranking. Adaptive
|
||
|
retrieval techniques address this problem, but do not work with listwise
|
||
|
rerankers because they assume a document's score is computed independently from
|
||
|
other documents. In this paper, we propose an adaptation of an existing
|
||
|
adaptive retrieval method that supports the listwise setting and helps guide
|
||
|
the retrieval process itself (thereby overcoming the bounded recall problem for
|
||
|
LLM rerankers). Specifically, our proposed algorithm merges results both from
|
||
|
the initial ranking and feedback documents provided by the most relevant
|
||
|
documents seen up to that point. Through extensive experiments across diverse
|
||
|
LLM rerankers, first stage retrievers, and feedback sources, we demonstrate
|
||
|
that our method can improve nDCG@10 by up to 13.23% and recall by 28.02%--all
|
||
|
while keeping the total number of LLM inferences constant and overheads due to
|
||
|
the adaptive process minimal. The work opens the door to leveraging LLM-based
|
||
|
search in settings where the initial pool of results is limited, e.g., by
|
||
|
legacy systems, or by the cost of deploying a semantic first-stage.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of LLMs in retrieval-augmented generation, presenting experimental results with quantitative metrics, and comparing its results to existing state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications and challenges.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Perspective Transition of Large Language Models for Solving Subjective
|
||
|
Tasks](https://arxiv.org/abs/2501.09265v1)
|
||
|
**arXiv ID:** 2501.09265v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Large language models (LLMs) have revolutionized the field of natural
|
||
|
language processing, enabling remarkable progress in various tasks. Different
|
||
|
from objective tasks such as commonsense reasoning and arithmetic
|
||
|
question-answering, the performance of LLMs on subjective tasks is still
|
||
|
limited, where the perspective on the specific problem plays crucial roles for
|
||
|
better interpreting the context and giving proper response. For example, in
|
||
|
certain scenarios, LLMs may perform better when answering from an expert role
|
||
|
perspective, potentially eliciting their relevant domain knowledge. In
|
||
|
contrast, in some scenarios, LLMs may provide more accurate responses when
|
||
|
answering from a third-person standpoint, enabling a more comprehensive
|
||
|
understanding of the problem and potentially mitigating inherent biases. In
|
||
|
this paper, we propose Reasoning through Perspective Transition (RPT), a method
|
||
|
based on in-context learning that enables LLMs to dynamically select among
|
||
|
direct, role, and third-person perspectives for the best way to solve
|
||
|
corresponding subjective problem. Through extensive experiments on totally 12
|
||
|
subjective tasks by using both closed-source and open-source LLMs including
|
||
|
GPT-4, GPT-3.5, Llama-3, and Qwen-2, our method outperforms widely used single
|
||
|
fixed perspective based methods such as chain-of-thought prompting and expert
|
||
|
prompting, highlights the intricate ways that LLMs can adapt their perspectives
|
||
|
to provide nuanced and contextually appropriate responses for different
|
||
|
problems.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of Large Language Models, including experimental results with quantitative metrics, and comparing its results with existing state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [On Learning Informative Trajectory Embeddings for Imitation,
|
||
|
Classification and Regression](https://arxiv.org/abs/2501.09327v1)
|
||
|
**arXiv ID:** 2501.09327v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> In real-world sequential decision making tasks like autonomous driving,
|
||
|
robotics, and healthcare, learning from observed state-action trajectories is
|
||
|
critical for tasks like imitation, classification, and clustering. For example,
|
||
|
self-driving cars must replicate human driving behaviors, while robots and
|
||
|
healthcare systems benefit from modeling decision sequences, whether or not
|
||
|
they come from expert data. Existing trajectory encoding methods often focus on
|
||
|
specific tasks or rely on reward signals, limiting their ability to generalize
|
||
|
across domains and tasks. Inspired by the success of embedding models like CLIP
|
||
|
and BERT in static domains, we propose a novel method for embedding
|
||
|
state-action trajectories into a latent space that captures the skills and
|
||
|
competencies in the dynamic underlying decision-making processes. This method
|
||
|
operates without the need for reward labels, enabling better generalization
|
||
|
across diverse domains and tasks. Our contributions are threefold: (1) We
|
||
|
introduce a trajectory embedding approach that captures multiple abilities from
|
||
|
state-action data. (2) The learned embeddings exhibit strong representational
|
||
|
power across downstream tasks, including imitation, classification, clustering,
|
||
|
and regression. (3) The embeddings demonstrate unique properties, such as
|
||
|
controlling agent behaviors in IQ-Learn and an additive structure in the latent
|
||
|
space. Experimental results confirm that our method outperforms traditional
|
||
|
approaches, offering more flexible and powerful trajectory representations for
|
||
|
various applications. Our code is available at
|
||
|
https://github.com/Erasmo1015/vte.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on real-world applications of sequential decision making, including experimental results with quantitative metrics, and comparing its results to traditional approaches. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment](https://arxiv.org/abs/2501.09620v1)
|
||
|
**arXiv ID:** 2501.09620v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Recent advances in large language models (LLMs) have demonstrated significant
|
||
|
progress in performing complex tasks. While Reinforcement Learning from Human
|
||
|
Feedback (RLHF) has been effective in aligning LLMs with human preferences, it
|
||
|
is susceptible to spurious correlations in reward modeling. Consequently, it
|
||
|
often introduces biases-such as length bias, sycophancy, conceptual bias, and
|
||
|
discrimination that hinder the model's ability to capture true causal
|
||
|
relationships. To address this, we propose a novel causal reward modeling
|
||
|
approach that integrates causal inference to mitigate these spurious
|
||
|
correlations. Our method enforces counterfactual invariance, ensuring reward
|
||
|
predictions remain consistent when irrelevant variables are altered. Through
|
||
|
experiments on both synthetic and real-world datasets, we show that our
|
||
|
approach mitigates various types of spurious correlations effectively,
|
||
|
resulting in more reliable and fair alignment of LLMs with human preferences.
|
||
|
As a drop-in enhancement to the existing RLHF workflow, our causal reward
|
||
|
modeling provides a practical way to improve the trustworthiness and fairness
|
||
|
of LLM finetuning.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of LLMs, including experimental results with quantitative metrics, and comparing its results to existing state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications and challenges involving LLMs.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating
|
||
|
Large Language Models](https://arxiv.org/abs/2501.09653v1)
|
||
|
**arXiv ID:** 2501.09653v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The recent rise in the popularity of large language models has spurred the
|
||
|
development of extensive code datasets needed to train them. This has left
|
||
|
limited code available for collection and use in the downstream investigation
|
||
|
of specific behaviors, or evaluation of large language models without suffering
|
||
|
from data contamination. To address this problem, we release The Heap, a large
|
||
|
multilingual dataset covering 57 programming languages that has been
|
||
|
deduplicated with respect to other open datasets of code, enabling researchers
|
||
|
to conduct fair evaluations of large language models without significant data
|
||
|
cleaning overhead.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 2 and 4 by providing a dataset for evaluating large language models and describing its methodology, and potentially meets criteria 3 by enabling fair comparisons with state-of-the-art techniques.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Towards Large Reasoning Models: A Survey of Reinforced Reasoning with
|
||
|
Large Language Models](https://arxiv.org/abs/2501.09686v1)
|
||
|
**arXiv ID:** 2501.09686v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Language has long been conceived as an essential tool for human reasoning.
|
||
|
The breakthrough of Large Language Models (LLMs) has sparked significant
|
||
|
research interest in leveraging these models to tackle complex reasoning tasks.
|
||
|
Researchers have moved beyond simple autoregressive token generation by
|
||
|
introducing the concept of "thought" -- a sequence of tokens representing
|
||
|
intermediate steps in the reasoning process. This innovative paradigm enables
|
||
|
LLMs' to mimic complex human reasoning processes, such as tree search and
|
||
|
reflective thinking. Recently, an emerging trend of learning to reason has
|
||
|
applied reinforcement learning (RL) to train LLMs to master reasoning
|
||
|
processes. This approach enables the automatic generation of high-quality
|
||
|
reasoning trajectories through trial-and-error search algorithms, significantly
|
||
|
expanding LLMs' reasoning capacity by providing substantially more training
|
||
|
data. Furthermore, recent studies demonstrate that encouraging LLMs to "think"
|
||
|
with more tokens during test-time inference can further significantly boost
|
||
|
reasoning accuracy. Therefore, the train-time and test-time scaling combined to
|
||
|
show a new research frontier -- a path toward Large Reasoning Model. The
|
||
|
introduction of OpenAI's o1 series marks a significant milestone in this
|
||
|
research direction. In this survey, we present a comprehensive review of recent
|
||
|
progress in LLM reasoning. We begin by introducing the foundational background
|
||
|
of LLMs and then explore the key technical components driving the development
|
||
|
of large reasoning models, with a focus on automated data construction,
|
||
|
learning-to-reason techniques, and test-time scaling. We also analyze popular
|
||
|
open-source projects at building large reasoning models, and conclude with open
|
||
|
challenges and future research directions.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 3, and 4 by focusing on practical applications of Large Language Models in reasoning tasks, comparing with state-of-the-art techniques, and clearly describing methodology and implementation details, making it a strong candidate for further review.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [A Simple Aerial Detection Baseline of Multimodal Language Models](https://arxiv.org/abs/2501.09720v1)
|
||
|
**arXiv ID:** 2501.09720v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The multimodal language models (MLMs) based on generative pre-trained
|
||
|
Transformer are considered powerful candidates for unifying various domains and
|
||
|
tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding
|
||
|
performance in multiple tasks, such as visual question answering and visual
|
||
|
grounding. In addition to visual grounding that detects specific objects
|
||
|
corresponded to given instruction, aerial detection, which detects all objects
|
||
|
of multiple categories, is also a valuable and challenging task for RS
|
||
|
foundation models. However, aerial detection has not been explored by existing
|
||
|
RS MLMs because the autoregressive prediction mechanism of MLMs differs
|
||
|
significantly from the detection outputs. In this paper, we present a simple
|
||
|
baseline for applying MLMs to aerial detection for the first time, named
|
||
|
LMMRotate. Specifically, we first introduce a normalization method to transform
|
||
|
detection outputs into textual outputs to be compatible with the MLM framework.
|
||
|
Then, we propose a evaluation method, which ensures a fair comparison between
|
||
|
MLMs and conventional object detection models. We construct the baseline by
|
||
|
fine-tuning open-source general-purpose MLMs and achieve impressive detection
|
||
|
performance comparable to conventional detector. We hope that this baseline
|
||
|
will serve as a reference for future MLM development, enabling more
|
||
|
comprehensive capabilities for understanding RS images. Code is available at
|
||
|
https://github.com/Li-Qingyun/mllm-mmrotate.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of Large Language Models in aerial detection, presenting experimental results with quantitative metrics, and comparing its results with state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Knowledge prompt chaining for semantic modeling](https://arxiv.org/abs/2501.08540v1)
|
||
|
**arXiv ID:** 2501.08540v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The task of building semantics for structured data such as CSV, JSON, and XML
|
||
|
files is highly relevant in the knowledge representation field. Even though we
|
||
|
have a vast of structured data on the internet, mapping them to domain
|
||
|
ontologies to build semantics for them is still very challenging as it requires
|
||
|
the construction model to understand and learn graph-structured knowledge.
|
||
|
Otherwise, the task will require human beings' effort and cost. In this paper,
|
||
|
we proposed a novel automatic semantic modeling framework: Knowledge Prompt
|
||
|
Chaining. It can serialize the graph-structured knowledge and inject it into
|
||
|
the LLMs properly in a Prompt Chaining architecture. Through this knowledge
|
||
|
injection and prompting chaining, the model in our framework can learn the
|
||
|
structure information and latent space of the graph and generate the semantic
|
||
|
labels and semantic graphs following the chains' insturction naturally. Based
|
||
|
on experimental results, our method achieves better performance than existing
|
||
|
leading techniques, despite using reduced structured input data.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of LLMs in knowledge graphs, presenting experimental results with quantitative metrics, and comparing its results to state-of-the-art techniques, demonstrating performance improvements.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [LAMS: LLM-Driven Automatic Mode Switching for Assistive Teleoperation](https://arxiv.org/abs/2501.08558v1)
|
||
|
**arXiv ID:** 2501.08558v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Teleoperating high degrees-of-freedom (DoF) robotic manipulators via low-DoF
|
||
|
controllers like joysticks often requires frequent switching between control
|
||
|
modes, where each mode maps controller movements to specific robot actions.
|
||
|
Manually performing this frequent switching can make teleoperation cumbersome
|
||
|
and inefficient. On the other hand, existing automatic mode-switching
|
||
|
solutions, such as heuristic-based or learning-based methods, are often
|
||
|
task-specific and lack generalizability. In this paper, we introduce LLM-Driven
|
||
|
Automatic Mode Switching (LAMS), a novel approach that leverages Large Language
|
||
|
Models (LLMs) to automatically switch control modes based on task context.
|
||
|
Unlike existing methods, LAMS requires no prior task demonstrations and
|
||
|
incrementally improves by integrating user-generated mode-switching examples.
|
||
|
We validate LAMS through an ablation study and a user study with 10
|
||
|
participants on complex, long-horizon tasks, demonstrating that LAMS
|
||
|
effectively reduces manual mode switches, is preferred over alternative
|
||
|
methods, and improves performance over time. The project website with
|
||
|
supplementary materials is at https://lams-assistance.github.io/.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on a practical application of LLMs in assistive teleoperation, including experimental results with quantitative metrics, and comparing its results with existing methods. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications and challenges.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [ANSR-DT: An Adaptive Neuro-Symbolic Learning and Reasoning Framework for
|
||
|
Digital Twins](https://arxiv.org/abs/2501.08561v1)
|
||
|
**arXiv ID:** 2501.08561v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> In this paper, we propose an Adaptive Neuro-Symbolic Learning Framework for
|
||
|
digital twin technology called ``ANSR-DT." Our approach combines pattern
|
||
|
recognition algorithms with reinforcement learning and symbolic reasoning to
|
||
|
enable real-time learning and adaptive intelligence. This integration enhances
|
||
|
the understanding of the environment and promotes continuous learning, leading
|
||
|
to better and more effective decision-making in real-time for applications that
|
||
|
require human-machine collaboration. We evaluated the \textit{ANSR-DT}
|
||
|
framework for its ability to learn and adapt to dynamic patterns, observing
|
||
|
significant improvements in decision accuracy, reliability, and
|
||
|
interpretability when compared to existing state-of-the-art methods. However,
|
||
|
challenges still exist in extracting and integrating symbolic rules in complex
|
||
|
environments, which limits the full potential of our framework in heterogeneous
|
||
|
settings. Moreover, our ongoing research aims to address this issue in the
|
||
|
future by ensuring seamless integration of neural models at large. In addition,
|
||
|
our open-source implementation promotes reproducibility and encourages future
|
||
|
research to build on our foundational work.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of neuro-symbolic learning, presenting experimental results with quantitative metrics, and comparing its results to existing state-of-the-art methods. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications and challenges.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation](https://arxiv.org/abs/2501.08617v1)
|
||
|
**arXiv ID:** 2501.08617v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Generative AI systems like foundation models (FMs) must align well with human
|
||
|
values to ensure their behavior is helpful and trustworthy. While Reinforcement
|
||
|
Learning from Human Feedback (RLHF) has shown promise for optimizing model
|
||
|
performance using human judgments, existing RLHF pipelines predominantly rely
|
||
|
on immediate feedback, which can fail to accurately reflect the downstream
|
||
|
impact of an interaction on users' utility. We demonstrate that feedback based
|
||
|
on evaluators' foresight estimates of downstream consequences systematically
|
||
|
induces Goodhart's Law dynamics, incentivizing misaligned behaviors like
|
||
|
sycophancy and deception and ultimately degrading user outcomes. To alleviate
|
||
|
this, we propose decoupling evaluation from prediction by refocusing RLHF on
|
||
|
hindsight feedback. Our theoretical analysis reveals that conditioning
|
||
|
evaluator feedback on downstream observations mitigates misalignment and
|
||
|
improves expected human utility, even when these observations are simulated by
|
||
|
the AI system itself. To leverage this insight in a practical alignment
|
||
|
algorithm, we introduce Reinforcement Learning from Hindsight Simulation
|
||
|
(RLHS), which first simulates plausible consequences and then elicits feedback
|
||
|
to assess what behaviors were genuinely beneficial in hindsight. We apply RLHS
|
||
|
to two widely-employed online and offline preference optimization methods --
|
||
|
Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) --
|
||
|
and show empirically that misalignment is significantly reduced with both
|
||
|
methods. Through an online human user study, we show that RLHS consistently
|
||
|
outperforms RLHF in helping users achieve their goals and earns higher
|
||
|
satisfaction ratings, despite being trained solely with simulated hindsight
|
||
|
feedback. These results underscore the importance of focusing on long-term
|
||
|
consequences, even simulated ones, to mitigate misalignment in RLHF.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of Large Language Models, including experimental results with quantitative metrics, and comparing its results to state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications and challenges.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Leveraging LLM Agents for Translating Network Configurations](https://arxiv.org/abs/2501.08760v1)
|
||
|
**arXiv ID:** 2501.08760v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Configuration translation is a critical and frequent task in network
|
||
|
operations. When a network device is damaged or outdated, administrators need
|
||
|
to replace it to maintain service continuity. The replacement devices may
|
||
|
originate from different vendors, necessitating configuration translation to
|
||
|
ensure seamless network operation. However, translating configurations manually
|
||
|
is a labor-intensive and error-prone process. In this paper, we propose an
|
||
|
intent-based framework for translating network configuration with Large
|
||
|
Language Model (LLM) Agents. The core of our approach is an Intent-based
|
||
|
Retrieval Augmented Generation (IRAG) module that systematically splits a
|
||
|
configuration file into fragments, extracts intents, and generates accurate
|
||
|
translations. We also design a two-stage verification method to validate the
|
||
|
syntax and semantics correctness of the translated configurations. We implement
|
||
|
and evaluate the proposed method on real-world network configurations.
|
||
|
Experimental results show that our method achieves 97.74% syntax correctness,
|
||
|
outperforming state-of-the-art methods in translation accuracy.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on a practical application of LLMs in network configuration translation, presenting experimental results with quantitative metrics, and comparing its results to state-of-the-art methods, demonstrating advancements in translation accuracy.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents](https://arxiv.org/abs/2501.08828v1)
|
||
|
**arXiv ID:** 2501.08828v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Multi-modal document retrieval is designed to identify and retrieve various
|
||
|
forms of multi-modal content, such as figures, tables, charts, and layout
|
||
|
information from extensive documents. Despite its significance, there is a
|
||
|
notable lack of a robust benchmark to effectively evaluate the performance of
|
||
|
systems in multi-modal document retrieval. To address this gap, this work
|
||
|
introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks:
|
||
|
page-level and layout-level retrieval. The former focuses on localizing the
|
||
|
most relevant pages within a long document, while the latter targets the
|
||
|
detection of specific layouts, offering a more fine-grained granularity than
|
||
|
whole-page analysis. A layout can refer to a variety of elements such as
|
||
|
textual paragraphs, equations, figures, tables, or charts. The MMDocIR
|
||
|
benchmark comprises a rich dataset featuring expertly annotated labels for
|
||
|
1,685 questions and bootstrapped labels for 173,843 questions, making it a
|
||
|
pivotal resource for advancing multi-modal document retrieval for both training
|
||
|
and evaluation. Through rigorous experiments, we reveal that (i) visual
|
||
|
retrievers significantly outperform their text counterparts, (ii) MMDocIR train
|
||
|
set can effectively benefit the training process of multi-modal document
|
||
|
retrieval and (iii) text retrievers leveraging on VLM-text perform much better
|
||
|
than those using OCR-text. These findings underscores the potential advantages
|
||
|
of integrating visual elements for multi-modal document retrieval.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 2 (Experimental Results and Quantitative Metrics) and 3 (Comparison with State-of-the-Art), and potentially meets criteria 1 (Practical Applications) and 4 (Methodology and Implementation Details), as it introduces a new benchmark for multi-modal document retrieval and presents experimental results with quantitative metrics.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Disentangling Exploration of Large Language Models by Optimal
|
||
|
Exploitation](https://arxiv.org/abs/2501.08925v1)
|
||
|
**arXiv ID:** 2501.08925v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Exploration is a crucial skill for self-improvement and open-ended
|
||
|
problem-solving. However, it remains uncertain whether large language models
|
||
|
can effectively explore the state-space. Existing evaluations predominantly
|
||
|
focus on the trade-off between exploration and exploitation, often assessed in
|
||
|
multi-armed bandit problems. In contrast, this work isolates exploration as the
|
||
|
sole objective, tasking the agent with delivering information that enhances
|
||
|
future returns. For the evaluation, we propose to decompose missing rewards
|
||
|
into exploration and exploitation components by measuring the optimal
|
||
|
achievable return for the states already explored. Our experiments with various
|
||
|
LLMs reveal that most models struggle to sufficiently explore the state-space
|
||
|
and that weak exploration is insufficient. We observe a positive correlation
|
||
|
between model size and exploration performance, with larger models
|
||
|
demonstrating superior capabilities. Furthermore, we show that our
|
||
|
decomposition provides insights into differences in behaviors driven by agent
|
||
|
instructions during prompt engineering, offering a valuable tool for refining
|
||
|
LLM performance in exploratory tasks.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of Large Language Models, including experimental results with quantitative metrics, and comparing its results with existing state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications and challenges involving LLMs.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG](https://arxiv.org/abs/2501.09136v1)
|
||
|
**arXiv ID:** 2501.09136v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Large Language Models (LLMs) have revolutionized artificial intelligence (AI)
|
||
|
by enabling human like text generation and natural language understanding.
|
||
|
However, their reliance on static training data limits their ability to respond
|
||
|
to dynamic, real time queries, resulting in outdated or inaccurate outputs.
|
||
|
Retrieval Augmented Generation (RAG) has emerged as a solution, enhancing LLMs
|
||
|
by integrating real time data retrieval to provide contextually relevant and
|
||
|
up-to-date responses. Despite its promise, traditional RAG systems are
|
||
|
constrained by static workflows and lack the adaptability required for
|
||
|
multistep reasoning and complex task management.
|
||
|
Agentic Retrieval-Augmented Generation (Agentic RAG) transcends these
|
||
|
limitations by embedding autonomous AI agents into the RAG pipeline. These
|
||
|
agents leverage agentic design patterns reflection, planning, tool use, and
|
||
|
multiagent collaboration to dynamically manage retrieval strategies,
|
||
|
iteratively refine contextual understanding, and adapt workflows to meet
|
||
|
complex task requirements. This integration enables Agentic RAG systems to
|
||
|
deliver unparalleled flexibility, scalability, and context awareness across
|
||
|
diverse applications.
|
||
|
This survey provides a comprehensive exploration of Agentic RAG, beginning
|
||
|
with its foundational principles and the evolution of RAG paradigms. It
|
||
|
presents a detailed taxonomy of Agentic RAG architectures, highlights key
|
||
|
applications in industries such as healthcare, finance, and education, and
|
||
|
examines practical implementation strategies. Additionally, it addresses
|
||
|
challenges in scaling these systems, ensuring ethical decision making, and
|
||
|
optimizing performance for real-world applications, while providing detailed
|
||
|
insights into frameworks and tools for implementing Agentic RAG
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 3, 4, and 5, and discusses real-world applications and challenges involving LLMs, particularly in areas like retrieval-augmented generation and agentic AI, with a focus on practical implementation strategies and scalability.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [AutoLoop: Fast Visual SLAM Fine-tuning through Agentic Curriculum
|
||
|
Learning](https://arxiv.org/abs/2501.09160v1)
|
||
|
**arXiv ID:** 2501.09160v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Current visual SLAM systems face significant challenges in balancing
|
||
|
computational efficiency with robust loop closure handling. Traditional
|
||
|
approaches require careful manual tuning and incur substantial computational
|
||
|
overhead, while learning-based methods either lack explicit loop closure
|
||
|
capabilities or implement them through computationally expensive methods. We
|
||
|
present AutoLoop, a novel approach that combines automated curriculum learning
|
||
|
with efficient fine-tuning for visual SLAM systems. Our method employs a DDPG
|
||
|
(Deep Deterministic Policy Gradient) agent to dynamically adjust loop closure
|
||
|
weights during training, eliminating the need for manual hyperparameter search
|
||
|
while significantly reducing the required training steps. The approach
|
||
|
pre-computes potential loop closure pairs offline and leverages them through an
|
||
|
agent-guided curriculum, allowing the model to adapt efficiently to new
|
||
|
scenarios. Experiments conducted on TartanAir for training and validated across
|
||
|
multiple benchmarks including KITTI, EuRoC, ICL-NUIM and TUM RGB-D demonstrate
|
||
|
that AutoLoop achieves comparable or superior performance while reducing
|
||
|
training time by an order of magnitude compared to traditional approaches.
|
||
|
AutoLoop provides a practical solution for rapid adaptation of visual SLAM
|
||
|
systems, automating the weight tuning process that traditionally requires
|
||
|
multiple manual iterations. Our results show that this automated curriculum
|
||
|
strategy not only accelerates training but also maintains or improves the
|
||
|
model's performance across diverse environmental conditions.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of automated curriculum learning for visual SLAM systems, including experimental results with quantitative metrics and comparison with state-of-the-art techniques, demonstrating significant improvements in training time and performance.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Clone-Robust AI Alignment](https://arxiv.org/abs/2501.09254v1)
|
||
|
**arXiv ID:** 2501.09254v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> A key challenge in training Large Language Models (LLMs) is properly aligning
|
||
|
them with human preferences. Reinforcement Learning with Human Feedback (RLHF)
|
||
|
uses pairwise comparisons from human annotators to train reward functions and
|
||
|
has emerged as a popular alignment method. However, input datasets in RLHF are
|
||
|
not necessarily balanced in the types of questions and answers that are
|
||
|
included. Therefore, we want RLHF algorithms to perform well even when the set
|
||
|
of alternatives is not uniformly distributed. Drawing on insights from social
|
||
|
choice theory, we introduce robustness to approximate clones, a desirable
|
||
|
property of RLHF algorithms which requires that adding near-duplicate
|
||
|
alternatives does not significantly change the learned reward function. We
|
||
|
first demonstrate that the standard RLHF algorithm based on regularized maximum
|
||
|
likelihood estimation (MLE) fails to satisfy this property. We then propose the
|
||
|
weighted MLE, a new RLHF algorithm that modifies the standard regularized MLE
|
||
|
by weighting alternatives based on their similarity to other alternatives. This
|
||
|
new algorithm guarantees robustness to approximate clones while preserving
|
||
|
desirable theoretical properties.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of Large Language Models, including experimental results with quantitative metrics and comparing its results with existing state-of-the-art techniques. It also introduces a novel approach to reinforcement learning with human feedback, making it relevant for further review.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [To Retrieve or Not to Retrieve? Uncertainty Detection for Dynamic
|
||
|
Retrieval Augmented Generation](https://arxiv.org/abs/2501.09292v1)
|
||
|
**arXiv ID:** 2501.09292v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Retrieval-Augmented Generation equips large language models with the
|
||
|
capability to retrieve external knowledge, thereby mitigating hallucinations by
|
||
|
incorporating information beyond the model's intrinsic abilities. However, most
|
||
|
prior works have focused on invoking retrieval deterministically, which makes
|
||
|
it unsuitable for tasks such as long-form question answering. Instead,
|
||
|
dynamically performing retrieval by invoking it only when the underlying LLM
|
||
|
lacks the required knowledge can be more efficient. In this context, we delve
|
||
|
deeper into the question, "To Retrieve or Not to Retrieve?" by exploring
|
||
|
multiple uncertainty detection methods. We evaluate these methods for the task
|
||
|
of long-form question answering, employing dynamic retrieval, and present our
|
||
|
comparisons. Our findings suggest that uncertainty detection metrics, such as
|
||
|
Degree Matrix Jaccard and Eccentricity, can reduce the number of retrieval
|
||
|
calls by almost half, with only a slight reduction in question-answering
|
||
|
accuracy.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of Large Language Models in retrieval-augmented generation, presenting experimental results with quantitative metrics, and comparing its results with existing state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [A Study of In-Context-Learning-Based Text-to-SQL Errors](https://arxiv.org/abs/2501.09310v1)
|
||
|
**arXiv ID:** 2501.09310v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Large language models (LLMs) have been adopted to perform text-to-SQL tasks,
|
||
|
utilizing their in-context learning (ICL) capability to translate natural
|
||
|
language questions into structured query language (SQL). However, such a
|
||
|
technique faces correctness problems and requires efficient repairing
|
||
|
solutions. In this paper, we conduct the first comprehensive study of
|
||
|
text-to-SQL errors. Our study covers four representative ICL-based techniques,
|
||
|
five basic repairing methods, two benchmarks, and two LLM settings. We find
|
||
|
that text-to-SQL errors are widespread and summarize 29 error types of 7
|
||
|
categories. We also find that existing repairing attempts have limited
|
||
|
correctness improvement at the cost of high computational overhead with many
|
||
|
mis-repairs. Based on the findings, we propose MapleRepair, a novel text-to-SQL
|
||
|
error detection and repairing framework. The evaluation demonstrates that
|
||
|
MapleRepair outperforms existing solutions by repairing 13.8% more queries with
|
||
|
neglectable mis-repairs and 67.4% less overhead.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of Large Language Models in text-to-SQL tasks, including experimental results with quantitative metrics, and comparing its results with existing state-of-the-art techniques, demonstrating performance improvements.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Rational Tuning of LLM Cascades via Probabilistic Modeling](https://arxiv.org/abs/2501.09345v1)
|
||
|
**arXiv ID:** 2501.09345v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Understanding the reliability of large language models (LLMs) has recently
|
||
|
garnered significant attention. Given LLMs' propensity to hallucinate, as well
|
||
|
as their high sensitivity to prompt design, it is already challenging to
|
||
|
predict the performance of an individual LLM. However, the problem becomes more
|
||
|
complex for compound LLM systems such as cascades, where in addition to each
|
||
|
model's standalone performance, we must understand how the error rates of
|
||
|
different models interact. In this paper, we present a probabilistic model for
|
||
|
the joint performance distribution of a sequence of LLMs, which enables a
|
||
|
framework for rationally tuning the confidence thresholds of a LLM cascade
|
||
|
using continuous optimization. Compared to selecting confidence thresholds
|
||
|
using grid search, our parametric Markov-copula model significantly improves
|
||
|
runtime scaling with respect to the length of the cascade and the desired
|
||
|
resolution of the cost-error curve, turning them from intractable into
|
||
|
low-order polynomial. In addition, the optimal thresholds computed using our
|
||
|
continuous optimization-based algorithm increasingly outperform those found via
|
||
|
grid search as cascade length grows, improving the area under the cost-error
|
||
|
curve by 1.9% on average for cascades consisting of at least three models.
|
||
|
Overall, our Markov-copula model provides a rational basis for tuning LLM
|
||
|
cascade performance and points to the potential of probabilistic methods in
|
||
|
analyzing LLM systems.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of LLMs, including experimental results with quantitative metrics, and comparing its results to existing state-of-the-art techniques, demonstrating performance improvements.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents
|
||
|
in Augmented Reality Tasks](https://arxiv.org/abs/2501.09355v1)
|
||
|
**arXiv ID:** 2501.09355v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Multimodal AI Agents are AI models that have the capability of interactively
|
||
|
and cooperatively assisting human users to solve day-to-day tasks. Augmented
|
||
|
Reality (AR) head worn devices can uniquely improve the user experience of
|
||
|
solving procedural day-to-day tasks by providing egocentric multimodal (audio
|
||
|
and video) observational capabilities to AI Agents. Such AR capabilities can
|
||
|
help AI Agents see and listen to actions that users take which can relate to
|
||
|
multimodal capabilities of human users. Existing AI Agents, either Large
|
||
|
Language Models (LLMs) or Multimodal Vision-Language Models (VLMs) are reactive
|
||
|
in nature, which means that models cannot take an action without reading or
|
||
|
listening to the human user's prompts. Proactivity of AI Agents on the other
|
||
|
hand can help the human user detect and correct any mistakes in agent observed
|
||
|
tasks, encourage users when they do tasks correctly or simply engage in
|
||
|
conversation with the user - akin to a human teaching or assisting a user. Our
|
||
|
proposed YET to Intervene (YETI) multimodal agent focuses on the research
|
||
|
question of identifying circumstances that may require the agent to intervene
|
||
|
proactively. This allows the agent to understand when it can intervene in a
|
||
|
conversation with human users that can help the user correct mistakes on tasks,
|
||
|
like cooking, using AR. Our YETI Agent learns scene understanding signals based
|
||
|
on interpretable notions of Structural Similarity (SSIM) on consecutive video
|
||
|
frames. We also define the alignment signal which the AI Agent can learn to
|
||
|
identify if the video frames corresponding to the user's actions on the task
|
||
|
are consistent with expected actions. These signals are used by our AI Agent to
|
||
|
determine when it should proactively intervene. We compare our results on the
|
||
|
instances of proactive intervention in the HoloAssist multimodal benchmark for
|
||
|
an expert agent guiding a user to complete procedural tasks.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of Large Language Models in augmented reality tasks, including experimental results with quantitative metrics and comparison with state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications and challenges.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [MoE$^2$: Optimizing Collaborative Inference for Edge Large Language
|
||
|
Models](https://arxiv.org/abs/2501.09410v1)
|
||
|
**arXiv ID:** 2501.09410v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Large language models (LLMs) have demonstrated remarkable capabilities across
|
||
|
a wide range of natural language processing tasks. Exploiting the heterogeneous
|
||
|
capabilities of edge LLMs is crucial for diverse emerging applications, as it
|
||
|
enables greater cost-effectiveness and reduced latency. In this work, we
|
||
|
introduce \textit{Mixture-of-Edge-Experts (MoE$^2$)}, a novel collaborative
|
||
|
inference framework for edge LLMs. We formulate the joint gating and expert
|
||
|
selection problem to optimize inference performance under energy and latency
|
||
|
constraints. Unlike conventional MoE problems, LLM expert selection is
|
||
|
significantly more challenging due to the combinatorial nature and the
|
||
|
heterogeneity of edge LLMs across various attributes. To this end, we propose a
|
||
|
two-level expert selection mechanism through which we uncover an
|
||
|
optimality-preserving property of gating parameters across expert selections.
|
||
|
This property enables the decomposition of the training and selection
|
||
|
processes, significantly reducing complexity. Furthermore, we leverage the
|
||
|
objective's monotonicity and design a discrete monotonic optimization algorithm
|
||
|
for optimal expert selection. We implement edge servers with NVIDIA Jetson AGX
|
||
|
Orins and NVIDIA RTX 4090 GPUs, and perform extensive experiments. Our results
|
||
|
validate that performance improvements of various LLM models and show that our
|
||
|
MoE$^2$ method can achieve optimal trade-offs among different delay and energy
|
||
|
budgets, and outperforms baselines under various system resource constraints.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of Large Language Models, including experimental results with quantitative metrics, and comparing its results with state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications. Additionally, the paper introduces a novel approach and demonstrates robust experimental validation.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through
|
||
|
Category-Bounding](https://arxiv.org/abs/2501.09645v1)
|
||
|
**arXiv ID:** 2501.09645v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> In today's assistant landscape, personalisation enhances interactions,
|
||
|
fosters long-term relationships, and deepens engagement. However, many systems
|
||
|
struggle with retaining user preferences, leading to repetitive user requests
|
||
|
and disengagement. Furthermore, the unregulated and opaque extraction of user
|
||
|
preferences in industry applications raises significant concerns about privacy
|
||
|
and trust, especially in regions with stringent regulations like Europe. In
|
||
|
response to these challenges, we propose a long-term memory system for voice
|
||
|
assistants, structured around predefined categories. This approach leverages
|
||
|
Large Language Models to efficiently extract, store, and retrieve preferences
|
||
|
within these categories, ensuring both personalisation and transparency. We
|
||
|
also introduce a synthetic multi-turn, multi-session conversation dataset
|
||
|
(CarMem), grounded in real industry data, tailored to an in-car voice assistant
|
||
|
setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to
|
||
|
.95 in preference extraction, depending on category granularity. Our
|
||
|
maintenance strategy reduces redundant preferences by 95% and contradictory
|
||
|
ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively,
|
||
|
the results demonstrate the system's suitability for industrial applications.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of LLMs in voice assistants, providing experimental results with quantitative metrics, and comparing its results to existing techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications and challenges.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Authenticated Delegation and Authorized AI Agents](https://arxiv.org/abs/2501.09674v1)
|
||
|
**arXiv ID:** 2501.09674v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The rapid deployment of autonomous AI agents creates urgent challenges around
|
||
|
authorization, accountability, and access control in digital spaces. New
|
||
|
standards are needed to know whom AI agents act on behalf of and guide their
|
||
|
use appropriately, protecting online spaces while unlocking the value of task
|
||
|
delegation to autonomous agents. We introduce a novel framework for
|
||
|
authenticated, authorized, and auditable delegation of authority to AI agents,
|
||
|
where human users can securely delegate and restrict the permissions and scope
|
||
|
of agents while maintaining clear chains of accountability. This framework
|
||
|
builds on existing identification and access management protocols, extending
|
||
|
OAuth 2.0 and OpenID Connect with agent-specific credentials and metadata,
|
||
|
maintaining compatibility with established authentication and web
|
||
|
infrastructure. Further, we propose a framework for translating flexible,
|
||
|
natural language permissions into auditable access control configurations,
|
||
|
enabling robust scoping of AI agent capabilities across diverse interaction
|
||
|
modalities. Taken together, this practical approach facilitates immediate
|
||
|
deployment of AI agents while addressing key security and accountability
|
||
|
concerns, working toward ensuring agentic AI systems perform only appropriate
|
||
|
actions and providing a tool for digital service providers to enable AI agent
|
||
|
interactions without risking harm from scalable interaction.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 3, and 5 by focusing on practical applications of agentic AI, comparing its framework with existing protocols, and discussing real-world applications and challenges. It also introduces a novel approach to authenticated delegation and authorized AI agents, demonstrating potential for innovative applications and performance improvements.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [ADAGE: A generic two-layer framework for adaptive agent based modelling](https://arxiv.org/abs/2501.09429v1)
|
||
|
**arXiv ID:** 2501.09429v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Agent-based models (ABMs) are valuable for modelling complex, potentially
|
||
|
out-of-equilibria scenarios. However, ABMs have long suffered from the Lucas
|
||
|
critique, stating that agent behaviour should adapt to environmental changes.
|
||
|
Furthermore, the environment itself often adapts to these behavioural changes,
|
||
|
creating a complex bi-level adaptation problem. Recent progress integrating
|
||
|
multi-agent reinforcement learning into ABMs introduces adaptive agent
|
||
|
behaviour, beginning to address the first part of this critique, however, the
|
||
|
approaches are still relatively ad hoc, lacking a general formulation, and
|
||
|
furthermore, do not tackle the second aspect of simultaneously adapting
|
||
|
environmental level characteristics in addition to the agent behaviours. In
|
||
|
this work, we develop a generic two-layer framework for ADaptive AGEnt based
|
||
|
modelling (ADAGE) for addressing these problems. This framework formalises the
|
||
|
bi-level problem as a Stackelberg game with conditional behavioural policies,
|
||
|
providing a consolidated framework for adaptive agent-based modelling based on
|
||
|
solving a coupled set of non-linear equations. We demonstrate how this generic
|
||
|
approach encapsulates several common (previously viewed as distinct) ABM tasks,
|
||
|
such as policy design, calibration, scenario generation, and robust behavioural
|
||
|
learning under one unified framework. We provide example simulations on
|
||
|
multiple complex economic and financial environments, showing the strength of
|
||
|
the novel framework under these canonical settings, addressing long-standing
|
||
|
critiques of traditional ABMs.
|
||
|
|
||
|
**Decision Explanation:** The paper meets criteria 1, 2, and 3 by focusing on practical applications of adaptive agent-based modeling, demonstrating experimental results with quantitative metrics, and comparing its results with existing state-of-the-art techniques. It also meets criteria 4 and 5 by clearly describing its methodology and discussing real-world applications.
|
||
|
|
||
|
---
|
||
|
|
||
|
# Rejected Papers
|
||
|
|
||
|
## [DualOpt: A Dual Divide-and-Optimize Algorithm for the Large-scale
|
||
|
Traveling Salesman Problem](https://arxiv.org/abs/2501.08565v1)
|
||
|
**arXiv ID:** 2501.08565v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> This paper proposes a dual divide-and-optimize algorithm (DualOpt) for
|
||
|
solving the large-scale traveling salesman problem (TSP). DualOpt combines two
|
||
|
complementary strategies to improve both solution quality and computational
|
||
|
efficiency. The first strategy is a grid-based divide-and-conquer procedure
|
||
|
that partitions the TSP into smaller sub-problems, solving them in parallel and
|
||
|
iteratively refining the solution by merging nodes and partial routes. The
|
||
|
process continues until only one grid remains, yielding a high-quality initial
|
||
|
solution. The second strategy involves a path-based divide-and-optimize
|
||
|
procedure that further optimizes the solution by dividing it into sub-paths,
|
||
|
optimizing each using a neural solver, and merging them back to progressively
|
||
|
improve the overall solution. Extensive experiments conducted on two groups of
|
||
|
TSP benchmark instances, including randomly generated instances with up to
|
||
|
100,000 nodes and real-world datasets from TSPLIB, demonstrate the
|
||
|
effectiveness of DualOpt. The proposed DualOpt achieves highly competitive
|
||
|
results compared to 10 state-of-the-art algorithms in the literature. In
|
||
|
particular, DualOpt achieves an improvement gap up to 1.40% for the largest
|
||
|
instance TSP100K with a remarkable 104x speed-up over the leading heuristic
|
||
|
solver LKH3. Additionally, DualOpt demonstrates strong generalization on TSPLIB
|
||
|
benchmarks, confirming its capability to tackle diverse real-world TSP
|
||
|
applications.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it does not focus on Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, but rather proposes an algorithm for the traveling salesman problem.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Development and Validation of the Provider Documentation Summarization
|
||
|
Quality Instrument for Large Language Models](https://arxiv.org/abs/2501.08977v1)
|
||
|
**arXiv ID:** 2501.08977v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> As Large Language Models (LLMs) are integrated into electronic health record
|
||
|
(EHR) workflows, validated instruments are essential to evaluate their
|
||
|
performance before implementation. Existing instruments for provider
|
||
|
documentation quality are often unsuitable for the complexities of
|
||
|
LLM-generated text and lack validation on real-world data. The Provider
|
||
|
Documentation Summarization Quality Instrument (PDSQI-9) was developed to
|
||
|
evaluate LLM-generated clinical summaries. Multi-document summaries were
|
||
|
generated from real-world EHR data across multiple specialties using several
|
||
|
LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson
|
||
|
correlation for substantive validity, factor analysis and Cronbach's alpha for
|
||
|
structural validity, inter-rater reliability (ICC and Krippendorff's alpha) for
|
||
|
generalizability, a semi-Delphi process for content validity, and comparisons
|
||
|
of high- versus low-quality summaries for discriminant validity. Seven
|
||
|
physician raters evaluated 779 summaries and answered 8,329 questions,
|
||
|
achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated
|
||
|
strong internal consistency (Cronbach's alpha = 0.879; 95% CI: 0.867-0.891) and
|
||
|
high inter-rater reliability (ICC = 0.867; 95% CI: 0.867-0.868), supporting
|
||
|
structural validity and generalizability. Factor analysis identified a 4-factor
|
||
|
model explaining 58% of the variance, representing organization, clarity,
|
||
|
accuracy, and utility. Substantive validity was supported by correlations
|
||
|
between note length and scores for Succinct (rho = -0.200, p = 0.029) and
|
||
|
Organized (rho = -0.190, p = 0.037). Discriminant validity distinguished high-
|
||
|
from low-quality summaries (p < 0.001). The PDSQI-9 demonstrates robust
|
||
|
construct validity, supporting its use in clinical practice to evaluate
|
||
|
LLM-generated summaries and facilitate safer integration of LLMs into
|
||
|
healthcare workflows.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, which is excluded according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [AI-based Identity Fraud Detection: A Systematic Review](https://arxiv.org/abs/2501.09239v1)
|
||
|
**arXiv ID:** 2501.09239v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> With the rapid development of digital services, a large volume of personally
|
||
|
identifiable information (PII) is stored online and is subject to cyberattacks
|
||
|
such as Identity fraud. Most recently, the use of Artificial Intelligence (AI)
|
||
|
enabled deep fake technologies has significantly increased the complexity of
|
||
|
identity fraud. Fraudsters may use these technologies to create highly
|
||
|
sophisticated counterfeit personal identification documents, photos and videos.
|
||
|
These advancements in the identity fraud landscape pose challenges for identity
|
||
|
fraud detection and society at large. There is a pressing need to review and
|
||
|
understand identity fraud detection methods, their limitations and potential
|
||
|
solutions. This research aims to address this important need by using the
|
||
|
well-known systematic literature review method. This paper reviewed a selected
|
||
|
set of 43 papers across 4 major academic literature databases. In particular,
|
||
|
the review results highlight the two types of identity fraud prevention and
|
||
|
detection methods, in-depth and open challenges. The results were also
|
||
|
consolidated into a taxonomy of AI-based identity fraud detection and
|
||
|
prevention methods including key insights and trends. Overall, this paper
|
||
|
provides a foundational knowledge base to researchers and practitioners for
|
||
|
further research and development in this important area of digital identity
|
||
|
fraud.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on identity fraud detection, which is related to law and social applications of AI, and does not meet the criteria of focusing on practical applications of Large Language Models (LLMs) or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [AI in Support of Diversity and Inclusion](https://arxiv.org/abs/2501.09534v1)
|
||
|
**arXiv ID:** 2501.09534v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> In this paper, we elaborate on how AI can support diversity and inclusion and
|
||
|
exemplify research projects conducted in that direction. We start by looking at
|
||
|
the challenges and progress in making large language models (LLMs) more
|
||
|
transparent, inclusive, and aware of social biases. Even though LLMs like
|
||
|
ChatGPT have impressive abilities, they struggle to understand different
|
||
|
cultural contexts and engage in meaningful, human like conversations. A key
|
||
|
issue is that biases in language processing, especially in machine translation,
|
||
|
can reinforce inequality. Tackling these biases requires a multidisciplinary
|
||
|
approach to ensure AI promotes diversity, fairness, and inclusion. We also
|
||
|
highlight AI's role in identifying biased content in media, which is important
|
||
|
for improving representation. By detecting unequal portrayals of social groups,
|
||
|
AI can help challenge stereotypes and create more inclusive technologies.
|
||
|
Transparent AI algorithms, which clearly explain their decisions, are essential
|
||
|
for building trust and reducing bias in AI systems. We also stress AI systems
|
||
|
need diverse and inclusive training data. Projects like the Child Growth
|
||
|
Monitor show how using a wide range of data can help address real world
|
||
|
problems like malnutrition and poverty. We present a project that demonstrates
|
||
|
how AI can be applied to monitor the role of search engines in spreading
|
||
|
disinformation about the LGBTQ+ community. Moreover, we discuss the SignON
|
||
|
project as an example of how technology can bridge communication gaps between
|
||
|
hearing and deaf people, emphasizing the importance of collaboration and mutual
|
||
|
trust in developing inclusive AI. Overall, with this paper, we advocate for AI
|
||
|
systems that are not only effective but also socially responsible, promoting
|
||
|
fair and inclusive interactions between humans and machines.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on social applications of AI in regard to Diversity, Social harm, and similar issues, which is an excluded topic.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Artificial Intelligence-Driven Clinical Decision Support Systems](https://arxiv.org/abs/2501.09628v1)
|
||
|
**arXiv ID:** 2501.09628v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> As artificial intelligence (AI) becomes increasingly embedded in healthcare
|
||
|
delivery, this chapter explores the critical aspects of developing reliable and
|
||
|
ethical Clinical Decision Support Systems (CDSS). Beginning with the
|
||
|
fundamental transition from traditional statistical models to sophisticated
|
||
|
machine learning approaches, this work examines rigorous validation strategies
|
||
|
and performance assessment methods, including the crucial role of model
|
||
|
calibration and decision curve analysis. The chapter emphasizes that creating
|
||
|
trustworthy AI systems in healthcare requires more than just technical
|
||
|
accuracy; it demands careful consideration of fairness, explainability, and
|
||
|
privacy. The challenge of ensuring equitable healthcare delivery through AI is
|
||
|
stressed, discussing methods to identify and mitigate bias in clinical
|
||
|
predictive models. The chapter then delves into explainability as a cornerstone
|
||
|
of human-centered CDSS. This focus reflects the understanding that healthcare
|
||
|
professionals must not only trust AI recommendations but also comprehend their
|
||
|
underlying reasoning. The discussion advances in an analysis of privacy
|
||
|
vulnerabilities in medical AI systems, from data leakage in deep learning
|
||
|
models to sophisticated attacks against model explanations. The text explores
|
||
|
privacy-preservation strategies such as differential privacy and federated
|
||
|
learning, while acknowledging the inherent trade-offs between privacy
|
||
|
protection and model performance. This progression, from technical validation
|
||
|
to ethical considerations, reflects the multifaceted challenges of developing
|
||
|
AI systems that can be seamlessly and reliably integrated into daily clinical
|
||
|
practice while maintaining the highest standards of patient care and data
|
||
|
protection.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, which is excluded according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Electronic Health Records: Towards Digital Twins in Healthcare](https://arxiv.org/abs/2501.09640v1)
|
||
|
**arXiv ID:** 2501.09640v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The pivotal shift from traditional paper-based records to sophisticated
|
||
|
Electronic Health Records (EHR), enabled systematic collection and analysis of
|
||
|
patient data through descriptive statistics, providing insight into patterns
|
||
|
and trends across patient populations. This evolution continued toward
|
||
|
predictive analytics, allowing healthcare providers to anticipate patient
|
||
|
outcomes and potential complications before they occur. This progression from
|
||
|
basic digital record-keeping to sophisticated predictive modelling and digital
|
||
|
twins reflects healthcare's broader evolution toward more integrated,
|
||
|
patient-centred approaches that combine data-driven insights with personalized
|
||
|
care delivery. This chapter explores the evolution and significance of
|
||
|
healthcare information systems, beginning with an examination of the
|
||
|
implementation of EHR in the UK and the USA. It provides a comprehensive
|
||
|
overview of the International Classification of Diseases (ICD) system, tracing
|
||
|
its development from ICD-9 to ICD-10. Central to this discussion is the
|
||
|
MIMIC-III database, a landmark achievement in healthcare data sharing and
|
||
|
arguably the most comprehensive critical care database freely available to
|
||
|
researchers worldwide. MIMIC-III has democratized access to high-quality
|
||
|
healthcare data, enabling unprecedented opportunities for research and
|
||
|
analysis. The chapter examines its structure, clinical outcome analysis
|
||
|
capabilities, and practical applications through case studies, with a
|
||
|
particular focus on mortality and length of stay metrics, vital signs
|
||
|
extraction, and ICD coding. Through detailed entity-relationship diagrams and
|
||
|
practical examples, the text illustrates MIMIC's complex data structure and
|
||
|
demonstrates how different querying approaches can lead to subtly different
|
||
|
results, emphasizing the critical importance of understanding the database's
|
||
|
architecture for accurate data extraction.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, specifically Electronic Health Records, which is excluded according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [The Goofus & Gallant Story Corpus for Practical Value Alignment](https://arxiv.org/abs/2501.09707v1)
|
||
|
**arXiv ID:** 2501.09707v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Values or principles are key elements of human society that influence people
|
||
|
to behave and function according to an accepted standard set of social rules to
|
||
|
maintain social order. As AI systems are becoming ubiquitous in human society,
|
||
|
it is a major concern that they could violate these norms or values and
|
||
|
potentially cause harm. Thus, to prevent intentional or unintentional harm, AI
|
||
|
systems are expected to take actions that align with these principles. Training
|
||
|
systems to exhibit this type of behavior is difficult and often requires a
|
||
|
specialized dataset. This work presents a multi-modal dataset illustrating
|
||
|
normative and non-normative behavior in real-life situations described through
|
||
|
natural language and artistic images. This training set contains curated sets
|
||
|
of images that are designed to teach young children about social principles. We
|
||
|
argue that this is an ideal dataset to use for training socially normative
|
||
|
agents given this fact.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on social applications of AI in regard to value alignment and social norms, which is not aligned with the specified criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [KU AIGEN ICL EDI@BC8 Track 3: Advancing Phenotype Named Entity
|
||
|
Recognition and Normalization for Dysmorphology Physical Examination Reports](https://arxiv.org/abs/2501.09744v1)
|
||
|
**arXiv ID:** 2501.09744v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The objective of BioCreative8 Track 3 is to extract phenotypic key medical
|
||
|
findings embedded within EHR texts and subsequently normalize these findings to
|
||
|
their Human Phenotype Ontology (HPO) terms. However, the presence of diverse
|
||
|
surface forms in phenotypic findings makes it challenging to accurately
|
||
|
normalize them to the correct HPO terms. To address this challenge, we explored
|
||
|
various models for named entity recognition and implemented data augmentation
|
||
|
techniques such as synonym marginalization to enhance the normalization step.
|
||
|
Our pipeline resulted in an exact extraction and normalization F1 score 2.6\%
|
||
|
higher than the mean score of all submissions received in response to the
|
||
|
challenge. Furthermore, in terms of the normalization F1 score, our approach
|
||
|
surpassed the average performance by 1.9\%. These findings contribute to the
|
||
|
advancement of automated medical data extraction and normalization techniques,
|
||
|
showcasing potential pathways for future research and application in the
|
||
|
biomedical domain.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, which is excluded according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Adapting Whisper for Regional Dialects: Enhancing Public Services for
|
||
|
Vulnerable Populations in the United Kingdom](https://arxiv.org/abs/2501.08502v1)
|
||
|
**arXiv ID:** 2501.08502v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> We collect novel data in the public service domain to evaluate the capability
|
||
|
of the state-of-the-art automatic speech recognition (ASR) models in capturing
|
||
|
regional differences in accents in the United Kingdom (UK), specifically
|
||
|
focusing on two accents from Scotland with distinct dialects. This study
|
||
|
addresses real-world problems where biased ASR models can lead to
|
||
|
miscommunication in public services, disadvantaging individuals with regional
|
||
|
accents particularly those in vulnerable populations. We first examine the
|
||
|
out-of-the-box performance of the Whisper large-v3 model on a baseline dataset
|
||
|
and our data. We then explore the impact of fine-tuning Whisper on the
|
||
|
performance in the two UK regions and investigate the effectiveness of existing
|
||
|
model evaluation techniques for our real-world application through manual
|
||
|
inspection of model errors. We observe that the Whisper model has a higher word
|
||
|
error rate (WER) on our test datasets compared to the baseline data and
|
||
|
fine-tuning on a given data improves performance on the test dataset with the
|
||
|
same domain and accent. The fine-tuned models also appear to show improved
|
||
|
performance when applied to the test data outside of the region it was trained
|
||
|
on suggesting that fine-tuned models may be transferable within parts of the
|
||
|
UK. Our manual analysis of model outputs reveals the benefits and drawbacks of
|
||
|
using WER as an evaluation metric and fine-tuning to adapt to regional
|
||
|
dialects.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on automatic speech recognition and regional dialects, which does not meet the criteria for practical applications of Large Language Models, and does not demonstrate a clear connection to knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Mitigating Domain Shift in Federated Learning via Intra- and
|
||
|
Inter-Domain Prototypes](https://arxiv.org/abs/2501.08521v1)
|
||
|
**arXiv ID:** 2501.08521v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Federated Learning (FL) has emerged as a decentralized machine learning
|
||
|
technique, allowing clients to train a global model collaboratively without
|
||
|
sharing private data. However, most FL studies ignore the crucial challenge of
|
||
|
heterogeneous domains where each client has a distinct feature distribution,
|
||
|
which is common in real-world scenarios. Prototype learning, which leverages
|
||
|
the mean feature vectors within the same classes, has become a prominent
|
||
|
solution for federated learning under domain skew. However, existing federated
|
||
|
prototype learning methods only consider inter-domain prototypes on the server
|
||
|
and overlook intra-domain characteristics. In this work, we introduce a novel
|
||
|
federated prototype learning method, namely I$^2$PFL, which incorporates
|
||
|
$\textbf{I}$ntra-domain and $\textbf{I}$nter-domain $\textbf{P}$rototypes, to
|
||
|
mitigate domain shifts and learn a generalized global model across multiple
|
||
|
domains in federated learning. To construct intra-domain prototypes, we propose
|
||
|
feature alignment with MixUp-based augmented prototypes to capture the
|
||
|
diversity of local domains and enhance the generalization of local features.
|
||
|
Additionally, we introduce a reweighting mechanism for inter-domain prototypes
|
||
|
to generate generalized prototypes to provide inter-domain knowledge and reduce
|
||
|
domain skew across multiple clients. Extensive experiments on the Digits,
|
||
|
Office-10, and PACS datasets illustrate the superior performance of our method
|
||
|
compared to other baselines.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on federated learning and domain shift mitigation, without clear connections to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Dynamic Portfolio Optimization via Augmented DDPG with Quantum Price
|
||
|
Levels-Based Trading Strategy](https://arxiv.org/abs/2501.08528v1)
|
||
|
**arXiv ID:** 2501.08528v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> With the development of deep learning, Dynamic Portfolio Optimization (DPO)
|
||
|
problem has received a lot of attention in recent years, not only in the field
|
||
|
of finance but also in the field of deep learning. Some advanced research in
|
||
|
recent years has proposed the application of Deep Reinforcement Learning (DRL)
|
||
|
to the DPO problem, which demonstrated to be more advantageous than supervised
|
||
|
learning in solving the DPO problem. However, there are still certain unsolved
|
||
|
issues: 1) DRL algorithms usually have the problems of slow learning speed and
|
||
|
high sample complexity, which is especially problematic when dealing with
|
||
|
complex financial data. 2) researchers use DRL simply for the purpose of
|
||
|
obtaining high returns, but pay little attention to the problem of risk control
|
||
|
and trading strategy, which will affect the stability of model returns. In
|
||
|
order to address these issues, in this study we revamped the intrinsic
|
||
|
structure of the model based on the Deep Deterministic Policy Gradient (DDPG)
|
||
|
and proposed the Augmented DDPG model. Besides, we also proposed an innovative
|
||
|
risk control strategy based on Quantum Price Levels (QPLs) derived from Quantum
|
||
|
Finance Theory (QFT). Our experimental results revealed that our model has
|
||
|
better profitability as well as risk control ability with less sample
|
||
|
complexity in the DPO problem compared to the baseline models.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on financial applications and does not explicitly mention Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, which are the key areas of interest.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [The Devil is in Temporal Token: High Quality Video Reasoning
|
||
|
Segmentation](https://arxiv.org/abs/2501.08549v1)
|
||
|
**arXiv ID:** 2501.08549v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Existing methods for Video Reasoning Segmentation rely heavily on a single
|
||
|
special token to represent the object in the keyframe or the entire video,
|
||
|
inadequately capturing spatial complexity and inter-frame motion. To overcome
|
||
|
these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation
|
||
|
approach that leverages Multimodal Large Language Models (MLLMs) to inject rich
|
||
|
spatiotemporal features into hierarchical tokens.Our key innovations include a
|
||
|
Temporal Dynamic Aggregation (TDA) and a Token-driven Keyframe Selection (TKS).
|
||
|
Specifically, we design frame-level <SEG> and temporal-level <TAK> tokens that
|
||
|
utilize MLLM's autoregressive learning to effectively capture both local and
|
||
|
global information. Subsequently, we apply a similarity-based weighted fusion
|
||
|
and frame selection strategy, then utilize SAM2 to perform keyframe
|
||
|
segmentation and propagation. To enhance keyframe localization accuracy, the
|
||
|
TKS filters keyframes based on SAM2's occlusion scores during inference. VRS-HQ
|
||
|
achieves state-of-the-art performance on ReVOS, surpassing VISA by
|
||
|
5.9%/12.5%/9.1% in J&F scores across the three subsets. These results highlight
|
||
|
the strong temporal reasoning and segmentation capabilities of our method. Code
|
||
|
and model weights will be released at VRS-HQ.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on video processing, which is one of the excluded areas according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Evaluating SAT and SMT Solvers on Large-Scale Sudoku Puzzles](https://arxiv.org/abs/2501.08569v1)
|
||
|
**arXiv ID:** 2501.08569v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Modern SMT solvers have revolutionized the approach to constraint
|
||
|
satisfaction problems by integrating advanced theory reasoning and encoding
|
||
|
techniques. In this work, we evaluate the performance of modern SMT solvers in
|
||
|
Z3, CVC5 and DPLL(T) against a standard SAT solver in DPLL. By benchmarking
|
||
|
these solvers on novel, diverse 25x25 Sudoku puzzles of various difficulty
|
||
|
levels created by our improved Sudoku generator, we examine the impact of
|
||
|
advanced theory reasoning and encoding techniques. Our findings demonstrate
|
||
|
that modern SMT solvers significantly outperform classical SAT solvers. This
|
||
|
work highlights the evolution of logical solvers and exemplifies the utility of
|
||
|
SMT solvers in addressing large-scale constraint satisfaction problems.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it focuses on evaluating SAT and SMT solvers on Sudoku puzzles, which does not involve Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Disjoint Processing Mechanisms of Hierarchical and Linear Grammars in
|
||
|
Large Language Models](https://arxiv.org/abs/2501.08618v1)
|
||
|
**arXiv ID:** 2501.08618v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> All natural languages are structured hierarchically. In humans, this
|
||
|
structural restriction is neurologically coded: when two grammars are presented
|
||
|
with identical vocabularies, brain areas responsible for language processing
|
||
|
are only sensitive to hierarchical grammars. Using large language models
|
||
|
(LLMs), we investigate whether such functionally distinct hierarchical
|
||
|
processing regions can arise solely from exposure to large-scale language
|
||
|
distributions. We generate inputs using English, Italian, Japanese, or nonce
|
||
|
words, varying the underlying grammars to conform to either hierarchical or
|
||
|
linear/positional rules. Using these grammars, we first observe that language
|
||
|
models show distinct behaviors on hierarchical versus linearly structured
|
||
|
inputs. Then, we find that the components responsible for processing
|
||
|
hierarchical grammars are distinct from those that process linear grammars; we
|
||
|
causally verify this in ablation experiments. Finally, we observe that
|
||
|
hierarchy-selective components are also active on nonce grammars; this suggests
|
||
|
that hierarchy sensitivity is not tied to meaning, nor in-distribution inputs.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria for practical applications of Large Language Models, as it focuses on the internal mechanisms of language processing in LLMs rather than real-world applications or experimental results with quantitative metrics.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [ViBidirectionMT-Eval: Machine Translation for Vietnamese-Chinese and
|
||
|
Vietnamese-Lao language pair](https://arxiv.org/abs/2501.08621v1)
|
||
|
**arXiv ID:** 2501.08621v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> This paper presents an results of the VLSP 2022-2023 Machine Translation
|
||
|
Shared Tasks, focusing on Vietnamese-Chinese and Vietnamese-Lao machine
|
||
|
translation. The tasks were organized as part of the 9th, 10th annual workshop
|
||
|
on Vietnamese Language and Speech Processing (VLSP 2022, VLSP 2023). The
|
||
|
objective of the shared task was to build machine translation systems,
|
||
|
specifically targeting Vietnamese-Chinese and Vietnamese-Lao translation
|
||
|
(corresponding to 4 translation directions). The submission were evaluated on
|
||
|
1,000 pairs for testing (news and general domains) using established metrics
|
||
|
like BLEU [11] and SacreBLEU [12]. Additionally, system outputs also were
|
||
|
evaluated with human judgment provided by experts in Chinese and Lao languages.
|
||
|
These human assessments played a crucial role in ranking the performance of the
|
||
|
machine translation models, ensuring a more comprehensive evaluation.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on machine translation, which is not explicitly mentioned as a relevant area, and does not clearly involve Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Reassessing the Role of Chain-of-Thought in Sentiment Analysis: Insights
|
||
|
and Limitations](https://arxiv.org/abs/2501.08641v1)
|
||
|
**arXiv ID:** 2501.08641v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The relationship between language and thought remains an unresolved
|
||
|
philosophical issue. Existing viewpoints can be broadly categorized into two
|
||
|
schools: one asserting their independence, and another arguing that language
|
||
|
constrains thought. In the context of large language models, this debate raises
|
||
|
a crucial question: Does a language model's grasp of semantic meaning depend on
|
||
|
thought processes? To explore this issue, we investigate whether reasoning
|
||
|
techniques can facilitate semantic understanding. Specifically, we
|
||
|
conceptualize thought as reasoning, employ chain-of-thought prompting as a
|
||
|
reasoning technique, and examine its impact on sentiment analysis tasks. The
|
||
|
experiments show that chain-of-thought has a minimal impact on sentiment
|
||
|
analysis tasks. Both the standard and chain-of-thought prompts focus on aspect
|
||
|
terms rather than sentiment in the generated content. Furthermore,
|
||
|
counterfactual experiments reveal that the model's handling of sentiment tasks
|
||
|
primarily depends on information from demonstrations. The experimental results
|
||
|
support the first viewpoint.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria for practical applications of Large Language Models, nor does it include experimental results with quantitative metrics that demonstrate performance improvements in areas like knowledge graphs or retrieval-augmented generation.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Application of Deep Reinforcement Learning to UAV Swarming for Ground
|
||
|
Surveillance](https://arxiv.org/abs/2501.08655v1)
|
||
|
**arXiv ID:** 2501.08655v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> This paper summarizes in depth the state of the art of aerial swarms,
|
||
|
covering both classical and new reinforcement-learning-based approaches for
|
||
|
their management. Then, it proposes a hybrid AI system, integrating deep
|
||
|
reinforcement learning in a multi-agent centralized swarm architecture. The
|
||
|
proposed system is tailored to perform surveillance of a specific area,
|
||
|
searching and tracking ground targets, for security and law enforcement
|
||
|
applications. The swarm is governed by a central swarm controller responsible
|
||
|
for distributing different search and tracking tasks among the cooperating
|
||
|
UAVs. Each UAV agent is then controlled by a collection of cooperative
|
||
|
sub-agents, whose behaviors have been trained using different deep
|
||
|
reinforcement learning models, tailored for the different task types proposed
|
||
|
by the swarm controller. More specifically, proximal policy optimization (PPO)
|
||
|
algorithms were used to train the agents' behavior. In addition, several
|
||
|
metrics to assess the performance of the swarm in this application were
|
||
|
defined. The results obtained through simulation show that our system searches
|
||
|
the operation area effectively, acquires the targets in a reasonable time, and
|
||
|
is capable of tracking them continuously and consistently.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on law enforcement applications and does not explicitly mention Large Language Models (LLMs) or their practical applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, thus not meeting the required criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [SPEQ: Stabilization Phases for Efficient Q-Learning in High
|
||
|
Update-To-Data Ratio Reinforcement Learning](https://arxiv.org/abs/2501.08669v1)
|
||
|
**arXiv ID:** 2501.08669v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> A key challenge in Deep Reinforcement Learning is sample efficiency,
|
||
|
especially in real-world applications where collecting environment interactions
|
||
|
is expensive or risky. Recent off-policy algorithms improve sample efficiency
|
||
|
by increasing the Update-To-Data (UTD) ratio and performing more gradient
|
||
|
updates per environment interaction. While this improves sample efficiency, it
|
||
|
significantly increases computational cost due to the higher number of gradient
|
||
|
updates required. In this paper we propose a sample-efficient method to improve
|
||
|
computational efficiency by separating training into distinct learning phases
|
||
|
in order to exploit gradient updates more effectively. Our approach builds on
|
||
|
top of the Dropout Q-Functions (DroQ) algorithm and alternates between an
|
||
|
online, low UTD ratio training phase, and an offline stabilization phase.
|
||
|
During the stabilization phase, we fine-tune the Q-functions without collecting
|
||
|
new environment interactions. This process improves the effectiveness of the
|
||
|
replay buffer and reduces computational overhead. Our experimental results on
|
||
|
continuous control problems show that our method achieves results comparable to
|
||
|
state-of-the-art, high UTD ratio algorithms while requiring 56\% fewer gradient
|
||
|
updates and 50\% less training time than DroQ. Our approach offers an effective
|
||
|
and computationally economical solution while maintaining the same sample
|
||
|
efficiency as the more costly, high UTD ratio state-of-the-art.
|
||
|
|
||
|
**Decision Explanation:** The paper does not focus on Large Language Models (LLMs) or their applications, but rather on reinforcement learning and Q-learning, which is outside the specified scope.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [ToMATO: Verbalizing the Mental States of Role-Playing LLMs for
|
||
|
Benchmarking Theory of Mind](https://arxiv.org/abs/2501.08838v1)
|
||
|
**arXiv ID:** 2501.08838v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Existing Theory of Mind (ToM) benchmarks diverge from real-world scenarios in
|
||
|
three aspects: 1) they assess a limited range of mental states such as beliefs,
|
||
|
2) false beliefs are not comprehensively explored, and 3) the diverse
|
||
|
personality traits of characters are overlooked. To address these challenges,
|
||
|
we introduce ToMATO, a new ToM benchmark formulated as multiple-choice QA over
|
||
|
conversations. ToMATO is generated via LLM-LLM conversations featuring
|
||
|
information asymmetry. By employing a prompting method that requires
|
||
|
role-playing LLMs to verbalize their thoughts before each utterance, we capture
|
||
|
both first- and second-order mental states across five categories: belief,
|
||
|
intention, desire, emotion, and knowledge. These verbalized thoughts serve as
|
||
|
answers to questions designed to assess the mental states of characters within
|
||
|
conversations. Furthermore, the information asymmetry introduced by hiding
|
||
|
thoughts from others induces the generation of false beliefs about various
|
||
|
mental states. Assigning distinct personality traits to LLMs further
|
||
|
diversifies both utterances and thoughts. ToMATO consists of 5.4k questions,
|
||
|
753 conversations, and 15 personality trait patterns. Our analysis shows that
|
||
|
this dataset construction approach frequently generates false beliefs due to
|
||
|
the information asymmetry between role-playing LLMs, and effectively reflects
|
||
|
diverse personalities. We evaluate nine LLMs on ToMATO and find that even
|
||
|
GPT-4o mini lags behind human performance, especially in understanding false
|
||
|
beliefs, and lacks robustness to various personality traits.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on benchmarking Theory of Mind in role-playing LLMs, which does not meet the criteria of having practical applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI, and does not demonstrate clear real-world applications or comparisons with state-of-the-art techniques in these areas.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Exploring Task-Level Optimal Prompts for Visual In-Context Learning](https://arxiv.org/abs/2501.08841v1)
|
||
|
**arXiv ID:** 2501.08841v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> With the development of Vision Foundation Models (VFMs) in recent years,
|
||
|
Visual In-Context Learning (VICL) has become a better choice compared to
|
||
|
modifying models in most scenarios. Different from retraining or fine-tuning
|
||
|
model, VICL does not require modifications to the model's weights or
|
||
|
architecture, and only needs a prompt with demonstrations to teach VFM how to
|
||
|
solve tasks. Currently, significant computational cost for finding optimal
|
||
|
prompts for every test sample hinders the deployment of VICL, as determining
|
||
|
which demonstrations to use for constructing prompts is very costly. In this
|
||
|
paper, however, we find a counterintuitive phenomenon that most test samples
|
||
|
actually achieve optimal performance under the same prompts, and searching for
|
||
|
sample-level prompts only costs more time but results in completely identical
|
||
|
prompts. Therefore, we propose task-level prompting to reduce the cost of
|
||
|
searching for prompts during the inference stage and introduce two time-saving
|
||
|
yet effective task-level prompt search strategies. Extensive experimental
|
||
|
results show that our proposed method can identify near-optimal prompts and
|
||
|
reach the best VICL performance with a minimal cost that prior work has never
|
||
|
achieved.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on visual in-context learning and prompt engineering for Vision Foundation Models, which does not meet the criteria of focusing on Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Digital Phenotyping for Adolescent Mental Health: A Feasibility Study
|
||
|
Employing Machine Learning to Predict Mental Health Risk From Active and
|
||
|
Passive Smartphone Data](https://arxiv.org/abs/2501.08851v1)
|
||
|
**arXiv ID:** 2501.08851v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Background: Adolescents are particularly vulnerable to mental disorders, with
|
||
|
over 75% of cases manifesting before the age of 25. Research indicates that
|
||
|
only 18 to 34% of young people experiencing high levels of depression or
|
||
|
anxiety symptoms seek support. Digital tools leveraging smartphones offer
|
||
|
scalable and early intervention opportunities. Objective: Using a novel machine
|
||
|
learning framework, this study evaluated the feasibility of integrating active
|
||
|
and passive smartphone data to predict mental disorders in non-clinical
|
||
|
adolescents. Specifically, we investigated the utility of the Mindcraft app in
|
||
|
predicting risks for internalising and externalising disorders, eating
|
||
|
disorders, insomnia and suicidal ideation. Methods: Participants (N=103; mean
|
||
|
age 16.1 years) were recruited from three London schools. Participants
|
||
|
completed the Strengths and Difficulties Questionnaire, the Eating Disorders-15
|
||
|
Questionnaire, Sleep Condition Indicator Questionnaire and indicated the
|
||
|
presence/absence of suicidal ideation. They used the Mindcraft app for 14 days,
|
||
|
contributing active data via self-reports and passive data from smartphone
|
||
|
sensors. A contrastive pretraining phase was applied to enhance user-specific
|
||
|
feature stability, followed by supervised fine-tuning. The model evaluation
|
||
|
employed leave-one-subject-out cross-validation using balanced accuracy as the
|
||
|
primary metric. Results: The integration of active and passive data achieved
|
||
|
superior performance compared to individual data sources, with mean balanced
|
||
|
accuracies of 0.71 for SDQ-High risk, 0.67 for insomnia, 0.77 for suicidal
|
||
|
ideation and 0.70 for eating disorders. The contrastive learning framework
|
||
|
stabilised daily behavioural representations, enhancing predictive robustness.
|
||
|
This study demonstrates the potential of integrating active and passive
|
||
|
smartphone data with advanced machine-learning techniques for predicting mental
|
||
|
health risks.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, specifically mental health, which is excluded according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Silent Abandonment in Text-Based Contact Centers: Identifying,
|
||
|
Quantifying, and Mitigating its Operational Impacts](https://arxiv.org/abs/2501.08869v2)
|
||
|
**arXiv ID:** 2501.08869v2
|
||
|
|
||
|
**Abstract:**
|
||
|
> In the quest to improve services, companies offer customers the option to
|
||
|
interact with agents via texting. Such contact centers face unique challenges
|
||
|
compared to traditional call centers, as measuring customer experience proxies
|
||
|
like abandonment and patience involves uncertainty. A key source of this
|
||
|
uncertainty is silent abandonment, where customers leave without notifying the
|
||
|
system, wasting agent time and leaving their status unclear. Silent abandonment
|
||
|
also obscures whether a customer was served or left. Our goals are to measure
|
||
|
the magnitude of silent abandonment and mitigate its effects. Classification
|
||
|
models show that 3%-70% of customers across 17 companies abandon silently. In
|
||
|
one study, 71.3% of abandoning customers did so silently, reducing agent
|
||
|
efficiency by 3.2% and system capacity by 15.3%, incurring $5,457 in annual
|
||
|
costs per agent. We develop an expectation-maximization (EM) algorithm to
|
||
|
estimate customer patience under uncertainty and identify influencing
|
||
|
covariates. We find that companies should use classification models to estimate
|
||
|
abandonment scope and our EM algorithm to assess patience. We suggest
|
||
|
strategies to operationally mitigate the impact of silent abandonment by
|
||
|
predicting suspected silent-abandonment behavior or changing service design.
|
||
|
Specifically, we show that while allowing customers to write while waiting in
|
||
|
the queue creates a missing data challenge, it also significantly increases
|
||
|
patience and reduces service time, leading to reduced abandonment and lower
|
||
|
staffing requirements.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria for practical applications of Large Language Models (LLMs) and does not mention LLMs, knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, focusing instead on operational impacts in text-based contact centers.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Projection Implicit Q-Learning with Support Constraint for Offline
|
||
|
Reinforcement Learning](https://arxiv.org/abs/2501.08907v1)
|
||
|
**arXiv ID:** 2501.08907v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Offline Reinforcement Learning (RL) faces a critical challenge of
|
||
|
extrapolation errors caused by out-of-distribution (OOD) actions. Implicit
|
||
|
Q-Learning (IQL) algorithm employs expectile regression to achieve in-sample
|
||
|
learning, effectively mitigating the risks associated with OOD actions.
|
||
|
However, the fixed hyperparameter in policy evaluation and density-based policy
|
||
|
improvement method limit its overall efficiency. In this paper, we propose
|
||
|
Proj-IQL, a projective IQL algorithm enhanced with the support constraint. In
|
||
|
the policy evaluation phase, Proj-IQL generalizes the one-step approach to a
|
||
|
multi-step approach through vector projection, while maintaining in-sample
|
||
|
learning and expectile regression framework. In the policy improvement phase,
|
||
|
Proj-IQL introduces support constraint that is more aligned with the policy
|
||
|
evaluation approach. Furthermore, we theoretically demonstrate that Proj-IQL
|
||
|
guarantees monotonic policy improvement and enjoys a progressively more
|
||
|
rigorous criterion for superior actions. Empirical results demonstrate the
|
||
|
Proj-IQL achieves state-of-the-art performance on D4RL benchmarks, especially
|
||
|
in challenging navigation domains.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on offline reinforcement learning and does not explicitly mention Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Modeling Melt Pool Features and Spatter Using Symbolic Regression and
|
||
|
Machine Learning](https://arxiv.org/abs/2501.08922v1)
|
||
|
**arXiv ID:** 2501.08922v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Additive manufacturing (AM) is a rapidly evolving technology that has
|
||
|
attracted applications across a wide range of fields due to its ability to
|
||
|
fabricate complex geometries. However, one of the key challenges in AM is
|
||
|
achieving consistent print quality. This inconsistency is often attributed to
|
||
|
uncontrolled melt pool dynamics, partly caused by spatter which can lead to
|
||
|
defects. Therefore, capturing and controlling the evolution of the melt pool is
|
||
|
crucial for enhancing process stability and part quality. In this study, we
|
||
|
developed a framework to support decision-making in AM operations, facilitating
|
||
|
quality control and minimizing defects via machine learning (ML) and polynomial
|
||
|
symbolic regression models. We implemented experimentally validated
|
||
|
computational tools as a cost-effective approach to collect large datasets from
|
||
|
laser powder bed fusion (LPBF) processes. For a dataset consisting of 281
|
||
|
process conditions, parameters such as melt pool dimensions (length, width,
|
||
|
depth), melt pool geometry (area, volume), and volume indicated as spatter were
|
||
|
extracted. Using machine learning (ML) and polynomial symbolic regression
|
||
|
models, a high R2 of over 95 % was achieved in predicting the melt pool
|
||
|
dimensions and geometry features for both the training and testing datasets,
|
||
|
with either process conditions (power and velocity) or melt pool dimensions as
|
||
|
the model inputs. In the case of volume indicated as spatter, R2 improved after
|
||
|
logarithmic transforming the model inputs, which was either the process
|
||
|
conditions or the melt pool dimensions. Among the investigated ML models, the
|
||
|
ExtraTree model achieved the highest R2 values of 96.7 % and 87.5 %.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on additive manufacturing and machine learning applications, without any mention of Large Language Models (LLMs) or related areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Visual WetlandBirds Dataset: Bird Species Identification and Behavior
|
||
|
Recognition in Videos](https://arxiv.org/abs/2501.08931v1)
|
||
|
**arXiv ID:** 2501.08931v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The current biodiversity loss crisis makes animal monitoring a relevant field
|
||
|
of study. In light of this, data collected through monitoring can provide
|
||
|
essential insights, and information for decision-making aimed at preserving
|
||
|
global biodiversity. Despite the importance of such data, there is a notable
|
||
|
scarcity of datasets featuring videos of birds, and none of the existing
|
||
|
datasets offer detailed annotations of bird behaviors in video format. In
|
||
|
response to this gap, our study introduces the first fine-grained video dataset
|
||
|
specifically designed for bird behavior detection and species classification.
|
||
|
This dataset addresses the need for comprehensive bird video datasets and
|
||
|
provides detailed data on bird actions, facilitating the development of deep
|
||
|
learning models to recognize these, similar to the advancements made in human
|
||
|
action recognition. The proposed dataset comprises 178 videos recorded in
|
||
|
Spanish wetlands, capturing 13 different bird species performing 7 distinct
|
||
|
behavior classes. In addition, we also present baseline results using state of
|
||
|
the art models on two tasks: bird behavior recognition and species
|
||
|
classification.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on video processing, which is excluded according to the criteria, and does not demonstrate a clear connection to Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Analyzing the Ethical Logic of Six Large Language Models](https://arxiv.org/abs/2501.08951v1)
|
||
|
**arXiv ID:** 2501.08951v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> This study examines the ethical reasoning of six prominent generative large
|
||
|
language models: OpenAI GPT-4o, Meta LLaMA 3.1, Perplexity, Anthropic Claude
|
||
|
3.5 Sonnet, Google Gemini, and Mistral 7B. The research explores how these
|
||
|
models articulate and apply ethical logic, particularly in response to moral
|
||
|
dilemmas such as the Trolley Problem, and Heinz Dilemma. Departing from
|
||
|
traditional alignment studies, the study adopts an explainability-transparency
|
||
|
framework, prompting models to explain their ethical reasoning. This approach
|
||
|
is analyzed through three established ethical typologies: the
|
||
|
consequentialist-deontological analytic, Moral Foundations Theory, and the
|
||
|
Kohlberg Stages of Moral Development Model. Findings reveal that LLMs exhibit
|
||
|
largely convergent ethical logic, marked by a rationalist, consequentialist
|
||
|
emphasis, with decisions often prioritizing harm minimization and fairness.
|
||
|
Despite similarities in pre-training and model architecture, a mixture of
|
||
|
nuanced and significant differences in ethical reasoning emerge across models,
|
||
|
reflecting variations in fine-tuning and post-training processes. The models
|
||
|
consistently display erudition, caution, and self-awareness, presenting ethical
|
||
|
reasoning akin to a graduate-level discourse in moral philosophy. In striking
|
||
|
uniformity these systems all describe their ethical reasoning as more
|
||
|
sophisticated than what is characteristic of typical human moral logic.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on the ethical logic of Large Language Models, which aligns with responsible AI application or AI ethics, a criterion that the paper should not meet.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Kolmogorov-Arnold Networks for Time Series Granger Causality Inference](https://arxiv.org/abs/2501.08958v1)
|
||
|
**arXiv ID:** 2501.08958v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> We introduce Granger Causality Kolmogorov-Arnold Networks (GCKAN), an
|
||
|
innovative architecture that extends the recently proposed Kolmogorov-Arnold
|
||
|
Networks (KAN) to the domain of causal inference. By extracting base weights
|
||
|
from KAN layers and incorporating the sparsity-inducing penalty along with
|
||
|
ridge regularization, GCKAN infers the Granger causality from time series while
|
||
|
enabling automatic time lag selection. Additionally, we propose an algorithm
|
||
|
leveraging time-reversed Granger causality to enhance inference accuracy. The
|
||
|
algorithm compares prediction and sparse-inducing losses derived from the
|
||
|
original and time-reversed series, automatically selecting the casual
|
||
|
relationship with the higher score or integrating the results to mitigate
|
||
|
spurious connectivities. Comprehensive experiments conducted on Lorenz-96, gene
|
||
|
regulatory networks, fMRI BOLD signals, and VAR datasets demonstrate that the
|
||
|
proposed model achieves competitive performance to state-of-the-art methods in
|
||
|
inferring Granger causality from nonlinear, high-dimensional, and
|
||
|
limited-sample time series.
|
||
|
|
||
|
**Decision Explanation:** The paper does not focus on Large Language Models (LLMs) or their applications, and instead proposes a new architecture for time series Granger causality inference, which is outside the specified scope.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [An analysis of data variation and bias in image-based dermatological
|
||
|
datasets for machine learning classification](https://arxiv.org/abs/2501.08962v1)
|
||
|
**arXiv ID:** 2501.08962v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> AI algorithms have become valuable in aiding professionals in healthcare. The
|
||
|
increasing confidence obtained by these models is helpful in critical decision
|
||
|
demands. In clinical dermatology, classification models can detect malignant
|
||
|
lesions on patients' skin using only RGB images as input. However, most
|
||
|
learning-based methods employ data acquired from dermoscopic datasets on
|
||
|
training, which are large and validated by a gold standard. Clinical models aim
|
||
|
to deal with classification on users' smartphone cameras that do not contain
|
||
|
the corresponding resolution provided by dermoscopy. Also, clinical
|
||
|
applications bring new challenges. It can contain captures from uncontrolled
|
||
|
environments, skin tone variations, viewpoint changes, noises in data and
|
||
|
labels, and unbalanced classes. A possible alternative would be to use transfer
|
||
|
learning to deal with the clinical images. However, as the number of samples is
|
||
|
low, it can cause degradations on the model's performance; the source
|
||
|
distribution used in training differs from the test set. This work aims to
|
||
|
evaluate the gap between dermoscopic and clinical samples and understand how
|
||
|
the dataset variations impact training. It assesses the main differences
|
||
|
between distributions that disturb the model's prediction. Finally, from
|
||
|
experiments on different architectures, we argue how to combine the data from
|
||
|
divergent distributions, decreasing the impact on the model's final accuracy.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, which is excluded according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [How Do Generative Models Draw a Software Engineer? A Case Study on
|
||
|
Stable Diffusion Bias](https://arxiv.org/abs/2501.09014v1)
|
||
|
**arXiv ID:** 2501.09014v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Generative models are nowadays widely used to generate graphical content used
|
||
|
for multiple purposes, e.g. web, art, advertisement. However, it has been shown
|
||
|
that the images generated by these models could reinforce societal biases
|
||
|
already existing in specific contexts. In this paper, we focus on understanding
|
||
|
if this is the case when one generates images related to various software
|
||
|
engineering tasks. In fact, the Software Engineering (SE) community is not
|
||
|
immune from gender and ethnicity disparities, which could be amplified by the
|
||
|
use of these models. Hence, if used without consciousness, artificially
|
||
|
generated images could reinforce these biases in the SE domain. Specifically,
|
||
|
we perform an extensive empirical evaluation of the gender and ethnicity bias
|
||
|
exposed by three versions of the Stable Diffusion (SD) model (a very popular
|
||
|
open-source text-to-image model) - SD 2, SD XL, and SD 3 - towards SE tasks. We
|
||
|
obtain 6,720 images by feeding each model with two sets of prompts describing
|
||
|
different software-related tasks: one set includes the Software Engineer
|
||
|
keyword, and one set does not include any specification of the person
|
||
|
performing the task. Next, we evaluate the gender and ethnicity disparities in
|
||
|
the generated images. Results show how all models are significantly biased
|
||
|
towards male figures when representing software engineers. On the contrary,
|
||
|
while SD 2 and SD XL are strongly biased towards White figures, SD 3 is
|
||
|
slightly more biased towards Asian figures. Nevertheless, all models
|
||
|
significantly under-represent Black and Arab figures, regardless of the prompt
|
||
|
style used. The results of our analysis highlight severe concerns about
|
||
|
adopting those models to generate content for SE tasks and open the field for
|
||
|
future research on bias mitigation in this context.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on social applications of AI in regard to diversity and social harm, which is one of the excluded topics.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [TCMM: Token Constraint and Multi-Scale Memory Bank of Contrastive
|
||
|
Learning for Unsupervised Person Re-identification](https://arxiv.org/abs/2501.09044v1)
|
||
|
**arXiv ID:** 2501.09044v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> This paper proposes the ViT Token Constraint and Multi-scale Memory bank
|
||
|
(TCMM) method to address the patch noises and feature inconsistency in
|
||
|
unsupervised person re-identification works. Many excellent methods use ViT
|
||
|
features to obtain pseudo labels and clustering prototypes, then train the
|
||
|
model with contrastive learning. However, ViT processes images by performing
|
||
|
patch embedding, which inevitably introduces noise in patches and may
|
||
|
compromise the performance of the re-identification model. On the other hand,
|
||
|
previous memory bank based contrastive methods may lead data inconsistency due
|
||
|
to the limitation of batch size. Furthermore, existing pseudo label methods
|
||
|
often discard outlier samples that are difficult to cluster. It sacrifices the
|
||
|
potential value of outlier samples, leading to limited model diversity and
|
||
|
robustness. This paper introduces the ViT Token Constraint to mitigate the
|
||
|
damage caused by patch noises to the ViT architecture. The proposed Multi-scale
|
||
|
Memory enhances the exploration of outlier samples and maintains feature
|
||
|
consistency. Experimental results demonstrate that our system achieves
|
||
|
state-of-the-art performance on common benchmarks. The project is available at
|
||
|
\href{https://github.com/andy412510/TCMM}{https://github.com/andy412510/TCMM}.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on person re-identification, which is a computer vision task, and does not involve Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Generating Realistic Synthetic Head Rotation Data for Extended Reality
|
||
|
using Deep Learning](https://arxiv.org/abs/2501.09050v1)
|
||
|
**arXiv ID:** 2501.09050v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Extended Reality is a revolutionary method of delivering multimedia content
|
||
|
to users. A large contributor to its popularity is the sense of immersion and
|
||
|
interactivity enabled by having real-world motion reflected in the virtual
|
||
|
experience accurately and immediately. This user motion, mainly caused by head
|
||
|
rotations, induces several technical challenges. For instance, which content is
|
||
|
generated and transmitted depends heavily on where the user is looking.
|
||
|
Seamless systems, taking user motion into account proactively, will therefore
|
||
|
require accurate predictions of upcoming rotations. Training and evaluating
|
||
|
such predictors requires vast amounts of orientational input data, which is
|
||
|
expensive to gather, as it requires human test subjects. A more feasible
|
||
|
approach is to gather a modest dataset through test subjects, and then extend
|
||
|
it to a more sizeable set using synthetic data generation methods. In this
|
||
|
work, we present a head rotation time series generator based on TimeGAN, an
|
||
|
extension of the well-known Generative Adversarial Network, designed
|
||
|
specifically for generating time series. This approach is able to extend a
|
||
|
dataset of head rotations with new samples closely matching the distribution of
|
||
|
the measured time series.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on generating synthetic head rotation data for Extended Reality using deep learning, which does not meet the criteria of focusing on Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Polyp detection in colonoscopy images using YOLOv11](https://arxiv.org/abs/2501.09051v1)
|
||
|
**arXiv ID:** 2501.09051v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Colorectal cancer (CRC) is one of the most commonly diagnosed cancers all
|
||
|
over the world. It starts as a polyp in the inner lining of the colon. To
|
||
|
prevent CRC, early polyp detection is required. Colonosopy is used for the
|
||
|
inspection of the colon. Generally, the images taken by the camera placed at
|
||
|
the tip of the endoscope are analyzed by the experts manually. Various
|
||
|
traditional machine learning models have been used with the rise of machine
|
||
|
learning. Recently, deep learning models have shown more effectiveness in polyp
|
||
|
detection due to their superiority in generalizing and learning small features.
|
||
|
These deep learning models for object detection can be segregated into two
|
||
|
different types: single-stage and two-stage. Generally, two stage models have
|
||
|
higher accuracy than single stage ones but the single stage models have low
|
||
|
inference time. Hence, single stage models are easy to use for quick object
|
||
|
detection. YOLO is one of the singlestage models used successfully for polyp
|
||
|
detection. It has drawn the attention of researchers because of its lower
|
||
|
inference time. The researchers have used Different versions of YOLO so far,
|
||
|
and with each newer version, the accuracy of the model is increasing. This
|
||
|
paper aims to see the effectiveness of the recently released YOLOv11 to detect
|
||
|
polyp. We analyzed the performance for all five models of YOLOv11 (YOLO11n,
|
||
|
YOLO11s, YOLO11m, YOLO11l, YOLO11x) with Kvasir dataset for the training and
|
||
|
testing. Two different versions of the dataset were used. The first consisted
|
||
|
of the original dataset, and the other was created using augmentation
|
||
|
techniques. The performance of all the models with these two versions of the
|
||
|
dataset have been analysed.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, specifically polyp detection in colonoscopy images, which is not aligned with the specified criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Decompose-ToM: Enhancing Theory of Mind Reasoning in Large Language
|
||
|
Models through Simulation and Task Decomposition](https://arxiv.org/abs/2501.09056v1)
|
||
|
**arXiv ID:** 2501.09056v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Theory of Mind (ToM) is the ability to understand and reflect on the mental
|
||
|
states of others. Although this capability is crucial for human interaction,
|
||
|
testing on Large Language Models (LLMs) reveals that they possess only a
|
||
|
rudimentary understanding of it. Although the most capable closed-source LLMs
|
||
|
have come close to human performance on some ToM tasks, they still perform
|
||
|
poorly on complex variations of the task that involve more structured
|
||
|
reasoning. In this work, we utilize the concept of "pretend-play", or
|
||
|
``Simulation Theory'' from cognitive psychology to propose ``Decompose-ToM'':
|
||
|
an LLM-based inference algorithm that improves model performance on complex ToM
|
||
|
tasks. We recursively simulate user perspectives and decompose the ToM task
|
||
|
into a simpler set of functions: subject identification, question-reframing,
|
||
|
world model updation, and knowledge availability. We test the algorithm on
|
||
|
higher-order ToM tasks and a task testing for ToM capabilities in a
|
||
|
conversational setting, demonstrating that our approach shows significant
|
||
|
improvement across models compared to baseline methods while requiring minimal
|
||
|
prompt tuning across tasks and no additional model training.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on enhancing Theory of Mind reasoning in Large Language Models, which does not meet the criteria of having practical applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI, and also does not clearly discuss real-world applications or challenges involving LLMs.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Average-Reward Reinforcement Learning with Entropy Regularization](https://arxiv.org/abs/2501.09080v1)
|
||
|
**arXiv ID:** 2501.09080v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The average-reward formulation of reinforcement learning (RL) has drawn
|
||
|
increased interest in recent years due to its ability to solve
|
||
|
temporally-extended problems without discounting. Independently, RL algorithms
|
||
|
have benefited from entropy-regularization: an approach used to make the
|
||
|
optimal policy stochastic, thereby more robust to noise. Despite the distinct
|
||
|
benefits of the two approaches, the combination of entropy regularization with
|
||
|
an average-reward objective is not well-studied in the literature and there has
|
||
|
been limited development of algorithms for this setting. To address this gap in
|
||
|
the field, we develop algorithms for solving entropy-regularized average-reward
|
||
|
RL problems with function approximation. We experimentally validate our method,
|
||
|
comparing it with existing algorithms on standard benchmarks for RL.
|
||
|
|
||
|
**Decision Explanation:** The paper does not focus on Large Language Models (LLMs) or their practical applications, and instead discusses reinforcement learning with entropy regularization, which does not meet the specified criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Inferring Transition Dynamics from Value Functions](https://arxiv.org/abs/2501.09081v1)
|
||
|
**arXiv ID:** 2501.09081v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> In reinforcement learning, the value function is typically trained to solve
|
||
|
the Bellman equation, which connects the current value to future values. This
|
||
|
temporal dependency hints that the value function may contain implicit
|
||
|
information about the environment's transition dynamics. By rearranging the
|
||
|
Bellman equation, we show that a converged value function encodes a model of
|
||
|
the underlying dynamics of the environment. We build on this insight to propose
|
||
|
a simple method for inferring dynamics models directly from the value function,
|
||
|
potentially mitigating the need for explicit model learning. Furthermore, we
|
||
|
explore the challenges of next-state identifiability, discussing conditions
|
||
|
under which the inferred dynamics model is well-defined. Our work provides a
|
||
|
theoretical foundation for leveraging value functions in dynamics modeling and
|
||
|
opens a new avenue for bridging model-free and model-based reinforcement
|
||
|
learning.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it focuses on reinforcement learning and dynamics modeling, which is not directly related to Large Language Models (LLMs) or their practical applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Mantis Shrimp: Exploring Photometric Band Utilization in Computer Vision
|
||
|
Networks for Photometric Redshift Estimation](https://arxiv.org/abs/2501.09112v1)
|
||
|
**arXiv ID:** 2501.09112v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> We present Mantis Shrimp, a multi-survey deep learning model for photometric
|
||
|
redshift estimation that fuses ultra-violet (GALEX), optical (PanSTARRS), and
|
||
|
infrared (UnWISE) imagery. Machine learning is now an established approach for
|
||
|
photometric redshift estimation, with generally acknowledged higher performance
|
||
|
in areas with a high density of spectroscopically identified galaxies over
|
||
|
template-based methods. Multiple works have shown that image-based
|
||
|
convolutional neural networks can outperform tabular-based color/magnitude
|
||
|
models. In comparison to tabular models, image models have additional design
|
||
|
complexities: it is largely unknown how to fuse inputs from different
|
||
|
instruments which have different resolutions or noise properties. The Mantis
|
||
|
Shrimp model estimates the conditional density estimate of redshift using
|
||
|
cutout images. The density estimates are well calibrated and the point
|
||
|
estimates perform well in the distribution of available spectroscopically
|
||
|
confirmed galaxies with (bias = 1e-2), scatter (NMAD = 2.44e-2) and
|
||
|
catastrophic outlier rate ($\eta$=17.53$\%$). We find that early fusion
|
||
|
approaches (e.g., resampling and stacking images from different instruments)
|
||
|
match the performance of late fusion approaches (e.g., concatenating latent
|
||
|
space representations), so that the design choice ultimately is left to the
|
||
|
user. Finally, we study how the models learn to use information across bands,
|
||
|
finding evidence that our models successfully incorporates information from all
|
||
|
surveys. The applicability of our model to the analysis of large populations of
|
||
|
galaxies is limited by the speed of downloading cutouts from external servers;
|
||
|
however, our model could be useful in smaller studies such as generating priors
|
||
|
over redshift for stellar population synthesis.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on computer vision and photometric redshift estimation, which is not related to Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Generative Medical Image Anonymization Based on Latent Code Projection
|
||
|
and Optimization](https://arxiv.org/abs/2501.09114v1)
|
||
|
**arXiv ID:** 2501.09114v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Medical image anonymization aims to protect patient privacy by removing
|
||
|
identifying information, while preserving the data utility to solve downstream
|
||
|
tasks. In this paper, we address the medical image anonymization problem with a
|
||
|
two-stage solution: latent code projection and optimization. In the projection
|
||
|
stage, we design a streamlined encoder to project input images into a latent
|
||
|
space and propose a co-training scheme to enhance the projection process. In
|
||
|
the optimization stage, we refine the latent code using two deep loss functions
|
||
|
designed to address the trade-off between identity protection and data utility
|
||
|
dedicated to medical images. Through a comprehensive set of qualitative and
|
||
|
quantitative experiments, we showcase the effectiveness of our approach on the
|
||
|
MIMIC-CXR chest X-ray dataset by generating anonymized synthetic images that
|
||
|
can serve as training set for detecting lung pathologies. Source codes are
|
||
|
available at https://github.com/Huiyu-Li/GMIA.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, which is excluded according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A
|
||
|
study on Lithuanian History](https://arxiv.org/abs/2501.09154v1)
|
||
|
**arXiv ID:** 2501.09154v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> In this work, we evaluated Lithuanian and general history knowledge of
|
||
|
multilingual Large Language Models (LLMs) on a multiple-choice
|
||
|
question-answering task. The models were tested on a dataset of Lithuanian
|
||
|
national and general history questions translated into Baltic, Nordic, and
|
||
|
other languages (English, Ukrainian, Arabic) to assess the knowledge sharing
|
||
|
from culturally and historically connected groups. We evaluated GPT-4o,
|
||
|
LLaMa3.1 8b and 70b, QWEN2.5 7b and 72b, Mistral Nemo 12b, LLaMa3 8b, Mistral
|
||
|
7b, LLaMa3.2 3b, and Nordic fine-tuned models (GPT-SW3 and LLaMa3 8b).
|
||
|
Our results show that GPT-4o consistently outperformed all other models
|
||
|
across language groups, with slightly better results for Baltic and Nordic
|
||
|
languages. Larger open-source models like QWEN2.5 72b and LLaMa3.1 70b
|
||
|
performed well but showed weaker alignment with Baltic languages. Smaller
|
||
|
models (Mistral Nemo 12b, LLaMa3.2 3b, QWEN 7B, LLaMa3.1 8B, and LLaMa3 8b)
|
||
|
demonstrated gaps with LT-related alignment with Baltic languages while
|
||
|
performing better on Nordic and other languages. The Nordic fine-tuned models
|
||
|
did not surpass multilingual models, indicating that shared cultural or
|
||
|
historical context alone does not guarantee better performance.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on evaluating multilingual Large Language Models for Baltic and Nordic languages, which does not meet the criteria of having practical applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI, and also does not discuss real-world applications or challenges involving LLMs in autonomous or agentic AI scenarios.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [The Veln(ia)s is in the Details: Evaluating LLM Judgment on Latvian and
|
||
|
Lithuanian Short Answer Matching](https://arxiv.org/abs/2501.09164v1)
|
||
|
**arXiv ID:** 2501.09164v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> In this work, we address the challenge of evaluating large language models
|
||
|
(LLMs) on the short answer matching task for Latvian and Lithuanian languages.
|
||
|
We introduce novel datasets consisting of 502 Latvian and 690 Lithuanian
|
||
|
question-answer pairs. For each question-answer pair, we generated matched and
|
||
|
non-matched answers using a set of alteration rules specifically designed to
|
||
|
introduce small but meaningful changes in the text. These generated answers
|
||
|
serve as test cases to assess the ability of LLMs to detect subtle differences
|
||
|
in matching of the original answers. A subset of the datasets was manually
|
||
|
verified for quality and accuracy. Our results show that while larger LLMs,
|
||
|
such as QWEN2.5 72b and LLaMa3.1 70b, demonstrate near-perfect performance in
|
||
|
distinguishing matched and non-matched answers, smaller models show more
|
||
|
variance. For instance, LLaMa3.1 8b and EuroLLM 9b benefited from few-shot
|
||
|
examples, while Mistral Nemo 12b underperformed on detection of subtle text
|
||
|
alteration, particularly in Lithuanian, even with additional examples. QWEN2.5
|
||
|
7b and Mistral 7b were able to obtain a strong and comparable performance to
|
||
|
the larger 70b models in zero and few shot experiments. Moreover, the
|
||
|
performance of Mistral 7b was weaker in few shot experiments.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on evaluating LLMs on short answer matching tasks for specific languages, which does not clearly demonstrate practical applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI, and does not explicitly address real-world challenges or novel approaches in these areas.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Grounding Text-To-Image Diffusion Models For Controlled High-Quality
|
||
|
Image Generation](https://arxiv.org/abs/2501.09194v1)
|
||
|
**arXiv ID:** 2501.09194v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Large-scale text-to-image (T2I) diffusion models have demonstrated an
|
||
|
outstanding performance in synthesizing diverse high-quality visuals from
|
||
|
natural language text captions. Multiple layout-to-image models have been
|
||
|
developed to control the generation process by utilizing a broad array of
|
||
|
layouts such as segmentation maps, edges, and human keypoints. In this work, we
|
||
|
present ObjectDiffusion, a model that takes inspirations from the top
|
||
|
cutting-edge image generative frameworks to seamlessly condition T2I models
|
||
|
with new bounding boxes capabilities. Specifically, we make substantial
|
||
|
modifications to the network architecture introduced in ContorlNet to integrate
|
||
|
it with the condition processing and injection techniques proposed in GLIGEN.
|
||
|
ObjectDiffusion is initialized with pretraining parameters to leverage the
|
||
|
generation knowledge obtained from training on large-scale datasets. We
|
||
|
fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on
|
||
|
the COCO2017 validation dataset. Our model achieves an AP$_{50}$ of 46.6, an AR
|
||
|
of 44.5, and a FID of 19.8 outperforming the current SOTA model trained on
|
||
|
open-source datasets in all of the three metrics. ObjectDiffusion demonstrates
|
||
|
a distinctive capability in synthesizing diverse, high-quality, high-fidelity
|
||
|
images that seamlessly conform to the semantic and spatial control layout.
|
||
|
Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits
|
||
|
remarkable grounding abilities on closed-set and open-set settings across a
|
||
|
wide variety of contexts. The qualitative assessment verifies the ability of
|
||
|
ObjectDiffusion to generate multiple objects of different sizes and locations.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on image generation using text-to-image diffusion models, which does not meet the criteria of having practical applications of Large Language Models (LLMs) in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Interpretable Droplet Digital PCR Assay for Trustworthy Molecular
|
||
|
Diagnostics](https://arxiv.org/abs/2501.09218v1)
|
||
|
**arXiv ID:** 2501.09218v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Accurate molecular quantification is essential for advancing research and
|
||
|
diagnostics in fields such as infectious diseases, cancer biology, and genetic
|
||
|
disorders. Droplet digital PCR (ddPCR) has emerged as a gold standard for
|
||
|
achieving absolute quantification. While computational ddPCR technologies have
|
||
|
advanced significantly, achieving automatic interpretation and consistent
|
||
|
adaptability across diverse operational environments remains a challenge. To
|
||
|
address these limitations, we introduce the intelligent interpretable droplet
|
||
|
digital PCR (I2ddPCR) assay, a comprehensive framework integrating front-end
|
||
|
predictive models (for droplet segmentation and classification) with GPT-4o
|
||
|
multimodal large language model (MLLM, for context-aware explanations and
|
||
|
recommendations) to automate and enhance ddPCR image analysis. This approach
|
||
|
surpasses the state-of-the-art models, affording 99.05% accuracy in processing
|
||
|
complex ddPCR images containing over 300 droplets per image with varying
|
||
|
signal-to-noise ratios (SNRs). By combining specialized neural networks and
|
||
|
large language models, the I2ddPCR assay offers a robust and adaptable solution
|
||
|
for absolute molecular quantification, achieving a sensitivity capable of
|
||
|
detecting low-abundance targets as low as 90.32 copies/{\mu}L. Furthermore, it
|
||
|
improves model's transparency through detailed explanation and troubleshooting
|
||
|
guidance, empowering users to make informed decisions. This innovative
|
||
|
framework has the potential to benefit molecular diagnostics, disease research,
|
||
|
and clinical applications, especially in resource-constrained settings.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, which is excluded according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [SEAL: Entangled White-box Watermarks on Low-Rank Adaptation](https://arxiv.org/abs/2501.09284v1)
|
||
|
**arXiv ID:** 2501.09284v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Recently, LoRA and its variants have become the de facto strategy for
|
||
|
training and sharing task-specific versions of large pretrained models, thanks
|
||
|
to their efficiency and simplicity. However, the issue of copyright protection
|
||
|
for LoRA weights, especially through watermark-based techniques, remains
|
||
|
underexplored. To address this gap, we propose SEAL (SEcure wAtermarking on
|
||
|
LoRA weights), the universal whitebox watermarking for LoRA. SEAL embeds a
|
||
|
secret, non-trainable matrix between trainable LoRA weights, serving as a
|
||
|
passport to claim ownership. SEAL then entangles the passport with the LoRA
|
||
|
weights through training, without extra loss for entanglement, and distributes
|
||
|
the finetuned weights after hiding the passport. When applying SEAL, we
|
||
|
observed no performance degradation across commonsense reasoning,
|
||
|
textual/visual instruction tuning, and text-to-image synthesis tasks. We
|
||
|
demonstrate that SEAL is robust against a variety of known attacks: removal,
|
||
|
obfuscation, and ambiguity attacks.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on copyright protection and watermarking for LoRA weights, which does not meet the criteria of having practical applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI, and also does not discuss real-world applications or challenges involving LLMs in autonomous or agentic AI scenarios.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Neural Honeytrace: A Robust Plug-and-Play Watermarking Framework against
|
||
|
Model Extraction Attacks](https://arxiv.org/abs/2501.09328v1)
|
||
|
**arXiv ID:** 2501.09328v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Developing high-performance deep learning models is resource-intensive,
|
||
|
leading model owners to utilize Machine Learning as a Service (MLaaS) platforms
|
||
|
instead of publicly releasing their models. However, malicious users may
|
||
|
exploit query interfaces to execute model extraction attacks, reconstructing
|
||
|
the target model's functionality locally. While prior research has investigated
|
||
|
triggerable watermarking techniques for asserting ownership, existing methods
|
||
|
face significant challenges: (1) most approaches require additional training,
|
||
|
resulting in high overhead and limited flexibility, and (2) they often fail to
|
||
|
account for advanced attackers, leaving them vulnerable to adaptive attacks.
|
||
|
In this paper, we propose Neural Honeytrace, a robust plug-and-play
|
||
|
watermarking framework against model extraction attacks. We first formulate a
|
||
|
watermark transmission model from an information-theoretic perspective,
|
||
|
providing an interpretable account of the principles and limitations of
|
||
|
existing triggerable watermarking. Guided by the model, we further introduce:
|
||
|
(1) a similarity-based training-free watermarking method for plug-and-play and
|
||
|
flexible watermarking, and (2) a distribution-based multi-step watermark
|
||
|
information transmission strategy for robust watermarking. Comprehensive
|
||
|
experiments on four datasets demonstrate that Neural Honeytrace outperforms
|
||
|
previous methods in efficiency and resisting adaptive attacks. Neural
|
||
|
Honeytrace reduces the average number of samples required for a worst-case
|
||
|
t-Test-based copyright claim from $12,000$ to $200$ with zero training cost.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on model extraction attacks and watermarking, which is not directly related to the practical applications of Large Language Models (LLMs) or their integration with technologies like knowledge graphs.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Prompt-CAM: A Simpler Interpretable Transformer for Fine-Grained
|
||
|
Analysis](https://arxiv.org/abs/2501.09333v1)
|
||
|
**arXiv ID:** 2501.09333v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> We present a simple usage of pre-trained Vision Transformers (ViTs) for
|
||
|
fine-grained analysis, aiming to identify and localize the traits that
|
||
|
distinguish visually similar categories, such as different bird species or dog
|
||
|
breeds. Pre-trained ViTs such as DINO have shown remarkable capabilities to
|
||
|
extract localized, informative features. However, using saliency maps like
|
||
|
Grad-CAM can hardly point out the traits: they often locate the whole object by
|
||
|
a blurred, coarse heatmap, not traits. We propose a novel approach Prompt Class
|
||
|
Attention Map (Prompt-CAM) to the rescue. Prompt-CAM learns class-specific
|
||
|
prompts to a pre-trained ViT and uses the corresponding outputs for
|
||
|
classification. To classify an image correctly, the true-class prompt must
|
||
|
attend to the unique image patches not seen in other classes' images, i.e.,
|
||
|
traits. As such, the true class's multi-head attention maps reveal traits and
|
||
|
their locations. Implementation-wise, Prompt-CAM is almost a free lunch by
|
||
|
simply modifying the prediction head of Visual Prompt Tuning (VPT). This makes
|
||
|
Prompt-CAM fairly easy to train and apply, sharply contrasting other
|
||
|
interpretable methods that design specific models and training processes. It is
|
||
|
even simpler than the recently published INterpretable TRansformer (INTR),
|
||
|
whose encoder-decoder architecture prevents it from leveraging pre-trained
|
||
|
ViTs. Extensive empirical studies on a dozen datasets from various domains
|
||
|
(e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate
|
||
|
Prompt-CAM superior interpretation capability.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on computer vision and image analysis using Vision Transformers, which does not meet the criteria of focusing on Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Style4Rec: Enhancing Transformer-based E-commerce Recommendation Systems
|
||
|
with Style and Shopping Cart Information](https://arxiv.org/abs/2501.09354v1)
|
||
|
**arXiv ID:** 2501.09354v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Understanding users' product preferences is essential to the efficacy of a
|
||
|
recommendation system. Precision marketing leverages users' historical data to
|
||
|
discern these preferences and recommends products that align with them.
|
||
|
However, recent browsing and purchase records might better reflect current
|
||
|
purchasing inclinations. Transformer-based recommendation systems have made
|
||
|
strides in sequential recommendation tasks, but they often fall short in
|
||
|
utilizing product image style information and shopping cart data effectively.
|
||
|
In light of this, we propose Style4Rec, a transformer-based e-commerce
|
||
|
recommendation system that harnesses style and shopping cart information to
|
||
|
enhance existing transformer-based sequential product recommendation systems.
|
||
|
Style4Rec represents a significant step forward in personalized e-commerce
|
||
|
recommendations, outperforming benchmarks across various evaluation metrics.
|
||
|
Style4Rec resulted in notable improvements: HR@5 increased from 0.681 to 0.735,
|
||
|
NDCG@5 increased from 0.594 to 0.674, and MRR@5 increased from 0.559 to 0.654.
|
||
|
We tested our model using an e-commerce dataset from our partnering company and
|
||
|
found that it exceeded established transformer-based sequential recommendation
|
||
|
benchmarks across various evaluation metrics. Thus, Style4Rec presents a
|
||
|
significant step forward in personalized e-commerce recommendation systems.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on e-commerce recommendation systems, which is not explicitly mentioned in the criteria, and lacks clear connections to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Predicting Air Temperature from Volumetric Urban Morphology with Machine
|
||
|
Learning](https://arxiv.org/abs/2501.09469v1)
|
||
|
**arXiv ID:** 2501.09469v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> In this study, we firstly introduce a method that converts CityGML data into
|
||
|
voxels which works efficiently and fast in high resolution for large scale
|
||
|
datasets such as cities but by sacrificing some building details to overcome
|
||
|
the limitations of previous voxelization methodologies that have been
|
||
|
computationally intensive and inefficient at transforming large-scale urban
|
||
|
areas into voxel representations for high resolution. Those voxelized 3D city
|
||
|
data from multiple cities and corresponding air temperature data are used to
|
||
|
develop a machine learning model. Before the model training, Gaussian blurring
|
||
|
is implemented on input data to consider spatial relationships, as a result the
|
||
|
correlation rate between air temperature and volumetric building morphology is
|
||
|
also increased after the Gaussian blurring. After the model training, the
|
||
|
prediction results are not just evaluated with Mean Square Error (MSE) but some
|
||
|
image similarity metrics such as Structural Similarity Index Measure (SSIM) and
|
||
|
Learned Perceptual Image Patch Similarity (LPIPS) that are able to detect and
|
||
|
consider spatial relations during the evaluation process. This trained model is
|
||
|
capable of predicting the spatial distribution of air temperature by using
|
||
|
building volume information of corresponding pixel as input. By doing so, this
|
||
|
research aims to assist urban planners in incorporating environmental
|
||
|
parameters into their planning strategies, thereby facilitating more
|
||
|
sustainable and inhabitable urban environments.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on predicting air temperature using machine learning and volumetric urban morphology, which is not related to Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Class Incremental Fault Diagnosis under Limited Fault Data via
|
||
|
Supervised Contrastive Knowledge Distillation](https://arxiv.org/abs/2501.09525v1)
|
||
|
**arXiv ID:** 2501.09525v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Class-incremental fault diagnosis requires a model to adapt to new fault
|
||
|
classes while retaining previous knowledge. However, limited research exists
|
||
|
for imbalanced and long-tailed data. Extracting discriminative features from
|
||
|
few-shot fault data is challenging, and adding new fault classes often demands
|
||
|
costly model retraining. Moreover, incremental training of existing methods
|
||
|
risks catastrophic forgetting, and severe class imbalance can bias the model's
|
||
|
decisions toward normal classes. To tackle these issues, we introduce a
|
||
|
Supervised Contrastive knowledge distiLlation for class Incremental Fault
|
||
|
Diagnosis (SCLIFD) framework proposing supervised contrastive knowledge
|
||
|
distillation for improved representation learning capability and less
|
||
|
forgetting, a novel prioritized exemplar selection method for sample replay to
|
||
|
alleviate catastrophic forgetting, and the Random Forest Classifier to address
|
||
|
the class imbalance. Extensive experimentation on simulated and real-world
|
||
|
industrial datasets across various imbalance ratios demonstrates the
|
||
|
superiority of SCLIFD over existing approaches. Our code can be found at
|
||
|
https://github.com/Zhang-Henry/SCLIFD_TII.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on fault diagnosis, which is not a specified area of interest, and does not explicitly mention Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Text-driven Adaptation of Foundation Models for Few-shot Surgical
|
||
|
Workflow Analysis](https://arxiv.org/abs/2501.09555v1)
|
||
|
**arXiv ID:** 2501.09555v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Purpose: Surgical workflow analysis is crucial for improving surgical
|
||
|
efficiency and safety. However, previous studies rely heavily on large-scale
|
||
|
annotated datasets, posing challenges in cost, scalability, and reliance on
|
||
|
expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven
|
||
|
Adaptation), designed to handle various surgical workflow analysis tasks with
|
||
|
minimal paired image-label data.
|
||
|
Methods: Our approach has two key components. First, Few-shot selection-based
|
||
|
modality alignment selects a small subset of images and aligns their embeddings
|
||
|
with text embeddings from the downstream task, bridging the modality gap.
|
||
|
Second, Text-driven adaptation leverages only text data to train a decoder,
|
||
|
eliminating the need for paired image-text data. This decoder is then applied
|
||
|
to aligned image embeddings, enabling image-related tasks without explicit
|
||
|
image-text pairs.
|
||
|
Results: We evaluate our approach to generative tasks (image captioning) and
|
||
|
discriminative tasks (triplet recognition and phase recognition). Results show
|
||
|
that Surg-FTDA outperforms baselines and generalizes well across downstream
|
||
|
tasks.
|
||
|
Conclusion: We propose a text-driven adaptation approach that mitigates the
|
||
|
modality gap and handles multiple downstream tasks in surgical workflow
|
||
|
analysis, with minimal reliance on large annotated datasets. The code and
|
||
|
dataset will be released in https://github.com/TingxuanSix/Surg-FTDA.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, which is excluded according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [IFRA: a machine learning-based Instrumented Fall Risk Assessment Scale
|
||
|
derived from Instrumented Timed Up and Go test in stroke patients](https://arxiv.org/abs/2501.09595v1)
|
||
|
**arXiv ID:** 2501.09595v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Effective fall risk assessment is critical for post-stroke patients. The
|
||
|
present study proposes a novel, data-informed fall risk assessment method based
|
||
|
on the instrumented Timed Up and Go (ITUG) test data, bringing in many mobility
|
||
|
measures that traditional clinical scales fail to capture. IFRA, which stands
|
||
|
for Instrumented Fall Risk Assessment, has been developed using a two-step
|
||
|
process: first, features with the highest predictive power among those
|
||
|
collected in a ITUG test have been identified using machine learning
|
||
|
techniques; then, a strategy is proposed to stratify patients into low, medium,
|
||
|
or high-risk strata. The dataset used in our analysis consists of 142
|
||
|
participants, out of which 93 were used for training (15 synthetically
|
||
|
generated), 17 for validation and 32 to test the resulting IFRA scale (22
|
||
|
non-fallers and 10 fallers). Features considered in the IFRA scale include gait
|
||
|
speed, vertical acceleration during sit-to-walk transition, and turning angular
|
||
|
velocity, which align well with established literature on the risk of fall in
|
||
|
neurological patients. In a comparison with traditional clinical scales such as
|
||
|
the traditional Timed Up & Go and the Mini-BESTest, IFRA demonstrates
|
||
|
competitive performance, being the only scale to correctly assign more than
|
||
|
half of the fallers to the high-risk stratum (Fischer's Exact test p = 0.004).
|
||
|
Despite the dataset's limited size, this is the first proof-of-concept study to
|
||
|
pave the way for future evidence regarding the use of IFRA tool for continuous
|
||
|
patient monitoring and fall prevention both in clinical stroke rehabilitation
|
||
|
and at home post-discharge.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, specifically fall risk assessment in stroke patients, which is excluded according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Reducing the Sensitivity of Neural Physics Simulators to Mesh Topology
|
||
|
via Pretraining](https://arxiv.org/abs/2501.09597v1)
|
||
|
**arXiv ID:** 2501.09597v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Meshes are used to represent complex objects in high fidelity physics
|
||
|
simulators across a variety of domains, such as radar sensing and aerodynamics.
|
||
|
There is growing interest in using neural networks to accelerate physics
|
||
|
simulations, and also a growing body of work on applying neural networks
|
||
|
directly to irregular mesh data. Since multiple mesh topologies can represent
|
||
|
the same object, mesh augmentation is typically required to handle topological
|
||
|
variation when training neural networks. Due to the sensitivity of physics
|
||
|
simulators to small changes in mesh shape, it is challenging to use these
|
||
|
augmentations when training neural network-based physics simulators. In this
|
||
|
work, we show that variations in mesh topology can significantly reduce the
|
||
|
performance of neural network simulators. We evaluate whether pretraining can
|
||
|
be used to address this issue, and find that employing an established
|
||
|
autoencoder pretraining technique with graph embedding models reduces the
|
||
|
sensitivity of neural network simulators to variations in mesh topology.
|
||
|
Finally, we highlight future research directions that may further reduce neural
|
||
|
simulator sensitivity to mesh topology.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on neural physics simulators and mesh topology, with no clear connection to Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Monte Carlo Tree Search with Velocity Obstacles for safe and efficient
|
||
|
motion planning in dynamic environments](https://arxiv.org/abs/2501.09649v1)
|
||
|
**arXiv ID:** 2501.09649v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Online motion planning is a challenging problem for intelligent robots moving
|
||
|
in dense environments with dynamic obstacles, e.g., crowds. In this work, we
|
||
|
propose a novel approach for optimal and safe online motion planning with
|
||
|
minimal information about dynamic obstacles. Specifically, our approach
|
||
|
requires only the current position of the obstacles and their maximum speed,
|
||
|
but it does not need any information about their exact trajectories or dynamic
|
||
|
model. The proposed methodology combines Monte Carlo Tree Search (MCTS), for
|
||
|
online optimal planning via model simulations, with Velocity Obstacles (VO),
|
||
|
for obstacle avoidance. We perform experiments in a cluttered simulated
|
||
|
environment with walls, and up to 40 dynamic obstacles moving with random
|
||
|
velocities and directions. With an ablation study, we show the key contribution
|
||
|
of VO in scaling up the efficiency of MCTS, selecting the safest and most
|
||
|
rewarding actions in the tree of simulations. Moreover, we show the superiority
|
||
|
of our methodology with respect to state-of-the-art planners, including
|
||
|
Non-linear Model Predictive Control (NMPC), in terms of improved collision
|
||
|
rate, computational and task performance.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on motion planning in dynamic environments, which does not involve Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP
|
||
|
Evaluation Benchmark](https://arxiv.org/abs/2501.09672v1)
|
||
|
**arXiv ID:** 2501.09672v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The proliferation of Vision-Language Models (VLMs) in the past several years
|
||
|
calls for rigorous and comprehensive evaluation methods and benchmarks. This
|
||
|
work analyzes existing VLM evaluation techniques, including automated metrics,
|
||
|
AI-based assessments, and human evaluations across diverse tasks. We first
|
||
|
introduce Robin - a novel suite of VLMs that we built by combining Large
|
||
|
Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use
|
||
|
Robin to identify shortcomings of current evaluation approaches across scales.
|
||
|
Next, to overcome the identified limitations, we introduce CHIRP - a new long
|
||
|
form response benchmark we developed for more robust and complete VLM
|
||
|
evaluation. We provide open access to the Robin training code, model suite, and
|
||
|
CHIRP benchmark to promote reproducibility and advance VLM research.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on Vision-Language Models and their evaluation, which does not meet the criteria of focusing on Large Language Models, particularly in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Cueless EEG imagined speech for subject identification: dataset and
|
||
|
benchmarks](https://arxiv.org/abs/2501.09700v1)
|
||
|
**arXiv ID:** 2501.09700v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Electroencephalogram (EEG) signals have emerged as a promising modality for
|
||
|
biometric identification. While previous studies have explored the use of
|
||
|
imagined speech with semantically meaningful words for subject identification,
|
||
|
most have relied on additional visual or auditory cues. In this study, we
|
||
|
introduce a cueless EEG-based imagined speech paradigm, where subjects imagine
|
||
|
the pronunciation of semantically meaningful words without any external cues.
|
||
|
This innovative approach addresses the limitations of prior methods by
|
||
|
requiring subjects to select and imagine words from a predefined list
|
||
|
naturally. The dataset comprises over 4,350 trials from 11 subjects across five
|
||
|
sessions. We assess a variety of classification methods, including traditional
|
||
|
machine learning techniques such as Support Vector Machines (SVM) and XGBoost,
|
||
|
as well as time-series foundation models and deep learning architectures
|
||
|
specifically designed for EEG classification, such as EEG Conformer and Shallow
|
||
|
ConvNet. A session-based hold-out validation strategy was employed to ensure
|
||
|
reliable evaluation and prevent data leakage. Our results demonstrate
|
||
|
outstanding classification accuracy, reaching 97.93%. These findings highlight
|
||
|
the potential of cueless EEG paradigms for secure and reliable subject
|
||
|
identification in real-world applications, such as brain-computer interfaces
|
||
|
(BCIs).
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on biometric identification using EEG signals and does not meet the criteria related to Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [CyberMentor: AI Powered Learning Tool Platform to Address Diverse
|
||
|
Student Needs in Cybersecurity Education](https://arxiv.org/abs/2501.09709v1)
|
||
|
**arXiv ID:** 2501.09709v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Many non-traditional students in cybersecurity programs often lack access to
|
||
|
advice from peers, family members and professors, which can hinder their
|
||
|
educational experiences. Additionally, these students may not fully benefit
|
||
|
from various LLM-powered AI assistants due to issues like content relevance,
|
||
|
locality of advice, minimum expertise, and timing. This paper addresses these
|
||
|
challenges by introducing an application designed to provide comprehensive
|
||
|
support by answering questions related to knowledge, skills, and career
|
||
|
preparation advice tailored to the needs of these students. We developed a
|
||
|
learning tool platform, CyberMentor, to address the diverse needs and pain
|
||
|
points of students majoring in cybersecurity. Powered by agentic workflow and
|
||
|
Generative Large Language Models (LLMs), the platform leverages
|
||
|
Retrieval-Augmented Generation (RAG) for accurate and contextually relevant
|
||
|
information retrieval to achieve accessibility and personalization. We
|
||
|
demonstrated its value in addressing knowledge requirements for cybersecurity
|
||
|
education and for career marketability, in tackling skill requirements for
|
||
|
analytical and programming assignments, and in delivering real time on demand
|
||
|
learning support. Using three use scenarios, we showcased CyberMentor in
|
||
|
facilitating knowledge acquisition and career preparation and providing
|
||
|
seamless skill-based guidance and support. We also employed the LangChain
|
||
|
prompt-based evaluation methodology to evaluate the platform's impact,
|
||
|
confirming its strong performance in helpfulness, correctness, and
|
||
|
completeness. These results underscore the system's ability to support students
|
||
|
in developing practical cybersecurity skills while improving equity and
|
||
|
sustainability within higher education. Furthermore, CyberMentor's open-source
|
||
|
design allows for adaptation across other disciplines, fostering educational
|
||
|
innovation and broadening its potential impact.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on educational applications of AI, which is not explicitly excluded but also does not align well with the preferred areas of knowledge graphs, retrieval-augmented generation, or agentic AI in a broader sense beyond education, and it touches on diversity and equity, which is a social application of AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Learnings from Scaling Visual Tokenizers for Reconstruction and
|
||
|
Generation](https://arxiv.org/abs/2501.09755v1)
|
||
|
**arXiv ID:** 2501.09755v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Visual tokenization via auto-encoding empowers state-of-the-art image and
|
||
|
video generative models by compressing pixels into a latent space. Although
|
||
|
scaling Transformer-based generators has been central to recent advances, the
|
||
|
tokenizer component itself is rarely scaled, leaving open questions about how
|
||
|
auto-encoder design choices influence both its objective of reconstruction and
|
||
|
downstream generative performance. Our work aims to conduct an exploration of
|
||
|
scaling in auto-encoders to fill in this blank. To facilitate this exploration,
|
||
|
we replace the typical convolutional backbone with an enhanced Vision
|
||
|
Transformer architecture for Tokenization (ViTok). We train ViTok on
|
||
|
large-scale image and video datasets far exceeding ImageNet-1K, removing data
|
||
|
constraints on tokenizer scaling. We first study how scaling the auto-encoder
|
||
|
bottleneck affects both reconstruction and generation -- and find that while it
|
||
|
is highly correlated with reconstruction, its relationship with generation is
|
||
|
more complex. We next explored the effect of separately scaling the
|
||
|
auto-encoders' encoder and decoder on reconstruction and generation
|
||
|
performance. Crucially, we find that scaling the encoder yields minimal gains
|
||
|
for either reconstruction or generation, while scaling the decoder boosts
|
||
|
reconstruction but the benefits for generation are mixed. Building on our
|
||
|
exploration, we design ViTok as a lightweight auto-encoder that achieves
|
||
|
competitive performance with state-of-the-art auto-encoders on ImageNet-1K and
|
||
|
COCO reconstruction tasks (256p and 512p) while outperforming existing
|
||
|
auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x
|
||
|
fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates
|
||
|
competitive performance on image generation for ImageNet-1K and sets new
|
||
|
state-of-the-art benchmarks for class-conditional video generation on UCF-101.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on video processing and image generation, which are not the main areas of interest, and does not clearly meet the criteria related to Large Language Models, knowledge graphs, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Exploring the Efficacy of Meta-Learning: Unveiling Superior Data
|
||
|
Diversity Utilization of MAML Over Pre-training](https://arxiv.org/abs/2501.08506v1)
|
||
|
**arXiv ID:** 2501.08506v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Currently, data and model size dominate the narrative in the training of
|
||
|
super-large, powerful models. However, there has been a lack of exploration on
|
||
|
the effect of other attributes of the training dataset on model performance. We
|
||
|
hypothesize that dataset diversity can impact the performance of vision models.
|
||
|
Our study shows positive correlations between test set accuracy and data
|
||
|
diversity, providing an argument for furthering the research of dataset
|
||
|
attributes beyond size. We analyzed pre-training and model-agnostic
|
||
|
meta-learning methods on twelve popular visual datasets (e.g., Omniglot,
|
||
|
CIFAR-FS, Aircraft) and five model configurations, including MAML variants with
|
||
|
different numbers of inner gradient steps and supervised learning. We show
|
||
|
moderate to strong positive correlations (R-squared: 0.15-0.42) between
|
||
|
accuracy and data diversity and weaker but significant correlations (R-squared:
|
||
|
~0.2) between loss and diversity. These findings support our hypothesis and
|
||
|
demonstrate a promising way for a deeper exploration of how formal data
|
||
|
diversity influences model performance. This initial study highlights the
|
||
|
potential of (Task2Vec) data diversity as a valuable measure in the rapidly
|
||
|
evolving field of large-scale learning and emphasizes that understanding the
|
||
|
dataset is key to building more powerful and generalizable models.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on vision models and dataset diversity, which does not meet the criteria of focusing on Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Easing Seasickness through Attention Redirection with a
|
||
|
Mindfulness-Based Brain--Computer Interface](https://arxiv.org/abs/2501.08518v1)
|
||
|
**arXiv ID:** 2501.08518v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Seasickness is a prevalent issue that adversely impacts both passenger
|
||
|
experiences and the operational efficiency of maritime crews. While techniques
|
||
|
that redirect attention have proven effective in alleviating motion sickness
|
||
|
symptoms in terrestrial environments, applying similar strategies to manage
|
||
|
seasickness poses unique challenges due to the prolonged and intense motion
|
||
|
environment associated with maritime travel. In this study, we propose a
|
||
|
mindfulness brain-computer interface (BCI), specifically designed to redirect
|
||
|
attention with the aim of mitigating seasickness symptoms in real-world
|
||
|
settings. Our system utilizes a single-channel headband to capture prefrontal
|
||
|
EEG signals, which are then wirelessly transmitted to computing devices for the
|
||
|
assessment of mindfulness states. The results are transferred into real-time
|
||
|
feedback as mindfulness scores and audiovisual stimuli, facilitating a shift in
|
||
|
attentional focus from physiological discomfort to mindfulness practices. A
|
||
|
total of 43 individuals participated in a real-world maritime experiment
|
||
|
consisted of three sessions: a real-feedback mindfulness session, a resting
|
||
|
session, and a pseudofeedback mindfulness session. Notably, 81.39% of
|
||
|
participants reported that the mindfulness BCI intervention was effective, and
|
||
|
there was a significant reduction in the severity of seasickness, as measured
|
||
|
by the Misery Scale (MISC). Furthermore, EEG analysis revealed a decrease in
|
||
|
the theta/beta ratio, corresponding with the alleviation of seasickness
|
||
|
symptoms. A decrease in overall EEG band power during the real-feedback
|
||
|
mindfulness session suggests that the mindfulness BCI fosters a more tranquil
|
||
|
and downregulated state of brain activity. Together, this study presents a
|
||
|
novel nonpharmacological, portable, and effective approach for seasickness
|
||
|
intervention, with the potential to enhance the cruising experience for both
|
||
|
passengers and crews.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on a medical application of AI, specifically using a brain-computer interface to alleviate seasickness, which is not within the specified areas of interest such as knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Reinforcement Learning-Enhanced Procedural Generation for Dynamic
|
||
|
Narrative-Driven AR Experiences](https://arxiv.org/abs/2501.08552v1)
|
||
|
**arXiv ID:** 2501.08552v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Procedural Content Generation (PCG) is widely used to create scalable and
|
||
|
diverse environments in games. However, existing methods, such as the Wave
|
||
|
Function Collapse (WFC) algorithm, are often limited to static scenarios and
|
||
|
lack the adaptability required for dynamic, narrative-driven applications,
|
||
|
particularly in augmented reality (AR) games. This paper presents a
|
||
|
reinforcement learning-enhanced WFC framework designed for mobile AR
|
||
|
environments. By integrating environment-specific rules and dynamic tile weight
|
||
|
adjustments informed by reinforcement learning (RL), the proposed method
|
||
|
generates maps that are both contextually coherent and responsive to gameplay
|
||
|
needs. Comparative evaluations and user studies demonstrate that the framework
|
||
|
achieves superior map quality and delivers immersive experiences, making it
|
||
|
well-suited for narrative-driven AR games. Additionally, the method holds
|
||
|
promise for broader applications in education, simulation training, and
|
||
|
immersive extended reality (XR) experiences, where dynamic and adaptive
|
||
|
environments are critical.
|
||
|
|
||
|
**Decision Explanation:** The paper does not focus on Large Language Models (LLMs) or their applications, and instead discusses reinforcement learning-enhanced procedural generation for dynamic narrative-driven AR experiences, which does not meet the primary criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Towards Lightweight and Stable Zero-shot TTS with Self-distilled
|
||
|
Representation Disentanglement](https://arxiv.org/abs/2501.08566v1)
|
||
|
**arXiv ID:** 2501.08566v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Zero-shot Text-To-Speech (TTS) synthesis shows great promise for personalized
|
||
|
voice customization through voice cloning. However, current methods for
|
||
|
achieving zero-shot TTS heavily rely on large model scales and extensive
|
||
|
training datasets to ensure satisfactory performance and generalizability
|
||
|
across various speakers. This raises concerns regarding both deployment costs
|
||
|
and data security. In this paper, we present a lightweight and stable zero-shot
|
||
|
TTS system. We introduce a novel TTS architecture designed to effectively model
|
||
|
linguistic content and various speaker attributes from source speech and prompt
|
||
|
speech, respectively. Furthermore, we present a two-stage self-distillation
|
||
|
framework that constructs parallel data pairs for effectively disentangling
|
||
|
linguistic content and speakers from the perspective of training data.
|
||
|
Extensive experiments show that our system exhibits excellent performance and
|
||
|
superior stability on the zero-shot TTS tasks. Moreover, it shows markedly
|
||
|
superior computational efficiency, with RTFs of 0.13 and 0.012 on the CPU and
|
||
|
GPU, respectively.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on Text-To-Speech synthesis, which does not meet the criteria of practical applications of Large Language Models in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Sound Scene Synthesis at the DCASE 2024 Challenge](https://arxiv.org/abs/2501.08587v1)
|
||
|
**arXiv ID:** 2501.08587v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> This paper presents Task 7 at the DCASE 2024 Challenge: sound scene
|
||
|
synthesis. Recent advances in sound synthesis and generative models have
|
||
|
enabled the creation of realistic and diverse audio content. We introduce a
|
||
|
standardized evaluation framework for comparing different sound scene synthesis
|
||
|
systems, incorporating both objective and subjective metrics. The challenge
|
||
|
attracted four submissions, which are evaluated using the Fr\'echet Audio
|
||
|
Distance (FAD) and human perceptual ratings. Our analysis reveals significant
|
||
|
insights into the current capabilities and limitations of sound scene synthesis
|
||
|
systems, while also highlighting areas for future improvement in this rapidly
|
||
|
evolving field.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on sound scene synthesis, which does not meet the criteria of having practical applications of Large Language Models (LLMs) or addressing areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [OpenMLDB: A Real-Time Relational Data Feature Computation System for
|
||
|
Online ML](https://arxiv.org/abs/2501.08591v1)
|
||
|
**arXiv ID:** 2501.08591v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Efficient and consistent feature computation is crucial for a wide range of
|
||
|
online ML applications. Typically, feature computation is divided into two
|
||
|
distinct phases, i.e., offline stage for model training and online stage for
|
||
|
model serving. These phases often rely on execution engines with different
|
||
|
interface languages and function implementations, causing significant
|
||
|
inconsistencies. Moreover, many online ML features involve complex time-series
|
||
|
computations (e.g., functions over varied-length table windows) that differ
|
||
|
from standard streaming and analytical queries. Existing data processing
|
||
|
systems (e.g., Spark, Flink, DuckDB) often incur multi-second latencies for
|
||
|
these computations, making them unsuitable for real-time online ML applications
|
||
|
that demand timely feature updates.
|
||
|
This paper presents OpenMLDB, a feature computation system deployed in
|
||
|
4Paradigm's SageOne platform and over 100 real scenarios. Technically, OpenMLDB
|
||
|
first employs a unified query plan generator for consistent computation results
|
||
|
across the offline and online stages, significantly reducing feature deployment
|
||
|
overhead. Second, OpenMLDB provides an online execution engine that resolves
|
||
|
performance bottlenecks caused by long window computations (via
|
||
|
pre-aggregation) and multi-table window unions (via data self-adjusting). It
|
||
|
also provides a high-performance offline execution engine with window parallel
|
||
|
optimization and time-aware data skew resolving. Third, OpenMLDB features a
|
||
|
compact data format and stream-focused indexing to maximize memory usage and
|
||
|
accelerate data access. Evaluations in testing and real workloads reveal
|
||
|
significant performance improvements and resource savings compared to the
|
||
|
baseline systems. The open community of OpenMLDB now has over 150 contributors
|
||
|
and gained 1.6k stars on GitHub.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on a real-time relational data feature computation system for online ML, without clear connections to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Fine-grained Spatio-temporal Event Prediction with Self-adaptive Anchor
|
||
|
Graph](https://arxiv.org/abs/2501.08653v1)
|
||
|
**arXiv ID:** 2501.08653v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Event prediction tasks often handle spatio-temporal data distributed in a
|
||
|
large spatial area. Different regions in the area exhibit different
|
||
|
characteristics while having latent correlations. This spatial heterogeneity
|
||
|
and correlations greatly affect the spatio-temporal distributions of event
|
||
|
occurrences, which has not been addressed by state-of-the-art models. Learning
|
||
|
spatial dependencies of events in a continuous space is challenging due to its
|
||
|
fine granularity and a lack of prior knowledge. In this work, we propose a
|
||
|
novel Graph Spatio-Temporal Point Process (GSTPP) model for fine-grained event
|
||
|
prediction. It adopts an encoder-decoder architecture that jointly models the
|
||
|
state dynamics of spatially localized regions using neural Ordinary
|
||
|
Differential Equations (ODEs). The state evolution is built on the foundation
|
||
|
of a novel Self-Adaptive Anchor Graph (SAAG) that captures spatial
|
||
|
dependencies. By adaptively localizing the anchor nodes in the space and
|
||
|
jointly constructing the correlation edges between them, the SAAG enhances the
|
||
|
model's ability of learning complex spatial event patterns. The proposed GSTPP
|
||
|
model greatly improves the accuracy of fine-grained event prediction. Extensive
|
||
|
experimental results show that our method greatly improves the prediction
|
||
|
accuracy over existing spatio-temporal event prediction approaches.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it does not focus on Large Language Models (LLMs) or their applications, and instead discusses spatio-temporal event prediction using a Graph Spatio-Temporal Point Process model.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Self-supervised Transformation Learning for Equivariant Representations](https://arxiv.org/abs/2501.08712v1)
|
||
|
**arXiv ID:** 2501.08712v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Unsupervised representation learning has significantly advanced various
|
||
|
machine learning tasks. In the computer vision domain, state-of-the-art
|
||
|
approaches utilize transformations like random crop and color jitter to achieve
|
||
|
invariant representations, embedding semantically the same inputs despite
|
||
|
transformations. However, this can degrade performance in tasks requiring
|
||
|
precise features, such as localization or flower classification. To address
|
||
|
this, recent research incorporates equivariant representation learning, which
|
||
|
captures transformation-sensitive information. However, current methods depend
|
||
|
on transformation labels and thus struggle with interdependency and complex
|
||
|
transformations. We propose Self-supervised Transformation Learning (STL),
|
||
|
replacing transformation labels with transformation representations derived
|
||
|
from image pairs. The proposed method ensures transformation representation is
|
||
|
image-invariant and learns corresponding equivariant transformations, enhancing
|
||
|
performance without increased batch complexity. We demonstrate the approach's
|
||
|
effectiveness across diverse classification and detection tasks, outperforming
|
||
|
existing methods in 7 out of 11 benchmarks and excelling in detection. By
|
||
|
integrating complex transformations like AugMix, unusable by prior equivariant
|
||
|
methods, this approach enhances performance across tasks, underscoring its
|
||
|
adaptability and resilience. Additionally, its compatibility with various base
|
||
|
models highlights its flexibility and broad applicability. The code is
|
||
|
available at https://github.com/jaemyung-u/stl.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on computer vision and image processing, which is not within the specified areas of interest such as knowledge graphs, retrieval-augmented generation, or agentic AI, and does not explicitly mention Large Language Models (LLMs) or their applications.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [How Developers Interact with AI: A Taxonomy of Human-AI Collaboration in
|
||
|
Software Engineering](https://arxiv.org/abs/2501.08774v1)
|
||
|
**arXiv ID:** 2501.08774v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Artificial intelligence (AI), including large language models and generative
|
||
|
AI, is emerging as a significant force in software development, offering
|
||
|
developers powerful tools that span the entire development lifecycle. Although
|
||
|
software engineering research has extensively studied AI tools in software
|
||
|
development, the specific types of interactions between developers and these
|
||
|
AI-powered tools have only recently begun to receive attention. Understanding
|
||
|
and improving these interactions has the potential to improve productivity,
|
||
|
trust, and efficiency in AI-driven workflows. In this paper, we propose a
|
||
|
taxonomy of interaction types between developers and AI tools, identifying
|
||
|
eleven distinct interaction types, such as auto-complete code suggestions,
|
||
|
command-driven actions, and conversational assistance. Building on this
|
||
|
taxonomy, we outline a research agenda focused on optimizing AI interactions,
|
||
|
improving developer control, and addressing trust and usability challenges in
|
||
|
AI-assisted development. By establishing a structured foundation for studying
|
||
|
developer-AI interactions, this paper aims to stimulate research on creating
|
||
|
more effective, adaptive AI tools for software development.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on the interaction between developers and AI tools in software engineering, without clear practical applications of Large Language Models, experimental results, or comparison with state-of-the-art techniques in areas like knowledge graphs or retrieval-augmented generation.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Networked Agents in the Dark: Team Value Learning under Partial
|
||
|
Observability](https://arxiv.org/abs/2501.08778v1)
|
||
|
**arXiv ID:** 2501.08778v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> We propose a novel cooperative multi-agent reinforcement learning (MARL)
|
||
|
approach for networked agents. In contrast to previous methods that rely on
|
||
|
complete state information or joint observations, our agents must learn how to
|
||
|
reach shared objectives under partial observability. During training, they
|
||
|
collect individual rewards and approximate a team value function through local
|
||
|
communication, resulting in cooperative behavior. To describe our problem, we
|
||
|
introduce the networked dynamic partially observable Markov game framework,
|
||
|
where agents communicate over a switching topology communication network. Our
|
||
|
distributed method, DNA-MARL, uses a consensus mechanism for local
|
||
|
communication and gradient descent for local computation. DNA-MARL increases
|
||
|
the range of the possible applications of networked agents, being well-suited
|
||
|
for real world domains that impose privacy and where the messages may not reach
|
||
|
their recipients. We evaluate DNA-MARL across benchmark MARL scenarios. Our
|
||
|
results highlight the superior performance of DNA-MARL over previous methods.
|
||
|
|
||
|
**Decision Explanation:** The paper does not focus on Large Language Models (LLMs) or their applications, and instead proposes a novel cooperative multi-agent reinforcement learning approach, which does not meet the primary criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [XMusic: Towards a Generalized and Controllable Symbolic Music Generation
|
||
|
Framework](https://arxiv.org/abs/2501.08809v1)
|
||
|
**arXiv ID:** 2501.08809v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> In recent years, remarkable advancements in artificial intelligence-generated
|
||
|
content (AIGC) have been achieved in the fields of image synthesis and text
|
||
|
generation, generating content comparable to that produced by humans. However,
|
||
|
the quality of AI-generated music has not yet reached this standard, primarily
|
||
|
due to the challenge of effectively controlling musical emotions and ensuring
|
||
|
high-quality outputs. This paper presents a generalized symbolic music
|
||
|
generation framework, XMusic, which supports flexible prompts (i.e., images,
|
||
|
videos, texts, tags, and humming) to generate emotionally controllable and
|
||
|
high-quality symbolic music. XMusic consists of two core components, XProjector
|
||
|
and XComposer. XProjector parses the prompts of various modalities into
|
||
|
symbolic music elements (i.e., emotions, genres, rhythms and notes) within the
|
||
|
projection space to generate matching music. XComposer contains a Generator and
|
||
|
a Selector. The Generator generates emotionally controllable and melodious
|
||
|
music based on our innovative symbolic music representation, whereas the
|
||
|
Selector identifies high-quality symbolic music by constructing a multi-task
|
||
|
learning scheme involving quality assessment, emotion recognition, and genre
|
||
|
recognition tasks. In addition, we build XMIDI, a large-scale symbolic music
|
||
|
dataset that contains 108,023 MIDI files annotated with precise emotion and
|
||
|
genre labels. Objective and subjective evaluations show that XMusic
|
||
|
significantly outperforms the current state-of-the-art methods with impressive
|
||
|
music quality. Our XMusic has been awarded as one of the nine Highlights of
|
||
|
Collectibles at WAIC 2023. The project homepage of XMusic is
|
||
|
https://xmusic-project.github.io.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on music generation, which does not meet the criteria of having practical applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI, and also does not align with the specified areas of interest.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [SAIF: A Comprehensive Framework for Evaluating the Risks of Generative
|
||
|
AI in the Public Sector](https://arxiv.org/abs/2501.08814v1)
|
||
|
**arXiv ID:** 2501.08814v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The rapid adoption of generative AI in the public sector, encompassing
|
||
|
diverse applications ranging from automated public assistance to welfare
|
||
|
services and immigration processes, highlights its transformative potential
|
||
|
while underscoring the pressing need for thorough risk assessments. Despite its
|
||
|
growing presence, evaluations of risks associated with AI-driven systems in the
|
||
|
public sector remain insufficiently explored. Building upon an established
|
||
|
taxonomy of AI risks derived from diverse government policies and corporate
|
||
|
guidelines, we investigate the critical risks posed by generative AI in the
|
||
|
public sector while extending the scope to account for its multimodal
|
||
|
capabilities. In addition, we propose a Systematic dAta generatIon Framework
|
||
|
for evaluating the risks of generative AI (SAIF). SAIF involves four key
|
||
|
stages: breaking down risks, designing scenarios, applying jailbreak methods,
|
||
|
and exploring prompt types. It ensures the systematic and consistent generation
|
||
|
of prompt data, facilitating a comprehensive evaluation while providing a solid
|
||
|
foundation for mitigating the risks. Furthermore, SAIF is designed to
|
||
|
accommodate emerging jailbreak methods and evolving prompt types, thereby
|
||
|
enabling effective responses to unforeseen risk scenarios. We believe that this
|
||
|
study can play a crucial role in fostering the safe and responsible integration
|
||
|
of generative AI into the public sector.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on responsible AI application and AI ethics in the public sector, which is an excluded topic.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [IDEA: Image Description Enhanced CLIP-Adapter](https://arxiv.org/abs/2501.08816v1)
|
||
|
**arXiv ID:** 2501.08816v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> CLIP (Contrastive Language-Image Pre-training) has attained great success in
|
||
|
pattern recognition and computer vision. Transferring CLIP to downstream tasks
|
||
|
(e.g. zero- or few-shot classification) is a hot topic in multimodal learning.
|
||
|
However, current studies primarily focus on either prompt learning for text or
|
||
|
adapter tuning for vision, without fully exploiting the complementary
|
||
|
information and correlations among image-text pairs. In this paper, we propose
|
||
|
an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to
|
||
|
few-shot image classification tasks. This method captures fine-grained features
|
||
|
by leveraging both visual features and textual descriptions of images. IDEA is
|
||
|
a training-free method for CLIP, and it can be comparable to or even exceeds
|
||
|
state-of-the-art models on multiple tasks. Furthermore, we introduce
|
||
|
Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable
|
||
|
components (i.e., a projector and a learnable latent space), further enhancing
|
||
|
the model's performance and achieving SOTA results on 11 datasets. As one
|
||
|
important contribution, we employ the Llama model and design a comprehensive
|
||
|
pipeline to generate textual descriptions for images of 11 datasets, resulting
|
||
|
in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are
|
||
|
released at https://github.com/FourierAI/IDEA.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on computer vision and image classification tasks, without clear applications of Large Language Models (LLMs) in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Automatic tuning of communication protocols for vehicular ad hoc
|
||
|
networks using metaheuristics](https://arxiv.org/abs/2501.08847v1)
|
||
|
**arXiv ID:** 2501.08847v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The emerging field of vehicular ad hoc networks (VANETs) deals with a set of
|
||
|
communicating vehicles which are able to spontaneously interconnect without any
|
||
|
pre-existing infrastructure. In such kind of networks, it is crucial to make an
|
||
|
optimal configuration of the communication protocols previously to the final
|
||
|
network deployment. This way, a human designer can obtain an optimal QoS of the
|
||
|
network beforehand. The problem we consider in this work lies in configuring
|
||
|
the File Transfer protocol Configuration (FTC) with the aim of optimizing the
|
||
|
transmission time, the number of lost packets, and the amount of data
|
||
|
transferred in realistic VANET scenarios. We face the FTC with five
|
||
|
representative state-of-the-art optimization techniques and compare their
|
||
|
performance. These algorithms are: Particle Swarm Optimization (PSO),
|
||
|
Differential Evolution (DE), Genetic Algorithm (GA), Evolutionary Strategy
|
||
|
(ES), and Simulated Annealing (SA). For our tests, two typical environment
|
||
|
instances of VANETs for Urban and Highway scenarios have been defined. The
|
||
|
experiments using ns- 2 (a well-known realistic VANET simulator) reveal that
|
||
|
PSO outperforms all the compared algorithms for both studied VANET instances.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it does not focus on Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [RouteNet-Gauss: Hardware-Enhanced Network Modeling with Machine Learning](https://arxiv.org/abs/2501.08848v1)
|
||
|
**arXiv ID:** 2501.08848v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Network simulation is pivotal in network modeling, assisting with tasks
|
||
|
ranging from capacity planning to performance estimation. Traditional
|
||
|
approaches such as Discrete Event Simulation (DES) face limitations in terms of
|
||
|
computational cost and accuracy. This paper introduces RouteNet-Gauss, a novel
|
||
|
integration of a testbed network with a Machine Learning (ML) model to address
|
||
|
these challenges. By using the testbed as a hardware accelerator,
|
||
|
RouteNet-Gauss generates training datasets rapidly and simulates network
|
||
|
scenarios with high fidelity to real-world conditions. Experimental results
|
||
|
show that RouteNet-Gauss significantly reduces prediction errors by up to 95%
|
||
|
and achieves a 488x speedup in inference time compared to state-of-the-art
|
||
|
DES-based methods. RouteNet-Gauss's modular architecture is dynamically
|
||
|
constructed based on the specific characteristics of the network scenario, such
|
||
|
as topology and routing. This enables it to understand and generalize to
|
||
|
different network configurations beyond those seen during training, including
|
||
|
networks up to 10x larger. Additionally, it supports Temporal Aggregated
|
||
|
Performance Estimation (TAPE), providing configurable temporal granularity and
|
||
|
maintaining high accuracy in flow performance metrics. This approach shows
|
||
|
promise in improving both simulation efficiency and accuracy, offering a
|
||
|
valuable tool for network operators.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on network modeling and simulation, which does not meet the criteria of focusing on Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Graph Counterfactual Explainable AI via Latent Space Traversal](https://arxiv.org/abs/2501.08850v1)
|
||
|
**arXiv ID:** 2501.08850v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Explaining the predictions of a deep neural network is a nontrivial task, yet
|
||
|
high-quality explanations for predictions are often a prerequisite for
|
||
|
practitioners to trust these models. Counterfactual explanations aim to explain
|
||
|
predictions by finding the ''nearest'' in-distribution alternative input whose
|
||
|
prediction changes in a pre-specified way. However, it remains an open question
|
||
|
how to define this nearest alternative input, whose solution depends on both
|
||
|
the domain (e.g. images, graphs, tabular data, etc.) and the specific
|
||
|
application considered. For graphs, this problem is complicated i) by their
|
||
|
discrete nature, as opposed to the continuous nature of state-of-the-art graph
|
||
|
classifiers; and ii) by the node permutation group acting on the graphs. We
|
||
|
propose a method to generate counterfactual explanations for any differentiable
|
||
|
black-box graph classifier, utilizing a case-specific permutation equivariant
|
||
|
graph variational autoencoder. We generate counterfactual explanations in a
|
||
|
continuous fashion by traversing the latent space of the autoencoder across the
|
||
|
classification boundary of the classifier, allowing for seamless integration of
|
||
|
discrete graph structure and continuous graph attributes. We empirically
|
||
|
validate the approach on three graph datasets, showing that our model is
|
||
|
consistently high-performing and more robust than the baselines.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria for practical applications of Large Language Models (LLMs) and does not mention LLMs, knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, focusing instead on explainable AI for graph classifiers.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [ARMOR: Shielding Unlearnable Examples against Data Augmentation](https://arxiv.org/abs/2501.08862v1)
|
||
|
**arXiv ID:** 2501.08862v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Private data, when published online, may be collected by unauthorized parties
|
||
|
to train deep neural networks (DNNs). To protect privacy, defensive noises can
|
||
|
be added to original samples to degrade their learnability by DNNs. Recently,
|
||
|
unlearnable examples are proposed to minimize the training loss such that the
|
||
|
model learns almost nothing. However, raw data are often pre-processed before
|
||
|
being used for training, which may restore the private information of protected
|
||
|
data. In this paper, we reveal the data privacy violation induced by data
|
||
|
augmentation, a commonly used data pre-processing technique to improve model
|
||
|
generalization capability, which is the first of its kind as far as we are
|
||
|
concerned. We demonstrate that data augmentation can significantly raise the
|
||
|
accuracy of the model trained on unlearnable examples from 21.3% to 66.1%. To
|
||
|
address this issue, we propose a defense framework, dubbed ARMOR, to protect
|
||
|
data privacy from potential breaches of data augmentation. To overcome the
|
||
|
difficulty of having no access to the model training process, we design a
|
||
|
non-local module-assisted surrogate model that better captures the effect of
|
||
|
data augmentation. In addition, we design a surrogate augmentation selection
|
||
|
strategy that maximizes distribution alignment between augmented and
|
||
|
non-augmented samples, to choose the optimal augmentation strategy for each
|
||
|
class. We also use a dynamic step size adjustment algorithm to enhance the
|
||
|
defensive noise generation process. Extensive experiments are conducted on 4
|
||
|
datasets and 5 data augmentation methods to verify the performance of ARMOR.
|
||
|
Comparisons with 6 state-of-the-art defense methods have demonstrated that
|
||
|
ARMOR can preserve the unlearnability of protected private data under data
|
||
|
augmentation. ARMOR reduces the test accuracy of the model trained on augmented
|
||
|
protected samples by as much as 60% more than baselines.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on protecting data privacy from potential breaches of data augmentation, which does not meet the criteria of practical applications of Large Language Models (LLMs) or real-world applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Karatsuba Matrix Multiplication and its Efficient Custom Hardware
|
||
|
Implementations](https://arxiv.org/abs/2501.08889v1)
|
||
|
**arXiv ID:** 2501.08889v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> While the Karatsuba algorithm reduces the complexity of large integer
|
||
|
multiplication, the extra additions required minimize its benefits for smaller
|
||
|
integers of more commonly-used bitwidths. In this work, we propose the
|
||
|
extension of the scalar Karatsuba multiplication algorithm to matrix
|
||
|
multiplication, showing how this maintains the reduction in multiplication
|
||
|
complexity of the original Karatsuba algorithm while reducing the complexity of
|
||
|
the extra additions. Furthermore, we propose new matrix multiplication hardware
|
||
|
architectures for efficiently exploiting this extension of the Karatsuba
|
||
|
algorithm in custom hardware. We show that the proposed algorithm and hardware
|
||
|
architectures can provide real area or execution time improvements for integer
|
||
|
matrix multiplication compared to scalar Karatsuba or conventional matrix
|
||
|
multiplication algorithms, while also supporting implementation through proven
|
||
|
systolic array and conventional multiplier architectures at the core. We
|
||
|
provide a complexity analysis of the algorithm and architectures and evaluate
|
||
|
the proposed designs both in isolation and in an end-to-end deep learning
|
||
|
accelerator system compared to baseline designs and prior state-of-the-art
|
||
|
works implemented on the same type of compute platform, demonstrating their
|
||
|
ability to increase the performance-per-area of matrix multiplication hardware.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it focuses on matrix multiplication and custom hardware implementations, which are not related to Large Language Models (LLMs) or their practical applications.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Computing Game Symmetries and Equilibria That Respect Them](https://arxiv.org/abs/2501.08905v1)
|
||
|
**arXiv ID:** 2501.08905v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Strategic interactions can be represented more concisely, and analyzed and
|
||
|
solved more efficiently, if we are aware of the symmetries within the
|
||
|
multiagent system. Symmetries also have conceptual implications, for example
|
||
|
for equilibrium selection. We study the computational complexity of identifying
|
||
|
and using symmetries. Using the classical framework of normal-form games, we
|
||
|
consider game symmetries that can be across some or all players and/or actions.
|
||
|
We find a strong connection between game symmetries and graph automorphisms,
|
||
|
yielding graph automorphism and graph isomorphism completeness results for
|
||
|
characterizing the symmetries present in a game. On the other hand, we also
|
||
|
show that the problem becomes polynomial-time solvable when we restrict the
|
||
|
consideration of actions in one of two ways.
|
||
|
Next, we investigate when exactly game symmetries can be successfully
|
||
|
leveraged for Nash equilibrium computation. We show that finding a Nash
|
||
|
equilibrium that respects a given set of symmetries is PPAD- and CLS-complete
|
||
|
in general-sum and team games respectively -- that is, exactly as hard as
|
||
|
Brouwer fixed point and gradient descent problems. Finally, we present
|
||
|
polynomial-time methods for the special cases where we are aware of a vast
|
||
|
number of symmetries, or where the game is two-player zero-sum and we do not
|
||
|
even know the symmetries.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it does not focus on Large Language Models (LLMs) or their applications, and does not mention areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Trusted Machine Learning Models Unlock Private Inference for Problems
|
||
|
Currently Infeasible with Cryptography](https://arxiv.org/abs/2501.08970v1)
|
||
|
**arXiv ID:** 2501.08970v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> We often interact with untrusted parties. Prioritization of privacy can limit
|
||
|
the effectiveness of these interactions, as achieving certain goals
|
||
|
necessitates sharing private data. Traditionally, addressing this challenge has
|
||
|
involved either seeking trusted intermediaries or constructing cryptographic
|
||
|
protocols that restrict how much data is revealed, such as multi-party
|
||
|
computations or zero-knowledge proofs. While significant advances have been
|
||
|
made in scaling cryptographic approaches, they remain limited in terms of the
|
||
|
size and complexity of applications they can be used for. In this paper, we
|
||
|
argue that capable machine learning models can fulfill the role of a trusted
|
||
|
third party, thus enabling secure computations for applications that were
|
||
|
previously infeasible. In particular, we describe Trusted Capable Model
|
||
|
Environments (TCMEs) as an alternative approach for scaling secure computation,
|
||
|
where capable machine learning model(s) interact under input/output
|
||
|
constraints, with explicit information flow control and explicit statelessness.
|
||
|
This approach aims to achieve a balance between privacy and computational
|
||
|
efficiency, enabling private inference where classical cryptographic solutions
|
||
|
are currently infeasible. We describe a number of use cases that are enabled by
|
||
|
TCME, and show that even some simple classic cryptographic problems can already
|
||
|
be solved with TCME. Finally, we outline current limitations and discuss the
|
||
|
path forward in implementing them.
|
||
|
|
||
|
**Decision Explanation:** The paper does not focus on Large Language Models (LLMs) or their applications, and instead discusses machine learning models for private inference, which does not meet the primary criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Personality Modeling for Persuasion of Misinformation using AI Agent](https://arxiv.org/abs/2501.08985v1)
|
||
|
**arXiv ID:** 2501.08985v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The proliferation of misinformation on social media platforms has highlighted
|
||
|
the need to understand how individual personality traits influence
|
||
|
susceptibility to and propagation of misinformation. This study employs an
|
||
|
innovative agent-based modeling approach to investigate the relationship
|
||
|
between personality traits and misinformation dynamics. Using six AI agents
|
||
|
embodying different dimensions of the Big Five personality traits
|
||
|
(Extraversion, Agreeableness, and Neuroticism), we simulated interactions
|
||
|
across six diverse misinformation topics. The experiment, implemented through
|
||
|
the AgentScope framework using the GLM-4-Flash model, generated 90 unique
|
||
|
interactions, revealing complex patterns in how personality combinations affect
|
||
|
persuasion and resistance to misinformation. Our findings demonstrate that
|
||
|
analytical and critical personality traits enhance effectiveness in
|
||
|
evidence-based discussions, while non-aggressive persuasion strategies show
|
||
|
unexpected success in misinformation correction. Notably, agents with critical
|
||
|
traits achieved a 59.4% success rate in HIV-related misinformation discussions,
|
||
|
while those employing non-aggressive approaches maintained consistent
|
||
|
persuasion rates above 40% across different personality combinations. The study
|
||
|
also revealed a non-transitive pattern in persuasion effectiveness, challenging
|
||
|
conventional assumptions about personality-based influence. These results
|
||
|
provide crucial insights for developing personality-aware interventions in
|
||
|
digital environments and suggest that effective misinformation countermeasures
|
||
|
should prioritize emotional connection and trust-building over confrontational
|
||
|
approaches. The findings contribute to both theoretical understanding of
|
||
|
personality-misinformation dynamics and practical strategies for combating
|
||
|
misinformation in social media contexts.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on social applications of AI in regard to misinformation, which is not a preferred area of focus according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [AI-RAN: Transforming RAN with AI-driven Computing Infrastructure](https://arxiv.org/abs/2501.09007v1)
|
||
|
**arXiv ID:** 2501.09007v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The radio access network (RAN) landscape is undergoing a transformative shift
|
||
|
from traditional, communication-centric infrastructures towards converged
|
||
|
compute-communication platforms. This article introduces AI-RAN which
|
||
|
integrates both RAN and artificial intelligence (AI) workloads on the same
|
||
|
infrastructure. By doing so, AI-RAN not only meets the performance demands of
|
||
|
future networks but also improves asset utilization. We begin by examining how
|
||
|
RANs have evolved beyond mobile broadband towards AI-RAN and articulating
|
||
|
manifestations of AI-RAN into three forms: AI-for-RAN, AI-on-RAN, and
|
||
|
AI-and-RAN. Next, we identify the key requirements and enablers for the
|
||
|
convergence of communication and computing in AI-RAN. We then provide a
|
||
|
reference architecture for advancing AI-RAN from concept to practice. To
|
||
|
illustrate the practical potential of AI-RAN, we present a proof-of-concept
|
||
|
that concurrently processes RAN and AI workloads utilizing NVIDIA Grace-Hopper
|
||
|
GH200 servers. Finally, we conclude the article by outlining future work
|
||
|
directions to guide further developments of AI-RAN.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on the integration of AI with radio access networks (RAN) and does not explicitly mention Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Multimodal LLMs Can Reason about Aesthetics in Zero-Shot](https://arxiv.org/abs/2501.09012v1)
|
||
|
**arXiv ID:** 2501.09012v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> We present the first study on how Multimodal LLMs' (MLLMs) reasoning ability
|
||
|
shall be elicited to evaluate the aesthetics of artworks. To facilitate this
|
||
|
investigation, we construct MM-StyleBench, a novel high-quality dataset for
|
||
|
benchmarking artistic stylization. We then develop a principled method for
|
||
|
human preference modeling and perform a systematic correlation analysis between
|
||
|
MLLMs' responses and human preference. Our experiments reveal an inherent
|
||
|
hallucination issue of MLLMs in art evaluation, associated with response
|
||
|
subjectivity. ArtCoT is proposed, demonstrating that art-specific task
|
||
|
decomposition and the use of concrete language boost MLLMs' reasoning ability
|
||
|
for aesthetics. Our findings offer valuable insights into MLLMs for art and can
|
||
|
benefit a wide range of downstream applications, such as style transfer and
|
||
|
artistic image generation. Code available at
|
||
|
https://github.com/songrise/MLLM4Art.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on the application of Multimodal LLMs in evaluating the aesthetics of artworks, which does not meet the criteria of having practical applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI, and also does not align with the specified exclusion criteria in a way that would suggest a strong relevance to the desired topics.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Spatio-Temporal Foundation Models: Vision, Challenges, and Opportunities](https://arxiv.org/abs/2501.09045v1)
|
||
|
**arXiv ID:** 2501.09045v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Foundation models have revolutionized artificial intelligence, setting new
|
||
|
benchmarks in performance and enabling transformative capabilities across a
|
||
|
wide range of vision and language tasks. However, despite the prevalence of
|
||
|
spatio-temporal data in critical domains such as transportation, public health,
|
||
|
and environmental monitoring, spatio-temporal foundation models (STFMs) have
|
||
|
not yet achieved comparable success. In this paper, we articulate a vision for
|
||
|
the future of STFMs, outlining their essential characteristics and the
|
||
|
generalization capabilities necessary for broad applicability. We critically
|
||
|
assess the current state of research, identifying gaps relative to these ideal
|
||
|
traits, and highlight key challenges that impede their progress. Finally, we
|
||
|
explore potential opportunities and directions to advance research towards the
|
||
|
aim of effective and broadly applicable STFMs.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on vision tasks and spatio-temporal foundation models, which is not directly related to Large Language Models (LLMs) or their practical applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Dynamic-Aware Spatio-temporal Representation Learning for Dynamic MRI
|
||
|
Reconstruction](https://arxiv.org/abs/2501.09049v1)
|
||
|
**arXiv ID:** 2501.09049v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Dynamic MRI reconstruction, one of inverse problems, has seen a surge by the
|
||
|
use of deep learning techniques. Especially, the practical difficulty of
|
||
|
obtaining ground truth data has led to the emergence of unsupervised learning
|
||
|
approaches. A recent promising method among them is implicit neural
|
||
|
representation (INR), which defines the data as a continuous function that maps
|
||
|
coordinate values to the corresponding signal values. This allows for filling
|
||
|
in missing information only with incomplete measurements and solving the
|
||
|
inverse problem effectively. Nevertheless, previous works incorporating this
|
||
|
method have faced drawbacks such as long optimization time and the need for
|
||
|
extensive hyperparameter tuning. To address these issues, we propose
|
||
|
Dynamic-Aware INR (DA-INR), an INR-based model for dynamic MRI reconstruction
|
||
|
that captures the spatial and temporal continuity of dynamic MRI data in the
|
||
|
image domain and explicitly incorporates the temporal redundancy of the data
|
||
|
into the model structure. As a result, DA-INR outperforms other models in
|
||
|
reconstruction quality even at extreme undersampling ratios while significantly
|
||
|
reducing optimization time and requiring minimal hyperparameter tuning.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, specifically dynamic MRI reconstruction, which is not aligned with the specified criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [SteLLA: A Structured Grading System Using LLMs with RAG](https://arxiv.org/abs/2501.09092v1)
|
||
|
**arXiv ID:** 2501.09092v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Large Language Models (LLMs) have shown strong general capabilities in many
|
||
|
applications. However, how to make them reliable tools for some specific tasks
|
||
|
such as automated short answer grading (ASAG) remains a challenge. We present
|
||
|
SteLLA (Structured Grading System Using LLMs with RAG) in which a) Retrieval
|
||
|
Augmented Generation (RAG) approach is used to empower LLMs specifically on the
|
||
|
ASAG task by extracting structured information from the highly relevant and
|
||
|
reliable external knowledge based on the instructor-provided reference answer
|
||
|
and rubric, b) an LLM performs a structured and question-answering-based
|
||
|
evaluation of student answers to provide analytical grades and feedback. A
|
||
|
real-world dataset that contains students' answers in an exam was collected
|
||
|
from a college-level Biology course. Experiments show that our proposed system
|
||
|
can achieve substantial agreement with the human grader while providing
|
||
|
break-down grades and feedback on all the knowledge points examined in the
|
||
|
problem. A qualitative and error analysis of the feedback generated by GPT4
|
||
|
shows that GPT4 is good at capturing facts while may be prone to inferring too
|
||
|
much implication from the given text in the grading task which provides
|
||
|
insights into the usage of LLMs in the ASAG system.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on educational applications, specifically automated short answer grading, which is not explicitly mentioned as a desired area of focus, and it does not clearly meet the criteria related to knowledge graphs, retrieval-augmented generation, or agentic AI in a way that aligns with the specified areas of interest.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Tracking the Takes and Trajectories of English-Language News Narratives
|
||
|
across Trustworthy and Worrisome Websites](https://arxiv.org/abs/2501.09102v1)
|
||
|
**arXiv ID:** 2501.09102v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Understanding how misleading and outright false information enters news
|
||
|
ecosystems remains a difficult challenge that requires tracking how narratives
|
||
|
spread across thousands of fringe and mainstream news websites. To do this, we
|
||
|
introduce a system that utilizes encoder-based large language models and
|
||
|
zero-shot stance detection to scalably identify and track news narratives and
|
||
|
their attitudes across over 4,000 factually unreliable, mixed-reliability, and
|
||
|
factually reliable English-language news websites. Running our system over an
|
||
|
18 month period, we track the spread of 146K news stories. Using network-based
|
||
|
interference via the NETINF algorithm, we show that the paths of news
|
||
|
narratives and the stances of websites toward particular entities can be used
|
||
|
to uncover slanted propaganda networks (e.g., anti-vaccine and anti-Ukraine)
|
||
|
and to identify the most influential websites in spreading these attitudes in
|
||
|
the broader news ecosystem. We hope that increased visibility into our
|
||
|
distributed news ecosystem can help with the reporting and fact-checking of
|
||
|
propaganda and disinformation.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on social applications of AI in regard to misinformation and disinformation, which is similar to issues of social harm, and does not meet the criteria of having practical applications of Large Language Models in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [A Non-autoregressive Model for Joint STT and TTS](https://arxiv.org/abs/2501.09104v1)
|
||
|
**arXiv ID:** 2501.09104v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> In this paper, we take a step towards jointly modeling automatic speech
|
||
|
recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way.
|
||
|
We develop a novel multimodal framework capable of handling the speech and text
|
||
|
modalities as input either individually or together. The proposed model can
|
||
|
also be trained with unpaired speech or text data owing to its multimodal
|
||
|
nature. We further propose an iterative refinement strategy to improve the STT
|
||
|
and TTS performance of our model such that the partial hypothesis at the output
|
||
|
can be fed back to the input of our model, thus iteratively improving both STT
|
||
|
and TTS predictions. We show that our joint model can effectively perform both
|
||
|
STT and TTS tasks, outperforming the STT-specific baseline in all tasks and
|
||
|
performing competitively with the TTS-specific baseline across a wide range of
|
||
|
evaluation metrics.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on speech recognition and synthesis, which does not involve Large Language Models or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Towards Understanding Extrapolation: a Causal Lens](https://arxiv.org/abs/2501.09163v1)
|
||
|
**arXiv ID:** 2501.09163v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Canonical work handling distribution shifts typically necessitates an entire
|
||
|
target distribution that lands inside the training distribution. However,
|
||
|
practical scenarios often involve only a handful of target samples, potentially
|
||
|
lying outside the training support, which requires the capability of
|
||
|
extrapolation. In this work, we aim to provide a theoretical understanding of
|
||
|
when extrapolation is possible and offer principled methods to achieve it
|
||
|
without requiring an on-support target distribution. To this end, we formulate
|
||
|
the extrapolation problem with a latent-variable model that embodies the
|
||
|
minimal change principle in causal mechanisms. Under this formulation, we cast
|
||
|
the extrapolation problem into a latent-variable identification problem. We
|
||
|
provide realistic conditions on shift properties and the estimation objectives
|
||
|
that lead to identification even when only one off-support target sample is
|
||
|
available, tackling the most challenging scenarios. Our theory reveals the
|
||
|
intricate interplay between the underlying manifold's smoothness and the shift
|
||
|
properties. We showcase how our theoretical results inform the design of
|
||
|
practical adaptation algorithms. Through experiments on both synthetic and
|
||
|
real-world data, we validate our theoretical findings and their practical
|
||
|
implications.
|
||
|
|
||
|
**Decision Explanation:** The paper does not explicitly focus on Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, which are the primary criteria for selection.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [A Blockchain-Enabled Approach to Cross-Border Compliance and Trust](https://arxiv.org/abs/2501.09182v1)
|
||
|
**arXiv ID:** 2501.09182v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> As artificial intelligence (AI) systems become increasingly integral to
|
||
|
critical infrastructure and global operations, the need for a unified,
|
||
|
trustworthy governance framework is more urgent that ever. This paper proposes
|
||
|
a novel approach to AI governance, utilizing blockchain and distributed ledger
|
||
|
technologies (DLT) to establish a decentralized, globally recognized framework
|
||
|
that ensures security, privacy, and trustworthiness of AI systems across
|
||
|
borders. The paper presents specific implementation scenarios within the
|
||
|
financial sector, outlines a phased deployment timeline over the next decade,
|
||
|
and addresses potential challenges with solutions grounded in current research.
|
||
|
By synthesizing advancements in blockchain, AI ethics, and cybersecurity, this
|
||
|
paper offers a comprehensive roadmap for a decentralized AI governance
|
||
|
framework capable of adapting to the complex and evolving landscape of global
|
||
|
AI regulation.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on AI governance, blockchain, and ethics, which does not meet the criteria of having practical applications of Large Language Models (LLMs) or discussing real-world applications and challenges involving LLMs.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Patch-aware Vector Quantized Codebook Learning for Unsupervised Visual
|
||
|
Defect Detection](https://arxiv.org/abs/2501.09187v1)
|
||
|
**arXiv ID:** 2501.09187v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Unsupervised visual defect detection is critical in industrial applications,
|
||
|
requiring a representation space that captures normal data features while
|
||
|
detecting deviations. Achieving a balance between expressiveness and
|
||
|
compactness is challenging; an overly expressive space risks inefficiency and
|
||
|
mode collapse, impairing detection accuracy. We propose a novel approach using
|
||
|
an enhanced VQ-VAE framework optimized for unsupervised defect detection. Our
|
||
|
model introduces a patch-aware dynamic code assignment scheme, enabling
|
||
|
context-sensitive code allocation to optimize spatial representation. This
|
||
|
strategy enhances normal-defect distinction and improves detection accuracy
|
||
|
during inference. Experiments on MVTecAD, BTAD, and MTSD datasets show our
|
||
|
method achieves state-of-the-art performance.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on unsupervised visual defect detection, which is not directly related to Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Adaptive Law-Based Transformation (ALT): A Lightweight Feature
|
||
|
Representation for Time Series Classification](https://arxiv.org/abs/2501.09217v1)
|
||
|
**arXiv ID:** 2501.09217v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Time series classification (TSC) is fundamental in numerous domains,
|
||
|
including finance, healthcare, and environmental monitoring. However,
|
||
|
traditional TSC methods often struggle with the inherent complexity and
|
||
|
variability of time series data. Building on our previous work with the linear
|
||
|
law-based transformation (LLT) - which improved classification accuracy by
|
||
|
transforming the feature space based on key data patterns - we introduce
|
||
|
adaptive law-based transformation (ALT). ALT enhances LLT by incorporating
|
||
|
variable-length shifted time windows, enabling it to capture distinguishing
|
||
|
patterns of various lengths and thereby handle complex time series more
|
||
|
effectively. By mapping features into a linearly separable space, ALT provides
|
||
|
a fast, robust, and transparent solution that achieves state-of-the-art
|
||
|
performance with only a few hyperparameters.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on time series classification and does not mention Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Foundations of Large Language Models](https://arxiv.org/abs/2501.09223v1)
|
||
|
**arXiv ID:** 2501.09223v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> This is a book about large language models. As indicated by the title, it
|
||
|
primarily focuses on foundational concepts rather than comprehensive coverage
|
||
|
of all cutting-edge technologies. The book is structured into four main
|
||
|
chapters, each exploring a key area: pre-training, generative models, prompting
|
||
|
techniques, and alignment methods. It is intended for college students,
|
||
|
professionals, and practitioners in natural language processing and related
|
||
|
fields, and can serve as a reference for anyone interested in large language
|
||
|
models.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on foundational concepts of large language models rather than practical applications, experimental results, or comparisons with state-of-the-art techniques.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Large Language Model is Secretly a Protein Sequence Optimizer](https://arxiv.org/abs/2501.09274v1)
|
||
|
**arXiv ID:** 2501.09274v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> We consider the protein sequence engineering problem, which aims to find
|
||
|
protein sequences with high fitness levels, starting from a given wild-type
|
||
|
sequence. Directed evolution has been a dominating paradigm in this field which
|
||
|
has an iterative process to generate variants and select via experimental
|
||
|
feedback. We demonstrate large language models (LLMs), despite being trained on
|
||
|
massive texts, are secretly protein sequence optimizers. With a directed
|
||
|
evolutionary method, LLM can perform protein engineering through Pareto and
|
||
|
experiment-budget constrained optimization, demonstrating success on both
|
||
|
synthetic and experimental fitness landscapes.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, specifically protein sequence engineering, which is excluded according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport](https://arxiv.org/abs/2501.09291v1)
|
||
|
**arXiv ID:** 2501.09291v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Automated audio captioning is a task that generates textual descriptions for
|
||
|
audio content, and recent studies have explored using visual information to
|
||
|
enhance captioning quality. However, current methods often fail to effectively
|
||
|
fuse audio and visual data, missing important semantic cues from each modality.
|
||
|
To address this, we introduce LAVCap, a large language model (LLM)-based
|
||
|
audio-visual captioning framework that effectively integrates visual
|
||
|
information with audio to improve audio captioning performance. LAVCap employs
|
||
|
an optimal transport-based alignment loss to bridge the modality gap between
|
||
|
audio and visual features, enabling more effective semantic extraction.
|
||
|
Additionally, we propose an optimal transport attention module that enhances
|
||
|
audio-visual fusion using an optimal transport assignment map. Combined with
|
||
|
the optimal training strategy, experimental results demonstrate that each
|
||
|
component of our framework is effective. LAVCap outperforms existing
|
||
|
state-of-the-art methods on the AudioCaps dataset, without relying on large
|
||
|
datasets or post-processing. Code is available at
|
||
|
https://github.com/NAVER-INTEL-Co-Lab/gaudi-lavcap.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on audio-visual captioning, which does not meet the criteria of having practical applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI, and also does not discuss real-world applications or challenges involving LLMs in autonomous or agentic AI scenarios.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Understanding Mental Health Content on Social Media and Its Effect
|
||
|
Towards Suicidal Ideation](https://arxiv.org/abs/2501.09309v1)
|
||
|
**arXiv ID:** 2501.09309v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> This review underscores the critical need for effective strategies to
|
||
|
identify and support individuals with suicidal ideation, exploiting
|
||
|
technological innovations in ML and DL to further suicide prevention efforts.
|
||
|
The study details the application of these technologies in analyzing vast
|
||
|
amounts of unstructured social media data to detect linguistic patterns,
|
||
|
keywords, phrases, tones, and contextual cues associated with suicidal
|
||
|
thoughts. It explores various ML and DL models like SVMs, CNNs, LSTM, neural
|
||
|
networks, and their effectiveness in interpreting complex data patterns and
|
||
|
emotional nuances within text data. The review discusses the potential of these
|
||
|
technologies to serve as a life-saving tool by identifying at-risk individuals
|
||
|
through their digital traces. Furthermore, it evaluates the real-world
|
||
|
effectiveness, limitations, and ethical considerations of employing these
|
||
|
technologies for suicide prevention, stressing the importance of responsible
|
||
|
development and usage. The study aims to fill critical knowledge gaps by
|
||
|
analyzing recent studies, methodologies, tools, and techniques in this field.
|
||
|
It highlights the importance of synthesizing current literature to inform
|
||
|
practical tools and suicide prevention efforts, guiding innovation in reliable,
|
||
|
ethical systems for early intervention. This research synthesis evaluates the
|
||
|
intersection of technology and mental health, advocating for the ethical and
|
||
|
responsible application of ML, DL, and NLP to offer life-saving potential
|
||
|
worldwide while addressing challenges like generalizability, biases, privacy,
|
||
|
and the need for further research to ensure these technologies do not
|
||
|
exacerbate existing inequities and harms.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on social applications of AI in regard to mental health and suicidal ideation, which is one of the excluded topics.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Shape-Based Single Object Classification Using Ensemble Method
|
||
|
Classifiers](https://arxiv.org/abs/2501.09311v1)
|
||
|
**arXiv ID:** 2501.09311v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Nowadays, more and more images are available. Annotation and retrieval of the
|
||
|
images pose classification problems, where each class is defined as the group
|
||
|
of database images labelled with a common semantic label. Various systems have
|
||
|
been proposed for content-based retrieval, as well as for image classification
|
||
|
and indexing. In this paper, a hierarchical classification framework has been
|
||
|
proposed for bridging the semantic gap effectively and achieving multi-category
|
||
|
image classification. A well known pre-processing and post-processing method
|
||
|
was used and applied to three problems; image segmentation, object
|
||
|
identification and image classification. The method was applied to classify
|
||
|
single object images from Amazon and Google datasets. The classification was
|
||
|
tested for four different classifiers; BayesNetwork (BN), Random Forest (RF),
|
||
|
Bagging and Vote. The estimated classification accuracies ranged from 20% to
|
||
|
99% (using 10-fold cross validation). The Bagging classifier presents the best
|
||
|
performance, followed by the Random Forest classifier.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on image classification and does not involve Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [ELM-DeepONets: Backpropagation-Free Training of Deep Operator Networks
|
||
|
via Extreme Learning Machines](https://arxiv.org/abs/2501.09395v1)
|
||
|
**arXiv ID:** 2501.09395v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Deep Operator Networks (DeepONets) are among the most prominent frameworks
|
||
|
for operator learning, grounded in the universal approximation theorem for
|
||
|
operators. However, training DeepONets typically requires significant
|
||
|
computational resources. To address this limitation, we propose ELM-DeepONets,
|
||
|
an Extreme Learning Machine (ELM) framework for DeepONets that leverages the
|
||
|
backpropagation-free nature of ELM. By reformulating DeepONet training as a
|
||
|
least-squares problem for newly introduced parameters, the ELM-DeepONet
|
||
|
approach significantly reduces training complexity. Validation on benchmark
|
||
|
problems, including nonlinear ODEs and PDEs, demonstrates that the proposed
|
||
|
method not only achieves superior accuracy but also drastically reduces
|
||
|
computational costs. This work offers a scalable and efficient alternative for
|
||
|
operator learning in scientific computing.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it focuses on Deep Operator Networks and Extreme Learning Machines, which are not directly related to Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Dynamic Neural Style Transfer for Artistic Image Generation using VGG19](https://arxiv.org/abs/2501.09420v1)
|
||
|
**arXiv ID:** 2501.09420v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Throughout history, humans have created remarkable works of art, but
|
||
|
artificial intelligence has only recently started to make strides in generating
|
||
|
visually compelling art. Breakthroughs in the past few years have focused on
|
||
|
using convolutional neural networks (CNNs) to separate and manipulate the
|
||
|
content and style of images, applying texture synthesis techniques.
|
||
|
Nevertheless, a number of current techniques continue to encounter obstacles,
|
||
|
including lengthy processing times, restricted choices of style images, and the
|
||
|
inability to modify the weight ratio of styles. We proposed a neural style
|
||
|
transfer system that can add various artistic styles to a desired image to
|
||
|
address these constraints allowing flexible adjustments to style weight ratios
|
||
|
and reducing processing time. The system uses the VGG19 model for feature
|
||
|
extraction, ensuring high-quality, flexible stylization without compromising
|
||
|
content integrity.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on image generation using convolutional neural networks, which does not meet the criteria of focusing on Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and
|
||
|
Mitigation Strategy](https://arxiv.org/abs/2501.09431v1)
|
||
|
**arXiv ID:** 2501.09431v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> While large language models (LLMs) present significant potential for
|
||
|
supporting numerous real-world applications and delivering positive social
|
||
|
impacts, they still face significant challenges in terms of the inherent risk
|
||
|
of privacy leakage, hallucinated outputs, and value misalignment, and can be
|
||
|
maliciously used for generating toxic content and unethical purposes after been
|
||
|
jailbroken. Therefore, in this survey, we present a comprehensive review of
|
||
|
recent advancements aimed at mitigating these issues, organized across the four
|
||
|
phases of LLM development and usage: data collecting and pre-training,
|
||
|
fine-tuning and alignment, prompting and reasoning, and post-processing and
|
||
|
auditing. We elaborate on the recent advances for enhancing the performance of
|
||
|
LLMs in terms of privacy protection, hallucination reduction, value alignment,
|
||
|
toxicity elimination, and jailbreak defenses. In contrast to previous surveys
|
||
|
that focus on a single dimension of responsible LLMs, this survey presents a
|
||
|
unified framework that encompasses these diverse dimensions, providing a
|
||
|
comprehensive view of enhancing LLMs to better serve real-world applications.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on responsible AI application or AI ethics, which is one of the excluded topics.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Solving the unsolvable: Translating case law in Hong Kong](https://arxiv.org/abs/2501.09444v1)
|
||
|
**arXiv ID:** 2501.09444v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> This paper addresses the challenges translating case law under Hong Kong's
|
||
|
bilingual legal system. It highlights the initial success of translating all
|
||
|
written statutes into Chinese before the 1997 handover, a task mandated by the
|
||
|
Basic Law. The effort involved significant collaboration among legal,
|
||
|
linguistic, and translation experts, resulting in a comprehensive and
|
||
|
culturally appropriate bilingual legal system. However, translating case law
|
||
|
remains a significant challenge due to the sheer volume and continuous growth
|
||
|
of judicial decisions. The paper critiques the governments and judiciarys
|
||
|
sporadic and uncoordinated efforts to translate case law, contrasting it with
|
||
|
the thorough approach previously taken for statute translation. Although the
|
||
|
government acknowledges the importance of legal bilingualism, it lacks a
|
||
|
sustainable strategy for translating case law. The Judiciarys position that
|
||
|
translating all judgments is unnecessary, unrealistic, and not cost-effectiveis
|
||
|
analyzed and critiqued for its impact on legal transparency and public trust. A
|
||
|
proposed solution involves leveraging machine translation technology through a
|
||
|
human-machine interactive translation platform, which undergoes two major
|
||
|
transitions. Initially based on a neural model, the platform transitions to
|
||
|
using a large language model for improved translation accuracy. Furthermore, it
|
||
|
evolves from a single-agent system to a multi-agent system, incorporating
|
||
|
Translator, Annotator, and Proofreader agents. This multi-agent approach,
|
||
|
supported by a grant, aims to facilitate efficient, high-quality translation of
|
||
|
judicial judgments by integrating advanced artificial intelligence and
|
||
|
continuous feedback mechanisms, thus better meeting the needs of a bilingual
|
||
|
legal system.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on law, which is an excluded topic, and does not meet the required criteria for practical applications of Large Language Models in areas like knowledge graphs, retrieval-augmented generation, or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [RE-POSE: Synergizing Reinforcement Learning-Based Partitioning and
|
||
|
Offloading for Edge Object Detection](https://arxiv.org/abs/2501.09465v1)
|
||
|
**arXiv ID:** 2501.09465v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Object detection plays a crucial role in smart video analysis, with
|
||
|
applications ranging from autonomous driving and security to smart cities.
|
||
|
However, achieving real-time object detection on edge devices presents
|
||
|
significant challenges due to their limited computational resources and the
|
||
|
high demands of deep neural network (DNN)-based detection models, particularly
|
||
|
when processing high-resolution video. Conventional strategies, such as input
|
||
|
down-sampling and network up-scaling, often compromise detection accuracy for
|
||
|
faster performance or lead to higher inference latency. To address these
|
||
|
issues, this paper introduces RE-POSE, a Reinforcement Learning (RL)-Driven
|
||
|
Partitioning and Edge Offloading framework designed to optimize the
|
||
|
accuracy-latency trade-off in resource-constrained edge environments. Our
|
||
|
approach features an RL-Based Dynamic Clustering Algorithm (RL-DCA) that
|
||
|
partitions video frames into non-uniform blocks based on object distribution
|
||
|
and the computational characteristics of DNNs. Furthermore, a parallel edge
|
||
|
offloading scheme is implemented to distribute these blocks across multiple
|
||
|
edge servers for concurrent processing. Experimental evaluations show that
|
||
|
RE-POSE significantly enhances detection accuracy and reduces inference
|
||
|
latency, surpassing existing methods.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on video processing, which is one of the excluded areas, and does not explicitly mention Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [MonoSOWA: Scalable monocular 3D Object detector Without human
|
||
|
Annotations](https://arxiv.org/abs/2501.09481v1)
|
||
|
**arXiv ID:** 2501.09481v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Detecting the three-dimensional position and orientation of objects using a
|
||
|
single RGB camera is a foundational task in computer vision with many important
|
||
|
applications. Traditionally, 3D object detection methods are trained in a
|
||
|
fully-supervised setup, requiring vast amounts of human annotations, which are
|
||
|
laborious, costly, and do not scale well with the ever-increasing amounts of
|
||
|
data being captured.
|
||
|
In this paper, we present the first method to train 3D object detectors for
|
||
|
monocular RGB cameras without domain-specific human annotations, thus making
|
||
|
orders of magnitude more data available for training. Thanks to newly proposed
|
||
|
Canonical Object Space, the method can not only exploit data across a variety
|
||
|
of datasets and camera setups to train a single 3D detector, but unlike
|
||
|
previous work it also works out of the box in previously unseen camera setups.
|
||
|
All this is crucial for practical applications, where the data and cameras are
|
||
|
extremely heterogeneous.
|
||
|
The method is evaluated on two standard autonomous driving datasets, where it
|
||
|
outperforms previous works, which, unlike our method, still rely on 2D human
|
||
|
annotations.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on computer vision and 3D object detection, which is not related to Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [MatrixNet: Learning over symmetry groups using learned group
|
||
|
representations](https://arxiv.org/abs/2501.09571v1)
|
||
|
**arXiv ID:** 2501.09571v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Group theory has been used in machine learning to provide a theoretically
|
||
|
grounded approach for incorporating known symmetry transformations in tasks
|
||
|
from robotics to protein modeling. In these applications, equivariant neural
|
||
|
networks use known symmetry groups with predefined representations to learn
|
||
|
over geometric input data. We propose MatrixNet, a neural network architecture
|
||
|
that learns matrix representations of group element inputs instead of using
|
||
|
predefined representations. MatrixNet achieves higher sample efficiency and
|
||
|
generalization over several standard baselines in prediction tasks over the
|
||
|
several finite groups and the Artin braid group. We also show that MatrixNet
|
||
|
respects group relations allowing generalization to group elements of greater
|
||
|
word length than in the training set.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it does not focus on Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, and does not include experimental results with quantitative metrics related to LLMs.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Managed-Retention Memory: A New Class of Memory for the AI Era](https://arxiv.org/abs/2501.09605v1)
|
||
|
**arXiv ID:** 2501.09605v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> AI clusters today are one of the major uses of High Bandwidth Memory (HBM).
|
||
|
However, HBM is suboptimal for AI workloads for several reasons. Analysis shows
|
||
|
HBM is overprovisioned on write performance, but underprovisioned on density
|
||
|
and read bandwidth, and also has significant energy per bit overheads. It is
|
||
|
also expensive, with lower yield than DRAM due to manufacturing complexity. We
|
||
|
propose a new memory class: Managed-Retention Memory (MRM), which is more
|
||
|
optimized to store key data structures for AI inference workloads. We believe
|
||
|
that MRM may finally provide a path to viability for technologies that were
|
||
|
originally proposed to support Storage Class Memory (SCM). These technologies
|
||
|
traditionally offered long-term persistence (10+ years) but provided poor IO
|
||
|
performance and/or endurance. MRM makes different trade-offs, and by
|
||
|
understanding the workload IO patterns, MRM foregoes long-term data retention
|
||
|
and write performance for better potential performance on the metrics important
|
||
|
for these workloads.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it primarily focuses on memory optimization for AI workloads, without explicit mention of Large Language Models, knowledge graphs, retrieval-augmented generation, or agentic AI, and lacks experimental results with quantitative metrics.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Incorporating Quantum Advantage in Quantum Circuit Generation through
|
||
|
Genetic Programming](https://arxiv.org/abs/2501.09682v1)
|
||
|
**arXiv ID:** 2501.09682v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Designing efficient quantum circuits that leverage quantum advantage compared
|
||
|
to classical computing has become increasingly critical. Genetic algorithms
|
||
|
have shown potential in generating such circuits through artificial evolution.
|
||
|
However, integrating quantum advantage into the fitness function of these
|
||
|
algorithms remains unexplored. In this paper, we aim to enhance the efficiency
|
||
|
of quantum circuit design by proposing two novel approaches for incorporating
|
||
|
quantum advantage metrics into the fitness function of genetic algorithms.1 We
|
||
|
evaluate our approaches based on the Bernstein-Vazirani Problem and the
|
||
|
Unstructured Database Search Problem as test cases. The results demonstrate
|
||
|
that our approaches not only improve the convergence speed of the genetic
|
||
|
algorithm but also produce circuits comparable to expert-designed solutions.
|
||
|
Our findings suggest that automated quantum circuit design using genetic
|
||
|
algorithms that incorporate a measure of quantum advantage is a promising
|
||
|
approach to accelerating the development of quantum algorithms.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it focuses on quantum circuit generation and genetic programming, which is unrelated to Large Language Models (LLMs) and their practical applications.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Reward-Guided Controlled Generation for Inference-Time Alignment in
|
||
|
Diffusion Models: Tutorial and Review](https://arxiv.org/abs/2501.09685v1)
|
||
|
**arXiv ID:** 2501.09685v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> This tutorial provides an in-depth guide on inference-time guidance and
|
||
|
alignment methods for optimizing downstream reward functions in diffusion
|
||
|
models. While diffusion models are renowned for their generative modeling
|
||
|
capabilities, practical applications in fields such as biology often require
|
||
|
sample generation that maximizes specific metrics (e.g., stability, affinity in
|
||
|
proteins, closeness to target structures). In these scenarios, diffusion models
|
||
|
can be adapted not only to generate realistic samples but also to explicitly
|
||
|
maximize desired measures at inference time without fine-tuning. This tutorial
|
||
|
explores the foundational aspects of such inference-time algorithms. We review
|
||
|
these methods from a unified perspective, demonstrating that current techniques
|
||
|
-- such as Sequential Monte Carlo (SMC)-based guidance, value-based sampling,
|
||
|
and classifier guidance -- aim to approximate soft optimal denoising processes
|
||
|
(a.k.a. policies in RL) that combine pre-trained denoising processes with value
|
||
|
functions serving as look-ahead functions that predict from intermediate states
|
||
|
to terminal rewards. Within this framework, we present several novel algorithms
|
||
|
not yet covered in the literature. Furthermore, we discuss (1) fine-tuning
|
||
|
methods combined with inference-time techniques, (2) inference-time algorithms
|
||
|
based on search algorithms such as Monte Carlo tree search, which have received
|
||
|
limited attention in current research, and (3) connections between
|
||
|
inference-time algorithms in language models and diffusion models. The code of
|
||
|
this tutorial on protein design is available at
|
||
|
https://github.com/masa-ue/AlignInversePro
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on diffusion models and their applications in biology, which does not meet the criteria of focusing on Large Language Models (LLMs) or their practical applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Practical Continual Forgetting for Pre-trained Vision Models](https://arxiv.org/abs/2501.09705v1)
|
||
|
**arXiv ID:** 2501.09705v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> For privacy and security concerns, the need to erase unwanted information
|
||
|
from pre-trained vision models is becoming evident nowadays. In real-world
|
||
|
scenarios, erasure requests originate at any time from both users and model
|
||
|
owners, and these requests usually form a sequence. Therefore, under such a
|
||
|
setting, selective information is expected to be continuously removed from a
|
||
|
pre-trained model while maintaining the rest. We define this problem as
|
||
|
continual forgetting and identify three key challenges. (i) For unwanted
|
||
|
knowledge, efficient and effective deleting is crucial. (ii) For remaining
|
||
|
knowledge, the impact brought by the forgetting procedure should be minimal.
|
||
|
(iii) In real-world scenarios, the training samples may be scarce or partially
|
||
|
missing during the process of forgetting. To address them, we first propose
|
||
|
Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA
|
||
|
modules to fine-tune the FFN layers in Transformer blocks for each forgetting
|
||
|
task independently, and towards (ii), a simple group sparse regularization is
|
||
|
adopted, enabling automatic selection of specific LoRA groups and zeroing out
|
||
|
the others. To further extend GS-LoRA to more practical scenarios, we
|
||
|
incorporate prototype information as additional supervision and introduce a
|
||
|
more practical approach, GS-LoRA++. For each forgotten class, we move the
|
||
|
logits away from its original prototype. For the remaining classes, we pull the
|
||
|
logits closer to their respective prototypes. We conduct extensive experiments
|
||
|
on face recognition, object detection and image classification and demonstrate
|
||
|
that our method manages to forget specific classes with minimal impact on other
|
||
|
classes. Codes have been released on https://github.com/bjzhb666/GS-LoRA.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on vision models and does not meet the criteria related to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Parallel multi-objective metaheuristics for smart communications in
|
||
|
vehicular networks](https://arxiv.org/abs/2501.09725v1)
|
||
|
**arXiv ID:** 2501.09725v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> This article analyzes the use of two parallel multi-objective soft computing
|
||
|
algorithms to automatically search for high-quality settings of the Ad hoc On
|
||
|
Demand Vector routing protocol for vehicular networks. These methods are based
|
||
|
on an evolutionary algorithm and on a swarm intelligence approach. The
|
||
|
experimental analysis demonstrates that the configurations computed by our
|
||
|
optimization algorithms outperform other state-of-the-art optimized ones. In
|
||
|
turn, the computational efficiency achieved by all the parallel versions is
|
||
|
greater than 87 %. Therefore, the line of work presented in this article
|
||
|
represents an efficient framework to improve vehicular communications.
|
||
|
|
||
|
**Decision Explanation:** The paper does not meet the criteria as it does not focus on Large Language Models (LLMs) or their applications, and instead discusses optimization algorithms for vehicular networks.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Benchmarking Robustness of Contrastive Learning Models for Medical
|
||
|
Image-Report Retrieval](https://arxiv.org/abs/2501.09134v1)
|
||
|
**arXiv ID:** 2501.09134v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Medical images and reports offer invaluable insights into patient health. The
|
||
|
heterogeneity and complexity of these data hinder effective analysis. To bridge
|
||
|
this gap, we investigate contrastive learning models for cross-domain
|
||
|
retrieval, which associates medical images with their corresponding clinical
|
||
|
reports. This study benchmarks the robustness of four state-of-the-art
|
||
|
contrastive learning models: CLIP, CXR-RePaiR, MedCLIP, and CXR-CLIP. We
|
||
|
introduce an occlusion retrieval task to evaluate model performance under
|
||
|
varying levels of image corruption. Our findings reveal that all evaluated
|
||
|
models are highly sensitive to out-of-distribution data, as evidenced by the
|
||
|
proportional decrease in performance with increasing occlusion levels. While
|
||
|
MedCLIP exhibits slightly more robustness, its overall performance remains
|
||
|
significantly behind CXR-CLIP and CXR-RePaiR. CLIP, trained on a
|
||
|
general-purpose dataset, struggles with medical image-report retrieval,
|
||
|
highlighting the importance of domain-specific training data. The evaluation of
|
||
|
this work suggests that more effort needs to be spent on improving the
|
||
|
robustness of these models. By addressing these limitations, we can develop
|
||
|
more reliable cross-domain retrieval models for medical applications.
|
||
|
|
||
|
**Decision Explanation:** The paper primarily focuses on medical applications of AI, which is an excluded topic according to the criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Quantum-Enhanced Transformers for Robust Acoustic Scene Classification
|
||
|
in IoT Environments](https://arxiv.org/abs/2501.09394v1)
|
||
|
**arXiv ID:** 2501.09394v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> The proliferation of Internet of Things (IoT) devices equipped with acoustic
|
||
|
sensors necessitates robust acoustic scene classification (ASC) capabilities,
|
||
|
even in noisy and data-limited environments. Traditional machine learning
|
||
|
methods often struggle to generalize effectively under such conditions. To
|
||
|
address this, we introduce Q-ASC, a novel Quantum-Inspired Acoustic Scene
|
||
|
Classifier that leverages the power of quantum-inspired transformers. By
|
||
|
integrating quantum concepts like superposition and entanglement, Q-ASC
|
||
|
achieves superior feature learning and enhanced noise resilience compared to
|
||
|
classical models. Furthermore, we introduce a Quantum Variational Autoencoder
|
||
|
(QVAE) based data augmentation technique to mitigate the challenge of limited
|
||
|
labeled data in IoT deployments. Extensive evaluations on the Tampere
|
||
|
University of Technology (TUT) Acoustic Scenes 2016 benchmark dataset
|
||
|
demonstrate that Q-ASC achieves remarkable accuracy between 68.3% and 88.5%
|
||
|
under challenging conditions, outperforming state-of-the-art methods by over 5%
|
||
|
in the best case. This research paves the way for deploying intelligent
|
||
|
acoustic sensing in IoT networks, with potential applications in smart homes,
|
||
|
industrial monitoring, and environmental surveillance, even in adverse acoustic
|
||
|
environments.
|
||
|
|
||
|
**Decision Explanation:** The paper does not focus on Large Language Models (LLMs) or their practical applications, and instead explores quantum-inspired transformers for acoustic scene classification, which is outside the specified criteria.
|
||
|
|
||
|
---
|
||
|
|
||
|
## [Metric Learning with Progressive Self-Distillation for Audio-Visual
|
||
|
Embedding Learning](https://arxiv.org/abs/2501.09608v1)
|
||
|
**arXiv ID:** 2501.09608v1
|
||
|
|
||
|
**Abstract:**
|
||
|
> Metric learning projects samples into an embedded space, where similarities
|
||
|
and dissimilarities are quantified based on their learned representations.
|
||
|
However, existing methods often rely on label-guided representation learning,
|
||
|
where representations of different modalities, such as audio and visual data,
|
||
|
are aligned based on annotated labels. This approach tends to underutilize
|
||
|
latent complex features and potential relationships inherent in the
|
||
|
distributions of audio and visual data that are not directly tied to the
|
||
|
labels, resulting in suboptimal performance in audio-visual embedding learning.
|
||
|
To address this issue, we propose a novel architecture that integrates
|
||
|
cross-modal triplet loss with progressive self-distillation. Our method
|
||
|
enhances representation learning by leveraging inherent distributions and
|
||
|
dynamically refining soft audio-visual alignments -- probabilistic alignments
|
||
|
between audio and visual data that capture the inherent relationships beyond
|
||
|
explicit labels. Specifically, the model distills audio-visual
|
||
|
distribution-based knowledge from annotated labels in a subset of each batch.
|
||
|
This self-distilled knowledge is used t
|
||
|
|
||
|
**Decision Explanation:** The paper does not focus on Large Language Models (LLMs) or their applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, which are the primary criteria for selection.
|
||
|
|
||
|
---
|
||
|
|