368 KiB
Accepted Papers
LLMPC: Large Language Model Predictive Control
arXiv ID: http://arxiv.org/abs/2501.02486v1
Abstract:
Recent advancements in prompting techniques for Large Language Models (LLMs) have improved their reasoning, planning, and action abilities. This paper examines these prompting techniques through the lens of model predictive control (MPC). We show that LLMs act as implicit planning cost function minimizers when planning prompts are used. Under our framework we demonstrate that LLM planning performance can be improved further by incorporating real planning cost functions and evaluators.
Decision Explanation: Original response: { "decision": "ACCEPT", "explanation": "The paper focuses primarily on optimizing LLM performance through the manipulation of textual input prompts (planning prompts), provides a framework (LLMPC) for systematic prompt development, and demonstrates its impact on LLM output, aligning with the core subject of prompt engineering for text-based interactions with LLMs." }
Rejected Papers
[Co-Activation Graph Analysis of Safety-Verified and Explainable Deep
Reinforcement Learning Policies](https://arxiv.org/abs/http://arxiv.org/abs/2501.03142v1) arXiv ID: http://arxiv.org/abs/2501.03142v1
Abstract:
Deep reinforcement learning (RL) policies can demonstrate unsafe behaviors and are challenging to interpret. To address these challenges, we combine RL policy model checking--a technique for determining whether RL policies exhibit unsafe behaviors--with co-activation graph analysis--a method that maps neural network inner workings by analyzing neuron activation patterns--to gain insight into the safe RL policy's sequential decision-making. This combination lets us interpret the RL policy's inner workings for safe decision-making. We demonstrate its applicability in various experiments.
Decision Explanation: Original decision: REJECT The paper focuses primarily on deep reinforcement learning (RL) policies, model checking, and co-activation graph analysis, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet all 'MUST' criteria.
Turn-based Multi-Agent Reinforcement Learning Model Checking
arXiv ID: http://arxiv.org/abs/2501.03187v1
Abstract:
In this paper, we propose a novel approach for verifying the compliance of turn-based multi-agent reinforcement learning (TMARL) agents with complex requirements in stochastic multiplayer games. Our method overcomes the limitations of existing verification approaches, which are inadequate for dealing with TMARL agents and not scalable to large games with multiple agents. Our approach relies on tight integration of TMARL and a verification technique referred to as model checking. We demonstrate the effectiveness and scalability of our technique through experiments in different types of environments. Our experiments show that our method is suited to verify TMARL agents and scales better than naive monolithic model checking.
Decision Explanation: Original decision: REJECT The paper primarily focuses on verifying compliance of turn-based multi-agent reinforcement learning agents, with no clear emphasis on prompt engineering for Large Language Models (LLMs), thus failing to meet the 'MUST' criteria of focusing on engineering, design, or optimization of prompts specifically for LLMs.
Neural Deconstruction Search for Vehicle Routing Problems
arXiv ID: http://arxiv.org/abs/2501.03715v1
Abstract:
Autoregressive construction approaches generate solutions to vehicle routing problems in a step-by-step fashion, leading to high-quality solutions that are nearing the performance achieved by handcrafted, operations research techniques. In this work, we challenge the conventional paradigm of sequential solution construction and introduce an iterative search framework where solutions are instead deconstructed by a neural policy. Throughout the search, the neural policy collaborates with a simple greedy insertion algorithm to rebuild the deconstructed solutions. Our approach surpasses the performance of state-of-the-art operations research methods across three challenging vehicle routing problems of various problem sizes.
Decision Explanation: Original decision: REJECT The paper focuses on solving Vehicle Routing Problems using a neural search framework, which falls under the excluded categories (primarily concerned with applications other than text generation driven by LLMs, and specifically mentions automotive/self-driving/operations research subjects). It also does not meet the 'MUST' criteria of focusing primarily on the engineering, design, or optimization of prompts for Large Language Models (LLMs).
A completely uniform transformer for parity
arXiv ID: http://arxiv.org/abs/2501.02535v1
Abstract:
We construct a 3-layer constant-dimension transformer, recognizing the parity language, where neither parameter matrices nor the positional encoding depend on the input length. This improves upon a construction of Chiang and Cholak who use a positional encoding, depending on the input length (but their construction has 2 layers).
Decision Explanation: Original decision: REJECT The paper focuses on constructing a uniform transformer for recognizing parity language, which aligns with developing new LLM architectures or training methods, violating the 'MUST NOT' criteria 1. Additionally, it does not demonstrate prompt engineering for text-based interactions with LLMs as its core subject.
Test-time Computing: from System-1 Thinking to System-2 Thinking
arXiv ID: http://arxiv.org/abs/2501.02497v1
Abstract:
The remarkable performance of the o1 model in complex reasoning demonstrates that test-time computing scaling can further unlock the model's potential, enabling powerful System-2 thinking. However, there is still a lack of comprehensive surveys for test-time computing scaling. We trace the concept of test-time computing back to System-1 models. In System-1 models, test-time computing addresses distribution shifts and improves robustness and generalization through parameter updating, input modification, representation editing, and output calibration. In System-2 models, it enhances the model's reasoning ability to solve complex problems through repeated sampling, self-correction, and tree search. We organize this survey according to the trend of System-1 to System-2 thinking, highlighting the key role of test-time computing in the transition from System-1 models to weak System-2 models, and then to strong System-2 models. We also point out a few possible future directions.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper does not meet the primary focus criteria as it discusses test-time computing scaling for enhancing model performance (pertaining to System-1 and System-2 thinking) without specifically addressing prompt engineering for Large Language Models (LLMs) or demonstrating the impact of textual input prompts on LLM output." }
[KG-CF: Knowledge Graph Completion with Context Filtering under the
Guidance of Large Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02711v1) arXiv ID: http://arxiv.org/abs/2501.02711v1
Abstract:
Large Language Models (LLMs) have shown impressive performance in various tasks, including knowledge graph completion (KGC). However, current studies mostly apply LLMs to classification tasks, like identifying missing triplets, rather than ranking-based tasks, where the model ranks candidate entities based on plausibility. This focus limits the practical use of LLMs in KGC, as real-world applications prioritize highly plausible triplets. Additionally, while graph paths can help infer the existence of missing triplets and improve completion accuracy, they often contain redundant information. To address these issues, we propose KG-CF, a framework tailored for ranking-based KGC tasks. KG-CF leverages LLMs' reasoning abilities to filter out irrelevant contexts, achieving superior results on real-world datasets. The code and datasets are available at \url{https://anonymous.4open.science/r/KG-CF}.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on leveraging LLMs for knowledge graph completion (ranking-based tasks) rather than on the engineering, design, or optimization of prompts specifically for LLMs. While LLMs are used, the central concern is the application (KGC) and not prompt engineering techniques, methods, or their impact on LLM output." }
Artificial Intelligence in Creative Industries: Advances Prior to 2025
arXiv ID: http://arxiv.org/abs/2501.02725v1
Abstract:
The rapid advancements in artificial intelligence (AI), particularly in generative AI and large language models (LLMs), have profoundly impacted the creative industries by enabling innovative content creation, enhancing workflows, and democratizing access to creative tools. This paper explores the significant technological shifts since our previous review in 2022, highlighting how these developments have expanded creative opportunities and efficiency. These technological advancements have enhanced the capabilities of text-to-image, text-to-video, and multimodal generation technologies. In particular, key breakthroughs in LLMs have established new benchmarks in conversational AI, while advancements in image generators have revolutionized content creation. We also discuss AI integration into post-production workflows, which has significantly accelerated and refined traditional processes. Despite these innovations, challenges remain, particularly for the media industry, due to the demands on communication traffic from creative content. We therefore include data compression and quality assessment in this paper. Furthermore, we highlight the trend toward unified AI frameworks capable of addressing multiple creative tasks and underscore the importance of human oversight to mitigate AI-generated inaccuracies. Finally, we explore AI's future potential in the creative sector, stressing the need to navigate emerging challenges to maximize its benefits while addressing associated risks.
Decision Explanation: Original decision: REJECT The paper's primary focus is on the broader impact of AI in creative industries, mentioning LLMs and generative AI as components, rather than specifically concentrating on prompt engineering for LLMs as required by the criteria.
[Multi-Agent Path Finding under Limited Communication Range Constraint
via Dynamic Leading](https://arxiv.org/abs/http://arxiv.org/abs/2501.02770v1) arXiv ID: http://arxiv.org/abs/2501.02770v1
Abstract:
This paper proposes a novel framework to handle a multi-agent path finding problem under a limited communication range constraint, where all agents must have a connected communication channel to the rest of the team. Many existing approaches to multi-agent path finding (e.g., leader-follower platooning) overcome computational challenges of planning in this domain by planning one agent at a time in a fixed order. However, fixed leader-follower approaches can become stuck during planning, limiting their practical utility in dense-clutter environments. To overcome this limitation, we develop dynamic leading multi-agent path finding, which allows for dynamic reselection of the leading agent during path planning whenever progress cannot be made. The experiments show the efficiency of our framework, which can handle up to 25 agents with more than 90% success-rate across five environment types where baselines routinely fail.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on multi-agent path finding under limited communication constraints, utilizing dynamic leading, and does not meet any of the 'MUST' criteria, particularly lacking primary focus on prompt engineering for Large Language Models (LLMs) and manipulation of textual input prompts to improve LLM performance." }
[SenseRAG: Constructing Environmental Knowledge Bases with Proactive
Querying for LLM-Based Autonomous Driving](https://arxiv.org/abs/http://arxiv.org/abs/2501.03535v2) arXiv ID: http://arxiv.org/abs/2501.03535v2
Abstract:
This study addresses the critical need for enhanced situational awareness in autonomous driving (AD) by leveraging the contextual reasoning capabilities of large language models (LLMs). Unlike traditional perception systems that rely on rigid, label-based annotations, it integrates real-time, multimodal sensor data into a unified, LLMs-readable knowledge base, enabling LLMs to dynamically understand and respond to complex driving environments. To overcome the inherent latency and modality limitations of LLMs, a proactive Retrieval-Augmented Generation (RAG) is designed for AD, combined with a chain-of-thought prompting mechanism, ensuring rapid and context-rich understanding. Experimental results using real-world Vehicle-to-everything (V2X) datasets demonstrate significant improvements in perception and prediction performance, highlighting the potential of this framework to enhance safety, adaptability, and decision-making in next-generation AD systems.
Decision Explanation: Original decision: REJECT The paper primarily focuses on enhancing autonomous driving (AD) systems with LLMs, rather than prompt engineering for text-based interactions with LLMs. Although it mentions a 'chain-of-thought prompting mechanism', the core subject is the AD system's perception and prediction performance, not novel prompt engineering techniques for LLMs.
[STContext: A Multifaceted Dataset for Developing Context-aware
Spatio-temporal Crowd Mobility Prediction Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03583v1) arXiv ID: http://arxiv.org/abs/2501.03583v1
Abstract:
In smart cities, context-aware spatio-temporal crowd flow prediction (STCFP) models leverage contextual features (e.g., weather) to identify unusual crowd mobility patterns and enhance prediction accuracy. However, the best practice for incorporating contextual features remains unclear due to inconsistent usage of contextual features in different papers. Developing a multifaceted dataset with rich types of contextual features and STCFP scenarios is crucial for establishing a principled context modeling paradigm. Existing open crowd flow datasets lack an adequate range of contextual features, which poses an urgent requirement to build a multifaceted dataset to fill these research gaps. To this end, we create STContext, a multifaceted dataset for developing context-aware STCFP models. Specifically, STContext provides nine spatio-temporal datasets across five STCFP scenarios and includes ten contextual features, including weather, air quality index, holidays, points of interest, road networks, etc. Besides, we propose a unified workflow for incorporating contextual features into deep STCFP methods, with steps including feature transformation, dependency modeling, representation fusion, and training strategies. Through extensive experiments, we have obtained several useful guidelines for effective context modeling and insights for future research. The STContext is open-sourced at https://github.com/Liyue-Chen/STContext.
Decision Explanation: Original decision: REJECT The paper does not focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs). Instead, it focuses on developing a multifaceted dataset for context-aware spatio-temporal crowd mobility prediction models, which falls outside the specified criteria.
[Neural DNF-MT: A Neuro-symbolic Approach for Learning Interpretable and
Editable Policies](https://arxiv.org/abs/http://arxiv.org/abs/2501.03888v1) arXiv ID: http://arxiv.org/abs/2501.03888v1
Abstract:
Although deep reinforcement learning has been shown to be effective, the model's black-box nature presents barriers to direct policy interpretation. To address this problem, we propose a neuro-symbolic approach called neural DNF-MT for end-to-end policy learning. The differentiable nature of the neural DNF-MT model enables the use of deep actor-critic algorithms for training. At the same time, its architecture is designed so that trained models can be directly translated into interpretable policies expressed as standard (bivalent or probabilistic) logic programs. Moreover, additional layers can be included to extract abstract features from complex observations, acting as a form of predicate invention. The logic representations are highly interpretable, and we show how the bivalent representations of deterministic policies can be edited and incorporated back into a neural model, facilitating manual intervention and adaptation of learned policies. We evaluate our approach on a range of tasks requiring learning deterministic or stochastic behaviours from various forms of observations. Our empirical results show that our neural DNF-MT model performs at the level of competing black-box methods whilst providing interpretable policies.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on developing a neuro-symbolic approach for interpretable policy learning in reinforcement learning, with no primary focus on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance." }
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
arXiv ID: http://arxiv.org/abs/2501.03936v1
Abstract:
Automatically generating presentations from documents is a challenging task that requires balancing content quality, visual design, and structural coherence. Existing methods primarily focus on improving and evaluating the content quality in isolation, often overlooking visual design and structural coherence, which limits their practical applicability. To address these limitations, we propose PPTAgent, which comprehensively improves presentation generation through a two-stage, edit-based approach inspired by human workflows. PPTAgent first analyzes reference presentations to understand their structural patterns and content schemas, then drafts outlines and generates slides through code actions to ensure consistency and alignment. To comprehensively evaluate the quality of generated presentations, we further introduce PPTEval, an evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Experiments show that PPTAgent significantly outperforms traditional automatic presentation generation methods across all three dimensions. The code and data are available at https://github.com/icip-cas/PPTAgent.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on generating presentations (image/video generation) and improving content quality, visual design, and structural coherence, rather than specifically engineering or optimizing text-based input prompts for Large Language Models (LLMs)." }
[Implementing Systemic Thinking for Automatic Schema Matching: An
Agent-Based Modeling Approach](https://arxiv.org/abs/http://arxiv.org/abs/2501.04136v1) arXiv ID: http://arxiv.org/abs/2501.04136v1
Abstract:
Several approaches are proposed to deal with the problem of the Automatic Schema Matching (ASM). The challenges and difficulties caused by the complexity and uncertainty characterizing both the process and the outcome of Schema Matching motivated us to investigate how bio-inspired emerging paradigm can help with understanding, managing, and ultimately overcoming those challenges. In this paper, we explain how we approached Automatic Schema Matching as a systemic and Complex Adaptive System (CAS) and how we modeled it using the approach of Agent-Based Modeling and Simulation (ABMS). This effort gives birth to a tool (prototype) for schema matching called Reflex-SMAS. A set of experiments demonstrates the viability of our approach on two main aspects: (i) effectiveness (increasing the quality of the found matchings) and (ii) efficiency (reducing the effort required for this efficiency). Our approach represents a significant paradigm-shift, in the field of Automatic Schema Matching.
Decision Explanation: Original decision: REJECT The paper does not meet the primary focus criteria, as it concentrates on Automatic Schema Matching using Agent-Based Modeling and Simulation, with no apparent emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.
[AI-Driven Reinvention of Hydrological Modeling for Accurate Predictions
and Interpretation to Transform Earth System Modeling](https://arxiv.org/abs/http://arxiv.org/abs/2501.04733v1) arXiv ID: http://arxiv.org/abs/2501.04733v1
Abstract:
Traditional equation-driven hydrological models often struggle to accurately predict streamflow in challenging regional Earth systems like the Tibetan Plateau, while hybrid and existing algorithm-driven models face difficulties in interpreting hydrological behaviors. This work introduces HydroTrace, an algorithm-driven, data-agnostic model that substantially outperforms these approaches, achieving a Nash-Sutcliffe Efficiency of 98% and demonstrating strong generalization on unseen data. Moreover, HydroTrace leverages advanced attention mechanisms to capture spatial-temporal variations and feature-specific impacts, enabling the quantification and spatial resolution of streamflow partitioning as well as the interpretation of hydrological behaviors such as glacier-snow-streamflow interactions and monsoon dynamics. Additionally, a large language model (LLM)-based application allows users to easily understand and apply HydroTrace's insights for practical purposes. These advancements position HydroTrace as a transformative tool in hydrological and broader Earth system modeling, offering enhanced prediction accuracy and interpretability.
Decision Explanation: Original decision: REJECT The paper's primary focus is on developing a new hydrological modeling approach (HydroTrace) and its application in Earth system modeling, with the Large Language Model (LLM) being a secondary component for interpreting insights, rather than the primary subject being prompt engineering for text-based interactions with LLMs.
[Rethinking IDE Customization for Enhanced HAX: A Hyperdimensional
Perspective](https://arxiv.org/abs/http://arxiv.org/abs/2501.02491v1) arXiv ID: http://arxiv.org/abs/2501.02491v1
Abstract:
As Integrated Development Environments (IDEs) increasingly integrate Artificial Intelligence, Software Engineering faces both benefits like productivity gains and challenges like mismatched user preferences. We propose Hyper-Dimensional (HD) vector spaces to model Human-Computer Interaction, focusing on user actions, stylistic preferences, and project context. These contributions aim to inspire further research on applying HD computing in IDE design.
Decision Explanation: Original decision: REJECT The paper focuses on IDE customization using Hyperdimensional vector spaces for Human-Computer Interaction, with no primary emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
[TreeMatch: A Fully Unsupervised WSD System Using Dependency Knowledge on
a Specific Domain](https://arxiv.org/abs/http://arxiv.org/abs/2501.02546v1) arXiv ID: http://arxiv.org/abs/2501.02546v1
Abstract:
Word sense disambiguation (WSD) is one of the main challenges in Computational Linguistics. TreeMatch is a WSD system originally developed using data from SemEval 2007 Task 7 (Coarse-grained English All-words Task) that has been adapted for use in SemEval 2010 Task 17 (All-words Word Sense Disambiguation on a Specific Domain). The system is based on a fully unsupervised method using dependency knowledge drawn from a domain specific knowledge base that was built for this task. When evaluated on the task, the system precision performs above the Most Frequent Selection baseline.
Decision Explanation: Original decision: REJECT The paper focuses on Word Sense Disambiguation (WSD) using dependency knowledge and a domain-specific knowledge base, with no indication of primarily investigating, analyzing, or proposing methods for improving Large Language Model (LLM) performance through the manipulation of textual input prompts.
Efficient Architectures for High Resolution Vision-Language Models
arXiv ID: http://arxiv.org/abs/2501.02584v1
Abstract:
Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.
Decision Explanation: Original decision: REJECT The paper focuses primarily on the development of a novel architecture (Pheye) for Vision-Language Models (VLMs) to process high-resolution images, which aligns with excluded criteria: developing new LLM architectures and being concerned with applications of generative AI other than text generation driven by LLMs (specifically, vision-language tasks).
[Enhancing Robot Route Optimization in Smart Logistics with Transformer
and GNN Integration](https://arxiv.org/abs/http://arxiv.org/abs/2501.02749v1) arXiv ID: http://arxiv.org/abs/2501.02749v1
Abstract:
This research delves into advanced route optimization for robots in smart logistics, leveraging a fusion of Transformer architectures, Graph Neural Networks (GNNs), and Generative Adversarial Networks (GANs). The approach utilizes a graph-based representation encompassing geographical data, cargo allocation, and robot dynamics, addressing both spatial and resource limitations to refine route efficiency. Through extensive testing with authentic logistics datasets, the proposed method achieves notable improvements, including a 15% reduction in travel distance, a 20% boost in time efficiency, and a 10% decrease in energy consumption. These findings highlight the algorithm's effectiveness, promoting enhanced performance in intelligent logistics operations.
Decision Explanation: Original decision: REJECT The paper focuses primarily on the integration of Transformer, GNN, and GAN for robot route optimization in smart logistics, with no apparent emphasis on prompt engineering, manipulation of textual input prompts, or the interaction with Large Language Models (LLMs) for text generation.
Key-value memory in the brain
arXiv ID: http://arxiv.org/abs/2501.02950v1
Abstract:
Classical models of memory in psychology and neuroscience rely on similarity-based retrieval of stored patterns, where similarity is a function of retrieval cues and the stored patterns. While parsimonious, these models do not allow distinct representations for storage and retrieval, despite their distinct computational demands. Key-value memory systems, in contrast, distinguish representations used for storage (values) and those used for retrieval (keys). This allows key-value memory systems to optimize simultaneously for fidelity in storage and discriminability in retrieval. We review the computational foundations of key-value memory, its role in modern machine learning systems, related ideas from psychology and neuroscience, applications to a number of empirical puzzles, and possible biological implementations.
Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria as it does not focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs), nor does it investigate methods for improving LLM performance through prompt manipulation. Instead, it discusses key-value memory systems in the context of psychology, neuroscience, and machine learning, with no apparent connection to LLM prompt engineering.
[Putnam's Critical and Explanatory Tendencies Interpreted from a Machine
Learning Perspective](https://arxiv.org/abs/http://arxiv.org/abs/2501.03026v1) arXiv ID: http://arxiv.org/abs/2501.03026v1
Abstract:
Making sense of theory choice in normal and across extraordinary science is central to philosophy of science. The emergence of machine learning models has the potential to act as a wrench in the gears of current debates. In this paper, I will attempt to reconstruct the main movements that lead to and came out of Putnam's critical and explanatory tendency distinction, argue for the biconditional necessity of the tendencies, and conceptualize that wrench through a machine learning interpretation of my claim.
Decision Explanation: Original decision: REJECT The paper's primary focus is on interpreting philosophical concepts through a machine learning perspective, not specifically on the engineering, design, or optimization of prompts for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.
GLiREL -- Generalist Model for Zero-Shot Relation Extraction
arXiv ID: http://arxiv.org/abs/2501.03172v1
Abstract:
We introduce GLiREL (Generalist Lightweight model for zero-shot Relation Extraction), an efficient architecture and training paradigm for zero-shot relation classification. Inspired by recent advancements in zero-shot named entity recognition, this work presents an approach to efficiently and accurately predict zero-shot relationship labels between multiple entities in a single forward pass. Experiments using the FewRel and WikiZSL benchmarks demonstrate that our approach achieves state-of-the-art results on the zero-shot relation classification task. In addition, we contribute a protocol for synthetically-generating datasets with diverse relation labels.
Decision Explanation: Original decision: REJECT The paper focuses primarily on developing a new efficient architecture and training paradigm for zero-shot relation classification, rather than engineering, designing, or optimizing prompts for Large Language Models (LLMs) to improve their text generation performance.
Can LLMs Design Good Questions Based on Context?
arXiv ID: http://arxiv.org/abs/2501.03491v1
Abstract:
This paper evaluates questions generated by LLMs from context, comparing them to human-generated questions across six dimensions. We introduce an automated LLM-based evaluation method, focusing on aspects like question length, type, context coverage, and answerability. Our findings highlight unique characteristics of LLM-generated questions, contributing insights that can support further research in question quality and downstream applications.
Decision Explanation: Original decision: REJECT The paper focuses on LLMs generating questions based on context, rather than the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) to improve their text generation performance through textual input manipulation.
[Self-Adaptive ERP: Embedding NLP into Petri-Net creation and Model
Matching](https://arxiv.org/abs/http://arxiv.org/abs/2501.03795v1) arXiv ID: http://arxiv.org/abs/2501.03795v1
Abstract:
Enterprise Resource Planning (ERP) consultants play a vital role in customizing systems to meet specific business needs by processing large amounts of data and adapting functionalities. However, the process is resource-intensive, time-consuming, and requires continuous adjustments as business demands evolve. This research introduces a Self-Adaptive ERP Framework that automates customization using enterprise process models and system usage analysis. It leverages Artificial Intelligence (AI) & Natural Language Processing (NLP) for Petri nets to transform business processes into adaptable models, addressing both structural and functional matching. The framework, built using Design Science Research (DSR) and a Systematic Literature Review (SLR), reduces reliance on manual adjustments, improving ERP customization efficiency and accuracy while minimizing the need for consultants.
Decision Explanation: Original decision: REJECT The paper's primary focus is on automating ERP customization using AI and NLP for Petri-net creation and model matching, rather than specifically engineering, designing, or optimizing prompts for Large Language Models (LLMs). The mention of NLP is not centered around prompt engineering for LLMs, but rather for transforming business processes into adaptable models within an ERP framework.
Synthetic Data Privacy Metrics
arXiv ID: http://arxiv.org/abs/2501.03941v1
Abstract:
Recent advancements in generative AI have made it possible to create synthetic datasets that can be as accurate as real-world data for training AI models, powering statistical insights, and fostering collaboration with sensitive datasets while offering strong privacy guarantees. Effectively measuring the empirical privacy of synthetic data is an important step in the process. However, while there is a multitude of new privacy metrics being published every day, there currently is no standardization. In this paper, we review the pros and cons of popular metrics that include simulations of adversarial attacks. We also review current best practices for amending generative models to enhance the privacy of the data they create (e.g. differential privacy).
Decision Explanation: Original decision: REJECT The paper focuses primarily on synthetic data privacy metrics, generative AI for dataset creation, and differential privacy, without addressing prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
[Traits of a Leader: User Influence Level Prediction through
Sociolinguistic Modeling](https://arxiv.org/abs/http://arxiv.org/abs/2501.04046v1) arXiv ID: http://arxiv.org/abs/2501.04046v1
Abstract:
Recognition of a user's influence level has attracted much attention as human interactions move online. Influential users have the ability to sway others' opinions to achieve some goals. As a result, predicting users' level of influence can help to understand social networks, forecast trends, prevent misinformation, etc. However, predicting user influence is a challenging problem because the concept of influence is specific to a situation or a domain, and user communications are limited to text. In this work, we define user influence level as a function of community endorsement and develop a model that significantly outperforms the baseline by leveraging demographic and personality data. This approach consistently improves RankDCG scores across eight different domains.
Decision Explanation: Original decision: REJECT The paper focuses on sociolinguistic modeling for predicting user influence levels in online interactions, without primarily addressing the engineering, design, or optimization of prompts for Large Language Models (LLMs) or demonstrating the manipulation of textual input prompts to improve LLM performance.
[IntegrityAI at GenAI Detection Task 2: Detecting Machine-Generated
Academic Essays in English and Arabic Using ELECTRA and Stylometry](https://arxiv.org/abs/http://arxiv.org/abs/2501.05476v1) arXiv ID: http://arxiv.org/abs/2501.05476v1
Abstract:
Recent research has investigated the problem of detecting machine-generated essays for academic purposes. To address this challenge, this research utilizes pre-trained, transformer-based models fine-tuned on Arabic and English academic essays with stylometric features. Custom models based on ELECTRA for English and AraELECTRA for Arabic were trained and evaluated using a benchmark dataset. Proposed models achieved excellent results with an F1-score of 99.7%, ranking 2nd among of 26 teams in the English subtask, and 98.4%, finishing 1st out of 23 teams in the Arabic one.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on developing and fine-tuning transformer-based models (ELECTRA, AraELECTRA) for detecting machine-generated essays, rather than engineering or optimizing prompts for Large Language Models (LLMs). Prompt engineering is not the central concern of this research." }
Fairness Through Matching
arXiv ID: http://arxiv.org/abs/2501.02793v1
Abstract:
Group fairness requires that different protected groups, characterized by a given sensitive attribute, receive equal outcomes overall. Typically, the level of group fairness is measured by the statistical gap between predictions from different protected groups. In this study, we reveal an implicit property of existing group fairness measures, which provides an insight into how the group-fair models behave. Then, we develop a new group-fair constraint based on this implicit property to learn group-fair models. To do so, we first introduce a notable theoretical observation: every group-fair model has an implicitly corresponding transport map between the input spaces of each protected group. Based on this observation, we introduce a new group fairness measure termed Matched Demographic Parity (MDP), which quantifies the averaged gap between predictions of two individuals (from different protected groups) matched by a given transport map. Then, we prove that any transport map can be used in MDP to learn group-fair models, and develop a novel algorithm called Fairness Through Matching (FTM), which learns a group-fair model using MDP constraint with an user-specified transport map. We specifically propose two favorable types of transport maps for MDP, based on the optimal transport theory, and discuss their advantages. Experiments reveal that FTM successfully trains group-fair models with certain desirable properties by choosing the transport map accordingly.
Decision Explanation: Original decision: REJECT The paper does not focus on prompt engineering for Large Language Models (LLMs), instead concentrating on developing a new group-fair constraint and algorithm for learnings group-fair models, with no mention of LLMs, prompt design, or textual input manipulation.
CALM: Curiosity-Driven Auditing for Large Language Models
arXiv ID: http://arxiv.org/abs/2501.02997v1
Abstract:
Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs. Our code is available at https://github.com/x-zheng16/CALM.git.
Decision Explanation: Original decision: REJECT The paper primarily focuses on auditing Large Language Models (LLMs) using a reinforcement learning approach, rather than on the engineering, design, or optimization of prompts for improving LLM performance through textual input manipulation.
[Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering
alignment](https://arxiv.org/abs/http://arxiv.org/abs/2501.03012v1) arXiv ID: http://arxiv.org/abs/2501.03012v1
Abstract:
Multimodal LLMs have reached remarkable levels of proficiency in understanding multimodal inputs, driving extensive research to develop increasingly powerful models. However, much less attention has been paid to understanding and explaining the underlying mechanisms of these models. Most existing explainability research examines these models only in their final states, overlooking the dynamic representational shifts that occur during training. In this work, we systematically analyze the evolution of hidden state representations to reveal how fine-tuning alters the internal structure of a model to specialize in new multimodal tasks. Using a concept-based approach, we map hidden states to interpretable visual and textual concepts, enabling us to trace changes in encoded concepts across modalities as training progresses. We also demonstrate the use of shift vectors to capture these concepts changes. These shift vectors allow us to recover fine-tuned concepts by shifting those in the original model. Finally, we explore the practical impact of our findings on model steering, showing that we can adjust multimodal LLMs behaviors without any training, such as modifying answer types, captions style, or biasing the model toward specific responses. Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks. The code for this project is publicly available at https://github.com/mshukor/xl-vlms.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on analyzing and understanding the internal representation shifts of multimodal LLMs during fine-tuning, rather than engineering, designing, or optimizing prompts specifically for LLMs, thus failing to meet the core 'MUST' criteria." }
[Large language models for artificial general intelligence (AGI): A
survey of foundational principles and approaches](https://arxiv.org/abs/http://arxiv.org/abs/2501.03151v1) arXiv ID: http://arxiv.org/abs/2501.03151v1
Abstract:
Generative artificial intelligence (AI) systems based on large-scale pretrained foundation models (PFMs) such as vision-language models, large language models (LLMs), diffusion models and vision-language-action (VLA) models have demonstrated the ability to solve complex and truly non-trivial AI problems in a wide variety of domains and contexts. Multimodal large language models (MLLMs), in particular, learn from vast and diverse data sources, allowing rich and nuanced representations of the world and, thereby, providing extensive capabilities, including the ability to reason, engage in meaningful dialog; collaborate with humans and other agents to jointly solve complex problems; and understand social and emotional aspects of humans. Despite this impressive feat, the cognitive abilities of state-of-the-art LLMs trained on large-scale datasets are still superficial and brittle. Consequently, generic LLMs are severely limited in their generalist capabilities. A number of foundational problems -- embodiment, symbol grounding, causality and memory -- are required to be addressed for LLMs to attain human-level general intelligence. These concepts are more aligned with human cognition and provide LLMs with inherent human-like cognitive properties that support the realization of physically-plausible, semantically meaningful, flexible and more generalizable knowledge and intelligence. In this work, we discuss the aforementioned foundational issues and survey state-of-the art approaches for implementing these concepts in LLMs. Specifically, we discuss how the principles of embodiment, symbol grounding, causality and memory can be leveraged toward the attainment of artificial general intelligence (AGI) in an organic manner.
Decision Explanation: Original decision: REJECT The paper's primary focus is on foundational principles and approaches for achieving Artificial General Intelligence (AGI) with Large Language Models (LLMs), rather than specifically on the engineering, design, or optimization of prompts for LLMs. It lacks concrete examples of prompts and their impact on LLM output, which is a required criterion.
[From Aleatoric to Epistemic: Exploring Uncertainty Quantification
Techniques in Artificial Intelligence](https://arxiv.org/abs/http://arxiv.org/abs/2501.03282v1) arXiv ID: http://arxiv.org/abs/2501.03282v1
Abstract:
Uncertainty quantification (UQ) is a critical aspect of artificial intelligence (AI) systems, particularly in high-risk domains such as healthcare, autonomous systems, and financial technology, where decision-making processes must account for uncertainty. This review explores the evolution of uncertainty quantification techniques in AI, distinguishing between aleatoric and epistemic uncertainties, and discusses the mathematical foundations and methods used to quantify these uncertainties. We provide an overview of advanced techniques, including probabilistic methods, ensemble learning, sampling-based approaches, and generative models, while also highlighting hybrid approaches that integrate domain-specific knowledge. Furthermore, we examine the diverse applications of UQ across various fields, emphasizing its impact on decision-making, predictive accuracy, and system robustness. The review also addresses key challenges such as scalability, efficiency, and integration with explainable AI, and outlines future directions for research in this rapidly developing area. Through this comprehensive survey, we aim to provide a deeper understanding of UQ's role in enhancing the reliability, safety, and trustworthiness of AI systems.
Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria as it focuses primarily on uncertainty quantification techniques in AI across various high-risk domains, rather than specifically on the engineering, design, or optimization of prompts for Large Language Models (LLMs), and does not provide concrete examples of prompts impacting LLM output.
[Online Reinforcement Learning-Based Dynamic Adaptive Evaluation Function
for Real-Time Strategy Tasks](https://arxiv.org/abs/http://arxiv.org/abs/2501.03824v1) arXiv ID: http://arxiv.org/abs/2501.03824v1
Abstract:
Effective evaluation of real-time strategy tasks requires adaptive mechanisms to cope with dynamic and unpredictable environments. This study proposes a method to improve evaluation functions for real-time responsiveness to battle-field situation changes, utilizing an online reinforcement learning-based dynam-ic weight adjustment mechanism within the real-time strategy game. Building on traditional static evaluation functions, the method employs gradient descent in online reinforcement learning to update weights dynamically, incorporating weight decay techniques to ensure stability. Additionally, the AdamW optimizer is integrated to adjust the learning rate and decay rate of online reinforcement learning in real time, further reducing the dependency on manual parameter tun-ing. Round-robin competition experiments demonstrate that this method signifi-cantly enhances the application effectiveness of the Lanchester combat model evaluation function, Simple evaluation function, and Simple Sqrt evaluation function in planning algorithms including IDABCD, IDRTMinimax, and Port-folio AI. The method achieves a notable improvement in scores, with the en-hancement becoming more pronounced as the map size increases. Furthermore, the increase in evaluation function computation time induced by this method is kept below 6% for all evaluation functions and planning algorithms. The pro-posed dynamic adaptive evaluation function demonstrates a promising approach for real-time strategy task evaluation.
Decision Explanation: Original decision: REJECT The paper's primary focus is on improving evaluation functions for real-time strategy tasks using online reinforcement learning, with no emphasis on prompt engineering, design, or optimization for Large Language Models (LLMs), thus failing to meet the 'MUST' criteria.
[Dolphin: Closed-loop Open-ended Auto-research through Thinking,
Practice, and Feedback](https://arxiv.org/abs/http://arxiv.org/abs/2501.03916v2) arXiv ID: http://arxiv.org/abs/2501.03916v2
Abstract:
The scientific research paradigm is undergoing a profound transformation owing to the development of Artificial Intelligence (AI). Recent works demonstrate that various AI-assisted research methods can largely improve research efficiency by improving data analysis, accelerating computation, and fostering novel idea generation. To further move towards the ultimate goal (i.e., automatic scientific research), in this paper, we propose Dolphin, the first closed-loop open-ended auto-research framework to further build the entire process of human scientific research. Dolphin can generate research ideas, perform experiments, and get feedback from experimental results to generate higher-quality ideas. More specifically, Dolphin first generates novel ideas based on relevant papers which are ranked by the topic and task attributes. Then, the codes are automatically generated and debugged with the exception-traceback-guided local code structure. Finally, Dolphin automatically analyzes the results of each idea and feeds the results back to the next round of idea generation. Experiments are conducted on the benchmark datasets of different topics and results show that Dolphin can generate novel ideas continuously and complete the experiment in a loop. We highlight that Dolphin can automatically propose methods that are comparable to the state-of-the-art in some tasks such as 2D image classification and 3D point classification.
Decision Explanation: Original decision: REJECT The paper's primary focus is on developing an auto-research framework using AI, with Large Language Models (LLMs) being only one of the potential tools used for generating research ideas, rather than the central focus on prompt engineering for text-based interactions with LLMs as required.
[Interpretable Neural ODEs for Gene Regulatory Network Discovery under
Perturbations](https://arxiv.org/abs/http://arxiv.org/abs/2501.02409v1) arXiv ID: http://arxiv.org/abs/2501.02409v1
Abstract:
Modern high-throughput biological datasets with thousands of perturbations provide the opportunity for large-scale discovery of causal graphs that represent the regulatory interactions between genes. Numerous methods have been proposed to infer a directed acyclic graph (DAG) corresponding to the underlying gene regulatory network (GRN) that captures causal gene relationships. However, existing models have restrictive assumptions (e.g. linearity, acyclicity), limited scalability, and/or fail to address the dynamic nature of biological processes such as cellular differentiation. We propose PerturbODE, a novel framework that incorporates biologically informative neural ordinary differential equations (neural ODEs) to model cell state trajectories under perturbations and derive the causal GRN from the neural ODE's parameters. We demonstrate PerturbODE's efficacy in trajectory prediction and GRN inference across simulated and real over-expression datasets.
Decision Explanation: Original decision: REJECT The paper does not meet any of the 'MUST' criteria, as it focuses on gene regulatory network discovery using neural ODEs, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, falling outside the specified scope.
[A Statistical Hypothesis Testing Framework for Data Misappropriation
Detection in Large Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02441v1) arXiv ID: http://arxiv.org/abs/2501.02441v1
Abstract:
Large Language Models (LLMs) are rapidly gaining enormous popularity in recent years. However, the training of LLMs has raised significant privacy and legal concerns, particularly regarding the inclusion of copyrighted materials in their training data without proper attribution or licensing, which falls under the broader issue of data misappropriation. In this article, we focus on a specific problem of data misappropriation detection, namely, to determine whether a given LLM has incorporated data generated by another LLM. To address this issue, we propose embedding watermarks into the copyrighted training data and formulating the detection of data misappropriation as a hypothesis testing problem. We develop a general statistical testing framework, construct a pivotal statistic, determine the optimal rejection threshold, and explicitly control the type I and type II errors. Furthermore, we establish the asymptotic optimality properties of the proposed tests, and demonstrate its empirical effectiveness through intensive numerical experiments.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on detecting data misappropriation in LLMs through statistical hypothesis testing, with no emphasis on prompt engineering, design, or optimization for improving LLM performance through textual input prompts." }
[RTLMarker: Protecting LLM-Generated RTL Copyright via a Hardware
Watermarking Framework](https://arxiv.org/abs/http://arxiv.org/abs/2501.02446v1) arXiv ID: http://arxiv.org/abs/2501.02446v1
Abstract:
Recent advances of large language models in the field of Verilog generation have raised several ethical and security concerns, such as code copyright protection and dissemination of malicious code. Researchers have employed watermarking techniques to identify codes generated by large language models. However, the existing watermarking works fail to protect RTL code copyright due to the significant syntactic and semantic differences between RTL code and software code in languages such as Python. This paper proposes a hardware watermarking framework RTLMarker that embeds watermarks into RTL code and deeper into the synthesized netlist. We propose a set of rule-based Verilog code transformations , ensuring the watermarked RTL code's syntactic and semantic correctness. In addition, we consider an inherent tradeoff between watermark transparency and watermark effectiveness and jointly optimize them. The results demonstrate RTLMarker's superiority over the baseline in RTL code watermarking.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on a hardware watermarking framework (RTLMarker) for protecting RTL code copyright generated by LLMs, rather than exploring prompt engineering techniques for improving LLM performance through textual input manipulation." }
[Enhancing Contrastive Learning for Retinal Imaging via Adjusted
Augmentation Scales](https://arxiv.org/abs/http://arxiv.org/abs/2501.02451v1) arXiv ID: http://arxiv.org/abs/2501.02451v1
Abstract:
Contrastive learning, a prominent approach within self-supervised learning, has demonstrated significant effectiveness in developing generalizable models for various applications involving natural images. However, recent research indicates that these successes do not necessarily extend to the medical imaging domain. In this paper, we investigate the reasons for this suboptimal performance and hypothesize that the dense distribution of medical images poses challenges to the pretext tasks in contrastive learning, particularly in constructing positive and negative pairs. We explore model performance under different augmentation strategies and compare the results to those achieved with strong augmentations. Our study includes six publicly available datasets covering multiple clinically relevant tasks. We further assess the model's generalizability through external evaluations. The model pre-trained with weak augmentation outperforms those with strong augmentation, improving AUROC from 0.838 to 0.848 and AUPR from 0.523 to 0.597 on MESSIDOR2, and showing similar enhancements across other datasets. Our findings suggest that optimizing the scale of augmentation is critical for enhancing the efficacy of contrastive learning in medical imaging.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on enhancing contrastive learning for medical imaging (retinal imaging), which violates the 'MUST NOT' criteria: being primarily concerned with medical subjects and not focusing on prompt engineering for Large Language Models (LLMs) or text generation driven by LLMs." }
[Hengqin-RA-v1: Advanced Large Language Model for Diagnosis and Treatment
of Rheumatoid Arthritis with Dataset based Traditional Chinese Medicine](https://arxiv.org/abs/http://arxiv.org/abs/2501.02471v1) arXiv ID: http://arxiv.org/abs/2501.02471v1
Abstract:
Large language models (LLMs) primarily trained on English texts, often face biases and inaccuracies in Chinese contexts. Their limitations are pronounced in fields like Traditional Chinese Medicine (TCM), where cultural and clinical subtleties are vital, further hindered by a lack of domain-specific data, such as rheumatoid arthritis (RA). To address these issues, this paper introduces Hengqin-RA-v1, the first large language model specifically tailored for TCM with a focus on diagnosing and treating RA. We also present HQ-GCM-RA-C1, a comprehensive RA-specific dataset curated from ancient Chinese medical literature, classical texts, and modern clinical studies. This dataset empowers Hengqin-RA-v1 to deliver accurate and culturally informed responses, effectively bridging the gaps left by general-purpose models. Extensive experiments demonstrate that Hengqin-RA-v1 outperforms state-of-the-art models, even surpassing the diagnostic accuracy of TCM practitioners in certain cases.
Decision Explanation: Original decision: REJECT The paper focuses primarily on developing a new Large Language Model architecture (Hengqin-RA-v1) specifically tailored for Traditional Chinese Medicine, rather than prompt engineering for existing LLMs, and introduces a new dataset for training, which violate the 'MUST NOT' criteria 1.
The Meta-Representation Hypothesis
arXiv ID: http://arxiv.org/abs/2501.02481v1
Abstract:
Humans rely on high-level meta-representations to engage in abstract reasoning. In complex cognitive tasks, these meta-representations help individuals abstract general rules from experience. However, constructing such meta-representations from high-dimensional observations remains a longstanding challenge for reinforcement learning agents. For instance, a well-trained agent often fails to generalize to even minor variations of the same task, such as changes in background color, while humans can easily handle. In this paper, we build a bridge between meta-representation and generalization, showing that generalization performance benefits from meta-representation learning. We also hypothesize that deep mutual learning (DML) among agents can help them converge to meta-representations. Empirical results provide support for our theory and hypothesis. Overall, this work provides a new perspective on the generalization of deep reinforcement learning.
Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria as it primarily focuses on reinforcement learning, meta-representation, and generalization in deep learning, with no clear emphasis on prompt engineering, design, or optimization specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts for improving LLM performance.
[Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment
Retrieval and Highlight Detection](https://arxiv.org/abs/http://arxiv.org/abs/2501.02504v1) arXiv ID: http://arxiv.org/abs/2501.02504v1
Abstract:
The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query. With the rapid growth of video content and the overlap between these tasks, recent works have addressed both simultaneously. However, they still struggle to fully capture the overall video context, making it challenging to determine which words are most relevant. In this paper, we present a novel Video Context-aware Keyword Attention module that overcomes this limitation by capturing keyword variation within the context of the entire video. To achieve this, we introduce a video context clustering module that provides concise representations of the overall video context, thereby enhancing the understanding of keyword dynamics. Furthermore, we propose a keyword weight detection module with keyword-aware contrastive learning that incorporates keyword information to enhance fine-grained alignment between visual and textual features. Extensive experiments on the QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that our proposed method significantly improves performance in moment retrieval and highlight detection tasks compared to existing approaches. Our code is available at: https://github.com/VisualAIKHU/Keyword-DETR
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on video moment retrieval and highlight detection, utilizing a novel Video Context-aware Keyword Attention module, with no evident focus on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance." }
[Remote Inference over Dynamic Links via Adaptive Rate Deep Task-Oriented
Vector Quantization](https://arxiv.org/abs/http://arxiv.org/abs/2501.02521v1) arXiv ID: http://arxiv.org/abs/2501.02521v1
Abstract:
A broad range of technologies rely on remote inference, wherein data acquired is conveyed over a communication channel for inference in a remote server. Communication between the participating entities is often carried out over rate-limited channels, necessitating data compression for reducing latency. While deep learning facilitates joint design of the compression mapping along with encoding and inference rules, existing learned compression mechanisms are static, and struggle in adapting their resolution to changes in channel conditions and to dynamic links. To address this, we propose Adaptive Rate Task-Oriented Vector Quantization (ARTOVeQ), a learned compression mechanism that is tailored for remote inference over dynamic links. ARTOVeQ is based on designing nested codebooks along with a learning algorithm employing progressive learning. We show that ARTOVeQ extends to support low-latency inference that is gradually refined via successive refinement principles, and that it enables the simultaneous usage of multiple resolutions when conveying high-dimensional data. Numerical results demonstrate that the proposed scheme yields remote deep inference that operates with multiple rates, supports a broad range of bit budgets, and facilitates rapid inference that gradually improves with more bits exchanged, while approaching the performance of single-rate deep quantization methods.
Decision Explanation: Original decision: REJECT The paper focuses on adaptive rate deep task-oriented vector quantization for remote inference over dynamic links, without any primary concern for prompt engineering, Large Language Models (LLMs), or the manipulation of textual input prompts to improve LLM performance.
Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation
arXiv ID: http://arxiv.org/abs/2501.02523v1
Abstract:
Facial images have extensive practical applications. Although the current large-scale text-image diffusion models exhibit strong generation capabilities, it is challenging to generate the desired facial images using only text prompt. Image prompts are a logical choice. However, current methods of this type generally focus on general domain. In this paper, we aim to optimize image makeup techniques to generate the desired facial images. Specifically, (1) we built a dataset of 4 million high-quality face image-text pairs (FaceCaptionHQ-4M) based on LAION-Face to train our Face-MakeUp model; (2) to maintain consistency with the reference facial image, we extract/learn multi-scale content features and pose features for the facial image, integrating these into the diffusion model to enhance the preservation of facial identity features for diffusion models. Validation on two face-related test datasets demonstrates that our Face-MakeUp can achieve the best comprehensive performance.All codes are available at:https://github.com/ddw2AIGROUP2CQUPT/Face-MakeUp
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on optimizing image makeup techniques for text-to-image generation, which falls under image generation driven by non-text-based interactions with generative models, violating the 'MUST NOT' criteria 2, and does not meet the core subject requirement of prompt engineering for text-based interactions with Large Language Models (LLMs)." }
[AMM: Adaptive Modularized Reinforcement Model for Multi-city Traffic
Signal Control](https://arxiv.org/abs/http://arxiv.org/abs/2501.02548v1) arXiv ID: http://arxiv.org/abs/2501.02548v1
Abstract:
Traffic signal control (TSC) is an important and widely studied direction. Recently, reinforcement learning (RL) methods have been used to solve TSC problems and achieve superior performance over conventional TSC methods. However, applying RL methods to the real world is challenging due to the huge cost of experiments in real-world traffic environments. One possible solution is TSC domain adaptation, which adapts trained models to target environments and reduces the number of interactions and the training cost. However, existing TSC domain adaptation methods still face two major issues: the lack of consideration for differences across cities and the low utilization of multi-city data. To solve aforementioned issues, we propose an approach named Adaptive Modularized Model (AMM). By modularizing TSC problems and network models, we overcome the challenge of possible changes in environmental observations. We also aggregate multi-city experience through meta-learning. We conduct extensive experiments on different cities and show that AMM can achieve excellent performance with limited interactions in target environments and outperform existing methods. We also demonstrate the feasibility and generalizability of our method.
Decision Explanation: Original decision: REJECT The paper focuses on Adaptive Modularized Reinforcement Model for Multi-city Traffic Signal Control, which does not meet the 'MUST' criteria of primarily focusing on engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate improving LLM performance through textual input prompt manipulation.
KM-UNet KAN Mamba UNet for medical image segmentation
arXiv ID: http://arxiv.org/abs/2501.02559v1
Abstract:
Medical image segmentation is a critical task in medical imaging analysis. Traditional CNN-based methods struggle with modeling long-range dependencies, while Transformer-based models, despite their success, suffer from quadratic computational complexity. To address these limitations, we propose KM-UNet, a novel U-shaped network architecture that combines the strengths of Kolmogorov-Arnold Networks (KANs) and state-space models (SSMs). KM-UNet leverages the Kolmogorov-Arnold representation theorem for efficient feature representation and SSMs for scalable long-range modeling, achieving a balance between accuracy and computational efficiency. We evaluate KM-UNet on five benchmark datasets: ISIC17, ISIC18, CVC, BUSI, and GLAS. Experimental results demonstrate that KM-UNet achieves competitive performance compared to state-of-the-art methods in medical image segmentation tasks. To the best of our knowledge, KM-UNet is the first medical image segmentation framework integrating KANs and SSMs. This work provides a valuable baseline and new insights for the development of more efficient and interpretable medical image segmentation systems. The code is open source at https://github.com/2760613195/KM_UNet Keywords:KAN,Manba, state-space models,UNet, Medical image segmentation, Deep learning
Decision Explanation: Original decision: REJECT The paper primarily focuses on medical image segmentation using a novel U-shaped network architecture, which falls under excluded subjects (medical) and does not meet the 'MUST' criteria of focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs).
Decoding fMRI Data into Captions using Prefix Language Modeling
arXiv ID: http://arxiv.org/abs/2501.02570v1
Abstract:
With the advancements in Large Language and Latent Diffusion models, brain decoding has achieved remarkable results in recent years. The works on the NSD dataset, with stimuli images from the COCO dataset, leverage the embeddings from the CLIP model for image reconstruction and GIT for captioning. However, the current captioning approach introduces the challenge of potential data contamination given that the GIT model was trained on the COCO dataset. In this work, we present an alternative method for decoding brain signals into image captions by predicting a DINOv2 model's embedding of an image from the corresponding fMRI signal and then providing its [CLS] token as the prefix to the GPT-2 language model which decreases computational requirements considerably. Additionally, instead of commonly used Linear Regression, we explore 3D Convolutional Neural Network mapping of fMRI signals to image embedding space for better accounting positional information of voxels.
Decision Explanation: Original decision: REJECT The paper primarily focuses on decoding fMRI data into image captions using a combination of computer vision and language models, rather than specifically engineering or optimizing prompts for Large Language Models (LLMs). The use of a GPT-2 model with a predicted prefix does not constitute a primary focus on prompt engineering for text-based interactions with LLMs.
Evolving Skeletons: Motion Dynamics in Action Recognition
arXiv ID: http://arxiv.org/abs/2501.02593v1
Abstract:
Skeleton-based action recognition has gained significant attention for its ability to efficiently represent spatiotemporal information in a lightweight format. Most existing approaches use graph-based models to process skeleton sequences, where each pose is represented as a skeletal graph structured around human physical connectivity. Among these, the Spatiotemporal Graph Convolutional Network (ST-GCN) has become a widely used framework. Alternatively, hypergraph-based models, such as the Hyperformer, capture higher-order correlations, offering a more expressive representation of complex joint interactions. A recent advancement, termed Taylor Videos, introduces motion-enhanced skeleton sequences by embedding motion concepts, providing a fresh perspective on interpreting human actions in skeleton-based action recognition. In this paper, we conduct a comprehensive evaluation of both traditional skeleton sequences and Taylor-transformed skeletons using ST-GCN and Hyperformer models on the NTU-60 and NTU-120 datasets. We compare skeletal graph and hypergraph representations, analyzing static poses against motion-injected poses. Our findings highlight the strengths and limitations of Taylor-transformed skeletons, demonstrating their potential to enhance motion dynamics while exposing current challenges in fully using their benefits. This study underscores the need for innovative skeletal modelling techniques to effectively handle motion-rich data and advance the field of action recognition.
Decision Explanation: Original decision: REJECT The paper focuses on skeleton-based action recognition, graph/hypergraph models (ST-GCN, Hyperformer), and motion dynamics analysis, with no apparent connection to Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.
[Empowering Bengali Education with AI: Solving Bengali Math Word Problems
through Transformer Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02599v1) arXiv ID: http://arxiv.org/abs/2501.02599v1
Abstract:
Mathematical word problems (MWPs) involve the task of converting textual descriptions into mathematical equations. This poses a significant challenge in natural language processing, particularly for low-resource languages such as Bengali. This paper addresses this challenge by developing an innovative approach to solving Bengali MWPs using transformer-based models, including Basic Transformer, mT5, BanglaT5, and mBART50. To support this effort, the "PatiGonit" dataset was introduced, containing 10,000 Bengali math problems, and these models were fine-tuned to translate the word problems into equations accurately. The evaluation revealed that the mT5 model achieved the highest accuracy of 97.30%, demonstrating the effectiveness of transformer models in this domain. This research marks a significant step forward in Bengali natural language processing, offering valuable methodologies and resources for educational AI tools. By improving math education, it also supports the development of advanced problem-solving skills for Bengali-speaking students.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses primarily on developing an approach for solving Bengali math word problems using transformer-based models, including fine-tuning these models, rather than engineering or optimizing prompts specifically for Large Language Models (LLMs) to improve their text generation performance." }
[LLMs Help Alleviate the Cross-Subject Variability in Brain Signal and
Language Alignment](https://arxiv.org/abs/http://arxiv.org/abs/2501.02621v1) arXiv ID: http://arxiv.org/abs/2501.02621v1
Abstract:
Decoding human activity from EEG signals has long been a popular research topic. While recent studies have increasingly shifted focus from single-subject to cross-subject analysis, few have explored the model's ability to perform zero-shot predictions on EEG signals from previously unseen subjects. This research aims to investigate whether deep learning methods can capture subject-independent semantic information inherent in human EEG signals. Such insights are crucial for Brain-Computer Interfaces (BCI) because, on one hand, they demonstrate the model's robustness against subject-specific temporal biases, and on the other, they significantly enhance the generalizability of downstream tasks. We employ Large Language Models (LLMs) as denoising agents to extract subject-independent semantic features from noisy EEG signals. Experimental results, including ablation studies, highlight the pivotal role of LLMs in decoding subject-independent semantic information from noisy EEG data. We hope our findings will contribute to advancing BCI research and assist both academia and industry in applying EEG signals to a broader range of applications.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on using LLMs as denoising agents for decoding subject-independent semantic information from EEG signals in Brain-Computer Interfaces (BCI) research, rather than on prompt engineering for text-based interactions with LLMs, failing to meet the core subject criterion." }
[Trust and Dependability in Blockchain & AI Based MedIoT Applications:
Research Challenges and Future Directions](https://arxiv.org/abs/http://arxiv.org/abs/2501.02647v1) arXiv ID: http://arxiv.org/abs/2501.02647v1
Abstract:
This paper critically reviews the integration of Artificial Intelligence (AI) and blockchain technologies in the context of Medical Internet of Things (MedIoT) applications, where they collectively promise to revolutionize healthcare delivery. By examining current research, we underscore AI's potential in advancing diagnostics and patient care, alongside blockchain's capacity to bolster data security and patient privacy. We focus particularly on the imperative to cultivate trust and ensure reliability within these systems. Our review highlights innovative solutions for managing healthcare data and challenges such as ensuring scalability, maintaining privacy, and promoting ethical practices within the MedIoT domain. We present a vision for integrating AI-driven insights with blockchain security in healthcare, offering a comprehensive review of current research and future directions. We conclude with a set of identified research gaps and propose that addressing these is crucial for achieving the dependable, secure, and patient -centric MedIoT applications of tomorrow.
Decision Explanation: Original decision: REJECT The paper focuses primarily on the integration of AI and blockchain for MedIoT applications, addressing trust, security, and healthcare, which falls under excluded subjects (medical) and does not meet the 'MUST' criteria for focusing on prompt engineering for Large Language Models (LLMs).
[Tighnari: Multi-modal Plant Species Prediction Based on Hierarchical
Cross-Attention Using Graph-Based and Vision Backbone-Extracted Features](https://arxiv.org/abs/http://arxiv.org/abs/2501.02649v1) arXiv ID: http://arxiv.org/abs/2501.02649v1
Abstract:
Predicting plant species composition in specific spatiotemporal contexts plays an important role in biodiversity management and conservation, as well as in improving species identification tools. Our work utilizes 88,987 plant survey records conducted in specific spatiotemporal contexts across Europe. We also use the corresponding satellite images, time series data, climate time series, and other rasterized environmental data such as land cover, human footprint, bioclimatic, and soil variables as training data to train the model to predict the outcomes of 4,716 plant surveys. We propose a feature construction and result correction method based on the graph structure. Through comparative experiments, we select the best-performing backbone networks for feature extraction in both temporal and image modalities. In this process, we built a backbone network based on the Swin-Transformer Block for extracting temporal Cubes features. We then design a hierarchical cross-attention mechanism capable of robustly fusing features from multiple modalities. During training, we adopt a 10-fold cross-fusion method based on fine-tuning and use a Threshold Top-K method for post-processing. Ablation experiments demonstrate the improvements in model performance brought by our proposed solution pipeline.
Decision Explanation: Original decision: REJECT The paper focuses on multi-modal (image, satellite, time series, environmental data) plant species prediction using graph-based and vision backbone-extracted features, with no mention of Large Language Models (LLMs) or prompt engineering for text-based interactions, failing to meet the primary 'MUST' criteria.
[Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence
Benchmarks](https://arxiv.org/abs/http://arxiv.org/abs/2501.02654v2) arXiv ID: http://arxiv.org/abs/2501.02654v2
Abstract:
Recent advancements in natural language processing have highlighted the vulnerability of deep learning models to adversarial attacks. While various defence mechanisms have been proposed, there is a lack of comprehensive benchmarks that evaluate these defences across diverse datasets, models, and tasks. In this work, we address this gap by presenting an extensive benchmark for textual adversarial defence that significantly expands upon previous work. Our benchmark incorporates a wide range of datasets, evaluates state-of-the-art defence mechanisms, and extends the assessment to include critical tasks such as single-sentence classification, similarity and paraphrase identification, natural language inference, and commonsense reasoning. This work not only serves as a valuable resource for researchers and practitioners in the field of adversarial robustness but also identifies key areas for future research in textual adversarial defence. By establishing a new standard for benchmarking in this domain, we aim to accelerate progress towards more robust and reliable natural language processing systems.
Decision Explanation: Original decision: REJECT The paper focuses on developing a benchmark for evaluating adversarial defence mechanisms in NLP, rather than primarily on the engineering, design, or optimization of prompts for Large Language Models (LLMs), and does not provide concrete examples of prompts with their impact on LLM output.
[Multi-Aggregator Time-Warping Heterogeneous Graph Neural Network for
Personalized Micro-Video Recommendation](https://arxiv.org/abs/http://arxiv.org/abs/2501.02666v1) arXiv ID: http://arxiv.org/abs/2501.02666v1
Abstract:
Micro-video recommendation is attracting global attention and becoming a popular daily service for people of all ages. Recently, Graph Neural Networks-based micro-video recommendation has displayed performance improvement for many kinds of recommendation tasks. However, the existing works fail to fully consider the characteristics of micro-videos, such as the high timeliness of news nature micro-video recommendation and sequential interactions of frequently changed interests. In this paper, a novel Multi-aggregator Time-warping Heterogeneous Graph Neural Network (MTHGNN) is proposed for personalized news nature micro-video recommendation based on sequential sessions, where characteristics of micro-videos are comprehensively studied, users' preference is mined via multi-aggregator, the temporal and dynamic changes of users' preference are captured, and timeliness is considered. Through the comparison with the state-of-the-arts, the experimental results validate the superiority of our MTHGNN model.
Decision Explanation: Original decision: REJECT The paper's primary focus is on developing a novel Graph Neural Network for personalized micro-video recommendation, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet all 'MUST' criteria.
[From Superficial Patterns to Semantic Understanding: Fine-Tuning
Language Models on Contrast Sets](https://arxiv.org/abs/http://arxiv.org/abs/2501.02683v2) arXiv ID: http://arxiv.org/abs/2501.02683v2
Abstract:
Large-scale pre-trained language models have demonstrated high performance on standard datasets for natural language inference (NLI) tasks. Unfortunately, these evaluations can be misleading, as although the models can perform well on in-distribution data, they perform poorly on out-of-distribution test sets, such as contrast sets. Contrast sets consist of perturbed instances of data that have very minor, but meaningful, changes to the input that alter the gold label, revealing how models can learn superficial patterns in the training data rather than learning more sophisticated language nuances. As an example, the ELECTRA-small language model achieves nearly 90% accuracy on an SNLI dataset but drops to 75% when tested on an out-of-distribution contrast set. The research carried out in this study explores how the robustness of a language model can be improved by exposing it to small amounts of more complex contrast sets during training to help it better learn language patterns. With this approach, the model recovers performance and achieves nearly 90% accuracy on contrast sets, highlighting the importance of diverse and challenging training data.
Decision Explanation: Original decision: REJECT The paper focuses primarily on fine-tuning language models with contrast sets for improved robustness, which aligns with developing new training methods, and does not investigate, analyze, or propose methods for improving LLM performance through the manipulation of textual input prompts.
Horizon Generalization in Reinforcement Learning
arXiv ID: http://arxiv.org/abs/2501.02709v1
Abstract:
We study goal-conditioned RL through the lens of generalization, but not in the traditional sense of random augmentations and domain randomization. Rather, we aim to learn goal-directed policies that generalize with respect to the horizon: after training to reach nearby goals (which are easy to learn), these policies should succeed in reaching distant goals (which are quite challenging to learn). In the same way that invariance is closely linked with generalization is other areas of machine learning (e.g., normalization layers make a network invariant to scale, and therefore generalize to inputs of varying scales), we show that this notion of horizon generalization is closely linked with invariance to planning: a policy navigating towards a goal will select the same actions as if it were navigating to a waypoint en route to that goal. Thus, such a policy trained to reach nearby goals should succeed at reaching arbitrarily-distant goals. Our theoretical analysis proves that both horizon generalization and planning invariance are possible, under some assumptions. We present new experimental results and recall findings from prior work in support of our theoretical results. Taken together, our results open the door to studying how techniques for invariance and generalization developed in other areas of machine learning might be adapted to achieve this alluring property.
Decision Explanation: Original decision: REJECT The paper focuses on reinforcement learning, goal-conditioned policies, and horizon generalization, with no apparent connection to prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
[Improved Data Encoding for Emerging Computing Paradigms: From Stochastic
to Hyperdimensional Computing](https://arxiv.org/abs/http://arxiv.org/abs/2501.02715v1) arXiv ID: http://arxiv.org/abs/2501.02715v1
Abstract:
Data encoding is a fundamental step in emerging computing paradigms, particularly in stochastic computing (SC) and hyperdimensional computing (HDC), where it plays a crucial role in determining the overall system performance and hardware cost efficiency. This study presents an advanced encoding strategy that leverages a hardware-friendly class of low-discrepancy (LD) sequences, specifically powers-of-2 bases of Van der Corput (VDC) sequences (VDC-2^n), as sources for random number generation. Our approach significantly enhances the accuracy and efficiency of SC and HDC systems by addressing challenges associated with randomness. By employing LD sequences, we improve correlation properties and reduce hardware complexity. Experimental results demonstrate significant improvements in accuracy and energy savings for SC and HDC systems. Our solution provides a robust framework for integrating SC and HDC in resource-constrained environments, paving the way for efficient and scalable AI implementations.
Decision Explanation: Original decision: REJECT The paper does not focus on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance, instead concentrating on data encoding strategies for stochastic and hyperdimensional computing paradigms.
[Tree-based RAG-Agent Recommendation System: A Case Study in Medical Test
Data](https://arxiv.org/abs/http://arxiv.org/abs/2501.02727v1) arXiv ID: http://arxiv.org/abs/2501.02727v1
Abstract:
We present HiRMed (Hierarchical RAG-enhanced Medical Test Recommendation), a novel tree-structured recommendation system that leverages Retrieval-Augmented Generation (RAG) for intelligent medical test recommendations. Unlike traditional vector similarity-based approaches, our system performs medical reasoning at each tree node through a specialized RAG process. Starting from the root node with initial symptoms, the system conducts step-wise medical analysis to identify potential underlying conditions and their corresponding diagnostic requirements. At each level, instead of simple matching, our RAG-enhanced nodes analyze retrieved medical knowledge to understand symptom-disease relationships and determine the most appropriate diagnostic path. The system dynamically adjusts its recommendation strategy based on medical reasoning results, considering factors such as urgency levels and diagnostic uncertainty. Experimental results demonstrate that our approach achieves superior performance in terms of coverage rate, accuracy, and miss rate compared to conventional retrieval-based methods. This work represents a significant advance in medical test recommendation by introducing medical reasoning capabilities into the traditional tree-based retrieval structure.
Decision Explanation: Original decision: REJECT The paper primarily focuses on a medical application (medical test recommendation system) and develops a new system architecture (Tree-based RAG-Agent) rather than focusing on prompt engineering for Large Language Models (LLMs) in text-based interactions, failing to meet the required criteria.
AFed: Algorithmic Fair Federated Learning
arXiv ID: http://arxiv.org/abs/2501.02732v1
Abstract:
Federated Learning (FL) has gained significant attention as it facilitates collaborative machine learning among multiple clients without centralizing their data on a server. FL ensures the privacy of participating clients by locally storing their data, which creates new challenges in fairness. Traditional debiasing methods assume centralized access to sensitive information, rendering them impractical for the FL setting. Additionally, FL is more susceptible to fairness issues than centralized machine learning due to the diverse client data sources that may be associated with group information. Therefore, training a fair model in FL without access to client local data is important and challenging. This paper presents AFed, a straightforward yet effective framework for promoting group fairness in FL. The core idea is to circumvent restricted data access by learning the global data distribution. This paper proposes two approaches: AFed-G, which uses a conditional generator trained on the server side, and AFed-GAN, which improves upon AFed-G by training a conditional GAN on the client side. We augment the client data with the generated samples to help remove bias. Our theoretical analysis justifies the proposed methods, and empirical results on multiple real-world datasets demonstrate a substantial improvement in AFed over several baselines.
Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a framework for promoting group fairness in Federated Learning (FL), with no emphasis on Large Language Models (LLMs), prompt engineering, or the manipulation of textual input prompts to improve LLM performance.
TARDiS : Text Augmentation for Refining Diversity and Separability
arXiv ID: http://arxiv.org/abs/2501.02739v1
Abstract:
Text augmentation (TA) is a critical technique for text classification, especially in few-shot settings. This paper introduces a novel LLM-based TA method, TARDiS, to address challenges inherent in the generation and alignment stages of two-stage TA methods. For the generation stage, we propose two generation processes, SEG and CEG, incorporating multiple class-specific prompts to enhance diversity and separability. For the alignment stage, we introduce a class adaptation (CA) method to ensure that generated examples align with their target classes through verification and modification. Experimental results demonstrate TARDiS's effectiveness, outperforming state-of-the-art LLM-based TA methods in various few-shot text classification tasks. An in-depth analysis confirms the detailed behaviors at each stage.
Decision Explanation: Original decision: REJECT The paper primarily focuses on a novel text augmentation method (TARDiS) for improving few-shot text classification tasks, using LLMs as a tool, rather than focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) to improve LLM performance through prompt manipulation.
[Interpretable Recognition of Fused Magnesium Furnace Working Conditions
with Deep Convolutional Stochastic Configuration Networks](https://arxiv.org/abs/http://arxiv.org/abs/2501.02740v1) arXiv ID: http://arxiv.org/abs/2501.02740v1
Abstract:
To address the issues of a weak generalization capability and interpretability in working condition recognition model of a fused magnesium furnace, this paper proposes an interpretable working condition recognition method based on deep convolutional stochastic configuration networks (DCSCNs). Firstly, a supervised learning mechanism is employed to generate physically meaningful Gaussian differential convolution kernels. An incremental method is utilized to construct a DCSCNs model, ensuring the convergence of recognition errors in a hierarchical manner and avoiding the iterative optimization process of convolutional kernel parameters using the widely used backpropagation algorithm. The independent coefficient of channel feature maps is defined to obtain the visualization results of feature class activation maps for the fused magnesium furnace. A joint reward function is constructed based on the recognition accuracy, the interpretable trustworthiness evaluation metrics, and the model parameter quantity. Reinforcement learning (RL) is applied to adaptively prune the convolutional kernels of the DCSCNs model, aiming to build a compact, highly performed and interpretable network. The experimental results demonstrate that the proposed method outperforms the other deep learning approaches in terms of recognition accuracy and interpretability.
Decision Explanation: Original decision: REJECT The paper does not meet the primary focus criteria, as it is centered on developing a deep convolutional stochastic configuration network for recognizing working conditions in a fused magnesium furnace, with no mention of Large Language Models (LLMs), prompt engineering, or text-based interactions.
[Visual Large Language Models for Generalized and Specialized
Applications](https://arxiv.org/abs/http://arxiv.org/abs/2501.02765v1) arXiv ID: http://arxiv.org/abs/2501.02765v1
Abstract:
Visual-language models (VLM) have emerged as a powerful tool for learning a unified embedding space for vision and language. Inspired by large language models, which have demonstrated strong reasoning and multi-task capabilities, visual large language models (VLLMs) are gaining increasing attention for building general-purpose VLMs. Despite the significant progress made in VLLMs, the related literature remains limited, particularly from a comprehensive application perspective, encompassing generalized and specialized applications across vision (image, video, depth), action, and language modalities. In this survey, we focus on the diverse applications of VLLMs, examining their using scenarios, identifying ethics consideration and challenges, and discussing future directions for their development. By synthesizing these contents, we aim to provide a comprehensive guide that will pave the way for future innovations and broader applications of VLLMs. The paper list repository is available: https://github.com/JackYFL/awesome-VLLMs.
Decision Explanation: Original decision: REJECT The paper focuses on Visual Large Language Models (VLLMs) for generalized and specialized applications across vision, action, and language modalities, with no primary focus on engineering, design, or optimization of textual input prompts specifically for Large Language Models (LLMs) as required.
[Are GNNs Effective for Multimodal Fault Diagnosis in Microservice
Systems?](https://arxiv.org/abs/http://arxiv.org/abs/2501.02766v1) arXiv ID: http://arxiv.org/abs/2501.02766v1
Abstract:
Fault diagnosis in microservice systems has increasingly embraced multimodal observation data for a holistic and multifaceted view of the system, with Graph Neural Networks (GNNs) commonly employed to model complex service dependencies. However, despite the intuitive appeal, there remains a lack of compelling justification for the adoption of GNNs, as no direct evidence supports their necessity or effectiveness. To critically evaluate the current use of GNNs, we propose DiagMLP, a simple topology-agnostic baseline as a substitute for GNNs in fault diagnosis frameworks. Through experiments on five public datasets, we surprisingly find that DiagMLP performs competitively with and even outperforms GNN-based methods in fault diagnosis tasks, indicating that the current paradigm of using GNNs to model service dependencies has not yet demonstrated a tangible contribution. We further discuss potential reasons for this observation and advocate shifting the focus from solely pursuing novel model designs to developing challenging datasets, standardizing preprocessing protocols, and critically evaluating the utility of advanced deep learning modules.
Decision Explanation: Original decision: REJECT The paper focuses on evaluating Graph Neural Networks (GNNs) for multimodal fault diagnosis in microservice systems, with no apparent connection to Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet the primary 'MUST' criteria.
[Enhancing Trustworthiness of Graph Neural Networks with Rank-Based
Conformal Training](https://arxiv.org/abs/http://arxiv.org/abs/2501.02767v1) arXiv ID: http://arxiv.org/abs/2501.02767v1
Abstract:
Graph Neural Networks (GNNs) has been widely used in a variety of fields because of their great potential in representing graph-structured data. However, lacking of rigorous uncertainty estimations limits their application in high-stakes. Conformal Prediction (CP) can produce statistically guaranteed uncertainty estimates by using the classifier's probability estimates to obtain prediction sets, which contains the true class with a user-specified probability. In this paper, we propose a Rank-based CP during training framework to GNNs (RCP-GNN) for reliable uncertainty estimates to enhance the trustworthiness of GNNs in the node classification scenario. By exploiting rank information of the classifier's outcome, prediction sets with desired coverage rate can be efficiently constructed. The strategy of CP during training with differentiable rank-based conformity loss function is further explored to adapt prediction sets according to network topology information. In this way, the composition of prediction sets can be guided by the goal of jointly reducing inefficiency and probability estimation errors. Extensive experiments on several real-world datasets show that our model achieves any pre-defined target marginal coverage while significantly reducing the inefficiency compared with state-of-the-art methods.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses primarily on enhancing Graph Neural Networks (GNNs) with Conformal Prediction for reliable uncertainty estimates, which does not meet the MUST criteria of focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs)." }
ICFNet: Integrated Cross-modal Fusion Network for Survival Prediction
arXiv ID: http://arxiv.org/abs/2501.02778v1
Abstract:
Survival prediction is a crucial task in the medical field and is essential for optimizing treatment options and resource allocation. However, current methods often rely on limited data modalities, resulting in suboptimal performance. In this paper, we propose an Integrated Cross-modal Fusion Network (ICFNet) that integrates histopathology whole slide images, genomic expression profiles, patient demographics, and treatment protocols. Specifically, three types of encoders, a residual orthogonal decomposition module and a unification fusion module are employed to merge multi-modal features to enhance prediction accuracy. Additionally, a balanced negative log-likelihood loss function is designed to ensure fair training across different patients. Extensive experiments demonstrate that our ICFNet outperforms state-of-the-art algorithms on five public TCGA datasets, including BLCA, BRCA, GBMLGG, LUAD, and UCEC, and shows its potential to support clinical decision-making and advance precision medicine. The codes are available at: https://github.com/binging512/ICFNet.
Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a new network (ICFNet) for survival prediction in the medical field, integrating various data modalities, and does not meet the criteria of focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs).
[Hybrid deep convolution model for lung cancer detection with transfer
learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.02785v1) arXiv ID: http://arxiv.org/abs/2501.02785v1
Abstract:
Advances in healthcare research have significantly enhanced our understanding of disease mechanisms, diagnostic precision, and therapeutic options. Yet, lung cancer remains one of the leading causes of cancer-related mortality worldwide due to challenges in early and accurate diagnosis. While current lung cancer detection models show promise, there is considerable potential for further improving the accuracy for timely intervention. To address this challenge, we introduce a hybrid deep convolution model leveraging transfer learning, named the Maximum Sensitivity Neural Network (MSNN). MSNN is designed to improve the precision of lung cancer detection by refining sensitivity and specificity. This model has surpassed existing deep learning approaches through experimental validation, achieving an accuracy of 98% and a sensitivity of 97%. By overlaying sensitivity maps onto lung Computed Tomography (CT) scans, it enables the visualization of regions most indicative of malignant or benign classifications. This innovative method demonstrates exceptional performance in distinguishing lung cancer with minimal false positives, thereby enhancing the accuracy of medical diagnoses.
Decision Explanation: Original decision: REJECT The paper focuses primarily on the development of a new deep convolution model for lung cancer detection, which falls under medical subjects and does not address prompt engineering for Large Language Models (LLMs) as required.
[Segmenting Text and Learning Their Rewards for Improved RLHF in Language
Model](https://arxiv.org/abs/http://arxiv.org/abs/2501.02790v1) arXiv ID: http://arxiv.org/abs/2501.02790v1
Abstract:
Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification. With these designs, our method performs competitively on three popular RLHF benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation studies are conducted to further demonstrate our method.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on improving Reinforcement Learning from Human Feedback (RLHF) for Language Models through segment-level reward modeling, rather than specifically investigating, analyzing, or proposing methods for improving LLM performance through the manipulation of textual input prompts." }
Enhancing Lifelong Multi-Agent Path Finding with Cache Mechanism
arXiv ID: http://arxiv.org/abs/2501.02803v1
Abstract:
Multi-Agent Path Finding (MAPF), which focuses on finding collision-free paths for multiple robots, is crucial in autonomous warehouse operations. Lifelong MAPF (L-MAPF), where agents are continuously reassigned new targets upon completing their current tasks, offers a more realistic approximation of real-world warehouse scenarios. While cache storage systems can enhance efficiency and reduce operational costs, existing approaches primarily rely on expectations and mathematical models, often without adequately addressing the challenges of multi-robot planning and execution. In this paper, we introduce a novel mechanism called Lifelong MAPF with Cache Mechanism (L-MAPF-CM), which integrates high-level cache storage with low-level path planning. We have involved a new type of map grid called cache for temporary item storage. Additionally, we involved a task assigner (TA) with a locking mechanism to bridge the gap between the new cache grid and L-MAPF algorithm. The TA dynamically allocates target locations to agents based on their status in various scenarios. We evaluated L-MAPF-CM using different cache replacement policies and task distributions. L-MAPF-CM has demonstrated performance improvements particularly with high cache hit rates and smooth traffic conditions.
Decision Explanation: Original decision: REJECT The paper focuses on enhancing Multi-Agent Path Finding with a cache mechanism for autonomous warehouse operations, which does not meet the required focus on prompt engineering, design, or optimization specifically for Large Language Models (LLMs) and their textual input prompts.
[InpDiffusion: Image Inpainting Localization via Conditional Diffusion
Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02816v1) arXiv ID: http://arxiv.org/abs/2501.02816v1
Abstract:
As artificial intelligence advances rapidly, particularly with the advent of GANs and diffusion models, the accuracy of Image Inpainting Localization (IIL) has become increasingly challenging. Current IIL methods face two main challenges: a tendency towards overconfidence, leading to incorrect predictions; and difficulty in detecting subtle tampering boundaries in inpainted images. In response, we propose a new paradigm that treats IIL as a conditional mask generation task utilizing diffusion models. Our method, InpDiffusion, utilizes the denoising process enhanced by the integration of image semantic conditions to progressively refine predictions. During denoising, we employ edge conditions and introduce a novel edge supervision strategy to enhance the model's perception of edge details in inpainted objects. Balancing the diffusion model's stochastic sampling with edge supervision of tampered image regions mitigates the risk of incorrect predictions from overconfidence and prevents the loss of subtle boundaries that can result from overly stochastic processes. Furthermore, we propose an innovative Dual-stream Multi-scale Feature Extractor (DMFE) for extracting multi-scale features, enhancing feature representation by considering both semantic and edge conditions of the inpainted images. Extensive experiments across challenging datasets demonstrate that the InpDiffusion significantly outperforms existing state-of-the-art methods in IIL tasks, while also showcasing excellent generalization capabilities and robustness.
Decision Explanation: Original decision: REJECT The paper does not meet the primary criteria as it focuses on image inpainting localization using conditional diffusion models, with no apparent connection to Large Language Models (LLMs), prompt engineering, or textual input prompts. The subject matter falls under image generation and processing, which is explicitly excluded.
[Enhanced Rooftop Solar Panel Detection by Efficiently Aggregating Local
Features](https://arxiv.org/abs/http://arxiv.org/abs/2501.02840v1) arXiv ID: http://arxiv.org/abs/2501.02840v1
Abstract:
In this paper, we present an enhanced Convolutional Neural Network (CNN)-based rooftop solar photovoltaic (PV) panel detection approach using satellite images. We propose to use pre-trained CNN-based model to extract the local convolutional features of rooftops. These local features are then combined using the Vectors of Locally Aggregated Descriptors (VLAD) technique to obtain rooftop-level global features, which are then used to train traditional Machine Learning (ML) models to identify rooftop images that do and do not contain PV panels. On the dataset used in this study, the proposed approach achieved rooftop-PV classification scores exceeding the predefined threshold of 0.9 across all three cities for each of the feature extractor networks evaluated. Moreover, we propose a 3-phase approach to enable efficient utilization of the previously trained models on a new city or region with limited labelled data. We illustrate the effectiveness of this 3-phase approach for multi-city rooftop-PV detection task.
Decision Explanation: Original decision: REJECT The paper focuses on enhancing rooftop solar panel detection using CNN and ML techniques with satellite images, and does not investigate, analyze, or propose methods for improving Large Language Model (LLM) performance through the manipulation of textual input prompts as required.
[IIMedGPT: Promoting Large Language Model Capabilities of Medical Tasks
by Efficient Human Preference Alignment](https://arxiv.org/abs/http://arxiv.org/abs/2501.02869v1) arXiv ID: http://arxiv.org/abs/2501.02869v1
Abstract:
Recent researches of large language models(LLM), which is pre-trained on massive general-purpose corpora, have achieved breakthroughs in responding human queries. However, these methods face challenges including limited data insufficiency to support extensive pre-training and can not align responses with users' instructions. To address these issues, we introduce a medical instruction dataset, CMedINS, containing six medical instructions derived from actual medical tasks, which effectively fine-tunes LLM in conjunction with other data. Subsequently, We launch our medical model, IIMedGPT, employing an efficient preference alignment method, Direct preference Optimization(DPO). The results show that our final model outperforms existing medical models in medical dialogue.Datsets, Code and model checkpoints will be released upon acceptance.
Decision Explanation: Original decision: REJECT The paper primarily focuses on fine-tuning a Large Language Model for medical tasks using a new dataset and preference alignment method, which violates the 'MUST NOT' criteria of not focusing on the development of new LLM architectures or training methods, and being primarily concerned with medical subjects.
[Skillful High-Resolution Ensemble Precipitation Forecasting with an
Integrated Deep Learning Framework](https://arxiv.org/abs/http://arxiv.org/abs/2501.02905v1) arXiv ID: http://arxiv.org/abs/2501.02905v1
Abstract:
High-resolution precipitation forecasts are crucial for providing accurate weather prediction and supporting effective responses to extreme weather events. Traditional numerical models struggle with stochastic subgrid-scale processes, while recent deep learning models often produce blurry results. To address these challenges, we propose a physics-inspired deep learning framework for high-resolution (0.05\textdegree{}
\times
0.05\textdegree{}) ensemble precipitation forecasting. Trained on ERA5 and CMPA high-resolution precipitation datasets, the framework integrates deterministic and probabilistic components. The deterministic model, based on a 3D SwinTransformer, captures average precipitation at mesoscale resolution and incorporates strategies to enhance performance, particularly for moderate to heavy rainfall. The probabilistic model employs conditional diffusion in latent space to account for uncertainties in residual precipitation at convective scales. During inference, ensemble members are generated by repeatedly sampling latent variables, enabling the model to represent precipitation uncertainty. Our model significantly enhances spatial resolution and forecast accuracy. Rank histogram shows that the ensemble system is reliable and unbiased. In a case study of heavy precipitation in southern China, the model outputs align more closely with observed precipitation distributions than ERA5, demonstrating superior capability in capturing extreme precipitation events. Additionally, 5-day real-time forecasts show good performance in terms of CSI scores.
Decision Explanation: Original decision: REJECT The paper focuses on developing a deep learning framework for high-resolution precipitation forecasting, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet the primary criteria.
[Unsupervised Tomato Split Anomaly Detection using Hyperspectral Imaging
and Variational Autoencoders](https://arxiv.org/abs/http://arxiv.org/abs/2501.02921v1) arXiv ID: http://arxiv.org/abs/2501.02921v1
Abstract:
Tomato anomalies/damages pose a significant challenge in greenhouse farming. While this method of cultivation benefits from efficient resource utilization, anomalies can significantly degrade the quality of farm produce. A common anomaly associated with tomatoes is splitting, characterized by the development of cracks on the tomato skin, which degrades its quality. Detecting this type of anomaly is challenging due to dynamic variations in appearance and sizes, compounded by dataset scarcity. We address this problem in an unsupervised manner by utilizing a tailored variational autoencoder (VAE) with hyperspectral input. Preliminary analysis of the dataset enabled us to select the optimal range of wavelengths for detecting this anomaly. Our findings indicate that the 530nm - 550nm range is suitable for identifying tomato dry splits. The analysis on reconstruction loss allow us to not only detect the anomalies but also to some degree estimate the anomalous regions.
Decision Explanation: Original decision: REJECT The paper does not meet any of the 'MUST' criteria: it focuses on anomaly detection in hyperspectral imaging using Variational Autoencoders, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus falling entirely outside the specified scope.
[GLFC: Unified Global-Local Feature and Contrast Learning with
Mamba-Enhanced UNet for Synthetic CT Generation from CBCT](https://arxiv.org/abs/http://arxiv.org/abs/2501.02992v2) arXiv ID: http://arxiv.org/abs/2501.02992v2
Abstract:
Generating synthetic Computed Tomography (CT) images from Cone Beam Computed Tomography (CBCT) is desirable for improving the image quality of CBCT. Existing synthetic CT (sCT) generation methods using Convolutional Neural Networks (CNN) and Transformers often face difficulties in effectively capturing both global and local features and contrasts for high-quality sCT generation. In this work, we propose a Global-Local Feature and Contrast learning (GLFC) framework for sCT generation. First, a Mamba-Enhanced UNet (MEUNet) is introduced by integrating Mamba blocks into the skip connections of a high-resolution UNet for effective global and local feature learning. Second, we propose a Multiple Contrast Loss (MCL) that calculates synthetic loss at different intensity windows to improve quality for both soft tissues and bone regions. Experiments on the SynthRAD2023 dataset demonstrate that GLFC improved the SSIM of sCT from 77.91% to 91.50% compared with the original CBCT, and significantly outperformed several existing methods for sCT generation. The code is available at https://github.com/HiLab-git/GLFC
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on generating synthetic CT images from CBCT using a novel CNN framework (GLFC with Mamba-Enhanced UNet), which does not meet the 'MUST' criteria of focusing on engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts for improving LLM performance." }
[Quality Estimation based Feedback Training for Improving Pronoun
Translation](https://arxiv.org/abs/http://arxiv.org/abs/2501.03008v1) arXiv ID: http://arxiv.org/abs/2501.03008v1
Abstract:
Pronoun translation is a longstanding challenge in neural machine translation (NMT), often requiring inter-sentential context to ensure linguistic accuracy. To address this, we introduce ProNMT, a novel framework designed to enhance pronoun and overall translation quality in context-aware machine translation systems. ProNMT leverages Quality Estimation (QE) models and a unique Pronoun Generation Likelihood-Based Feedback mechanism to iteratively fine-tune pre-trained NMT models without relying on extensive human annotations. The framework combines QE scores with pronoun-specific rewards to guide training, ensuring improved handling of linguistic nuances. Extensive experiments demonstrate significant gains in pronoun translation accuracy and general translation quality across multiple metrics. ProNMT offers an efficient, scalable, and context-aware approach to improving NMT systems, particularly in translating context-dependent elements like pronouns.
Decision Explanation: Original decision: REJECT The paper primarily focuses on improving neural machine translation (NMT) systems through Quality Estimation based feedback training, rather than engineering or optimizing prompts specifically for Large Language Models (LLMs), and does not provide concrete examples of prompts or demonstrate their impact on LLM output.
[Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization
Degradation for Mathematical Reasoning](https://arxiv.org/abs/http://arxiv.org/abs/2501.03035v1) arXiv ID: http://arxiv.org/abs/2501.03035v1
Abstract:
Large language models have achieved significant advancements in complex mathematical reasoning benchmarks, such as MATH. However, their substantial computational requirements present challenges for practical deployment. Model quantization has emerged as an effective strategy to reduce memory usage and computational costs by employing lower precision and bit-width representations. In this study, we systematically evaluate the impact of quantization on mathematical reasoning tasks. We introduce a multidimensional evaluation framework that qualitatively assesses specific capability dimensions and conduct quantitative analyses on the step-by-step outputs of various quantization methods. Our results demonstrate that quantization differentially affects numerical computation and reasoning planning abilities, identifying key areas where quantized models experience performance degradation.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on model quantization for reducing computational costs and its impact on mathematical reasoning tasks, rather than prompt engineering for Large Language Models (LLMs), not meeting the 'MUST' criteria for prompt engineering, LLM performance improvement through prompt manipulation, and providing concrete prompt examples." }
[Piano Transcription by Hierarchical Language Modeling with Pretrained
Roll-based Encoders](https://arxiv.org/abs/http://arxiv.org/abs/2501.03038v2) arXiv ID: http://arxiv.org/abs/2501.03038v2
Abstract:
Automatic Music Transcription (AMT), aiming to get musical notes from raw audio, typically uses frame-level systems with piano-roll outputs or language model (LM)-based systems with note-level predictions. However, frame-level systems require manual thresholding, while the LM-based systems struggle with long sequences. In this paper, we propose a hybrid method combining pre-trained roll-based encoders with an LM decoder to leverage the strengths of both methods. Besides, our approach employs a hierarchical prediction strategy, first predicting onset and pitch, then velocity, and finally offset. The hierarchical prediction strategy reduces computational costs by breaking down long sequences into different hierarchies. Evaluated on two benchmark roll-based encoders, our method outperforms traditional piano-roll outputs 0.01 and 0.022 in onset-offset-velocity F1 score, demonstrating its potential as a performance-enhancing plug-in for arbitrary roll-based music transcription encoder.
Decision Explanation: Original decision: REJECT The paper primarily focuses on Automatic Music Transcription using a hybrid method combining pre-trained roll-based encoders with a language model decoder, rather than prompt engineering for Large Language Models (LLMs). The subject is outside the specified scope, particularly excluding applications of generative AI other than text generation driven by LLMs, such as audio generation in this case.
[Single-Channel Distance-Based Source Separation for Mobile GPU in
Outdoor and Indoor Environments](https://arxiv.org/abs/http://arxiv.org/abs/2501.03045v1) arXiv ID: http://arxiv.org/abs/2501.03045v1
Abstract:
This study emphasizes the significance of exploring distance-based source separation (DSS) in outdoor environments. Unlike existing studies that primarily focus on indoor settings, the proposed model is designed to capture the unique characteristics of outdoor audio sources. It incorporates advanced techniques, including a two-stage conformer block, a linear relation-aware self-attention (RSA), and a TensorFlow Lite GPU delegate. While the linear RSA may not capture physical cues as explicitly as the quadratic RSA, the linear RSA enhances the model's context awareness, leading to improved performance on the DSS that requires an understanding of physical cues in outdoor and indoor environments. The experimental results demonstrated that the proposed model overcomes the limitations of existing approaches and considerably enhances energy efficiency and real-time inference speed on mobile devices.
Decision Explanation: Original decision: REJECT The paper focuses on source separation for audio signals in outdoor and indoor environments, utilizing mobile GPU and TensorFlow, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.
[Personalized Fashion Recommendation with Image Attributes and Aesthetics
Assessment](https://arxiv.org/abs/http://arxiv.org/abs/2501.03085v1) arXiv ID: http://arxiv.org/abs/2501.03085v1
Abstract:
Personalized fashion recommendation is a difficult task because 1) the decisions are highly correlated with users' aesthetic appetite, which previous work frequently overlooks, and 2) many new items are constantly rolling out that cause strict cold-start problems in the popular identity (ID)-based recommendation methods. These new items are critical to recommend because of trend-driven consumerism. In this work, we aim to provide more accurate personalized fashion recommendations and solve the cold-start problem by converting available information, especially images, into two attribute graphs focusing on optimized image utilization and noise-reducing user modeling. Compared with previous methods that separate image and text as two components, the proposed method combines image and text information to create a richer attributes graph. Capitalizing on the advancement of large language and vision models, we experiment with extracting fine-grained attributes efficiently and as desired using two different prompts. Preliminary experiments on the IQON3000 dataset have shown that the proposed method achieves competitive accuracy compared with baselines.
Decision Explanation: Original decision: REJECT The paper primarily focuses on personalized fashion recommendation using image attributes and aesthetics, combining image and text information, rather than specifically engineering or optimizing prompts for Large Language Models (LLMs). While it mentions using two different prompts for attribute extraction, prompt engineering is not the central concern, but rather a minor component within the larger system.
[LangFair: A Python Package for Assessing Bias and Fairness in Large
Language Model Use Cases](https://arxiv.org/abs/http://arxiv.org/abs/2501.03112v1) arXiv ID: http://arxiv.org/abs/2501.03112v1
Abstract:
Large Language Models (LLMs) have been observed to exhibit bias in numerous ways, potentially creating or worsening outcomes for specific groups identified by protected attributes such as sex, race, sexual orientation, or age. To help address this gap, we introduce LangFair, an open-source Python package that aims to equip LLM practitioners with the tools to evaluate bias and fairness risks relevant to their specific use cases. The package offers functionality to easily generate evaluation datasets, comprised of LLM responses to use-case-specific prompts, and subsequently calculate applicable metrics for the practitioner's use case. To guide in metric selection, LangFair offers an actionable decision framework.
Decision Explanation: Original decision: REJECT The paper's primary focus is on assessing bias and fairness in Large Language Models using a Python package, rather than on the engineering, design, or optimization of prompts specifically for LLMs to improve their text generation performance.
[From Models to Network Topologies: A Topology Inference Attack in
Decentralized Federated Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.03119v1) arXiv ID: http://arxiv.org/abs/2501.03119v1
Abstract:
Federated Learning (FL) is widely recognized as a privacy-preserving machine learning paradigm due to its model-sharing mechanism that avoids direct data exchange. However, model training inevitably leaves exploitable traces that can be used to infer sensitive information. In Decentralized FL (DFL), the overlay topology significantly influences its models' convergence, robustness, and security. This study explores the feasibility of inferring the overlay topology of DFL systems based solely on model behavior, introducing a novel Topology Inference Attack. A taxonomy of topology inference attacks is proposed, categorizing them by the attacker's capabilities and knowledge. Practical attack strategies are developed for different scenarios, and quantitative experiments are conducted to identify key factors influencing the attack effectiveness. Experimental results demonstrate that analyzing only the public models of individual nodes can accurately infer the DFL topology, underscoring the risk of sensitive information leakage in DFL systems. This finding offers valuable insights for improving privacy preservation in decentralized learning environments.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on topology inference attacks in Decentralized Federated Learning, which does not meet the MUST criteria: it does not focus on engineering, design, or optimization of prompts for Large Language Models (LLMs), nor does it investigate improving LLM performance through textual input prompts." }
[PRMBench: A Fine-grained and Challenging Benchmark for Process-Level
Reward Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03124v2) arXiv ID: http://arxiv.org/abs/2501.03124v2
Abstract:
Process-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios. However, current benchmarks primarily focus on step correctness, failing to evaluate PRMs' performance systematically. To address this gap, we introduce PRMBench, a process-level benchmark specifically designed to assess the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions, including simplicity, soundness, and sensitivity. In our experiments on 15 models, spanning both open-source PRMs and closed-source large language models prompted as critic models, we uncover significant weaknesses in current PRMs. These findings underscore the challenges inherent in process-level evaluation and highlight key directions for future research. We hope PRMBench can be a robust bench for advancing research on PRM evaluation and development.
Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a benchmark for evaluating Process-Level Reward Models (PRMs) and their error detection capabilities, rather than engineering, designing, or optimizing prompts specifically for Large Language Models (LLMs). While LLMs are mentioned as being used as critic models, prompt engineering is not the central concern.
Geometry Restoration and Dewarping of Camera-Captured Document Images
arXiv ID: http://arxiv.org/abs/2501.03145v2
Abstract:
This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera, using algorithms for detection, segmentation, geometry restoration, and dewarping. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid using cubic polynomial interpolation and correct nonlinear distortions by remapping the image. Using classical CV methods makes the document topology restoration process more efficient and faster, as it requires significantly fewer computational resources and memory. We developed a new pipeline for automatic document dewarping and reconstruction, along with a framework and annotated dataset to demonstrate its efficiency. Our experiments confirm the promise of our methodology and its superiority over existing benchmarks (including mobile apps and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both visually and in terms of document readability via Optical Character Recognition (OCR) and geometry restoration metrics. This paves the way for creating high-quality digital copies of paper documents and enhancing the efficiency of OCR systems. Project page: https://github.com/HorizonParadox/DRCCBI
Decision Explanation: Original decision: REJECT The paper focuses primarily on developing a method for restoring and dewarping document images using computer vision and deep learning, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.
[Automated Generation of Challenging Multiple-Choice Questions for Vision
Language Model Evaluation](https://arxiv.org/abs/http://arxiv.org/abs/2501.03225v1) arXiv ID: http://arxiv.org/abs/2501.03225v1
Abstract:
The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.
Decision Explanation: Original decision: REJECT The paper primarily focuses on evaluating Vision Language Models (VLMs) through automated generation of multiple-choice questions, rather than specifically engineering or optimizing prompts for Large Language Models (LLMs) to improve text generation, as required by the criteria.
LightGNN: Simple Graph Neural Network for Recommendation
arXiv ID: http://arxiv.org/abs/2501.03228v2
Abstract:
Graph neural networks (GNNs) have demonstrated superior performance in collaborative recommendation through their ability to conduct high-order representation smoothing, effectively capturing structural information within users' interaction patterns. However, existing GNN paradigms face significant challenges in scalability and robustness when handling large-scale, noisy, and real-world datasets. To address these challenges, we present LightGNN, a lightweight and distillation-based GNN pruning framework designed to substantially reduce model complexity while preserving essential collaboration modeling capabilities. Our LightGNN framework introduces a computationally efficient pruning module that adaptively identifies and removes redundant edges and embedding entries for model compression. The framework is guided by a resource-friendly hierarchical knowledge distillation objective, whose intermediate layer augments the observed graph to maintain performance, particularly in high-rate compression scenarios. Extensive experiments on public datasets demonstrate LightGNN's effectiveness, significantly improving both computational efficiency and recommendation accuracy. Notably, LightGNN achieves an 80% reduction in edge count and 90% reduction in embedding entries while maintaining performance comparable to more complex state-of-the-art baselines. The implementation of our LightGNN framework is available at the github repository: https://github.com/HKUDS/LightGNN.
Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a lightweight Graph Neural Network (GNN) for recommendation systems, with no mention of Large Language Models (LLMs), prompt engineering, or textual input manipulation, thus failing to meet all 'MUST' criteria.
Gaussian Masked Autoencoders
arXiv ID: http://arxiv.org/abs/2501.03229v1
Abstract:
This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. While reconstructive self-supervised learning frameworks such as MAE learns good semantic abstractions, it is not trained for explicit spatial awareness. Our approach, named Gaussian Masked Autoencoder, or GMAE, aims to learn semantic abstractions and spatial understanding jointly. Like MAE, it reconstructs the image end-to-end in the pixel space, but beyond MAE, it also introduces an intermediate, 3D Gaussian-based representation and renders images via splatting. We show that GMAE can enable various zero-shot learning capabilities of spatial understanding (e.g., figure-ground segmentation, image layering, edge detection, etc.) while preserving the high-level semantics of self-supervised representation quality from MAE. To our knowledge, we are the first to employ Gaussian primitives in an image representation learning framework beyond optimization-based single-scene reconstructions. We believe GMAE will inspire further research in this direction and contribute to developing next-generation techniques for modeling high-fidelity visual data. More details at https://brjathu.github.io/gmae
Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a new image representation learning framework (Gaussian Masked Autoencoders) for spatial understanding and visual data modeling, with no mention of Large Language Models (LLMs), prompt engineering, or text-based interactions, thus failing to meet all 'MUST' criteria.
[Backdoor Token Unlearning: Exposing and Defending Backdoors in
Pretrained Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03272v1) arXiv ID: http://arxiv.org/abs/2501.03272v1
Abstract:
Supervised fine-tuning has become the predominant method for adapting large pretrained models to downstream tasks. However, recent studies have revealed that these models are vulnerable to backdoor attacks, where even a small number of malicious samples can successfully embed backdoor triggers into the model. While most existing defense methods focus on post-training backdoor defense, efficiently defending against backdoor attacks during training phase remains largely unexplored. To address this gap, we propose a novel defense method called Backdoor Token Unlearning (BTU), which proactively detects and neutralizes trigger tokens during the training stage. Our work is based on two key findings: 1) backdoor learning causes distinctive differences between backdoor token parameters and clean token parameters in word embedding layers, and 2) the success of backdoor attacks heavily depends on backdoor token parameters. The BTU defense leverages these properties to identify aberrant embedding parameters and subsequently removes backdoor behaviors using a fine-grained unlearning technique. Extensive evaluations across three datasets and four types of backdoor attacks demonstrate that BTU effectively defends against these threats while preserving the model's performance on primary tasks. Our code is available at https://github.com/XDJPH/BTU.
Decision Explanation: Original decision: REJECT The paper primarily focuses on defending against backdoor attacks in pretrained language models during the training phase, rather than prompt engineering for text-based interactions with Large Language Models (LLMs), failing to meet the 'MUST' criteria related to prompt engineering and manipulation of textual input prompts.
Strategic Fusion Optimizes Transformer Compression
arXiv ID: http://arxiv.org/abs/2501.03273v1
Abstract:
This study investigates transformer model compression by systematically pruning its layers. We evaluated 14 pruning strategies across nine diverse datasets, including 12 strategies based on different signals obtained from layer activations, mutual information, gradients, weights, and attention. To address the limitations of single-signal strategies, we introduced two fusion strategies, linear regression and random forest, which combine individual strategies (i.e., strategic fusion), for more informed pruning decisions. Additionally, we applied knowledge distillation to mitigate any accuracy loss during layer pruning. Our results reveal that random forest strategic fusion outperforms individual strategies in seven out of nine datasets and achieves near-optimal performance in the other two. The distilled random forest surpasses the original accuracy in six datasets and mitigates accuracy drops in the remaining three. Knowledge distillation also improves the accuracy-to-size ratio by an average factor of 18.84 across all datasets. Supported by mathematical foundations and biological analogies, our findings suggest that strategically combining multiple signals can lead to efficient, high-performing transformer models for resource-constrained applications.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on transformer model compression through layer pruning and knowledge distillation, not on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance." }
[ComMer: a Framework for Compressing and Merging User Data for
Personalization](https://arxiv.org/abs/http://arxiv.org/abs/2501.03276v1) arXiv ID: http://arxiv.org/abs/2501.03276v1
Abstract:
Large Language Models (LLMs) excel at a wide range of tasks, but adapting them to new data, particularly for personalized applications, poses significant challenges due to resource and computational constraints. Existing methods either rely on exposing fresh data to the model through the prompt, which is limited by context size and computationally expensive at inference time, or fine-tuning, which incurs substantial training and update costs. In this paper, we introduce ComMer - Compress and Merge - a novel framework that efficiently personalizes LLMs by compressing users' documents into compact representations, which are then merged and fed into a frozen LLM. We evaluate ComMer on two types of personalization tasks - personalized skill learning, using the tweet paraphrasing dataset and the personalized news headline generation dataset from the LaMP benchmark, and knowledge-intensive, using the PerLTQA dataset. Our experiments demonstrate that in constrained inference budget scenarios ComMer achieves superior quality in skill learning tasks, while highlighting limitations in knowledge-intensive settings due to the loss of detailed information. These results offer insights into trade-offs and potential optimizations in multi-document compression for personalization.
Decision Explanation: Original decision: REJECT The paper focuses primarily on developing a framework for compressing and merging user data to personalize Large Language Models (LLMs), rather than on the engineering, design, or optimization of prompts specifically for LLMs. It does not investigate or propose methods for improving LLM performance through the manipulation of textual input prompts.
[CodeVision: Detecting LLM-Generated Code Using 2D Token Probability Maps
and Vision Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03288v1) arXiv ID: http://arxiv.org/abs/2501.03288v1
Abstract:
The rise of large language models (LLMs) like ChatGPT has significantly improved automated code generation, enhancing software development efficiency. However, this introduces challenges in academia, particularly in distinguishing between human-written and LLM-generated code, which complicates issues of academic integrity. Existing detection methods, such as pre-trained models and watermarking, face limitations in adaptability and computational efficiency. In this paper, we propose a novel detection method using 2D token probability maps combined with vision models, preserving spatial code structures such as indentation and brackets. By transforming code into log probability matrices and applying vision models like Vision Transformers (ViT) and ResNet, we capture both content and structure for more accurate detection. Our method shows robustness across multiple programming languages and improves upon traditional detectors, offering a scalable and computationally efficient solution for identifying LLM-generated code.
Decision Explanation: Original decision: REJECT The paper focuses on detecting LLM-generated code using vision models, rather than on the engineering, design, or optimization of prompts for Large Language Models (LLMs), failing to meet the primary criteria of focusing on prompt engineering for text-based interactions with LLMs.
[Analyzing Bias in Swiss Federal Supreme Court Judgments Using Facebook's
Holistic Bias Dataset: Implications for Language Model Training](https://arxiv.org/abs/http://arxiv.org/abs/2501.03324v1) arXiv ID: http://arxiv.org/abs/2501.03324v1
Abstract:
Natural Language Processing (NLP) is vital for computers to process and respond accurately to human language. However, biases in training data can introduce unfairness, especially in predicting legal judgment. This study focuses on analyzing biases within the Swiss Judgment Prediction Dataset (SJP-Dataset). Our aim is to ensure unbiased factual descriptions essential for fair decision making by NLP models in legal contexts. We analyze the dataset using social bias descriptors from the Holistic Bias dataset and employ advanced NLP techniques, including attention visualization, to explore the impact of dispreferred descriptors on model predictions. The study identifies biases and examines their influence on model behavior. Challenges include dataset imbalance and token limits affecting model performance.
Decision Explanation: Original decision: REJECT The paper primarily focuses on analyzing biases in a legal dataset for fair NLP model training, not on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and does not provide concrete examples of prompt manipulation impacting LLM output.
Existential Crisis: A Social Robot's Reason for Being
arXiv ID: http://arxiv.org/abs/2501.03376v1
Abstract:
As Robots become ever more important in our daily lives there's growing need for understanding how they're perceived by people. This study aims to investigate how the user perception of robots is influenced by displays of personality. Using LLMs and speech to text technology, we designed a within-subject study to compare two conditions: a personality-driven robot and a purely task-oriented, personality-neutral robot. Twelve participants, recruited from Socially Intelligent Robotics course at Vrije Universiteit Amsterdam, interacted with a robot Nao tasked with asking them a set of medical questions under both conditions. After completing both interactions, the participants completed a user experience questionnaire measuring their emotional states and robot perception using standardized questionnaires from the SRI and Psychology literature.
Decision Explanation: Original decision: REJECT The paper focuses on user perception of robots with displayed personality, utilizing LLMs as a tool, but prompt engineering for Large Language Models is not the primary concern; the core subject revolves around social robotics and user experience, not the manipulation of textual input prompts to improve LLM performance.
Over-the-Air Fair Federated Learning via Multi-Objective Optimization
arXiv ID: http://arxiv.org/abs/2501.03392v1
Abstract:
In federated learning (FL), heterogeneity among the local dataset distributions of clients can result in unsatisfactory performance for some, leading to an unfair model. To address this challenge, we propose an over-the-air fair federated learning algorithm (OTA-FFL), which leverages over-the-air computation to train fair FL models. By formulating FL as a multi-objective minimization problem, we introduce a modified Chebyshev approach to compute adaptive weighting coefficients for gradient aggregation in each communication round. To enable efficient aggregation over the multiple access channel, we derive analytical solutions for the optimal transmit scalars at the clients and the de-noising scalar at the parameter server. Extensive experiments demonstrate the superiority of OTA-FFL in achieving fairness and robust performance compared to existing methods.
Decision Explanation: Original decision: REJECT The paper focuses on federated learning, multi-objective optimization, and over-the-air computation, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.
[Enhanced Importance Sampling through Latent Space Exploration in
Normalizing Flows](https://arxiv.org/abs/http://arxiv.org/abs/2501.03394v1) arXiv ID: http://arxiv.org/abs/2501.03394v1
Abstract:
Importance sampling is a rare event simulation technique used in Monte Carlo simulations to bias the sampling distribution towards the rare event of interest. By assigning appropriate weights to sampled points, importance sampling allows for more efficient estimation of rare events or tails of distributions. However, importance sampling can fail when the proposal distribution does not effectively cover the target distribution. In this work, we propose a method for more efficient sampling by updating the proposal distribution in the latent space of a normalizing flow. Normalizing flows learn an invertible mapping from a target distribution to a simpler latent distribution. The latent space can be more easily explored during the search for a proposal distribution, and samples from the proposal distribution are recovered in the space of the target distribution via the invertible mapping. We empirically validate our methodology on simulated robotics applications such as autonomous racing and aircraft ground collision avoidance.
Decision Explanation: Original decision: REJECT The paper does not meet the criteria as it focuses on enhancing importance sampling in Monte Carlo simulations using normalizing flows, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing all 'MUST' criteria.
[BoundingDocs: a Unified Dataset for Document Question Answering with
Spatial Annotations](https://arxiv.org/abs/http://arxiv.org/abs/2501.03403v1) arXiv ID: http://arxiv.org/abs/2501.03403v1
Abstract:
We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "Although the paper mentions exploring the impact of different prompting techniques on LLM performance, its primary focus is on presenting a unified dataset for document Question Answering, making prompt engineering a secondary concern rather than the central focus." }
SALT: Sales Autocompletion Linked Business Tables Dataset
arXiv ID: http://arxiv.org/abs/2501.03413v1
Abstract:
Foundation models, particularly those that incorporate Transformer architectures, have demonstrated exceptional performance in domains such as natural language processing and image processing. Adapting these models to structured data, like tables, however, introduces significant challenges. These difficulties are even more pronounced when addressing multi-table data linked via foreign key, which is prevalent in the enterprise realm and crucial for empowering business use cases. Despite its substantial impact, research focusing on such linked business tables within enterprise settings remains a significantly important yet underexplored domain. To address this, we introduce a curated dataset sourced from an Enterprise Resource Planning (ERP) system, featuring extensive linked tables. This dataset is specifically designed to support research endeavors in table representation learning. By providing access to authentic enterprise data, our goal is to potentially enhance the effectiveness and applicability of models for real-world business contexts.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper does not meet the 'MUST' criteria, as its primary focus is on introducing a dataset for table representation learning in enterprise settings, rather than engineering, designing, or optimizing prompts specifically for Large Language Models (LLMs), and does not demonstrate the impact of textual input prompts on LLM output." }
Optimization Learning
arXiv ID: http://arxiv.org/abs/2501.03443v1
Abstract:
This article introduces the concept of optimization learning, a methodology to design optimization proxies that learn the input/output mapping of parametric optimization problems. These optimization proxies are trustworthy by design: they compute feasible solutions to the underlying optimization problems, provide quality guarantees on the returned solutions, and scale to large instances. Optimization proxies are differentiable programs that combine traditional deep learning technology with repair or completion layers to produce feasible solutions. The article shows that optimization proxies can be trained end-to-end in a self-supervised way. It presents methodologies to provide performance guarantees and to scale optimization proxies to large-scale optimization problems. The potential of optimization proxies is highlighted through applications in power systems and, in particular, real-time risk assessment and security-constrained optimal power flow.
Decision Explanation: Original decision: REJECT The paper does not focus on prompt engineering, design, or optimization for Large Language Models (LLMs). Instead, it introduces 'optimization learning' for solving parametric optimization problems, unrelated to LLMs or text-based interactions.
[LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification
and Tagging](https://arxiv.org/abs/http://arxiv.org/abs/2501.03464v1) arXiv ID: http://arxiv.org/abs/2501.03464v1
Abstract:
Transformers have set new benchmarks in audio processing tasks, leveraging self-attention mechanisms to capture complex patterns and dependencies within audio data. However, their focus on pairwise interactions limits their ability to process the higher-order relations essential for identifying distinct audio objects. To address this limitation, this work introduces the Local- Higher Order Graph Neural Network (LHGNN), a graph based model that enhances feature understanding by integrating local neighbourhood information with higher-order data from Fuzzy C-Means clusters, thereby capturing a broader spectrum of audio relationships. Evaluation of the model on three publicly available audio datasets shows that it outperforms Transformer-based models across all benchmarks while operating with substantially fewer parameters. Moreover, LHGNN demonstrates a distinct advantage in scenarios lacking ImageNet pretraining, establishing its effectiveness and efficiency in environments where extensive pretraining data is unavailable.
Decision Explanation: Original decision: REJECT The paper focuses on developing a new graph neural network model (LHGNN) for audio classification and tagging, primarily dealing with audio data and disregarding Large Language Models (LLMs) and text-based prompt engineering, thus failing to meet all the 'MUST' criteria.
[MTRAG: A Multi-Turn Conversational Benchmark for Evaluating
Retrieval-Augmented Generation Systems](https://arxiv.org/abs/http://arxiv.org/abs/2501.03468v1) arXiv ID: http://arxiv.org/abs/2501.03468v1
Abstract:
Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. MTRAG is available at https://github.com/ibm/mt-rag-benchmark.
Decision Explanation: Original decision: REJECT The paper primarily focuses on introducing a benchmark (MTRAG) for evaluating Retrieval-Augmented Generation Systems, rather than specifically focusing on the engineering, design, or optimization of prompts for Large Language Models (LLMs) and demonstrating their impact on LLM output.
Reading with Intent -- Neutralizing Intent
arXiv ID: http://arxiv.org/abs/2501.03475v1
Abstract:
Queries to large language models (LLMs) can be divided into two parts: the instruction/question and the accompanying context. The context for retrieval-augmented generation (RAG) systems in most benchmarks comes from Wikipedia or Wikipedia-like texts which are written in a neutral and factual tone. However, when RAG systems retrieve internet-based content, they encounter text with diverse tones and linguistic styles, introducing challenges for downstream tasks. The Reading with Intent task addresses this issue by evaluating how varying tones in context passages affect model performance. Building on prior work that focused on sarcasm, we extend this paradigm by constructing a dataset where context passages are transformed to
11
distinct emotions using a better synthetic data generation approach. Using this dataset, we train an emotion translation model to systematically adapt passages to specified emotional tones. The human evaluation shows that the LLM fine-tuned to become the emotion-translator benefited from the synthetically generated data. Finally, the emotion-translator is used in the Reading with Intent task to transform the passages to a neutral tone. By neutralizing the passages, it mitigates the challenges posed by sarcastic passages and improves overall results on this task by about3\%
.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on training an emotion translation model to adapt passages to specified emotional tones and fine-tuning an LLM for emotion translation, rather than the engineering, design, or optimization of prompts specifically for LLMs. While it mentions improving results on the Reading with Intent task by neutralizing passages, the core subject is emotion adaptation in LLM training, not prompt engineering for text-based interactions with LLMs." }
[Align-Pro: A Principled Approach to Prompt Optimization for LLM
Alignment](https://arxiv.org/abs/http://arxiv.org/abs/2501.03486v1) arXiv ID: http://arxiv.org/abs/2501.03486v1
Abstract:
The alignment of large language models (LLMs) with human values is critical as these models become increasingly integrated into various societal and decision-making processes. Traditional methods, such as reinforcement learning from human feedback (RLHF), achieve alignment by fine-tuning model parameters, but these approaches are often computationally expensive and impractical when models are frozen or inaccessible for parameter modification. In contrast, prompt optimization is a viable alternative to RLHF for LLM alignment. While the existing literature has shown empirical promise of prompt optimization, its theoretical underpinning remains under-explored. We address this gap by formulating prompt optimization as an optimization problem and try to provide theoretical insights into the optimality of such a framework. To analyze the performance of the prompt optimization, we study theoretical suboptimality bounds and provide insights in terms of how prompt optimization depends upon the given prompter and target model. We also provide empirical validation through experiments on various datasets, demonstrating that prompt optimization can effectively align LLMs, even when parameter fine-tuning is not feasible.
Decision Explanation: Original decision: REJECT Although the paper discusses prompt optimization for LLM alignment, its primary focus is on LLM alignment through a theoretical optimization framework, rather than specifically on engineering, design, or optimization of prompts for improving LLM performance through textual input manipulation, as required.
Can Deep Learning Trigger Alerts from Mobile-Captured Images?
arXiv ID: http://arxiv.org/abs/2501.03499v1
Abstract:
Our research presents a comprehensive approach to leveraging mobile camera image data for real-time air quality assessment and recommendation. We develop a regression-based Convolutional Neural Network model and tailor it explicitly for air quality prediction by exploiting the inherent relationship between output parameters. As a result, the Mean Squared Error of 0.0077 and 0.0112 obtained for 2 and 5 pollutants respectively outperforms existing models. Furthermore, we aim to verify the common practice of augmenting the original dataset with a view to introducing more variation in the training phase. It is one of our most significant contributions that our experimental results demonstrate minimal accuracy differences between the original and augmented datasets. Finally, a real-time, user-friendly dashboard is implemented which dynamically displays the Air Quality Index and pollutant values derived from captured mobile camera images. Users' health conditions are considered to recommend whether a location is suitable based on current air quality metrics. Overall, this research contributes to verification of data augmentation techniques, CNN-based regression modelling for air quality prediction, and user-centric air quality monitoring through mobile technology. The proposed system offers practical solutions for individuals to make informed environmental health and well-being decisions.
Decision Explanation: Original decision: REJECT The paper focuses on leveraging mobile camera images for air quality assessment using a Convolutional Neural Network model, with no mention of Large Language Models (LLMs), prompt engineering, or text-based interactions, thereby failing to meet all 'MUST' criteria.
Vocal Tract Length Warped Features for Spoken Keyword Spotting
arXiv ID: http://arxiv.org/abs/2501.03523v1
Abstract:
In this paper, we propose several methods that incorporate vocal tract length (VTL) warped features for spoken keyword spotting (KWS). The first method, VTL-independent KWS, involves training a single deep neural network (DNN) that utilizes VTL features with various warping factors. During training, a specific VTL feature is randomly selected per epoch, allowing the exploration of VTL variations. During testing, the VTL features with different warping factors of a test utterance are scored against the DNN and combined with equal weight. In the second method scores the conventional features of a test utterance (without VTL warping) against the DNN. The third method, VTL-concatenation KWS, concatenates VTL warped features to form high-dimensional features for KWS. Evaluations carried out on the English Google Command dataset demonstrate that the proposed methods improve the accuracy of KWS.
Decision Explanation: Original decision: REJECT The paper focuses on spoken keyword spotting using vocal tract length warped features with deep neural networks, which does not meet the criteria of primarily focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or investigating methods for improving LLM performance through textual input prompt manipulation.
[Deep Learning within Tabular Data: Foundations, Challenges, Advances and
Future Directions](https://arxiv.org/abs/http://arxiv.org/abs/2501.03540v1) arXiv ID: http://arxiv.org/abs/2501.03540v1
Abstract:
Tabular data remains one of the most prevalent data types across a wide range of real-world applications, yet effective representation learning for this domain poses unique challenges due to its irregular patterns, heterogeneous feature distributions, and complex inter-column dependencies. This survey provides a comprehensive review of state-of-the-art techniques in tabular data representation learning, structured around three foundational design elements: training data, neural architectures, and learning objectives. Unlike prior surveys that focus primarily on either architecture design or learning strategies, we adopt a holistic perspective that emphasizes the universality and robustness of representation learning methods across diverse downstream tasks. We examine recent advances in data augmentation and generation, specialized neural network architectures tailored to tabular data, and innovative learning objectives that enhance representation quality. Additionally, we highlight the growing influence of self-supervised learning and the adaptation of transformer-based foundation models for tabular data. Our review is based on a systematic literature search using rigorous inclusion criteria, encompassing 127 papers published since 2020 in top-tier conferences and journals. Through detailed analysis and comparison, we identify emerging trends, critical gaps, and promising directions for future research, aiming to guide the development of more generalizable and effective tabular data representation methods.
Decision Explanation: Original decision: REJECT The paper focuses on tabular data representation learning using deep learning techniques, with no apparent connection to Large Language Models (LLMs) or prompt engineering for text-based interactions, thus failing to meet the primary 'MUST' criteria.
[PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for
Text-to-Image Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03544v1) arXiv ID: http://arxiv.org/abs/2501.03544v1
Abstract:
Text-to-image (T2I) models have been shown to be vulnerable to misuse, particularly in generating not-safe-for-work (NSFW) content, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. Extensive experiments across three datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard achieves 7.8 times faster than prior content moderation methods, surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to 5.84%.
Decision Explanation: Original decision: REJECT The paper primarily focuses on text-to-image models and content moderation, rather than the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) as required, violating the 'MUST NOT' criteria concerning applications of generative AI other than text generation driven by LLMs.
[Rethinking Adversarial Attacks in Reinforcement Learning from Policy
Distribution Perspective](https://arxiv.org/abs/http://arxiv.org/abs/2501.03562v2) arXiv ID: http://arxiv.org/abs/2501.03562v2
Abstract:
Deep Reinforcement Learning (DRL) suffers from uncertainties and inaccuracies in the observation signal in realworld applications. Adversarial attack is an effective method for evaluating the robustness of DRL agents. However, existing attack methods targeting individual sampled actions have limited impacts on the overall policy distribution, particularly in continuous action spaces. To address these limitations, we propose the Distribution-Aware Projected Gradient Descent attack (DAPGD). DAPGD uses distribution similarity as the gradient perturbation input to attack the policy network, which leverages the entire policy distribution rather than relying on individual samples. We utilize the Bhattacharyya distance in DAPGD to measure policy similarity, enabling sensitive detection of subtle but critical differences between probability distributions. Our experiment results demonstrate that DAPGD achieves SOTA results compared to the baselines in three robot navigation tasks, achieving an average 22.03% higher reward drop compared to the best baseline.
Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria: it focuses on Reinforcement Learning, adversarial attacks, and policy distribution, with no primary emphasis on engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate improving LLM performance through textual input prompt manipulation.
[Applying Large Language Models in Knowledge Graph-based Enterprise
Modeling: Challenges and Opportunities](https://arxiv.org/abs/http://arxiv.org/abs/2501.03566v1) arXiv ID: http://arxiv.org/abs/2501.03566v1
Abstract:
The role of large language models (LLMs) in enterprise modeling has recently started to shift from academic research to that of industrial applications. Thereby, LLMs represent a further building block for the machine-supported generation of enterprise models. In this paper we employ a knowledge graph-based approach for enterprise modeling and investigate the potential benefits of LLMs in this context. In addition, the findings of an expert survey and ChatGPT-4o-based experiments demonstrate that LLM-based model generations exhibit minimal variability, yet remain constrained to specific tasks, with reliability declining for more intricate tasks. The survey results further suggest that the supervision and intervention of human modeling experts are essential to ensure the accuracy and integrity of the generated models.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper's primary focus is on applying LLMs in enterprise modeling using knowledge graphs, not on prompt engineering for text-based interactions with LLMs, failing to meet the core subject requirement." }
RecKG: Knowledge Graph for Recommender Systems
arXiv ID: http://arxiv.org/abs/2501.03598v1
Abstract:
Knowledge graphs have proven successful in integrating heterogeneous data across various domains. However, there remains a noticeable dearth of research on their seamless integration among heterogeneous recommender systems, despite knowledge graph-based recommender systems garnering extensive research attention. This study aims to fill this gap by proposing RecKG, a standardized knowledge graph for recommender systems. RecKG ensures the consistent representation of entities across different datasets, accommodating diverse attribute types for effective data integration. Through a meticulous examination of various recommender system datasets, we select attributes for RecKG, ensuring standardized formatting through consistent naming conventions. By these characteristics, RecKG can seamlessly integrate heterogeneous data sources, enabling the discovery of additional semantic information within the integrated knowledge graph. We apply RecKG to standardize real-world datasets, subsequently developing an application for RecKG using a graph database. Finally, we validate RecKG's achievement in interoperability through a qualitative evaluation between RecKG and other studies.
Decision Explanation: Original decision: REJECT The paper focuses primarily on integrating knowledge graphs for recommender systems, not on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), thus failing to meet the 'MUST' criteria 1 and 2.
MHGNet: Multi-Heterogeneous Graph Neural Network for Traffic Prediction
arXiv ID: http://arxiv.org/abs/2501.03635v1
Abstract:
In recent years, traffic flow prediction has played a crucial role in the management of intelligent transportation systems. However, traditional forecasting methods often model non-Euclidean low-dimensional traffic data as a simple graph with single-type nodes and edges, failing to capture similar trends among nodes of the same type. To address this limitation, this paper proposes MHGNet, a novel framework for modeling spatiotemporal multi-heterogeneous graphs. Within this framework, the STD Module decouples single-pattern traffic data into multi-pattern traffic data through feature mappings of timestamp embedding matrices and node embedding matrices. Subsequently, the Node Clusterer leverages the Euclidean distance between nodes and different types of limit points to perform clustering with O(N) time complexity. The nodes within each cluster undergo residual subgraph convolution within the spatiotemporal fusion subgraphs generated by the DSTGG Module, followed by processing in the SIE Module for node repositioning and redistribution of weights. To validate the effectiveness of MHGNet, this paper conducts extensive ablation studies and quantitative evaluations on four widely used benchmarks, demonstrating its superior performance.
Decision Explanation: Original decision: REJECT The paper does not meet the primary focus criteria, as it concentrates on traffic prediction using Multi-Heterogeneous Graph Neural Networks, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to satisfy the 'MUST' criteria.
[Effective and Efficient Mixed Precision Quantization of Speech
Foundation Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03643v2) arXiv ID: http://arxiv.org/abs/2501.03643v2
Abstract:
This paper presents a novel mixed-precision quantization approach for speech foundation models that tightly integrates mixed-precision learning and quantized model parameter estimation into one single model compression stage. Experiments conducted on LibriSpeech dataset with fine-tuned wav2vec2.0-base and HuBERT-large models suggest the resulting mixed-precision quantized models increased the lossless compression ratio by factors up to 1.7x and 1.9x over the respective uniform-precision and two-stage mixed-precision quantized baselines that perform precision learning and model parameters quantization in separate and disjointed stages, while incurring no statistically word error rate (WER) increase over the 32-bit full-precision models. The system compression time of wav2vec2.0-base and HuBERT-large models is reduced by up to 1.9 and 1.5 times over the two-stage mixed-precision baselines, while both produce lower WERs. The best-performing 3.5-bit mixed-precision quantized HuBERT-large model produces a lossless compression ratio of 8.6x over the 32-bit full-precision system.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on model compression through mixed-precision quantization of speech foundation models, not on prompt engineering or the manipulation of textual input prompts for Large Language Models (LLMs), thus failing to meet the 'MUST' criteria." }
[A Diversity-Enhanced Knowledge Distillation Model for Practical Math
Word Problem Solving](https://arxiv.org/abs/http://arxiv.org/abs/2501.03670v1) arXiv ID: http://arxiv.org/abs/2501.03670v1
Abstract:
Math Word Problem (MWP) solving is a critical task in natural language processing, has garnered significant research interest in recent years. Various recent studies heavily rely on Seq2Seq models and their extensions (e.g., Seq2Tree and Graph2Tree) to generate mathematical equations. While effective, these models struggle to generate diverse but counterpart solution equations, limiting their generalization across various math problem scenarios. In this paper, we introduce a novel Diversity-enhanced Knowledge Distillation (DivKD) model for practical MWP solving. Our approach proposes an adaptive diversity distillation method, in which a student model learns diverse equations by selectively transferring high-quality knowledge from a teacher model. Additionally, we design a diversity prior-enhanced student model to better capture the diversity distribution of equations by incorporating a conditional variational auto-encoder. Extensive experiments on {four} MWP benchmark datasets demonstrate that our approach achieves higher answer accuracy than strong baselines while maintaining high efficiency for practical applications.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses primarily on developing a new model (Diversity-enhanced Knowledge Distillation) for math word problem solving, rather than engineering or optimizing prompts specifically for Large Language Models (LLMs). Prompt engineering is not the central concern, and the paper does not demonstrate the impact of textual input prompts on LLM output." }
SALE-Based Offline Reinforcement Learning with Ensemble Q-Networks
arXiv ID: http://arxiv.org/abs/2501.03676v2
Abstract:
In this work, we build upon the offline reinforcement learning algorithm TD7, which incorporates State-Action Learned Embeddings (SALE) and a prioritized experience replay buffer (LAP). We propose a model-free actor-critic algorithm that integrates ensemble Q-networks and a gradient diversity penalty from EDAC. The ensemble Q-networks introduce penalties to guide the actor network toward in-distribution actions, effectively addressing the challenge of out-of-distribution actions. Meanwhile, the gradient diversity penalty encourages diverse Q-value gradients, further suppressing overestimation for out-of-distribution actions. Additionally, our method retains an adjustable behavior cloning (BC) term that directs the actor network toward dataset actions during early training stages, while gradually reducing its influence as the precision of the Q-ensemble improves. These enhancements work synergistically to improve the stability and precision of the training. Experimental results on the D4RL MuJoCo benchmarks demonstrate that our algorithm achieves higher convergence speed, stability, and performance compared to existing methods.
Decision Explanation: Original decision: REJECT The paper primarily focuses on offline reinforcement learning with ensemble Q-networks, disregarding the mandatory criteria of concentrating on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and manipulating textual input prompts to improve LLM performance.
[SLAM: Towards Efficient Multilingual Reasoning via Selective Language
Alignment](https://arxiv.org/abs/http://arxiv.org/abs/2501.03681v1) arXiv ID: http://arxiv.org/abs/2501.03681v1
Abstract:
Despite the significant improvements achieved by large language models (LLMs) in English reasoning tasks, these models continue to struggle with multilingual reasoning. Recent studies leverage a full-parameter and two-stage training paradigm to teach models to first understand non-English questions and then reason. However, this method suffers from both substantial computational resource computing and catastrophic forgetting. The fundamental cause is that, with the primary goal of enhancing multilingual comprehension, an excessive number of irrelevant layers and parameters are tuned during the first stage. Given our findings that the representation learning of languages is merely conducted in lower-level layers, we propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism. Experimental results show that our method, SLAM, only tunes 6 layers' feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs, achieving superior average performance than all strong baselines across 10 languages. Meanwhile, SLAM only involves one training stage, reducing training time by 4.1-11.9 compared to the two-stage method.
Decision Explanation: Original decision: REJECT The paper primarily focuses on efficient multilingual reasoning via selective language alignment, involving a one-stage training method for LLMs, which aligns more with the development of new training methods (violation of MUST NOT 1) rather than the engineering, design, or optimization of prompts for Large Language Models.
Exploring Molecule Generation Using Latent Space Graph Diffusion
arXiv ID: http://arxiv.org/abs/2501.03696v1
Abstract:
Generating molecular graphs is a challenging task due to their discrete nature and the competitive objectives involved. Diffusion models have emerged as SOTA approaches in data generation across various modalities. For molecular graphs, graph neural networks (GNNs) as a diffusion backbone have achieved impressive results. Latent space diffusion, where diffusion occurs in a low-dimensional space via an autoencoder, has demonstrated computational efficiency. However, the literature on latent space diffusion for molecular graphs is scarce, and no commonly accepted best practices exist. In this work, we explore different approaches and hyperparameters, contrasting generative flow models (denoising diffusion, flow matching, heat dissipation) and architectures (GNNs and E(3)-equivariant GNNs). Our experiments reveal a high sensitivity to the choice of approach and design decisions. Code is made available at github.com/Prashanth-Pombala/Molecule-Generation-using-Latent-Space-Graph-Diffusion.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on generating molecular graphs using latent space graph diffusion, involving graph neural networks and diffusion models, without any mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet all 'MUST' criteria." }
[Unsupervised Speech Segmentation: A General Approach Using Speech
Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03711v1) arXiv ID: http://arxiv.org/abs/2501.03711v1
Abstract:
In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at https://github.com/avishaiElmakies/unsupervised_speech_segmentation_using_slm.
Decision Explanation: Original decision: REJECT The paper focuses on unsupervised speech segmentation using Speech Language Models (SLMs), which does not meet the criteria of primarily focusing on the engineering, design, or optimization of prompts for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts for improving LLM performance.
[Materialist: Physically Based Editing Using Single-Image Inverse
Rendering](https://arxiv.org/abs/http://arxiv.org/abs/2501.03717v1) arXiv ID: http://arxiv.org/abs/2501.03717v1
Abstract:
To perform image editing based on single-view, inverse physically based rendering, we present a method combining a learning-based approach with progressive differentiable rendering. Given an image, our method leverages neural networks to predict initial material properties. Progressive differentiable rendering is then used to optimize the environment map and refine the material properties with the goal of closely matching the rendered result to the input image. We require only a single image while other inverse rendering methods based on the rendering equation require multiple views. In comparison to single-view methods that rely on neural renderers, our approach achieves more realistic light material interactions, accurate shadows, and global illumination. Furthermore, with optimized material properties and illumination, our method enables a variety of tasks, including physically based material editing, object insertion, and relighting. We also propose a method for material transparency editing that operates effectively without requiring full scene geometry. Compared with methods based on Stable Diffusion, our approach offers stronger interpretability and more realistic light refraction based on empirical results.
Decision Explanation: Original decision: REJECT The paper focuses on image editing using single-image inverse physically based rendering, involving neural networks and differentiable rendering for image generation, which falls under image generation and not text-based interactions with Large Language Models (LLMs), thus failing to meet the primary criteria of focusing on prompt engineering for LLMs.
[SelectiveFinetuning: Enhancing Transfer Learning in Sleep Staging
through Selective Domain Alignment](https://arxiv.org/abs/http://arxiv.org/abs/2501.03764v1) arXiv ID: http://arxiv.org/abs/2501.03764v1
Abstract:
In practical sleep stage classification, a key challenge is the variability of EEG data across different subjects and environments. Differences in physiology, age, health status, and recording conditions can lead to domain shifts between data. These domain shifts often result in decreased model accuracy and reliability, particularly when the model is applied to new data with characteristics different from those it was originally trained on, which is a typical manifestation of negative transfer. To address this, we propose SelectiveFinetuning in this paper. Our method utilizes a pretrained Multi Resolution Convolutional Neural Network (MRCNN) to extract EEG features, capturing the distinctive characteristics of different sleep stages. To mitigate the effect of domain shifts, we introduce a domain aligning mechanism that employs Earth Mover Distance (EMD) to evaluate and select source domain data closely matching the target domain. By finetuning the model with selective source data, our SelectiveFinetuning enhances the model's performance on target domain that exhibits domain shifts compared to the data used for training. Experimental results show that our method outperforms existing baselines, offering greater robustness and adaptability in practical scenarios where data distributions are often unpredictable.
Decision Explanation: Original decision: REJECT The paper focuses primarily on enhancing transfer learning in sleep staging through selective domain alignment using a Convolutional Neural Network (CNN), not on the engineering, design, or optimization of prompts for Large Language Models (LLMs), and does not investigate the manipulation of textual input prompts to improve LLM performance.
[Three-dimensional attention Transformer for state evaluation in
real-time strategy games](https://arxiv.org/abs/http://arxiv.org/abs/2501.03832v1) arXiv ID: http://arxiv.org/abs/2501.03832v1
Abstract:
Situation assessment in Real-Time Strategy (RTS) games is crucial for understanding decision-making in complex adversarial environments. However, existing methods remain limited in processing multi-dimensional feature information and temporal dependencies. Here we propose a tri-dimensional Space-Time-Feature Transformer (TSTF Transformer) architecture, which efficiently models battlefield situations through three independent but cascaded modules: spatial attention, temporal attention, and feature attention. On a dataset comprising 3,150 adversarial experiments, the 8-layer TSTF Transformer demonstrates superior performance: achieving 58.7% accuracy in the early game (~4% progress), significantly outperforming the conventional Timesformer's 41.8%; reaching 97.6% accuracy in the mid-game (~40% progress) while maintaining low performance variation (standard deviation 0.114). Meanwhile, this architecture requires fewer parameters (4.75M) compared to the baseline model (5.54M). Our study not only provides new insights into situation assessment in RTS games but also presents an innovative paradigm for Transformer-based multi-dimensional temporal modeling.
Decision Explanation: Original decision: REJECT The paper focuses primarily on developing a new Transformer architecture for situation assessment in Real-Time Strategy games, rather than on prompt engineering for Large Language Models (LLMs). It does not investigate, analyze, or propose methods for improving LLM performance through textual input prompt manipulation, and there is no mention of prompts or LLMs in the provided abstract.
[SCC-YOLO: An Improved Object Detector for Assisting in Brain Tumor
Diagnosis](https://arxiv.org/abs/http://arxiv.org/abs/2501.03836v2) arXiv ID: http://arxiv.org/abs/2501.03836v2
Abstract:
Brain tumors can result in neurological dysfunction, alterations in cognitive and psychological states, increased intracranial pressure, and the occurrence of seizures, thereby presenting a substantial risk to human life and health. The You Only Look Once(YOLO) series models have demonstrated superior accuracy in object detection for medical imaging. In this paper, we develop a novel SCC-YOLO architecture by integrating the SCConv attention mechanism into YOLOv9. The SCConv module reconstructs an efficient convolutional module by reducing spatial and channel redundancy among features, thereby enhancing the learning of image features. We investigate the impact of intergrating different attention mechanisms with the YOLOv9 model on brain tumor image detection using both the Br35H dataset and our self-made dataset(Brain_Tumor_Dataset). Experimental results show that on the Br35H dataset, SCC-YOLO achieved a 0.3% improvement in mAp50 compared to YOLOv9, while on our self-made dataset, SCC-YOLO exhibited a 0.5% improvement over YOLOv9. SCC-YOLO has reached state-of-the-art performance in brain tumor detection. Source code is available at : https://jihulab.com/healthcare-information-studio/SCC-YOLO/-/tree/master
Decision Explanation: Original decision: REJECT The paper focuses primarily on developing a new object detection architecture (SCC-YOLO) for medical imaging (brain tumor diagnosis) and does not investigate, analyze, or propose methods for improving Large Language Model (LLM) performance through the manipulation of textual input prompts.
Explainable Reinforcement Learning via Temporal Policy Decomposition
arXiv ID: http://arxiv.org/abs/2501.03902v1
Abstract:
We investigate the explainability of Reinforcement Learning (RL) policies from a temporal perspective, focusing on the sequence of future outcomes associated with individual actions. In RL, value functions compress information about rewards collected across multiple trajectories and over an infinite horizon, allowing a compact form of knowledge representation. However, this compression obscures the temporal details inherent in sequential decision-making, presenting a key challenge for interpretability. We present Temporal Policy Decomposition (TPD), a novel explainability approach that explains individual RL actions in terms of their Expected Future Outcome (EFO). These explanations decompose generalized value functions into a sequence of EFOs, one for each time step up to a prediction horizon of interest, revealing insights into when specific outcomes are expected to occur. We leverage fixed-horizon temporal difference learning to devise an off-policy method for learning EFOs for both optimal and suboptimal actions, enabling contrastive explanations consisting of EFOs for different state-action pairs. Our experiments demonstrate that TPD generates accurate explanations that (i) clarify the policy's future strategy and anticipated trajectory for a given action and (ii) improve understanding of the reward composition, facilitating fine-tuning of the reward function to align with human expectations.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on explainability in Reinforcement Learning (RL) policies, primarily dealing with temporal policy decomposition and value functions, with no indication of prompt engineering for Large Language Models (LLMs) or manipulation of textual input prompts to improve LLM performance." }
[Exploring the Potential of Large Language Models in Public
Transportation: San Antonio Case Study](https://arxiv.org/abs/http://arxiv.org/abs/2501.03904v1) arXiv ID: http://arxiv.org/abs/2501.03904v1
Abstract:
The integration of large language models (LLMs) into public transit systems presents a transformative opportunity to enhance urban mobility. This study explores the potential of LLMs to revolutionize public transportation management within the context of San Antonio's transit system. Leveraging the capabilities of LLMs in natural language processing and data analysis, we investigate their capabilities to optimize route planning, reduce wait times, and provide personalized travel assistance. By utilizing the General Transit Feed Specification (GTFS) and other relevant data, this research aims to demonstrate how LLMs can potentially improve resource allocation, elevate passenger satisfaction, and inform data-driven decision-making in transit operations. A comparative analysis of different ChatGPT models was conducted to assess their ability to understand transportation information, retrieve relevant data, and provide comprehensive responses. Findings from this study suggest that while LLMs hold immense promise for public transit, careful engineering and fine-tuning are essential to realizing their full potential. San Antonio serves as a case study to inform the development of LLM-powered transit systems in other urban environments.
Decision Explanation: Original decision: REJECT The paper primarily focuses on applying Large Language Models to public transportation management, rather than on the engineering, design, or optimization of prompts specifically for LLMs, failing to meet the core subject requirement.
[Localizing AI: Evaluating Open-Weight Language Models for Languages of
Baltic States](https://arxiv.org/abs/http://arxiv.org/abs/2501.03952v1) arXiv ID: http://arxiv.org/abs/2501.03952v1
Abstract:
Although large language models (LLMs) have transformed our expectations of modern language technologies, concerns over data privacy often restrict the use of commercially available LLMs hosted outside of EU jurisdictions. This limits their application in governmental, defence, and other data-sensitive sectors. In this work, we evaluate the extent to which locally deployable open-weight LLMs support lesser-spoken languages such as Lithuanian, Latvian, and Estonian. We examine various size and precision variants of the top-performing multilingual open-weight models, Llama~3, Gemma~2, Phi, and NeMo, on machine translation, multiple-choice question answering, and free-form text generation. The results indicate that while certain models like Gemma~2 perform close to the top commercially available models, many LLMs struggle with these languages. Most surprisingly, however, we find that these models, while showing close to state-of-the-art translation performance, are still prone to lexical hallucinations with errors in at least 1 in 20 words for all open-weight multilingual LLMs.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on evaluating open-weight Language Models for lesser-spoken languages, concerning data privacy and model performance, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs)." }
VLM-driven Behavior Tree for Context-aware Task Planning
arXiv ID: http://arxiv.org/abs/2501.03968v2
Abstract:
The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex environments. A key feature of our approach lies in the conditional control through self-prompted visual conditions. Specifically, the VLM generates BTs with visual condition nodes, where conditions are expressed as free-form text. Another VLM process integrates the text into its prompt and evaluates the conditions against real-world images during robot execution. We validated our framework in a real-world cafe scenario, demonstrating both its feasibility and limitations.
Decision Explanation: Original decision: REJECT The paper's primary focus is on leveraging Vision-Language Models (VLMs) for context-aware task planning in robotics, with prompt engineering being a secondary aspect used for conditional control, not the central focus of the paper.
[ChronoLLM: A Framework for Customizing Large Language Model for Digital
Twins generalization based on PyChrono](https://arxiv.org/abs/http://arxiv.org/abs/2501.04062v1) arXiv ID: http://arxiv.org/abs/2501.04062v1
Abstract:
Recently, the integration of advanced simulation technologies with artificial intelligence (AI) is revolutionizing science and engineering research. ChronoLlama introduces a novel framework that customizes the open-source LLMs, specifically for code generation, paired with PyChrono for multi-physics simulations. This integration aims to automate and improve the creation of simulation scripts, thus enhancing model accuracy and efficiency. This combination harnesses the speed of AI-driven code generation with the reliability of physics-based simulations, providing a powerful tool for researchers and engineers. Empirical results indicate substantial enhancements in simulation setup speed, accuracy of the generated codes, and overall computational efficiency. ChronoLlama not only expedites the development and testing of multibody systems but also spearheads a scalable, AI-enhanced approach to managing intricate mechanical simulations. This pioneering integration of cutting-edge AI with traditional simulation platforms represents a significant leap forward in automating and optimizing design processes in engineering applications.
Decision Explanation: Original decision: REJECT The paper's primary focus is on integrating LLMs with multi-physics simulations for automating code generation in engineering applications, rather than specifically on prompt engineering for Large Language Models. The core subject is the development of a framework for simulation script automation, not novel prompt engineering techniques or the manipulation of textual input prompts to improve LLM performance.
[TrojanDec: Data-free Detection of Trojan Inputs in Self-supervised
Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.04108v1) arXiv ID: http://arxiv.org/abs/2501.04108v1
Abstract:
An image encoder pre-trained by self-supervised learning can be used as a general-purpose feature extractor to build downstream classifiers for various downstream tasks. However, many studies showed that an attacker can embed a trojan into an encoder such that multiple downstream classifiers built based on the trojaned encoder simultaneously inherit the trojan behavior. In this work, we propose TrojanDec, the first data-free method to identify and recover a test input embedded with a trigger. Given a (trojaned or clean) encoder and a test input, TrojanDec first predicts whether the test input is trojaned. If not, the test input is processed in a normal way to maintain the utility. Otherwise, the test input will be further restored to remove the trigger. Our extensive evaluation shows that TrojanDec can effectively identify the trojan (if any) from a given test input and recover it under state-of-the-art trojan attacks. We further demonstrate by experiments that our TrojanDec outperforms the state-of-the-art defenses.
Decision Explanation: Original decision: REJECT The paper focuses on detecting and recovering from trojan attacks in self-supervised image encoders, which falls outside the specified criteria: it does not involve Large Language Models (LLMs), prompt engineering for text-based interactions, or text generation, and instead pertains to image processing and security.
BiasGuard: Guardrailing Fairness in Machine Learning Production Systems
arXiv ID: http://arxiv.org/abs/2501.04142v1
Abstract:
As machine learning (ML) systems increasingly impact critical sectors such as hiring, financial risk assessments, and criminal justice, the imperative to ensure fairness has intensified due to potential negative implications. While much ML fairness research has focused on enhancing training data and processes, addressing the outputs of already deployed systems has received less attention. This paper introduces 'BiasGuard', a novel approach designed to act as a fairness guardrail in production ML systems. BiasGuard leverages Test-Time Augmentation (TTA) powered by Conditional Generative Adversarial Network (CTGAN), a cutting-edge generative AI model, to synthesize data samples conditioned on inverted protected attribute values, thereby promoting equitable outcomes across diverse groups. This method aims to provide equal opportunities for both privileged and unprivileged groups while significantly enhancing the fairness metrics of deployed systems without the need for retraining. Our comprehensive experimental analysis across diverse datasets reveals that BiasGuard enhances fairness by 31% while only reducing accuracy by 0.09% compared to non-mitigated benchmarks. Additionally, BiasGuard outperforms existing post-processing methods in improving fairness, positioning it as an effective tool to safeguard against biases when retraining the model is impractical.
Decision Explanation: Original decision: REJECT The paper primarily focuses on ensuring fairness in machine learning production systems via Test-Time Augmentation and a generative AI model, rather than specifically on prompt engineering for Large Language Models (LLMs) and the manipulation of textual input prompts to improve LLM performance.
[Reasoning-Enhanced Self-Training for Long-Form Personalized Text
Generation](https://arxiv.org/abs/http://arxiv.org/abs/2501.04167v1) arXiv ID: http://arxiv.org/abs/2501.04167v1
Abstract:
Personalized text generation requires a unique ability of large language models (LLMs) to learn from context that they often do not encounter during their standard training. One way to encourage LLMs to better use personalized context for generating outputs that better align with the user's expectations is to instruct them to reason over the user's past preferences, background knowledge, or writing style. To achieve this, we propose Reasoning-Enhanced Self-Training for Personalized Text Generation (REST-PG), a framework that trains LLMs to reason over personal data during response generation. REST-PG first generates reasoning paths to train the LLM's reasoning abilities and then employs Expectation-Maximization Reinforced Self-Training to iteratively train the LLM based on its own high-reward outputs. We evaluate REST-PG on the LongLaMP benchmark, consisting of four diverse personalized long-form text generation tasks. Our experiments demonstrate that REST-PG achieves significant improvements over state-of-the-art baselines, with an average relative performance gain of 14.5% on the benchmark.
Decision Explanation: Original decision: REJECT The paper focuses primarily on developing a new framework (REST-PG) for self-training LLMs to enhance personalized text generation, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models. It meets the exclusions as it's more about training methods for LLMs than prompt engineering.
Learning to Transfer Human Hand Skills for Robot Manipulations
arXiv ID: http://arxiv.org/abs/2501.04169v1
Abstract:
We present a method for teaching dexterous manipulation tasks to robots from human hand motion demonstrations. Unlike existing approaches that solely rely on kinematics information without taking into account the plausibility of robot and object interaction, our method directly infers plausible robot manipulation actions from human motion demonstrations. To address the embodiment gap between the human hand and the robot system, our approach learns a joint motion manifold that maps human hand movements, robot hand actions, and object movements in 3D, enabling us to infer one motion component from others. Our key idea is the generation of pseudo-supervision triplets, which pair human, object, and robot motion trajectories synthetically. Through real-world experiments with robot hand manipulation, we demonstrate that our data-driven retargeting method significantly outperforms conventional retargeting techniques, effectively bridging the embodiment gap between human and robotic hands. Website at https://rureadyo.github.io/MocapRobot/.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper does not focus on prompt engineering for Large Language Models (LLMs), instead concentrating on robotics and teaching manipulation tasks to robots from human hand motion demonstrations, which falls outside the specified criteria." }
[SNR-EQ-JSCC: Joint Source-Channel Coding with SNR-Based Embedding and
Query](https://arxiv.org/abs/http://arxiv.org/abs/2501.04732v1) arXiv ID: http://arxiv.org/abs/2501.04732v1
Abstract:
Coping with the impact of dynamic channels is a critical issue in joint source-channel coding (JSCC)-based semantic communication systems. In this paper, we propose a lightweight channel-adaptive semantic coding architecture called SNR-EQ-JSCC. It is built upon the generic Transformer model and achieves channel adaptation (CA) by Embedding the signal-to-noise ratio (SNR) into the attention blocks and dynamically adjusting attention scores through channel-adaptive Queries. Meanwhile, penalty terms are introduced in the loss function to stabilize the training process. Considering that instantaneous SNR feedback may be imperfect, we propose an alternative method that uses only the average SNR, which requires no retraining of SNR-EQ-JSCC. Simulation results conducted on image transmission demonstrate that the proposed SNR-EQJSCC outperforms the state-of-the-art SwinJSCC in peak signal-to-noise ratio (PSNR) and perception metrics while only requiring 0.05% of the storage overhead and 6.38% of the computational complexity for CA. Moreover, the channel-adaptive query method demonstrates significant improvements in perception metrics. When instantaneous SNR feedback is imperfect, SNR-EQ-JSCC using only the average SNR still surpasses baseline schemes.
Decision Explanation: Original decision: REJECT The paper focuses on joint source-channel coding for semantic communication systems, adapting a Transformer model for channel conditions, and does not primarily investigate prompt engineering for Large Language Models (LLMs) or analyze methods for improving LLM performance through textual input prompt manipulation.
RTLSquad: Multi-Agent Based Interpretable RTL Design
arXiv ID: http://arxiv.org/abs/2501.05470v1
Abstract:
Optimizing Register-Transfer Level (RTL) code is crucial for improving hardware PPA performance. Large Language Models (LLMs) offer new approaches for automatic RTL code generation and optimization. However, existing methods often lack decision interpretability (sufficient, understandable justification for decisions), making it difficult for hardware engineers to trust the generated results, thus preventing these methods from being integrated into the design process. To address this, we propose RTLSquad, a novel LLM-Based Multi-Agent system for interpretable RTL code generation. RTLSquad divides the design process into exploration, implementation, and verification & evaluation stages managed by specialized agent squads, generating optimized RTL code through inter-agent collaboration, and providing decision interpretability through the communication process. Experiments show that RTLSquad excels in generating functionally correct RTL code and optimizing PPA performance, while also having the capability to provide decision paths, demonstrating the practical value of our system.
Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a multi-agent system for interpretable RTL (Register-Transfer Level) design using LLMs, rather than focusing on the engineering, design, or optimization of prompts specifically for Large Language Models. Prompt engineering is not the central concern, but rather a means to achieve the system's goals.
[Found in Translation: semantic approaches for enhancing AI
interpretability in face verification](https://arxiv.org/abs/http://arxiv.org/abs/2501.05471v1) arXiv ID: http://arxiv.org/abs/2501.05471v1
Abstract:
The increasing complexity of machine learning models in computer vision, particularly in face verification, requires the development of explainable artificial intelligence (XAI) to enhance interpretability and transparency. This study extends previous work by integrating semantic concepts derived from human cognitive processes into XAI frameworks to bridge the comprehension gap between model outputs and human understanding. We propose a novel approach combining global and local explanations, using semantic features defined by user-selected facial landmarks to generate similarity maps and textual explanations via large language models (LLMs). The methodology was validated through quantitative experiments and user feedback, demonstrating improved interpretability. Results indicate that our semantic-based approach, particularly the most detailed set, offers a more nuanced understanding of model decisions than traditional methods. User studies highlight a preference for our semantic explanations over traditional pixelbased heatmaps, emphasizing the benefits of human-centric interpretability in AI. This work contributes to the ongoing efforts to create XAI frameworks that align AI models behaviour with human cognitive processes, fostering trust and acceptance in critical applications.
Decision Explanation: Original decision: REJECT The paper primarily focuses on enhancing AI interpretability in face verification (a computer vision task) using Large Language Models (LLMs) as a component for generating textual explanations, rather than focusing on the engineering, design, or optimization of prompts specifically for LLMs in text-based interactions.
[Modality-Invariant Bidirectional Temporal Representation Distillation
Network for Missing Multimodal Sentiment Analysis](https://arxiv.org/abs/http://arxiv.org/abs/2501.05474v1) arXiv ID: http://arxiv.org/abs/2501.05474v1
Abstract:
Multimodal Sentiment Analysis (MSA) integrates diverse modalities(text, audio, and video) to comprehensively analyze and understand individuals' emotional states. However, the real-world prevalence of incomplete data poses significant challenges to MSA, mainly due to the randomness of modality missing. Moreover, the heterogeneity issue in multimodal data has yet to be effectively addressed. To tackle these challenges, we introduce the Modality-Invariant Bidirectional Temporal Representation Distillation Network (MITR-DNet) for Missing Multimodal Sentiment Analysis. MITR-DNet employs a distillation approach, wherein a complete modality teacher model guides a missing modality student model, ensuring robustness in the presence of modality missing. Simultaneously, we developed the Modality-Invariant Bidirectional Temporal Representation Learning Module (MIB-TRL) to mitigate heterogeneity.
Decision Explanation: Original decision: REJECT The paper focuses primarily on Multimodal Sentiment Analysis with a network architecture (MITR-DNet) to address modality missing and heterogeneity issues, rather than prompt engineering for Large Language Models (LLMs), failing to meet the 'MUST' criteria for focus on LLM prompt engineering and manipulation of textual input prompts.
[Cooperative Search and Track of Rogue Drones using Multiagent
Reinforcement Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.10413v1) arXiv ID: http://arxiv.org/abs/2501.10413v1
Abstract:
This work considers the problem of intercepting rogue drones targeting sensitive critical infrastructure facilities. While current interception technologies focus mainly on the jamming/spoofing tasks, the challenges of effectively locating and tracking rogue drones have not received adequate attention. Solving this problem and integrating with recently proposed interception techniques will enable a holistic system that can reliably detect, track, and neutralize rogue drones. Specifically, this work considers a team of pursuer UAVs that can search, detect, and track multiple rogue drones over a sensitive facility. The joint search and track problem is addressed through a novel multiagent reinforcement learning scheme to optimize the agent mobility control actions that maximize the number of rogue drones detected and tracked. The performance of the proposed system is investigated under realistic settings through extensive simulation experiments with varying number of agents demonstrating both its performance and scalability.
Decision Explanation: Original decision: REJECT The paper focuses primarily on multiagent reinforcement learning for tracking rogue drones, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.
[Efficient Deployment of Large Language Models on Resource-constrained
Devices](https://arxiv.org/abs/http://arxiv.org/abs/2501.02438v1) arXiv ID: http://arxiv.org/abs/2501.02438v1
Abstract:
Deploying Large Language Models (LLMs) on resource-constrained (or weak) devices presents significant challenges due to limited resources and heterogeneous data distribution. To address the data concern, it is necessary to fine-tune LLMs using on-device private data for various downstream tasks. While Federated Learning (FL) offers a promising privacy-preserving solution, existing fine-tuning methods retain the original LLM size, leaving issues of high inference latency and excessive memory demands unresolved. Hence, we design FedSpine, an FL framework that combines Parameter- Efficient Fine-Tuning (PEFT) with structured pruning for efficient deployment of LLMs on resource-constrained devices. Specifically, FedSpine introduces an iterative process to prune and tune the parameters of LLMs. To mitigate the impact of device heterogeneity, an online Multi-Armed Bandit (MAB) algorithm is employed to adaptively determine different pruning ratios and LoRA ranks for heterogeneous devices without any prior knowledge of their computing and communication capabilities. As a result, FedSpine maintains higher inference accuracy while improving fine-tuning efficiency. Experimental results conducted on a physical platform with 80 devices demonstrate that FedSpine can speed up fine-tuning by 1.4$\times$-6.9$\times$ and improve final accuracy by 0.4%-4.5% under the same sparsity level compared to other baselines.
Decision Explanation: Original decision: REJECT The paper focuses primarily on the efficient deployment and fine-tuning of Large Language Models on resource-constrained devices, rather than on the engineering, design, or optimization of prompts for LLMs, failing to meet the first 'MUST' criterion.
[FedRSClip: Federated Learning for Remote Sensing Scene Classification
Using Vision-Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02461v1) arXiv ID: http://arxiv.org/abs/2501.02461v1
Abstract:
Remote sensing data is often distributed across multiple institutions, and due to privacy concerns and data-sharing restrictions, leveraging large-scale datasets in a centralized training framework is challenging. Federated learning offers a promising solution by enabling collaborative model training across distributed data sources without requiring data centralization. However, current Vision-Language Models (VLMs), which typically contain billions of parameters, pose significant communication challenges for traditional federated learning approaches based on model parameter updates, as they would incur substantial communication costs. In this paper, we propose FedRSCLIP, the first federated learning framework designed for remote sensing image classification based on a VLM, specifically CLIP. FedRSCLIP addresses the challenges of data heterogeneity and large-scale model transmission in federated environments by introducing Prompt Learning, which optimizes only a small set of tunable parameters. The framework introduces a dual-prompt mechanism, comprising Shared Prompts for global knowledge sharing and Private Prompts for client-specific adaptation. To maintain semantic coherence between shared and private prompts, we propose the Dual Prompt Alignment Constraint to balance global consistency and local adaptability across diverse client distributions. Additionally, to enhance cross-modal representation learning, we introduce the Cross-Modal Feature Alignment Constraint to align multimodal features between text and image prompts. To validate the effectiveness of our proposed model, we construct a Fed-RSIC dataset based on three existing remote sensing image classification datasets, specifically designed to simulate various federated learning configurations. Experimental results demonstrate the effectiveness and superiority of FedRSCLIP in remote sensing image classification.
Decision Explanation: Original decision: REJECT The paper primarily focuses on federated learning for Vision-Language Models (VLMs) in remote sensing image classification, not on the engineering, design, or optimization of textual input prompts for Large Language Models (LLMs), and does not provide concrete examples of text prompts impacting LLM output.
Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera
arXiv ID: http://arxiv.org/abs/2501.02464v1
Abstract:
While recent depth estimation methods exhibit strong zero-shot generalization, achieving accurate metric depth across diverse camera types-particularly those with large fields of view (FoV) such as fisheye and 360-degree cameras-remains a significant challenge. This paper presents Depth Any Camera (DAC), a powerful zero-shot metric depth estimation framework that extends a perspective-trained model to effectively handle cameras with varying FoVs. The framework is designed to ensure that all existing 3D data can be leveraged, regardless of the specific camera types used in new applications. Remarkably, DAC is trained exclusively on perspective images but generalizes seamlessly to fisheye and 360-degree cameras without the need for specialized training data. DAC employs Equi-Rectangular Projection (ERP) as a unified image representation, enabling consistent processing of images with diverse FoVs. Its key components include a pitch-aware Image-to-ERP conversion for efficient online augmentation in ERP space, a FoV alignment operation to support effective training across a wide range of FoVs, and multi-resolution data augmentation to address resolution disparities between training and testing. DAC achieves state-of-the-art zero-shot metric depth estimation, improving delta-1 (
\delta_1
) accuracy by up to 50% on multiple fisheye and 360-degree datasets compared to prior metric depth foundation models, demonstrating robust generalization across camera types.
Decision Explanation: Original decision: REJECT The paper focuses on depth estimation in computer vision, specifically developing a framework for metric depth estimation from various camera types, and does not address prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
[PTEENet: Post-Trained Early-Exit Neural Networks Augmentation for
Inference Cost Optimization](https://arxiv.org/abs/http://arxiv.org/abs/2501.02508v1) arXiv ID: http://arxiv.org/abs/2501.02508v1
Abstract:
For many practical applications, a high computational cost of inference over deep network architectures might be unacceptable. A small degradation in the overall inference accuracy might be a reasonable price to pay for a significant reduction in the required computational resources. In this work, we describe a method for introducing "shortcuts" into the DNN feedforward inference process by skipping costly feedforward computations whenever possible. The proposed method is based on the previously described BranchyNet (Teerapittayanon et al., 2016) and the EEnet (Demir, 2019) architectures that jointly train the main network and early exit branches. We extend those methods by attaching branches to pre-trained models and, thus, eliminating the need to alter the original weights of the network. We also suggest a new branch architecture based on convolutional building blocks to allow enough training capacity when applied on large DNNs. The proposed architecture includes confidence heads that are used for predicting the confidence level in the corresponding early exits. By defining adjusted thresholds on these confidence extensions, we can control in real-time the amount of data exiting from each branch and the overall tradeoff between speed and accuracy of our model. In our experiments, we evaluate our method using image datasets (SVHN and CIFAR10) and several DNN architectures (ResNet, DenseNet, VGG) with varied depth. Our results demonstrate that the proposed method enables us to reduce the average inference computational cost and further controlling the tradeoff between the model accuracy and the computation cost.
Decision Explanation: Original decision: REJECT The paper focuses on optimizing the inference cost of deep neural networks (DNNs) through early-exit strategies, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet the primary criteria.
[Evaluating Large Language Models Against Human Annotators in Latent
Content Analysis: Sentiment, Political Leaning, Emotional Intensity, and Sarcasm](https://arxiv.org/abs/http://arxiv.org/abs/2501.02532v1) arXiv ID: http://arxiv.org/abs/2501.02532v1
Abstract:
In the era of rapid digital communication, vast amounts of textual data are generated daily, demanding efficient methods for latent content analysis to extract meaningful insights. Large Language Models (LLMs) offer potential for automating this process, yet comprehensive assessments comparing their performance to human annotators across multiple dimensions are lacking. This study evaluates the reliability, consistency, and quality of seven state-of-the-art LLMs, including variants of OpenAI's GPT-4, Gemini, Llama, and Mixtral, relative to human annotators in analyzing sentiment, political leaning, emotional intensity, and sarcasm detection. A total of 33 human annotators and eight LLM variants assessed 100 curated textual items, generating 3,300 human and 19,200 LLM annotations, with LLMs evaluated across three time points to examine temporal consistency. Inter-rater reliability was measured using Krippendorff's alpha, and intra-class correlation coefficients assessed consistency over time. The results reveal that both humans and LLMs exhibit high reliability in sentiment analysis and political leaning assessments, with LLMs demonstrating higher internal consistency than humans. In emotional intensity, LLMs displayed higher agreement compared to humans, though humans rated emotional intensity significantly higher. Both groups struggled with sarcasm detection, evidenced by low agreement. LLMs showed excellent temporal consistency across all dimensions, indicating stable performance over time. This research concludes that LLMs, especially GPT-4, can effectively replicate human analysis in sentiment and political leaning, although human expertise remains essential for emotional intensity interpretation. The findings demonstrate the potential of LLMs for consistent and high-quality performance in certain areas of latent content analysis.
Decision Explanation: Original decision: REJECT The paper focuses on evaluating LLMs' performance in latent content analysis compared to human annotators, rather than primarily investigating, analyzing, or proposing methods for improving LLM performance through the manipulation of textual input prompts.
[Energy Optimization of Multi-task DNN Inference in MEC-assisted XR
Devices: A Lyapunov-Guided Reinforcement Learning Approach](https://arxiv.org/abs/http://arxiv.org/abs/2501.02572v1) arXiv ID: http://arxiv.org/abs/2501.02572v1
Abstract:
Extended reality (XR), blending virtual and real worlds, is a key application of future networks. While AI advancements enhance XR capabilities, they also impose significant computational and energy challenges on lightweight XR devices. In this paper, we developed a distributed queue model for multi-task DNN inference, addressing issues of resource competition and queue coupling. In response to the challenges posed by the high energy consumption and limited resources of XR devices, we designed a dual time-scale joint optimization strategy for model partitioning and resource allocation, formulated as a bi-level optimization problem. This strategy aims to minimize the total energy consumption of XR devices while ensuring queue stability and adhering to computational and communication resource constraints. To tackle this problem, we devised a Lyapunov-guided Proximal Policy Optimization algorithm, named LyaPPO. Numerical results demonstrate that the LyaPPO algorithm outperforms the baselines, achieving energy conservation of 24.79% to 46.14% under varying resource capacities. Specifically, the proposed algorithm reduces the energy consumption of XR devices by 24.29% to 56.62% compared to baseline algorithms.
Decision Explanation: Original decision: REJECT The paper primarily focuses on energy optimization of Multi-task DNN Inference in MEC-assisted XR Devices using Reinforcement Learning, with no apparent focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs), thus failing to meet the 'MUST' criteria.
[TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud
Platforms](https://arxiv.org/abs/http://arxiv.org/abs/2501.02600v1) arXiv ID: http://arxiv.org/abs/2501.02600v1
Abstract:
The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in cloud datacenters. Traditional techniques often are inadequate for LLM inference due to the fine-grained, millisecond-scale execution phases, each with distinct performance, thermal, and power profiles. Additionally, LLM inference workloads are sensitive to various configuration parameters (e.g., model parallelism, size, and quantization) that involve trade-offs between performance, temperature, power, and output quality. Moreover, clouds often co-locate SaaS and IaaS workloads, each with different levels of visibility and flexibility. We propose TAPAS, a thermal- and power-aware framework designed for LLM inference clusters in the cloud. TAPAS enhances cooling and power oversubscription capabilities, reducing the total cost of ownership (TCO) while effectively handling emergencies (e.g., cooling and power failures). The system leverages historical temperature and power data, along with the adaptability of SaaS workloads, to: (1) efficiently place new GPU workload VMs within cooling and power constraints, (2) route LLM inference requests across SaaS VMs, and (3) reconfigure SaaS VMs to manage load spikes and emergency situations. Our evaluation on a large GPU cluster demonstrates significant reductions in thermal and power throttling events, boosting system efficiency.
Decision Explanation: Original decision: REJECT The paper primarily focuses on thermal- and power-aware scheduling for LLM inference in cloud platforms, addressing infrastructure management rather than prompt engineering for text-based interactions with LLMs, thus failing to meet the core subject requirement.
[Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for
Jailbreak Attack Defense](https://arxiv.org/abs/http://arxiv.org/abs/2501.02629v1) arXiv ID: http://arxiv.org/abs/2501.02629v1
Abstract:
As large language models (LLMs) are increasingly deployed in diverse applications, including chatbot assistants and code generation, aligning their behavior with safety and ethical standards has become paramount. However, jailbreak attacks, which exploit vulnerabilities to elicit unintended or harmful outputs, threaten LLMs' safety significantly. In this paper, we introduce Layer-AdvPatcher, a novel methodology designed to defend against jailbreak attacks by utilizing an unlearning strategy to patch specific layers within LLMs through self-augmented datasets. Our insight is that certain layer(s), tend to produce affirmative tokens when faced with harmful prompts. By identifying these layers and adversarially exposing them to generate more harmful data, one can understand their inherent and diverse vulnerabilities to attacks. With these exposures, we then "unlearn" these issues, reducing the impact of affirmative tokens and hence minimizing jailbreak risks while keeping the model's responses to safe queries intact. We conduct extensive experiments on two models, four benchmark datasets, and multiple state-of-the-art jailbreak benchmarks to demonstrate the efficacy of our approach. Results indicate that our framework reduces the harmfulness and attack success rate of jailbreak attacks without compromising utility for benign queries compared to recent defense methods.
Decision Explanation: Original decision: REJECT The paper focuses on defending against jailbreak attacks by modifying LLM architecture (Layer-AdvPatcher) and utilizing an unlearning strategy, rather than primarily focusing on the engineering, design, or optimization of prompts for improving LLM performance through textual input manipulation.
Representation Learning of Lab Values via Masked AutoEncoder
arXiv ID: http://arxiv.org/abs/2501.02648v2
Abstract:
Accurate imputation of missing laboratory values in electronic health records (EHRs) is critical to enable robust clinical predictions and reduce biases in AI systems in healthcare. Existing methods, such as variational autoencoders (VAEs) and decision tree-based approaches such as XGBoost, struggle to model the complex temporal and contextual dependencies in EHR data, mainly in underrepresented groups. In this work, we propose Lab-MAE, a novel transformer-based masked autoencoder framework that leverages self-supervised learning for the imputation of continuous sequential lab values. Lab-MAE introduces a structured encoding scheme that jointly models laboratory test values and their corresponding timestamps, enabling explicit capturing temporal dependencies. Empirical evaluation on the MIMIC-IV dataset demonstrates that Lab-MAE significantly outperforms the state-of-the-art baselines such as XGBoost across multiple metrics, including root mean square error (RMSE), R-squared (R2), and Wasserstein distance (WD). Notably, Lab-MAE achieves equitable performance across demographic groups of patients, advancing fairness in clinical predictions. We further investigate the role of follow-up laboratory values as potential shortcut features, revealing Lab-MAE's robustness in scenarios where such data is unavailable. The findings suggest that our transformer-based architecture, adapted to the characteristics of the EHR data, offers a foundation model for more accurate and fair clinical imputation models. In addition, we measure and compare the carbon footprint of Lab-MAE with the baseline XGBoost model, highlighting its environmental requirements.
Decision Explanation: Original decision: REJECT The paper focuses on developing a novel transformer-based masked autoencoder for imputing missing laboratory values in electronic health records, which does not meet the 'MUST' criteria of primarily focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts for improving LLM performance.
[From thermodynamics to protein design: Diffusion models for biomolecule
generation towards autonomous protein engineering](https://arxiv.org/abs/http://arxiv.org/abs/2501.02680v1) arXiv ID: http://arxiv.org/abs/2501.02680v1
Abstract:
Protein design with desirable properties has been a significant challenge for many decades. Generative artificial intelligence is a promising approach and has achieved great success in various protein generation tasks. Notably, diffusion models stand out for their robust mathematical foundations and impressive generative capabilities, offering unique advantages in certain applications such as protein design. In this review, we first give the definition and characteristics of diffusion models and then focus on two strategies: Denoising Diffusion Probabilistic Models and Score-based Generative Models, where DDPM is the discrete form of SGM. Furthermore, we discuss their applications in protein design, peptide generation, drug discovery, and protein-ligand interaction. Finally, we outline the future perspectives of diffusion models to advance autonomous protein design and engineering. The E(3) group consists of all rotations, reflections, and translations in three-dimensions. The equivariance on the E(3) group can keep the physical stability of the frame of each amino acid as much as possible, and we reflect on how to keep the diffusion model E(3) equivariant for protein generation.
Decision Explanation: Original decision: REJECT The paper primarily focuses on the application of diffusion models for biomolecule generation in protein design, not on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs). It also concerns a medical/biological subject (protein design) and generative AI application (protein generation) outside the specified text generation driven by LLMs.
[EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in
Instructional Multimodal Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02699v1) arXiv ID: http://arxiv.org/abs/2501.02699v1
Abstract:
Large language models and vision transformers have demonstrated impressive zero-shot capabilities, enabling significant transferability in downstream tasks. The fusion of these models has resulted in multi-modal architectures with enhanced instructional capabilities. Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate from the ground truth in the image data. These failure cases are known as hallucinations. Current methods for mitigating hallucinations generally focus on regularizing the language component, improving the fusion module, or ensembling multiple visual encoders to improve visual representation. In this paper, we address the hallucination issue by directly enhancing the capabilities of the visual component. Our approach, named EAGLE, is fully agnostic to the LLM or fusion module and works as a post-pretraining approach that improves the grounding and language alignment of the visual encoder. We show that a straightforward reformulation of the original contrastive pre-training task results in an improved visual encoder that can be incorporated into the instructional multi-modal architecture without additional instructional training. As a result, EAGLE achieves a significant reduction in hallucinations across multiple challenging benchmarks and tasks.
Decision Explanation: Original decision: REJECT The paper's primary focus is on enhancing visual grounding in multimodal models to reduce hallucinations, rather than engineering or optimizing prompts specifically for Large Language Models (LLMs). It meets none of the 'MUST' criteria for prompt engineering, focusing instead on improving the visual component of multimodal architectures.
[QuIM-RAG: Advancing Retrieval-Augmented Generation with Inverted
Question Matching for Enhanced QA Performance](https://arxiv.org/abs/http://arxiv.org/abs/2501.02702v1) arXiv ID: http://arxiv.org/abs/2501.02702v1
Abstract:
This work presents a novel architecture for building Retrieval-Augmented Generation (RAG) systems to improve Question Answering (QA) tasks from a target corpus. Large Language Models (LLMs) have revolutionized the analyzing and generation of human-like text. These models rely on pre-trained data and lack real-time updates unless integrated with live data tools. RAG enhances LLMs by integrating online resources and databases to generate contextually appropriate responses. However, traditional RAG still encounters challenges like information dilution and hallucinations when handling vast amounts of data. Our approach addresses these challenges by converting corpora into a domain-specific dataset and RAG architecture is constructed to generate responses from the target document. We introduce QuIM-RAG (Question-to-question Inverted Index Matching), a novel approach for the retrieval mechanism in our system. This strategy generates potential questions from document chunks and matches these with user queries to identify the most relevant text chunks for generating accurate answers. We have implemented our RAG system on top of the open-source Meta-LLaMA3-8B-instruct model by Meta Inc. that is available on Hugging Face. We constructed a custom corpus of 500+ pages from a high-traffic website accessed thousands of times daily for answering complex questions, along with manually prepared ground truth QA for evaluation. We compared our approach with traditional RAG models using BERT-Score and RAGAS, state-of-the-art metrics for evaluating LLM applications. Our evaluation demonstrates that our approach outperforms traditional RAG architectures on both metrics.
Decision Explanation: Original decision: REJECT The paper focuses primarily on advancing Retrieval-Augmented Generation (RAG) system architecture for Question Answering tasks, rather than specifically on the engineering, design, or optimization of prompts for Large Language Models (LLMs), as required. While LLMs are utilized, the core subject is the RAG system's enhancement, not prompt engineering for text-based LLM interactions.
OpenGU: A Comprehensive Benchmark for Graph Unlearning
arXiv ID: http://arxiv.org/abs/2501.02728v1
Abstract:
Graph Machine Learning is essential for understanding and analyzing relational data. However, privacy-sensitive applications demand the ability to efficiently remove sensitive information from trained graph neural networks (GNNs), avoiding the unnecessary time and space overhead caused by retraining models from scratch. To address this issue, Graph Unlearning (GU) has emerged as a critical solution, with the potential to support dynamic graph updates in data management systems and enable scalable unlearning in distributed data systems while ensuring privacy compliance. Unlike machine unlearning in computer vision or other fields, GU faces unique difficulties due to the non-Euclidean nature of graph data and the recursive message-passing mechanism of GNNs. Additionally, the diversity of downstream tasks and the complexity of unlearning requests further amplify these challenges. Despite the proliferation of diverse GU strategies, the absence of a benchmark providing fair comparisons for GU, and the limited flexibility in combining downstream tasks and unlearning requests, have yielded inconsistencies in evaluations, hindering the development of this domain. To fill this gap, we present OpenGU, the first GU benchmark, where 16 SOTA GU algorithms and 37 multi-domain datasets are integrated, enabling various downstream tasks with 13 GNN backbones when responding to flexible unlearning requests. Based on this unified benchmark framework, we are able to provide a comprehensive and fair evaluation for GU. Through extensive experimentation, we have drawn
8
crucial conclusions about existing GU methods, while also gaining valuable insights into their limitations, shedding light on potential avenues for future research.
Decision Explanation: Original decision: REJECT The paper does not focus on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance. Instead, it focuses on Graph Unlearning (GU) for Graph Neural Networks (GNNs), which falls outside the specified criteria.
[GLoG-CSUnet: Enhancing Vision Transformers with Adaptable Radiomic
Features for Medical Image Segmentation](https://arxiv.org/abs/http://arxiv.org/abs/2501.02788v2) arXiv ID: http://arxiv.org/abs/2501.02788v2
Abstract:
Vision Transformers (ViTs) have shown promise in medical image semantic segmentation (MISS) by capturing long-range correlations. However, ViTs often struggle to model local spatial information effectively, which is essential for accurately segmenting fine anatomical details, particularly when applied to small datasets without extensive pre-training. We introduce Gabor and Laplacian of Gaussian Convolutional Swin Network (GLoG-CSUnet), a novel architecture enhancing Transformer-based models by incorporating learnable radiomic features. This approach integrates dynamically adaptive Gabor and Laplacian of Gaussian (LoG) filters to capture texture, edge, and boundary information, enhancing the feature representation processed by the Transformer model. Our method uniquely combines the long-range dependency modeling of Transformers with the texture analysis capabilities of Gabor and LoG features. Evaluated on the Synapse multi-organ and ACDC cardiac segmentation datasets, GLoG-CSUnet demonstrates significant improvements over state-of-the-art models, achieving a 1.14% increase in Dice score for Synapse and 0.99% for ACDC, with minimal computational overhead (only 15 and 30 additional parameters, respectively). GLoG-CSUnet's flexible design allows integration with various base models, offering a promising approach for incorporating radiomics-inspired feature extraction in Transformer architectures for medical image analysis. The code implementation is available on GitHub at: https://github.com/HAAIL/GLoG-CSUnet.
Decision Explanation: Original decision: REJECT The paper primarily focuses on enhancing Vision Transformers with adaptable radiomic features for Medical Image Segmentation, which falls under the excluded categories of 'medical subjects' and 'applications of generative AI other than text generation driven by LLMs', and does not meet the core subject requirement of prompt engineering for text-based interactions with LLMs.
RDD4D: 4D Attention-Guided Road Damage Detection And Classification
arXiv ID: http://arxiv.org/abs/2501.02822v1
Abstract:
Road damage detection and assessment are crucial components of infrastructure maintenance. However, current methods often struggle with detecting multiple types of road damage in a single image, particularly at varying scales. This is due to the lack of road datasets with various damage types having varying scales. To overcome this deficiency, first, we present a novel dataset called Diverse Road Damage Dataset (DRDD) for road damage detection that captures the diverse road damage types in individual images, addressing a crucial gap in existing datasets. Then, we provide our model, RDD4D, that exploits Attention4D blocks, enabling better feature refinement across multiple scales. The Attention4D module processes feature maps through an attention mechanism combining positional encoding and "Talking Head" components to capture local and global contextual information. In our comprehensive experimental analysis comparing various state-of-the-art models on our proposed, our enhanced model demonstrated superior performance in detecting large-sized road cracks with an Average Precision (AP) of 0.458 and maintained competitive performance with an overall AP of 0.445. Moreover, we also provide results on the CrackTinyNet dataset; our model achieved around a 0.21 increase in performance. The code, model weights, dataset, and our results are available on \href{https://github.com/msaqib17/Road_Damage_Detection}{https://github.com/msaqib17/Road_Damage_Detection}.
Decision Explanation: Original decision: REJECT The paper focuses on road damage detection using computer vision and deep learning techniques, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thereby failing to meet all the 'MUST' criteria.
[Forward Once for All: Structural Parameterized Adaptation for Efficient
Cloud-coordinated On-device Recommendation](https://arxiv.org/abs/http://arxiv.org/abs/2501.02837v1) arXiv ID: http://arxiv.org/abs/2501.02837v1
Abstract:
In cloud-centric recommender system, regular data exchanges between user devices and cloud could potentially elevate bandwidth demands and privacy risks. On-device recommendation emerges as a viable solution by performing reranking locally to alleviate these concerns. Existing methods primarily focus on developing local adaptive parameters, while potentially neglecting the critical role of tailor-made model architecture. Insights from broader research domains suggest that varying data distributions might favor distinct architectures for better fitting. In addition, imposing a uniform model structure across heterogeneous devices may result in risking inefficacy on less capable devices or sub-optimal performance on those with sufficient capabilities. In response to these gaps, our paper introduces Forward-OFA, a novel approach for the dynamic construction of device-specific networks (both structure and parameters). Forward-OFA employs a structure controller to selectively determine whether each block needs to be assembled for a given device. However, during the training of the structure controller, these assembled heterogeneous structures are jointly optimized, where the co-adaption among blocks might encounter gradient conflicts. To mitigate this, Forward-OFA is designed to establish a structure-guided mapping of real-time behaviors to the parameters of assembled networks. Structure-related parameters and parallel components within the mapper prevent each part from receiving heterogeneous gradients from others, thus bypassing the gradient conflicts for coupled optimization. Besides, direct mapping enables Forward-OFA to achieve adaptation through only one forward pass, allowing for swift adaptation to changing interests and eliminating the requirement for on-device backpropagation. Experiments on real-world datasets demonstrate the effectiveness and efficiency of Forward-OFA.
Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a novel approach for dynamic construction of device-specific networks for on-device recommendation, with no clear emphasis on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
[Explaining Humour Style Classifications: An XAI Approach to
Understanding Computational Humour Analysis](https://arxiv.org/abs/http://arxiv.org/abs/2501.02891v1) arXiv ID: http://arxiv.org/abs/2501.02891v1
Abstract:
Humour styles can have either a negative or a positive impact on well-being. Given the importance of these styles to mental health, significant research has been conducted on their automatic identification. However, the automated machine learning models used for this purpose are black boxes, making their prediction decisions opaque. Clarity and transparency are vital in the field of mental health. This paper presents an explainable AI (XAI) framework for understanding humour style classification, building upon previous work in computational humour analysis. Using the best-performing single model (ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to analyse how linguistic, emotional, and semantic features contribute to humour style classification decisions. Our analysis reveals distinct patterns in how different humour styles are characterised and misclassified, with particular emphasis on the challenges in distinguishing affiliative humour from other styles. Through detailed examination of feature importance, error patterns, and misclassification cases, we identify key factors influencing model decisions, including emotional ambiguity, context misinterpretation, and target identification. The framework demonstrates significant utility in understanding model behaviour, achieving interpretable insights into the complex interplay of features that define different humour styles. Our findings contribute to both the theoretical understanding of computational humour analysis and practical applications in mental health, content moderation, and digital humanities research.
Decision Explanation: Original decision: REJECT The paper focuses primarily on explainable AI (XAI) for humour style classification in computational humour analysis, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), failing to meet the first 'MUST' criterion.
[Label-free Concept Based Multiple Instance Learning for Gigapixel
Histopathology](https://arxiv.org/abs/http://arxiv.org/abs/2501.02922v1) arXiv ID: http://arxiv.org/abs/2501.02922v1
Abstract:
Multiple Instance Learning (MIL) methods allow for gigapixel Whole-Slide Image (WSI) analysis with only slide-level annotations. Interpretability is crucial for safely deploying such algorithms in high-stakes medical domains. Traditional MIL methods offer explanations by highlighting salient regions. However, such spatial heatmaps provide limited insights for end users. To address this, we propose a novel inherently interpretable WSI-classification approach that uses human-understandable pathology concepts to generate explanations. Our proposed Concept MIL model leverages recent advances in vision-language models to directly predict pathology concepts based on image features. The model's predictions are obtained through a linear combination of the concepts identified on the top-K patches of a WSI, enabling inherent explanations by tracing each concept's influence on the prediction. In contrast to traditional concept-based interpretable models, our approach eliminates the need for costly human annotations by leveraging the vision-language model. We validate our method on two widely used pathology datasets: Camelyon16 and PANDA. On both datasets, Concept MIL achieves AUC and accuracy scores over 0.9, putting it on par with state-of-the-art models. We further find that 87.1% (Camelyon16) and 85.3% (PANDA) of the top 20 patches fall within the tumor region. A user study shows that the concepts identified by our model align with the concepts used by pathologists, making it a promising strategy for human-interpretable WSI classification.
Decision Explanation: Original decision: REJECT The paper primarily focuses on medical image analysis (histopathology) using vision-language models, not on engineering, design, or optimization of textual input prompts for Large Language Models (LLMs), and does not meet the core subject requirement of prompt engineering for text-based interactions with LLMs.
[Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the
Wild](https://arxiv.org/abs/http://arxiv.org/abs/2501.02964v2) arXiv ID: http://arxiv.org/abs/2501.02964v2
Abstract:
Complex visual reasoning remains a key challenge today. Typically, the challenge is tackled using methodologies such as Chain of Thought (COT) and visual instruction tuning. However, how to organically combine these two methodologies for greater success remains unexplored. Also, issues like hallucinations and high training cost still need to be addressed. In this work, we devise an innovative multi-round training and reasoning framework suitable for lightweight Multimodal Large Language Models (MLLMs). Our self-questioning approach heuristically guides MLLMs to focus on visual clues relevant to the target problem, reducing hallucinations and enhancing the model's ability to describe fine-grained image details. This ultimately enables the model to perform well in complex visual reasoning and question-answering tasks. We have named this framework Socratic Questioning(SQ). To facilitate future research, we create a multimodal mini-dataset named CapQA, which includes 1k images of fine-grained activities, for visual instruction tuning and evaluation, our proposed SQ method leads to a 31.2% improvement in the hallucination score. Our extensive experiments on various benchmarks demonstrate SQ's remarkable capabilities in heuristic self-questioning, zero-shot visual reasoning and hallucination mitigation. Our model and code will be publicly available.
Decision Explanation: Original decision: REJECT The paper focuses on multimodal reasoning, visual question-answering, and reducing hallucinations in Multimodal Large Language Models (MLLMs) through a self-questioning framework, rather than primarily on the engineering, design, or optimization of textual input prompts for Large Language Models (LLMs) as required.
Proof-of-Data: A Consensus Protocol for Collaborative Intelligence
arXiv ID: http://arxiv.org/abs/2501.02971v1
Abstract:
Existing research on federated learning has been focused on the setting where learning is coordinated by a centralized entity. Yet the greatest potential of future collaborative intelligence would be unleashed in a more open and democratized setting with no central entity in a dominant role, referred to as "decentralized federated learning". New challenges arise accordingly in achieving both correct model training and fair reward allocation with collective effort among all participating nodes, especially with the threat of the Byzantine node jeopardising both tasks. In this paper, we propose a blockchain-based decentralized Byzantine fault-tolerant federated learning framework based on a novel Proof-of-Data (PoD) consensus protocol to resolve both the "trust" and "incentive" components. By decoupling model training and contribution accounting, PoD is able to enjoy not only the benefit of learning efficiency and system liveliness from asynchronous societal-scale PoW-style learning but also the finality of consensus and reward allocation from epoch-based BFT-style voting. To mitigate false reward claims by data forgery from Byzantine attacks, a privacy-aware data verification and contribution-based reward allocation mechanism is designed to complete the framework. Our evaluation results show that PoD demonstrates performance in model training close to that of the centralized counterpart while achieving trust in consensus and fairness for reward allocation with a fault tolerance ratio of 1/3.
Decision Explanation: Original decision: REJECT The paper focuses on decentralized federated learning, blockchain, and a consensus protocol, with no primary emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.
[Fuzzy Granule Density-Based Outlier Detection with Multi-Scale Granular
Balls](https://arxiv.org/abs/http://arxiv.org/abs/2501.02975v1) arXiv ID: http://arxiv.org/abs/2501.02975v1
Abstract:
Outlier detection refers to the identification of anomalous samples that deviate significantly from the distribution of normal data and has been extensively studied and used in a variety of practical tasks. However, most unsupervised outlier detection methods are carefully designed to detect specified outliers, while real-world data may be entangled with different types of outliers. In this study, we propose a fuzzy rough sets-based multi-scale outlier detection method to identify various types of outliers. Specifically, a novel fuzzy rough sets-based method that integrates relative fuzzy granule density is first introduced to improve the capability of detecting local outliers. Then, a multi-scale view generation method based on granular-ball computing is proposed to collaboratively identify group outliers at different levels of granularity. Moreover, reliable outliers and inliers determined by the three-way decision are used to train a weighted support vector machine to further improve the performance of outlier detection. The proposed method innovatively transforms unsupervised outlier detection into a semi-supervised classification problem and for the first time explores the fuzzy rough sets-based outlier detection from the perspective of multi-scale granular balls, allowing for high adaptability to different types of outliers. Extensive experiments carried out on both artificial and UCI datasets demonstrate that the proposed outlier detection method significantly outperforms the state-of-the-art methods, improving the results by at least 8.48% in terms of the Area Under the ROC Curve (AUROC) index. { The source codes are released at \url{https://github.com/Xiaofeng-Tan/MGBOD}. }
Decision Explanation: Original decision: REJECT The paper does not meet the primary criteria as it focuses on outlier detection using fuzzy rough sets and multi-scale granular balls, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thereby failing to align with the specified requirements.
[CAMP: Collaborative Attention Model with Profiles for Vehicle Routing
Problems](https://arxiv.org/abs/http://arxiv.org/abs/2501.02977v1) arXiv ID: http://arxiv.org/abs/2501.02977v1
Abstract:
The profiled vehicle routing problem (PVRP) is a generalization of the heterogeneous capacitated vehicle routing problem (HCVRP) in which the objective is to optimize the routes of vehicles to serve client demands subject to different vehicle profiles, with each having a preference or constraint on a per-client basis. While existing learning methods have shown promise for solving the HCVRP in real-time, no learning method exists to solve the more practical and challenging PVRP. In this paper, we propose a Collaborative Attention Model with Profiles (CAMP), a novel approach that learns efficient solvers for PVRP using multi-agent reinforcement learning. CAMP employs a specialized attention-based encoder architecture to embed profiled client embeddings in parallel for each vehicle profile. We design a communication layer between agents for collaborative decision-making across profiled embeddings at each decoding step and a batched pointer mechanism to attend to the profiled embeddings to evaluate the likelihood of the next actions. We evaluate CAMP on two variants of PVRPs: PVRP with preferences, which explicitly influence the reward function, and PVRP with zone constraints with different numbers of agents and clients, demonstrating that our learned solvers achieve competitive results compared to both classical state-of-the-art neural multi-agent models in terms of solution quality and computational efficiency. We make our code openly available at https://github.com/ai4co/camp.
Decision Explanation: Original decision: REJECT The paper focuses primarily on developing a multi-agent reinforcement learning model (CAMP) for solving vehicle routing problems, with no emphasis on prompt engineering, Large Language Models (LLMs), or the manipulation of textual input prompts for LLM performance.
[CONTINUUM: Detecting APT Attacks through Spatial-Temporal Graph Neural
Networks](https://arxiv.org/abs/http://arxiv.org/abs/2501.02981v2) arXiv ID: http://arxiv.org/abs/2501.02981v2
Abstract:
Advanced Persistent Threats (APTs) represent a significant challenge in cybersecurity due to their sophisticated and stealthy nature. Traditional Intrusion Detection Systems (IDS) often fall short in detecting these multi-stage attacks. Recently, Graph Neural Networks (GNNs) have been employed to enhance IDS capabilities by analyzing the complex relationships within networked data. However, existing GNN-based solutions are hampered by high false positive rates and substantial resource consumption. In this paper, we present a novel IDS designed to detect APTs using a Spatio-Temporal Graph Neural Network Autoencoder. Our approach leverages spatial information to understand the interactions between entities within a graph and temporal information to capture the evolution of the graph over time. This dual perspective is crucial for identifying the sequential stages of APTs. Furthermore, to address privacy and scalability concerns, we deploy our architecture in a federated learning environment. This setup ensures that local data remains on-premise while encrypted model-weights are shared and aggregated using homomorphic encryption, maintaining data privacy and security. Our evaluation shows that this system effectively detects APTs with lower false positive rates and optimized resource usage compared to existing methods, highlighting the potential of spatio-temporal analysis and federated learning in enhancing cybersecurity defenses.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper does not meet the 'MUST' criteria as it primarily focuses on detecting APT attacks using Spatio-Temporal Graph Neural Networks in cybersecurity, with no emphasis on engineering, design, or optimization of prompts for Large Language Models (LLMs)." }
[A Bio-Inspired Research Paradigm of Collision Perception Neurons
Enabling Neuro-Robotic Integration: The LGMD Case](https://arxiv.org/abs/http://arxiv.org/abs/2501.02982v1) arXiv ID: http://arxiv.org/abs/2501.02982v1
Abstract:
Compared to human vision, insect visual systems excel at rapid and precise collision detection, despite relying on only tens of thousands of neurons organized through a few neuropils. This efficiency makes them an attractive model system for developing artificial collision-detecting systems. Specifically, researchers have identified collision-selective neurons in the locust's optic lobe, called lobula giant movement detectors (LGMDs), which respond specifically to approaching objects. Research upon LGMD neurons began in the early 1970s. Initially, due to their large size, these neurons were identified as motion detectors, but their role as looming detectors was recognized over time. Since then, progress in neuroscience, computational modeling of LGMD's visual neural circuits, and LGMD-based robotics has advanced in tandem, each field supporting and driving the others. Today, with a deeper understanding of LGMD neurons, LGMD-based models have significantly improved collision-free navigation in mobile robots including ground and aerial robots. This review highlights recent developments in LGMD research from the perspectives of neuroscience, computational modeling, and robotics. It emphasizes a biologically plausible research paradigm, where insights from neuroscience inform real-world applications, which would in turn validate and advance neuroscience. With strong support from extensive research and growing application demand, this paradigm has reached a mature stage and demonstrates versatility across different areas of neuroscience research, thereby enhancing our understanding of the interconnections between neuroscience, computational modeling, and robotics. Furthermore, other motion-sensitive neurons have also shown promising potential for adopting this research paradigm.
Decision Explanation: Original decision: REJECT The paper does not focus on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance, instead focusing on bio-inspired robotics, neuroscience, and collision detection using LGMD neurons.
To Analyze and Regulate Human-in-the-loop Learning for Congestion Games
arXiv ID: http://arxiv.org/abs/2501.03055v2
Abstract:
In congestion games, selfish users behave myopically to crowd to the shortest paths, and the social planner designs mechanisms to regulate such selfish routing through information or payment incentives. However, such mechanism design requires the knowledge of time-varying traffic conditions and it is the users themselves to learn and report past road experiences to the social planner (e.g., Waze or Google Maps). When congestion games meet mobile crowdsourcing, it is critical to incentivize selfish users to explore non-shortest paths in the best exploitation-exploration trade-off. First, we consider a simple but fundamental parallel routing network with one deterministic path and multiple stochastic paths for users with an average arrival probability
\lambda
. We prove that the current myopic routing policy (widely used in Waze and Google Maps) misses both exploration (when strong hazard belief) and exploitation (when weak hazard belief) as compared to the social optimum. Due to the myopic policy's under-exploration, we prove that the caused price of anarchy (PoA) is larger than\frac{1}{1-\rho^{\frac{1}{\lambda}}}
, which can be arbitrarily large as discount factor\rho\rightarrow1
. To mitigate such huge efficiency loss, we propose a novel selective information disclosure (SID) mechanism: we only reveal the latest traffic information to users when they intend to over-explore stochastic paths upon arrival, while hiding such information when they want to under-explore. We prove that our mechanism successfully reduces PoA to be less than~2
. Besides the parallel routing network, we further extend our mechanism and PoA results to any linear path graphs with multiple intermediate nodes.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper does not meet the 'MUST' criteria as it does not focus primarily on the engineering, design, or optimization of prompts for Large Language Models (LLMs). Instead, it addresses mechanism design for regulating human-in-the-loop learning in congestion games, with no apparent connection to LLMs or prompt engineering for text-based interactions." }
[Survival Analysis Revisited: Understanding and Unifying Poisson,
Exponential, and Cox Models in Fall Risk Analysis](https://arxiv.org/abs/http://arxiv.org/abs/2501.03058v1) arXiv ID: http://arxiv.org/abs/2501.03058v1
Abstract:
This paper explores foundational and applied aspects of survival analysis, using fall risk assessment as a case study. It revisits key time-related probability distributions and statistical methods, including logistic regression, Poisson regression, Exponential regression, and the Cox Proportional Hazards model, offering a unified perspective on their relationships within the survival analysis framework. A contribution of this work is the step-by-step derivation and clarification of the relationships among these models, particularly demonstrating that Poisson regression in the survival context is a specific case of the Cox model. These insights address gaps in understanding and reinforce the simplicity and interpretability of survival models. The paper also emphasizes the practical utility of survival analysis by connecting theoretical insights with real-world applications. In the context of fall detection, it demonstrates how these models can simultaneously predict fall risk, analyze contributing factors, and estimate time-to-event outcomes within a single streamlined framework. In contrast, advanced deep learning methods often require complex post-hoc interpretation and separate training for different tasks particularly when working with structured numerical data. This highlights the enduring relevance of classical statistical frameworks and makes survival models especially valuable in healthcare settings, where explainability and robustness are critical. By unifying foundational concepts and offering a cohesive perspective on time-to-event analysis, this work serves as an accessible resource for understanding survival models and applying them effectively to diverse analytical challenges.
Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria as it primarily focuses on survival analysis, statistical methods, and their application in healthcare, with no apparent emphasis on prompt engineering, Large Language Models (LLMs), or the manipulation of textual input prompts to improve LLM performance.
[Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video
Generation](https://arxiv.org/abs/http://arxiv.org/abs/2501.03059v1) arXiv ID: http://arxiv.org/abs/2501.03059v1
Abstract:
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce \benchmark, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate our method's superiority on this benchmark. Project page is available at https://guyyariv.github.io/TTM/.
Decision Explanation: Original decision: REJECT The paper primarily focuses on Image-to-Video (I2V) generation, a generative AI application other than text generation driven by LLMs, and does not centrally concern prompt engineering for text-based interactions with LLMs, despite mentioning object-specific prompts as a component of the methodology.
The Scaling Law for LoRA Base on Mutual Information Upper Bound
arXiv ID: http://arxiv.org/abs/2501.03152v1
Abstract:
LoRA (Low-Rank Adaptation) is a widely used model fine-tuning method. In fine-tuning, the law among model performance, model parameters, and data complexity has been a focal issue in the field. Existing methods often leverage external metrics (such as cross-entropy or perplexity) to evaluate model performance. In the fine-tuning process for large models, two types of knowledge are typically involved: the frozen, general knowledge acquired by the model during pre-training and the new knowledge learned through the LoRA module from the current data. Generally, the less LoRA's learned knowledge relies on the large model, the more it captures the specific knowledge of new data, thereby enhancing its adaptability to new tasks. However, external metrics do not readily capture the dependency relationship between these two types of knowledge. Therefore, we designed an internal metric based on the Mutual Information Upper Bound (MIUB) theory to investigate the scaling law of large-model LoRA fine-tuning. In our experiments, we validated this approach on benchmark datasets, using the Llama3-8B and Phi3-3B models. The results show that the proposed MIUB metric aligns more accurately and stably with the scaling law of LoRA fine-tuning compared to cross-entropy and perplexity.
Decision Explanation: Original decision: REJECT The paper focuses primarily on model fine-tuning (LoRA) and evaluating its performance using an internal metric (Mutual Information Upper Bound), rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs).
[Detecting AI-Generated Text in Educational Content: Leveraging Machine
Learning and Explainable AI for Academic Integrity](https://arxiv.org/abs/http://arxiv.org/abs/2501.03203v1) arXiv ID: http://arxiv.org/abs/2501.03203v1
Abstract:
This study seeks to enhance academic integrity by providing tools to detect AI-generated content in student work using advanced technologies. The findings promote transparency and accountability, helping educators maintain ethical standards and supporting the responsible integration of AI in education. A key contribution of this work is the generation of the CyberHumanAI dataset, which has 1000 observations, 500 of which are written by humans and the other 500 produced by ChatGPT. We evaluate various machine learning (ML) and deep learning (DL) algorithms on the CyberHumanAI dataset comparing human-written and AI-generated content from Large Language Models (LLMs) (i.e., ChatGPT). Results demonstrate that traditional ML algorithms, specifically XGBoost and Random Forest, achieve high performance (83% and 81% accuracies respectively). Results also show that classifying shorter content seems to be more challenging than classifying longer content. Further, using Explainable Artificial Intelligence (XAI) we identify discriminative features influencing the ML model's predictions, where human-written content tends to use a practical language (e.g., use and allow). Meanwhile AI-generated text is characterized by more abstract and formal terms (e.g., realm and employ). Finally, a comparative analysis with GPTZero show that our narrowly focused, simple, and fine-tuned model can outperform generalized systems like GPTZero. The proposed model achieved approximately 77.5% accuracy compared to GPTZero's 48.5% accuracy when tasked to classify Pure AI, Pure Human, and mixed class. GPTZero showed a tendency to classify challenging and small-content cases as either mixed or unrecognized while our proposed model showed a more balanced performance across the three classes.
Decision Explanation: Original decision: REJECT The paper primarily focuses on detecting AI-generated text in educational content using machine learning and Explainable AI, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), as required by the criteria.
[BoostStep: Boosting mathematical capability of Large Language Models via
improved single-step reasoning](https://arxiv.org/abs/http://arxiv.org/abs/2501.03226v2) arXiv ID: http://arxiv.org/abs/2501.03226v2
Abstract:
Cutting-edge large language models (LLMs) demonstrate promising performance in solving complex math problems with a divide-and-conquer pipeline and the assistance of in-context learning (ICL) examples. However, their potential for improvement is limited by two critical problems within their ICL examples: granularity-mismatch and the ensuing negative-effect noise problem. Specifically, the LLMs are capable of the dividing process yet mostly failed by inaccurate reasoning within a few conquer steps, while the ICL examples retrieved in question-grained sometimes lack relevant steps for a specific challenging reasoning step. Further, this disconnect may hinder the correct reasoning due to its irrelevance. To this end, we focus on improving the reasoning quality within each step and present BoostStep. BoostStep aligns the granularity between the retrieving and reasoning on step grained, and provides highly related ICL examples for each reasoning step with a novel `first-try' strategy. BoostStep provides more relevant examples than the coarse question-grained strategy, enhancing the model reasoning quality within each step steadily. BoostStep is a general and robust reasoning-enhancing method that not only improves standalone reasoning performance but also integrates seamlessly with Monte Carlo Tree Search methods (MCTS) to refine both candidate generation and decision-making. Quantitatively, it improves GPT-4o and Qwen2.5-Math-72B by 3.6% and 2.0% respectively on various mathematical benchmarks, and 7.5% gain combined with MCTS.
Decision Explanation: Original decision: REJECT Although the paper improves LLM performance through refined in-context learning (ICL) examples, its primary focus is on enhancing mathematical reasoning capabilities within LLMs via a novel strategy, rather than prompt engineering techniques specifically for text-based interactions with LLMs.
[DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich
Paradigm for Direct Preference Optimization](https://arxiv.org/abs/http://arxiv.org/abs/2501.03271v3) arXiv ID: http://arxiv.org/abs/2501.03271v3
Abstract:
The rapid rise of large language models (LLMs) has unlocked many applications but also underscores the challenge of aligning them with diverse values and preferences. Direct Preference Optimization (DPO) is central to alignment but constrained by fixed divergences and limited feature transformations. We propose DPO-Kernels, which integrates kernel methods to address these issues through four key contributions: (i) Kernelized Representations with polynomial, RBF, Mahalanobis, and spectral kernels for richer transformations, plus a hybrid loss combining embedding-based and probability-based objectives; (ii) Divergence Alternatives (Jensen-Shannon, Hellinger, Renyi, Bhattacharyya, Wasserstein, and f-divergences) for greater stability; (iii) Data-Driven Selection metrics that automatically choose the best kernel-divergence pair; and (iv) a Hierarchical Mixture of Kernels for both local precision and global modeling. Evaluations on 12 datasets demonstrate state-of-the-art performance in factuality, safety, reasoning, and instruction following. Grounded in Heavy-Tailed Self-Regularization, DPO-Kernels maintains robust generalization for LLMs, offering a comprehensive resource for further alignment research.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on Direct Preference Optimization (DPO) for aligning Large Language Models (LLMs) with diverse values and preferences, rather than prompt engineering specifically for LLMs. While LLMs are mentioned, the core subject is DPO methodology, not the design, optimization, or manipulation of textual input prompts for improving LLM performance." }
[Revolutionizing Encrypted Traffic Classification with MH-Net: A
Multi-View Heterogeneous Graph Model](https://arxiv.org/abs/http://arxiv.org/abs/2501.03279v1) arXiv ID: http://arxiv.org/abs/2501.03279v1
Abstract:
With the growing significance of network security, the classification of encrypted traffic has emerged as an urgent challenge. Traditional byte-based traffic analysis methods are constrained by the rigid granularity of information and fail to fully exploit the diverse correlations between bytes. To address these limitations, this paper introduces MH-Net, a novel approach for classifying network traffic that leverages multi-view heterogeneous traffic graphs to model the intricate relationships between traffic bytes. The essence of MH-Net lies in aggregating varying numbers of traffic bits into multiple types of traffic units, thereby constructing multi-view traffic graphs with diverse information granularities. By accounting for different types of byte correlations, such as header-payload relationships, MH-Net further endows the traffic graph with heterogeneity, significantly enhancing model performance. Notably, we employ contrastive learning in a multi-task manner to strengthen the robustness of the learned traffic unit representations. Experiments conducted on the ISCX and CIC-IoT datasets for both the packet-level and flow-level traffic classification tasks demonstrate that MH-Net achieves the best overall performance compared to dozens of SOTA methods.
Decision Explanation: Original decision: REJECT The paper does not meet the primary focus criteria, as it revolves around network traffic classification using a multi-view heterogeneous graph model (MH-Net), with no apparent connection to Large Language Models (LLMs), prompt engineering, or text generation, thus failing to satisfy the mandatory 'MUST' criteria.
[A Decision-Based Heterogenous Graph Attention Network for Multi-Class
Fake News Detection](https://arxiv.org/abs/http://arxiv.org/abs/2501.03290v1) arXiv ID: http://arxiv.org/abs/2501.03290v1
Abstract:
A promising tool for addressing fake news detection is Graph Neural Networks (GNNs). However, most existing GNN-based methods rely on binary classification, categorizing news as either real or fake. Additionally, traditional GNN models use a static neighborhood for each node, making them susceptible to issues like over-squashing. In this paper, we introduce a novel model named Decision-based Heterogeneous Graph Attention Network (DHGAT) for fake news detection in a semi-supervised setting. DHGAT effectively addresses the limitations of traditional GNNs by dynamically optimizing and selecting the neighborhood type for each node in every layer. It represents news data as a heterogeneous graph where nodes (news items) are connected by various types of edges. The architecture of DHGAT consists of a decision network that determines the optimal neighborhood type and a representation network that updates node embeddings based on this selection. As a result, each node learns an optimal and task-specific computational graph, enhancing both the accuracy and efficiency of the fake news detection process. We evaluate DHGAT on the LIAR dataset, a large and challenging dataset for multi-class fake news detection, which includes news items categorized into six classes. Our results demonstrate that DHGAT outperforms existing methods, improving accuracy by approximately 4% and showing robustness with limited labeled data.
Decision Explanation: Original decision: REJECT The paper focuses on developing a new Graph Neural Network (GNN) architecture for multi-class fake news detection, which does not meet the 'MUST' criteria of primarily focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or investigating methods for improving LLM performance through textual input prompt manipulation.
[Multi-Modal One-Shot Federated Ensemble Learning for Medical Data with
Vision Large Language Model](https://arxiv.org/abs/http://arxiv.org/abs/2501.03292v1) arXiv ID: http://arxiv.org/abs/2501.03292v1
Abstract:
Federated learning (FL) has attracted considerable interest in the medical domain due to its capacity to facilitate collaborative model training while maintaining data privacy. However, conventional FL methods typically necessitate multiple communication rounds, leading to significant communication overhead and delays, especially in environments with limited bandwidth. One-shot federated learning addresses these issues by conducting model training and aggregation in a single communication round, thereby reducing communication costs while preserving privacy. Among these, one-shot federated ensemble learning combines independently trained client models using ensemble techniques such as voting, further boosting performance in non-IID data scenarios. On the other hand, existing machine learning methods in healthcare predominantly use unimodal data (e.g., medical images or textual reports), which restricts their diagnostic accuracy and comprehensiveness. Therefore, the integration of multi-modal data is proposed to address these shortcomings. In this paper, we introduce FedMME, an innovative one-shot multi-modal federated ensemble learning framework that utilizes multi-modal data for medical image analysis. Specifically, FedMME capitalizes on vision large language models to produce textual reports from medical images, employs a BERT model to extract textual features from these reports, and amalgamates these features with visual features to improve diagnostic accuracy. Experimental results show that our method demonstrated superior performance compared to existing one-shot federated learning methods in healthcare scenarios across four datasets with various data distributions. For instance, it surpasses existing one-shot federated learning approaches by more than 17.5% in accuracy on the RSNA dataset when applying a Dirichlet distribution with (
\alpha
= 0.3).
Decision Explanation: Original decision: REJECT The paper focuses primarily on the development of a one-shot federated ensemble learning framework for medical data, using a Vision Large Language Model as a component, rather than focusing specifically on the engineering, design, or optimization of prompts for Large Language Models.
[Rethinking Byzantine Robustness in Federated Recommendation from Sparse
Aggregation Perspective](https://arxiv.org/abs/http://arxiv.org/abs/2501.03301v2) arXiv ID: http://arxiv.org/abs/2501.03301v2
Abstract:
To preserve user privacy in recommender systems, federated recommendation (FR) based on federated learning (FL) emerges, keeping the personal data on the local client and updating a model collaboratively. Unlike FL, FR has a unique sparse aggregation mechanism, where the embedding of each item is updated by only partial clients, instead of full clients in a dense aggregation of general FL. Recently, as an essential principle of FL, model security has received increasing attention, especially for Byzantine attacks, where malicious clients can send arbitrary updates. The problem of exploring the Byzantine robustness of FR is particularly critical since in the domains applying FR, e.g., e-commerce, malicious clients can be injected easily by registering new accounts. However, existing Byzantine works neglect the unique sparse aggregation of FR, making them unsuitable for our problem. Thus, we make the first effort to investigate Byzantine attacks on FR from the perspective of sparse aggregation, which is non-trivial: it is not clear how to define Byzantine robustness under sparse aggregations and design Byzantine attacks under limited knowledge/capability. In this paper, we reformulate the Byzantine robustness under sparse aggregation by defining the aggregation for a single item as the smallest execution unit. Then we propose a family of effective attack strategies, named Spattack, which exploit the vulnerability in sparse aggregation and are categorized along the adversary's knowledge and capability. Extensive experimental results demonstrate that Spattack can effectively prevent convergence and even break down defenses under a few malicious clients, raising alarms for securing FR systems.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on Byzantine robustness in Federated Recommendation systems, specifically addressing sparse aggregation and security against malicious clients, with no primary concern or investigation into prompt engineering for Large Language Models (LLMs) or their textual input prompts." }
[Advanced Machine Learning Techniques for Social Support Detection on
Social Media](https://arxiv.org/abs/http://arxiv.org/abs/2501.03370v1) arXiv ID: http://arxiv.org/abs/2501.03370v1
Abstract:
The widespread use of social media highlights the need to understand its impact, particularly the role of online social support. This study uses a dataset focused on online social support, which includes binary and multiclass classifications of social support content on social media. The classification of social support is divided into three tasks. The first task focuses on distinguishing between supportive and non-supportive. The second task aims to identify whether the support is directed toward an individual or a group. The third task categorizes the specific type of social support, grouping it into categories such as Nation, LGBTQ, Black people, Women, Religion, and Other (if it does not fit into the previously mentioned categories). To address data imbalances in these tasks, we employed K-means clustering for balancing the dataset and compared the results with the original unbalanced data. Using advanced machine learning techniques, including transformers and zero-shot learning approaches with GPT3, GPT4, and GPT4-o, we predict social support levels in various contexts. The effectiveness of the dataset is evaluated using baseline models across different learning approaches, with transformer-based methods demonstrating superior performance. Additionally, we achieved a 0.4% increase in the macro F1 score for the second task and a 0.7% increase for the third task, compared to previous work utilizing traditional machine learning with psycholinguistic and unigram-based TF-IDF values.
Decision Explanation: Original decision: REJECT The paper does not focus primarily on the engineering, design, or optimization of prompts for Large Language Models (LLMs). Instead, it utilizes LLMs (GPT3, GPT4) as components for social support detection on social media, with the primary focus being on advanced machine learning techniques for classification tasks, not prompt engineering.
License Plate Images Generation with Diffusion Models
arXiv ID: http://arxiv.org/abs/2501.03374v1
Abstract:
Despite the evident practical importance of license plate recognition (LPR), corresponding research is limited by the volume of publicly available datasets due to privacy regulations such as the General Data Protection Regulation (GDPR). To address this challenge, synthetic data generation has emerged as a promising approach. In this paper, we propose to synthesize realistic license plates (LPs) using diffusion models, inspired by recent advances in image and video generation. In our experiments a diffusion model was successfully trained on a Ukrainian LP dataset, and 1000 synthetic images were generated for detailed analysis. Through manual classification and annotation of the generated images, we performed a thorough study of the model output, such as success rate, character distributions, and type of failures. Our contributions include experimental validation of the efficacy of diffusion models for LP synthesis, along with insights into the characteristics of the generated data. Furthermore, we have prepared a synthetic dataset consisting of 10,000 LP images, publicly available at https://zenodo.org/doi/10.5281/zenodo.13342102. Conducted experiments empirically confirm the usefulness of synthetic data for the LPR task. Despite the initial performance gap between the model trained with real and synthetic data, the expansion of the training data set with pseudolabeled synthetic data leads to an improvement in LPR accuracy by 3% compared to baseline.
Decision Explanation: Original decision: REJECT The paper focuses primarily on generating license plate images using diffusion models, which is an application of generative AI in image generation, not text generation driven by Large Language Models (LLMs), and does not investigate prompt engineering for LLMs.
[Activating Associative Disease-Aware Vision Token Memory for LLM-Based
X-ray Report Generation](https://arxiv.org/abs/http://arxiv.org/abs/2501.03458v1) arXiv ID: http://arxiv.org/abs/2501.03458v1
Abstract:
X-ray image based medical report generation achieves significant progress in recent years with the help of the large language model, however, these models have not fully exploited the effective information in visual image regions, resulting in reports that are linguistically sound but insufficient in describing key diseases. In this paper, we propose a novel associative memory-enhanced X-ray report generation model that effectively mimics the process of professional doctors writing medical reports. It considers both the mining of global and local visual information and associates historical report information to better complete the writing of the current report. Specifically, given an X-ray image, we first utilize a classification model along with its activation maps to accomplish the mining of visual regions highly associated with diseases and the learning of disease query tokens. Then, we employ a visual Hopfield network to establish memory associations for disease-related tokens, and a report Hopfield network to retrieve report memory information. This process facilitates the generation of high-quality reports based on a large language model and achieves state-of-the-art performance on multiple benchmark datasets, including the IU X-ray, MIMIC-CXR, and Chexpert Plus. The source code of this work is released on \url{https://github.com/Event-AHU/Medical_Image_Analysis}.
Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a novel model for X-ray report generation in a medical context, which violates the 'MUST NOT' criteria of being primarily concerned with medical subjects and not focusing on prompt engineering for text-based interactions with LLMs as the core subject.
[Radar Signal Recognition through Self-Supervised Learning and Domain
Adaptation](https://arxiv.org/abs/http://arxiv.org/abs/2501.03461v2) arXiv ID: http://arxiv.org/abs/2501.03461v2
Abstract:
Automatic radar signal recognition (RSR) plays a pivotal role in electronic warfare (EW), as accurately classifying radar signals is critical for informing decision-making processes. Recent advances in deep learning have shown significant potential in improving RSR performance in domains with ample annotated data. However, these methods fall short in EW scenarios where annotated RF data are scarce or impractical to obtain. To address these challenges, we introduce a self-supervised learning (SSL) method which utilises masked signal modelling and RF domain adaption to enhance RSR performance in environments with limited RF samples and labels. Specifically, we investigate pre-training masked autoencoders (MAE) on baseband in-phase and quadrature (I/Q) signals from various RF domains and subsequently transfer the learned representation to the radar domain, where annotated data are limited. Empirical results show that our lightweight self-supervised ResNet model with domain adaptation achieves up to a 17.5% improvement in 1-shot classification accuracy when pre-trained on in-domain signals (i.e., radar signals) and up to a 16.31% improvement when pre-trained on out-of-domain signals (i.e., comm signals), compared to its baseline without SSL. We also provide reference results for several MAE designs and pre-training strategies, establishing a new benchmark for few-shot radar signal classification.
Decision Explanation: Original decision: REJECT The paper primarily focuses on self-supervised learning and domain adaptation for radar signal recognition, with no discussion on prompt engineering, Large Language Models (LLMs), or textual input prompts, thus failing to meet all 'MUST' criteria.
[KG-TRICK: Unifying Textual and Relational Information Completion of
Knowledge for Multilingual Knowledge Graphs](https://arxiv.org/abs/http://arxiv.org/abs/2501.03560v1) arXiv ID: http://arxiv.org/abs/2501.03560v1
Abstract:
Multilingual knowledge graphs (KGs) provide high-quality relational and textual information for various NLP applications, but they are often incomplete, especially in non-English languages. Previous research has shown that combining information from KGs in different languages aids either Knowledge Graph Completion (KGC), the task of predicting missing relations between entities, or Knowledge Graph Enhancement (KGE), the task of predicting missing textual information for entities. Although previous efforts have considered KGC and KGE as independent tasks, we hypothesize that they are interdependent and mutually beneficial. To this end, we introduce KG-TRICK, a novel sequence-to-sequence framework that unifies the tasks of textual and relational information completion for multilingual KGs. KG-TRICK demonstrates that: i) it is possible to unify the tasks of KGC and KGE into a single framework, and ii) combining textual information from multiple languages is beneficial to improve the completeness of a KG. As part of our contributions, we also introduce WikiKGE10++, the largest manually-curated benchmark for textual information completion of KGs, which features over 25,000 entities across 10 diverse languages.
Decision Explanation: Original decision: REJECT The paper primarily focuses on unifying Knowledge Graph Completion and Enhancement tasks for multilingual Knowledge Graphs, utilizing a sequence-to-sequence framework, without explicit concentration on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
[From Code to Compliance: Assessing ChatGPT's Utility in Designing an
Accessible Webpage -- A Case Study](https://arxiv.org/abs/http://arxiv.org/abs/2501.03572v1) arXiv ID: http://arxiv.org/abs/2501.03572v1
Abstract:
Web accessibility ensures that individuals with disabilities can access and interact with digital content without barriers, yet a significant majority of most used websites fail to meet accessibility standards. This study evaluates ChatGPT's (GPT-4o) ability to generate and improve web pages in line with Web Content Accessibility Guidelines (WCAG). While ChatGPT can effectively address accessibility issues when prompted, its default code often lacks compliance, reflecting limitations in its training data and prevailing inaccessible web practices. Automated and manual testing revealed strengths in resolving simple issues but challenges with complex tasks, requiring human oversight and additional iterations. Unlike prior studies, we incorporate manual evaluation, dynamic elements, and use the visual reasoning capability of ChatGPT along with the prompts to fix accessibility issues. Providing screenshots alongside prompts enhances the LLM's ability to address accessibility issues by allowing it to analyze surrounding components, such as determining appropriate contrast colors. We found that effective prompt engineering, such as providing concise, structured feedback and incorporating visual aids, significantly enhances ChatGPT's performance. These findings highlight the potential and limitations of large language models for accessible web development, offering practical guidance for developers to create more inclusive websites.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper's primary focus is leveraging ChatGPT for accessible web development, not prompt engineering for Large Language Models. While it mentions effective prompt engineering techniques, this is secondary to its main objective of improving web accessibility." }
[Action Quality Assessment via Hierarchical Pose-guided Multi-stage
Contrastive Regression](https://arxiv.org/abs/http://arxiv.org/abs/2501.03674v1) arXiv ID: http://arxiv.org/abs/2501.03674v1
Abstract:
Action Quality Assessment (AQA), which aims at automatic and fair evaluation of athletic performance, has gained increasing attention in recent years. However, athletes are often in rapid movement and the corresponding visual appearance variances are subtle, making it challenging to capture fine-grained pose differences and leading to poor estimation performance. Furthermore, most common AQA tasks, such as diving in sports, are usually divided into multiple sub-actions, each of which contains different durations. However, existing methods focus on segmenting the video into fixed frames, which disrupts the temporal continuity of sub-actions resulting in unavoidable prediction errors. To address these challenges, we propose a novel action quality assessment method through hierarchically pose-guided multi-stage contrastive regression. Firstly, we introduce a multi-scale dynamic visual-skeleton encoder to capture fine-grained spatio-temporal visual and skeletal features. Then, a procedure segmentation network is introduced to separate different sub-actions and obtain segmented features. Afterwards, the segmented visual and skeletal features are both fed into a multi-modal fusion module as physics structural priors, to guide the model in learning refined activity similarities and variances. Finally, a multi-stage contrastive learning regression approach is employed to learn discriminative representations and output prediction results. In addition, we introduce a newly-annotated FineDiving-Pose Dataset to improve the current low-quality human pose labels. In experiments, the results on FineDiving and MTL-AQA datasets demonstrate the effectiveness and superiority of our proposed approach. Our source code and dataset are available at https://github.com/Lumos0507/HP-MCoRe.
Decision Explanation: Original decision: REJECT The paper focuses primarily on Action Quality Assessment using computer vision and pose-guided contrastive regression, with no mention of Large Language Models (LLMs), prompt engineering, or text-based interactions, thus failing to meet all 'MUST' criteria.
[MAJL: A Model-Agnostic Joint Learning Framework for Music Source
Separation and Pitch Estimation](https://arxiv.org/abs/http://arxiv.org/abs/2501.03689v1) arXiv ID: http://arxiv.org/abs/2501.03689v1
Abstract:
Music source separation and pitch estimation are two vital tasks in music information retrieval. Typically, the input of pitch estimation is obtained from the output of music source separation. Therefore, existing methods have tried to perform these two tasks simultaneously, so as to leverage the mutually beneficial relationship between both tasks. However, these methods still face two critical challenges that limit the improvement of both tasks: the lack of labeled data and joint learning optimization. To address these challenges, we propose a Model-Agnostic Joint Learning (MAJL) framework for both tasks. MAJL is a generic framework and can use variant models for each task. It includes a two-stage training method and a dynamic weighting method named Dynamic Weights on Hard Samples (DWHS), which addresses the lack of labeled data and joint learning optimization, respectively. Experimental results on public music datasets show that MAJL outperforms state-of-the-art methods on both tasks, with significant improvements of 0.92 in Signal-to-Distortion Ratio (SDR) for music source separation and 2.71% in Raw Pitch Accuracy (RPA) for pitch estimation. Furthermore, comprehensive studies not only validate the effectiveness of each component of MAJL, but also indicate the great generality of MAJL in adapting to different model architectures.
Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria as it focuses on music source separation and pitch estimation, leveraging joint learning frameworks, without any mention of Large Language Models (LLMs), prompt engineering, or text-based interactions, thus falling outside the specified scope.
[AuxDepthNet: Real-Time Monocular 3D Object Detection with
Depth-Sensitive Features](https://arxiv.org/abs/http://arxiv.org/abs/2501.03700v1) arXiv ID: http://arxiv.org/abs/2501.03700v1
Abstract:
Monocular 3D object detection is a challenging task in autonomous systems due to the lack of explicit depth information in single-view images. Existing methods often depend on external depth estimators or expensive sensors, which increase computational complexity and hinder real-time performance. To overcome these limitations, we propose AuxDepthNet, an efficient framework for real-time monocular 3D object detection that eliminates the reliance on external depth maps or pre-trained depth models. AuxDepthNet introduces two key components: the Auxiliary Depth Feature (ADF) module, which implicitly learns depth-sensitive features to improve spatial reasoning and computational efficiency, and the Depth Position Mapping (DPM) module, which embeds depth positional information directly into the detection process to enable accurate object localization and 3D bounding box regression. Leveraging the DepthFusion Transformer architecture, AuxDepthNet globally integrates visual and depth-sensitive features through depth-guided interactions, ensuring robust and efficient detection. Extensive experiments on the KITTI dataset show that AuxDepthNet achieves state-of-the-art performance, with
\text{AP}_{3D}
scores of 24.72% (Easy), 18.63% (Moderate), and 15.31% (Hard), and\text{AP}_{\text{BEV}}
scores of 34.11% (Easy), 25.18% (Moderate), and 21.90% (Hard) at an IoU threshold of 0.7.
Decision Explanation: Original decision: REJECT The paper focuses on monocular 3D object detection for autonomous systems, involving depth estimation and sensor technologies, with no mention of Large Language Models (LLMs) or prompt engineering for text-based interactions, thus failing all 'MUST' criteria.
[Self-adaptive vision-language model for 3D segmentation of pulmonary
artery and vein](https://arxiv.org/abs/http://arxiv.org/abs/2501.03722v1) arXiv ID: http://arxiv.org/abs/2501.03722v1
Abstract:
Accurate segmentation of pulmonary structures iscrucial in clinical diagnosis, disease study, and treatment planning. Significant progress has been made in deep learning-based segmentation techniques, but most require much labeled data for training. Consequently, developing precise segmentation methods that demand fewer labeled datasets is paramount in medical image analysis. The emergence of pre-trained vision-language foundation models, such as CLIP, recently opened the door for universal computer vision tasks. Exploiting the generalization ability of these pre-trained foundation models on downstream tasks, such as segmentation, leads to unexpected performance with a relatively small amount of labeled data. However, exploring these models for pulmonary artery-vein segmentation is still limited. This paper proposes a novel framework called Language-guided self-adaptive Cross-Attention Fusion Framework. Our method adopts pre-trained CLIP as a strong feature extractor for generating the segmentation of 3D CT scans, while adaptively aggregating the cross-modality of text and image representations. We propose a s pecially designed adapter module to fine-tune pre-trained CLIP with a self-adaptive learning strategy to effectively fuse the two modalities of embeddings. We extensively validate our method on a local dataset, which is the largest pulmonary artery-vein CT dataset to date and consists of 718 labeled data in total. The experiments show that our method outperformed other state-of-the-art methods by a large margin. Our data and code will be made publicly available upon acceptance.
Decision Explanation: Original decision: REJECT The paper focuses primarily on medical image analysis (3D segmentation of pulmonary artery and vein) and the development of a vision-language model, rather than the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and does not demonstrate the impact of textual input prompts on LLM output.
[Deep Sylvester Posterior Inference for Adaptive Compressed Sensing in
Ultrasound Imaging](https://arxiv.org/abs/http://arxiv.org/abs/2501.03825v1) arXiv ID: http://arxiv.org/abs/2501.03825v1
Abstract:
Ultrasound images are commonly formed by sequential acquisition of beam-steered scan-lines. Minimizing the number of required scan-lines can significantly enhance frame rate, field of view, energy efficiency, and data transfer speeds. Existing approaches typically use static subsampling schemes in combination with sparsity-based or, more recently, deep-learning-based recovery. In this work, we introduce an adaptive subsampling method that maximizes intrinsic information gain in-situ, employing a Sylvester Normalizing Flow encoder to infer an approximate Bayesian posterior under partial observation in real-time. Using the Bayesian posterior and a deep generative model for future observations, we determine the subsampling scheme that maximizes the mutual information between the subsampled observations, and the next frame of the video. We evaluate our approach using the EchoNet cardiac ultrasound video dataset and demonstrate that our active sampling method outperforms competitive baselines, including uniform and variable-density random sampling, as well as equidistantly spaced scan-lines, improving mean absolute reconstruction error by 15%. Moreover, posterior inference and the sampling scheme generation are performed in just 0.015 seconds (66Hz), making it fast enough for real-time 2D ultrasound imaging applications.
Decision Explanation: Original decision: REJECT The paper primarily focuses on adaptive compressed sensing in ultrasound imaging, employing deep learning for image reconstruction, and does not investigate, analyze, or propose methods for improving Large Language Model (LLM) performance through prompt engineering, thus failing to meet all 'MUST' criteria.
[TACLR: A Scalable and Efficient Retrieval-based Method for Industrial
Product Attribute Value Identification](https://arxiv.org/abs/http://arxiv.org/abs/2501.03835v1) arXiv ID: http://arxiv.org/abs/2501.03835v1
Abstract:
Product Attribute Value Identification (PAVI) involves identifying attribute values from product profiles, a key task for improving product search, recommendations, and business analytics on e-commerce platforms. However, existing PAVI methods face critical challenges, such as inferring implicit values, handling out-of-distribution (OOD) values, and producing normalized outputs. To address these limitations, we introduce Taxonomy-Aware Contrastive Learning Retrieval (TACLR), the first retrieval-based method for PAVI. TACLR formulates PAVI as an information retrieval task by encoding product profiles and candidate values into embeddings and retrieving values based on their similarity to the item embedding. It leverages contrastive training with taxonomy-aware hard negative sampling and employs adaptive inference with dynamic thresholds. TACLR offers three key advantages: (1) it effectively handles implicit and OOD values while producing normalized outputs; (2) it scales to thousands of categories, tens of thousands of attributes, and millions of values; and (3) it supports efficient inference for high-load industrial scenarios. Extensive experiments on proprietary and public datasets validate the effectiveness and efficiency of TACLR. Moreover, it has been successfully deployed in a real-world e-commerce platform, processing millions of product listings daily while supporting dynamic, large-scale attribute taxonomies.
Decision Explanation: Original decision: REJECT The paper does not focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs). Instead, it introduces a retrieval-based method (TACLR) for Product Attribute Value Identification, primarily concerned with information retrieval and taxonomy-aware contrastive learning, without any central focus on prompt engineering for LLMs.
[Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video
Generation Control](https://arxiv.org/abs/http://arxiv.org/abs/2501.03847v2) arXiv ID: http://arxiv.org/abs/2501.03847v2
Abstract:
Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process, such as camera manipulation or content editing, remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.
Decision Explanation: Original decision: REJECT The paper focuses on video generation control using diffusion models, which falls under image/video generation rather than text generation driven by Large Language Models (LLMs), violating the 'MUST NOT' criteria related to applications of generative AI.
[CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds
Ratio on High-Resolution Point Clouds](https://arxiv.org/abs/http://arxiv.org/abs/2501.03879v1) arXiv ID: http://arxiv.org/abs/2501.03879v1
Abstract:
Recent research has demonstrated that Large Language Models (LLMs) are not limited to text-only tasks but can also function as multimodal models across various modalities, including audio, images, and videos. In particular, research on 3D Large Multimodal Models (3D LMMs) is making notable strides, driven by the potential of processing higher-dimensional data like point clouds. However, upon closer examination, we find that the visual and textual content within each sample of existing training datasets lacks both high informational granularity and clarity, which serve as a bottleneck for precise cross-modal understanding. To address these issues, we propose CL3DOR, Contrastive Learning for 3D large multimodal models via Odds ratio on high-Resolution point clouds, designed to ensure greater specificity and clarity in both visual and textual content. Specifically, we increase the density of point clouds per object and construct informative hard negative responses in the training dataset to penalize unwanted responses. To leverage hard negative responses, we incorporate the odds ratio as an auxiliary term for contrastive learning into the conventional language modeling loss. CL3DOR achieves state-of-the-art performance in 3D scene understanding and reasoning benchmarks. Additionally, we demonstrate the effectiveness of CL3DOR's key components through extensive experiments.
Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a new multimodal learning method (CL3DOR) for 3D Large Multimodal Models, which includes but is not centered on prompt engineering for Large Language Models (LLMs). The core contribution is the contrastive learning approach for multimodal understanding, not the manipulation of textual input prompts to improve LLM performance.
[LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
Vision Token](https://arxiv.org/abs/http://arxiv.org/abs/2501.03895v1) arXiv ID: http://arxiv.org/abs/2501.03895v1
Abstract:
The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and textual instructions into the context of large language models (LLMs), where large-scale parameters and numerous context tokens (predominantly vision tokens) result in substantial computational overhead. Previous efforts towards efficient LMMs always focus on replacing the LLM backbone with smaller models, while neglecting the crucial issue of token quantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimal vision tokens. To achieve a high compression ratio of vision tokens while preserving visual information, we first analyze how LMMs understand vision tokens and find that most vision tokens only play a crucial role in the early layers of LLM backbone, where they mainly fuse visual information into text tokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion to fuse visual information into text tokens in advance, thereby facilitating the extreme compression of vision tokens fed to LLM backbone into one token. LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Experiments across 11 image-based and 7 video-based benchmarks demonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision token instead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on developing an efficient Large Multimodal Model (LMM) for image and video processing, optimizing vision tokens, and reducing computational overhead, rather than engineering or optimizing prompts specifically for Large Language Models (LLMs) and their text-based interactions." }
[Explainable Time Series Prediction of Tyre Energy in Formula One Race
Strategy](https://arxiv.org/abs/http://arxiv.org/abs/2501.04067v1) arXiv ID: http://arxiv.org/abs/2501.04067v1
Abstract:
Formula One (F1) race strategy takes place in a high-pressure and fast-paced environment where split-second decisions can drastically affect race results. Two of the core decisions of race strategy are when to make pit stops (i.e. replace the cars' tyres) and which tyre compounds (hard, medium or soft, in normal conditions) to select. The optimal pit stop decisions can be determined by estimating the tyre degradation of these compounds, which in turn can be computed from the energy applied to each tyre, i.e. the tyre energy. In this work, we trained deep learning models, using the Mercedes-AMG PETRONAS F1 team's historic race data consisting of telemetry, to forecast tyre energies during races. Additionally, we fitted XGBoost, a decision tree-based machine learning algorithm, to the same dataset and compared the results, with both giving impressive performance. Furthermore, we incorporated two different explainable AI methods, namely feature importance and counterfactual explanations, to gain insights into the reasoning behind the forecasts. Our contributions thus result in an explainable, automated method which could assist F1 teams in optimising their race strategy.
Decision Explanation: Original decision: REJECT The paper focuses on developing an explainable AI method for time series prediction in Formula One racing, utilizing deep learning models and XGBoost, with no primary emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs).
Explainable Reinforcement Learning for Formula One Race Strategy
arXiv ID: http://arxiv.org/abs/2501.04068v1
Abstract:
In Formula One, teams compete to develop their cars and achieve the highest possible finishing position in each race. During a race, however, teams are unable to alter the car, so they must improve their cars' finishing positions via race strategy, i.e. optimising their selection of which tyre compounds to put on the car and when to do so. In this work, we introduce a reinforcement learning model, RSRL (Race Strategy Reinforcement Learning), to control race strategies in simulations, offering a faster alternative to the industry standard of hard-coded and Monte Carlo-based race strategies. Controlling cars with a pace equating to an expected finishing position of P5.5 (where P1 represents first place and P20 is last place), RSRL achieves an average finishing position of P5.33 on our test race, the 2023 Bahrain Grand Prix, outperforming the best baseline of P5.63. We then demonstrate, in a generalisability study, how performance for one track or multiple tracks can be prioritised via training. Further, we supplement model predictions with feature importance, decision tree-based surrogate models, and decision tree counterfactuals towards improving user trust in the model. Finally, we provide illustrations which exemplify our approach in real-world situations, drawing parallels between simulations and reality.
Decision Explanation: Original decision: REJECT The paper focuses on developing a reinforcement learning model for optimizing Formula One race strategies, with no apparent connection to Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet the primary 'MUST' criteria.
[Multi-armed Bandit and Backbone boost Lin-Kernighan-Helsgaun Algorithm
for the Traveling Salesman Problems](https://arxiv.org/abs/http://arxiv.org/abs/2501.04072v1) arXiv ID: http://arxiv.org/abs/2501.04072v1
Abstract:
The Lin-Kernighan-Helsguan (LKH) heuristic is a classic local search algorithm for the Traveling Salesman Problem (TSP). LKH introduces an
\alpha
-value to replace the traditional distance metric for evaluating the edge quality, which leads to a significant improvement. However, we observe that the\alpha
-value does not make full use of the historical information during the search, and single guiding information often makes LKH hard to escape from some local optima. To address the above issues, we propose a novel way to extract backbone information during the TSP local search process, which is dynamic and can be updated once a local optimal solution is found. We further propose to combine backbone information,\alpha
-value, and distance to evaluate the edge quality so as to guide the search. Moreover, we abstract their different combinations to arms in a multi-armed bandit (MAB) and use an MAB model to help the algorithm select an appropriate evaluation metric dynamically. Both the backbone information and MAB can provide diverse guiding information and learn from the search history to suggest the best metric. We apply our methods to LKH and LKH-3, which is an extension version of LKH that can be used to solve about 40 variant problems of TSP and Vehicle Routing Problem (VRP). Extensive experiments show the excellent performance and generalization capability of our proposed method, significantly improving LKH for TSP and LKH-3 for two representative TSP and VRP variants, the Colored TSP (CTSP) and Capacitated VRP with Time Windows (CVRPTW).
Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria as it focuses on optimizing the Traveling Salesman Problem algorithm using multi-armed bandit and backbone boost, with no primary focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.
[Enhancing Distribution and Label Consistency for Graph
Out-of-Distribution Generalization](https://arxiv.org/abs/http://arxiv.org/abs/2501.04102v1) arXiv ID: http://arxiv.org/abs/2501.04102v1
Abstract:
To deal with distribution shifts in graph data, various graph out-of-distribution (OOD) generalization techniques have been recently proposed. These methods often employ a two-step strategy that first creates augmented environments and subsequently identifies invariant subgraphs to improve generalizability. Nevertheless, this approach could be suboptimal from the perspective of consistency. First, the process of augmenting environments by altering the graphs while preserving labels may lead to graphs that are not realistic or meaningfully related to the origin distribution, thus lacking distribution consistency. Second, the extracted subgraphs are obtained from directly modifying graphs, and may not necessarily maintain a consistent predictive relationship with their labels, thereby impacting label consistency. In response to these challenges, we introduce an innovative approach that aims to enhance these two types of consistency for graph OOD generalization. We propose a modifier to obtain both augmented and invariant graphs in a unified manner. With the augmented graphs, we enrich the training data without compromising the integrity of label-graph relationships. The label consistency enhancement in our framework further preserves the supervision information in the invariant graph. We conduct extensive experiments on real-world datasets to demonstrate the superiority of our framework over other state-of-the-art baselines.
Decision Explanation: Original decision: REJECT The paper primarily focuses on enhancing distribution and label consistency for graph out-of-distribution generalization, with no apparent connection to Large Language Models (LLMs), prompt engineering, or text-based interactions, thus failing to meet the 'MUST' criteria.
Multimodal Multihop Source Retrieval for Web Question Answering
arXiv ID: http://arxiv.org/abs/2501.04173v1
Abstract:
This work deals with the challenge of learning and reasoning over multi-modal multi-hop question answering (QA). We propose a graph reasoning network based on the semantic structure of the sentences to learn multi-source reasoning paths and find the supporting facts across both image and text modalities for answering the question. In this paper, we investigate the importance of graph structure for multi-modal multi-hop question answering. Our analysis is centered on WebQA. We construct a strong baseline model, that finds relevant sources using a pairwise classification task. We establish that, with the proper use of feature representations from pre-trained models, graph structure helps in improving multi-modal multi-hop question answering. We point out that both graph structure and adjacency matrix are task-related prior knowledge, and graph structure can be leveraged to improve the retrieval performance for the task. Experiments and visualized analysis demonstrate that message propagation over graph networks or the entire graph structure can replace massive multimodal transformers with token-wise cross-attention. We demonstrated the applicability of our method and show a performance gain of \textbf{4.6$%$} retrieval F1score over the transformer baselines, despite being a very light model. We further demonstrated the applicability of our model to a large scale retrieval setting.
Decision Explanation: Original decision: REJECT The paper focuses primarily on multimodal multihop question answering using graph reasoning networks, with emphasis on leveraging image and text modalities, rather than specifically engineering prompts for Large Language Models (LLMs). It does not investigate, analyze, or propose methods for improving LLM performance through the manipulation of textual input prompts as required.
[HIVEX: A High-Impact Environment Suite for Multi-Agent Research
(extended version)](https://arxiv.org/abs/http://arxiv.org/abs/2501.04180v2) arXiv ID: http://arxiv.org/abs/2501.04180v2
Abstract:
Games have been vital test beds for the rapid development of Agent-based research. Remarkable progress has been achieved in the past, but it is unclear if the findings equip for real-world problems. While pressure grows, some of the most critical ecological challenges can find mitigation and prevention solutions through technology and its applications. Most real-world domains include multi-agent scenarios and require machine-machine and human-machine collaboration. Open-source environments have not advanced and are often toy scenarios, too abstract or not suitable for multi-agent research. By mimicking real-world problems and increasing the complexity of environments, we hope to advance state-of-the-art multi-agent research and inspire researchers to work on immediate real-world problems. Here, we present HIVEX, an environment suite to benchmark multi-agent research focusing on ecological challenges. HIVEX includes the following environments: Wind Farm Control, Wildfire Resource Management, Drone-Based Reforestation, Ocean Plastic Collection, and Aerial Wildfire Suppression. We provide environments, training examples, and baselines for the main and sub-tasks. All trained models resulting from the experiments of this work are hosted on Hugging Face. We also provide a leaderboard on Hugging Face and encourage the community to submit models trained on our environment suite.
Decision Explanation: Original decision: REJECT The paper primarily focuses on multi-agent research, ecological challenges, and the development of an environment suite (HIVEX), with no clear emphasis on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
[Fixed Points of Deep Neural Networks: Emergence, Stability, and
Applications](https://arxiv.org/abs/http://arxiv.org/abs/2501.04182v1) arXiv ID: http://arxiv.org/abs/2501.04182v1
Abstract:
We present numerical and analytical results on the formation and stability of a family of fixed points of deep neural networks (DNNs). Such fixed points appear in a class of DNNs when dimensions of input and output vectors are the same. We demonstrate examples of applications of such networks in supervised, semi-supervised and unsupervised learning such as encoding/decoding of images, restoration of damaged images among others. We present several numerical and analytical results. First, we show that for untrained DNN's with weights and biases initialized by normally distributed random variables the only one fixed point exists. This result holds for DNN with any depth (number of layers)
L
, any layer widthN
, and sigmoid-type activation functions. Second, it has been shown that for a DNN whose parameters (weights and biases) are initialized bylight-tailed'' distribution of weights (e.g. normal distribution), after training the distribution of these parameters become
heavy-tailed''. This motivates our study of DNNs with ``heavy-tailed'' initialization. For such DNNs we show numerically %existence and stability that training leads to emergence ofQ(N,L)
fixed points, whereQ(N,L)
is a positive integer which depends on the number of layersL
and layer widthN
. We further observe numerically that for fixedN = N_0
the functionQ(N_0, L)
is non-monotone, that is it initially grows asL
increases and then decreases to 1. This non-monotone behavior ofQ(N_0, L)
is also obtained by analytical derivation of equation for Empirical Spectral Distribution (ESD) of input-output Jacobian followed by numerical solution of this equation.
Decision Explanation: Original decision: REJECT The paper focuses on the analysis of fixed points in deep neural networks (DNNs), including their emergence, stability, and applications in image processing, but does not address prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance, thus failing to meet all 'MUST' criteria.
[Generative Style Transfer for MRI Image Segmentation: A Case of Glioma
Segmentation in Sub-Saharan Africa](https://arxiv.org/abs/http://arxiv.org/abs/2501.04734v1) arXiv ID: http://arxiv.org/abs/2501.04734v1
Abstract:
In Sub-Saharan Africa (SSA), the utilization of lower-quality Magnetic Resonance Imaging (MRI) technology raises questions about the applicability of machine learning methods for clinical tasks. This study aims to provide a robust deep learning-based brain tumor segmentation (BraTS) method tailored for the SSA population using a threefold approach. Firstly, the impact of domain shift from the SSA training data on model efficacy was examined, revealing no significant effect. Secondly, a comparative analysis of 3D and 2D full-resolution models using the nnU-Net framework indicates similar performance of both the models trained for 300 epochs achieving a five-fold cross-validation score of 0.93. Lastly, addressing the performance gap observed in SSA validation as opposed to the relatively larger BraTS glioma (GLI) validation set, two strategies are proposed: fine-tuning SSA cases using the GLI+SSA best-pretrained 2D fullres model at 300 epochs, and introducing a novel neural style transfer-based data augmentation technique for the SSA cases. This investigation underscores the potential of enhancing brain tumor prediction within SSA's unique healthcare landscape.
Decision Explanation: Original decision: REJECT The paper focuses on MRI image segmentation, a non-text generation task, using deep learning methods, and does not investigate, analyze, or propose methods for improving Large Language Model (LLM) performance through prompt engineering, violating multiple 'MUST NOT' criteria.
Retrieval-Augmented Generation by Evidence Retroactivity in LLMs
arXiv ID: http://arxiv.org/abs/2501.05475v1
Abstract:
Retrieval-augmented generation has gained significant attention due to its ability to integrate relevant external knowledge, enhancing the accuracy and reliability of the LLMs' responses. Most of the existing methods apply a dynamic multiple retrieval-generating process, to address multi-hop complex questions by decomposing them into sub-problems. However, these methods rely on an unidirectional forward reasoning paradigm, where errors from insufficient reasoning steps or inherent flaws in current retrieval systems are irreversible, potentially derailing the entire reasoning chain. For the first time, this work introduces Retroactive Retrieval-Augmented Generation (RetroRAG), a novel framework to build a retroactive reasoning paradigm. RetroRAG revises and updates the evidence, redirecting the reasoning chain to the correct direction. RetroRAG constructs an evidence-collation-discovery framework to search, generate, and refine credible evidence. It synthesizes inferential evidence related to the key entities in the question from the existing source knowledge and formulates search queries to uncover additional information. As new evidence is found, RetroRAG continually updates and organizes this information, enhancing its ability to locate further necessary evidence. Paired with an Answerer to generate and evaluate outputs, RetroRAG is capable of refining its reasoning process iteratively until a reliable answer is obtained. Empirical evaluations show that RetroRAG significantly outperforms existing methods.
Decision Explanation: Original decision: REJECT The paper primarily focuses on introducing a novel framework for retrieval-augmented generation with a retroactive reasoning paradigm, rather than specifically engineering or optimizing prompts for Large Language Models (LLMs). While LLMs are utilized, the core subject is the enhancement of their reliability through external knowledge integration, not prompt engineering for text-based interactions.
[Language and Planning in Robotic Navigation: A Multilingual Evaluation
of State-of-the-Art Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.05478v1) arXiv ID: http://arxiv.org/abs/2501.05478v1
Abstract:
Large Language Models (LLMs) such as GPT-4, trained on huge amount of datasets spanning multiple domains, exhibit significant reasoning, understanding, and planning capabilities across various tasks. This study presents the first-ever work in Arabic language integration within the Vision-and-Language Navigation (VLN) domain in robotics, an area that has been notably underexplored in existing research. We perform a comprehensive evaluation of state-of-the-art multi-lingual Small Language Models (SLMs), including GPT-4o mini, Llama 3 8B, and Phi-3 medium 14B, alongside the Arabic-centric LLM, Jais. Our approach utilizes the NavGPT framework, a pure LLM-based instruction-following navigation agent, to assess the impact of language on navigation reasoning through zero-shot sequential action prediction using the R2R dataset. Through comprehensive experiments, we demonstrate that our framework is capable of high-level planning for navigation tasks when provided with instructions in both English and Arabic. However, certain models struggled with reasoning and planning in the Arabic language due to inherent limitations in their capabilities, sub-optimal performance, and parsing issues. These findings highlight the importance of enhancing planning and reasoning capabilities in language models for effective navigation, emphasizing this as a key area for further development while also unlocking the potential of Arabic-language models for impactful real-world applications.
Decision Explanation: Original decision: REJECT The paper primarily focuses on the evaluation of Large Language Models (LLMs) for multilingual robotic navigation, rather than on engineering, design, or optimization of prompts specifically for LLMs. While LLMs are utilized, the core subject is not prompt engineering for text-based interactions, but rather the application of LLMs in navigation tasks across different languages.
[Detection, Retrieval, and Explanation Unified: A Violence Detection
System Based on Knowledge Graphs and GAT](https://arxiv.org/abs/http://arxiv.org/abs/2501.06224v1) arXiv ID: http://arxiv.org/abs/2501.06224v1
Abstract:
Recently, violence detection systems developed using unified multimodal models have achieved significant success and attracted widespread attention. However, most of these systems face two critical challenges: the lack of interpretability as black-box models and limited functionality, offering only classification or retrieval capabilities. To address these challenges, this paper proposes a novel interpretable violence detection system, termed the Three-in-One (TIO) System. The TIO system integrates knowledge graphs (KG) and graph attention networks (GAT) to provide three core functionalities: detection, retrieval, and explanation. Specifically, the system processes each video frame along with text descriptions generated by a large language model (LLM) for videos containing potential violent behavior. It employs ImageBind to generate high-dimensional embeddings for constructing a knowledge graph, uses GAT for reasoning, and applies lightweight time series modules to extract video embedding features. The final step connects a classifier and retriever for multi-functional outputs. The interpretability of KG enables the system to verify the reasoning process behind each output. Additionally, the paper introduces several lightweight methods to reduce the resource consumption of the TIO system and enhance its efficiency. Extensive experiments conducted on the XD-Violence and UCF-Crime datasets validate the effectiveness of the proposed system. A case study further reveals an intriguing phenomenon: as the number of bystanders increases, the occurrence of violent behavior tends to decrease.
Decision Explanation: Original decision: REJECT The paper's primary focus is on developing a unified violence detection system using knowledge graphs and GAT, with LLMs being used only as a component for generating text descriptions, rather than the central focus being on prompt engineering for text-based interactions with LLMs.
asanAI: In-Browser, No-Code, Offline-First Machine Learning Toolkit
arXiv ID: http://arxiv.org/abs/2501.06226v1
Abstract:
Machine learning (ML) has become crucial in modern life, with growing interest from researchers and the public. Despite its potential, a significant entry barrier prevents widespread adoption, making it challenging for non-experts to understand and implement ML techniques. The increasing desire to leverage ML is counterbalanced by its technical complexity, creating a gap between potential and practical application. This work introduces asanAI, an offline-first, open-source, no-code machine learning toolkit designed for users of all skill levels. It allows individuals to design, debug, train, and test ML models directly in a web browser, eliminating the need for software installations and coding. The toolkit runs on any device with a modern web browser, including smartphones, and ensures user privacy through local computations while utilizing WebGL for enhanced GPU performance. Users can quickly experiment with neural networks and train custom models using various data sources, supported by intuitive visualizations of network structures and data flows. asanAI simplifies the teaching of ML concepts in educational settings and is released under an open-source MIT license, encouraging modifications. It also supports exporting models in industry-ready formats, empowering a diverse range of users to effectively learn and apply machine learning in their projects. The proposed toolkit is successfully utilized by researchers of ScaDS.AI to swiftly draft and test machine learning ideas, by trainers to effectively educate enthusiasts, and by teachers to introduce contemporary ML topics in classrooms with minimal effort and high clarity.
Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a no-code machine learning toolkit (asanAI) that happens to utilize ML techniques, rather than engineering prompts specifically for Large Language Models (LLMs), thus not meeting the core subject criterion.
Balanced Multi-view Clustering
arXiv ID: http://arxiv.org/abs/2501.02564v2
Abstract:
Multi-view clustering (MvC) aims to integrate information from different views to enhance the capability of the model in capturing the underlying data structures. The widely used joint training paradigm in MvC is potentially not fully leverage the multi-view information, since the imbalanced and under-optimized view-specific features caused by the uniform learning objective for all views. For instance, particular views with more discriminative information could dominate the learning process in the joint training paradigm, leading to other views being under-optimized. To alleviate this issue, we first analyze the imbalanced phenomenon in the joint-training paradigm of multi-view clustering from the perspective of gradient descent for each view-specific feature extractor. Then, we propose a novel balanced multi-view clustering (BMvC) method, which introduces a view-specific contrastive regularization (VCR) to modulate the optimization of each view. Concretely, VCR preserves the sample similarities captured from the joint features and view-specific ones into the clustering distributions corresponding to view-specific features to enhance the learning process of view-specific feature extractors. Additionally, a theoretical analysis is provided to illustrate that VCR adaptively modulates the magnitudes of gradients for updating the parameters of view-specific feature extractors to achieve a balanced multi-view learning procedure. In such a manner, BMvC achieves a better trade-off between the exploitation of view-specific patterns and the exploration of view-invariance patterns to fully learn the multi-view information for the clustering task. Finally, a set of experiments are conducted to verify the superiority of the proposed method compared with state-of-the-art approaches both on eight benchmark MvC datasets and two spatially resolved transcriptomics datasets.
Decision Explanation: Original decision: REJECT The paper focuses on multi-view clustering, a general machine learning technique, and does not meet the 'MUST' criteria for prompt engineering specifically tailored for Large Language Models (LLMs), nor does it provide examples of textual input prompts impacting LLM output.
[Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM
Pre-Training Datasets](https://arxiv.org/abs/http://arxiv.org/abs/2501.02628v1) arXiv ID: http://arxiv.org/abs/2501.02628v1
Abstract:
A critical part of creating code suggestion systems is the pre-training of Large Language Models on vast amounts of source code and natural language text, often of questionable origin or quality. This may contribute to the presence of bugs and vulnerabilities in code generated by LLMs. While efforts to identify bugs at or after code generation exist, it is preferable to pre-train or fine-tune LLMs on curated, high-quality, and compliant datasets. The need for vast amounts of training data necessitates that such curation be automated, minimizing human intervention. We propose an automated source code autocuration technique that leverages the complete version history of open-source software projects to improve the quality of training data. This approach leverages the version history of all OSS projects to identify training data samples that have been modified or have undergone changes in at least one OSS project, and pinpoint a subset of samples that include fixes for bugs or vulnerabilities. We evaluate this method using The Stack v2 dataset, and find that 17% of the code versions in the dataset have newer versions, with 17% of those representing bug fixes, including 2.36% addressing known CVEs. The deduplicated version of Stack v2 still includes blobs vulnerable to 6,947 known CVEs. Furthermore, 58% of the blobs in the dataset were never modified after creation, suggesting they likely represent software with minimal or no use. Misidentified blob origins present an additional challenge, as they lead to the inclusion of non-permissively licensed code, raising serious compliance concerns. By addressing these issues, the training of new models can avoid perpetuating buggy code patterns or license violations. We expect our results to inspire process improvements for automated data curation, with the potential to enhance the reliability of outputs generated by AI tools.
Decision Explanation: Original decision: REJECT The paper focuses primarily on the curation of pre-training datasets for LLMs to reduce bugs and licensing risks, rather than the engineering, design, or optimization of prompts specifically for Large Language Models, failing to meet the first 'MUST' criteria.
[Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured
State-Space Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02832v3) arXiv ID: http://arxiv.org/abs/2501.02832v3
Abstract:
We propose Samba ASR,the first state of the art Automatic Speech Recognition(ASR)model leveraging the novel Mamba architecture as both encoder and decoder,built on the foundation of state space models(SSMs).Unlike transformerbased ASR models,which rely on self-attention mechanisms to capture dependencies,Samba ASR effectively models both local and global temporal dependencies using efficient statespace dynamics,achieving remarkable performance gains.By addressing the limitations of transformers,such as quadratic scaling with input length and difficulty in handling longrange dependencies,Samba ASR achieves superior accuracy and efficiency.Experimental results demonstrate that Samba ASR surpasses existing opensource transformerbased ASR models across various standard benchmarks,establishing it as the new state of theart in ASR.Extensive evaluations on the benchmark dataset show significant improvements in Word Error Rate(WER),with competitive performance even in lowresource scenarios.Furthermore,the inherent computational efficiency and parameter optimization of the Mamba architecture make Samba ASR a scalable and robust solution for diverse ASR tasks.Our contributions include the development of a new Samba ASR architecture for automatic speech recognition(ASR),demonstrating the superiority of structured statespace models(SSMs)over transformer based models for speech sequence processing.We provide a comprehensive evaluation on public benchmarks,showcasing stateoftheart(SOTA)performance,and present an indepth analysis of computational efficiency,robustness to noise,and sequence generalization.This work highlights the viability of Mamba SSMs as a transformerfree alternative for efficient and accurate ASR.By leveraging the advancements of statespace modeling,Samba ASR redefines ASR performance standards and sets a new benchmark for future research in this field.
Decision Explanation: Original decision: REJECT The paper focuses on developing a new architecture (Mamba) for Automatic Speech Recognition (ASR), which is unrelated to Large Language Models (LLMs) and prompt engineering for text-based interactions, thus failing to meet the primary criteria.
[A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation
Based on Large Language Models Enhanced by Domain Knowledge Retrieval](https://arxiv.org/abs/http://arxiv.org/abs/2501.03295v2) arXiv ID: http://arxiv.org/abs/2501.03295v2
Abstract:
Data-driven soft sensors are crucial in predicting key performance indicators in industrial systems. However, current methods predominantly rely on the supervised learning paradigms of parameter updating, which inherently faces challenges such as high development costs, poor robustness, training instability, and lack of interpretability. Recently, large language models (LLMs) have demonstrated significant potential across various domains, notably through In-Context Learning (ICL), which enables high-performance task execution with minimal input-label demonstrations and no prior training. This paper aims to replace supervised learning with the emerging ICL paradigm for soft sensor modeling to address existing challenges and explore new avenues for advancement. To achieve this, we propose a novel framework called the Few-shot Uncertainty-aware and self-Explaining Soft Sensor (LLM-FUESS), which includes the Zero-shot Auxiliary Variable Selector (LLM-ZAVS) and the Uncertainty-aware Few-shot Soft Sensor (LLM-UFSS). The LLM-ZAVS retrieves from the Industrial Knowledge Vector Storage to enhance LLMs' domain-specific knowledge, enabling zero-shot auxiliary variable selection. In the LLM-UFSS, we utilize text-based context demonstrations of structured data to prompt LLMs to execute ICL for predicting and propose a context sample retrieval augmentation strategy to improve performance. Additionally, we explored LLMs' AIGC and probabilistic characteristics to propose self-explanation and uncertainty quantification methods for constructing a trustworthy soft sensor. Extensive experiments demonstrate that our method achieved state-of-the-art predictive performance, strong robustness, and flexibility, effectively mitigates training instability found in traditional methods. To the best of our knowledge, this is the first work to establish soft sensor utilizing LLMs.
Decision Explanation: Original decision: REJECT The paper primarily focuses on replacing supervised learning with In-Context Learning for soft sensor modeling in industrial systems, utilizing LLMs as a core component within a larger system. While it mentions prompt enhancements (e.g., text-based context demonstrations, context sample retrieval augmentation), prompt engineering for text-based interactions with LLMs is not the central focus, but rather a means to achieve the paper's main objective of advancing soft sensor technology.
[FTA-FTL: A Fine-Tuned Aggregation Federated Transfer Learning Scheme for
Lithology Microscopic Image Classification](https://arxiv.org/abs/http://arxiv.org/abs/2501.03349v1) arXiv ID: http://arxiv.org/abs/2501.03349v1
Abstract:
Lithology discrimination is a crucial activity in characterizing oil reservoirs, and processing lithology microscopic images is an essential technique for investigating fossils and minerals and geological assessment of shale oil exploration. In this way, Deep Learning (DL) technique is a powerful approach for building robust classifier models. However, there is still a considerable challenge to collect and produce a large dataset. Transfer-learning and data augmentation techniques have emerged as popular approaches to tackle this problem. Furthermore, due to different reasons, especially data privacy, individuals, organizations, and industry companies often are not willing to share their sensitive data and information. Federated Learning (FL) has emerged to train a highly accurate central model across multiple decentralized edge servers without transferring sensitive data, preserving sensitive data, and enhancing security. This study involves two phases; the first phase is to conduct Lithology microscopic image classification on a small dataset using transfer learning. In doing so, various pre-trained DL model architectures are comprehensively compared for the classification task. In the second phase, we formulated the classification task to a Federated Transfer Learning (FTL) scheme and proposed a Fine-Tuned Aggregation strategy for Federated Learning (FTA-FTL). In order to perform a comprehensive experimental study, several metrics such as accuracy, f1 score, precision, specificity, sensitivity (recall), and confusion matrix are taken into account. The results are in excellent agreement and confirm the efficiency of the proposed scheme, and show that the proposed FTA-FTL algorithm is capable enough to achieve approximately the same results obtained by the centralized implementation for Lithology microscopic images classification task.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on Federated Transfer Learning for image classification, primarily dealing with Deep Learning model architectures and data privacy in the context of Lithology microscopic image classification, with no mention of Large Language Models (LLMs) or prompt engineering for text-based interactions." }
[Not all tokens are created equal: Perplexity Attention Weighted Networks
for AI generated text detection](https://arxiv.org/abs/http://arxiv.org/abs/2501.03940v2) arXiv ID: http://arxiv.org/abs/2501.03940v2
Abstract:
The rapid advancement in large language models (LLMs) has significantly enhanced their ability to generate coherent and contextually relevant text, raising concerns about the misuse of AI-generated content and making it critical to detect it. However, the task remains challenging, particularly in unseen domains or with unfamiliar LLMs. Leveraging LLM next-token distribution outputs offers a theoretically appealing approach for detection, as they encapsulate insights from the models' extensive pre-training on diverse corpora. Despite its promise, zero-shot methods that attempt to operationalize these outputs have met with limited success. We hypothesize that one of the problems is that they use the mean to aggregate next-token distribution metrics across tokens, when some tokens are naturally easier or harder to predict and should be weighted differently. Based on this idea, we propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length. Although not zero-shot, our method allows us to cache the last hidden states and next-token distribution metrics on disk, greatly reducing the training resource requirements. PAWN shows competitive and even better performance in-distribution than the strongest baselines (fine-tuned LMs) with a fraction of their trainable parameters. Our model also generalizes better to unseen domains and source models, with smaller variability in the decision boundary across distribution shifts. It is also more robust to adversarial attacks, and if the backbone has multilingual capabilities, it presents decent generalization to languages not seen during supervised training, with LLaMA3-1B reaching a mean macro-averaged F1 score of 81.46% in cross-validation with nine languages.
Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a novel method (PAWN) for detecting AI-generated text, leveraging LLM outputs, rather than engineering or optimizing prompts for Large Language Models. Prompt manipulation for improving LLM performance is not the central concern.
[More is not always better? Enhancing Many-Shot In-Context Learning with
Differentiated and Reweighting Objectives](https://arxiv.org/abs/http://arxiv.org/abs/2501.04070v2) arXiv ID: http://arxiv.org/abs/2501.04070v2
Abstract:
Large language models (LLMs) excel at few-shot in-context learning (ICL) without requiring parameter updates. However, as the number of ICL demonstrations increases from a few to many, performance tends to plateau and eventually decline. We identify two primary causes for this trend: the suboptimal negative log-likelihood (NLL) optimization objective and the incremental data noise. To address these issues, we introduce DrICL, a novel optimization method that enhances model performance through Differentiated Learning and advantage-based Reweighting objectives. Globally, DrICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels. Locally, it dynamically adjusts the weighting of many-shot demonstrations by leveraging cumulative advantages inspired by reinforcement learning, thereby improving generalization. This approach allows the model to handle varying numbers of shots effectively, mitigating the impact of noisy data. Recognizing the lack of multi-task datasets with diverse many-shot distributions, we develop the Many-Shot ICL Benchmark (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for fine-tuning purposes. ICL-50 facilitates the evaluation of many-shot ICL strategies across seven prominent NLP tasks and 50 distinct datasets. Experimental results demonstrate that LLMs enhanced with DrICL achieve significant improvements in many-shot setups across various tasks, including both in-domain and out-of-domain scenarios. We release the code and benchmark dataset hoping to facilitate further research in many-shot ICL.
Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on enhancing many-shot in-context learning through novel optimization methods (Differentiated Learning and Reweighting objectives), rather than specifically on the engineering, design, or optimization of textual input prompts for Large Language Models (LLMs). The core subject is optimization of LLM performance through training methods, not prompt engineering." }
Cosmos World Foundation Model Platform for Physical AI
arXiv ID: http://arxiv.org/abs/2501.03575v1
Abstract:
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.
Decision Explanation: Original decision: REJECT The paper primarily focuses on a platform for Physical AI and world model development, with emphasis on video curation, model fine-tuning, and open-source resources. It does not meet the core criteria of focusing primarily on prompt engineering for Large Language Models (LLMs), manipulating textual input prompts to improve LLM performance, or providing concrete examples of prompts and their impact on LLM output.