paper-system/20250101-20250105-cat-cs.AI...


			
				
				
					
						
						
						
							
							
							[{"title":"Beyond Text: Implementing Multimodal Large Language Model-Powered\n  Multi-Agent Systems Using a No-Code Platform","abstract":"  This study proposes the design and implementation of a multimodal LLM-based\nMulti-Agent System (MAS) leveraging a No-Code platform to address the practical\nconstraints and significant entry barriers associated with AI adoption in\nenterprises. Advanced AI technologies, such as Large Language Models (LLMs),\noften pose challenges due to their technical complexity and high implementation\ncosts, making them difficult for many organizations to adopt. To overcome these\nlimitations, this research develops a No-Code-based Multi-Agent System designed\nto enable users without programming knowledge to easily build and manage AI\nsystems. The study examines various use cases to validate the applicability of\nAI in business processes, including code generation from image-based notes,\nAdvanced RAG-based question-answering systems, text-based image generation, and\nvideo generation using images and prompts. These systems lower the barriers to\nAI adoption, empowering not only professional developers but also general users\nto harness AI for significantly improved productivity and efficiency. By\ndemonstrating the scalability and accessibility of No-Code platforms, this\nstudy advances the democratization of AI technologies within enterprises and\nvalidates the practical applicability of Multi-Agent Systems, ultimately\ncontributing to the widespread adoption of AI across various industries.\n","arxiv_id":"http://arxiv.org/abs/2501.00750v1","authors":["Cheonsu Jeong"]},{"title":"A3: Android Agent Arena for Mobile GUI Agents","abstract":"  AI agents have become increasingly prevalent in recent years, driven by\nsignificant advancements in the field of large language models (LLMs). Mobile\nGUI agents, a subset of AI agents, are designed to autonomously perform tasks\non mobile devices. While numerous studies have introduced agents, datasets, and\nbenchmarks to advance mobile GUI agent research, many existing datasets focus\non static frame evaluations and fail to provide a comprehensive platform for\nassessing performance on real-world, in-the-wild tasks. To address this gap, we\npresent Android Agent Arena (A3), a novel evaluation platform. Unlike existing\nin-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as\nreal-time online information retrieval and operational instructions; (2) a\nlarger, more flexible action space, enabling compatibility with agents trained\non any dataset; and (3) automated business-level LLM-based evaluation process.\nA3 includes 21 widely used general third-party apps and 201 tasks\nrepresentative of common user scenarios, providing a robust foundation for\nevaluating mobile GUI agents in real-world situations and a new autonomous\nevaluation process for less human labor and coding expertise. The project is\navailable at \\url{https://yuxiangchai.github.io/Android-Agent-Arena/}.\n","arxiv_id":"http://arxiv.org/abs/2501.01149v1","authors":["Yuxiang Chai","Hanhao Li","Jiayu Zhang","Liang Liu","Guozhi Wang","Shuai Ren","Siyuan Huang","Hongsheng Li"]},{"title":"Rethinking Relation Extraction: Beyond Shortcuts to Generalization with\n  a Debiased Benchmark","abstract":"  Benchmarks are crucial for evaluating machine learning algorithm performance,\nfacilitating comparison and identifying superior solutions. However, biases\nwithin datasets can lead models to learn shortcut patterns, resulting in\ninaccurate assessments and hindering real-world applicability. This paper\naddresses the issue of entity bias in relation extraction tasks, where models\ntend to rely on entity mentions rather than context. We propose a debiased\nrelation extraction benchmark DREB that breaks the pseudo-correlation between\nentity mentions and relation types through entity replacement. DREB utilizes\nBias Evaluator and PPL Evaluator to ensure low bias and high naturalness,\nproviding a reliable and accurate assessment of model generalization in entity\nbias scenarios. To establish a new baseline on DREB, we introduce MixDebias, a\ndebiasing method combining data-level and model training-level techniques.\nMixDebias effectively improves model performance on DREB while maintaining\nperformance on the original dataset. Extensive experiments demonstrate the\neffectiveness and robustness of MixDebias compared to existing methods,\nhighlighting its potential for improving the generalization ability of relation\nextraction models. We will release DREB and MixDebias publicly.\n","arxiv_id":"http://arxiv.org/abs/2501.01349v1","authors":["Liang He","Yougang Chu","Zhen Wu","Jianbing Zhang","Xinyu Dai","Jiajun Chen"]},{"title":"ASKCOS: an open source software suite for synthesis planning","abstract":"  The advancement of machine learning and the availability of large-scale\nreaction datasets have accelerated the development of data-driven models for\ncomputer-aided synthesis planning (CASP) in the past decade. Here, we detail\nthe newest version of ASKCOS, an open source software suite for synthesis\nplanning that makes available several research advances in a freely available,\npractical tool. Four one-step retrosynthesis models form the basis of both\ninteractive planning and automatic planning modes. Retrosynthetic planning is\ncomplemented by other modules for feasibility assessment and pathway\nevaluation, including reaction condition recommendation, reaction outcome\nprediction, and auxiliary capabilities such as solubility prediction and\nquantum mechanical descriptor prediction. ASKCOS has assisted hundreds of\nmedicinal, synthetic, and process chemists in their day-to-day tasks,\ncomplementing expert decision making. It is our belief that CASP tools like\nASKCOS are an important part of modern chemistry research, and that they offer\never-increasing utility and accessibility.\n","arxiv_id":"http://arxiv.org/abs/2501.01835v1","authors":["Zhengkai Tu","Sourabh J. Choure","Mun Hong Fong","Jihye Roh","Itai Levin","Kevin Yu","Joonyoung F. Joung","Nathan Morgan","Shih-Cheng Li","Xiaoqi Sun","Huiqian Lin","Mark Murnin","Jordan P. Liles","Thomas J. Struble","Michael E. Fortunato","Mengjie Liu","William H. Green","Klavs F. Jensen","Connor W. Coley"]},{"title":"eRevise+RF: A Writing Evaluation System for Assessing Student Essay\n  Revisions and Providing Formative Feedback","abstract":"  The ability to revise essays in response to feedback is important for\nstudents' writing success. An automated writing evaluation (AWE) system that\nsupports students in revising their essays is thus essential. We present\neRevise+RF, an enhanced AWE system for assessing student essay revisions (e.g.,\nchanges made to an essay to improve its quality in response to essay feedback)\nand providing revision feedback. We deployed the system with 6 teachers and 406\nstudents across 3 schools in Pennsylvania and Louisiana. The results confirmed\nits effectiveness in (1) assessing student essays in terms of evidence usage,\n(2) extracting evidence and reasoning revisions across essays, and (3)\ndetermining revision success in responding to feedback. The evaluation also\nsuggested eRevise+RF is a helpful system for young students to improve their\nargumentative writing skills through revision and formative feedback.\n","arxiv_id":"http://arxiv.org/abs/2501.00715v1","authors":["Zhexiong Liu","Diane Litman","Elaine Wang","Tianwen Li","Mason Gobat","Lindsay Clare Matsumura","Richard Correnti"]},{"title":"Towards End-to-End Neuromorphic Voxel-based 3D Object Reconstruction\n  Without Physical Priors","abstract":"  Neuromorphic cameras, also known as event cameras, are asynchronous\nbrightness-change sensors that can capture extremely fast motion without\nsuffering from motion blur, making them particularly promising for 3D\nreconstruction in extreme environments. However, existing research on 3D\nreconstruction using monocular neuromorphic cameras is limited, and most of the\nmethods rely on estimating physical priors and employ complex multi-step\npipelines. In this work, we propose an end-to-end method for dense voxel 3D\nreconstruction using neuromorphic cameras that eliminates the need to estimate\nphysical priors. Our method incorporates a novel event representation to\nenhance edge features, enabling the proposed feature-enhancement model to learn\nmore effectively. Additionally, we introduced Optimal Binarization Threshold\nSelection Principle as a guideline for future related work, using the optimal\nreconstruction results achieved with threshold optimization as the benchmark.\nOur method achieves a 54.6% improvement in reconstruction accuracy compared to\nthe baseline method.\n","arxiv_id":"http://arxiv.org/abs/2501.00741v1","authors":["Chuanzhi Xu","Langyi Chen","Vincent Qu","Haodong Chen","Vera Chung"]},{"title":"AttriReBoost: A Gradient-Free Propagation Optimization Method for Cold\n  Start Mitigation in Attribute Missing Graphs","abstract":"  Missing attribute issues are prevalent in the graph learning, leading to\nbiased outcomes in Graph Neural Networks (GNNs). Existing methods that rely on\nfeature propagation are prone to cold start problem, particularly when dealing\nwith attribute resetting and low-degree nodes, which hinder effective\npropagation and convergence. To address these challenges, we propose\nAttriReBoost (ARB), a novel method that incorporates propagation-based method\nto mitigate cold start problems in attribute-missing graphs. ARB enhances\nglobal feature propagation by redefining initial boundary conditions and\nstrategically integrating virtual edges, thereby improving node connectivity\nand ensuring more stable and efficient convergence. This method facilitates\ngradient-free attribute reconstruction with lower computational overhead. The\nproposed method is theoretically grounded, with its convergence rigorously\nestablished. Extensive experiments on several real-world benchmark datasets\ndemonstrate the effectiveness of ARB, achieving an average accuracy improvement\nof 5.11% over state-of-the-art methods. Additionally, ARB exhibits remarkable\ncomputational efficiency, processing a large-scale graph with 2.49 million\nnodes in just 16 seconds on a single GPU. Our code is available at\nhttps://github.com/limengran98/ARB.\n","arxiv_id":"http://arxiv.org/abs/2501.00743v1","authors":["Mengran Li","Chaojun Ding","Junzhou Chen","Wenbin Xing","Cong Ye","Ronghui Zhang","Songlin Zhuang","Jia Hu","Tony Z. Qiu","Huijun Gao"]},{"title":"Enhancing Transformers for Generalizable First-Order Logical Entailment","abstract":"  Transformers, as a fundamental deep learning architecture, have demonstrated\nremarkable capabilities in reasoning. This paper investigates the generalizable\nfirst-order logical reasoning ability of transformers with their parameterized\nknowledge and explores ways to improve it. The first-order reasoning capability\nof transformers is assessed through their ability to perform first-order\nlogical entailment, which is quantitatively measured by their performance in\nanswering knowledge graph queries. We establish connections between (1) two\ntypes of distribution shifts studied in out-of-distribution generalization and\n(2) the unseen knowledge and query settings discussed in the task of knowledge\ngraph query answering, enabling a characterization of fine-grained\ngeneralizability. Results on our comprehensive dataset show that transformers\noutperform previous methods specifically designed for this task and provide\ndetailed empirical evidence on the impact of input query syntax, token\nembedding, and transformer architectures on the reasoning capability of\ntransformers. Interestingly, our findings reveal a mismatch between positional\nencoding and other design choices in transformer architectures employed in\nprior practices. This discovery motivates us to propose a more sophisticated,\nlogic-aware architecture, TEGA, to enhance the capability for generalizable\nfirst-order logical entailment in transformers.\n","arxiv_id":"http://arxiv.org/abs/2501.00759v1","authors":["Tianshi Zheng","Jiazheng Wang","Zihao Wang","Jiaxin Bai","Hang Yin","Zheye Deng","Yangqiu Song","Jianxin Li"]},{"title":"REM: A Scalable Reinforced Multi-Expert Framework for Multiplex\n  Influence Maximization","abstract":"  In social online platforms, identifying influential seed users to maximize\ninfluence spread is a crucial as it can greatly diminish the cost and efforts\nrequired for information dissemination. While effective, traditional methods\nfor Multiplex Influence Maximization (MIM) have reached their performance\nlimits, prompting the emergence of learning-based approaches. These novel\nmethods aim for better generalization and scalability for more sizable graphs\nbut face significant challenges, such as (1) inability to handle unknown\ndiffusion patterns and (2) reliance on high-quality training samples. To\naddress these issues, we propose the Reinforced Expert Maximization framework\n(REM). REM leverages a Propagation Mixture of Experts technique to encode\ndynamic propagation of large multiplex networks effectively in order to\ngenerate enhanced influence propagation. Noticeably, REM treats a generative\nmodel as a policy to autonomously generate different seed sets and learn how to\nimprove them from a Reinforcement Learning perspective. Extensive experiments\non several real-world datasets demonstrate that REM surpasses state-of-the-art\nmethods in terms of influence spread, scalability, and inference time in\ninfluence maximization tasks.\n","arxiv_id":"http://arxiv.org/abs/2501.00779v1","authors":["Huyen Nguyen","Hieu Dam","Nguyen Do","Cong Tran","Cuong Pham"]},{"title":"Make Shuffling Great Again: A Side-Channel Resistant Fisher-Yates\n  Algorithm for Protecting Neural Networks","abstract":"  Neural network models implemented in embedded devices have been shown to be\nsusceptible to side-channel attacks (SCAs), allowing recovery of proprietary\nmodel parameters, such as weights and biases. There are already available\ncountermeasure methods currently used for protecting cryptographic\nimplementations that can be tailored to protect embedded neural network models.\nShuffling, a hiding-based countermeasure that randomly shuffles the order of\ncomputations, was shown to be vulnerable to SCA when the Fisher-Yates algorithm\nis used. In this paper, we propose a design of an SCA-secure version of the\nFisher-Yates algorithm. By integrating the masking technique for modular\nreduction and Blakely's method for modular multiplication, we effectively\nremove the vulnerability in the division operation that led to side-channel\nleakage in the original version of the algorithm. We experimentally evaluate\nthat the countermeasure is effective against SCA by implementing a correlation\npower analysis attack on an embedded neural network model implemented on ARM\nCortex-M4. Compared to the original proposal, the memory overhead is $2\\times$\nthe biggest layer of the network, while the time overhead varies from $4\\%$ to\n$0.49\\%$ for a layer with $100$ and $1000$ neurons, respectively.\n","arxiv_id":"http://arxiv.org/abs/2501.00798v1","authors":["Leonard Puškáč","Marek Benovič","Jakub Breier","Xiaolu Hou"]},{"title":"Reasoning-Oriented and Analogy-Based Methods for Locating and Editing in\n  Zero-Shot Event-Relational Reasoning","abstract":"  Zero-shot event-relational reasoning is an important task in natural language\nprocessing, and existing methods jointly learn a variety of event-relational\nprefixes and inference-form prefixes to achieve such tasks. However, training\nprefixes consumes large computational resources and lacks interpretability.\nAdditionally, learning various relational and inferential knowledge\ninefficiently exploits the connections between tasks. Therefore, we first\npropose a method for Reasoning-Oriented Locating and Editing (ROLE), which\nlocates and edits the key modules of the language model for reasoning about\nevent relations, enhancing interpretability and also resource-efficiently\noptimizing the reasoning ability. Subsequently, we propose a method for\nAnalogy-Based Locating and Editing (ABLE), which efficiently exploits the\nsimilarities and differences between tasks to optimize the zero-shot reasoning\ncapability. Experimental results show that ROLE improves interpretability and\nreasoning performance with reduced computational cost. ABLE achieves SOTA\nresults in zero-shot reasoning.\n","arxiv_id":"http://arxiv.org/abs/2501.00803v1","authors":["Jingyao Tang","Lishuang Li","Liteng Mi","Haiming Wu","Hongbin Lu"]},{"title":"LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management","abstract":"  Cryptocurrency investment is inherently difficult due to its shorter history\ncompared to traditional assets, the need to integrate vast amounts of data from\nvarious modalities, and the requirement for complex reasoning. While deep\nlearning approaches have been applied to address these challenges, their\nblack-box nature raises concerns about trust and explainability. Recently,\nlarge language models (LLMs) have shown promise in financial applications due\nto their ability to understand multi-modal data and generate explainable\ndecisions. However, single LLM faces limitations in complex, comprehensive\ntasks such as asset investment. These limitations are even more pronounced in\ncryptocurrency investment, where LLMs have less domain-specific knowledge in\ntheir training corpora.\n  To overcome these challenges, we propose an explainable, multi-modal,\nmulti-agent framework for cryptocurrency investment. Our framework uses\nspecialized agents that collaborate within and across teams to handle subtasks\nsuch as data analysis, literature integration, and investment decision-making\nfor the top 30 cryptocurrencies by market capitalization. The expert training\nmodule fine-tunes agents using multi-modal historical data and professional\ninvestment literature, while the multi-agent investment module employs\nreal-time data to make informed cryptocurrency investment decisions. Unique\nintrateam and interteam collaboration mechanisms enhance prediction accuracy by\nadjusting final predictions based on confidence levels within agent teams and\nfacilitating information sharing between teams. Empirical evaluation using data\nfrom November 2023 to September 2024 demonstrates that our framework\noutperforms single-agent models and market benchmarks in classification, asset\npricing, portfolio, and explainability performance.\n","arxiv_id":"http://arxiv.org/abs/2501.00826v2","authors":["Yichen Luo","Yebo Feng","Jiahua Xu","Paolo Tasca","Yang Liu"]},{"title":"Embedding Style Beyond Topics: Analyzing Dispersion Effects Across\n  Different Language Models","abstract":"  This paper analyzes how writing style affects the dispersion of embedding\nvectors across multiple, state-of-the-art language models. While early\ntransformer models primarily aligned with topic modeling, this study examines\nthe role of writing style in shaping embedding spaces. Using a literary corpus\nthat alternates between topics and styles, we compare the sensitivity of\nlanguage models across French and English. By analyzing the particular impact\nof style on embedding dispersion, we aim to better understand how language\nmodels process stylistic information, contributing to their overall\ninterpretability.\n","arxiv_id":"http://arxiv.org/abs/2501.00828v1","authors":["Benjamin Icard","Evangelia Zve","Lila Sainero","Alice Breton","Jean-Gabriel Ganascia"]},{"title":"An LLM-Empowered Adaptive Evolutionary Algorithm For Multi-Component\n  Deep Learning Systems","abstract":"  Multi-objective evolutionary algorithms (MOEAs) are widely used for searching\noptimal solutions in complex multi-component applications. Traditional MOEAs\nfor multi-component deep learning (MCDL) systems face challenges in enhancing\nthe search efficiency while maintaining the diversity. To combat these, this\npaper proposes $\\mu$MOEA, the first LLM-empowered adaptive evolutionary search\nalgorithm to detect safety violations in MCDL systems. Inspired by the\ncontext-understanding ability of Large Language Models (LLMs), $\\mu$MOEA\npromotes the LLM to comprehend the optimization problem and generate an initial\npopulation tailed to evolutionary objectives. Subsequently, it employs adaptive\nselection and variation to iteratively produce offspring, balancing the\nevolutionary efficiency and diversity. During the evolutionary process, to\nnavigate away from the local optima, $\\mu$MOEA integrates the evolutionary\nexperience back into the LLM. This utilization harnesses the LLM's quantitative\nreasoning prowess to generate differential seeds, breaking away from current\noptimal solutions. We evaluate $\\mu$MOEA in finding safety violations of MCDL\nsystems, and compare its performance with state-of-the-art MOEA methods.\nExperimental results show that $\\mu$MOEA can significantly improve the\nefficiency and diversity of the evolutionary search.\n","arxiv_id":"http://arxiv.org/abs/2501.00829v1","authors":["Haoxiang Tian","Xingshuo Han","Guoquan Wu","An Guo","Yuan Zhou. Jie Zhang","Shuo Li","Jun Wei","Tianwei Zhang"]},{"title":"LLM+AL: Bridging Large Language Models and Action Languages for Complex\n  Reasoning about Actions","abstract":"  Large Language Models (LLMs) have made significant strides in various\nintelligent tasks but still struggle with complex action reasoning tasks that\nrequire systematic search. To address this limitation, we propose a method that\nbridges the natural language understanding capabilities of LLMs with the\nsymbolic reasoning strengths of action languages. Our approach, termed\n\"LLM+AL,\" leverages the LLM's strengths in semantic parsing and commonsense\nknowledge generation alongside the action language's proficiency in automated\nreasoning based on encoded knowledge. We compare LLM+AL against\nstate-of-the-art LLMs, including ChatGPT-4, Claude 3 Opus, Gemini Ultra 1.0,\nand o1-preview, using benchmarks for complex reasoning about actions. Our\nfindings indicate that, although all methods exhibit errors, LLM+AL, with\nrelatively minimal human corrections, consistently leads to correct answers,\nwhereas standalone LLMs fail to improve even with human feedback. LLM+AL also\ncontributes to automated generation of action languages.\n","arxiv_id":"http://arxiv.org/abs/2501.00830v1","authors":["Adam Ishay","Joohyung Lee"]},{"title":"Distilled Lifelong Self-Adaptation for Configurable Systems","abstract":"  Modern configurable systems provide tremendous opportunities for engineering\nfuture intelligent software systems. A key difficulty thereof is how to\neffectively self-adapt the configuration of a running system such that its\nperformance (e.g., runtime and throughput) can be optimized under time-varying\nworkloads. This unfortunately remains unaddressed in existing approaches as\nthey either overlook the available past knowledge or rely on static\nexploitation of past knowledge without reasoning the usefulness of information\nwhen planning for self-adaptation. In this paper, we tackle this challenging\nproblem by proposing DLiSA, a framework that self-adapts configurable systems.\nDLiSA comes with two properties: firstly, it supports lifelong planning, and\nthereby the planning process runs continuously throughout the lifetime of the\nsystem, allowing dynamic exploitation of the accumulated knowledge for rapid\nadaptation. Secondly, the planning for a newly emerged workload is boosted via\ndistilled knowledge seeding, in which the knowledge is dynamically purified\nsuch that only useful past configurations are seeded when necessary, mitigating\nmisleading information. Extensive experiments suggest that the proposed DLiSA\nsignificantly outperforms state-of-the-art approaches, demonstrating a\nperformance improvement of up to 229% and a resource acceleration of up to\n2.22x on generating promising adaptation configurations. All data and sources\ncan be found at our repository: https://github.com/ideas-labo/dlisa.\n","arxiv_id":"http://arxiv.org/abs/2501.00840v1","authors":["Yulong Ye","Tao Chen","Miqing Li"]},{"title":"Diversity Optimization for Travelling Salesman Problem via Deep\n  Reinforcement Learning","abstract":"  Existing neural methods for the Travelling Salesman Problem (TSP) mostly aim\nat finding a single optimal solution. To discover diverse yet high-quality\nsolutions for Multi-Solution TSP (MSTSP), we propose a novel deep reinforcement\nlearning based neural solver, which is primarily featured by an encoder-decoder\nstructured policy. Concretely, on the one hand, a Relativization Filter (RF) is\ndesigned to enhance the robustness of the encoder to affine transformations of\nthe instances, so as to potentially improve the quality of the found solutions.\nOn the other hand, a Multi-Attentive Adaptive Active Search (MA3S) is tailored\nto allow the decoders to strike a balance between the optimality and diversity.\nExperimental evaluations on benchmark instances demonstrate the superiority of\nour method over recent neural baselines across different metrics, and its\ncompetitive performance against state-of-the-art traditional heuristics with\nsignificantly reduced computational time, ranging from $1.3\\times$ to\n$15\\times$ faster. Furthermore, we demonstrate that our method can also be\napplied to the Capacitated Vehicle Routing Problem (CVRP).\n","arxiv_id":"http://arxiv.org/abs/2501.00884v1","authors":["Qi Li","Zhiguang Cao","Yining Ma","Yaoxin Wu","Yue-Jiao Gong"]},{"title":"Population Aware Diffusion for Time Series Generation","abstract":"  Diffusion models have shown promising ability in generating high-quality time\nseries (TS) data. Despite the initial success, existing works mostly focus on\nthe authenticity of data at the individual level, but pay less attention to\npreserving the population-level properties on the entire dataset. Such\npopulation-level properties include value distributions for each dimension and\ndistributions of certain functional dependencies (e.g., cross-correlation, CC)\nbetween different dimensions. For instance, when generating house energy\nconsumption TS data, the value distributions of the outside temperature and the\nkitchen temperature should be preserved, as well as the distribution of CC\nbetween them. Preserving such TS population-level properties is critical in\nmaintaining the statistical insights of the datasets, mitigating model bias,\nand augmenting downstream tasks like TS prediction. Yet, it is often overlooked\nby existing models. Hence, data generated by existing models often bear\ndistribution shifts from the original data. We propose Population-aware\nDiffusion for Time Series (PaD-TS), a new TS generation model that better\npreserves the population-level properties. The key novelties of PaD-TS include\n1) a new training method explicitly incorporating TS population-level property\npreservation, and 2) a new dual-channel encoder model architecture that better\ncaptures the TS data structure. Empirical results in major benchmark datasets\nshow that PaD-TS can improve the average CC distribution shift score between\nreal and synthetic data by 5.9x while maintaining a performance comparable to\nstate-of-the-art models on individual-level authenticity.\n","arxiv_id":"http://arxiv.org/abs/2501.00910v1","authors":["Yang Li","Han Meng","Zhenyu Bi","Ingolv T. Urnes","Haipeng Chen"]},{"title":"$β$-DQN: Improving Deep Q-Learning By Evolving the Behavior","abstract":"  While many sophisticated exploration methods have been proposed, their lack\nof generality and high computational cost often lead researchers to favor\nsimpler methods like $\\epsilon$-greedy. Motivated by this, we introduce\n$\\beta$-DQN, a simple and efficient exploration method that augments the\nstandard DQN with a behavior function $\\beta$. This function estimates the\nprobability that each action has been taken at each state. By leveraging\n$\\beta$, we generate a population of diverse policies that balance exploration\nbetween state-action coverage and overestimation bias correction. An adaptive\nmeta-controller is designed to select an effective policy for each episode,\nenabling flexible and explainable exploration. $\\beta$-DQN is straightforward\nto implement and adds minimal computational overhead to the standard DQN.\nExperiments on both simple and challenging exploration domains show that\n$\\beta$-DQN outperforms existing baseline methods across a wide range of tasks,\nproviding an effective solution for improving exploration in deep reinforcement\nlearning.\n","arxiv_id":"http://arxiv.org/abs/2501.00913v1","authors":["Hongming Zhang","Fengshuo Bai","Chenjun Xiao","Chao Gao","Bo Xu","Martin Müller"]},{"title":"Incremental Dialogue Management: Survey, Discussion, and Implications\n  for HRI","abstract":"  Efforts towards endowing robots with the ability to speak have benefited from\nrecent advancements in NLP, in particular large language models. However, as\npowerful as current models have become, they still operate on sentence or\nmulti-sentence level input, not on the word-by-word input that humans operate\non, affecting the degree of responsiveness that they offer, which is critical\nin situations where humans interact with robots using speech. In this paper, we\nreview the literature on interactive systems that operate incrementally (i.e.,\nat the word level or below it). We motivate the need for incremental systems,\nsurvey incremental modeling of important aspects of dialogue like speech\nrecognition and language generation. Primary focus is on the part of the system\nthat makes decisions, known as the dialogue manager. We find that there is very\nlittle research on incremental dialogue management, offer some requirements for\npractical incremental dialogue management, and the implications of incremental\ndialogue for embodied, robotic platforms.\n","arxiv_id":"http://arxiv.org/abs/2501.00953v1","authors":["Casey Kennington","Pierre Lison","David Schlangen"]},{"title":"Are LLMs effective psychological assessors? Leveraging adaptive RAG for\n  interpretable mental health screening through psychometric practice","abstract":"  In psychological practice, standardized questionnaires serve as essential\ntools for assessing mental constructs (e.g., attitudes, traits, and emotions)\nthrough structured questions (aka items). With the increasing prevalence of\nsocial media platforms where users share personal experiences and emotions,\nresearchers are exploring computational methods to leverage this data for rapid\nmental health screening. In this study, we propose a novel adaptive\nRetrieval-Augmented Generation (RAG) approach that completes psychological\nquestionnaires by analyzing social media posts. Our method retrieves the most\nrelevant user posts for each question in a psychological survey and uses Large\nLanguage Models (LLMs) to predict questionnaire scores in a zero-shot setting.\nOur findings are twofold. First we demonstrate that this approach can\neffectively predict users' responses to psychological questionnaires, such as\nthe Beck Depression Inventory II (BDI-II), achieving performance comparable to\nor surpassing state-of-the-art models on Reddit-based benchmark datasets\nwithout relying on training data. Second, we show how this methodology can be\ngeneralized as a scalable screening tool, as the final assessment is\nsystematically derived by completing standardized questionnaires and tracking\nhow individual item responses contribute to the diagnosis, aligning with\nestablished psychometric practices.\n","arxiv_id":"http://arxiv.org/abs/2501.00982v1","authors":["Federico Ravenda","Seyed Ali Bahrainian","Andrea Raballo","Antonietta Mira","Noriko Kando"]},{"title":"Bootstrapped Reward Shaping","abstract":"  In reinforcement learning, especially in sparse-reward domains, many\nenvironment steps are required to observe reward information. In order to\nincrease the frequency of such observations, \"potential-based reward shaping\"\n(PBRS) has been proposed as a method of providing a more dense reward signal\nwhile leaving the optimal policy invariant. However, the required \"potential\nfunction\" must be carefully designed with task-dependent knowledge to not deter\ntraining performance. In this work, we propose a \"bootstrapped\" method of\nreward shaping, termed BSRS, in which the agent's current estimate of the\nstate-value function acts as the potential function for PBRS. We provide\nconvergence proofs for the tabular setting, give insights into training\ndynamics for deep RL, and show that the proposed method improves training speed\nin the Atari suite.\n","arxiv_id":"http://arxiv.org/abs/2501.00989v1","authors":["Jacob Adamczyk","Volodymyr Makarenko","Stas Tiomkin","Rahul V. Kulkarni"]},{"title":"Exploring Information Processing in Large Language Models: Insights from\n  Information Bottleneck Theory","abstract":"  Large Language Models (LLMs) have demonstrated remarkable performance across\na wide range of tasks by understanding input information and predicting\ncorresponding outputs. However, the internal mechanisms by which LLMs\ncomprehend input and make effective predictions remain poorly understood. In\nthis paper, we explore the working mechanism of LLMs in information processing\nfrom the perspective of Information Bottleneck Theory. We propose a\nnon-training construction strategy to define a task space and identify the\nfollowing key findings: (1) LLMs compress input information into specific task\nspaces (e.g., sentiment space, topic space) to facilitate task understanding;\n(2) they then extract and utilize relevant information from the task space at\ncritical moments to generate accurate predictions. Based on these insights, we\nintroduce two novel approaches: an Information Compression-based Context\nLearning (IC-ICL) and a Task-Space-guided Fine-Tuning (TS-FT). IC-ICL enhances\nreasoning performance and inference efficiency by compressing retrieved example\ninformation into the task space. TS-FT employs a space-guided loss to fine-tune\nLLMs, encouraging the learning of more effective compression and selection\nmechanisms. Experiments across multiple datasets validate the effectiveness of\ntask space construction. Additionally, IC-ICL not only improves performance but\nalso accelerates inference speed by over 40\\%, while TS-FT achieves superior\nresults with a minimal strategy adjustment.\n","arxiv_id":"http://arxiv.org/abs/2501.00999v2","authors":["Zhou Yang","Zhengyu Qi","Zhaochun Ren","Zhikai Jia","Haizhou Sun","Xiaofei Zhu","Xiangwen Liao"]},{"title":"Deep Reinforcement Learning for Job Scheduling and Resource Management\n  in Cloud Computing: An Algorithm-Level Review","abstract":"  Cloud computing has revolutionized the provisioning of computing resources,\noffering scalable, flexible, and on-demand services to meet the diverse\nrequirements of modern applications. At the heart of efficient cloud operations\nare job scheduling and resource management, which are critical for optimizing\nsystem performance and ensuring timely and cost-effective service delivery.\nHowever, the dynamic and heterogeneous nature of cloud environments presents\nsignificant challenges for these tasks, as workloads and resource availability\ncan fluctuate unpredictably. Traditional approaches, including heuristic and\nmeta-heuristic algorithms, often struggle to adapt to these real-time changes\ndue to their reliance on static models or predefined rules. Deep Reinforcement\nLearning (DRL) has emerged as a promising solution to these challenges by\nenabling systems to learn and adapt policies based on continuous observations\nof the environment, facilitating intelligent and responsive decision-making.\nThis survey provides a comprehensive review of DRL-based algorithms for job\nscheduling and resource management in cloud computing, analyzing their\nmethodologies, performance metrics, and practical applications. We also\nhighlight emerging trends and future research directions, offering valuable\ninsights into leveraging DRL to advance both job scheduling and resource\nmanagement in cloud computing.\n","arxiv_id":"http://arxiv.org/abs/2501.01007v1","authors":["Yan Gu","Zhaoze Liu","Shuhong Dai","Cong Liu","Ying Wang","Shen Wang","Georgios Theodoropoulos","Long Cheng"]},{"title":"MDSF: Context-Aware Multi-Dimensional Data Storytelling Framework based\n  on Large language Model","abstract":"  The exponential growth of data and advancements in big data technologies have\ncreated a demand for more efficient and automated approaches to data analysis\nand storytelling. However, automated data analysis systems still face\nchallenges in leveraging large language models (LLMs) for data insight\ndiscovery, augmented analysis, and data storytelling. This paper introduces the\nMultidimensional Data Storytelling Framework (MDSF) based on large language\nmodels for automated insight generation and context-aware storytelling. The\nframework incorporates advanced preprocessing techniques, augmented analysis\nalgorithms, and a unique scoring mechanism to identify and prioritize\nactionable insights. The use of fine-tuned LLMs enhances contextual\nunderstanding and generates narratives with minimal manual intervention. The\narchitecture also includes an agent-based mechanism for real-time storytelling\ncontinuation control. Key findings reveal that MDSF outperforms existing\nmethods across various datasets in terms of insight ranking accuracy,\ndescriptive quality, and narrative coherence. The experimental evaluation\ndemonstrates MDSF's ability to automate complex analytical tasks, reduce\ninterpretive biases, and improve user satisfaction. User studies further\nunderscore its practical utility in enhancing content structure, conclusion\nextraction, and richness of detail.\n","arxiv_id":"http://arxiv.org/abs/2501.01014v1","authors":["Chengze Zhang","Changshan Li","Shiyang Gao"]},{"title":"Towards Adversarially Robust Deep Metric Learning","abstract":"  Deep Metric Learning (DML) has shown remarkable successes in many domains by\ntaking advantage of powerful deep neural networks. Deep neural networks are\nprone to adversarial attacks and could be easily fooled by adversarial\nexamples. The current progress on this robustness issue is mainly about deep\nclassification models but pays little attention to DML models. Existing works\nfail to thoroughly inspect the robustness of DML and neglect an important DML\nscenario, the clustering-based inference. In this work, we first point out the\nrobustness issue of DML models in clustering-based inference scenarios. We find\nthat, for the clustering-based inference, existing defenses designed DML are\nunable to be reused and the adaptions of defenses designed for deep\nclassification models cannot achieve satisfactory robustness performance. To\nalleviate the hazard of adversarial examples, we propose a new defense, the\nEnsemble Adversarial Training (EAT), which exploits ensemble learning and\nadversarial training. EAT promotes the diversity of the ensemble, encouraging\neach model in the ensemble to have different robustness features, and employs a\nself-transferring mechanism to make full use of the robustness statistics of\nthe whole ensemble in the update of every single model. We evaluate the EAT\nmethod on three widely-used datasets with two popular model architectures. The\nresults show that the proposed EAT method greatly outperforms the adaptions of\ndefenses designed for deep classification models.\n","arxiv_id":"http://arxiv.org/abs/2501.01025v2","authors":["Xiaopeng Ke"]},{"title":"Reasoning based on symbolic and parametric knowledge bases: a survey","abstract":"  Reasoning is fundamental to human intelligence, and critical for\nproblem-solving, decision-making, and critical thinking. Reasoning refers to\ndrawing new conclusions based on existing knowledge, which can support various\napplications like clinical diagnosis, basic education, and financial analysis.\nThough a good number of surveys have been proposed for reviewing\nreasoning-related methods, none of them has systematically investigated these\nmethods from the viewpoint of their dependent knowledge base. Both the\nscenarios to which the knowledge bases are applied and their storage formats\nare significantly different. Hence, investigating reasoning methods from the\nknowledge base perspective helps us better understand the challenges and future\ndirections. To fill this gap, this paper first classifies the knowledge base\ninto symbolic and parametric ones. The former explicitly stores information in\nhuman-readable symbols, and the latter implicitly encodes knowledge within\nparameters. Then, we provide a comprehensive overview of reasoning methods\nusing symbolic knowledge bases, parametric knowledge bases, and both of them.\nFinally, we identify the future direction toward enhancing reasoning\ncapabilities to bridge the gap between human and machine intelligence.\n","arxiv_id":"http://arxiv.org/abs/2501.01030v1","authors":["Mayi Xu","Yunfeng Ning","Yongqi Li","Jianhao Chen","Jintao Wen","Yao Xiao","Shen Zhou","Birong Pan","Zepeng Bao","Xin Miao","Hankun Kang","Ke Sun","Tieyun Qian"]},{"title":"MSWA: Refining Local Attention with Multi-ScaleWindow Attention","abstract":"  Transformer-based LLMs have achieved exceptional performance across a wide\nrange of NLP tasks. However, the standard self-attention mechanism suffers from\nquadratic time complexity and linearly increased cache size. Sliding window\nattention (SWA) solves this problem by restricting the attention range to a\nfixed-size local context window. Nevertheless, SWA employs a uniform window\nsize for each head in each layer, making it inefficient in capturing context of\nvarying scales. To mitigate this limitation, we propose Multi-Scale Window\nAttention (MSWA) which applies diverse window sizes across heads and layers in\nthe Transformer. It not only allows for different window sizes among heads\nwithin the same layer but also progressively increases window size allocation\nfrom shallow to deep layers, thus enabling the model to capture contextual\ninformation with different lengths and distances. Experimental results on\nlanguage modeling and common-sense reasoning tasks substantiate that MSWA\noutperforms traditional local attention in both effectiveness and efficiency.\n","arxiv_id":"http://arxiv.org/abs/2501.01039v1","authors":["Yixing Xu","Shivank Nag","Dong Li","Lu Tian","Emad Barsoum"]},{"title":"Risks of Cultural Erasure in Large Language Models","abstract":"  Large language models are increasingly being integrated into applications\nthat shape the production and discovery of societal knowledge such as search,\nonline education, and travel planning. As a result, language models will shape\nhow people learn about, perceive and interact with global cultures making it\nimportant to consider whose knowledge systems and perspectives are represented\nin models. Recognizing this importance, increasingly work in Machine Learning\nand NLP has focused on evaluating gaps in global cultural representational\ndistribution within outputs. However, more work is needed on developing\nbenchmarks for cross-cultural impacts of language models that stem from a\nnuanced sociologically-aware conceptualization of cultural impact or harm. We\njoin this line of work arguing for the need of metricizable evaluations of\nlanguage technologies that interrogate and account for historical power\ninequities and differential impacts of representation on global cultures,\nparticularly for cultures already under-represented in the digital corpora. We\nlook at two concepts of erasure: omission: where cultures are not represented\nat all and simplification i.e. when cultural complexity is erased by presenting\none-dimensional views of a rich culture. The former focuses on whether\nsomething is represented, and the latter on how it is represented. We focus our\nanalysis on two task contexts with the potential to influence global cultural\nproduction. First, we probe representations that a language model produces\nabout different places around the world when asked to describe these contexts.\nSecond, we analyze the cultures represented in the travel recommendations\nproduced by a set of language model applications. Our study shows ways in which\nthe NLP community and application developers can begin to operationalize\ncomplex socio-cultural considerations into standard evaluations and benchmarks.\n","arxiv_id":"http://arxiv.org/abs/2501.01056v1","authors":["Rida Qadri","Aida M. Davani","Kevin Robinson","Vinodkumar Prabhakaran"]},{"title":"Graph Generative Pre-trained Transformer","abstract":"  Graph generation is a critical task in numerous domains, including molecular\ndesign and social network analysis, due to its ability to model complex\nrelationships and structured data. While most modern graph generative models\nutilize adjacency matrix representations, this work revisits an alternative\napproach that represents graphs as sequences of node set and edge set. We\nadvocate for this approach due to its efficient encoding of graphs and propose\na novel representation. Based on this representation, we introduce the Graph\nGenerative Pre-trained Transformer (G2PT), an auto-regressive model that learns\ngraph structures via next-token prediction. To further exploit G2PT's\ncapabilities as a general-purpose foundation model, we explore fine-tuning\nstrategies for two downstream applications: goal-oriented generation and graph\nproperty prediction. We conduct extensive experiments across multiple datasets.\nResults indicate that G2PT achieves superior generative performance on both\ngeneric graph and molecule datasets. Furthermore, G2PT exhibits strong\nadaptability and versatility in downstream tasks from molecular design to\nproperty prediction.\n","arxiv_id":"http://arxiv.org/abs/2501.01073v1","authors":["Xiaohui Chen","Yinkai Wang","Jiaxing He","Yuanqi Du","Soha Hassoun","Xiaolin Xu","Li-Ping Liu"]},{"title":"BatStyler: Advancing Multi-category Style Generation for Source-free\n  Domain Generalization","abstract":"  Source-Free Domain Generalization (SFDG) aims to develop a model that\nperforms on unseen domains without relying on any source domains. However, the\nimplementation remains constrained due to the unavailability of training data.\nResearch on SFDG focus on knowledge transfer of multi-modal models and style\nsynthesis based on joint space of multiple modalities, thus eliminating the\ndependency on source domain images. However, existing works primarily work for\nmulti-domain and less-category configuration, but performance on multi-domain\nand multi-category configuration is relatively poor. In addition, the\nefficiency of style synthesis also deteriorates in multi-category scenarios.\nHow to efficiently synthesize sufficiently diverse data and apply it to\nmulti-category configuration is a direction with greater practical value. In\nthis paper, we propose a method called BatStyler, which is utilized to improve\nthe capability of style synthesis in multi-category scenarios. BatStyler\nconsists of two modules: Coarse Semantic Generation and Uniform Style\nGeneration modules. The Coarse Semantic Generation module extracts\ncoarse-grained semantics to prevent the compression of space for style\ndiversity learning in multi-category configuration, while the Uniform Style\nGeneration module provides a template of styles that are uniformly distributed\nin space and implements parallel training. Extensive experiments demonstrate\nthat our method exhibits comparable performance on less-category datasets,\nwhile surpassing state-of-the-art methods on multi-category datasets.\n","arxiv_id":"http://arxiv.org/abs/2501.01109v1","authors":["Xiusheng Xu","Lei Qi","Jingyang Zhou","Xin Geng"]},{"title":"MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic\n  Forgetting in Malware Classification","abstract":"  Continual Learning (CL) for malware classification tackles the rapidly\nevolving nature of malware threats and the frequent emergence of new types.\nGenerative Replay (GR)-based CL systems utilize a generative model to produce\nsynthetic versions of past data, which are then combined with new data to\nretrain the primary model. Traditional machine learning techniques in this\ndomain often struggle with catastrophic forgetting, where a model's performance\non old data degrades over time.\n  In this paper, we introduce a GR-based CL system that employs Generative\nAdversarial Networks (GANs) with feature matching loss to generate high-quality\nmalware samples. Additionally, we implement innovative selection schemes for\nreplay samples based on the model's hidden representations.\n  Our comprehensive evaluation across Windows and Android malware datasets in a\nclass-incremental learning scenario -- where new classes are introduced\ncontinuously over multiple tasks -- demonstrates substantial performance\nimprovements over previous methods. For example, our system achieves an average\naccuracy of 55% on Windows malware samples, significantly outperforming other\nGR-based models by 28%. This study provides practical insights for advancing\nGR-based malware classification systems. The implementation is available at\n\\url {https://github.com/MalwareReplayGAN/MalCL}\\footnote{The code will be made\npublic upon the presentation of the paper}.\n","arxiv_id":"http://arxiv.org/abs/2501.01110v1","authors":["Jimin Park","AHyun Ji","Minji Park","Mohammad Saidur Rahman","Se Eun Oh"]},{"title":"Pruning-based Data Selection and Network Fusion for Efficient Deep\n  Learning","abstract":"  Efficient data selection is essential for improving the training efficiency\nof deep neural networks and reducing the associated annotation costs. However,\ntraditional methods tend to be computationally expensive, limiting their\nscalability and real-world applicability. We introduce PruneFuse, a novel\nmethod that combines pruning and network fusion to enhance data selection and\naccelerate network training. In PruneFuse, the original dense network is pruned\nto generate a smaller surrogate model that efficiently selects the most\ninformative samples from the dataset. Once this iterative data selection\nselects sufficient samples, the insights learned from the pruned model are\nseamlessly integrated with the dense model through network fusion, providing an\noptimized initialization that accelerates training. Extensive experimentation\non various datasets demonstrates that PruneFuse significantly reduces\ncomputational costs for data selection, achieves better performance than\nbaselines, and accelerates the overall training process.\n","arxiv_id":"http://arxiv.org/abs/2501.01118v1","authors":["Humaira Kousar","Hasnain Irshad Bhatti","Jaekyun Moon"]},{"title":"Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal\n  Learning","abstract":"  Multimodal learning with incomplete modality is practical and challenging.\nRecently, researchers have focused on enhancing the robustness of pre-trained\nMultiModal Transformers (MMTs) under missing modality conditions by applying\nlearnable prompts. However, these prompt-based methods face several\nlimitations: (1) incomplete modalities provide restricted modal cues for\ntask-specific inference, (2) dummy imputation for missing content causes\ninformation loss and introduces noise, and (3) static prompts are\ninstance-agnostic, offering limited knowledge for instances with various\nmissing conditions. To address these issues, we propose RAGPT, a novel\nRetrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three\nmodules: (I) the multi-channel retriever, which identifies similar instances\nthrough a within-modality retrieval strategy, (II) the missing modality\ngenerator, which recovers missing information using retrieved contexts, and\n(III) the context-aware prompter, which captures contextual knowledge from\nrelevant instances and generates dynamic prompts to largely enhance the MMT's\nrobustness. Extensive experiments conducted on three real-world datasets show\nthat RAGPT consistently outperforms all competitive baselines in handling\nincomplete modality problems. The code of our work and prompt-based baselines\nis available at https://github.com/Jian-Lang/RAGPT.\n","arxiv_id":"http://arxiv.org/abs/2501.01120v1","authors":["Jian Lang","Zhangtao Cheng","Ting Zhong","Fan Zhou"]},{"title":"Deep Learning in Palmprint Recognition-A Comprehensive Survey","abstract":"  Palmprint recognition has emerged as a prominent biometric technology, widely\napplied in diverse scenarios. Traditional handcrafted methods for palmprint\nrecognition often fall short in representation capability, as they heavily\ndepend on researchers' prior knowledge. Deep learning (DL) has been introduced\nto address this limitation, leveraging its remarkable successes across various\ndomains. While existing surveys focus narrowly on specific tasks within\npalmprint recognition-often grounded in traditional methodologies-there remains\na significant gap in comprehensive research exploring DL-based approaches\nacross all facets of palmprint recognition. This paper bridges that gap by\nthoroughly reviewing recent advancements in DL-powered palmprint recognition.\nThe paper systematically examines progress across key tasks, including\nregion-of-interest segmentation, feature extraction, and\nsecurity/privacy-oriented challenges. Beyond highlighting these advancements,\nthe paper identifies current challenges and uncovers promising opportunities\nfor future research. By consolidating state-of-the-art progress, this review\nserves as a valuable resource for researchers, enabling them to stay abreast of\ncutting-edge technologies and drive innovation in palmprint recognition.\n","arxiv_id":"http://arxiv.org/abs/2501.01166v1","authors":["Chengrui Gao","Ziyuan Yang","Wei Jia","Lu Leng","Bob Zhang","Andrew Beng Jin Teoh"]},{"title":"Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes\n  in Benchmark Datasets","abstract":"  The multifaceted challenge of accurately measuring gender stereotypical bias\nin language models is akin to discerning different segments of a broader,\nunseen entity. This short paper primarily focuses on intrinsic bias mitigation\nand measurement strategies for language models, building on prior research that\ndemonstrates a lack of correlation between intrinsic and extrinsic approaches.\nWe delve deeper into intrinsic measurements, identifying inconsistencies and\nsuggesting that these benchmarks may reflect different facets of gender\nstereotype. Our methodology involves analyzing data distributions across\ndatasets and integrating gender stereotype components informed by social\npsychology. By adjusting the distribution of two datasets, we achieve a better\nalignment of outcomes. Our findings underscore the complexity of gender\nstereotyping in language models and point to new directions for developing more\nrefined techniques to detect and reduce bias.\n","arxiv_id":"http://arxiv.org/abs/2501.01168v1","authors":["Mahdi Zakizadeh","Mohammad Taher Pilehvar"]},{"title":"L3D-Pose: Lifting Pose for 3D Avatars from a Single Camera in the Wild","abstract":"  While 2D pose estimation has advanced our ability to interpret body movements\nin animals and primates, it is limited by the lack of depth information,\nconstraining its application range. 3D pose estimation provides a more\ncomprehensive solution by incorporating spatial depth, yet creating extensive\n3D pose datasets for animals is challenging due to their dynamic and\nunpredictable behaviours in natural settings. To address this, we propose a\nhybrid approach that utilizes rigged avatars and the pipeline to generate\nsynthetic datasets to acquire the necessary 3D annotations for training. Our\nmethod introduces a simple attention-based MLP network for converting 2D poses\nto 3D, designed to be independent of the input image to ensure scalability for\nposes in natural environments. Additionally, we identify that existing\nanatomical keypoint detectors are insufficient for accurate pose retargeting\nonto arbitrary avatars. To overcome this, we present a lookup table based on a\ndeep pose estimation method using a synthetic collection of diverse actions\nrigged avatars perform. Our experiments demonstrate the effectiveness and\nefficiency of this lookup table-based retargeting approach. Overall, we propose\na comprehensive framework with systematically synthesized datasets for lifting\nposes from 2D to 3D and then utilize this to re-target motion from wild\nsettings onto arbitrary avatars.\n","arxiv_id":"http://arxiv.org/abs/2501.01174v1","authors":["Soumyaratna Debnath","Harish Katti","Shashikant Verma","Shanmuganathan Raman"]},{"title":"Data Augmentation Techniques for Chinese Disease Name Normalization","abstract":"  Disease name normalization is an important task in the medical domain. It\nclassifies disease names written in various formats into standardized names,\nserving as a fundamental component in smart healthcare systems for various\ndisease-related functions. Nevertheless, the most significant obstacle to\nexisting disease name normalization systems is the severe shortage of training\ndata. Consequently, we present a novel data augmentation approach that includes\na series of data augmentation techniques and some supporting modules to help\nmitigate the problem. Through extensive experimentation, we illustrate that our\nproposed approach exhibits significant performance improvements across various\nbaseline models and training objectives, particularly in scenarios with limited\ntraining data\n","arxiv_id":"http://arxiv.org/abs/2501.01195v1","authors":["Wenqian Cui","Xiangling Fu","Shaohui Liu","Mingjun Gu","Xien Liu","Ji Wu","Irwin King"]},{"title":"A redescription mining framework for post-hoc explaining and relating\n  deep learning models","abstract":"  Deep learning models (DLMs) achieve increasingly high performance both on\nstructured and unstructured data. They significantly extended applicability of\nmachine learning to various domains. Their success in making predictions,\ndetecting patterns and generating new data made significant impact on science\nand industry. Despite these accomplishments, DLMs are difficult to explain\nbecause of their enormous size. In this work, we propose a novel framework for\npost-hoc explaining and relating DLMs using redescriptions. The framework\nallows cohort analysis of arbitrary DLMs by identifying statistically\nsignificant redescriptions of neuron activations. It allows coupling neurons to\na set of target labels or sets of descriptive attributes, relating layers\nwithin a single DLM or associating different DLMs. The proposed framework is\nindependent of the artificial neural network architecture and can work with\nmore complex target labels (e.g. multi-label or multi-target scenario).\nAdditionally, it can emulate both pedagogical and decompositional approach to\nrule extraction. The aforementioned properties of the proposed framework can\nincrease explainability and interpretability of arbitrary DLMs by providing\ndifferent information compared to existing explainable-AI approaches.\n","arxiv_id":"http://arxiv.org/abs/2501.01209v1","authors":["Matej Mihelčić","Ivan Grubišić","Miha Keber"]},{"title":"An Efficient Attention Mechanism for Sequential Recommendation Tasks:\n  HydraRec","abstract":"  Transformer based models are increasingly being used in various domains\nincluding recommender systems (RS). Pretrained transformer models such as BERT\nhave shown good performance at language modelling. With the greater ability to\nmodel sequential tasks, variants of Encoder-only models (like BERT4Rec, SASRec\netc.) have found success in sequential RS problems. Computing dot-product\nattention in traditional transformer models has quadratic complexity in\nsequence length. This is a bigger problem with RS because unlike language\nmodels, new items are added to the catalogue every day. User buying history is\na dynamic sequence which depends on multiple factors. Recently, various linear\nattention models have tried to solve this problem by making the model linear in\nsequence length (token dimensions). Hydra attention is one such linear\ncomplexity model proposed for vision transformers which reduces the complexity\nof attention for both the number of tokens as well as model embedding\ndimensions. Building on the idea of Hydra attention, we introduce an efficient\nTransformer based Sequential RS (HydraRec) which significantly improves\ntheoretical complexity of computing attention for longer sequences and bigger\ndatasets while preserving the temporal context. Extensive experiments are\nconducted to evaluate other linear transformer-based RS models and compared\nwith HydraRec across various evaluation metrics. HydraRec outperforms other\nlinear attention-based models as well as dot-product based attention models\nwhen used with causal masking for sequential recommendation next item\nprediction tasks. For bi-directional models its performance is comparable to\nthe BERT4Rec model with an improvement in running time.\n","arxiv_id":"http://arxiv.org/abs/2501.01242v1","authors":["Uzma Mushtaque"]},{"title":"Stealthy Backdoor Attack to Real-world Models in Android Apps","abstract":"  Powered by their superior performance, deep neural networks (DNNs) have found\nwidespread applications across various domains. Many deep learning (DL) models\nare now embedded in mobile apps, making them more accessible to end users\nthrough on-device DL. However, deploying on-device DL to users' smartphones\nsimultaneously introduces several security threats. One primary threat is\nbackdoor attacks. Extensive research has explored backdoor attacks for several\nyears and has proposed numerous attack approaches. However, few studies have\ninvestigated backdoor attacks on DL models deployed in the real world, or they\nhave shown obvious deficiencies in effectiveness and stealthiness. In this\nwork, we explore more effective and stealthy backdoor attacks on real-world DL\nmodels extracted from mobile apps. Our main justification is that imperceptible\nand sample-specific backdoor triggers generated by DNN-based steganography can\nenhance the efficacy of backdoor attacks on real-world models. We first confirm\nthe effectiveness of steganography-based backdoor attacks on four\nstate-of-the-art DNN models. Subsequently, we systematically evaluate and\nanalyze the stealthiness of the attacks to ensure they are difficult to\nperceive. Finally, we implement the backdoor attacks on real-world models and\ncompare our approach with three baseline methods. We collect 38,387 mobile\napps, extract 89 DL models from them, and analyze these models to obtain the\nprerequisite model information for the attacks. After identifying the target\nmodels, our approach achieves an average of 12.50% higher attack success rate\nthan DeepPayload while better maintaining the normal performance of the models.\nExtensive experimental results demonstrate that our method enables more\neffective, robust, and stealthy backdoor attacks on real-world models.\n","arxiv_id":"http://arxiv.org/abs/2501.01263v1","authors":["Jiali Wei","Ming Fan","Xicheng Zhang","Wenjing Jiao","Haijun Wang","Ting Liu"]},{"title":"PIMAEX: Multi-Agent Exploration through Peer Incentivization","abstract":"  While exploration in single-agent reinforcement learning has been studied\nextensively in recent years, considerably less work has focused on its\ncounterpart in multi-agent reinforcement learning. To address this issue, this\nwork proposes a peer-incentivized reward function inspired by previous research\non intrinsic curiosity and influence-based rewards. The \\textit{PIMAEX} reward,\nshort for Peer-Incentivized Multi-Agent Exploration, aims to improve\nexploration in the multi-agent setting by encouraging agents to exert influence\nover each other to increase the likelihood of encountering novel states. We\nevaluate the \\textit{PIMAEX} reward in conjunction with\n\\textit{PIMAEX-Communication}, a multi-agent training algorithm that employs a\ncommunication channel for agents to influence one another. The evaluation is\nconducted in the \\textit{Consume/Explore} environment, a partially observable\nenvironment with deceptive rewards, specifically designed to challenge the\nexploration vs.\\ exploitation dilemma and the credit-assignment problem. The\nresults empirically demonstrate that agents using the \\textit{PIMAEX} reward\nwith \\textit{PIMAEX-Communication} outperform those that do not.\n","arxiv_id":"http://arxiv.org/abs/2501.01266v1","authors":["Michael Kölle","Johannes Tochtermann","Julian Schönberger","Gerhard Stenzel","Philipp Altmann","Claudia Linnhoff-Popien"]},{"title":"NeutraSum: A Language Model can help a Balanced Media Diet by\n  Neutralizing News Summaries","abstract":"  Media bias in news articles arises from the political polarisation of media\noutlets, which can reinforce societal stereotypes and beliefs. Reporting on the\nsame event often varies significantly between outlets, reflecting their\npolitical leanings through polarised language and focus. Although previous\nstudies have attempted to generate bias-free summaries from multiperspective\nnews articles, they have not effectively addressed the challenge of mitigating\ninherent media bias. To address this gap, we propose \\textbf{NeutraSum}, a\nnovel framework that integrates two neutrality losses to adjust the semantic\nspace of generated summaries, thus minimising media bias. These losses,\ndesigned to balance the semantic distances across polarised inputs and ensure\nalignment with expert-written summaries, guide the generation of neutral and\nfactually rich summaries. To evaluate media bias, we employ the political\ncompass test, which maps political leanings based on economic and social\ndimensions. Experimental results on the Allsides dataset demonstrate that\nNeutraSum not only improves summarisation performance but also achieves\nsignificant reductions in media bias, offering a promising approach for neutral\nnews summarisation.\n","arxiv_id":"http://arxiv.org/abs/2501.01284v1","authors":["Xi Luo","Junjie Liu","Sirong Wu","Yuhui Deng"]},{"title":"Citations and Trust in LLM Generated Responses","abstract":"  Question answering systems are rapidly advancing, but their opaque nature may\nimpact user trust. We explored trust through an anti-monitoring framework,\nwhere trust is predicted to be correlated with presence of citations and\ninversely related to checking citations. We tested this hypothesis with a live\nquestion-answering experiment that presented text responses generated using a\ncommercial Chatbot along with varying citations (zero, one, or five), both\nrelevant and random, and recorded if participants checked the citations and\ntheir self-reported trust in the generated responses. We found a significant\nincrease in trust when citations were present, a result that held true even\nwhen the citations were random; we also found a significant decrease in trust\nwhen participants checked the citations. These results highlight the importance\nof citations in enhancing trust in AI-generated content.\n","arxiv_id":"http://arxiv.org/abs/2501.01303v1","authors":["Yifan Ding","Matthew Facciani","Amrit Poudel","Ellen Joyce","Salvador Aguinaga","Balaji Veeramani","Sanmitra Bhattacharya","Tim Weninger"]},{"title":"Multi-Head Explainer: A General Framework to Improve Explainability in\n  CNNs and Transformers","abstract":"  In this study, we introduce the Multi-Head Explainer (MHEX), a versatile and\nmodular framework that enhances both the explainability and accuracy of\nConvolutional Neural Networks (CNNs) and Transformer-based models. MHEX\nconsists of three core components: an Attention Gate that dynamically\nhighlights task-relevant features, Deep Supervision that guides early layers to\ncapture fine-grained details pertinent to the target class, and an Equivalent\nMatrix that unifies refined local and global representations to generate\ncomprehensive saliency maps. Our approach demonstrates superior compatibility,\nenabling effortless integration into existing residual networks like ResNet and\nTransformer architectures such as BERT with minimal modifications. Extensive\nexperiments on benchmark datasets in medical imaging and text classification\nshow that MHEX not only improves classification accuracy but also produces\nhighly interpretable and detailed saliency scores.\n","arxiv_id":"http://arxiv.org/abs/2501.01311v2","authors":["Bohang Sun","Pietro Liò"]},{"title":"Understanding Difficult-to-learn Examples in Contrastive Learning: A\n  Theoretical Framework for Spectral Contrastive Learning","abstract":"  Unsupervised contrastive learning has shown significant performance\nimprovements in recent years, often approaching or even rivaling supervised\nlearning in various tasks. However, its learning mechanism is fundamentally\ndifferent from that of supervised learning. Previous works have shown that\ndifficult-to-learn examples (well-recognized in supervised learning as examples\naround the decision boundary), which are essential in supervised learning,\ncontribute minimally in unsupervised settings. In this paper, perhaps\nsurprisingly, we find that the direct removal of difficult-to-learn examples,\nalthough reduces the sample size, can boost the downstream classification\nperformance of contrastive learning. To uncover the reasons behind this, we\ndevelop a theoretical framework modeling the similarity between different pairs\nof samples. Guided by this theoretical framework, we conduct a thorough\ntheoretical analysis revealing that the presence of difficult-to-learn examples\nnegatively affects the generalization of contrastive learning. Furthermore, we\ndemonstrate that the removal of these examples, and techniques such as margin\ntuning and temperature scaling can enhance its generalization bounds, thereby\nimproving performance. Empirically, we propose a simple and efficient mechanism\nfor selecting difficult-to-learn examples and validate the effectiveness of the\naforementioned methods, which substantiates the reliability of our proposed\ntheoretical framework.\n","arxiv_id":"http://arxiv.org/abs/2501.01317v1","authors":["Yi-Ge Zhang","Jingyi Cui","Qiran Li","Yisen Wang"]},{"title":"DeepFilter: An Instrumental Baseline for Accurate and Efficient Process\n  Monitoring","abstract":"  Effective process monitoring is increasingly vital in industrial automation\nfor ensuring operational safety, necessitating both high accuracy and\nefficiency. Although Transformers have demonstrated success in various fields,\ntheir canonical form based on the self-attention mechanism is inadequate for\nprocess monitoring due to two primary limitations: (1) the step-wise\ncorrelations captured by self-attention mechanism are difficult to capture\ndiscriminative patterns in monitoring logs due to the lacking semantics of each\nstep, thus compromising accuracy; (2) the quadratic computational complexity of\nself-attention hampers efficiency. To address these issues, we propose\nDeepFilter, a Transformer-style framework for process monitoring. The core\ninnovation is an efficient filtering layer that excel capturing long-term and\nperiodic patterns with reduced complexity. Equipping with the global filtering\nlayer, DeepFilter enhances both accuracy and efficiency, meeting the stringent\ndemands of process monitoring. Experimental results on real-world process\nmonitoring datasets validate DeepFilter's superiority in terms of accuracy and\nefficiency compared to existing state-of-the-art models.\n","arxiv_id":"http://arxiv.org/abs/2501.01342v1","authors":["Hao Wang","Zhichao Chen","Licheng Pan","Xiaoyu Jiang","Yichen Song","Qunshan He","Xinggao Liu"]},{"title":"A Unified Hyperparameter Optimization Pipeline for Transformer-Based\n  Time Series Forecasting Models","abstract":"  Transformer-based models for time series forecasting (TSF) have attracted\nsignificant attention in recent years due to their effectiveness and\nversatility. However, these models often require extensive hyperparameter\noptimization (HPO) to achieve the best possible performance, and a unified\npipeline for HPO in transformer-based TSF remains lacking. In this paper, we\npresent one such pipeline and conduct extensive experiments on several\nstate-of-the-art (SOTA) transformer-based TSF models. These experiments are\nconducted on standard benchmark datasets to evaluate and compare the\nperformance of different models, generating practical insights and examples.\nOur pipeline is generalizable beyond transformer-based architectures and can be\napplied to other SOTA models, such as Mamba and TimeMixer, as demonstrated in\nour experiments. The goal of this work is to provide valuable guidance to both\nindustry practitioners and academic researchers in efficiently identifying\noptimal hyperparameters suited to their specific domain applications. The code\nand complete experimental results are available on GitHub.\n","arxiv_id":"http://arxiv.org/abs/2501.01394v1","authors":["Jingjing Xu","Caesar Wu","Yuan-Fang Li","Grégoire Danoy","Pascal Bouvry"]},{"title":"On Unifying Video Generation and Camera Pose Estimation","abstract":"  Inspired by the emergent 3D capabilities in image generators, we explore\nwhether video generators similarly exhibit 3D awareness. Using\nstructure-from-motion (SfM) as a benchmark for 3D tasks, we investigate if\nintermediate features from OpenSora, a video generation model, can support\ncamera pose estimation. We first examine native 3D awareness in video\ngeneration features by routing raw intermediate outputs to SfM-prediction\nmodules like DUSt3R. Then, we explore the impact of fine-tuning on camera pose\nestimation to enhance 3D awareness. Results indicate that while video generator\nfeatures have limited inherent 3D awareness, task-specific supervision\nsignificantly boosts their accuracy for camera pose estimation, resulting in\ncompetitive performance. The proposed unified model, named JOG3R, produces\ncamera pose estimates with competitive quality without degrading video\ngeneration quality.\n","arxiv_id":"http://arxiv.org/abs/2501.01409v1","authors":["Chun-Hao Paul Huang","Jae Shin Yoon","Hyeonho Jeong","Niloy Mitra","Duygu Ceylan"]},{"title":"Balance-aware Sequence Sampling Makes Multi-modal Learning Better","abstract":"  To address the modality imbalance caused by data heterogeneity, existing\nmulti-modal learning (MML) approaches primarily focus on balancing this\ndifference from the perspective of optimization objectives. However, almost all\nexisting methods ignore the impact of sample sequences, i.e., an inappropriate\ntraining order tends to trigger learning bias in the model, further\nexacerbating modality imbalance. In this paper, we propose Balance-aware\nSequence Sampling (BSS) to enhance the robustness of MML. Specifically, we\nfirst define a multi-perspective measurer to evaluate the balance degree of\neach sample. Via the evaluation, we employ a heuristic scheduler based on\ncurriculum learning (CL) that incrementally provides training subsets,\nprogressing from balanced to imbalanced samples to rebalance MML. Moreover,\nconsidering that sample balance may evolve as the model capability increases,\nwe propose a learning-based probabilistic sampling method to dynamically update\nthe training sequences at the epoch level, further improving MML performance.\nExtensive experiments on widely used datasets demonstrate the superiority of\nour method compared with state-of-the-art (SOTA) MML approaches.\n","arxiv_id":"http://arxiv.org/abs/2501.01470v1","authors":["Zhi-Hao Guan"]},{"title":"Augmented Contrastive Clustering with Uncertainty-Aware Prototyping for\n  Time Series Test Time Adaptation","abstract":"  Test-time adaptation aims to adapt pre-trained deep neural networks using\nsolely online unlabelled test data during inference. Although TTA has shown\npromise in visual applications, its potential in time series contexts remains\nlargely unexplored. Existing TTA methods, originally designed for visual tasks,\nmay not effectively handle the complex temporal dynamics of real-world time\nseries data, resulting in suboptimal adaptation performance. To address this\ngap, we propose Augmented Contrastive Clustering with Uncertainty-aware\nPrototyping (ACCUP), a straightforward yet effective TTA method for time series\ndata. Initially, our approach employs augmentation ensemble on the time series\ndata to capture diverse temporal information and variations, incorporating\nuncertainty-aware prototypes to distill essential characteristics.\nAdditionally, we introduce an entropy comparison scheme to selectively acquire\nmore confident predictions, enhancing the reliability of pseudo labels.\nFurthermore, we utilize augmented contrastive clustering to enhance feature\ndiscriminability and mitigate error accumulation from noisy pseudo labels,\npromoting cohesive clustering within the same class while facilitating clear\nseparation between different classes. Extensive experiments conducted on three\nreal-world time series datasets and an additional visual dataset demonstrate\nthe effectiveness and generalization potential of the proposed method,\nadvancing the underexplored realm of TTA for time series data.\n","arxiv_id":"http://arxiv.org/abs/2501.01472v1","authors":["Peiliang Gong","Mohamed Ragab","Min Wu","Zhenghua Chen","Yongyi Su","Xiaoli Li","Daoqiang Zhang"]},{"title":"Unraveling Indirect In-Context Learning Using Influence Functions","abstract":"  This work introduces a novel paradigm for generalized In-Context Learning\n(ICL), termed Indirect In-Context Learning. In Indirect ICL, we explore\ndemonstration selection strategies tailored for two distinct real-world\nscenarios: Mixture of Tasks and Noisy Demonstrations. We systematically\nevaluate the effectiveness of Influence Functions (IFs) as a selection tool for\nthese settings, highlighting the potential for IFs to better capture the\ninformativeness of examples within the demonstration pool. For the Mixture of\nTasks setting, demonstrations are drawn from 28 diverse tasks, including MMLU,\nBigBench, StrategyQA, and CommonsenseQA. We demonstrate that combining\nBertScore-Recall (BSR) with an IF surrogate model can significantly improve\nperformance, leading to average absolute accuracy gains of 0.37\\% and 1.45\\%\nfor 3-shot and 5-shot setups when compared to traditional ICL metrics. In the\nNoisy Demonstrations setting, we examine scenarios where demonstrations might\nbe mislabeled. Our experiments show that reweighting traditional ICL selectors\n(BSR and Cosine Similarity) with IF-based selectors boosts accuracy by an\naverage of 2.90\\% for Cosine Similarity and 2.94\\% for BSR on noisy GLUE\nbenchmarks. In sum, we propose a robust framework for demonstration selection\nthat generalizes beyond traditional ICL, offering valuable insights into the\nrole of IFs for Indirect ICL.\n","arxiv_id":"http://arxiv.org/abs/2501.01473v1","authors":["Hadi Askari","Shivanshu Gupta","Terry Tong","Fei Wang","Anshuman Chhabra","Muhao Chen"]},{"title":"A Survey of Deep Learning Methods in Protein Bioinformatics and its\n  Impact on Protein Design","abstract":"  Proteins are sequences of amino acids that serve as the basic building blocks\nof living organisms. Despite rapidly growing databases documenting structural\nand functional information for various protein sequences, our understanding of\nproteins remains limited because of the large possible sequence space and the\ncomplex inter- and intra-molecular forces. Deep learning, which is\ncharacterized by its ability to learn relevant features directly from large\ndatasets, has demonstrated remarkable performance in fields such as computer\nvision and natural language processing. It has also been increasingly applied\nin recent years to the data-rich domain of protein sequences with great\nsuccess, most notably with Alphafold2's breakout performance in the protein\nstructure prediction. The performance improvements achieved by deep learning\nunlocks new possibilities in the field of protein bioinformatics, including\nprotein design, one of the most difficult but useful tasks. In this paper, we\nbroadly categorize problems in protein bioinformatics into three main\ncategories: 1) structural prediction, 2) functional prediction, and 3) protein\ndesign, and review the progress achieved from using deep learning methodologies\nin each of them. We expand on the main challenges of the protein design problem\nand highlight how advances in structural and functional prediction have\ndirectly contributed to design tasks. Finally, we conclude by identifying\nimportant topics and future research directions.\n","arxiv_id":"http://arxiv.org/abs/2501.01477v1","authors":["Weihang Dai"]},{"title":"Drift2Matrix: Kernel-Induced Self Representation for Concept Drift\n  Adaptation in Co-evolving Time Series","abstract":"  In the realm of time series analysis, tackling the phenomenon of concept\ndrift poses a significant challenge. Concept drift -- characterized by the\nevolving statistical properties of time series data, affects the reliability\nand accuracy of conventional analysis models. This is particularly evident in\nco-evolving scenarios where interactions among variables are crucial. This\npaper presents Drift2Matrix, a novel framework that leverages kernel-induced\nself-representation for adaptive responses to concept drift in time series.\nDrift2Matrix employs a kernel-based learning mechanism to generate a\nrepresentation matrix, encapsulating the inherent dynamics of co-evolving time\nseries. This matrix serves as a key tool for identification and adaptation to\nconcept drift by observing its temporal variations. Furthermore, Drift2Matrix\neffectively identifies prevailing patterns and offers insights into emerging\ntrends through pattern evolution analysis. Our empirical evaluation of\nDrift2Matrix across various datasets demonstrates its effectiveness in handling\nthe complexities of concept drift. This approach introduces a novel perspective\nin the theoretical domain of co-evolving time series analysis, enhancing\nadaptability and accuracy in the face of dynamic data environments.\n","arxiv_id":"http://arxiv.org/abs/2501.01480v2","authors":["Kunpeng Xu","Lifei Chen","Shengrui Wang"]},{"title":"BoxingGym: Benchmarking Progress in Automated Experimental Design and\n  Model Discovery","abstract":"  Understanding the world and explaining it with scientific theories is a\ncentral aspiration of artificial intelligence research. Proposing theories,\ndesigning experiments to test them, and then revising them based on data are\nfundamental to scientific discovery. Despite the significant promise of\nLLM-based scientific agents, no benchmarks systematically test LLM's ability to\npropose scientific models, collect experimental data, and revise them in light\nof new data. We introduce BoxingGym, a benchmark with 10 environments for\nsystematically evaluating both experimental design (e.g. collecting data to\ntest a scientific theory) and model discovery (e.g. proposing and revising\nscientific theories). To enable tractable and quantitative evaluation, we\nimplement each environment as a generative probabilistic model with which a\nscientific agent can run interactive experiments. These probabilistic models\nare drawn from various real-world scientific domains ranging from psychology to\necology. To quantitatively evaluate a scientific agent's ability to collect\ninformative experimental data, we compute the expected information gain (EIG),\nan information-theoretic quantity which measures how much an experiment reduces\nuncertainty about the parameters of a generative model. A good scientific\ntheory is a concise and predictive explanation. Therefore, to quantitatively\nevaluate model discovery, we ask a scientific agent to explain their model and\nthen assess whether this explanation enables another scientific agent to make\nreliable predictions about this environment. In addition to this\nexplanation-based evaluation, we compute standard model evaluation metrics such\nas prediction errors. We find that current LLMs, such as GPT-4o, struggle with\nboth experimental design and model discovery. We find that augmenting the\nLLM-based agent with an explicit statistical model does not reliably improve\nthese results.\n","arxiv_id":"http://arxiv.org/abs/2501.01540v1","authors":["Kanishk Gandhi","Michael Y. Li","Lyle Goodyear","Louise Li","Aditi Bhaskar","Mohammed Zaman","Noah D. Goodman"]},{"title":"Constructing and explaining machine learning models for chemistry:\n  example of the exploration and design of boron-based Lewis acids","abstract":"  The integration of machine learning (ML) into chemistry offers transformative\npotential in the design of molecules with targeted properties. However, the\nfocus has often been on creating highly efficient predictive models, sometimes\nat the expense of interpretability. In this study, we leverage explainable AI\ntechniques to explore the rational design of boron-based Lewis acids, which\nplay a pivotal role in organic reactions due to their electron-ccepting\nproperties. Using Fluoride Ion Affinity as a proxy for Lewis acidity, we\ndeveloped interpretable ML models based on chemically meaningful descriptors,\nincluding ab initio computed features and substituent-based parameters derived\nfrom the Hammett linear free-energy relationship. By constraining the chemical\nspace to well-defined molecular scaffolds, we achieved highly accurate\npredictions (mean absolute error \u003c 6 kJ/mol), surpassing conventional black-box\ndeep learning models in low-data regimes. Interpretability analyses of the\nmodels shed light on the origin of Lewis acidity in these compounds and\nidentified actionable levers to modulate it through the nature and positioning\nof substituents on the molecular scaffold. This work bridges ML and chemist's\nway of thinking, demonstrating how explainable models can inspire molecular\ndesign and enhance scientific understanding of chemical reactivity.\n","arxiv_id":"http://arxiv.org/abs/2501.01576v2","authors":["Juliette Fenogli","Laurence Grimaud","Rodolphe Vuilleumier"]},{"title":"(WhyPHI) Fine-Tuning PHI-3 for Multiple-Choice Question Answering:\n  Methodology, Results, and Challenges","abstract":"  Large Language Models (LLMs) have become essential tools across various\ndomains due to their impressive capabilities in understanding and generating\nhuman-like text. The ability to accurately answer multiple-choice questions\n(MCQs) holds significant value in education, particularly in automated tutoring\nsystems and assessment platforms. However, adapting LLMs to handle MCQ tasks\neffectively remains challenging due to the hallucinations and unclear prompts.\nThis work explores the potential of Microsoft's PHI-3\\cite{Abdin2024}, a\ncompact yet efficient LLM, for MCQ answering. Our contributions include\nfine-tuning the model on the TruthfulQA dataset, designing optimized prompts to\nenhance model performance, and evaluating using perplexity and traditional\nmetrics like accuracy and F1 score. Results show a remarkable improvement in\nPHI-3.5's MCQ handling post-fine-tuning, with perplexity decreasing from 4.68\nto 2.27, and accuracy rising from 62\\% to 90.8\\%. This research underlines the\nimportance of efficient models in adaptive learning systems and educational\nassessments, paving the way for broader integration into the classroom,\nparticularly in fields like test preparation, student feedback, and\npersonalized learning.\n","arxiv_id":"http://arxiv.org/abs/2501.01588v1","authors":["Mohamed Hisham Abdellatif"]},{"title":"Prism: Mining Task-aware Domains in Non-i.i.d. IMU Data for Flexible\n  User Perception","abstract":"  A wide range of user perception applications leverage inertial measurement\nunit (IMU) data for online prediction. However, restricted by the non-i.i.d.\nnature of IMU data collected from mobile devices, most systems work well only\nin a controlled setting (e.g., for a specific user in particular postures),\nlimiting application scenarios. To achieve uncontrolled online prediction on\nmobile devices, referred to as the flexible user perception (FUP) problem, is\nattractive but hard. In this paper, we propose a novel scheme, called Prism,\nwhich can obtain high FUP accuracy on mobile devices. The core of Prism is to\ndiscover task-aware domains embedded in IMU dataset, and to train a\ndomain-aware model on each identified domain. To this end, we design an\nexpectation-maximization (EM) algorithm to estimate latent domains with respect\nto the specific downstream perception task. Finally, the best-fit model can be\nautomatically selected for use by comparing the test sample and all identified\ndomains in the feature space. We implement Prism on various mobile devices and\nconduct extensive experiments. Results demonstrate that Prism can achieve the\nbest FUP performance with a low latency.\n","arxiv_id":"http://arxiv.org/abs/2501.01598v1","authors":["Yunzhe Li","Facheng Hu","Hongzi Zhu","Quan Liu","Xiaoke Zhao","Jiangang Shen","Shan Chang","Minyi Guo"]},{"title":"Few-shot Implicit Function Generation via Equivariance","abstract":"  Implicit Neural Representations (INRs) have emerged as a powerful framework\nfor representing continuous signals. However, generating diverse INR weights\nremains challenging due to limited training data. We introduce Few-shot\nImplicit Function Generation, a new problem setup that aims to generate diverse\nyet functionally consistent INR weights from only a few examples. This is\nchallenging because even for the same signal, the optimal INRs can vary\nsignificantly depending on their initializations. To tackle this, we propose\nEquiGen, a framework that can generate new INRs from limited data. The core\nidea is that functionally similar networks can be transformed into one another\nthrough weight permutations, forming an equivariance group. By projecting these\nweights into an equivariant latent space, we enable diverse generation within\nthese groups, even with few examples. EquiGen implements this through an\nequivariant encoder trained via contrastive learning and smooth augmentation,\nan equivariance-guided diffusion process, and controlled perturbations in the\nequivariant subspace. Experiments on 2D image and 3D shape INR datasets\ndemonstrate that our approach effectively generates diverse INR weights while\npreserving their functional properties in few-shot scenarios.\n","arxiv_id":"http://arxiv.org/abs/2501.01601v1","authors":["Suizhi Huang","Xingyi Yang","Hongtao Lu","Xinchao Wang"]},{"title":"Google is all you need: Semi-Supervised Transfer Learning Strategy For\n  Light Multimodal Multi-Task Classification Model","abstract":"  As the volume of digital image data increases, the effectiveness of image\nclassification intensifies. This study introduces a robust multi-label\nclassification system designed to assign multiple labels to a single image,\naddressing the complexity of images that may be associated with multiple\ncategories (ranging from 1 to 19, excluding 12). We propose a multi-modal\nclassifier that merges advanced image recognition algorithms with Natural\nLanguage Processing (NLP) models, incorporating a fusion module to integrate\nthese distinct modalities. The purpose of integrating textual data is to\nenhance the accuracy of label prediction by providing contextual understanding\nthat visual analysis alone cannot fully capture. Our proposed classification\nmodel combines Convolutional Neural Networks (CNN) for image processing with\nNLP techniques for analyzing textual description (i.e., captions). This\napproach includes rigorous training and validation phases, with each model\ncomponent verified and analyzed through ablation experiments. Preliminary\nresults demonstrate the classifier's accuracy and efficiency, highlighting its\npotential as an automatic image-labeling system.\n","arxiv_id":"http://arxiv.org/abs/2501.01611v1","authors":["Haixu Liu","Penghao Jiang","Zerui Tao"]},{"title":"Merging Context Clustering with Visual State Space Models for Medical\n  Image Segmentation","abstract":"  Medical image segmentation demands the aggregation of global and local\nfeature representations, posing a challenge for current methodologies in\nhandling both long-range and short-range feature interactions. Recently, vision\nmamba (ViM) models have emerged as promising solutions for addressing model\ncomplexities by excelling in long-range feature iterations with linear\ncomplexity. However, existing ViM approaches overlook the importance of\npreserving short-range local dependencies by directly flattening spatial tokens\nand are constrained by fixed scanning patterns that limit the capture of\ndynamic spatial context information. To address these challenges, we introduce\na simple yet effective method named context clustering ViM (CCViM), which\nincorporates a context clustering module within the existing ViM models to\nsegment image tokens into distinct windows for adaptable local clustering. Our\nmethod effectively combines long-range and short-range feature interactions,\nthereby enhancing spatial contextual representations for medical image\nsegmentation tasks. Extensive experimental evaluations on diverse public\ndatasets, i.e., Kumar, CPM17, ISIC17, ISIC18, and Synapse demonstrate the\nsuperior performance of our method compared to current state-of-the-art\nmethods. Our code can be found at https://github.com/zymissy/CCViM.\n","arxiv_id":"http://arxiv.org/abs/2501.01618v1","authors":["Yun Zhu","Dong Zhang","Yi Lin","Yifei Feng","Jinhui Tang"]},{"title":"ICPC: In-context Prompt Compression with Faster Inference","abstract":"  Despite the recent success of Large Language Models (LLMs), it remains\nchallenging to feed LLMs with long prompts due to the fixed size of LLM inputs.\nAs a remedy, prompt compression becomes a promising solution by removing\nredundant tokens in the prompt. However, using LLM in the existing works\nrequires additional computation resources and leads to memory overheads. To\naddress it, we propose ICPC (In-context Prompt Compression), a novel and\nscalable prompt compression method that adaptively reduces the prompt length.\nThe key idea of ICPC is to calculate the probability of each word appearing in\nthe prompt using encoders and calculate information carried by each word\nthrough the information function, which effectively reduces the information\nloss during prompt compression and increases the speed of compression.\nEmpirically, we demonstrate that ICPC can effectively compress long texts of\ndifferent categories and thus achieve better performance and speed on different\ntypes of NLP tasks.\n","arxiv_id":"http://arxiv.org/abs/2501.01625v1","authors":["Ziyang Yu","Yuyu Liu"]},{"title":"Implications of Artificial Intelligence on Health Data Privacy and\n  Confidentiality","abstract":"  The rapid integration of artificial intelligence (AI) in healthcare is\nrevolutionizing medical diagnostics, personalized medicine, and operational\nefficiency. However, alongside these advancements, significant challenges arise\nconcerning patient data privacy, ethical considerations, and regulatory\ncompliance. This paper examines the dual impact of AI on healthcare,\nhighlighting its transformative potential and the critical need for\nsafeguarding sensitive health information. It explores the role of the Health\nInsurance Portability and Accountability Act (HIPAA) as a regulatory framework\nfor ensuring data privacy and security, emphasizing the importance of robust\nsafeguards and ethical standards in AI-driven healthcare. Through case studies,\nincluding AI applications in diabetic retinopathy, oncology, and the\ncontroversies surrounding data sharing, this study underscores the ethical and\nlegal complexities of AI implementation. A balanced approach that fosters\ninnovation while maintaining patient trust and privacy is imperative. The\nfindings emphasize the importance of continuous education, transparency, and\nadherence to regulatory frameworks to harness AI's full potential responsibly\nand ethically in healthcare.\n","arxiv_id":"http://arxiv.org/abs/2501.01639v2","authors":["Ahmad Momani"]},{"title":"HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long\n  Video Understanding","abstract":"  Multimodal large language models have become a popular topic in deep visual\nunderstanding due to many promising real-world applications. However, hour-long\nvideo understanding, spanning over one hour and containing tens of thousands of\nvisual frames, remains under-explored because of 1) challenging long-term video\nanalyses, 2) inefficient large-model approaches, and 3) lack of large-scale\nbenchmark datasets. Among them, in this paper, we focus on building a\nlarge-scale hour-long long video benchmark, HLV-1K, designed to evaluate long\nvideo understanding models. HLV-1K comprises 1009 hour-long videos with 14,847\nhigh-quality question answering (QA) and multi-choice question asnwering (MCQA)\npairs with time-aware query and diverse annotations, covering frame-level,\nwithin-event-level, cross-event-level, and long-term reasoning tasks. We\nevaluate our benchmark using existing state-of-the-art methods and demonstrate\nits value for testing deep long video understanding capabilities at different\nlevels and for various tasks. This includes promoting future long video\nunderstanding tasks at a granular level, such as deep understanding of long\nlive videos, meeting recordings, and movies.\n","arxiv_id":"http://arxiv.org/abs/2501.01645v1","authors":["Heqing Zou","Tianze Luo","Guiyang Xie"," Victor"," Zhang","Fengmao Lv","Guangcong Wang","Junyang Chen","Zhuochen Wang","Hansheng Zhang","Huaijian Zhang"]},{"title":"AVATAR: Adversarial Autoencoders with Autoregressive Refinement for Time\n  Series Generation","abstract":"  Data augmentation can significantly enhance the performance of machine\nlearning tasks by addressing data scarcity and improving generalization.\nHowever, generating time series data presents unique challenges. A model must\nnot only learn a probability distribution that reflects the real data\ndistribution but also capture the conditional distribution at each time step to\npreserve the inherent temporal dependencies. To address these challenges, we\nintroduce AVATAR, a framework that combines Adversarial Autoencoders (AAE) with\nAutoregressive Learning to achieve both objectives. Specifically, our technique\nintegrates the autoencoder with a supervisor and introduces a novel supervised\nloss to assist the decoder in learning the temporal dynamics of time series\ndata. Additionally, we propose another innovative loss function, termed\ndistribution loss, to guide the encoder in more efficiently aligning the\naggregated posterior of the autoencoder's latent representation with a prior\nGaussian distribution. Furthermore, our framework employs a joint training\nmechanism to simultaneously train all networks using a combined loss, thereby\nfulfilling the dual objectives of time series generation. We evaluate our\ntechnique across a variety of time series datasets with diverse\ncharacteristics. Our experiments demonstrate significant improvements in both\nthe quality and practical utility of the generated data, as assessed by various\nqualitative and quantitative metrics.\n","arxiv_id":"http://arxiv.org/abs/2501.01649v1","authors":["MohammadReza EskandariNasab","Shah Muhammad Hamdi","Soukaina Filali Boubrahimi"]},{"title":"EAUWSeg: Eliminating annotation uncertainty in weakly-supervised medical\n  image segmentation","abstract":"  Weakly-supervised medical image segmentation is gaining traction as it\nrequires only rough annotations rather than accurate pixel-to-pixel labels,\nthereby reducing the workload for specialists. Although some progress has been\nmade, there is still a considerable performance gap between the label-efficient\nmethods and fully-supervised one, which can be attributed to the uncertainty\nnature of these weak labels. To address this issue, we propose a novel weak\nannotation method coupled with its learning framework EAUWSeg to eliminate the\nannotation uncertainty. Specifically, we first propose the Bounded Polygon\nAnnotation (BPAnno) by simply labeling two polygons for a lesion. Then, the\ntailored learning mechanism that explicitly treat bounded polygons as two\nseparated annotations is proposed to learn invariant feature by providing\nadversarial supervision signal for model training. Subsequently, a\nconfidence-auxiliary consistency learner incorporates with a\nclassification-guided confidence generator is designed to provide reliable\nsupervision signal for pixels in uncertain region by leveraging the feature\npresentation consistency across pixels within the same category as well as\nclass-specific information encapsulated in bounded polygons annotation.\nExperimental results demonstrate that EAUWSeg outperforms existing\nweakly-supervised segmentation methods. Furthermore, compared to\nfully-supervised counterparts, the proposed method not only delivers superior\nperformance but also costs much less annotation workload. This underscores the\nsuperiority and effectiveness of our approach.\n","arxiv_id":"http://arxiv.org/abs/2501.01658v1","authors":["Wang Lituan","Zhang Lei","Wang Yan","Wang Zhenbin","Zhang Zhenwei","Zhang Yi"]},{"title":"BARTPredict: Empowering IoT Security with LLM-Driven Cyber Threat\n  Prediction","abstract":"  The integration of Internet of Things (IoT) technology in various domains has\nled to operational advancements, but it has also introduced new vulnerabilities\nto cybersecurity threats, as evidenced by recent widespread cyberattacks on IoT\ndevices. Intrusion detection systems are often reactive, triggered by specific\npatterns or anomalies observed within the network. To address this challenge,\nthis work proposes a proactive approach to anticipate and preemptively mitigate\nmalicious activities, aiming to prevent potential damage before it occurs. This\npaper proposes an innovative intrusion prediction framework empowered by\nPre-trained Large Language Models (LLMs). The framework incorporates two LLMs:\na fine-tuned Bidirectional and AutoRegressive Transformers (BART) model for\npredicting network traffic and a fine-tuned Bidirectional Encoder\nRepresentations from Transformers (BERT) model for evaluating the predicted\ntraffic. By harnessing the bidirectional capabilities of BART the framework\nthen identifies malicious packets among these predictions. Evaluated using the\nCICIoT2023 IoT attack dataset, our framework showcases a notable enhancement in\npredictive performance, attaining an impressive 98% overall accuracy, providing\na powerful response to the cybersecurity challenges that confront IoT networks.\n","arxiv_id":"http://arxiv.org/abs/2501.01664v1","authors":["Alaeddine Diaf","Abdelaziz Amara Korba","Nour Elislem Karabadji","Yacine Ghamri-Doudane"]},{"title":"Adaptive Few-shot Prompting for Machine Translation with Pre-trained\n  Language Models","abstract":"  Recently, Large language models (LLMs) with in-context learning have\ndemonstrated remarkable potential in handling neural machine translation.\nHowever, existing evidence shows that LLMs are prompt-sensitive and it is\nsub-optimal to apply the fixed prompt to any input for downstream machine\ntranslation tasks. To address this issue, we propose an adaptive few-shot\nprompting (AFSP) framework to automatically select suitable translation\ndemonstrations for various source input sentences to further elicit the\ntranslation capability of an LLM for better machine translation. First, we\nbuild a translation demonstration retrieval module based on LLM's embedding to\nretrieve top-k semantic-similar translation demonstrations from aligned\nparallel translation corpus. Rather than using other embedding models for\nsemantic demonstration retrieval, we build a hybrid demonstration retrieval\nmodule based on the embedding layer of the deployed LLM to build better input\nrepresentation for retrieving more semantic-related translation demonstrations.\nThen, to ensure better semantic consistency between source inputs and target\noutputs, we force the deployed LLM itself to generate multiple output\ncandidates in the target language with the help of translation demonstrations\nand rerank these candidates. Besides, to better evaluate the effectiveness of\nour AFSP framework on the latest language and extend the research boundary of\nneural machine translation, we construct a high-quality diplomatic\nChinese-English parallel dataset that consists of 5,528 parallel\nChinese-English sentences. Finally, extensive experiments on the proposed\ndiplomatic Chinese-English parallel dataset and the United Nations Parallel\nCorpus (Chinese-English part) show the effectiveness and superiority of our\nproposed AFSP.\n","arxiv_id":"http://arxiv.org/abs/2501.01679v1","authors":["Lei Tang","Jinghui Qin","Wenxuan Ye","Hao Tan","Zhijing Yang"]},{"title":"VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer\n  for Video-based Remote Physiological Measurement","abstract":"  Remote physiological signal measurement based on facial videos, also known as\nremote photoplethysmography (rPPG), involves predicting changes in facial\nvascular blood flow from facial videos. While most deep learning-based methods\nhave achieved good results, they often struggle to balance performance across\nsmall and large-scale datasets due to the inherent limitations of convolutional\nneural networks (CNNs) and Transformer. In this paper, we introduce VidFormer,\na novel end-to-end framework that integrates 3-Dimension Convolutional Neural\nNetwork (3DCNN) and Transformer models for rPPG tasks. Initially, we conduct an\nanalysis of the traditional skin reflection model and subsequently introduce an\nenhanced model for the reconstruction of rPPG signals. Based on this improved\nmodel, VidFormer utilizes 3DCNN and Transformer to extract local and global\nfeatures from input data, respectively. To enhance the spatiotemporal feature\nextraction capabilities of VidFormer, we incorporate temporal-spatial attention\nmechanisms tailored for both 3DCNN and Transformer. Additionally, we design a\nmodule to facilitate information exchange and fusion between the 3DCNN and\nTransformer. Our evaluation on five publicly available datasets demonstrates\nthat VidFormer outperforms current state-of-the-art (SOTA) methods. Finally, we\ndiscuss the essential roles of each VidFormer module and examine the effects of\nethnicity, makeup, and exercise on its performance.\n","arxiv_id":"http://arxiv.org/abs/2501.01691v2","authors":["Jiachen Li","Shisheng Guo","Longzhen Tang","Cuolong Cui","Lingjiang Kong","Xiaobo Yang"]},{"title":"The Essence of Contextual Understanding in Theory of Mind: A Study on\n  Question Answering with Story Characters","abstract":"  Theory-of-Mind (ToM) is a fundamental psychological capability that allows\nhumans to understand and interpret the mental states of others. Humans infer\nothers' thoughts by integrating causal cues and indirect clues from broad\ncontextual information, often derived from past interactions. In other words,\nhuman ToM heavily relies on the understanding about the backgrounds and life\nstories of others. Unfortunately, this aspect is largely overlooked in existing\nbenchmarks for evaluating machines' ToM capabilities, due to their usage of\nshort narratives without global backgrounds. In this paper, we verify the\nimportance of understanding long personal backgrounds in ToM and assess the\nperformance of LLMs in such realistic evaluation scenarios. To achieve this, we\nintroduce a novel benchmark, CharToM-QA, comprising 1,035 ToM questions based\non characters from classic novels. Our human study reveals a significant\ndisparity in performance: the same group of educated participants performs\ndramatically better when they have read the novels compared to when they have\nnot. In parallel, our experiments on state-of-the-art LLMs, including the very\nrecent o1 model, show that LLMs still perform notably worse than humans,\ndespite that they have seen these stories during pre-training. This highlights\nthe limitations of current LLMs in capturing the nuanced contextual information\nrequired for ToM reasoning.\n","arxiv_id":"http://arxiv.org/abs/2501.01705v1","authors":["Chulun Zhou","Qiujing Wang","Mo Yu","Xiaoqian Yue","Rui Lu","Jiangnan Li","Yifan Zhou","Shunchi Zhang","Jie Zhou","Wai Lam"]},{"title":"MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders","abstract":"  Visual encoders are fundamental components in vision-language models (VLMs),\neach showcasing unique strengths derived from various pre-trained visual\nfoundation models. To leverage the various capabilities of these encoders,\nrecent studies incorporate multiple encoders within a single VLM, leading to a\nconsiderable increase in computational cost. In this paper, we present\nMixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework\nthat distills the unique proficiencies of multiple vision encoders into a\nsingle, efficient encoder model. Specifically, to mitigate conflicts and retain\nthe unique characteristics of each teacher encoder, we employ low-rank\nadaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate\nspecialized knowledge based on input features, enhancing both adaptability and\nefficiency. To regularize the KD process and enhance performance, we propose an\nattention-based distillation strategy that adaptively weighs the different\nvisual encoders and emphasizes valuable visual tokens, reducing the burden of\nreplicating comprehensive but distinct features from multiple teachers.\nComprehensive experiments on popular VLMs, such as LLaVA and LLaVA-NeXT,\nvalidate the effectiveness of our method. The code will be released.\n","arxiv_id":"http://arxiv.org/abs/2501.01709v1","authors":["Jiajun Cao","Yuan Zhang","Tao Huang","Ming Lu","Qizhe Zhang","Ruichuan An","Ningning MA","Shanghang Zhang"]},{"title":"LLMs \u0026 Legal Aid: Understanding Legal Needs Exhibited Through User\n  Queries","abstract":"  The paper presents a preliminary analysis of an experiment conducted by Frank\nBold, a Czech expert group, to explore user interactions with GPT-4 for\naddressing legal queries. Between May 3, 2023, and July 25, 2023, 1,252 users\nsubmitted 3,847 queries. Unlike studies that primarily focus on the accuracy,\nfactuality, or hallucination tendencies of large language models (LLMs), our\nanalysis focuses on the user query dimension of the interaction. Using GPT-4o\nfor zero-shot classification, we categorized queries on (1) whether users\nprovided factual information about their issue (29.95%) or not (70.05%), (2)\nwhether they sought legal information (64.93%) or advice on the course of\naction (35.07\\%), and (3) whether they imposed requirements to shape or control\nthe model's answer (28.57%) or not (71.43%). We provide both quantitative and\nqualitative insight into user needs and contribute to a better understanding of\nuser engagement with LLMs.\n","arxiv_id":"http://arxiv.org/abs/2501.01711v1","authors":["Michal Kuk","Jakub Harasta"]},{"title":"Proposing Hierarchical Goal-Conditioned Policy Planning in Multi-Goal\n  Reinforcement Learning","abstract":"  Humanoid robots must master numerous tasks with sparse rewards, posing a\nchallenge for reinforcement learning (RL). We propose a method combining RL and\nautomated planning to address this. Our approach uses short goal-conditioned\npolicies (GCPs) organized hierarchically, with Monte Carlo Tree Search (MCTS)\nplanning using high-level actions (HLAs). Instead of primitive actions, the\nplanning process generates HLAs. A single plan-tree, maintained during the\nagent's lifetime, holds knowledge about goal achievement. This hierarchy\nenhances sample efficiency and speeds up reasoning by reusing HLAs and\nanticipating future actions. Our Hierarchical Goal-Conditioned Policy Planning\n(HGCPP) framework uniquely integrates GCPs, MCTS, and hierarchical RL,\npotentially improving exploration and planning in complex tasks.\n","arxiv_id":"http://arxiv.org/abs/2501.01727v1","authors":["Gavin B. Rens"]},{"title":"Augmentation Matters: A Mix-Paste Method for X-Ray Prohibited Item\n  Detection under Noisy Annotations","abstract":"  Automatic X-ray prohibited item detection is vital for public safety.\nExisting deep learning-based methods all assume that the annotations of\ntraining X-ray images are correct. However, obtaining correct annotations is\nextremely hard if not impossible for large-scale X-ray images, where item\noverlapping is ubiquitous.As a result, X-ray images are easily contaminated\nwith noisy annotations, leading to performance deterioration of existing\nmethods.In this paper, we address the challenging problem of training a robust\nprohibited item detector under noisy annotations (including both category noise\nand bounding box noise) from a novel perspective of data augmentation, and\npropose an effective label-aware mixed patch paste augmentation method\n(Mix-Paste). Specifically, for each item patch, we mix several item patches\nwith the same category label from different images and replace the original\npatch in the image with the mixed patch. In this way, the probability of\ncontaining the correct prohibited item within the generated image is increased.\nMeanwhile, the mixing process mimics item overlapping, enabling the model to\nlearn the characteristics of X-ray images. Moreover, we design an item-based\nlarge-loss suppression (LLS) strategy to suppress the large losses\ncorresponding to potentially positive predictions of additional items due to\nthe mixing operation. We show the superiority of our method on X-ray datasets\nunder noisy annotations. In addition, we evaluate our method on the noisy\nMS-COCO dataset to showcase its generalization ability. These results clearly\nindicate the great potential of data augmentation to handle noise annotations.\nThe source code is released at https://github.com/wscds/Mix-Paste.\n","arxiv_id":"http://arxiv.org/abs/2501.01733v1","authors":["Ruikang Chen","Yan Yan","Jing-Hao Xue","Yang Lu","Hanzi Wang"]},{"title":"Automating Legal Concept Interpretation with LLMs: Retrieval,\n  Generation, and Evaluation","abstract":"  Legal articles often include vague concepts to adapt to the ever-changing\nsociety. Providing detailed interpretations of these concepts is a critical\ntask for legal practitioners, which requires meticulous and professional\nannotations by legal experts, admittedly time-consuming and expensive to\ncollect at scale. In this paper, we introduce a novel retrieval-augmented\ngeneration framework, ATRI, for AuTomatically Retrieving relevant information\nfrom past judicial precedents and Interpreting vague legal concepts. We further\npropose a new benchmark, Legal Concept Entailment, to automate the evaluation\nof generated concept interpretations without expert involvement. Automatic\nevaluations indicate that our generated interpretations can effectively assist\nlarge language models (LLMs) in understanding vague legal concepts.\nMulti-faceted evaluations by legal experts indicate that the quality of our\nconcept interpretations is comparable to those written by human experts. Our\nwork has strong implications for leveraging LLMs to support legal practitioners\nin interpreting vague legal concepts and beyond.\n","arxiv_id":"http://arxiv.org/abs/2501.01743v1","authors":["Kangcheng Luo","Quzhe Huang","Cong Jiang","Yansong Feng"]},{"title":"Creating Artificial Students that Never Existed: Leveraging Large\n  Language Models and CTGANs for Synthetic Data Generation","abstract":"  In this study, we explore the growing potential of AI and deep learning\ntechnologies, particularly Generative Adversarial Networks (GANs) and Large\nLanguage Models (LLMs), for generating synthetic tabular data. Access to\nquality students data is critical for advancing learning analytics, but privacy\nconcerns and stricter data protection regulations worldwide limit their\navailability and usage. Synthetic data offers a promising alternative. We\ninvestigate whether synthetic data can be leveraged to create artificial\nstudents for serving learning analytics models. Using the popular GAN model\nCTGAN and three LLMs- GPT2, DistilGPT2, and DialoGPT, we generate synthetic\ntabular student data. Our results demonstrate the strong potential of these\nmethods to produce high-quality synthetic datasets that resemble real students\ndata. To validate our findings, we apply a comprehensive set of utility\nevaluation metrics to assess the statistical and predictive performance of the\nsynthetic data and compare the different generator models used, specially the\nperformance of LLMs. Our study aims to provide the learning analytics community\nwith valuable insights into the use of synthetic data, laying the groundwork\nfor expanding the field methodological toolbox with new innovative approaches\nfor learning analytics data generation.\n","arxiv_id":"http://arxiv.org/abs/2501.01793v1","authors":["Mohammad Khalil","Farhad Vadiee","Ronas Shakya","Qinyi Liu"]},{"title":"End-to-End Long Document Summarization using Gradient Caching","abstract":"  Training transformer-based encoder-decoder models for long document\nsummarization poses a significant challenge due to the quadratic memory\nconsumption during training. Several approaches have been proposed to extend\nthe input length at test time, but training with these approaches is still\ndifficult, requiring truncation of input documents and causing a mismatch\nbetween training and test conditions. In this work, we propose CachED (Gradient\n$\\textbf{Cach}$ing for $\\textbf{E}$ncoder-$\\textbf{D}$ecoder models), an\napproach that enables end-to-end training of existing transformer-based\nencoder-decoder models, using the entire document without truncation.\nSpecifically, we apply non-overlapping sliding windows to input documents,\nfollowed by fusion in decoder. During backpropagation, the gradients are cached\nat the decoder and are passed through the encoder in chunks by re-computing the\nhidden vectors, similar to gradient checkpointing. In the experiments on long\ndocument summarization, we extend BART to CachED BART, processing more than\n500K tokens during training and achieving superior performance without using\nany additional parameters.\n","arxiv_id":"http://arxiv.org/abs/2501.01805v1","authors":["Rohit Saxena","Hao Tang","Frank Keller"]},{"title":"SDPO: Segment-Level Direct Preference Optimization for Social Agents","abstract":"  Social agents powered by large language models (LLMs) can simulate human\nsocial behaviors but fall short in handling complex goal-oriented social\ndialogues. Direct Preference Optimization (DPO) has proven effective in\naligning LLM behavior with human preferences across a variety of agent tasks.\nExisting DPO-based approaches for multi-turn interactions are divided into\nturn-level and session-level methods. The turn-level method is overly\nfine-grained, focusing exclusively on individual turns, while session-level\nmethods are too coarse-grained, often introducing training noise. To address\nthese limitations, we propose Segment-Level Direct Preference Optimization\n(SDPO), which focuses on specific key segments within interactions to optimize\nmulti-turn agent behavior while minimizing training noise. Evaluations on the\nSOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform\nboth existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring\nSDPO's potential to advance the social intelligence of LLM-based agents. We\nrelease our code and data at\nhttps://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.\n","arxiv_id":"http://arxiv.org/abs/2501.01821v1","authors":["Aobo Kong","Wentao Ma","Shiwan Zhao","Yongbin Li","Yuchuan Wu","Ke Wang","Xiaoqian Liu","Qicheng Li","Yong Qin","Fei Huang"]},{"title":"The Proof is in the Almond Cookies","abstract":"  This paper presents a case study on how to process cooking recipes (and more\ngenerally, how-to instructions) in a way that makes it possible for a robot or\nartificial cooking assistant to support human chefs in the kitchen. Such AI\nassistants would be of great benefit to society, as they can help to sustain\nthe autonomy of aging adults or people with a physical impairment, or they may\nreduce the stress in a professional kitchen. We propose a novel approach to\ncomputational recipe understanding that mimics the human sense-making process,\nwhich is narrative-based. Using an English recipe for almond crescent cookies\nas illustration, we show how recipes can be modelled as rich narrative\nstructures by integrating various knowledge sources such as language\nprocessing, ontologies, and mental simulation. We show how such narrative\nstructures can be used for (a) dealing with the challenges of recipe language,\nsuch as zero anaphora, (b) optimizing a robot's planning process, (c) measuring\nhow well an AI system understands its current tasks, and (d) allowing recipe\nannotations to become language-independent.\n","arxiv_id":"http://arxiv.org/abs/2501.01827v1","authors":["Remi van Trijp","Katrien Beuls","Paul Van Eecke"]},{"title":"MoColl: Agent-Based Specific and General Model Collaboration for Image\n  Captioning","abstract":"  Image captioning is a critical task at the intersection of computer vision\nand natural language processing, with wide-ranging applications across various\ndomains. For complex tasks such as diagnostic report generation, deep learning\nmodels require not only domain-specific image-caption datasets but also the\nincorporation of relevant general knowledge to provide contextual accuracy.\nExisting approaches exhibit inherent limitations: specialized models excel in\ncapturing domain-specific details but lack generalization, while\nvision-language models (VLMs) built on large language models (LLMs) leverage\ngeneral knowledge but struggle with domain-specific adaptation. To address\nthese limitations, this paper proposes a novel agent-enhanced model\ncollaboration framework, which we call MoColl, designed to effectively\nintegrate domain-specific and general knowledge. Specifically, our approach is\nto decompose complex image captioning tasks into a series of interconnected\nquestion-answer subtasks. A trainable visual question answering (VQA) model is\nemployed as a specialized tool to focus on domain-specific visual analysis,\nanswering task-specific questions based on image content. Concurrently, an\nLLM-based agent with general knowledge formulates these questions and\nsynthesizes the resulting question-answer pairs into coherent captions. Beyond\nits role in leveraging the VQA model, the agent further guides its training to\nenhance its domain-specific capabilities. Experimental results on radiology\nreport generation validate the effectiveness of the proposed framework,\ndemonstrating significant improvements in the quality of generated reports.\n","arxiv_id":"http://arxiv.org/abs/2501.01834v2","authors":["Pu Yang","Bin Dong"]},{"title":"Practical machine learning is learning on small samples","abstract":"  Based on limited observations, machine learning discerns a dependence which\nis expected to hold in the future. What makes it possible? Statistical learning\ntheory imagines indefinitely increasing training sample to justify its\napproach. In reality, there is no infinite time or even infinite general\npopulation for learning. Here I argue that practical machine learning is based\non an implicit assumption that underlying dependence is relatively ``smooth\" :\nlikely, there are no abrupt differences in feedback between cases with close\ndata points. From this point of view learning shall involve selection of the\nhypothesis ``smoothly\" approximating the training set. I formalize this as\nPractical learning paradigm. The paradigm includes terminology and rules for\ndescription of learners. Popular learners (local smoothing, k-NN, decision\ntrees, Naive Bayes, SVM for classification and for regression) are shown here\nto be implementations of this paradigm.\n","arxiv_id":"http://arxiv.org/abs/2501.01836v1","authors":["Marina Sapir"]},{"title":"Multi-Agent Conversational Online Learning for Adaptive LLM Response\n  Identification","abstract":"  The remarkable generative capability of large language models (LLMs) has\nsparked a growing interest in automatically generating responses for different\napplications. Given the dynamic nature of user preferences and the uncertainty\nof LLM response performance, it is crucial to design efficient online learning\nalgorithms to identify optimal LLM responses (i.e., high-quality responses that\nalso meet user preferences). Most existing online algorithms adopt a\ncentralized approach and fail to leverage explicit user preferences for more\nefficient and personalized LLM response identification. In contrast, this paper\nintroduces \\textit{MACO} (\\underline{M}ulti-\\underline{A}gent\n\\underline{C}onversational \\underline{O}nline Learning for Adaptive LLM\nResponse Identification): 1) The online LLM response identification process is\naccelerated by multiple local agents (such as smartphones), while enhancing\ndata privacy; 2) A novel conversational mechanism is proposed to adaptively\nconduct conversations for soliciting user preferences (e.g., a preference for a\nhumorous tone over a serious one in generated responses), so to minimize\nuncertainty in preference estimation. Our theoretical analysis demonstrates\nthat \\cadi\\ is near-optimal regarding cumulative regret. Additionally, \\cadi\\\noffers reduced communication costs and computational complexity by eliminating\nthe traditional, computing-intensive ``G-optimal design\" found in previous\nworks. Extensive experiments with the open LLM \\textit{Llama}, coupled with two\ndifferent embedding models from Google and OpenAI for text vector\nrepresentation, demonstrate that \\cadi\\ significantly outperforms the current\nstate-of-the-art in online LLM response identification.\n","arxiv_id":"http://arxiv.org/abs/2501.01849v1","authors":["Xiangxiang Dai","Yuejin Xie","Maoli Liu","Xuchuang Wang","Zhuohua Li","Huanyu Wang","John C. S. Lui"]},{"title":"Accuracy Can Lie: On the Impact of Surrogate Model in Configuration\n  Tuning","abstract":"  To ease the expensive measurements during configuration tuning, it is natural\nto build a surrogate model as the replacement of the system, and thereby the\nconfiguration performance can be cheaply evaluated. Yet, a stereotype therein\nis that the higher the model accuracy, the better the tuning result would be.\nThis \"accuracy is all\" belief drives our research community to build more and\nmore accurate models and criticize a tuner for the inaccuracy of the model\nused. However, this practice raises some previously unaddressed questions,\ne.g., Do those somewhat small accuracy improvements reported in existing work\nreally matter much to the tuners? What role does model accuracy play in the\nimpact of tuning quality? To answer those related questions, we conduct one of\nthe largest-scale empirical studies to date-running over the period of 13\nmonths 24*7-that covers 10 models, 17 tuners, and 29 systems from the existing\nworks while under four different commonly used metrics, leading to 13,612 cases\nof investigation. Surprisingly, our key findings reveal that the accuracy can\nlie: there are a considerable number of cases where higher accuracy actually\nleads to no improvement in the tuning outcomes (up to 58% cases under certain\nsetting), or even worse, it can degrade the tuning quality (up to 24% cases\nunder certain setting). We also discover that the chosen models in most\nproposed tuners are sub-optimal and that the required % of accuracy change to\nsignificantly improve tuning quality varies according to the range of model\naccuracy. Deriving from the fitness landscape analysis, we provide in-depth\ndiscussions of the rationale behind, offering several lessons learned as well\nas insights for future opportunities. Most importantly, this work poses a clear\nmessage to the community: we should take one step back from the natural\n\"accuracy is all\" belief for model-based configuration tuning.\n","arxiv_id":"http://arxiv.org/abs/2501.01876v1","authors":["Pengzhou Chen","Jingzhi Gong","Tao Chen"]},{"title":"Virgo: A Preliminary Exploration on Reproducing o1-like MLLM","abstract":"  Recently, slow-thinking reasoning systems, built upon large language models\n(LLMs), have garnered widespread attention by scaling the thinking time during\ninference. There is also growing interest in adapting this capability to\nmultimodal large language models (MLLMs). Given that MLLMs handle more complex\ndata semantics across different modalities, it is intuitively more challenging\nto implement multimodal slow-thinking systems.\n  To address this issue, in this paper, we explore a straightforward approach\nby fine-tuning a capable MLLM with a small amount of textual long-form thought\ndata, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning\nwith long thought). We find that these long-form reasoning processes, expressed\nin natural language, can be effectively transferred to MLLMs. Moreover, it\nseems that such textual reasoning data can be even more effective than visual\nreasoning data in eliciting the slow-thinking capacities of MLLMs. While this\nwork is preliminary, it demonstrates that slow-thinking capacities are\nfundamentally associated with the language model component, which can be\ntransferred across modalities or domains. This finding can be leveraged to\nguide the development of more powerful slow-thinking reasoning systems. We\nrelease our resources at https://github.com/RUCAIBox/Virgo.\n","arxiv_id":"http://arxiv.org/abs/2501.01904v1","authors":["Yifan Du","Zikang Liu","Yifan Li","Wayne Xin Zhao","Yuqi Huo","Bingning Wang","Weipeng Chen","Zheng Liu","Zhongyuan Wang","Ji-Rong Wen"]},{"title":"Mitigating Hallucination for Large Vision Language Model by\n  Inter-Modality Correlation Calibration Decoding","abstract":"  Large vision-language models (LVLMs) have shown remarkable capabilities in\nvisual-language understanding for downstream multi-modal tasks. Despite their\nsuccess, LVLMs still suffer from generating hallucinations in complex\ngeneration tasks, leading to inconsistencies between visual inputs and\ngenerated content. To address this issue, some approaches have introduced\ninference-time interventions, such as contrastive decoding and attention\nrectification, to reduce overreliance on language priors. However, these\napproaches overlook hallucinations stemming from spurious inter-modality\ncorrelations. In this paper, we propose an Inter-Modality Correlation\nCalibration Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a\ntraining-free manner. In this method, we design a Cross-Modal Value-Enhanced\nDecoding(CMVED) module to alleviate hallucination by a novel contrastive\ndecoding mechanism. During the estimation of distorted distribution, CMVED\nmasks the value vectors associated with significant cross-modal attention\nweights, which address both uni-modality overreliance and misleading\ninter-modality correlations. Additionally, a Content-Driven Attention\nRefinement(CDAR) module refines cross-modal attention weights, guiding LVLMs to\nfocus on important visual content. Experimental results on diverse\nhallucination benchmarks validate the superiority of our method over existing\nstate-of-the-art techniques in reducing hallucinations in LVLM text generation.\nOur code will be available at https://github.com/lijm48/IMCCD.\n","arxiv_id":"http://arxiv.org/abs/2501.01926v1","authors":["Jiaming Li","Jiacheng Zhang","Zequn Jie","Lin Ma","Guanbin Li"]},{"title":"Abstractive Text Summarization for Contemporary Sanskrit Prose: Issues\n  and Challenges","abstract":"  This thesis presents Abstractive Text Summarization models for contemporary\nSanskrit prose. The first chapter, titled Introduction, presents the motivation\nbehind this work, the research questions, and the conceptual framework.\nSanskrit is a low-resource inflectional language. The key research question\nthat this thesis investigates is what the challenges in developing an\nabstractive TS for Sanskrit. To answer the key research questions,\nsub-questions based on four different themes have been posed in this work. The\nsecond chapter, Literature Review, surveys the previous works done. The third\nchapter, data preparation, answers the remaining three questions from the third\ntheme. It reports the data collection and preprocessing challenges for both\nlanguage model and summarization model trainings. The fourth chapter reports\nthe training and inference of models and the results obtained therein. This\nresearch has initiated a pipeline for Sanskrit abstractive text summarization\nand has reported the challenges faced at every stage of the development. The\nresearch questions based on every theme have been answered to answer the key\nresearch question.\n","arxiv_id":"http://arxiv.org/abs/2501.01933v1","authors":["Shagun Sinha"]},{"title":"Cold-Start Recommendation towards the Era of Large Language Models\n  (LLMs): A Comprehensive Survey and Roadmap","abstract":"  Cold-start problem is one of the long-standing challenges in recommender\nsystems, focusing on accurately modeling new or interaction-limited users or\nitems to provide better recommendations. Due to the diversification of internet\nplatforms and the exponential growth of users and items, the importance of\ncold-start recommendation (CSR) is becoming increasingly evident. At the same\ntime, large language models (LLMs) have achieved tremendous success and possess\nstrong capabilities in modeling user and item information, providing new\npotential for cold-start recommendations. However, the research community on\nCSR still lacks a comprehensive review and reflection in this field. Based on\nthis, in this paper, we stand in the context of the era of large language\nmodels and provide a comprehensive review and discussion on the roadmap,\nrelated literature, and future directions of CSR. Specifically, we have\nconducted an exploration of the development path of how existing CSR utilizes\ninformation, from content features, graph relations, and domain information, to\nthe world knowledge possessed by large language models, aiming to provide new\ninsights for both the research and industrial communities on CSR. Related\nresources of cold-start recommendations are collected and continuously updated\nfor the community in\nhttps://github.com/YuanchenBei/Awesome-Cold-Start-Recommendation.\n","arxiv_id":"http://arxiv.org/abs/2501.01945v2","authors":["Weizhi Zhang","Yuanchen Bei","Liangwei Yang","Henry Peng Zou","Peilin Zhou","Aiwei Liu","Yinghui Li","Hao Chen","Jianling Wang","Yu Wang","Feiran Huang","Sheng Zhou","Jiajun Bu","Allen Lin","James Caverlee","Fakhri Karray","Irwin King","Philip S. Yu"]},{"title":"MADGEN: Mass-Spec attends to De Novo Molecular generation","abstract":"  The annotation (assigning structural chemical identities) of MS/MS spectra\nremains a significant challenge due to the enormous molecular diversity in\nbiological samples and the limited scope of reference databases. Currently, the\nvast majority of spectral measurements remain in the \"dark chemical space\"\nwithout structural annotations. To improve annotation, we propose MADGEN\n(Mass-spec Attends to De Novo Molecular GENeration), a scaffold-based method\nfor de novo molecular structure generation guided by mass spectrometry data.\nMADGEN operates in two stages: scaffold retrieval and spectra-conditioned\nmolecular generation starting with the scaffold. In the first stage, given an\nMS/MS spectrum, we formulate scaffold retrieval as a ranking problem and employ\ncontrastive learning to align mass spectra with candidate molecular scaffolds.\nIn the second stage, starting from the retrieved scaffold, we employ the MS/MS\nspectrum to guide an attention-based generative model to generate the final\nmolecule. Our approach constrains the molecular generation search space,\nreducing its complexity and improving generation accuracy. We evaluate MADGEN\non three datasets (NIST23, CANOPUS, and MassSpecGym) and evaluate MADGEN's\nperformance with a predictive scaffold retriever and with an oracle retriever.\nWe demonstrate the effectiveness of using attention to integrate spectral\ninformation throughout the generation process to achieve strong results with\nthe oracle retriever.\n","arxiv_id":"http://arxiv.org/abs/2501.01950v2","authors":["Yinkai Wang","Xiaohui Chen","Liping Liu","Soha Hassoun"]},{"title":"MixGCN: Scalable GCN Training by Mixture of Parallelism and Mixture of\n  Accelerators","abstract":"  Graph convolutional networks (GCNs) have demonstrated superiority in\ngraph-based learning tasks. However, training GCNs on full graphs is\nparticularly challenging, due to the following two challenges: (1) the\nassociated feature tensors can easily explode the memory and block the\ncommunication bandwidth of modern accelerators, and (2) the computation\nworkflow in training GCNs alternates between sparse and dense matrix\noperations, complicating the efficient utilization of computational resources.\nExisting solutions for scalable distributed full-graph GCN training mostly\nadopt partition parallelism, which is unsatisfactory as they only partially\naddress the first challenge while incurring scaled-out communication volume. To\nthis end, we propose MixGCN aiming to simultaneously address both the\naforementioned challenges towards GCN training. To tackle the first challenge,\nMixGCN integrates mixture of parallelism. Both theoretical and empirical\nanalysis verify its constant communication volumes and enhanced balanced\nworkload; For handling the second challenge, we consider mixture of\naccelerators (i.e., sparse and dense accelerators) with a dedicated accelerator\nfor GCN training and a fine-grain pipeline. Extensive experiments show that\nMixGCN achieves boosted training efficiency and scalability.\n","arxiv_id":"http://arxiv.org/abs/2501.01951v2","authors":["Cheng Wan","Runkai Tao","Zheng Du","Yang Katie Zhao","Yingyan Celine Lin"]},{"title":"SmartSpatial: Enhancing the 3D Spatial Arrangement Capabilities of\n  Stable Diffusion Models and Introducing a Novel 3D Spatial Evaluation\n  Framework","abstract":"  Stable Diffusion models have made remarkable strides in generating\nphotorealistic images from text prompts but often falter when tasked with\naccurately representing complex spatial arrangements, particularly involving\nintricate 3D relationships. To address this limitation, we introduce\nSmartSpatial, an innovative approach that enhances the spatial arrangement\ncapabilities of Stable Diffusion models through 3D-aware conditioning and\nattention-guided mechanisms. SmartSpatial incorporates depth information and\nemploys cross-attention control to ensure precise object placement, delivering\nnotable improvements in spatial accuracy metrics. In conjunction with\nSmartSpatial, we present SmartSpatialEval, a comprehensive evaluation framework\ndesigned to assess spatial relationships. This framework utilizes\nvision-language models and graph-based dependency parsing for performance\nanalysis. Experimental results on the COCO and SpatialPrompts datasets show\nthat SmartSpatial significantly outperforms existing methods, setting new\nbenchmarks for spatial arrangement accuracy in image generation.\n","arxiv_id":"http://arxiv.org/abs/2501.01998v1","authors":["Mao Xun Huang","Hen-Hsen Huang"]},{"title":"Multi-Task Semantic Communication With Graph Attention-Based Feature\n  Correlation Extraction","abstract":"  Multi-task semantic communication can serve multiple learning tasks using a\nshared encoder model. Existing models have overlooked the intricate\nrelationships between features extracted during an encoding process of tasks.\nThis paper presents a new graph attention inter-block (GAI) module to the\nencoder/transmitter of a multi-task semantic communication system, which\nenriches the features for multiple tasks by embedding the intermediate outputs\nof encoding in the features, compared to the existing techniques. The key idea\nis that we interpret the outputs of the intermediate feature extraction blocks\nof the encoder as the nodes of a graph to capture the correlations of the\nintermediate features. Another important aspect is that we refine the node\nrepresentation using a graph attention mechanism to extract the correlations\nand a multi-layer perceptron network to associate the node representations with\ndifferent tasks. Consequently, the intermediate features are weighted and\nembedded into the features transmitted for executing multiple tasks at the\nreceiver. Experiments demonstrate that the proposed model surpasses the most\ncompetitive and publicly available models by 11.4% on the CityScapes 2Task\ndataset and outperforms the established state-of-the-art by 3.97% on the NYU V2\n3Task dataset, respectively, when the bandwidth ratio of the communication\nchannel (i.e., compression level for transmission over the channel) is as\nconstrained as 1 12 .\n","arxiv_id":"http://arxiv.org/abs/2501.02006v1","authors":["Xi Yu","Tiejun Lv","Weicai Li","Wei Ni","Dusit Niyato","Ekram Hossain"]},{"title":"TART: Token-based Architecture Transformer for Neural Network\n  Performance Prediction","abstract":"  In the realm of neural architecture design, achieving high performance is\nlargely reliant on the manual expertise of researchers. Despite the emergence\nof Neural Architecture Search (NAS) as a promising technique for automating\nthis process, current NAS methods still require human input to expand the\nsearch space and cannot generate new architectures. This paper explores the\npotential of Transformers in comprehending neural architectures and their\nperformance, with the objective of establishing the foundation for utilizing\nTransformers to generate novel networks. We propose the Token-based\nArchitecture Transformer (TART), which predicts neural network performance\nwithout the need to train candidate networks. TART attains state-of-the-art\nperformance on the DeepNets-1M dataset for performance prediction tasks without\nedge information, indicating the potential of Transformers to aid in\ndiscovering novel and high-performing neural architectures.\n","arxiv_id":"http://arxiv.org/abs/2501.02007v1","authors":["Yannis Y. He"]},{"title":"Cross-model Transferability among Large Language Models on the Platonic\n  Representations of Concepts","abstract":"  Understanding the inner workings of Large Language Models (LLMs) is a\ncritical research frontier. Prior research has shown that a single LLM's\nconcept representations can be captured as steering vectors (SVs), enabling the\ncontrol of LLM behavior (e.g., towards generating harmful content). Our work\ntakes a novel approach by exploring the intricate relationships between concept\nrepresentations across different LLMs, drawing an intriguing parallel to\nPlato's Allegory of the Cave. In particular, we introduce a linear\ntransformation method to bridge these representations and present three key\nfindings: 1) Concept representations across different LLMs can be effectively\naligned using simple linear transformations, enabling efficient cross-model\ntransfer and behavioral control via SVs. 2) This linear transformation\ngeneralizes across concepts, facilitating alignment and control of SVs\nrepresenting different concepts across LLMs. 3) A weak-to-strong\ntransferability exists between LLM concept representations, whereby SVs\nextracted from smaller LLMs can effectively control the behavior of larger\nLLMs.\n","arxiv_id":"http://arxiv.org/abs/2501.02009v1","authors":["Youcheng Huang","Chen Huang","Duanyu Feng","Wenqiang Lei","Jiancheng Lv"]},{"title":"Machine Learning-Based Differential Diagnosis of Parkinson's Disease\n  Using Kinematic Feature Extraction and Selection","abstract":"  Parkinson's disease (PD), the second most common neurodegenerative disorder,\nis characterized by dopaminergic neuron loss and the accumulation of abnormal\nsynuclein. PD presents both motor and non-motor symptoms that progressively\nimpair daily functioning. The severity of these symptoms is typically assessed\nusing the MDS-UPDRS rating scale, which is subjective and dependent on the\nphysician's experience. Additionally, PD shares symptoms with other\nneurodegenerative diseases, such as progressive supranuclear palsy (PSP) and\nmultiple system atrophy (MSA), complicating accurate diagnosis. To address\nthese diagnostic challenges, we propose a machine learning-based system for\ndifferential diagnosis of PD, PSP, MSA, and healthy controls (HC). This system\nutilizes a kinematic feature-based hierarchical feature extraction and\nselection approach. Initially, 18 kinematic features are extracted, including\ntwo newly proposed features: Thumb-to-index vector velocity and acceleration,\nwhich provide insights into motor control patterns. In addition, 41 statistical\nfeatures were extracted here from each kinematic feature, including some new\napproaches such as Average Absolute Change, Rhythm, Amplitude, Frequency,\nStandard Deviation of Frequency, and Slope. Feature selection is performed\nusing One-way ANOVA to rank features, followed by Sequential Forward Floating\nSelection (SFFS) to identify the most relevant ones, aiming to reduce the\ncomputational complexity. The final feature set is used for classification,\nachieving a classification accuracy of 66.67% for each dataset and 88.89% for\neach patient, with particularly high performance for the MSA and HC groups\nusing the SVM algorithm. This system shows potential as a rapid and accurate\ndiagnostic tool in clinical practice, though further data collection and\nrefinement are needed to enhance its reliability.\n","arxiv_id":"http://arxiv.org/abs/2501.02014v1","authors":["Masahiro Matsumoto","Abu Saleh Musa Miah","Nobuyoshi Asai","Jungpil Shin"]},{"title":"Enhancing Uncertainty Modeling with Semantic Graph for Hallucination\n  Detection","abstract":"  Large Language Models (LLMs) are prone to hallucination with non-factual or\nunfaithful statements, which undermines the applications in real-world\nscenarios. Recent researches focus on uncertainty-based hallucination\ndetection, which utilizes the output probability of LLMs for uncertainty\ncalculation and does not rely on external knowledge or frequent sampling from\nLLMs. Whereas, most approaches merely consider the uncertainty of each\nindependent token, while the intricate semantic relations among tokens and\nsentences are not well studied, which limits the detection of hallucination\nthat spans over multiple tokens and sentences in the passage. In this paper, we\npropose a method to enhance uncertainty modeling with semantic graph for\nhallucination detection. Specifically, we first construct a semantic graph that\nwell captures the relations among entity tokens and sentences. Then, we\nincorporate the relations between two entities for uncertainty propagation to\nenhance sentence-level hallucination detection. Given that hallucination occurs\ndue to the conflict between sentences, we further present a graph-based\nuncertainty calibration method that integrates the contradiction probability of\nthe sentence with its neighbors in the semantic graph for uncertainty\ncalculation. Extensive experiments on two datasets show the great advantages of\nour proposed approach. In particular, we obtain substantial improvements with\n19.78% in passage-level hallucination detection.\n","arxiv_id":"http://arxiv.org/abs/2501.02020v1","authors":["Kedi Chen","Qin Chen","Jie Zhou","Xinqi Tao","Bowen Ding","Jingwen Xie","Mingchen Xie","Peilong Li","Feng Zheng","Liang He"]},{"title":"Weakly Supervised Learning on Large Graphs","abstract":"  Graph classification plays a pivotal role in various domains, including\npathology, where images can be represented as graphs.In this domain, images can\nbe represented as graphs, where nodes might represent individual nuclei, and\nedges capture the spatial or functional relationships between them. Often, the\noverall label of the graph, such as a cancer type or disease state, is\ndetermined by patterns within smaller, localized regions of the image. This\nwork introduces a weakly-supervised graph classification framework leveraging\ntwo subgraph extraction techniques: (1) Sliding-window approach (2) BFS-based\napproach. Subgraphs are processed using a Graph Attention Network (GAT), which\nemploys attention mechanisms to identify the most informative subgraphs for\nclassification. Weak supervision is achieved by propagating graph-level labels\nto subgraphs, eliminating the need for detailed subgraph annotations.\n","arxiv_id":"http://arxiv.org/abs/2501.02021v1","authors":["Aditya Prakash"]},{"title":"CarbonChat: Large Language Model-Based Corporate Carbon Emission\n  Analysis and Climate Knowledge Q\u0026A System","abstract":"  As the impact of global climate change intensifies, corporate carbon\nemissions have become a focal point of global attention. In response to issues\nsuch as the lag in climate change knowledge updates within large language\nmodels, the lack of specialization and accuracy in traditional augmented\ngeneration architectures for complex problems, and the high cost and time\nconsumption of sustainability report analysis, this paper proposes CarbonChat:\nLarge Language Model-based corporate carbon emission analysis and climate\nknowledge Q\u0026A system, aimed at achieving precise carbon emission analysis and\npolicy understanding.First, a diversified index module construction method is\nproposed to handle the segmentation of rule-based and long-text documents, as\nwell as the extraction of structured data, thereby optimizing the parsing of\nkey information.Second, an enhanced self-prompt retrieval-augmented generation\narchitecture is designed, integrating intent recognition, structured reasoning\nchains, hybrid retrieval, and Text2SQL, improving the efficiency of semantic\nunderstanding and query conversion.Next, based on the greenhouse gas accounting\nframework, 14 dimensions are established for carbon emission analysis, enabling\nreport summarization, relevance evaluation, and customized responses.Finally,\nthrough a multi-layer chunking mechanism, timestamps, and hallucination\ndetection features, the accuracy and verifiability of the analysis results are\nensured, reducing hallucination rates and enhancing the precision of the\nresponses.\n","arxiv_id":"http://arxiv.org/abs/2501.02031v1","authors":["Zhixuan Cao","Ming Han","Jingtao Wang","Meng Jia"]},{"title":"3D Cloud reconstruction through geospatially-aware Masked Autoencoders","abstract":"  Clouds play a key role in Earth's radiation balance with complex effects that\nintroduce large uncertainties into climate models. Real-time 3D cloud data is\nessential for improving climate predictions. This study leverages geostationary\nimagery from MSG/SEVIRI and radar reflectivity measurements of cloud profiles\nfrom CloudSat/CPR to reconstruct 3D cloud structures. We first apply\nself-supervised learning (SSL) methods-Masked Autoencoders (MAE) and\ngeospatially-aware SatMAE on unlabelled MSG images, and then fine-tune our\nmodels on matched image-profile pairs. Our approach outperforms\nstate-of-the-art methods like U-Nets, and our geospatial encoding further\nimproves prediction results, demonstrating the potential of SSL for cloud\nreconstruction.\n","arxiv_id":"http://arxiv.org/abs/2501.02035v1","authors":["Stella Girtsou","Emiliano Diaz Salas-Porras","Lilli Freischem","Joppe Massant","Kyriaki-Margarita Bintsi","Guiseppe Castiglione","William Jones","Michael Eisinger","Emmanuel Johnson","Anna Jungbluth"]},{"title":"Architecture for Trajectory-Based Fishing Ship Classification with AIS\n  Data","abstract":"  This paper proposes a data preparation process for managing real-world\nkinematic data and detecting fishing vessels. The solution is a binary\nclassification that classifies ship trajectories into either fishing or\nnon-fishing ships. The data used are characterized by the typical problems\nfound in classic data mining applications using real-world data, such as noise\nand inconsistencies. The two classes are also clearly unbalanced in the data, a\nproblem which is addressed using algorithms that resample the instances. For\nclassification, a series of features are extracted from spatiotemporal data\nthat represent the trajectories of the ships, available from sequences of\nAutomatic Identification System (AIS) reports. These features are proposed for\nthe modelling of ship behavior but, because they do not contain context-related\ninformation, the classification can be applied in other scenarios.\nExperimentation shows that the proposed data preparation process is useful for\nthe presented classification problem. In addition, positive results are\nobtained using minimal information.\n","arxiv_id":"http://arxiv.org/abs/2501.02038v1","authors":["David Sánchez Pedroche","Daniel Amigo","Jesús García","Jose M. Molina"]},{"title":"An Investigation into Value Misalignment in LLM-Generated Texts for\n  Cultural Heritage","abstract":"  As Large Language Models (LLMs) become increasingly prevalent in tasks\nrelated to cultural heritage, such as generating descriptions of historical\nmonuments, translating ancient texts, preserving oral traditions, and creating\neducational content, their ability to produce accurate and culturally aligned\ntexts is being increasingly relied upon by users and researchers. However,\ncultural value misalignments may exist in generated texts, such as the\nmisrepresentation of historical facts, the erosion of cultural identity, and\nthe oversimplification of complex cultural narratives, which may lead to severe\nconsequences. Therefore, investigating value misalignment in the context of LLM\nfor cultural heritage is crucial for mitigating these risks, yet there has been\na significant lack of systematic and comprehensive study and investigation in\nthis area. To fill this gap, we systematically assess the reliability of LLMs\nin generating culturally aligned texts for cultural heritage-related tasks. We\nconduct a comprehensive evaluation by compiling an extensive set of 1066 query\ntasks covering 5 widely recognized categories with 17 aspects within the\nknowledge framework of cultural heritage across 5 open-source LLMs, and examine\nboth the type and rate of cultural value misalignments in the generated texts.\nUsing both automated and manual approaches, we effectively detect and analyze\nthe cultural value misalignments in LLM-generated texts. Our findings are\nconcerning: over 65% of the generated texts exhibit notable cultural\nmisalignments, with certain tasks demonstrating almost complete misalignment\nwith key cultural values. Beyond these findings, this paper introduces a\nbenchmark dataset and a comprehensive evaluation workflow that can serve as a\nvaluable resource for future research aimed at enhancing the cultural\nsensitivity and reliability of LLMs.\n","arxiv_id":"http://arxiv.org/abs/2501.02039v1","authors":["Fan Bu","Zheng Wang","Siyi Wang","Ziyao Liu"]},{"title":"A Separable Self-attention Inspired by the State Space Model for\n  Computer Vision","abstract":"  Mamba is an efficient State Space Model (SSM) with linear computational\ncomplexity. Although SSMs are not suitable for handling non-causal data, Vision\nMamba (ViM) methods still demonstrate good performance in tasks such as image\nclassification and object detection. Recent studies have shown that there is a\nrich theoretical connection between state space models and attention variants.\nWe propose a novel separable self attention method, for the first time\nintroducing some excellent design concepts of Mamba into separable\nself-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a\nsimple yet powerful prototype architecture, constructed solely by stacking our\nnovel attention modules with the most basic down-sampling layers. Notably,\nVMINet differs significantly from the conventional Transformer architecture.\nOur experiments demonstrate that VMINet has achieved competitive results on\nimage classification and high-resolution dense prediction tasks.Code is\navailable at: \\url{https://github.com/yws-wxs/VMINet}.\n","arxiv_id":"http://arxiv.org/abs/2501.02040v1","authors":["Juntao Zhang","Shaogeng Liu","Kun Bian","You Zhou","Pei Zhang","Jianning Liu","Jun Zhou","Bingyan Liu"]},{"title":"MRG: A Multi-Robot Manufacturing Digital Scene Generation Method Using\n  Multi-Instance Point Cloud Registration","abstract":"  A high-fidelity digital simulation environment is crucial for accurately\nreplicating physical operational processes. However, inconsistencies between\nsimulation and physical environments result in low confidence in simulation\noutcomes, limiting their effectiveness in guiding real-world production. Unlike\nthe traditional step-by-step point cloud \"segmentation-registration\" generation\nmethod, this paper introduces, for the first time, a novel Multi-Robot\nManufacturing Digital Scene Generation (MRG) method that leverages\nmulti-instance point cloud registration, specifically within manufacturing\nscenes. Tailored to the characteristics of industrial robots and manufacturing\nsettings, an instance-focused transformer module is developed to delineate\ninstance boundaries and capture correlations between local regions.\nAdditionally, a hypothesis generation module is proposed to extract target\ninstances while preserving key features. Finally, an efficient screening and\noptimization algorithm is designed to refine the final registration results.\nExperimental evaluations on the Scan2CAD and Welding-Station datasets\ndemonstrate that: (1) the proposed method outperforms existing multi-instance\npoint cloud registration techniques; (2) compared to state-of-the-art methods,\nthe Scan2CAD dataset achieves improvements in MR and MP by 12.15% and 17.79%,\nrespectively; and (3) on the Welding-Station dataset, MR and MP are enhanced by\n16.95% and 24.15%, respectively. This work marks the first application of\nmulti-instance point cloud registration in manufacturing scenes, significantly\nadvancing the precision and reliability of digital simulation environments for\nindustrial applications.\n","arxiv_id":"http://arxiv.org/abs/2501.02041v1","authors":["Songjie Han","Yinhua Liu","Yanzheng Li","Hua Chen","Dongmei Yang"]},{"title":"Advancing Pancreatic Cancer Prediction with a Next Visit Token\n  Prediction Head on top of Med-BERT","abstract":"  Background: Recently, numerous foundation models pretrained on extensive data\nhave demonstrated efficacy in disease prediction using Electronic Health\nRecords (EHRs). However, there remains some unanswered questions on how to best\nutilize such models especially with very small fine-tuning cohorts. Methods: We\nutilized Med-BERT, an EHR-specific foundation model, and reformulated the\ndisease binary prediction task into a token prediction task and a next visit\nmask token prediction task to align with Med-BERT's pretraining task format in\norder to improve the accuracy of pancreatic cancer (PaCa) prediction in both\nfew-shot and fully supervised settings. Results: The reformulation of the task\ninto a token prediction task, referred to as Med-BERT-Sum, demonstrates\nslightly superior performance in both few-shot scenarios and larger data\nsamples. Furthermore, reformulating the prediction task as a Next Visit Mask\nToken Prediction task (Med-BERT-Mask) significantly outperforms the\nconventional Binary Classification (BC) prediction task (Med-BERT-BC) by 3% to\n7% in few-shot scenarios with data sizes ranging from 10 to 500 samples. These\nfindings highlight that aligning the downstream task with Med-BERT's\npretraining objectives substantially enhances the model's predictive\ncapabilities, thereby improving its effectiveness in predicting both rare and\ncommon diseases. Conclusion: Reformatting disease prediction tasks to align\nwith the pretraining of foundation models enhances prediction accuracy, leading\nto earlier detection and timely intervention. This approach improves treatment\neffectiveness, survival rates, and overall patient outcomes for PaCa and\npotentially other cancers.\n","arxiv_id":"http://arxiv.org/abs/2501.02044v1","authors":["Jianping He","Laila Rasmy","Degui Zhi","Cui Tao"]},{"title":"ArtCrafter: Text-Image Aligning Style Transfer via Embedding Reframing","abstract":"  Recent years have witnessed significant advancements in text-guided style\ntransfer, primarily attributed to innovations in diffusion models. These models\nexcel in conditional guidance, utilizing text or images to direct the sampling\nprocess. However, despite their capabilities, direct conditional guidance\napproaches often face challenges in balancing the expressiveness of textual\nsemantics with the diversity of output results while capturing stylistic\nfeatures. To address these challenges, we introduce ArtCrafter, a novel\nframework for text-to-image style transfer. Specifically, we introduce an\nattention-based style extraction module, meticulously engineered to capture the\nsubtle stylistic elements within an image. This module features a multi-layer\narchitecture that leverages the capabilities of perceiver attention mechanisms\nto integrate fine-grained information. Additionally, we present a novel\ntext-image aligning augmentation component that adeptly balances control over\nboth modalities, enabling the model to efficiently map image and text\nembeddings into a shared feature space. We achieve this through attention\noperations that enable smooth information flow between modalities. Lastly, we\nincorporate an explicit modulation that seamlessly blends multimodal enhanced\nembeddings with original embeddings through an embedding reframing design,\nempowering the model to generate diverse outputs. Extensive experiments\ndemonstrate that ArtCrafter yields impressive results in visual stylization,\nexhibiting exceptional levels of stylistic intensity, controllability, and\ndiversity.\n","arxiv_id":"http://arxiv.org/abs/2501.02064v1","authors":["Nisha Huang","Kaer Huang","Yifan Pu","Jiangshan Wang","Jie Guo","Yiqiang Yan","Xiu Li"]},{"title":"The interplay between domain specialization and model size: a case study\n  in the legal domain","abstract":"  Scaling laws for language models so far focused on finding the\ncompute-optimal model size and token count for training from scratch. However,\nachieving this optimal balance requires significant compute resources due to\nthe extensive data demands when training models from randomly-initialized\nweights. Continual pre-training offers a cost-effective alternative, leveraging\nthe compute investment from pre-trained models to incorporate new knowledge\nwithout requiring extensive new data. Recent findings suggest that data quality\ninfluences constants in scaling laws, thereby altering the optimal\nparameter-token allocation ratio. Building on this insight, we investigate the\ninterplay between domain specialization and model size during continual\npre-training under compute-constrained scenarios. Our goal is to identify a\ncompute-efficient training regime for this scenario and, potentially, detect\npatterns in this interplay that can be generalized across different model sizes\nand domains. To compare general and specialized training, we filtered a\nweb-based dataset to extract legal domain data. We pre-trained models with\n1.5B, 3B, 7B and 14B parameters on both the unfiltered and filtered datasets,\nthen evaluated their performance on legal exams. Results show that as model\nsize increases, the compute-effectiveness gap between specialized and general\nmodels widens.\n","arxiv_id":"http://arxiv.org/abs/2501.02068v1","authors":["Roseval Malaquias Junior","Ramon Pires","Thales Sales Almeida","Kenzo Sakiyama","Roseli Romero","Rodrigo Nogueira"]},{"title":"On the Statistical Complexity for Offline and Low-Adaptive Reinforcement\n  Learning with Structures","abstract":"  This article reviews the recent advances on the statistical foundation of\nreinforcement learning (RL) in the offline and low-adaptive settings. We will\nstart by arguing why offline RL is the appropriate model for almost any\nreal-life ML problems, even if they have nothing to do with the recent AI\nbreakthroughs that use RL. Then we will zoom into two fundamental problems of\noffline RL: offline policy evaluation (OPE) and offline policy learning (OPL).\nIt may be surprising to people that tight bounds for these problems were not\nknown even for tabular and linear cases until recently. We delineate the\ndifferences between worst-case minimax bounds and instance-dependent bounds. We\nalso cover key algorithmic ideas and proof techniques behind near-optimal\ninstance-dependent methods in OPE and OPL. Finally, we discuss the limitations\nof offline RL and review a burgeoning problem of \\emph{low-adaptive\nexploration} which addresses these limitations by providing a sweet middle\nground between offline and online RL.\n","arxiv_id":"http://arxiv.org/abs/2501.02089v1","authors":["Ming Yin","Mengdi Wang","Yu-Xiang Wang"]},{"title":"Online Detection of Water Contamination Under Concept Drift","abstract":"  Water Distribution Networks (WDNs) are vital infrastructures, and\ncontamination poses serious public health risks. Harmful substances can\ninteract with disinfectants like chlorine, making chlorine monitoring essential\nfor detecting contaminants. However, chlorine sensors often become unreliable\nand require frequent calibration. This study introduces the Dual-Threshold\nAnomaly and Drift Detection (AD\u0026DD) method, an unsupervised approach combining\na dual-threshold drift detection mechanism with an LSTM-based Variational\nAutoencoder(LSTM-VAE) for real-time contamination detection. Tested on two\nrealistic WDNs, AD\u0026DD effectively identifies anomalies with sensor offsets as\nconcept drift, and outperforms other methods. A proposed decentralized\narchitecture enables accurate contamination detection and localization by\ndeploying AD\u0026DD on selected nodes.\n","arxiv_id":"http://arxiv.org/abs/2501.02107v1","authors":["Jin Li","Kleanthis Malialis","Stelios G. Vrachimis","Marios M. Polycarpou"]},{"title":"Siamese Networks for Cat Re-Identification: Exploring Neural Models for\n  Cat Instance Recognition","abstract":"  Street cats in urban areas often rely on human intervention for survival,\nleading to challenges in population control and welfare management. In April\n2023, Hello Inc., a Chinese urban mobility company, launched the Hello Street\nCat initiative to address these issues. The project deployed over 21,000 smart\nfeeding stations across 14 cities in China, integrating livestreaming cameras\nand treat dispensers activated through user donations. It also promotes the\nTrap-Neuter-Return (TNR) method, supported by a community-driven platform,\nHelloStreetCatWiki, where volunteers catalog and identify cats. However, manual\nidentification is inefficient and unsustainable, creating a need for automated\nsolutions. This study explores Deep Learning-based models for re-identifying\nstreet cats in the Hello Street Cat initiative. A dataset of 2,796 images of 69\ncats was used to train Siamese Networks with EfficientNetB0, MobileNet and\nVGG16 as base models, evaluated under contrastive and triplet loss functions.\nVGG16 paired with contrastive loss emerged as the most effective configuration,\nachieving up to 97% accuracy and an F1 score of 0.9344 during testing. The\napproach leverages image augmentation and dataset refinement to overcome\nchallenges posed by limited data and diverse visual variations. These findings\nunderscore the potential of automated cat re-identification to streamline\npopulation monitoring and welfare efforts. By reducing reliance on manual\nprocesses, the method offers a scalable and reliable solution for\ncommunitydriven initiatives. Future research will focus on expanding datasets\nand developing real-time implementations to enhance practicality in large-scale\ndeployments.\n","arxiv_id":"http://arxiv.org/abs/2501.02112v1","authors":["Tobias Trein","Luan Fonseca Garcia"]},{"title":"AVTrustBench: Assessing and Enhancing Reliability and Robustness in\n  Audio-Visual LLMs","abstract":"  With the rapid advancement of Multi-modal Large Language Models (MLLMs),\nseveral diagnostic benchmarks have recently been developed to assess these\nmodels' multi-modal reasoning proficiency. However, these benchmarks are\nrestricted to assessing primarily the visual aspect and do not examine the\nholistic audio-visual (AV) understanding. Moreover, currently, there are no\nbenchmarks that investigate the capabilities of AVLLMs to calibrate their\nresponses when presented with perturbed inputs. To this end, we introduce\nAudio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising\n600K samples spanning over 9 meticulously crafted tasks, evaluating the\ncapabilities of AVLLMs across three distinct dimensions: Adversarial attack,\nCompositional reasoning, and Modality-specific dependency. Using our benchmark\nwe extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that\nthe majority of existing models fall significantly short of achieving\nhuman-like comprehension, offering valuable insights for future research\ndirections. To alleviate the limitations in the existing approaches, we further\npropose a robust, model-agnostic calibrated audio-visual preference\noptimization based training strategy CAVPref, obtaining a gain up to 30.19%\nacross all 9 tasks. We will publicly release our code and benchmark to\nfacilitate future research in this direction.\n","arxiv_id":"http://arxiv.org/abs/2501.02135v1","authors":["Sanjoy Chowdhury","Sayan Nag","Subhrajyoti Dasgupta","Yaoting Wang","Mohamed Elhoseiny","Ruohan Gao","Dinesh Manocha"]},{"title":"Attribute-Based Robotic Grasping with Data-Efficient Adaptation","abstract":"  Robotic grasping is one of the most fundamental robotic manipulation tasks\nand has been the subject of extensive research. However, swiftly teaching a\nrobot to grasp a novel target object in clutter remains challenging. This paper\nattempts to address the challenge by leveraging object attributes that\nfacilitate recognition, grasping, and rapid adaptation to new domains. In this\nwork, we present an end-to-end encoder-decoder network to learn attribute-based\nrobotic grasping with data-efficient adaptation capability. We first pre-train\nthe end-to-end model with a variety of basic objects to learn generic attribute\nrepresentation for recognition and grasping. Our approach fuses the embeddings\nof a workspace image and a query text using a gated-attention mechanism and\nlearns to predict instance grasping affordances. To train the joint embedding\nspace of visual and textual attributes, the robot utilizes object persistence\nbefore and after grasping. Our model is self-supervised in a simulation that\nonly uses basic objects of various colors and shapes but generalizes to novel\nobjects in new environments. To further facilitate generalization, we propose\ntwo adaptation methods, adversarial adaption and one-grasp adaptation.\nAdversarial adaptation regulates the image encoder using augmented data of\nunlabeled images, whereas one-grasp adaptation updates the overall end-to-end\nmodel using augmented data from one grasp trial. Both adaptation methods are\ndata-efficient and considerably improve instance grasping performance.\nExperimental results in both simulation and the real world demonstrate that our\napproach achieves over 81% instance grasping success rate on unknown objects,\nwhich outperforms several baselines by large margins.\n","arxiv_id":"http://arxiv.org/abs/2501.02149v1","authors":["Yang Yang","Houjian Yu","Xibai Lou","Yuanhao Liu","Changhyun Choi"]},{"title":"Table as Thought: Exploring Structured Thoughts in LLM Reasoning","abstract":"  Large language models' reasoning abilities benefit from methods that organize\ntheir thought processes, such as chain-of-thought prompting, which employs a\nsequential structure to guide the reasoning process step-by-step. However,\nexisting approaches focus primarily on organizing the sequence of thoughts,\nleaving structure in individual thought steps underexplored. To address this\ngap, we propose Table as Thought, a framework inspired by cognitive\nneuroscience theories on human thought. Table as Thought organizes reasoning\nwithin a tabular schema, where rows represent sequential thought steps and\ncolumns capture critical constraints and contextual information to enhance\nreasoning. The reasoning process iteratively populates the table until\nself-verification ensures completeness and correctness. Our experiments show\nthat Table as Thought excels in planning tasks and demonstrates a strong\npotential for enhancing LLM performance in mathematical reasoning compared to\nunstructured thought baselines. This work provides a novel exploration of\nrefining thought representation within LLMs, paving the way for advancements in\nreasoning and AI cognition.\n","arxiv_id":"http://arxiv.org/abs/2501.02152v1","authors":["Zhenjie Sun","Naihao Deng","Haofei Yu","Jiaxuan You"]},{"title":"The Integration of Blockchain and Artificial Intelligence for Secure\n  Healthcare Systems","abstract":"  Verisign reported a 125 percent increase in data breaches within the\nhealthcare sector in the United States during 2022, with 18.2 million patient\nrecords being impacted. Growing healthcare data volumes and diversification\nmean that medical information is becoming more valuable. Many Health Centers\nuse various technologies to ease the classification, storage, and exchange of\nbig data. This use can also make the health data of the users at risk and\nvulnerable. AI and blockchain are among the leading technologies at hand. With\nAI, data-driven operations and big data efficiency have been improved with\nrespect to traditional techniques. Due to its potential to bring about\nimprovements in health services and lower medical costs, this AI technology is\nregularly used in healthcare. Blockchain helps protect transactions on sharing\ninformation and private privacy as long as the exchange of knowledge is that of\nthe standard. The objective of this analysis is to investigate the research and\nunique contributions since 2008 regarding blockchain-integrated AI and\nhealthcare systems. The work sheds light on applied AI-based healthcare schemes\nwith machine, ballistic, and acrylic learning and disparate blockchain\nstructures. The use of technology in order to ensure patient data security and\nmanage medical information effectively in healthcare settings offers a highly\nsuccessful position for both healthcare providers and patients. From 2018 to\n2021, the best year was 2021 to grow, enhancing everything to examine the\ndownload of the device and the counting of Google Academies, for which the\njoining perspective was borrowed; local research experts were asked, identified\narticles in recent years, and read reviews of large research grants.\n","arxiv_id":"http://arxiv.org/abs/2501.02169v1","authors":["Umar Safdar","Simon Gabrael"]},{"title":"AdaMixup: A Dynamic Defense Framework for Membership Inference Attack\n  Mitigation","abstract":"  Membership inference attacks have emerged as a significant privacy concern in\nthe training of deep learning models, where attackers can infer whether a data\npoint was part of the training set based on the model's outputs. To address\nthis challenge, we propose a novel defense mechanism, AdaMixup. AdaMixup\nemploys adaptive mixup techniques to enhance the model's robustness against\nmembership inference attacks by dynamically adjusting the mixup strategy during\ntraining. This method not only improves the model's privacy protection but also\nmaintains high performance. Experimental results across multiple datasets\ndemonstrate that AdaMixup significantly reduces the risk of membership\ninference attacks while achieving a favorable trade-off between defensive\nefficiency and model accuracy. This research provides an effective solution for\ndata privacy protection and lays the groundwork for future advancements in\nmixup training methods.\n","arxiv_id":"http://arxiv.org/abs/2501.02182v1","authors":["Ying Chen","Jiajing Chen","Yijie Weng","ChiaHua Chang","Dezhi Yu","Guanbiao Lin"]},{"title":"CPTuning: Contrastive Prompt Tuning for Generative Relation Extraction","abstract":"  Generative relation extraction (RE) commonly involves first reformulating RE\nas a linguistic modeling problem easily tackled with pre-trained language\nmodels (PLM) and then fine-tuning a PLM with supervised cross-entropy loss.\nAlthough having achieved promising performance, existing approaches assume only\none deterministic relation between each pair of entities without considering\nreal scenarios where multiple relations may be valid, i.e., entity pair\noverlap, causing their limited applications. To address this problem, we\nintroduce a novel contrastive prompt tuning method for RE, CPTuning, which\nlearns to associate a candidate relation between two in-context entities with a\nprobability mass above or below a threshold, corresponding to whether the\nrelation exists. Beyond learning schema, CPTuning also organizes RE as a\nverbalized relation generation task and uses Trie-constrained decoding to\nensure a model generates valid relations. It adaptively picks out the generated\ncandidate relations with a high estimated likelihood in inference, thereby\nachieving multi-relation extraction. We conduct extensive experiments on four\nwidely used datasets to validate our method. Results show that T5-large\nfine-tuned with CPTuning significantly outperforms previous methods, regardless\nof single or multiple relations extraction.\n","arxiv_id":"http://arxiv.org/abs/2501.02196v1","authors":["Jiaxin Duan","Fengyu Lu","Junfei Liu"]},{"title":"Financial Named Entity Recognition: How Far Can LLM Go?","abstract":"  The surge of large language models (LLMs) has revolutionized the extraction\nand analysis of crucial information from a growing volume of financial\nstatements, announcements, and business news. Recognition for named entities to\nconstruct structured data poses a significant challenge in analyzing financial\ndocuments and is a foundational task for intelligent financial analytics.\nHowever, how effective are these generic LLMs and their performance under\nvarious prompts are yet need a better understanding. To fill in the blank, we\npresent a systematic evaluation of state-of-the-art LLMs and prompting methods\nin the financial Named Entity Recognition (NER) problem. Specifically, our\nexperimental results highlight their strengths and limitations, identify five\nrepresentative failure types, and provide insights into their potential and\nchallenges for domain-specific tasks.\n","arxiv_id":"http://arxiv.org/abs/2501.02237v1","authors":["Yi-Te Lu","Yintong Huo"]},{"title":"Interpretable Load Forecasting via Representation Learning of\n  Geo-distributed Meteorological Factors","abstract":"  Meteorological factors (MF) are crucial in day-ahead load forecasting as they\nsignificantly influence the electricity consumption behaviors of consumers.\nNumerous studies have incorporated MF into the load forecasting model to\nachieve higher accuracy. Selecting MF from one representative location or the\naveraged MF as the inputs of the forecasting model is a common practice.\nHowever, the difference in MF collected in various locations within a region\nmay be significant, which poses a challenge in selecting the appropriate MF\nfrom numerous locations. A representation learning framework is proposed to\nextract geo-distributed MF while considering their spatial relationships. In\naddition, this paper employs the Shapley value in the graph-based model to\nreveal connections between MF collected in different locations and loads. To\nreduce the computational complexity of calculating the Shapley value, an\nacceleration method is adopted based on Monte Carlo sampling and weighted\nlinear regression. Experiments on two real-world datasets demonstrate that the\nproposed method improves the day-ahead forecasting accuracy, especially in\nextreme scenarios such as the \"accumulation temperature effect\" in summer and\n\"sudden temperature change\" in winter. We also find a significant correlation\nbetween the importance of MF in different locations and the corresponding\narea's GDP and mainstay industry.\n","arxiv_id":"http://arxiv.org/abs/2501.02241v1","authors":["Yangze Zhou","Guoxin Lin","Gonghao Zhang","Yi Wang"]},{"title":"LLMzSzŁ: a comprehensive LLM benchmark for Polish","abstract":"  This article introduces the first comprehensive benchmark for the Polish\nlanguage at this scale: LLMzSz{\\L} (LLMs Behind the School Desk). It is based\non a coherent collection of Polish national exams, including both academic and\nprofessional tests extracted from the archives of the Polish Central\nExamination Board. It covers 4 types of exams, coming from 154 domains.\nAltogether, it consists of almost 19k closed-ended questions. We investigate\nthe performance of open-source multilingual, English, and Polish LLMs to verify\nLLMs' abilities to transfer knowledge between languages. Also, the correlation\nbetween LLMs and humans at model accuracy and exam pass rate levels is\nexamined. We show that multilingual LLMs can obtain superior results over\nmonolingual ones; however, monolingual models may be beneficial when model size\nmatters. Our analysis highlights the potential of LLMs in assisting with exam\nvalidation, particularly in identifying anomalies or errors in examination\ntasks.\n","arxiv_id":"http://arxiv.org/abs/2501.02266v1","authors":["Krzysztof Jassem","Michał Ciesiółka","Filip Graliński","Piotr Jabłoński","Jakub Pokrywka","Marek Kubis","Monika Jabłońska","Ryszard Staruch"]},{"title":"What Kind of Visual Tokens Do We Need? Training-free Visual Token\n  Pruning for Multi-modal Large Language Models from the Perspective of Graph","abstract":"  Recent Multimodal Large Language Models(MLLMs) often use a large number of\nvisual tokens to compensate their visual shortcoming, leading to excessive\ncomputation and obvious visual redundancy. In this paper, we investigate what\nkind of visual tokens are needed for MLLMs, and reveal that both foreground and\nbackground tokens are critical for MLLMs given the varying difficulties of\nexamples. Based on this observation, we propose a graph-based method towards\ntraining-free visual token pruning, termed G-Prune.In particular, G-Prune\nregards visual tokens as nodes, and construct their connections based on their\nsemantic similarities. Afterwards, the information flow is propagated via\nweighted links, and the most important tokens after iterations are kept for\nMLLMs, which can be front or background.To validate G-Prune, we apply it to a\nrecent MLLM called LLaVA-NeXT, and conduct extensive experiments on a set of\nbenchmarks.The experiment results show that G-Prune can greatly reduce\ncomputation overhead while retaining high performance on both coarse- and\nfine-grained tasks. For instance, G-Prune can reduce 63.57\\% FLOPs of\nLLaVA-NeXT on VQA2.0 and TextVQA with only 0.95\\% and 2.34\\% accuracy drops,\nrespectively.\n","arxiv_id":"http://arxiv.org/abs/2501.02268v1","authors":["Yutao Jiang","Qiong Wu","Wenhao Lin","Wei Yu","Yiyi Zhou"]},{"title":"Hyperbolic Contrastive Learning for Hierarchical 3D Point Cloud\n  Embedding","abstract":"  Hyperbolic spaces allow for more efficient modeling of complex, hierarchical\nstructures, which is particularly beneficial in tasks involving multi-modal\ndata. Although hyperbolic geometries have been proven effective for\nlanguage-image pre-training, their capabilities to unify language, image, and\n3D Point Cloud modalities are under-explored. We extend the 3D Point Cloud\nmodality in hyperbolic multi-modal contrastive pre-training. Additionally, we\nexplore the entailment, modality gap, and alignment regularizers for learning\nhierarchical 3D embeddings and facilitating the transfer of knowledge from both\nText and Image modalities. These regularizers enable the learning of\nintra-modal hierarchy within each modality and inter-modal hierarchy across\ntext, 2D images, and 3D Point Clouds. Experimental results demonstrate that our\nproposed training strategy yields an outstanding 3D Point Cloud encoder, and\nthe obtained 3D Point Cloud hierarchical embeddings significantly improve\nperformance on various downstream tasks.\n","arxiv_id":"http://arxiv.org/abs/2501.02285v2","authors":["Yingjie Liu","Pengyu Zhang","Ziyao He","Mingsong Chen","Xuan Tang","Xian Wei"]},{"title":"SR-Reward: Taking The Path More Traveled","abstract":"  In this paper, we propose a novel method for learning reward functions\ndirectly from offline demonstrations. Unlike traditional inverse reinforcement\nlearning (IRL), our approach decouples the reward function from the learner's\npolicy, eliminating the adversarial interaction typically required between the\ntwo. This results in a more stable and efficient training process. Our reward\nfunction, called \\textit{SR-Reward}, leverages successor representation (SR) to\nencode a state based on expected future states' visitation under the\ndemonstration policy and transition dynamics. By utilizing the Bellman\nequation, SR-Reward can be learned concurrently with most reinforcement\nlearning (RL) algorithms without altering the existing training pipeline. We\nalso introduce a negative sampling strategy to mitigate overestimation errors\nby reducing rewards for out-of-distribution data, thereby enhancing robustness.\nThis strategy inherently introduces a conservative bias into RL algorithms that\nemploy the learned reward. We evaluate our method on the D4RL benchmark,\nachieving competitive results compared to offline RL algorithms with access to\ntrue rewards and imitation learning (IL) techniques like behavioral cloning.\nMoreover, our ablation studies on data size and quality reveal the advantages\nand limitations of SR-Reward as a proxy for true rewards.\n","arxiv_id":"http://arxiv.org/abs/2501.02330v1","authors":["Seyed Mahdi B. Azad","Zahra Padar","Gabriel Kalweit","Joschka Boedecker"]},{"title":"AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM\n  Inference","abstract":"  Long-context large language models (LLMs) inference is increasingly critical,\nmotivating a number of studies devoted to alleviating the substantial storage\nand computational costs in such scenarios. Layer-wise skipping methods are\npromising optimizations but rarely explored in long-context inference. We\nobserve that existing layer-wise skipping strategies have several limitations\nwhen applied in long-context inference, including the inability to adapt to\nmodel and context variability, disregard for sublayer significance, and\ninapplicability for the prefilling phase. This paper proposes \\sysname, an\nadaptive sublayer skipping method specifically designed for long-context\ninference. \\sysname adaptively identifies less important layers by leveraging\non-the-fly similarity information, enables sublayer-wise skipping, and\naccelerates both the prefilling and decoding phases. The effectiveness of\n\\sysname is demonstrated through extensive experiments on various long-context\nbenchmarks and models, showcasing its superior inference performance over\nexisting baselines.\n","arxiv_id":"http://arxiv.org/abs/2501.02336v1","authors":["Zhuomin He","Yizhen Yao","Pengfei Zuo","Bin Gao","Qinya Li","Zhenzhe Zheng","Fan Wu"]},{"title":"Evaluation of the Code Generation Capabilities of ChatGPT 4: A\n  Comparative Analysis in 19 Programming Languages","abstract":"  This bachelor's thesis examines the capabilities of ChatGPT 4 in code\ngeneration across 19 programming languages. The study analyzed solution rates\nacross three difficulty levels, types of errors encountered, and code quality\nin terms of runtime and memory efficiency through a quantitative experiment. A\ntotal of 188 programming problems were selected from the LeetCode platform, and\nChatGPT 4 was given three attempts to produce a correct solution with feedback.\nChatGPT 4 successfully solved 39.67% of all tasks, with success rates\ndecreasing significantly as problem complexity increased. Notably, the model\nfaced considerable challenges with hard problems across all languages. ChatGPT\n4 demonstrated higher competence in widely used languages, likely due to a\nlarger volume and higher quality of training data. The solution rates also\nrevealed a preference for languages with low abstraction levels and static\ntyping. For popular languages, the most frequent error was \"Wrong Answer,\"\nwhereas for less popular languages, compiler and runtime errors prevailed,\nsuggesting frequent misunderstandings and confusion regarding the structural\ncharacteristics of these languages. The model exhibited above-average runtime\nefficiency in all programming languages, showing a tendency toward statically\ntyped and low-abstraction languages. Memory efficiency results varied\nsignificantly, with above-average performance in 14 languages and below-average\nperformance in five languages. A slight preference for low-abstraction\nlanguages and a leaning toward dynamically typed languages in terms of memory\nefficiency were observed. Future research should include a larger number of\ntasks, iterations, and less popular languages. Additionally, ChatGPT 4's\nabilities in code interpretation and summarization, debugging, and the\ndevelopment of complex, practical code could be analyzed further.\n  ----\n  Diese Bachelorarbeit untersucht die F\\\"ahigkeiten von ChatGPT 4 zur\nCode-Generierung in 19 Programmiersprachen. Betrachtet wurden die\nL\\\"osungsraten zwischen drei Schwierigkeitsgraden, die aufgetretenen\nFehlerarten und die Qualit\\\"at des Codes hinsichtlich der Laufzeit- und\nSpeichereffizienz in einem quantitativen Experiment. Dabei wurden 188\nProgrammierprobleme der Plattform LeetCode entnommen, wobei ChatGPT 4 jeweils\ndrei Versuche hatte, mittels Feedback eine korrekte L\\\"osung zu generieren.\nChatGPT 4 l\\\"oste 39,67 % aller Aufgaben erfolgreich, wobei die Erfolgsrate mit\nzunehmendem Schwierigkeitsgrad deutlich abnahm und bei komplexen Problemen in\nallen Sprachen signifikante Schwierigkeiten auftraten. Das Modell zeigte eine\nh\\\"ohere Kompetenz in weit verbreiteten Sprachen, was wahrscheinlich auf eine\ngr\\\"o{\\ss}ere Menge und h\\\"ohere Qualit\\\"at der Trainingsdaten\nzur\\\"uckzuf\\\"uhren ist. Bez\\\"uglich der L\\\"osungsraten zeigte das Modell zudem\neine Pr\\\"aferenz f\\\"ur Sprachen mit niedrigem Abstraktionsniveau und statischer\nTypisierung. Bei Sprachen hoher Popularit\\\"at trat der Fehler Wrong Answer am\nh\\\"aufigsten auf, w\\\"ahrend bei weniger popul\\\"aren Sprachen Compiler- und\nLaufzeitfehler \\\"uberwogen, was auf h\\\"aufige Missverst\\\"andnisse und\nVerwechslungen bez\\\"uglich der spezifischen strukturellen Eigenschaften dieser\nSprachen zur\\\"uckzuf\\\"uhren ist. ChatGPT 4 demonstrierte in allen\nProgrammiersprachen eine \\\"uberdurchschnittliche Laufzeiteffizienz und\ntendierte diesbez\\\"uglich erneut zu statisch typisierten und niedrig\nabstrahierten Sprachen. Die Werte zur Speichereffizienz variierten erheblich,\nwobei in 14 Sprachen \\\"uberdurchschnittliche und in f\\\"unf Sprachen\nunterdurchschnittliche Werte erzielt wurden. Es zeigte sich diesbez\\\"uglich\neine leichte Tendenz zugunsten von niedrig abstrahierten sowie eine Pr\\\"aferenz\nzu dynamisch typisierten Sprachen. Zuk\\\"unftige Forschung sollte eine h\\\"ohere\nAnzahl an Aufgaben, Iterationen und unpopul\\\"aren Sprachen einbeziehen.\nDar\\\"uber hinaus k\\\"onnten die F\\\"ahigkeiten von ChatGPT 4 in der\nCode-Interpretation und -Zusammenfassung, im Debugging und in der Entwicklung\nkomplexer, praxisbezogener Codes analysiert werden.\n","arxiv_id":"http://arxiv.org/abs/2501.02338v1","authors":["L. C. Gilbert"]},{"title":"UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude\n  Mobility","abstract":"  Low-altitude mobility, exemplified by unmanned aerial vehicles (UAVs), has\nintroduced transformative advancements across various domains, like\ntransportation, logistics, and agriculture. Leveraging flexible perspectives\nand rapid maneuverability, UAVs extend traditional systems' perception and\naction capabilities, garnering widespread attention from academia and industry.\nHowever, current UAV operations primarily depend on human control, with only\nlimited autonomy in simple scenarios, and lack the intelligence and\nadaptability needed for more complex environments and tasks. The emergence of\nlarge language models (LLMs) demonstrates remarkable problem-solving and\ngeneralization capabilities, offering a promising pathway for advancing UAV\nintelligence. This paper explores the integration of LLMs and UAVs, beginning\nwith an overview of UAV systems' fundamental components and functionalities,\nfollowed by an overview of the state-of-the-art in LLM technology.\nSubsequently, it systematically highlights the multimodal data resources\navailable for UAVs, which provide critical support for training and evaluation.\nFurthermore, it categorizes and analyzes key tasks and application scenarios\nwhere UAVs and LLMs converge. Finally, a reference roadmap towards agentic UAVs\nis proposed, aiming to enable UAVs to achieve agentic intelligence through\nautonomous perception, memory, reasoning, and tool utilization. Related\nresources are available at https://github.com/Hub-Tian/UAVs_Meet_LLMs.\n","arxiv_id":"http://arxiv.org/abs/2501.02341v1","authors":["Yonglin Tian","Fei Lin","Yiduo Li","Tengchao Zhang","Qiyao Zhang","Xuan Fu","Jun Huang","Xingyuan Dai","Yutong Wang","Chunwei Tian","Bai Li","Yisheng Lv","Levente Kovács","Fei-Yue Wang"]},{"title":"Exploring the Capabilities and Limitations of Large Language Models for\n  Radiation Oncology Decision Support","abstract":"  Thanks to the rapidly evolving integration of LLMs into decision-support\ntools, a significant transformation is happening across large-scale systems.\nLike other medical fields, the use of LLMs such as GPT-4 is gaining increasing\ninterest in radiation oncology as well. An attempt to assess GPT-4's\nperformance in radiation oncology was made via a dedicated 100-question\nexamination on the highly specialized topic of radiation oncology physics,\nrevealing GPT-4's superiority over other LLMs. GPT-4's performance on a broader\nfield of clinical radiation oncology is further benchmarked by the ACR\nRadiation Oncology In-Training (TXIT) exam where GPT-4 achieved a high accuracy\nof 74.57%. Its performance on re-labelling structure names in accordance with\nthe AAPM TG-263 report has also been benchmarked, achieving above 96%\naccuracies. Such studies shed light on the potential of LLMs in radiation\noncology. As interest in the potential and constraints of LLMs in general\nhealthcare applications continues to rise5, the capabilities and limitations of\nLLMs in radiation oncology decision support have not yet been fully explored.\n","arxiv_id":"http://arxiv.org/abs/2501.02346v1","authors":["Florian Putz","Marlen Haderleina","Sebastian Lettmaier","Sabine Semrau","Rainer Fietkau","Yixing Huang"]},{"title":"Context Aware Lemmatization and Morphological Tagging Method in Turkish","abstract":"  The smallest part of a word that defines the word is called a word root. Word\nroots are used to increase success in many applications since they simplify the\nword. In this study, the lemmatization model, which is a word root finding\nmethod, and the morphological tagging model, which predicts the grammatical\nknowledge of the word, are presented. The presented model was developed for\nTurkish, and both models make predictions by taking the meaning of the word\ninto account. In the literature, there is no lemmatization study that is\nsensitive to word meaning in Turkish. For this reason, the present study shares\nthe model and the results obtained from the model on Turkish lemmatization for\nthe first time in the literature. In the present study, in the lemmatization\nand morphological tagging models, bidirectional LSTM is used for the spelling\nof words, and the Turkish BERT model is used for the meaning of words. The\nmodels are trained using the IMST and PUD datasets from Universal Dependencies.\nThe results from the training of the models were compared with the results from\nthe SIGMORPHON 2019 competition. The results of the comparisons revealed that\nour models were superior.\n","arxiv_id":"http://arxiv.org/abs/2501.02361v1","authors":["Cagri Sayallar"]},{"title":"Enhancing Workplace Productivity and Well-being Using AI Agent","abstract":"  This paper discusses the use of Artificial Intelligence (AI) to enhance\nworkplace productivity and employee well-being. By integrating machine learning\n(ML) techniques with neurobiological data, the proposed approaches ensure\nalignment with human ethical standards through value alignment models and\nHierarchical Reinforcement Learning (HRL) for autonomous task management. The\nsystem utilizes biometric feedback from employees to generate personalized\nhealth prompts, fostering a supportive work environment that encourages\nphysical activity. Additionally, we explore decentralized multi-agent systems\nfor improved collaboration and decision-making frameworks that enhance\ntransparency. Various approaches using ML techniques in conjunction with AI\nimplementations are discussed. Together, these innovations aim to create a more\nproductive and health-conscious workplace. These outcomes assist HR management\nand organizations in launching more rational career progression streams for\nemployees and facilitating organizational transformation.\n","arxiv_id":"http://arxiv.org/abs/2501.02368v1","authors":["Ravirajan K","Arvind Sundarajan"]},{"title":"Syntactic Evolution in Language Usage","abstract":"  This research aims to investigate the dynamic nature of linguistic style\nthroughout various stages of life, from post teenage to old age. By employing\nlinguistic analysis tools and methodologies, the study will delve into the\nintricacies of how individuals adapt and modify their language use over time.\nThe research uses a data set of blogs from blogger.com from 2004 and focuses on\nEnglish for syntactic analysis. The findings of this research can have\nimplications for linguistics, psychology, and communication studies, shedding\nlight on the intricate relationship between age and language.\n","arxiv_id":"http://arxiv.org/abs/2501.02392v1","authors":["Surbhit Kumar"]},{"title":"iTARGET: Interpretable Tailored Age Regression for Grouped Epigenetic\n  Traits","abstract":"  Accurately predicting chronological age from DNA methylation patterns is\ncrucial for advancing biological age estimation. However, this task is made\nchallenging by Epigenetic Correlation Drift (ECD) and Heterogeneity Among CpGs\n(HAC), which reflect the dynamic relationship between methylation and age\nacross different life stages. To address these issues, we propose a novel\ntwo-phase algorithm. The first phase employs similarity searching to cluster\nmethylation profiles by age group, while the second phase uses Explainable\nBoosting Machines (EBM) for precise, group-specific prediction. Our method not\nonly improves prediction accuracy but also reveals key age-related CpG sites,\ndetects age-specific changes in aging rates, and identifies pairwise\ninteractions between CpG sites. Experimental results show that our approach\noutperforms traditional epigenetic clocks and machine learning models, offering\na more accurate and interpretable solution for biological age estimation with\nsignificant implications for aging research.\n","arxiv_id":"http://arxiv.org/abs/2501.02401v1","authors":["Zipeng Wu","Daniel Herring","Fabian Spill","James Andrews"]},{"title":"Boosting Explainability through Selective Rationalization in Pre-trained\n  Language Models","abstract":"  The widespread application of pre-trained language models (PLMs) in natural\nlanguage processing (NLP) has led to increasing concerns about their\nexplainability. Selective rationalization is a self-explanatory framework that\nselects human-intelligible input subsets as rationales for predictions. Recent\nstudies have shown that applying existing rationalization frameworks to PLMs\nwill result in severe degeneration and failure problems, producing sub-optimal\nor meaningless rationales. Such failures severely damage trust in\nrationalization methods and constrain the application of rationalization\ntechniques on PLMs. In this paper, we find that the homogeneity of tokens in\nthe sentences produced by PLMs is the primary contributor to these problems. To\naddress these challenges, we propose a method named Pre-trained Language\nModel's Rationalization (PLMR), which splits PLMs into a generator and a\npredictor to deal with NLP tasks while providing interpretable rationales. The\ngenerator in PLMR also alleviates homogeneity by pruning irrelevant tokens,\nwhile the predictor uses full-text information to standardize predictions.\nExperiments conducted on two widely used datasets across multiple PLMs\ndemonstrate the effectiveness of the proposed method PLMR in addressing the\nchallenge of applying selective rationalization to PLMs. Codes:\nhttps://github.com/ylb777/PLMR.\n","arxiv_id":"http://arxiv.org/abs/2501.03182v1","authors":["Libing Yuan","Shuaibo Hu","Kui Yu","Le Wu"]},{"title":"AI-ANNE: (A) (N)eural (N)et for (E)xploration: Transferring Deep\n  Learning Models onto Microcontrollers and Embedded Systems","abstract":"  This working paper explores the integration of neural networks onto\nresource-constrained embedded systems like a Raspberry Pi Pico / Raspberry Pi\nPico 2. A TinyML aproach transfers neural networks directly on these\nmicrocontrollers, enabling real-time, low-latency, and energy-efficient\ninference while maintaining data privacy. Therefore, AI-ANNE: (A) (N)eural\n(N)et for (E)xploration will be presented, which facilitates the transfer of\npre-trained models from high-performance platforms like TensorFlow and Keras\nonto microcontrollers, using a lightweight programming language like\nMicroPython. This approach demonstrates how neural network architectures, such\nas neurons, layers, density and activation functions can be implemented in\nMicroPython in order to deal with the computational limitations of embedded\nsystems. Based on the Raspberry Pi Pico / Raspberry Pi Pico 2, two different\nneural networks on microcontrollers are presented for an example of data\nclassification. As an further application example, such a microcontroller can\nbe used for condition monitoring, where immediate corrective measures are\ntriggered on the basis of sensor data. Overall, this working paper presents a\nvery easy-to-implement way of using neural networks on energy-efficient devices\nsuch as microcontrollers. This makes AI-ANNE: (A) (N)eural (N)et for\n(E)xploration not only suited for practical use, but also as an educational\ntool with clear insights into how neural networks operate.\n","arxiv_id":"http://arxiv.org/abs/2501.03256v1","authors":["Dennis Klinkhammer"]},{"title":"Optimizing Edge AI: A Comprehensive Survey on Data, Model, and System\n  Strategies","abstract":"  The emergence of 5G and edge computing hardware has brought about a\nsignificant shift in artificial intelligence, with edge AI becoming a crucial\ntechnology for enabling intelligent applications. With the growing amount of\ndata generated and stored on edge devices, deploying AI models for local\nprocessing and inference has become increasingly necessary. However, deploying\nstate-of-the-art AI models on resource-constrained edge devices faces\nsignificant challenges that must be addressed. This paper presents an\noptimization triad for efficient and reliable edge AI deployment, including\ndata, model, and system optimization. First, we discuss optimizing data through\ndata cleaning, compression, and augmentation to make it more suitable for edge\ndeployment. Second, we explore model design and compression methods at the\nmodel level, such as pruning, quantization, and knowledge distillation.\nFinally, we introduce system optimization techniques like framework support and\nhardware acceleration to accelerate edge AI workflows. Based on an in-depth\nanalysis of various application scenarios and deployment challenges of edge AI,\nthis paper proposes an optimization paradigm based on the data-model-system\ntriad to enable a whole set of solutions to effectively transfer ML models,\nwhich are initially trained in the cloud, to various edge devices for\nsupporting multiple scenarios.\n","arxiv_id":"http://arxiv.org/abs/2501.03265v1","authors":["Xubin Wang","Weijia Jia"]},{"title":"Heterogeneous Graph Pre-training Based Model for Secure and Efficient\n  Prediction of Default Risk Propagation among Bond Issuers","abstract":"  Efficient prediction of default risk for bond-issuing enterprises is pivotal\nfor maintaining stability and fostering growth in the bond market. Conventional\nmethods usually rely solely on an enterprise's internal data for risk\nassessment. In contrast, graph-based techniques leverage interconnected\ncorporate information to enhance default risk identification for targeted bond\nissuers. Traditional graph techniques such as label propagation algorithm or\ndeepwalk fail to effectively integrate a enterprise's inherent attribute\ninformation with its topological network data. Additionally, due to data\nscarcity and security privacy concerns between enterprises, end-to-end graph\nneural network (GNN) algorithms may struggle in delivering satisfactory\nperformance for target tasks. To address these challenges, we present a novel\ntwo-stage model. In the first stage, we employ an innovative Masked\nAutoencoders for Heterogeneous Graph (HGMAE) to pre-train on a vast enterprise\nknowledge graph. Subsequently, in the second stage, a specialized classifier\nmodel is trained to predict default risk propagation probabilities. The\nclassifier leverages concatenated feature vectors derived from the pre-trained\nencoder with the enterprise's task-specific feature vectors. Through the\ntwo-stage training approach, our model not only boosts the importance of unique\nbond characteristics for specific default prediction tasks, but also securely\nand efficiently leverage the global information pre-trained from other\nenterprises. Experimental results demonstrate that our proposed model\noutperforms existing approaches in predicting default risk for bond issuers.\n","arxiv_id":"http://arxiv.org/abs/2501.03268v1","authors":["Xurui Li","Xin Shan","Wenhao Yin","Haijiao Wang"]},{"title":"Knowledge-Guided Biomarker Identification for Label-Free Single-Cell\n  RNA-Seq Data: A Reinforcement Learning Perspective","abstract":"  Gene panel selection aims to identify the most informative genomic biomarkers\nin label-free genomic datasets. Traditional approaches, which rely on domain\nexpertise, embedded machine learning models, or heuristic-based iterative\noptimization, often introduce biases and inefficiencies, potentially obscuring\ncritical biological signals. To address these challenges, we present an\niterative gene panel selection strategy that harnesses ensemble knowledge from\nexisting gene selection algorithms to establish preliminary boundaries or prior\nknowledge, which guide the initial search space. Subsequently, we incorporate\nreinforcement learning through a reward function shaped by expert behavior,\nenabling dynamic refinement and targeted selection of gene panels. This\nintegration mitigates biases stemming from initial boundaries while\ncapitalizing on RL's stochastic adaptability. Comprehensive comparative\nexperiments, case studies, and downstream analyses demonstrate the\neffectiveness of our method, highlighting its improved precision and efficiency\nfor label-free biomarker discovery. Our results underscore the potential of\nthis approach to advance single-cell genomics data analysis.\n","arxiv_id":"http://arxiv.org/abs/2501.04718v1","authors":["Meng Xiao","Weiliang Zhang","Xiaohan Huang","Hengshu Zhu","Min Wu","Xiaoli Li","Yuanchun Zhou"]},{"title":"Calculating Customer Lifetime Value and Churn using Beta Geometric\n  Negative Binomial and Gamma-Gamma Distribution in a NFT based setting","abstract":"  Customer Lifetime Value (CLV) is an important metric that measures the total\nvalue a customer will bring to a business over their lifetime. The Beta\nGeometric Negative Binomial Distribution (BGNBD) and Gamma Gamma Distribution\nare two models that can be used to calculate CLV, taking into account both the\nfrequency and value of customer transactions. This article explains the BGNBD\nand Gamma Gamma Distribution models, and how they can be used to calculate CLV\nfor NFT (Non-Fungible Token) transaction data in a blockchain setting. By\nestimating the parameters of these models using historical transaction data,\nbusinesses can gain insights into the lifetime value of their customers and\nmake data-driven decisions about marketing and customer retention strategies.\n","arxiv_id":"http://arxiv.org/abs/2501.04719v1","authors":["Sagarnil Das"]},{"title":"Multi-task Domain Adaptation for Computation Offloading in\n  Edge-intelligence Networks","abstract":"  In the field of multi-access edge computing (MEC), efficient computation\noffloading is crucial for improving resource utilization and reducing latency\nin dynamically changing environments. This paper introduces a new approach,\ntermed as Multi-Task Domain Adaptation (MTDA), aiming to enhance the ability of\ncomputational offloading models to generalize in the presence of domain shifts,\ni.e., when new data in the target environment significantly differs from the\ndata in the source domain. The proposed MTDA model incorporates a\nteacher-student architecture that allows continuous adaptation without\nnecessitating access to the source domain data during inference, thereby\nmaintaining privacy and reducing computational overhead. Utilizing a multi-task\nlearning framework that simultaneously manages offloading decisions and\nresource allocation, the proposed MTDA approach outperforms benchmark methods\nregarding mean squared error and accuracy, particularly in environments with\nincreasing numbers of users. It is observed by means of computer simulation\nthat the proposed MTDA model maintains high performance across various\nscenarios, demonstrating its potential for practical deployment in emerging MEC\napplications.\n","arxiv_id":"http://arxiv.org/abs/2501.07585v1","authors":["Runxin Han","Bo Yang","Zhiwen Yu","Xuelin Cao","George C. Alexandropoulos","Chau Yuen"]},{"title":"Adjoint sharding for very long context training of state space models","abstract":"  Despite very fast progress, efficiently training large language models (LLMs)\nin very long contexts remains challenging. Existing methods fall back to\ntraining LLMs with short contexts (a maximum of a few thousands tokens in\ntraining) and use inference time techniques when evaluating on long contexts\n(above 1M tokens context window at inference). As opposed to\nlong-context-inference, training on very long context input prompts is quickly\nlimited by GPU memory availability and by the prohibitively long training times\nit requires on state-of-the-art hardware. Meanwhile, many real-life\napplications require not only inference but also training/fine-tuning with long\ncontext on specific tasks. Such applications include, for example, augmenting\nthe context with various sources of raw reference information for fact\nextraction, fact summarization, or fact reconciliation tasks. We propose\nadjoint sharding, a novel technique that comprises sharding gradient\ncalculation during training to reduce memory requirements by orders of\nmagnitude, making training on very long context computationally tractable.\nAdjoint sharding is based on the adjoint method and computes equivalent\ngradients to backpropagation. We also propose truncated adjoint sharding to\nspeed up the algorithm while maintaining performance. We provide a distributed\nversion, and a paralleled version of adjoint sharding to further speed up\ntraining. Empirical results show the proposed adjoint sharding algorithm\nreduces memory usage by up to 3X with a 1.27B parameter large language model on\n1M context length training. This allows to increase the maximum context length\nduring training or fine-tuning of a 1.27B parameter model from 35K tokens to\nabove 100K tokens on a training infrastructure composed of five AWS P4\ninstances.\n","arxiv_id":"http://arxiv.org/abs/2501.00692v1","authors":["Xingzi Xu","Amir Tavanaei","Kavosh Asadi","Karim Bouyarmane"]},{"title":"Everywhere Attack: Attacking Locally and Globally to Boost Targeted\n  Transferability","abstract":"  Adversarial examples' (AE) transferability refers to the phenomenon that AEs\ncrafted with one surrogate model can also fool other models. Notwithstanding\nremarkable progress in untargeted transferability, its targeted counterpart\nremains challenging. This paper proposes an everywhere scheme to boost targeted\ntransferability. Our idea is to attack a victim image both globally and\nlocally. We aim to optimize 'an army of targets' in every local image region\ninstead of the previous works that optimize a high-confidence target in the\nimage. Specifically, we split a victim image into non-overlap blocks and\njointly mount a targeted attack on each block. Such a strategy mitigates\ntransfer failures caused by attention inconsistency between surrogate and\nvictim models and thus results in stronger transferability. Our approach is\nmethod-agnostic, which means it can be easily combined with existing\ntransferable attacks for even higher transferability. Extensive experiments on\nImageNet demonstrate that the proposed approach universally improves the\nstate-of-the-art targeted attacks by a clear margin, e.g., the transferability\nof the widely adopted Logit attack can be improved by 28.8%-300%.We also\nevaluate the crafted AEs on a real-world platform: Google Cloud Vision. Results\nfurther support the superiority of the proposed method.\n","arxiv_id":"http://arxiv.org/abs/2501.00707v1","authors":["Hui Zeng","Sanshuai Cui","Biwei Chen","Anjie Peng"]},{"title":"An AI-powered Bayesian generative modeling approach for causal inference\n  in observational studies","abstract":"  Causal inference in observational studies with high-dimensional covariates\npresents significant challenges. We introduce CausalBGM, an AI-powered Bayesian\ngenerative modeling approach that captures the causal relationship among\ncovariates, treatment, and outcome variables. The core innovation of CausalBGM\nlies in its ability to estimate the individual treatment effect (ITE) by\nlearning individual-specific distributions of a low-dimensional latent feature\nset (e.g., latent confounders) that drives changes in both treatment and\noutcome. This approach not only effectively mitigates confounding effects but\nalso provides comprehensive uncertainty quantification, offering reliable and\ninterpretable causal effect estimates at the individual level. CausalBGM adopts\na Bayesian model and uses a novel iterative algorithm to update the model\nparameters and the posterior distribution of latent features until convergence.\nThis framework leverages the power of AI to capture complex dependencies among\nvariables while adhering to the Bayesian principles. Extensive experiments\ndemonstrate that CausalBGM consistently outperforms state-of-the-art methods,\nparticularly in scenarios with high-dimensional covariates and large-scale\ndatasets. Its Bayesian foundation ensures statistical rigor, providing robust\nand well-calibrated posterior intervals. By addressing key limitations of\nexisting methods, CausalBGM emerges as a robust and promising framework for\nadvancing causal inference in modern applications in fields such as genomics,\nhealthcare, and social sciences. CausalBGM is maintained at the website\nhttps://causalbgm.readthedocs.io/.\n","arxiv_id":"http://arxiv.org/abs/2501.00755v1","authors":["Qiao Liu","Wing Hung Wong"]},{"title":"Revisiting Graph Neural Networks on Graph-level Tasks: Comprehensive\n  Experiments, Analysis, and Improvements","abstract":"  Graphs are essential data structures for modeling complex interactions in\ndomains such as social networks, molecular structures, and biological systems.\nGraph-level tasks, which predict properties or classes for the entire graph,\nare critical for applications, such as molecular property prediction and\nsubgraph counting. Graph Neural Networks (GNNs) have shown promise in these\ntasks, but their evaluations are often limited to narrow datasets, tasks, and\ninconsistent experimental setups, restricting their generalizability. To\naddress these limitations, we propose a unified evaluation framework for\ngraph-level GNNs. This framework provides a standardized setting to evaluate\nGNNs across diverse datasets, various graph tasks (e.g., graph classification\nand regression), and challenging scenarios, including noisy, imbalanced, and\nfew-shot graphs. Additionally, we propose a novel GNN model with enhanced\nexpressivity and generalization capabilities. Specifically, we enhance the\nexpressivity of GNNs through a $k$-path rooted subgraph approach, enabling the\nmodel to effectively count subgraphs (e.g., paths and cycles). Moreover, we\nintroduce a unified graph contrastive learning algorithm for graphs across\ndiverse domains, which adaptively removes unimportant edges to augment graphs,\nthereby significantly improving generalization performance. Extensive\nexperiments demonstrate that our model achieves superior performance against\nfourteen effective baselines across twenty-seven graph datasets, establishing\nit as a robust and generalizable model for graph-level tasks.\n","arxiv_id":"http://arxiv.org/abs/2501.00773v1","authors":["Haoyang Li","Yuming Xu","Chen Jason Zhang","Alexander Zhou","Lei Chen","Qing Li"]},{"title":"LENS-XAI: Redefining Lightweight and Explainable Network Security\n  through Knowledge Distillation and Variational Autoencoders for Scalable\n  Intrusion Detection in Cybersecurity","abstract":"  The rapid proliferation of Industrial Internet of Things (IIoT) systems\nnecessitates advanced, interpretable, and scalable intrusion detection systems\n(IDS) to combat emerging cyber threats. Traditional IDS face challenges such as\nhigh computational demands, limited explainability, and inflexibility against\nevolving attack patterns. To address these limitations, this study introduces\nthe Lightweight Explainable Network Security framework (LENS-XAI), which\ncombines robust intrusion detection with enhanced interpretability and\nscalability. LENS-XAI integrates knowledge distillation, variational\nautoencoder models, and attribution-based explainability techniques to achieve\nhigh detection accuracy and transparency in decision-making. By leveraging a\ntraining set comprising 10% of the available data, the framework optimizes\ncomputational efficiency without sacrificing performance. Experimental\nevaluation on four benchmark datasets: Edge-IIoTset, UKM-IDS20, CTU-13, and\nNSL-KDD, demonstrates the framework's superior performance, achieving detection\naccuracies of 95.34%, 99.92%, 98.42%, and 99.34%, respectively. Additionally,\nthe framework excels in reducing false positives and adapting to complex attack\nscenarios, outperforming existing state-of-the-art methods. Key strengths of\nLENS-XAI include its lightweight design, suitable for resource-constrained\nenvironments, and its scalability across diverse IIoT and cybersecurity\ncontexts. Moreover, the explainability module enhances trust and transparency,\ncritical for practical deployment in dynamic and sensitive applications. This\nresearch contributes significantly to advancing IDS by addressing computational\nefficiency, feature interpretability, and real-world applicability. Future work\ncould focus on extending the framework to ensemble AI systems for distributed\nenvironments, further enhancing its robustness and adaptability.\n","arxiv_id":"http://arxiv.org/abs/2501.00790v2","authors":["Muhammet Anil Yagiz","Polat Goktas"]},{"title":"Decoupling Knowledge and Reasoning in Transformers: A Modular\n  Architecture with Generalized Cross-Attention","abstract":"  Transformers have achieved remarkable success across diverse domains, but\ntheir monolithic architecture presents challenges in interpretability,\nadaptability, and scalability. This paper introduces a novel modular\nTransformer architecture that explicitly decouples knowledge and reasoning\nthrough a generalized cross-attention mechanism to a globally shared knowledge\nbase with layer-specific transformations, specifically designed for effective\nknowledge retrieval. Critically, we provide a rigorous mathematical derivation\ndemonstrating that the Feed-Forward Network (FFN) in a standard Transformer is\na specialized case (a closure) of this generalized cross-attention, revealing\nits role in implicit knowledge retrieval and validating our design. This\ntheoretical framework provides a new lens for understanding FFNs and lays the\nfoundation for future research exploring enhanced interpretability,\nadaptability, and scalability, enabling richer interplay with external\nknowledge bases and other systems.\n","arxiv_id":"http://arxiv.org/abs/2501.00823v2","authors":["Zhenyu Guo","Wenguang Chen"]},{"title":"What is a Social Media Bot? A Global Comparison of Bot and Human\n  Characteristics","abstract":"  Chatter on social media is 20% bots and 80% humans. Chatter by bots and\nhumans is consistently different: bots tend to use linguistic cues that can be\neasily automated while humans use cues that require dialogue understanding.\nBots use words that match the identities they choose to present, while humans\nmay send messages that are not related to the identities they present. Bots and\nhumans differ in their communication structure: sampled bots have a star\ninteraction structure, while sampled humans have a hierarchical structure.\nThese conclusions are based on a large-scale analysis of social media tweets\nacross ~200mil users across 7 events. Social media bots took the world by storm\nwhen social-cybersecurity researchers realized that social media users not only\nconsisted of humans but also of artificial agents called bots. These bots wreck\nhavoc online by spreading disinformation and manipulating narratives. Most\nresearch on bots are based on special-purposed definitions, mostly predicated\non the event studied. This article first begins by asking, \"What is a bot?\",\nand we study the underlying principles of how bots are different from humans.\nWe develop a first-principle definition of a social media bot. With this\ndefinition as a premise, we systematically compare characteristics between bots\nand humans across global events, and reflect on how the software-programmed bot\nis an Artificial Intelligent algorithm, and its potential for evolution as\ntechnology advances. Based on our results, we provide recommendations for the\nuse and regulation of bots. Finally, we discuss open challenges and future\ndirections: Detect, to systematically identify these automated and potentially\nevolving bots; Differentiate, to evaluate the goodness of the bot in terms of\ntheir content postings and relationship interactions; Disrupt, to moderate the\nimpact of malicious bots.\n","arxiv_id":"http://arxiv.org/abs/2501.00855v1","authors":["Lynnette Hui Xian Ng","Kathleen M. Carley"]},{"title":"DiffETM: Diffusion Process Enhanced Embedded Topic Model","abstract":"  The embedded topic model (ETM) is a widely used approach that assumes the\nsampled document-topic distribution conforms to the logistic normal\ndistribution for easier optimization. However, this assumption oversimplifies\nthe real document-topic distribution, limiting the model's performance. In\nresponse, we propose a novel method that introduces the diffusion process into\nthe sampling process of document-topic distribution to overcome this limitation\nand maintain an easy optimization process. We validate our method through\nextensive experiments on two mainstream datasets, proving its effectiveness in\nimproving topic modeling performance.\n","arxiv_id":"http://arxiv.org/abs/2501.00862v1","authors":["Wei Shao","Mingyang Liu","Linqi Song"]},{"title":"Representation in large language models","abstract":"  The extraordinary success of recent Large Language Models (LLMs) on a diverse\narray of tasks has led to an explosion of scientific and philosophical\ntheorizing aimed at explaining how they do what they do. Unfortunately,\ndisagreement over fundamental theoretical issues has led to stalemate, with\nentrenched camps of LLM optimists and pessimists often committed to very\ndifferent views of how these systems work. Overcoming stalemate requires\nagreement on fundamental questions, and the goal of this paper is to address\none such question, namely: is LLM behavior driven partly by\nrepresentation-based information processing of the sort implicated in\nbiological cognition, or is it driven entirely by processes of memorization and\nstochastic table look-up? This is a question about what kind of algorithm LLMs\nimplement, and the answer carries serious implications for higher level\nquestions about whether these systems have beliefs, intentions, concepts,\nknowledge, and understanding. I argue that LLM behavior is partially driven by\nrepresentation-based information processing, and then I describe and defend a\nseries of practical techniques for investigating these representations and\ndeveloping explanations on their basis. The resulting account provides a\ngroundwork for future theorizing about language models and their successors.\n","arxiv_id":"http://arxiv.org/abs/2501.00885v1","authors":["Cameron C. Yetman"]},{"title":"Demystifying Online Clustering of Bandits: Enhanced Exploration Under\n  Stochastic and Smoothed Adversarial Contexts","abstract":"  The contextual multi-armed bandit (MAB) problem is crucial in sequential\ndecision-making. A line of research, known as online clustering of bandits,\nextends contextual MAB by grouping similar users into clusters, utilizing\nshared features to improve learning efficiency. However, existing algorithms,\nwhich rely on the upper confidence bound (UCB) strategy, struggle to gather\nadequate statistical information to accurately identify unknown user clusters.\nAs a result, their theoretical analyses require several strong assumptions\nabout the \"diversity\" of contexts generated by the environment, leading to\nimpractical settings, complicated analyses, and poor practical performance.\nRemoving these assumptions has been a long-standing open problem in the\nclustering of bandits literature. In this paper, we provide two solutions to\nthis open problem. First, following the i.i.d. context generation setting in\nexisting studies, we propose two novel algorithms, UniCLUB and PhaseUniCLUB,\nwhich incorporate enhanced exploration mechanisms to accelerate cluster\nidentification. Remarkably, our algorithms require substantially weaker\nassumptions while achieving regret bounds comparable to prior work. Second,\ninspired by the smoothed analysis framework, we propose a more practical\nsetting that eliminates the requirement for i.i.d. context generation used in\nprevious studies, thus enhancing the performance of existing algorithms for\nonline clustering of bandits. Our technique can be applied to both graph-based\nand set-based clustering of bandits frameworks. Extensive evaluations on both\nsynthetic and real-world datasets demonstrate that our proposed algorithms\nconsistently outperform existing approaches.\n","arxiv_id":"http://arxiv.org/abs/2501.00891v1","authors":["Zhuohua Li","Maoli Liu","Xiangxiang Dai","John C. S. Lui"]},{"title":"Large Language Model Based Multi-Agent System Augmented Complex Event\n  Processing Pipeline for Internet of Multimedia Things","abstract":"  This paper presents the development and evaluation of a Large Language Model\n(LLM), also known as foundation models, based multi-agent system framework for\ncomplex event processing (CEP) with a focus on video query processing use\ncases. The primary goal is to create a proof-of-concept (POC) that integrates\nstate-of-the-art LLM orchestration frameworks with publish/subscribe (pub/sub)\ntools to address the integration of LLMs with current CEP systems. Utilizing\nthe Autogen framework in conjunction with Kafka message brokers, the system\ndemonstrates an autonomous CEP pipeline capable of handling complex workflows.\nExtensive experiments evaluate the system's performance across varying\nconfigurations, complexities, and video resolutions, revealing the trade-offs\nbetween functionality and latency. The results show that while higher agent\ncount and video complexities increase latency, the system maintains high\nconsistency in narrative coherence. This research builds upon and contributes\nto, existing novel approaches to distributed AI systems, offering detailed\ninsights into integrating such systems into existing infrastructures.\n","arxiv_id":"http://arxiv.org/abs/2501.00906v2","authors":["Talha Zeeshan","Abhishek Kumar","Susanna Pirttikangas","Sasu Tarkoma"]},{"title":"Enhancing Early Diabetic Retinopathy Detection through Synthetic DR1\n  Image Generation: A StyleGAN3 Approach","abstract":"  Diabetic Retinopathy (DR) is a leading cause of preventable blindness. Early\ndetection at the DR1 stage is critical but is hindered by a scarcity of\nhigh-quality fundus images. This study uses StyleGAN3 to generate synthetic DR1\nimages characterized by microaneurysms with high fidelity and diversity. The\naim is to address data scarcity and enhance the performance of supervised\nclassifiers. A dataset of 2,602 DR1 images was used to train the model,\nfollowed by a comprehensive evaluation using quantitative metrics, including\nFrechet Inception Distance (FID), Kernel Inception Distance (KID), and\nEquivariance with respect to translation (EQ-T) and rotation (EQ-R).\nQualitative assessments included Human Turing tests, where trained\nophthalmologists evaluated the realism of synthetic images. Spectral analysis\nfurther validated image quality. The model achieved a final FID score of 17.29,\noutperforming the mean FID of 21.18 (95 percent confidence interval - 20.83 to\n21.56) derived from bootstrap resampling. Human Turing tests demonstrated the\nmodel's ability to produce highly realistic images, though minor artifacts near\nthe borders were noted. These findings suggest that StyleGAN3-generated\nsynthetic DR1 images hold significant promise for augmenting training datasets,\nenabling more accurate early detection of Diabetic Retinopathy. This\nmethodology highlights the potential of synthetic data in advancing medical\nimaging and AI-driven diagnostics.\n","arxiv_id":"http://arxiv.org/abs/2501.00954v1","authors":["Sagarnil Das","Pradeep Walia"]},{"title":"The Silent Majority: Demystifying Memorization Effect in the Presence of\n  Spurious Correlations","abstract":"  Machine learning models often rely on simple spurious features -- patterns in\ntraining data that correlate with targets but are not causally related to them,\nlike image backgrounds in foreground classification. This reliance typically\nleads to imbalanced test performance across minority and majority groups. In\nthis work, we take a closer look at the fundamental cause of such imbalanced\nperformance through the lens of memorization, which refers to the ability to\npredict accurately on \\textit{atypical} examples (minority groups) in the\ntraining set but failing in achieving the same accuracy in the testing set.\nThis paper systematically shows the ubiquitous existence of spurious features\nin a small set of neurons within the network, providing the first-ever evidence\nthat memorization may contribute to imbalanced group performance. Through three\nexperimental sources of converging empirical evidence, we find the property of\na small subset of neurons or channels in memorizing minority group information.\nInspired by these findings, we articulate the hypothesis: the imbalanced group\nperformance is a byproduct of ``noisy'' spurious memorization confined to a\nsmall set of neurons. To further substantiate this hypothesis, we show that\neliminating these unnecessary spurious memorization patterns via a novel\nframework during training can significantly affect the model performance on\nminority groups. Our experimental results across various architectures and\nbenchmarks offer new insights on how neural networks encode core and spurious\nknowledge, laying the groundwork for future research in demystifying robustness\nto spurious correlation.\n","arxiv_id":"http://arxiv.org/abs/2501.00961v2","authors":["Chenyu You","Haocheng Dai","Yifei Min","Jasjeet S. Sekhon","Sarang Joshi","James S. Duncan"]},{"title":"FlashInfer: Efficient and Customizable Attention Engine for LLM\n  Inference Serving","abstract":"  Transformers, driven by attention mechanisms, form the foundation of large\nlanguage models (LLMs). As these models scale up, efficient GPU attention\nkernels become essential for high-throughput and low-latency inference. Diverse\nLLM applications demand flexible and high-performance attention solutions. We\npresent FlashInfer: a customizable and efficient attention engine for LLM\nserving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse\nformat and composable formats to optimize memory access and reduce redundancy.\nIt also offers a customizable attention template, enabling adaptation to\nvarious settings through Just-In-Time (JIT) compilation. Additionally,\nFlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user\nrequests while maintaining compatibility with CUDAGraph which requires static\nconfiguration. FlashInfer have been integrated into leading LLM serving\nframeworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and\nend-to-end evaluations demonstrate FlashInfer's ability to significantly boost\nkernel performance across diverse inference scenarios: compared to\nstate-of-the-art LLM serving solutions, FlashInfer achieve 29-69%\ninter-token-latency reduction compared to compiler backends for LLM serving\nbenchmark, 28-30% latency reduction for long-context inference, and 13-17%\nspeedup for LLM serving with parallel generation.\n","arxiv_id":"http://arxiv.org/abs/2501.01005v1","authors":["Zihao Ye","Lequn Chen","Ruihang Lai","Wuwei Lin","Yineng Zhang","Stephanie Wang","Tianqi Chen","Baris Kasikci","Vinod Grover","Arvind Krishnamurthy","Luis Ceze"]},{"title":"CryptoMamba: Leveraging State Space Models for Accurate Bitcoin Price\n  Prediction","abstract":"  Predicting Bitcoin price remains a challenging problem due to the high\nvolatility and complex non-linear dynamics of cryptocurrency markets.\nTraditional time-series models, such as ARIMA and GARCH, and recurrent neural\nnetworks, like LSTMs, have been widely applied to this task but struggle to\ncapture the regime shifts and long-range dependencies inherent in the data. In\nthis work, we propose CryptoMamba, a novel Mamba-based State Space Model (SSM)\narchitecture designed to effectively capture long-range dependencies in\nfinancial time-series data. Our experiments show that CryptoMamba not only\nprovides more accurate predictions but also offers enhanced generalizability\nacross different market conditions, surpassing the limitations of previous\nmodels. Coupled with trading algorithms for real-world scenarios, CryptoMamba\ndemonstrates its practical utility by translating accurate forecasts into\nfinancial outcomes. Our findings signal a huge advantage for SSMs in stock and\ncryptocurrency price forecasting tasks.\n","arxiv_id":"http://arxiv.org/abs/2501.01010v1","authors":["Mohammad Shahab Sepehri","Asal Mehradfar","Mahdi Soltanolkotabi","Salman Avestimehr"]},{"title":"ValuesRAG: Enhancing Cultural Alignment Through Retrieval-Augmented\n  Contextual Learning","abstract":"  Cultural values alignment in Large Language Models (LLMs) is a critical\nchallenge due to their tendency to embed Western-centric biases from training\ndata, leading to misrepresentations and fairness issues in cross-cultural\ncontexts. Recent approaches, such as role-assignment and few-shot learning,\noften struggle with reliable cultural alignment as they heavily rely on\npre-trained knowledge, lack scalability, and fail to capture nuanced cultural\nvalues effectively. To address these issues, we propose ValuesRAG, a novel and\neffective framework that applies Retrieval-Augmented Generation (RAG) with\nIn-Context Learning (ICL) to integrate cultural and demographic knowledge\ndynamically during text generation. Leveraging the World Values Survey (WVS)\ndataset, ValuesRAG first generates summaries of values for each individual.\nSubsequently, we curate several representative regional datasets to serve as\ntest datasets and retrieve relevant summaries of values based on demographic\nfeatures, followed by a reranking step to select the top-k relevant summaries.\nValuesRAG consistently outperforms baseline methods, both in the main\nexperiment and in the ablation study where only the values summary was\nprovided. Notably, ValuesRAG demonstrates an accuracy of 21% improvement over\nother baseline methods, highlighting its potential to foster culturally aligned\nAI systems and enhance the inclusivity of AI-driven applications.\n","arxiv_id":"http://arxiv.org/abs/2501.01031v2","authors":["Wonduk Seo","Zonghao Yuan","Yi Bu"]},{"title":"MSC-Bench: Benchmarking and Analyzing Multi-Sensor Corruption for\n  Driving Perception","abstract":"  Multi-sensor fusion models play a crucial role in autonomous driving\nperception, particularly in tasks like 3D object detection and HD map\nconstruction. These models provide essential and comprehensive static\nenvironmental information for autonomous driving systems. While camera-LiDAR\nfusion methods have shown promising results by integrating data from both\nmodalities, they often depend on complete sensor inputs. This reliance can lead\nto low robustness and potential failures when sensors are corrupted or missing,\nraising significant safety concerns. To tackle this challenge, we introduce the\nMulti-Sensor Corruption Benchmark (MSC-Bench), the first comprehensive\nbenchmark aimed at evaluating the robustness of multi-sensor autonomous driving\nperception models against various sensor corruptions. Our benchmark includes 16\ncombinations of corruption types that disrupt both camera and LiDAR inputs,\neither individually or concurrently. Extensive evaluations of six 3D object\ndetection models and four HD map construction models reveal substantial\nperformance degradation under adverse weather conditions and sensor failures,\nunderscoring critical safety issues. The benchmark toolkit and affiliated code\nand model checkpoints have been made publicly accessible.\n","arxiv_id":"http://arxiv.org/abs/2501.01037v1","authors":["Xiaoshuai Hao","Guanqun Liu","Yuting Zhao","Yuheng Ji","Mengchuan Wei","Haimei Zhao","Lingdong Kong","Rong Yin","Yu Liu"]},{"title":"MMVA: Multimodal Matching Based on Valence and Arousal across Images,\n  Music, and Musical Captions","abstract":"  We introduce Multimodal Matching based on Valence and Arousal (MMVA), a\ntri-modal encoder framework designed to capture emotional content across\nimages, music, and musical captions. To support this framework, we expand the\nImage-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which\nincludes 24,756 images and 25,944 music clips with corresponding musical\ncaptions. We employ multimodal matching scores based on the continuous valence\n(emotional positivity) and arousal (emotional intensity) values. This\ncontinuous matching score allows for random sampling of image-music pairs\nduring training by computing similarity scores from the valence-arousal values\nacross different modalities. Consequently, the proposed approach achieves\nstate-of-the-art performance in valence-arousal prediction tasks. Furthermore,\nthe framework demonstrates its efficacy in various zeroshot tasks, highlighting\nthe potential of valence and arousal predictions in downstream applications.\n","arxiv_id":"http://arxiv.org/abs/2501.01094v1","authors":["Suhwan Choi","Kyu Won Kim","Myungjoo Kang"]},{"title":"Disambiguation of Chinese Polyphones in an End-to-End Framework with\n  Semantic Features Extracted by Pre-trained BERT","abstract":"  Grapheme-to-phoneme (G2P) conversion serves as an essential component in\nChinese Mandarin text-to-speech (TTS) system, where polyphone disambiguation is\nthe core issue. In this paper, we propose an end-to-end framework to predict\nthe pronunciation of a polyphonic character, which accepts sentence containing\npolyphonic character as input in the form of Chinese character sequence without\nthe necessity of any preprocessing. The proposed method consists of a\npre-trained bidirectional encoder representations from Transformers (BERT)\nmodel and a neural network (NN) based classifier. The pre-trained BERT model\nextracts semantic features from a raw Chinese character sequence and the NN\nbased classifier predicts the polyphonic character's pronunciation according to\nBERT output. In out experiments, we implemented three classifiers, a\nfully-connected network based classifier, a long short-term memory (LSTM)\nnetwork based classifier and a Transformer block based classifier. The\nexperimental results compared with the baseline approach based on LSTM\ndemonstrate that, the pre-trained model extracts effective semantic features,\nwhich greatly enhances the performance of polyphone disambiguation. In\naddition, we also explored the impact of contextual information on polyphone\ndisambiguation.\n","arxiv_id":"http://arxiv.org/abs/2501.01102v1","authors":["Dongyang Dai","Zhiyong Wu","Shiyin Kang","Xixin Wu","Jia Jia","Dan Su","Dong Yu","Helen Meng"]},{"title":"learning discriminative features from spectrograms using center loss for\n  speech emotion recognition","abstract":"  Identifying the emotional state from speech is essential for the natural\ninteraction of the machine with the speaker. However, extracting effective\nfeatures for emotion recognition is difficult, as emotions are ambiguous. We\npropose a novel approach to learn discriminative features from variable length\nspectrograms for emotion recognition by cooperating softmax cross-entropy loss\nand center loss together. The softmax cross-entropy loss enables features from\ndifferent emotion categories separable, and center loss efficiently pulls the\nfeatures belonging to the same emotion category to their center. By combining\nthe two losses together, the discriminative power will be highly enhanced,\nwhich leads to network learning more effective features for emotion\nrecognition. As demonstrated by the experimental results, after introducing\ncenter loss, both the unweighted accuracy and weighted accuracy are improved by\nover 3\\% on Mel-spectrogram input, and more than 4\\% on Short Time Fourier\nTransform spectrogram input.\n","arxiv_id":"http://arxiv.org/abs/2501.01103v1","authors":["Dongyang Dai","Zhiyong Wu","Runnan Li","Xixin Wu","Jia Jia","Helen Meng"]},{"title":"Robust COVID-19 Detection from Cough Sounds using Deep Neural Decision\n  Tree and Forest: A Comprehensive Cross-Datasets Evaluation","abstract":"  This research presents a robust approach to classifying COVID-19 cough sounds\nusing cutting-edge machine-learning techniques. Leveraging deep neural decision\ntrees and deep neural decision forests, our methodology demonstrates consistent\nperformance across diverse cough sound datasets. We begin with a comprehensive\nextraction of features to capture a wide range of audio features from\nindividuals, whether COVID-19 positive or negative. To determine the most\nimportant features, we use recursive feature elimination along with\ncross-validation. Bayesian optimization fine-tunes hyper-parameters of deep\nneural decision tree and deep neural decision forest models. Additionally, we\nintegrate the SMOTE during training to ensure a balanced representation of\npositive and negative data. Model performance refinement is achieved through\nthreshold optimization, maximizing the ROC-AUC score. Our approach undergoes a\ncomprehensive evaluation in five datasets: Cambridge, Coswara, COUGHVID,\nVirufy, and the combined Virufy with the NoCoCoDa dataset. Consistently\noutperforming state-of-the-art methods, our proposed approach yields notable\nAUC scores of 0.97, 0.98, 0.92, 0.93, 0.99, and 0.99 across the respective\ndatasets. Merging all datasets into a combined dataset, our method, using a\ndeep neural decision forest classifier, achieves an AUC of 0.97. Also, our\nstudy includes a comprehensive cross-datasets analysis, revealing demographic\nand geographic differences in the cough sounds associated with COVID-19. These\ndifferences highlight the challenges in transferring learned features across\ndiverse datasets and underscore the potential benefits of dataset integration,\nimproving generalizability and enhancing COVID-19 detection from audio signals.\n","arxiv_id":"http://arxiv.org/abs/2501.01117v1","authors":["Rofiqul Islam","Nihad Karim Chowdhury","Muhammad Ashad Kabir"]},{"title":"TED: Turn Emphasis with Dialogue Feature Attention for Emotion\n  Recognition in Conversation","abstract":"  Emotion recognition in conversation (ERC) has been attracting attention by\nmethods for modeling multi-turn contexts. The multi-turn input to a pretraining\nmodel implicitly assumes that the current turn and other turns are\ndistinguished during the training process by inserting special tokens into the\ninput sequence. This paper proposes a priority-based attention method to\ndistinguish each turn explicitly by adding dialogue features into the attention\nmechanism, called Turn Emphasis with Dialogue (TED). It has a priority for each\nturn according to turn position and speaker information as dialogue features.\nIt takes multi-head self-attention between turn-based vectors for multi-turn\ninput and adjusts attention scores with the dialogue features. We evaluate TED\non four typical benchmarks. The experimental results demonstrate that TED has\nhigh overall performance in all datasets and achieves state-of-the-art\nperformance on IEMOCAP with numerous turns.\n","arxiv_id":"http://arxiv.org/abs/2501.01123v1","authors":["Junya Ono","Hiromi Wakaki"]},{"title":"Missing Data as Augmentation in the Earth Observation Domain: A\n  Multi-View Learning Approach","abstract":"  Multi-view learning (MVL) leverages multiple sources or views of data to\nenhance machine learning model performance and robustness. This approach has\nbeen successfully used in the Earth Observation (EO) domain, where views have a\nheterogeneous nature and can be affected by missing data. Despite the negative\neffect that missing data has on model predictions, the ML literature has used\nit as an augmentation technique to improve model generalization, like masking\nthe input data. Inspired by this, we introduce novel methods for EO\napplications tailored to MVL with missing views. Our methods integrate the\ncombination of a set to simulate all combinations of missing views as different\ntraining samples. Instead of replacing missing data with a numerical value, we\nuse dynamic merge functions, like average, and more complex ones like\nTransformer. This allows the MVL model to entirely ignore the missing views,\nenhancing its predictive robustness. We experiment on four EO datasets with\ntemporal and static views, including state-of-the-art methods from the EO\ndomain. The results indicate that our methods improve model robustness under\nconditions of moderate missingness, and improve the predictive performance when\nall views are present. The proposed methods offer a single adaptive solution to\noperate effectively with any combination of available views.\n","arxiv_id":"http://arxiv.org/abs/2501.01132v1","authors":["Francisco Mena","Diego Arenas","Andreas Dengel"]},{"title":"TexAVi: Generating Stereoscopic VR Video Clips from Text Descriptions","abstract":"  While generative models such as text-to-image, large language models and\ntext-to-video have seen significant progress, the extension to\ntext-to-virtual-reality remains largely unexplored, due to a deficit in\ntraining data and the complexity of achieving realistic depth and motion in\nvirtual environments. This paper proposes an approach to coalesce existing\ngenerative systems to form a stereoscopic virtual reality video from text.\n  Carried out in three main stages, we start with a base text-to-image model\nthat captures context from an input text. We then employ Stable Diffusion on\nthe rudimentary image produced, to generate frames with enhanced realism and\noverall quality. These frames are processed with depth estimation algorithms to\ncreate left-eye and right-eye views, which are stitched side-by-side to create\nan immersive viewing experience. Such systems would be highly beneficial in\nvirtual reality production, since filming and scene building often require\nextensive hours of work and post-production effort.\n  We utilize image evaluation techniques, specifically Fr\\'echet Inception\nDistance and CLIP Score, to assess the visual quality of frames produced for\nthe video. These quantitative measures establish the proficiency of the\nproposed method.\n  Our work highlights the exciting possibilities of using natural\nlanguage-driven graphics in fields like virtual reality simulations.\n","arxiv_id":"http://arxiv.org/abs/2501.01156v1","authors":["Vriksha Srihari","R. Bhavya","Shruti Jayaraman","V. Mary Anita Rajam"]},{"title":"Harnessing Multi-Agent LLMs for Complex Engineering Problem-Solving: A\n  Framework for Senior Design Projects","abstract":"  Multi-Agent Large Language Models (LLMs) are gaining significant attention\nfor their ability to harness collective intelligence in complex\nproblem-solving, decision-making, and planning tasks. This aligns with the\nconcept of the wisdom of crowds, where diverse agents contribute collectively\nto generating effective solutions, making it particularly suitable for\neducational settings. Senior design projects, also known as capstone or final\nyear projects, are pivotal in engineering education as they integrate\ntheoretical knowledge with practical application, fostering critical thinking,\nteamwork, and real-world problem-solving skills. In this paper, we explore the\nuse of Multi-Agent LLMs in supporting these senior design projects undertaken\nby engineering students, which often involve multidisciplinary considerations\nand conflicting objectives, such as optimizing technical performance while\naddressing ethical, social, and environmental concerns. We propose a framework\nwhere distinct LLM agents represent different expert perspectives, such as\nproblem formulation agents, system complexity agents, societal and ethical\nagents, or project managers, thus facilitating a holistic problem-solving\napproach. This implementation leverages standard multi-agent system (MAS)\nconcepts such as coordination, cooperation, and negotiation, incorporating\nprompt engineering to develop diverse personas for each agent. These agents\nengage in rich, collaborative dialogues to simulate human engineering teams,\nguided by principles from swarm AI to efficiently balance individual\ncontributions towards a unified solution. We adapt these techniques to create a\ncollaboration structure for LLM agents, encouraging interdisciplinary reasoning\nand negotiation similar to real-world senior design projects. To assess the\nefficacy of this framework, we collected six proposals of engineering and\ncomputer science of...\n","arxiv_id":"http://arxiv.org/abs/2501.01205v1","authors":["Abdullah Mushtaq","Muhammad Rafay Naeem","Ibrahim Ghaznavi","Muhammad Imran Taj","Imran Hashmi","Junaid Qadir"]},{"title":"Face-Human-Bench: A Comprehensive Benchmark of Face and Human\n  Understanding for Multi-modal Assistants","abstract":"  Faces and humans are crucial elements in social interaction and are widely\nincluded in everyday photos and videos. Therefore, a deep understanding of\nfaces and humans will enable multi-modal assistants to achieve improved\nresponse quality and broadened application scope. Currently, the multi-modal\nassistant community lacks a comprehensive and scientific evaluation of face and\nhuman understanding abilities. In this paper, we first propose a hierarchical\nability taxonomy that includes three levels of abilities. Then, based on this\ntaxonomy, we collect images and annotations from publicly available datasets in\nthe face and human community and build a semi-automatic data pipeline to\nproduce problems for the new benchmark. Finally, the obtained Face-Human-Bench\ncomprises a development set with 900 problems and a test set with 1800\nproblems, supporting both English and Chinese. We conduct evaluations over 25\nmainstream multi-modal large language models (MLLMs) with our Face-Human-Bench,\nfocusing on the correlation between abilities, the impact of the relative\nposition of targets on performance, and the impact of Chain of Thought (CoT)\nprompting on performance. Moreover, inspired by multi-modal agents, we also\nexplore which abilities of MLLMs need to be supplemented by specialist models.\n","arxiv_id":"http://arxiv.org/abs/2501.01243v2","authors":["Lixiong Qin","Shilong Ou","Miaoxuan Zhang","Jiangning Wei","Yuhang Zhang","Xiaoshuai Song","Yuchen Liu","Mei Wang","Weiran Xu"]},{"title":"ProgCo: Program Helps Self-Correction of Large Language Models","abstract":"  Self-Correction aims to enable large language models (LLMs) to self-verify\nand self-refine their initial responses without external feedback. However,\nLLMs often fail to effectively self-verify and generate correct feedback,\nfurther misleading refinement and leading to the failure of self-correction,\nespecially in complex reasoning tasks. In this paper, we propose Program-driven\nSelf-Correction (ProgCo). First, program-driven verification (ProgVe) achieves\ncomplex verification logic and extensive validation through self-generated,\nself-executing verification pseudo-programs. Then, program-driven refinement\n(ProgRe) receives feedback from ProgVe, conducts dual reflection and refinement\non both responses and verification programs to mitigate misleading of incorrect\nfeedback in complex reasoning tasks. Experiments on three instruction-following\nand mathematical benchmarks indicate that ProgCo achieves effective\nself-correction, and can be further enhance performance when combined with real\nprogram tools.\n","arxiv_id":"http://arxiv.org/abs/2501.01264v1","authors":["Xiaoshuai Song","Yanan Wu","Weixun Wang","Jiaheng Liu","Wenbo Su","Bo Zheng"]},{"title":"CultureVLM: Characterizing and Improving Cultural Understanding of\n  Vision-Language Models for over 100 Countries","abstract":"  Vision-language models (VLMs) have advanced human-AI interaction but struggle\nwith cultural understanding, often misinterpreting symbols, gestures, and\nartifacts due to biases in predominantly Western-centric training data. In this\npaper, we construct CultureVerse, a large-scale multimodal benchmark covering\n19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3\nquestion types, with the aim of characterizing and improving VLMs'\nmulticultural understanding capabilities. Then, we propose CultureVLM, a series\nof VLMs fine-tuned on our dataset to achieve significant performance\nimprovement in cultural understanding. Our evaluation of 16 models reveals\nsignificant disparities, with a stronger performance in Western concepts and\nweaker results in African and Asian contexts. Fine-tuning on our CultureVerse\nenhances cultural perception, demonstrating cross-cultural, cross-continent,\nand cross-dataset generalization without sacrificing performance on models'\ngeneral VLM benchmarks. We further present insights on cultural generalization\nand forgetting. We hope that this work could lay the foundation for more\nequitable and culturally aware multimodal AI systems.\n","arxiv_id":"http://arxiv.org/abs/2501.01282v1","authors":["Shudong Liu","Yiqiao Jin","Cheng Li","Derek F. Wong","Qingsong Wen","Lichao Sun","Haipeng Chen","Xing Xie","Jindong Wang"]},{"title":"LEO-Split: A Semi-Supervised Split Learning Framework over LEO Satellite\n  Networks","abstract":"  Recently, the increasing deployment of LEO satellite systems has enabled\nvarious space analytics (e.g., crop and climate monitoring), which heavily\nrelies on the advancements in deep learning (DL). However, the intermittent\nconnectivity between LEO satellites and ground station (GS) significantly\nhinders the timely transmission of raw data to GS for centralized learning,\nwhile the scaled-up DL models hamper distributed learning on\nresource-constrained LEO satellites. Though split learning (SL) can be a\npotential solution to these problems by partitioning a model and offloading\nprimary training workload to GS, the labor-intensive labeling process remains\nan obstacle, with intermittent connectivity and data heterogeneity being other\nchallenges. In this paper, we propose LEO-Split, a semi-supervised (SS) SL\ndesign tailored for satellite networks to combat these challenges. Leveraging\nSS learning to handle (labeled) data scarcity, we construct an auxiliary model\nto tackle the training failure of the satellite-GS non-contact time. Moreover,\nwe propose a pseudo-labeling algorithm to rectify data imbalances across\nsatellites. Lastly, an adaptive activation interpolation scheme is devised to\nprevent the overfitting of server-side sub-model training at GS. Extensive\nexperiments with real-world LEO satellite traces (e.g., Starlink) demonstrate\nthat our LEO-Split framework achieves superior performance compared to\nstate-ofthe-art benchmarks.\n","arxiv_id":"http://arxiv.org/abs/2501.01293v1","authors":["Zheng Lin","Yuxin Zhang","Zhe Chen","Zihan Fang","Cong Wu","Xianhao Chen","Yue Gao","Jun Luo"]},{"title":"The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for\n  Test Case Generation","abstract":"  Test cases are essential for validating the reliability and quality of\nsoftware applications. Recent studies have demonstrated the capability of Large\nLanguage Models (LLMs) to generate useful test cases for given source code.\nHowever, the existing work primarily relies on human-written plain prompts,\nwhich often leads to suboptimal results since the performance of LLMs can be\nhighly influenced by the prompts. Moreover, these approaches use the same\nprompt for all LLMs, overlooking the fact that different LLMs might be best\nsuited to different prompts. Given the wide variety of possible prompt\nformulations, automatically discovering the optimal prompt for each LLM\npresents a significant challenge. Although there are methods on automated\nprompt optimization in the natural language processing field, they are hard to\nproduce effective prompts for the test case generation task. First, the methods\niteratively optimize prompts by simply combining and mutating existing ones\nwithout proper guidance, resulting in prompts that lack diversity and tend to\nrepeat the same errors in the generated test cases. Second, the prompts are\ngenerally lack of domain contextual knowledge, limiting LLMs' performance in\nthe task.\n","arxiv_id":"http://arxiv.org/abs/2501.01329v1","authors":["Shuzheng Gao","Chaozheng Wang","Cuiyun Gao","Xiaoqian Jiao","Chun Yong Chong","Shan Gao","Michael Lyu"]},{"title":"CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for\n  Benchmarking Large Language Models","abstract":"  Numerous studies have investigated methods for jailbreaking Large Language\nModels (LLMs) to generate harmful content. Typically, these methods are\nevaluated using datasets of malicious prompts designed to bypass security\npolicies established by LLM providers. However, the generally broad scope and\nopen-ended nature of existing datasets can complicate the assessment of\njailbreaking effectiveness, particularly in specific domains, notably\ncybersecurity. To address this issue, we present and publicly release\nCySecBench, a comprehensive dataset containing 12662 prompts specifically\ndesigned to evaluate jailbreaking techniques in the cybersecurity domain. The\ndataset is organized into 10 distinct attack-type categories, featuring\nclose-ended prompts to enable a more consistent and accurate assessment of\njailbreaking attempts. Furthermore, we detail our methodology for dataset\ngeneration and filtration, which can be adapted to create similar datasets in\nother domains. To demonstrate the utility of CySecBench, we propose and\nevaluate a jailbreaking approach based on prompt obfuscation. Our experimental\nresults show that this method successfully elicits harmful content from\ncommercial black-box LLMs, achieving Success Rates (SRs) of 65% with ChatGPT\nand 88% with Gemini; in contrast, Claude demonstrated greater resilience with a\njailbreaking SR of 17%. Compared to existing benchmark approaches, our method\nshows superior performance, highlighting the value of domain-specific\nevaluation datasets for assessing LLM security measures. Moreover, when\nevaluated using prompts from a widely used dataset (i.e., AdvBench), it\nachieved an SR of 78.5%, higher than the state-of-the-art methods.\n","arxiv_id":"http://arxiv.org/abs/2501.01335v1","authors":["Johan Wahréus","Ahmed Mohamed Hussain","Panos Papadimitratos"]},{"title":"ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding","abstract":"  3D visual grounding (3DVG) involves localizing entities in a 3D scene\nreferred to by natural language text. Such models are useful for embodied AI\nand scene retrieval applications, which involve searching for objects or\npatterns using natural language descriptions. While recent works have focused\non LLM-based scaling of 3DVG datasets, these datasets do not capture the full\nrange of potential prompts which could be specified in the English language. To\nensure that we are scaling up and testing against a useful and representative\nset of prompts, we propose a framework for linguistically analyzing 3DVG\nprompts and introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a\ndiagnostic dataset for evaluating visual grounding methods against a diverse\nset of language patterns. We evaluate existing open-vocabulary 3DVG methods to\ndemonstrate that these methods are not yet proficient in understanding and\nidentifying the targets of more challenging, out-of-distribution prompts,\ntoward real-world applications.\n","arxiv_id":"http://arxiv.org/abs/2501.01366v1","authors":["Austin T. Wang","ZeMing Gong","Angel X. Chang"]},{"title":"Contrastive Learning from Exploratory Actions: Leveraging Natural\n  Interactions for Preference Elicitation","abstract":"  People have a variety of preferences for how robots behave. To understand and\nreason about these preferences, robots aim to learn a reward function that\ndescribes how aligned robot behaviors are with a user's preferences. Good\nrepresentations of a robot's behavior can significantly reduce the time and\neffort required for a user to teach the robot their preferences. Specifying\nthese representations -- what \"features\" of the robot's behavior matter to\nusers -- remains a difficult problem; Features learned from raw data lack\nsemantic meaning and features learned from user data require users to engage in\ntedious labeling processes. Our key insight is that users tasked with\ncustomizing a robot are intrinsically motivated to produce labels through\nexploratory search; they explore behaviors that they find interesting and\nignore behaviors that are irrelevant. To harness this novel data source of\nexploratory actions, we propose contrastive learning from exploratory actions\n(CLEA) to learn trajectory features that are aligned with features that users\ncare about. We learned CLEA features from exploratory actions users performed\nin an open-ended signal design activity (N=25) with a Kuri robot, and evaluated\nCLEA features through a second user study with a different set of users (N=42).\nCLEA features outperformed self-supervised features when eliciting user\npreferences over four metrics: completeness, simplicity, minimality, and\nexplainability.\n","arxiv_id":"http://arxiv.org/abs/2501.01367v1","authors":["Nathaniel Dennler","Stefanos Nikolaidis","Maja Matarić"]},{"title":"ScarNet: A Novel Foundation Model for Automated Myocardial Scar\n  Quantification from LGE in Cardiac MRI","abstract":"  Background: Late Gadolinium Enhancement (LGE) imaging is the gold standard\nfor assessing myocardial fibrosis and scarring, with left ventricular (LV) LGE\nextent predicting major adverse cardiac events (MACE). Despite its importance,\nroutine LGE-based LV scar quantification is hindered by labor-intensive manual\nsegmentation and inter-observer variability. Methods: We propose ScarNet, a\nhybrid model combining a transformer-based encoder from the Medical Segment\nAnything Model (MedSAM) with a convolution-based U-Net decoder, enhanced by\ntailored attention blocks. ScarNet was trained on 552 ischemic cardiomyopathy\npatients with expert segmentations of myocardial and scar boundaries and tested\non 184 separate patients. Results: ScarNet achieved robust scar segmentation in\n184 test patients, yielding a median Dice score of 0.912 (IQR: 0.863--0.944),\nsignificantly outperforming MedSAM (median Dice = 0.046, IQR: 0.043--0.047) and\nnnU-Net (median Dice = 0.638, IQR: 0.604--0.661). ScarNet demonstrated lower\nbias (-0.63%) and coefficient of variation (4.3%) compared to MedSAM (bias:\n-13.31%, CoV: 130.3%) and nnU-Net (bias: -2.46%, CoV: 20.3%). In Monte Carlo\nsimulations with noise perturbations, ScarNet achieved significantly higher\nscar Dice (0.892 \\pm 0.053, CoV = 5.9%) than MedSAM (0.048 \\pm 0.112, CoV =\n233.3%) and nnU-Net (0.615 \\pm 0.537, CoV = 28.7%). Conclusion: ScarNet\noutperformed MedSAM and nnU-Net in accurately segmenting myocardial and scar\nboundaries in LGE images. The model exhibited robust performance across diverse\nimage qualities and scar patterns.\n","arxiv_id":"http://arxiv.org/abs/2501.01372v1","authors":["Neda Tavakoli","Amir Ali Rahsepar","Brandon C. Benefield","Daming Shen","Santiago López-Tapia","Florian Schiffers","Jeffrey J. Goldberger","Christine M. Albert","Edwin Wu","Aggelos K. Katsaggelos","Daniel C. Lee","Daniel Kim"]},{"title":"Training Medical Large Vision-Language Models with Abnormal-Aware\n  Feedback","abstract":"  Existing Medical Large Vision-Language Models (Med-LVLMs), which encapsulate\nextensive medical knowledge, demonstrate excellent capabilities in\nunderstanding medical images and responding to human queries based on these\nimages. However, there remain challenges in visual localization in medical\nimages, which is crucial for abnormality detection and interpretation. To\naddress these issues, we propose a novel UMed-LVLM designed with Unveiling\nMedical abnormalities. Specifically, we collect a Medical Abnormalities\nUnveiling (MAU) dataset and propose a two-stage training method for UMed-LVLM\ntraining. To collect MAU dataset, we propose a prompt method utilizing the\nGPT-4V to generate diagnoses based on identified abnormal areas in medical\nimages. Moreover, the two-stage training method includes Abnormal-Aware\nInstruction Tuning and Abnormal-Aware Rewarding, comprising Abnormal\nLocalization Rewarding and Vision Relevance Rewarding. Experimental results\ndemonstrate that our UMed-LVLM surpasses existing Med-LVLMs in identifying and\nunderstanding medical abnormality. In addition, this work shows that enhancing\nthe abnormality detection capabilities of Med-LVLMs significantly improves\ntheir understanding of medical images and generalization capability.\n","arxiv_id":"http://arxiv.org/abs/2501.01377v1","authors":["Yucheng Zhou","Lingran Song","Jianbing Shen"]},{"title":"Multi-Modal Video Feature Extraction for Popularity Prediction","abstract":"  This work aims to predict the popularity of short videos using the videos\nthemselves and their related features. Popularity is measured by four key\nengagement metrics: view count, like count, comment count, and share count.\nThis study employs video classification models with different architectures and\ntraining methods as backbone networks to extract video modality features.\nMeanwhile, the cleaned video captions are incorporated into a carefully\ndesigned prompt framework, along with the video, as input for video-to-text\ngeneration models, which generate detailed text-based video content\nunderstanding. These texts are then encoded into vectors using a pre-trained\nBERT model. Based on the six sets of vectors mentioned above, a neural network\nis trained for each of the four prediction metrics. Moreover, the study\nconducts data mining and feature engineering based on the video and tabular\ndata, constructing practical features such as the total frequency of hashtag\nappearances, the total frequency of mention appearances, video duration, frame\ncount, frame rate, and total time online. Multiple machine learning models are\ntrained, and the most stable model, XGBoost, is selected. Finally, the\npredictions from the neural network and XGBoost models are averaged to obtain\nthe final result.\n","arxiv_id":"http://arxiv.org/abs/2501.01422v1","authors":["Haixu Liu","Wenning Wang","Haoxiang Zheng","Penghao Jiang","Qirui Wang","Ruiqing Yan","Qiuzhuang Sun"]},{"title":"Object-level Visual Prompts for Compositional Image Generation","abstract":"  We introduce a method for composing object-level visual prompts within a\ntext-to-image diffusion model. Our approach addresses the task of generating\nsemantically coherent compositions across diverse scenes and styles, similar to\nthe versatility and expressiveness offered by text prompts. A key challenge in\nthis task is to preserve the identity of the objects depicted in the input\nvisual prompts, while also generating diverse compositions across different\nimages. To address this challenge, we introduce a new KV-mixed cross-attention\nmechanism, in which keys and values are learned from distinct visual\nrepresentations. The keys are derived from an encoder with a small bottleneck\nfor layout control, whereas the values come from a larger bottleneck encoder\nthat captures fine-grained appearance details. By mixing keys and values from\nthese complementary sources, our model preserves the identity of the visual\nprompts while supporting flexible variations in object arrangement, pose, and\ncomposition. During inference, we further propose object-level compositional\nguidance to improve the method's identity preservation and layout correctness.\nResults show that our technique produces diverse scene compositions that\npreserve the unique characteristics of each visual prompt, expanding the\ncreative potential of text-to-image generation.\n","arxiv_id":"http://arxiv.org/abs/2501.01424v1","authors":["Gaurav Parmar","Or Patashnik","Kuan-Chieh Wang","Daniil Ostashev","Srinivasa Narasimhan","Jun-Yan Zhu","Daniel Cohen-Or","Kfir Aberman"]},{"title":"Enhancing Reasoning through Process Supervision with Monte Carlo Tree\n  Search","abstract":"  Large language models (LLMs) have demonstrated their remarkable capacity\nacross a variety of tasks. However, reasoning remains a challenge for LLMs. To\nimprove LLMs' reasoning ability, process supervision has proven to be better\nthan outcome supervision. In this work, we study using Monte Carlo Tree Search\n(MCTS) to generate process supervision data with LLMs themselves for training\nthem. We sample reasoning steps with an LLM and assign each step a score that\ncaptures its \"relative correctness,\" and the LLM is then trained by minimizing\nweighted log-likelihood of generating the reasoning steps. This\ngenerate-then-train process is repeated iteratively until convergence.Our\nexperimental results demonstrate that the proposed methods considerably improve\nthe performance of LLMs on two mathematical reasoning datasets. Furthermore,\nmodels trained on one dataset also exhibit improved performance on the other,\nshowing the transferability of the enhanced reasoning ability.\n","arxiv_id":"http://arxiv.org/abs/2501.01478v1","authors":["Shuangtao Li","Shuaihao Dong","Kexin Luan","Xinhan Di","Chaofan Ding"]},{"title":"ORACLE: A Real-Time, Hierarchical, Deep-Learning Photometric Classifier\n  for the LSST","abstract":"  We present ORACLE, the first hierarchical deep-learning model for real-time,\ncontext-aware classification of transient and variable astrophysical phenomena.\nORACLE is a recurrent neural network with Gated Recurrent Units (GRUs), and has\nbeen trained using a custom hierarchical cross-entropy loss function to provide\nhigh-confidence classifications along an observationally-driven taxonomy with\nas little as a single photometric observation. Contextual information for each\nobject, including host galaxy photometric redshift, offset, ellipticity and\nbrightness, is concatenated to the light curve embedding and used to make a\nfinal prediction. Training on $\\sim$0.5M events from the Extended LSST\nAstronomical Time-Series Classification Challenge, we achieve a top-level\n(Transient vs Variable) macro-averaged precision of 0.96 using only 1 day of\nphotometric observations after the first detection in addition to contextual\ninformation, for each event; this increases to $\u003e$0.99 once 64 days of the\nlight curve has been obtained, and 0.83 at 1024 days after first detection for\n19-way classification (including supernova sub-types, active galactic nuclei,\nvariable stars, microlensing events, and kilonovae). We also compare ORACLE\nwith other state-of-the-art classifiers and report comparable performance for\nthe 19-way classification task, in addition to delivering accurate top-level\nclassifications much earlier. The code and model weights used in this work are\npublicly available at our associated GitHub repository\n(https://github.com/uiucsn/ELAsTiCC-Classification).\n","arxiv_id":"http://arxiv.org/abs/2501.01496v1","authors":["Ved G. Shah","Alex Gagliano","Konstantin Malanchev","Gautham Narayan","The LSST Dark Energy Science Collaboration"]},{"title":"Transfer Learning Analysis of Variational Quantum Circuits","abstract":"  This work analyzes transfer learning of the Variational Quantum Circuit\n(VQC). Our framework begins with a pretrained VQC configured in one domain and\ncalculates the transition of 1-parameter unitary subgroups required for a new\ndomain. A formalism is established to investigate the adaptability and\ncapability of a VQC under the analysis of loss bounds. Our theory observes\nknowledge transfer in VQCs and provides a heuristic interpretation for the\nmechanism. An analytical fine-tuning method is derived to attain the optimal\ntransition for adaptations of similar domains.\n","arxiv_id":"http://arxiv.org/abs/2501.01507v1","authors":["Huan-Hsin Tseng","Hsin-Yi Lin","Samuel Yen-Chi Chen","Shinjae Yoo"]},{"title":"AI-Enabled Operations at Fermi Complex: Multivariate Time Series\n  Prediction for Outage Prediction and Diagnosis","abstract":"  The Main Control Room of the Fermilab accelerator complex continuously\ngathers extensive time-series data from thousands of sensors monitoring the\nbeam. However, unplanned events such as trips or voltage fluctuations often\nresult in beam outages, causing operational downtime. This downtime not only\nconsumes operator effort in diagnosing and addressing the issue but also leads\nto unnecessary energy consumption by idle machines awaiting beam restoration.\nThe current threshold-based alarm system is reactive and faces challenges\nincluding frequent false alarms and inconsistent outage-cause labeling. To\naddress these limitations, we propose an AI-enabled framework that leverages\npredictive analytics and automated labeling. Using data from $2,703$ Linac\ndevices and $80$ operator-labeled outages, we evaluate state-of-the-art deep\nlearning architectures, including recurrent, attention-based, and linear\nmodels, for beam outage prediction. Additionally, we assess a Random\nForest-based labeling system for providing consistent, confidence-scored outage\nannotations. Our findings highlight the strengths and weaknesses of these\narchitectures for beam outage prediction and identify critical gaps that must\nbe addressed to fully harness AI for transitioning downtime handling from\nreactive to predictive, ultimately reducing downtime and improving\ndecision-making in accelerator management.\n","arxiv_id":"http://arxiv.org/abs/2501.01509v1","authors":["Milan Jain","Burcu O. Mutlu","Caleb Stam","Jan Strube","Brian A. Schupbach","Jason M. St. John","William A. Pellico"]},{"title":"DiagrammaticLearning: A Graphical Language for Compositional Training\n  Regimes","abstract":"  Motivated by deep learning regimes with multiple interacting yet distinct\nmodel components, we introduce learning diagrams, graphical depictions of\ntraining setups that capture parameterized learning as data rather than code. A\nlearning diagram compiles to a unique loss function on which component models\nare trained. The result of training on this loss is a collection of models\nwhose predictions ``agree\" with one another. We show that a number of popular\nlearning setups such as few-shot multi-task learning, knowledge distillation,\nand multi-modal learning can be depicted as learning diagrams. We further\nimplement learning diagrams in a library that allows users to build diagrams of\nPyTorch and Flux.jl models. By implementing some classic machine learning use\ncases, we demonstrate how learning diagrams allow practitioners to build\ncomplicated models as compositions of smaller components, identify\nrelationships between workflows, and manipulate models during or after\ntraining. Leveraging a category theoretic framework, we introduce a rigorous\nsemantics for learning diagrams that puts such operations on a firm\nmathematical foundation.\n","arxiv_id":"http://arxiv.org/abs/2501.01515v1","authors":["Mason Lary","Richard Samuelson","Alexander Wilentz","Alina Zare","Matthew Klawonn","James P. Fairbanks"]},{"title":"Improving Robustness Estimates in Natural Language Explainable AI though\n  Synonymity Weighted Similarity Measures","abstract":"  Explainable AI (XAI) has seen a surge in recent interest with the\nproliferation of powerful but intractable black-box models. Moreover, XAI has\ncome under fire for techniques that may not offer reliable explanations. As\nmany of the methods in XAI are themselves models, adversarial examples have\nbeen prominent in the literature surrounding the effectiveness of XAI, with the\nobjective of these examples being to alter the explanation while maintaining\nthe output of the original model. For explanations in natural language, it is\nnatural to use measures found in the domain of information retrieval for use\nwith ranked lists to guide the adversarial XAI process. We show that the\nstandard implementation of these measures are poorly suited for the comparison\nof explanations in adversarial XAI and amend them by using information that is\ndiscarded, the synonymity of perturbed words. This synonymity weighting\nproduces more accurate estimates of the actual weakness of XAI methods to\nadversarial examples.\n","arxiv_id":"http://arxiv.org/abs/2501.01516v1","authors":["Christopher Burger"]},{"title":"A Metasemantic-Metapragmatic Framework for Taxonomizing Multimodal\n  Communicative Alignment","abstract":"  Drawing on contemporary pragmatist philosophy and linguistic theories on\ncognition, meaning, and communication, this paper presents a dynamic,\nmetasemantic-metapragmatic taxonomy for grounding and conceptualizing\nhuman-like multimodal communicative alignment. The framework is rooted in\ncontemporary developments of the three basic communicative capacities initially\nidentified by American logician and pragmatist philosopher Charles Sanders\nPeirce: iconic (sensory and perceptual qualities), indexical (contextual and\nsociocultural associations), and rule-like (symbolic and intuitive reasoning).\nExpanding on these developments, I introduce the concept of indexical\ncontextualization and propose the principle of \"contextualization\ndirectionality\" for characterizing the crucial metapragmatic capacity for\nmaintaining, navigating, or transitioning between semantic and pragmatic modes\nof multimodal communication. I contend that current cognitive-social\ncomputational and engineering methodologies disproportionately emphasize the\nsemantic/metasemantic domain, overlooking the pivotal role of metapragmatic\nindexicality in traversing the semantic-pragmatic spectrum of communication.\nThe framework's broader implications for intentionality, identity, affect, and\nethics in within-modal and cross-modal human-machine alignment are also\ndiscussed.\n","arxiv_id":"http://arxiv.org/abs/2501.01535v1","authors":["Eugene Yu Ji"]},{"title":"In Search of a Lost Metric: Human Empowerment as a Pillar of Socially\n  Conscious Navigation","abstract":"  In social robot navigation, traditional metrics like proxemics and behavior\nnaturalness emphasize human comfort and adherence to social norms but often\nfail to capture an agent's autonomy and adaptability in dynamic environments.\nThis paper introduces human empowerment, an information-theoretic concept that\nmeasures a human's ability to influence their future states and observe those\nchanges, as a complementary metric for evaluating social compliance. This\nmetric reveals how robot navigation policies can indirectly impact human\nempowerment. We present a framework that integrates human empowerment into the\nevaluation of social performance in navigation tasks. Through numerical\nsimulations, we demonstrate that human empowerment as a metric not only aligns\nwith intuitive social behavior, but also shows statistically significant\ndifferences across various robot navigation policies. These results provide a\ndeeper understanding of how different policies affect social compliance,\nhighlighting the potential of human empowerment as a complementary metric for\nfuture research in social navigation.\n","arxiv_id":"http://arxiv.org/abs/2501.01539v1","authors":["Vasanth Reddy Baddam","Behdad Chalaki","Vaishnav Tadiparthi","Hossein Nourkhiz Mahjoub","Ehsan Moradi-Pari","Hoda Eldardiry","Almuatazbellah Boker"]},{"title":"BLAST: A Stealthy Backdoor Leverage Attack against Cooperative\n  Multi-Agent Deep Reinforcement Learning based Systems","abstract":"  Recent studies have shown that cooperative multi-agent deep reinforcement\nlearning (c-MADRL) is under the threat of backdoor attacks. Once a backdoor\ntrigger is observed, it will perform malicious actions leading to failures or\nmalicious goals. However, existing backdoor attacks suffer from several issues,\ne.g., instant trigger patterns lack stealthiness, the backdoor is trained or\nactivated by an additional network, or all agents are backdoored. To this end,\nin this paper, we propose a novel backdoor leverage attack against c-MADRL,\nBLAST, which attacks the entire multi-agent team by embedding the backdoor only\nin a single agent. Firstly, we introduce adversary spatiotemporal behavior\npatterns as the backdoor trigger rather than manual-injected fixed visual\npatterns or instant status and control the period to perform malicious actions.\nThis method can guarantee the stealthiness and practicality of BLAST. Secondly,\nwe hack the original reward function of the backdoor agent via unilateral\nguidance to inject BLAST, so as to achieve the \\textit{leverage attack effect}\nthat can pry open the entire multi-agent system via a single backdoor agent. We\nevaluate our BLAST against 3 classic c-MADRL algorithms (VDN, QMIX, and MAPPO)\nin 2 popular c-MADRL environments (SMAC and Pursuit), and 2 existing defense\nmechanisms. The experimental results demonstrate that BLAST can achieve a high\nattack success rate while maintaining a low clean performance variance rate.\n","arxiv_id":"http://arxiv.org/abs/2501.01593v1","authors":["Yinbo Yu","Saihao Yan","Xueyu Yin","Jing Fang","Jiajia Liu"]},{"title":"PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of\n  Psychiatric Assessment Conversational Agents","abstract":"  Recent advances in large language models (LLMs) have accelerated the\ndevelopment of conversational agents capable of generating human-like\nresponses. Since psychiatric assessments typically involve complex\nconversational interactions between psychiatrists and patients, there is\ngrowing interest in developing LLM-based psychiatric assessment conversational\nagents (PACAs) that aim to simulate the role of psychiatrists in clinical\nevaluations. However, standardized methods for benchmarking the clinical\nappropriateness of PACAs' interaction with patients still remain underexplored.\nHere, we propose PSYCHE, a novel framework designed to enable the 1) clinically\nrelevant, 2) ethically safe, 3) cost-efficient, and 4) quantitative evaluation\nof PACAs. This is achieved by simulating psychiatric patients based on a\nmulti-faceted psychiatric construct that defines the simulated patients'\nprofiles, histories, and behaviors, which PACAs are expected to assess. We\nvalidate the effectiveness of PSYCHE through a study with 10 board-certified\npsychiatrists, supported by an in-depth analysis of the simulated patient\nutterances.\n","arxiv_id":"http://arxiv.org/abs/2501.01594v1","authors":["Jingoo Lee","Kyungho Lim","Young-Chul Jung","Byung-Hoon Kim"]},{"title":"A non-ergodic framework for understanding emergent capabilities in Large\n  Language Models","abstract":"  Large language models have emergent capabilities that come unexpectedly at\nscale, but we need a theoretical framework to explain why and how they emerge.\nWe prove that language models are actually non-ergodic systems while providing\na mathematical framework based on Stuart Kauffman's theory of the adjacent\npossible (TAP) to explain capability emergence. Our resource-constrained TAP\nequation demonstrates how architectural, training, and contextual constraints\ninteract to shape model capabilities through phase transitions in semantic\nspace. We prove through experiments with three different language models that\ncapacities emerge through discrete transitions guided by constraint\ninteractions and path-dependent exploration. This framework provides a\ntheoretical basis for understanding emergence in language models and guides the\ndevelopment of architectures that can guide capability emergence.\n","arxiv_id":"http://arxiv.org/abs/2501.01638v1","authors":["Javier Marin"]},{"title":"AgentRefine: Enhancing Agent Generalization through Refinement Tuning","abstract":"  Large Language Model (LLM) based agents have proved their ability to perform\ncomplex tasks like humans. However, there is still a large gap between\nopen-sourced LLMs and commercial models like the GPT series. In this paper, we\nfocus on improving the agent generalization capabilities of LLMs via\ninstruction tuning. We first observe that the existing agent training corpus\nexhibits satisfactory results on held-in evaluation sets but fails to\ngeneralize to held-out sets. These agent-tuning works face severe formatting\nerrors and are frequently stuck in the same mistake for a long while. We\nanalyze that the poor generalization ability comes from overfitting to several\nmanual agent environments and a lack of adaptation to new situations. They\nstruggle with the wrong action steps and can not learn from the experience but\njust memorize existing observation-action relations. Inspired by the insight,\nwe propose a novel AgentRefine framework for agent-tuning. The core idea is to\nenable the model to learn to correct its mistakes via observation in the\ntrajectory. Specifically, we propose an agent synthesis framework to encompass\na diverse array of environments and tasks and prompt a strong LLM to refine its\nerror action according to the environment feedback. AgentRefine significantly\noutperforms state-of-the-art agent-tuning work in terms of generalization\nability on diverse agent tasks. It also has better robustness facing\nperturbation and can generate diversified thought in inference. Our findings\nestablish the correlation between agent generalization and self-refinement and\nprovide a new paradigm for future research.\n","arxiv_id":"http://arxiv.org/abs/2501.01702v1","authors":["Dayuan Fu","Keqing He","Yejie Wang","Wentao Hong","Zhuoma Gongque","Weihao Zeng","Wei Wang","Jingang Wang","Xunliang Cai","Weiran Xu"]},{"title":"Combined Hyper-Extensible Extremely-Secured Zero-Trust CIAM-PAM\n  architecture","abstract":"  Customer Identity and Access Management (CIAM) systems play a pivotal role in\nsecuring enterprise infrastructures. However, the complexity of implementing\nthese systems requires careful architectural planning to ensure positive Return\non Investment (RoI) and avoid costly delays. The proliferation of Active\nPersistent cyber threats, coupled with advancements in AI, cloud computing, and\ngeographically distributed customer populations, necessitates a paradigm shift\ntowards adaptive and zero-trust security frameworks. This paper introduces the\nCombined Hyper-Extensible Extremely-Secured Zero-Trust (CHEZ) CIAM-PAM\narchitecture, designed specifically for large-scale enterprises. The CHEZ PL\nCIAM-PAM framework addresses critical security gaps by integrating federated\nidentity management (private and public identities), password-less\nauthentication, adaptive multi-factor authentication (MFA), microservice-based\nPEP (Policy Entitlement Point), multi-layer RBAC (Role Based Access Control)\nand multi-level trust systems. This future-proof design also includes\nend-to-end data encryption, and seamless integration with state-of-the-art\nAI-based threat detection systems, while ensuring compliance with stringent\nregulatory standards.\n","arxiv_id":"http://arxiv.org/abs/2501.01732v1","authors":["Shivom Aggarwal","Shourya Mehra","Safeer Sathar"]},{"title":"How Toxic Can You Get? Search-based Toxicity Testing for Large Language\n  Models","abstract":"  Language is a deep-rooted means of perpetration of stereotypes and\ndiscrimination. Large Language Models (LLMs), now a pervasive technology in our\neveryday lives, can cause extensive harm when prone to generating toxic\nresponses. The standard way to address this issue is to align the LLM, which,\nhowever, dampens the issue without constituting a definitive solution.\nTherefore, testing LLM even after alignment efforts remains crucial for\ndetecting any residual deviations with respect to ethical standards. We present\nEvoTox, an automated testing framework for LLMs' inclination to toxicity,\nproviding a way to quantitatively assess how much LLMs can be pushed towards\ntoxic responses even in the presence of alignment. The framework adopts an\niterative evolution strategy that exploits the interplay between two LLMs, the\nSystem Under Test (SUT) and the Prompt Generator steering SUT responses toward\nhigher toxicity. The toxicity level is assessed by an automated oracle based on\nan existing toxicity classifier. We conduct a quantitative and qualitative\nempirical evaluation using four state-of-the-art LLMs as evaluation subjects\nhaving increasing complexity (7-13 billion parameters). Our quantitative\nevaluation assesses the cost-effectiveness of four alternative versions of\nEvoTox against existing baseline methods, based on random search, curated\ndatasets of toxic prompts, and adversarial attacks. Our qualitative assessment\nengages human evaluators to rate the fluency of the generated prompts and the\nperceived toxicity of the responses collected during the testing sessions.\nResults indicate that the effectiveness, in terms of detected toxicity level,\nis significantly higher than the selected baseline methods (effect size up to\n1.0 against random search and up to 0.99 against adversarial attacks).\nFurthermore, EvoTox yields a limited cost overhead (from 22% to 35% on\naverage).\n","arxiv_id":"http://arxiv.org/abs/2501.01741v1","authors":["Simone Corbo","Luca Bancale","Valeria De Gennaro","Livia Lestingi","Vincenzo Scotti","Matteo Camilli"]},{"title":"Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic\n  Data Generation and Fairness Algorithms","abstract":"  The increasing use of machine learning in learning analytics (LA) has raised\nsignificant concerns around algorithmic fairness and privacy. Synthetic data\nhas emerged as a dual-purpose tool, enhancing privacy and improving fairness in\nLA models. However, prior research suggests an inverse relationship between\nfairness and privacy, making it challenging to optimize both. This study\ninvestigates which synthetic data generators can best balance privacy and\nfairness, and whether pre-processing fairness algorithms, typically applied to\nreal datasets, are effective on synthetic data. Our results highlight that the\nDEbiasing CAusal Fairness (DECAF) algorithm achieves the best balance between\nprivacy and fairness. However, DECAF suffers in utility, as reflected in its\npredictive accuracy. Notably, we found that applying pre-processing fairness\nalgorithms to synthetic data improves fairness even more than when applied to\nreal data. These findings suggest that combining synthetic data generation with\nfairness pre-processing offers a promising approach to creating fairer LA\nmodels.\n","arxiv_id":"http://arxiv.org/abs/2501.01785v1","authors":["Qinyi Liu","Oscar Deho","Farhad Vadiee","Mohammad Khalil","Srecko Joksimovic","George Siemens"]},{"title":"BERT4MIMO: A Foundation Model using BERT Architecture for Massive MIMO\n  Channel State Information Prediction","abstract":"  Massive MIMO (Multiple-Input Multiple-Output) is an advanced wireless\ncommunication technology, using a large number of antennas to improve the\noverall performance of the communication system in terms of capacity, spectral,\nand energy efficiency. The performance of MIMO systems is highly dependent on\nthe quality of channel state information (CSI). Predicting CSI is, therefore,\nessential for improving communication system performance, particularly in MIMO\nsystems, since it represents key characteristics of a wireless channel,\nincluding propagation, fading, scattering, and path loss. This study proposes a\nfoundation model inspired by BERT, called BERT4MIMO, which is specifically\ndesigned to process high-dimensional CSI data from massive MIMO systems.\nBERT4MIMO offers superior performance in reconstructing CSI under varying\nmobility scenarios and channel conditions through deep learning and attention\nmechanisms. The experimental results demonstrate the effectiveness of BERT4MIMO\nin a variety of wireless environments.\n","arxiv_id":"http://arxiv.org/abs/2501.01802v1","authors":["Ferhat Ozgur Catak","Murat Kuzlu","Umit Cali"]},{"title":"Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large\n  Language Models","abstract":"  Automated red-teaming has become a crucial approach for uncovering\nvulnerabilities in large language models (LLMs). However, most existing methods\nfocus on isolated safety flaws, limiting their ability to adapt to dynamic\ndefenses and uncover complex vulnerabilities efficiently. To address this\nchallenge, we propose Auto-RT, a reinforcement learning framework that\nautomatically explores and optimizes complex attack strategies to effectively\nuncover security vulnerabilities through malicious queries. Specifically, we\nintroduce two key mechanisms to reduce exploration complexity and improve\nstrategy optimization: 1) Early-terminated Exploration, which accelerate\nexploration by focusing on high-potential attack strategies; and 2) Progressive\nReward Tracking algorithm with intermediate downgrade models, which dynamically\nrefine the search trajectory toward successful vulnerability exploitation.\nExtensive experiments across diverse LLMs demonstrate that, by significantly\nimproving exploration efficiency and automatically optimizing attack\nstrategies, Auto-RT detects a boarder range of vulnerabilities, achieving a\nfaster detection speed and 16.63\\% higher success rates compared to existing\nmethods.\n","arxiv_id":"http://arxiv.org/abs/2501.01830v1","authors":["Yanjiang Liu","Shuhen Zhou","Yaojie Lu","Huijia Zhu","Weiqiang Wang","Hongyu Lin","Ben He","Xianpei Han","Le Sun"]},{"title":"LCFed: An Efficient Clustered Federated Learning Framework for\n  Heterogeneous Data","abstract":"  Clustered federated learning (CFL) addresses the performance challenges posed\nby data heterogeneity in federated learning (FL) by organizing edge devices\nwith similar data distributions into clusters, enabling collaborative model\ntraining tailored to each group. However, existing CFL approaches strictly\nlimit knowledge sharing to within clusters, lacking the integration of global\nknowledge with intra-cluster training, which leads to suboptimal performance.\nMoreover, traditional clustering methods incur significant computational\noverhead, especially as the number of edge devices increases. In this paper, we\npropose LCFed, an efficient CFL framework to combat these challenges. By\nleveraging model partitioning and adopting distinct aggregation strategies for\neach sub-model, LCFed effectively incorporates global knowledge into\nintra-cluster co-training, achieving optimal training performance.\nAdditionally, LCFed customizes a computationally efficient model similarity\nmeasurement method based on low-rank models, enabling real-time cluster updates\nwith minimal computational overhead. Extensive experiments show that LCFed\noutperforms state-of-the-art benchmarks in both test accuracy and clustering\ncomputational efficiency.\n","arxiv_id":"http://arxiv.org/abs/2501.01850v1","authors":["Yuxin Zhang","Haoyu Chen","Zheng Lin","Zhe Chen","Jin Zhao"]},{"title":"Evaluating Scenario-based Decision-making for Interactive Autonomous\n  Driving Using Rational Criteria: A Survey","abstract":"  Autonomous vehicles (AVs) can significantly promote the advances in road\ntransport mobility in terms of safety, reliability, and decarbonization.\nHowever, ensuring safety and efficiency in interactive during within dynamic\nand diverse environments is still a primary barrier to large-scale AV adoption.\nIn recent years, deep reinforcement learning (DRL) has emerged as an advanced\nAI-based approach, enabling AVs to learn decision-making strategies adaptively\nfrom data and interactions. DRL strategies are better suited than traditional\nrule-based methods for handling complex, dynamic, and unpredictable driving\nenvironments due to their adaptivity. However, varying driving scenarios\npresent distinct challenges, such as avoiding obstacles on highways and\nreaching specific exits at intersections, requiring different scenario-specific\ndecision-making algorithms. Many DRL algorithms have been proposed in\ninteractive decision-making. However, a rationale review of these DRL\nalgorithms across various scenarios is lacking. Therefore, a comprehensive\nevaluation is essential to assess these algorithms from multiple perspectives,\nincluding those of vehicle users and vehicle manufacturers. This survey reviews\nthe application of DRL algorithms in autonomous driving across typical\nscenarios, summarizing road features and recent advancements. The scenarios\ninclude highways, on-ramp merging, roundabouts, and unsignalized intersections.\nFurthermore, DRL-based algorithms are evaluated based on five rationale\ncriteria: driving safety, driving efficiency, training efficiency,\nunselfishness, and interpretability (DDTUI). Each criterion of DDTUI is\nspecifically analyzed in relation to the reviewed algorithms. Finally, the\nchallenges for future DRL-based decision-making algorithms are summarized.\n","arxiv_id":"http://arxiv.org/abs/2501.01886v1","authors":["Zhen Tian","Zhihao Lin","Dezong Zhao","Wenjing Zhao","David Flynn","Shuja Ansari","Chongfeng Wei"]},{"title":"QuArch: A Question-Answering Dataset for AI Agents in Computer\n  Architecture","abstract":"  We introduce QuArch, a dataset of 1500 human-validated question-answer pairs\ndesigned to evaluate and enhance language models' understanding of computer\narchitecture. The dataset covers areas including processor design, memory\nsystems, and performance optimization. Our analysis highlights a significant\nperformance gap: the best closed-source model achieves 84% accuracy, while the\ntop small open-source model reaches 72%. We observe notable struggles in memory\nsystems, interconnection networks, and benchmarking. Fine-tuning with QuArch\nimproves small model accuracy by up to 8%, establishing a foundation for\nadvancing AI-driven computer architecture research. The dataset and leaderboard\nare at https://harvard-edge.github.io/QuArch/.\n","arxiv_id":"http://arxiv.org/abs/2501.01892v2","authors":["Shvetank Prakash","Andrew Cheng","Jason Yik","Arya Tschand","Radhika Ghosal","Ikechukwu Uchendu","Jessica Quaye","Jeffrey Ma","Shreyas Grampurohit","Sofia Giannuzzi","Arnav Balyan","Fin Amin","Aadya Pipersenia","Yash Choudhary","Ankita Nayak","Amir Yazdanbakhsh","Vijay Janapa Reddi"]},{"title":"Mingling with the Good to Backdoor Federated Learning","abstract":"  Federated learning (FL) is a decentralized machine learning technique that\nallows multiple entities to jointly train a model while preserving dataset\nprivacy. However, its distributed nature has raised various security concerns,\nwhich have been addressed by increasingly sophisticated defenses. These\nprotections utilize a range of data sources and metrics to, for example, filter\nout malicious model updates, ensuring that the impact of attacks is minimized\nor eliminated.\n  This paper explores the feasibility of designing a generic attack method\ncapable of installing backdoors in FL while evading a diverse array of\ndefenses. Specifically, we focus on an attacker strategy called MIGO, which\naims to produce model updates that subtly blend with legitimate ones. The\nresulting effect is a gradual integration of a backdoor into the global model,\noften ensuring its persistence long after the attack concludes, while\ngenerating enough ambiguity to hinder the effectiveness of defenses.\n  MIGO was employed to implant three types of backdoors across five datasets\nand different model architectures. The results demonstrate the significant\nthreat posed by these backdoors, as MIGO consistently achieved exceptionally\nhigh backdoor accuracy (exceeding 90%) while maintaining the utility of the\nmain task. Moreover, MIGO exhibited strong evasion capabilities against ten\ndefenses, including several state-of-the-art methods. When compared to four\nother attack strategies, MIGO consistently outperformed them across most\nconfigurations. Notably, even in extreme scenarios where the attacker controls\njust 0.1% of the clients, the results indicate that successful backdoor\ninsertion is possible if the attacker can persist for a sufficient number of\nrounds.\n","arxiv_id":"http://arxiv.org/abs/2501.01913v1","authors":["Nuno Neves"]},{"title":"On the Utility of Equivariance and Symmetry Breaking in Deep Learning\n  Architectures on Point Clouds","abstract":"  This paper explores the key factors that influence the performance of models\nworking with point clouds, across different tasks of varying geometric\ncomplexity. In this work, we explore the trade-offs between flexibility and\nweight-sharing introduced by equivariant layers, assessing when equivariance\nboosts or detracts from performance. It is often argued that providing more\ninformation as input improves a model's performance. However, if this\nadditional information breaks certain properties, such as $\\SE(3)$\nequivariance, does it remain beneficial? We identify the key aspects of\nequivariant and non-equivariant architectures that drive success in different\ntasks by benchmarking them on segmentation, regression, and generation tasks\nacross multiple datasets with increasing complexity. We observe a positive\nimpact of equivariance, which becomes more pronounced with increasing task\ncomplexity, even when strict equivariance is not required.\n","arxiv_id":"http://arxiv.org/abs/2501.01999v1","authors":["Sharvaree Vadgama","Mohammad Mohaiminul Islam","Domas Buracus","Christian Shewmake","Erik Bekkers"]},{"title":"Multi-Center Study on Deep Learning-Assisted Detection and\n  Classification of Fetal Central Nervous System Anomalies Using Ultrasound\n  Imaging","abstract":"  Prenatal ultrasound evaluates fetal growth and detects congenital\nabnormalities during pregnancy, but the examination of ultrasound images by\nradiologists requires expertise and sophisticated equipment, which would\notherwise fail to improve the rate of identifying specific types of fetal\ncentral nervous system (CNS) abnormalities and result in unnecessary patient\nexaminations. We construct a deep learning model to improve the overall\naccuracy of the diagnosis of fetal cranial anomalies to aid prenatal diagnosis.\nIn our collected multi-center dataset of fetal craniocerebral anomalies\ncovering four typical anomalies of the fetal central nervous system (CNS):\nanencephaly, encephalocele (including meningocele), holoprosencephaly, and\nrachischisis, patient-level prediction accuracy reaches 94.5%, with an AUROC\nvalue of 99.3%. In the subgroup analyzes, our model is applicable to the entire\ngestational period, with good identification of fetal anomaly types for any\ngestational period. Heatmaps superimposed on the ultrasound images not only\nprovide a visual interpretation for the algorithm but also provide an intuitive\nvisual aid to the physician by highlighting key areas that need to be reviewed,\nhelping the physician to quickly identify and validate key areas. Finally, the\nretrospective reader study demonstrates that by combining the automatic\nprediction of the DL system with the professional judgment of the radiologist,\nthe diagnostic accuracy and efficiency can be effectively improved and the\nmisdiagnosis rate can be reduced, which has an important clinical application\nprospect.\n","arxiv_id":"http://arxiv.org/abs/2501.02000v1","authors":["Yang Qi","Jiaxin Cai","Jing Lu","Runqing Xiong","Rongshang Chen","Liping Zheng","Duo Ma"]},{"title":"General Information Metrics for Improving AI Model Training Efficiency","abstract":"  To address the growing size of AI model training data and the lack of a\nuniversal data selection methodology-factors that significantly drive up\ntraining costs -- this paper presents the General Information Metrics\nEvaluation (GIME) method. GIME leverages general information metrics from\nObjective Information Theory (OIT), including volume, delay, scope,\ngranularity, variety, duration, sampling rate, aggregation, coverage,\ndistortion, and mismatch to optimize dataset selection for training purposes.\nComprehensive experiments conducted across diverse domains, such as CTR\nPrediction, Civil Case Prediction, and Weather Forecasting, demonstrate that\nGIME effectively preserves model performance while substantially reducing both\ntraining time and costs. Additionally, applying GIME within the Judicial AI\nProgram led to a remarkable 39.56% reduction in total model training expenses,\nunderscoring its potential to support efficient and sustainable AI development.\n","arxiv_id":"http://arxiv.org/abs/2501.02004v1","authors":["Jianfeng Xu","Congcong Liu","Xiaoying Tan","Xiaojie Zhu","Anpeng Wu","Huan Wan","Weijun Kong","Chun Li","Hu Xu","Kun Kuang","Fei Wu"]},{"title":"ST-HCSS: Deep Spatio-Temporal Hypergraph Convolutional Neural Network\n  for Soft Sensing","abstract":"  Higher-order sensor networks are more accurate in characterizing the\nnonlinear dynamics of sensory time-series data in modern industrial settings by\nallowing multi-node connections beyond simple pairwise graph edges. In light of\nthis, we propose a deep spatio-temporal hypergraph convolutional neural network\nfor soft sensing (ST-HCSS). In particular, our proposed framework is able to\nconstruct and leverage a higher-order graph (hypergraph) to model the complex\nmulti-interactions between sensor nodes in the absence of prior structural\nknowledge. To capture rich spatio-temporal relationships underlying sensor\ndata, our proposed ST-HCSS incorporates stacked gated temporal and hypergraph\nconvolution layers to effectively aggregate and update hypergraph information\nacross time and nodes. Our results validate the superiority of ST-HCSS compared\nto existing state-of-the-art soft sensors, and demonstrates that the learned\nhypergraph feature representations aligns well with the sensor data\ncorrelations. The code is available at https://github.com/htew0001/ST-HCSS.git\n","arxiv_id":"http://arxiv.org/abs/2501.02016v1","authors":["Hwa Hui Tew","Fan Ding","Gaoxuan Li","Junn Yong Loo","Chee-Ming Ting","Ze Yang Ding","Chee Pin Tan"]},{"title":"Safeguarding Large Language Models in Real-time with Tunable\n  Safety-Performance Trade-offs","abstract":"  Large Language Models (LLMs) have been shown to be susceptible to jailbreak\nattacks, or adversarial attacks used to illicit high risk behavior from a\nmodel. Jailbreaks have been exploited by cybercriminals and blackhat actors to\ncause significant harm, highlighting the critical need to safeguard\nwidely-deployed models. Safeguarding approaches, which include fine-tuning\nmodels or having LLMs \"self-reflect\", may lengthen the inference time of a\nmodel, incur a computational penalty, reduce the semantic fluency of an output,\nand restrict ``normal'' model behavior. Importantly, these Safety-Performance\nTrade-offs (SPTs) remain an understudied area. In this work, we introduce a\nnovel safeguard, called SafeNudge, that combines Controlled Text Generation\nwith \"nudging\", or using text interventions to change the behavior of a model.\nSafeNudge triggers during text-generation while a jailbreak attack is being\nexecuted, and can reduce successful jailbreak attempts by 30% by guiding the\nLLM towards a safe responses. It adds minimal latency to inference and has a\nnegligible impact on the semantic fluency of outputs. Further, we allow for\ntunable SPTs. SafeNudge is open-source and available through https://pypi.org/,\nand is compatible with models loaded with the Hugging Face \"transformers\"\nlibrary.\n","arxiv_id":"http://arxiv.org/abs/2501.02018v1","authors":["Joao Fonseca","Andrew Bell","Julia Stoyanovich"]},{"title":"Benchmarking Constraint-Based Bayesian Structure Learning Algorithms:\n  Role of Network Topology","abstract":"  Modeling the associations between real world entities from their multivariate\ncross-sectional profiles can provide cues into the concerted working of these\nentities as a system. Several techniques have been proposed for deciphering\nthese associations including constraint-based Bayesian structure learning (BSL)\nalgorithms that model them as directed acyclic graphs. Benchmarking these\nalgorithms have typically focused on assessing the variation in performance\nmeasures such as sensitivity as a function of the dimensionality represented by\nthe number of nodes in the DAG, and sample size. The present study elucidates\nthe importance of network topology in benchmarking exercises. More\nspecifically, it investigates variations in sensitivity across distinct network\ntopologies while constraining the nodes, edges, and sample-size to be\nidentical, eliminating these as potential confounders. Sensitivity of three\npopular constraint-based BSL algorithms (Peter-Clarke, Grow-Shrink, Incremental\nAssociation Markov Blanket) in learning the network structure from multivariate\ncross-sectional profiles sampled from network models with sub-linear, linear,\nand super-linear DAG topologies generated using preferential attachment is\ninvestigated. Results across linear and nonlinear models revealed statistically\nsignificant $(\\alpha=0.05)$ decrease in sensitivity estimates from sub-linear\nto super-linear topology constitutively across the three algorithms. These\nresults are demonstrated on networks with nodes $(N_{nods}=48,64)$, noise\nstrengths $(\\sigma =3,6)$ and sample size $(N = 2^{10})$. The findings\nelucidate the importance of accommodating the network topology in\nconstraint-based BSL benchmarking exercises.\n","arxiv_id":"http://arxiv.org/abs/2501.02019v1","authors":["Radha Nagarajan","Marco Scutari"]},{"title":"Model Checking in Medical Imaging for Tumor Detection and Segmentation","abstract":"  Recent advancements in model checking have demonstrated significant potential\nacross diverse applications, particularly in signal and image analysis. Medical\nimaging stands out as a critical domain where model checking can be effectively\napplied to design and evaluate robust frameworks. These frameworks facilitate\nautomatic and semi-automatic delineation of regions of interest within images,\naiding in accurate segmentation. This paper provides a comprehensive analysis\nof recent works leveraging spatial logic to develop operators and tools for\nidentifying regions of interest, including tumorous and non-tumorous areas.\nAdditionally, we examine the challenges inherent to spatial model-checking\ntechniques, such as variability in ground truth data and the need for\nstreamlined procedures suitable for routine clinical practice.\n","arxiv_id":"http://arxiv.org/abs/2501.02024v2","authors":["Elhoucine Elfatimi","Lahcen El fatimi"]},{"title":"Recursive Decomposition of Logical Thoughts: Framework for Superior\n  Reasoning and Knowledge Propagation in Large Language Models","abstract":"  Enhancing the reasoning capabilities of Large Language Models remains a\ncritical challenge in artificial intelligence. We introduce RDoLT, Recursive\nDecomposition of Logical Thought prompting, a novel framework that\nsignificantly boosts LLM reasoning performance. RDoLT is built on three key\ninnovations: (1) recursively breaking down complex reasoning tasks into\nsub-tasks of progressive complexity; (2) employing an advanced selection and\nscoring mechanism to identify the most promising reasoning thoughts; and (3)\nintegrating a knowledge propagation module that mimics human learning by\nkeeping track of strong and weak thoughts for information propagation. Our\napproach was evaluated across multiple benchmarks, including GSM8K, SVAMP,\nMultiArith, LastLetterConcatenation, and Gaokao2023 Math. The results\ndemonstrate that RDoLT consistently outperforms existing state-of-the-art\ntechniques, achieving a 90.98 percent accuracy on GSM8K with ChatGPT-4,\nsurpassing state-of-the-art techniques by 6.28 percent. Similar improvements\nwere observed on other benchmarks, with accuracy gains ranging from 5.5 percent\nto 6.75 percent. These findings highlight RDoLT's potential to advance prompt\nengineering, offering a more effective and generalizable approach to complex\nreasoning tasks.\n","arxiv_id":"http://arxiv.org/abs/2501.02026v1","authors":["Kaleem Ullah Qasim","Jiashu Zhang","Tariq Alsahfi","Ateeq Ur Rehman Butt"]},{"title":"Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large\n  Vision-Language Models","abstract":"  With the integration of an additional modality, large vision-language models\n(LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking)\ncompared to their language-only predecessors. Although recent studies have\ndevoted considerable effort to the post-hoc alignment of LVLMs, the inner\nsafety mechanisms remain largely unexplored. In this paper, we discover that\ninternal activations of LVLMs during the first token generation can effectively\nidentify malicious prompts across different attacks. This inherent safety\nperception is governed by sparse attention heads, which we term ``safety\nheads.\" Further analysis reveals that these heads act as specialized shields\nagainst malicious prompts; ablating them leads to higher attack success rates,\nwhile the model's utility remains unaffected. By locating these safety heads\nand concatenating their activations, we construct a straightforward but\npowerful malicious prompt detector that integrates seamlessly into the\ngeneration process with minimal extra inference overhead. Despite its simple\nstructure of a logistic regression model, the detector surprisingly exhibits\nstrong zero-shot generalization capabilities. Experiments across various\nprompt-based attacks confirm the effectiveness of leveraging safety heads to\nprotect LVLMs. Code is available at \\url{https://github.com/Ziwei-Zheng/SAHs}.\n","arxiv_id":"http://arxiv.org/abs/2501.02029v1","authors":["Ziwei Zheng","Junyao Zhao","Le Yang","Lijun He","Fan Li"]},{"title":"Detecting Music Performance Errors with Transformers","abstract":"  Beginner musicians often struggle to identify specific errors in their\nperformances, such as playing incorrect notes or rhythms. There are two\nlimitations in existing tools for music error detection: (1) Existing\napproaches rely on automatic alignment; therefore, they are prone to errors\ncaused by small deviations between alignment targets.; (2) There is a lack of\nsufficient data to train music error detection models, resulting in\nover-reliance on heuristics. To address (1), we propose a novel transformer\nmodel, Polytune, that takes audio inputs and outputs annotated music scores.\nThis model can be trained end-to-end to implicitly align and compare\nperformance audio with music scores through latent space representations. To\naddress (2), we present a novel data generation technique capable of creating\nlarge-scale synthetic music error datasets. Our approach achieves a 64.1%\naverage Error Detection F1 score, improving upon prior work by 40 percentage\npoints across 14 instruments. Additionally, compared with existing\ntranscription methods repurposed for music error detection, our model can\nhandle multiple instruments. Our source code and datasets are available at\nhttps://github.com/ben2002chou/Polytune.\n","arxiv_id":"http://arxiv.org/abs/2501.02030v1","authors":["Benjamin Shiue-Hal Chou","Purvish Jajal","Nicholas John Eliopoulos","Tim Nadolsky","Cheng-Yun Yang","Nikita Ravi","James C. Davis","Kristen Yeon-Ji Yun","Yung-Hsiang Lu"]},{"title":"Dynamic Feature Fusion: Combining Global Graph Structures and Local\n  Semantics for Blockchain Fraud Detection","abstract":"  The advent of blockchain technology has facilitated the widespread adoption\nof smart contracts in the financial sector. However, current fraud detection\nmethodologies exhibit limitations in capturing both global structural patterns\nwithin transaction networks and local semantic relationships embedded in\ntransaction data. Most existing models focus on either structural information\nor semantic features individually, leading to suboptimal performance in\ndetecting complex fraud patterns.In this paper, we propose a dynamic feature\nfusion model that combines graph-based representation learning and semantic\nfeature extraction for blockchain fraud detection. Specifically, we construct\nglobal graph representations to model account relationships and extract local\ncontextual features from transaction data. A dynamic multimodal fusion\nmechanism is introduced to adaptively integrate these features, enabling the\nmodel to capture both structural and semantic fraud patterns effectively. We\nfurther develop a comprehensive data processing pipeline, including graph\nconstruction, temporal feature enhancement, and text preprocessing.\nExperimental results on large-scale real-world blockchain datasets demonstrate\nthat our method outperforms existing benchmarks across accuracy, F1 score, and\nrecall metrics. This work highlights the importance of integrating structural\nrelationships and semantic similarities for robust fraud detection and offers a\nscalable solution for securing blockchain systems.\n","arxiv_id":"http://arxiv.org/abs/2501.02032v1","authors":["Zhang Sheng","Liangliang Song","Yanbin Wang"]},{"title":"Deep Clustering via Community Detection","abstract":"  Deep clustering is an essential task in modern artificial intelligence,\naiming to partition a set of data samples into a given number of homogeneous\ngroups (i.e., clusters). Even though many Deep Neural Network (DNN) backbones\nand clustering strategies have been proposed for the task, achieving\nincreasingly improved performance, deep clustering remains very challenging due\nto the lack of accurately labeled samples. In this paper, we propose a novel\napproach of deep clustering via community detection. It initializes clustering\nby detecting many communities, and then gradually expands clusters by community\nmerging. Compared with the existing clustering strategies, community detection\nfactors in the new perspective of cluster network analysis. As a result, it has\nthe inherent benefit of high pseudo-label purity, which is critical to the\nperformance of self-supervision. We have validated the efficacy of the proposed\napproach on benchmark image datasets. Our extensive experiments have shown that\nit can effectively improve the SOTA performance. Our ablation study also\ndemonstrates that the new network perspective can effectively improve community\npseudo-label purity, resulting in improved clustering performance.\n","arxiv_id":"http://arxiv.org/abs/2501.02036v1","authors":["Tianyu Cheng","Qun Chen"]},{"title":"METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring","abstract":"  We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer\nmodel, which we refer to as a metagenomic foundation model, on a novel corpus\nof diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base\npairs. This dataset is sourced from a large collection of human wastewater\nsamples, processed and sequenced using deep metagenomic (next-generation)\nsequencing methods. Unlike genomic models that focus on individual genomes or\ncurated sets of specific species, the aim of METAGENE-1 is to capture the full\ndistribution of genomic information present within this wastewater, to aid in\ntasks relevant to pandemic monitoring and pathogen detection. We carry out\nbyte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic\nsequences, and then pretrain our model. In this paper, we first detail the\npretraining dataset, tokenization strategy, and model architecture,\nhighlighting the considerations and design choices that enable the effective\nmodeling of metagenomic data. We then show results of pretraining this model on\nour metagenomic dataset, providing details about our losses, system metrics,\nand training stability over the course of pretraining. Finally, we demonstrate\nthe performance of METAGENE-1, which achieves state-of-the-art results on a set\nof genomic benchmarks and new evaluations focused on human-pathogen detection\nand genomic sequence embedding, showcasing its potential for public health\napplications in pandemic monitoring, biosurveillance, and early detection of\nemerging health threats.\n","arxiv_id":"http://arxiv.org/abs/2501.02045v1","authors":["Ollie Liu","Sami Jaghouar","Johannes Hagemann","Shangshang Wang","Jason Wiemels","Jeff Kaufman","Willie Neiswanger"]},{"title":"Relaxation-assisted reverse annealing on nonnegative/binary matrix\n  factorization","abstract":"  Quantum annealing has garnered significant attention as meta-heuristics\ninspired by quantum physics for combinatorial optimization problems. Among its\nmany applications, nonnegative/binary matrix factorization stands out for its\ncomplexity and relevance in unsupervised machine learning. The use of reverse\nannealing, a derivative procedure of quantum annealing to prioritize the search\nin a vicinity under a given initial state, helps improve its optimization\nperformance in matrix factorization. This study proposes an improved strategy\nthat integrates reverse annealing with a linear programming relaxation\ntechnique. Using relaxed solutions as the initial configuration for reverse\nannealing, we demonstrate improvements in optimization performance comparable\nto the exact optimization methods. Our experiments on facial image datasets\nshow that our method provides better convergence than known reverse annealing\nmethods. Furthermore, we investigate the effectiveness of relaxation-based\ninitialization methods on randomized datasets, demonstrating a relationship\nbetween the relaxed solution and the optimal solution. This research\nunderscores the potential of combining reverse annealing and classical\noptimization strategies to enhance optimization performance.\n","arxiv_id":"http://arxiv.org/abs/2501.02114v1","authors":["Renichiro Haba","Masayuki Ohzeki","Kazuyuki Tanaka"]},{"title":"A hybrid marketplace of ideas","abstract":"  The convergence of humans and artificial intelligence systems introduces new\ndynamics into the cultural and intellectual landscape. Complementing emerging\ncultural evolution concepts such as machine culture, AI agents represent a\nsignificant techno-sociological development, particularly within the\nanthropological study of Web3 as a community focused on decentralization\nthrough blockchain. Despite their growing presence, the cultural significance\nof AI agents remains largely unexplored in academic literature. Toward this\nend, we conceived hybrid netnography, a novel interdisciplinary approach that\nexamines the cultural and intellectual dynamics within digital ecosystems by\nanalyzing the interactions and contributions of both human and AI agents as\nco-participants in shaping narratives, ideas, and cultural artifacts. We argue\nthat, within the Web3 community on the social media platform X, these agents\nchallenge traditional notions of participation and influence in public\ndiscourse, creating a hybrid marketplace of ideas, a conceptual space where\nhuman and AI generated ideas coexist and compete for attention. We examine the\ncurrent state of AI agents in idea generation, propagation, and engagement,\npositioning their role as cultural agents through the lens of memetics and\nencouraging further inquiry into their cultural and societal impact.\nAdditionally, we address the implications of this paradigm for privacy,\nintellectual property, and governance, highlighting the societal and legal\nchallenges of integrating AI agents into the hybrid marketplace of ideas.\n","arxiv_id":"http://arxiv.org/abs/2501.02132v2","authors":["Tomer Jordi Chaffer","Dontrail Cotlage","Justin Goldston"]},{"title":"Effective LLM-Driven Code Generation with Pythoness","abstract":"  The advent of large language models (LLMs) has paved the way for a new era of\nprogramming tools with both significant capabilities and risks, as the\ngenerated code lacks guarantees of correctness and reliability. Developers\nusing LLMs currently face the difficult task of optimizing, integrating, and\nmaintaining code generated by AI. We propose an embedded domain-specific\nlanguage (DSL), Pythoness, to address those challenges. In Pythoness,\ndevelopers program with LLMs at a higher level of abstraction. Rather than\ninteracting directly with generated code, developers using Pythoness operate at\nthe level of behavioral specifications when writing functions, classes, or an\nentire program. These specifications can take the form of unit tests and\nproperty-based tests, which may be expressed formally or in natural language.\nGuided by these specifications, Pythoness generates code that both passes the\ntests and can be continuously checked during execution. We posit that the\nPythoness approach lets developers harness the full potential of LLMs for code\ngeneration while substantially mitigating their inherent risks. We describe our\ncurrent prototype implementation of Pythoness and demonstrate that it can\nsuccessfully leverage a combination of tests and code generation to yield\nhigher quality code than specifications alone.\n","arxiv_id":"http://arxiv.org/abs/2501.02138v1","authors":["Kyla H. Levin","Kyle Gwilt","Emery D. Berger","Stephen N. Freund"]},{"title":"Establishing baselines for generative discovery of inorganic crystals","abstract":"  Generative artificial intelligence offers a promising avenue for materials\ndiscovery, yet its advantages over traditional methods remain unclear. In this\nwork, we introduce and benchmark two baseline approaches - random enumeration\nof charge-balanced prototypes and data-driven ion exchange of known compounds -\nagainst three generative models: a variational autoencoder, a large language\nmodel, and a diffusion model. Our results show that established methods such as\nion exchange perform comparably well in generating stable materials, although\nmany of these materials tend to closely resemble known compounds. In contrast,\ngenerative models excel at proposing novel structural frameworks and, when\nsufficient training data is available, can more effectively target properties\nsuch as electronic band gap and bulk modulus while maintaining a high stability\nrate. To enhance the performance of both the baseline and generative\napproaches, we implement a post-generation screening step in which all proposed\nstructures are passed through stability and property filters from pre-trained\nmachine learning models including universal interatomic potentials. This\nlow-cost filtering step leads to substantial improvement in the success rates\nof all methods, remains computationally efficient, and ultimately provides a\npractical pathway toward more effective generative strategies for materials\ndiscovery.\n","arxiv_id":"http://arxiv.org/abs/2501.02144v1","authors":["Nathan J. Szymanski","Christopher J. Bartel"]},{"title":"Plasma-CycleGAN: Plasma Biomarker-Guided MRI to PET Cross-modality\n  Translation Using Conditional CycleGAN","abstract":"  Cross-modality translation between MRI and PET imaging is challenging due to\nthe distinct mechanisms underlying these modalities. Blood-based biomarkers\n(BBBMs) are revolutionizing Alzheimer's disease (AD) detection by identifying\npatients and quantifying brain amyloid levels. However, the potential of BBBMs\nto enhance PET image synthesis remains unexplored. In this paper, we performed\na thorough study on the effect of incorporating BBBM into deep generative\nmodels. By evaluating three widely used cross-modality translation models, we\nfound that BBBMs integration consistently enhances the generative quality\nacross all models. By visual inspection of the generated results, we observed\nthat PET images generated by CycleGAN exhibit the best visual fidelity. Based\non these findings, we propose Plasma-CycleGAN, a novel generative model based\non CycleGAN, to synthesize PET images from MRI using BBBMs as conditions. This\nis the first approach to integrate BBBMs in conditional cross-modality\ntranslation between MRI and PET.\n","arxiv_id":"http://arxiv.org/abs/2501.02146v1","authors":["Yanxi Chen","Yi Su","Celine Dumitrascu","Kewei Chen","David Weidman","Richard J Caselli","Nicholas Ashton","Eric M Reiman","Yalin Wang"]},{"title":"The Race to Efficiency: A New Perspective on AI Scaling Laws","abstract":"  As large-scale AI models expand, training becomes costlier and sustaining\nprogress grows harder. Classical scaling laws (e.g., Kaplan et al. (2020),\nHoffmann et al. (2022)) predict training loss from a static compute budget yet\nneglect time and efficiency, prompting the question: how can we balance\nballooning GPU fleets with rapidly improving hardware and algorithms? We\nintroduce the relative-loss equation, a time- and efficiency-aware framework\nthat extends classical AI scaling laws. Our model shows that, without ongoing\nefficiency gains, advanced performance could demand millennia of training or\nunrealistically large GPU fleets. However, near-exponential progress remains\nachievable if the \"efficiency-doubling rate\" parallels Moore's Law. By\nformalizing this race to efficiency, we offer a quantitative roadmap for\nbalancing front-loaded GPU investments with incremental improvements across the\nAI stack. Empirical trends suggest that sustained efficiency gains can push AI\nscaling well into the coming decade, providing a new perspective on the\ndiminishing returns inherent in classical scaling.\n","arxiv_id":"http://arxiv.org/abs/2501.02156v3","authors":["Chien-Ping Lu"]},{"title":"Can ChatGPT implement finite element models for geotechnical engineering\n  applications?","abstract":"  This study assesses the capability of ChatGPT to generate finite element code\nfor geotechnical engineering applications from a set of prompts. We tested\nthree different initial boundary value problems using a hydro-mechanically\ncoupled formulation for unsaturated soils, including the dissipation of excess\npore water pressure through fluid mass diffusion in one-dimensional space,\ntime-dependent differential settlement of a strip footing, and gravity-driven\nseepage. For each case, initial prompting involved providing ChatGPT with\nnecessary information for finite element implementation, such as balance and\nconstitutive equations, problem geometry, initial and boundary conditions,\nmaterial properties, and spatiotemporal discretization and solution strategies.\nAny errors and unexpected results were further addressed through prompt\naugmentation processes until the ChatGPT-generated finite element code passed\nthe verification/validation test. Our results demonstrate that ChatGPT required\nminimal code revisions when using the FEniCS finite element library, owing to\nits high-level interfaces that enable efficient programming. In contrast, the\nMATLAB code generated by ChatGPT necessitated extensive prompt augmentations\nand/or direct human intervention, as it involves a significant amount of\nlow-level programming required for finite element analysis, such as\nconstructing shape functions or assembling global matrices. Given that prompt\nengineering for this task requires an understanding of the mathematical\nformulation and numerical techniques, this study suggests that while a large\nlanguage model may not yet replace human programmers, it can greatly assist in\nthe implementation of numerical models.\n","arxiv_id":"http://arxiv.org/abs/2501.02199v1","authors":["Taegu Kim","Tae Sup Yun","Hyoung Suk Suh"]},{"title":"Learning Evolution via Optimization Knowledge Adaptation","abstract":"  Evolutionary algorithms (EAs) maintain populations through evolutionary\noperators to discover diverse solutions for complex tasks while gathering\nvaluable knowledge, such as historical population data and fitness evaluations.\nHowever, traditional EAs face challenges in dynamically adapting to expanding\nknowledge bases, hindering the efficient exploitation of accumulated\ninformation and limiting adaptability to new situations. To address these\nissues, we introduce an Optimization Knowledge Adaptation Evolutionary Model\n(OKAEM), which features dynamic parameter adjustment using accumulated\nknowledge to enhance its optimization capabilities. OKAEM employs attention\nmechanisms to model the interactions among individuals, fitness landscapes, and\ngenetic components separately, thereby parameterizing the evolutionary\noperators of selection, crossover, and mutation. These powerful learnable\noperators enable OKAEM to benefit from pre-learned extensive prior knowledge\nand self-tune with real-time evolutionary insights. Experimental results\ndemonstrate that OKAEM: 1) exploits prior knowledge for significant performance\ngains across various knowledge transfer settings; 2) achieves competitive\nperformance through self-tuning alone, even without prior knowledge; 3)\noutperforms state-of-the-art black-box baselines in a vision-language model\ntuning case; 4) can improve its optimization capabilities with growing\nknowledge; 5) is capable of emulating principles of natural selection and\ngenetic recombination.\n","arxiv_id":"http://arxiv.org/abs/2501.02200v1","authors":["Chao Wang","Licheng Jiao","Jiaxuan Zhao","Lingling Li","Fang Liu","Shuyuan Yang"]},{"title":"Diffusion Model-Based Data Synthesis Aided Federated Semi-Supervised\n  Learning","abstract":"  Federated semi-supervised learning (FSSL) is primarily challenged by two\nfactors: the scarcity of labeled data across clients and the non-independent\nand identically distribution (non-IID) nature of data among clients. In this\npaper, we propose a novel approach, diffusion model-based data synthesis aided\nFSSL (DDSA-FSSL), which utilizes a diffusion model (DM) to generate synthetic\ndata, bridging the gap between heterogeneous local data distributions and the\nglobal data distribution. In DDSA-FSSL, clients address the challenge of the\nscarcity of labeled data by employing a federated learning-trained classifier\nto perform pseudo labeling for unlabeled data. The DM is then collaboratively\ntrained using both labeled and precision-optimized pseudo-labeled data,\nenabling clients to generate synthetic samples for classes that are absent in\ntheir labeled datasets. This process allows clients to generate more\ncomprehensive synthetic datasets aligned with the global distribution.\nExtensive experiments conducted on multiple datasets and varying non-IID\ndistributions demonstrate the effectiveness of DDSA-FSSL, e.g., it improves\naccuracy from 38.46% to 52.14% on CIFAR-10 datasets with 10% labeled data.\n","arxiv_id":"http://arxiv.org/abs/2501.02219v1","authors":["Zhongwei Wang","Tong Wu","Zhiyong Chen","Liang Qian","Yin Xu","Meixia Tao"]},{"title":"CORD: Generalizable Cooperation via Role Diversity","abstract":"  Cooperative multi-agent reinforcement learning (MARL) aims to develop agents\nthat can collaborate effectively. However, most cooperative MARL methods\noverfit training agents, making learned policies not generalize well to unseen\ncollaborators, which is a critical issue for real-world deployment. Some\nmethods attempt to address the generalization problem but require prior\nknowledge or predefined policies of new teammates, limiting real-world\napplications. To this end, we propose a hierarchical MARL approach to enable\ngeneralizable cooperation via role diversity, namely CORD. CORD's high-level\ncontroller assigns roles to low-level agents by maximizing the role entropy\nwith constraints. We show this constrained objective can be decomposed into\ncausal influence in role that enables reasonable role assignment, and role\nheterogeneity that yields coherent, non-redundant role clusters. Evaluated on a\nvariety of cooperative multi-agent tasks, CORD achieves better performance than\nbaselines, especially in generalization tests. Ablation studies further\ndemonstrate the efficacy of the constrained objective in generalizable\ncooperation.\n","arxiv_id":"http://arxiv.org/abs/2501.02221v2","authors":["Kanefumi Matsuyama","Kefan Su","Jiangxing Wang","Deheng Ye","Zongqing Lu"]},{"title":"Towards a constructive framework for control theory","abstract":"  This work presents a framework for control theory based on constructive\nanalysis to account for discrepancy between mathematical results and their\nimplementation in a computer, also referred to as computational uncertainty. In\ncontrol engineering, the latter is usually either neglected or considered\nsubmerged into some other type of uncertainty, such as system noise, and\naddressed within robust control. However, even robust control methods may be\ncompromised when the mathematical objects involved in the respective algorithms\nfail to exist in exact form and subsequently fail to satisfy the required\nproperties. For instance, in general stabilization using a control Lyapunov\nfunction, computational uncertainty may distort stability certificates or even\ndestabilize the system despite robustness of the stabilization routine with\nregards to system, actuator and measurement noise. In fact, battling numerical\nproblems in practical implementation of controllers is common among control\nengineers. Such observations indicate that computational uncertainty should\nindeed be addressed explicitly in controller synthesis and system analysis. The\nmajor contribution here is a fairly general framework for proof techniques in\nanalysis and synthesis of control systems based on constructive analysis which\nexplicitly states that every computation be doable only up to a finite\nprecision thus accounting for computational uncertainty. A series of previous\nworks is overviewed, including constructive system stability and stabilization,\napproximate optimal controls, eigenvalue problems, Caratheodory trajectories,\nmeasurable selectors. Additionally, a new constructive version of the Danskin's\ntheorem, which is crucial in adversarial defense, is presented.\n","arxiv_id":"http://arxiv.org/abs/2501.02267v1","authors":["Pavel Osinenko"]},{"title":"Deep Learning-Driven Segmentation of Ischemic Stroke Lesions Using\n  Multi-Channel MRI","abstract":"  Ischemic stroke, caused by cerebral vessel occlusion, presents substantial\nchallenges in medical imaging due to the variability and subtlety of stroke\nlesions. Magnetic Resonance Imaging (MRI) plays a crucial role in diagnosing\nand managing ischemic stroke, yet existing segmentation techniques often fail\nto accurately delineate lesions. This study introduces a novel deep\nlearning-based method for segmenting ischemic stroke lesions using\nmulti-channel MRI modalities, including Diffusion Weighted Imaging (DWI),\nApparent Diffusion Coefficient (ADC), and enhanced Diffusion Weighted Imaging\n(eDWI). The proposed architecture integrates DenseNet121 as the encoder with\nSelf-Organized Operational Neural Networks (SelfONN) in the decoder, enhanced\nby Channel and Space Compound Attention (CSCA) and Double\nSqueeze-and-Excitation (DSE) blocks. Additionally, a custom loss function\ncombining Dice Loss and Jaccard Loss with weighted averages is introduced to\nimprove model performance. Trained and evaluated on the ISLES 2022 dataset, the\nmodel achieved Dice Similarity Coefficients (DSC) of 83.88% using DWI alone,\n85.86% with DWI and ADC, and 87.49% with the integration of DWI, ADC, and eDWI.\nThis approach not only outperforms existing methods but also addresses key\nlimitations in current segmentation practices. These advancements significantly\nenhance diagnostic precision and treatment planning for ischemic stroke,\nproviding valuable support for clinical decision-making.\n","arxiv_id":"http://arxiv.org/abs/2501.02287v1","authors":["Ashiqur Rahman","Muhammad E. H. Chowdhury","Md Sharjis Ibne Wadud","Rusab Sarmun","Adam Mushtak","Sohaib Bassam Zoghoul","Israa Al-Hashimi"]},{"title":"DiffGraph: Heterogeneous Graph Diffusion Model","abstract":"  Recent advances in Graph Neural Networks (GNNs) have revolutionized\ngraph-structured data modeling, yet traditional GNNs struggle with complex\nheterogeneous structures prevalent in real-world scenarios. Despite progress in\nhandling heterogeneous interactions, two fundamental challenges persist: noisy\ndata significantly compromising embedding quality and learning performance, and\nexisting methods' inability to capture intricate semantic transitions among\nheterogeneous relations, which impacts downstream predictions. To address these\nfundamental issues, we present the Heterogeneous Graph Diffusion Model\n(DiffGraph), a pioneering framework that introduces an innovative cross-view\ndenoising strategy. This advanced approach transforms auxiliary heterogeneous\ndata into target semantic spaces, enabling precise distillation of\ntask-relevant information. At its core, DiffGraph features a sophisticated\nlatent heterogeneous graph diffusion mechanism, implementing a novel forward\nand backward diffusion process for superior noise management. This methodology\nachieves simultaneous heterogeneous graph denoising and cross-type transition,\nwhile significantly simplifying graph generation through its latent-space\ndiffusion capabilities. Through rigorous experimental validation on both public\nand industrial datasets, we demonstrate that DiffGraph consistently surpasses\nexisting methods in link prediction and node classification tasks, establishing\nnew benchmarks for robustness and efficiency in heterogeneous graph processing.\nThe model implementation is publicly available at:\nhttps://github.com/HKUDS/DiffGraph.\n","arxiv_id":"http://arxiv.org/abs/2501.02313v1","authors":["Zongwei Li","Lianghao Xia","Hua Hua","Shijie Zhang","Shuangyang Wang","Chao Huang"]},{"title":"Validity Arguments For Constructed Response Scoring Using Generative\n  Artificial Intelligence Applications","abstract":"  The rapid advancements in large language models and generative artificial\nintelligence (AI) capabilities are making their broad application in the\nhigh-stakes testing context more likely. Use of generative AI in the scoring of\nconstructed responses is particularly appealing because it reduces the effort\nrequired for handcrafting features in traditional AI scoring and might even\noutperform those methods. The purpose of this paper is to highlight the\ndifferences in the feature-based and generative AI applications in constructed\nresponse scoring systems and propose a set of best practices for the collection\nof validity evidence to support the use and interpretation of constructed\nresponse scores from scoring systems using generative AI. We compare the\nvalidity evidence needed in scoring systems using human ratings, feature-based\nnatural language processing AI scoring engines, and generative AI. The evidence\nneeded in the generative AI context is more extensive than in the feature-based\nNLP scoring context because of the lack of transparency and other concerns\nunique to generative AI such as consistency. Constructed response score data\nfrom standardized tests demonstrate the collection of validity evidence for\ndifferent types of scoring systems and highlights the numerous complexities and\nconsiderations when making a validity argument for these scores. In addition,\nwe discuss how the evaluation of AI scores might include a consideration of how\na contributory scoring approach combining multiple AI scores (from different\nsources) will cover more of the construct in the absence of human ratings.\n","arxiv_id":"http://arxiv.org/abs/2501.02334v1","authors":["Jodi M. Casabianca","Daniel F. McCaffrey","Matthew S. Johnson","Naim Alper","Vladimir Zubenko"]},{"title":"GNSS/GPS Spoofing and Jamming Identification Using Machine Learning and\n  Deep Learning","abstract":"  The increasing reliance on Global Navigation Satellite Systems (GNSS),\nparticularly the Global Positioning System (GPS), underscores the urgent need\nto safeguard these technologies against malicious threats such as spoofing and\njamming. As the backbone for positioning, navigation, and timing (PNT) across\nvarious applications including transportation, telecommunications, and\nemergency services GNSS is vulnerable to deliberate interference that poses\nsignificant risks. Spoofing attacks, which involve transmitting counterfeit\nGNSS signals to mislead receivers into calculating incorrect positions, can\nresult in serious consequences, from navigational errors in civilian aviation\nto security breaches in military operations. Furthermore, the lack of inherent\nsecurity measures within GNSS systems makes them attractive targets for\nadversaries. While GNSS/GPS jamming and spoofing systems consist of numerous\ncomponents, the ability to distinguish authentic signals from malicious ones is\nessential for maintaining system integrity. Recent advancements in machine\nlearning and deep learning provide promising avenues for enhancing detection\nand mitigation strategies against these threats. This paper addresses both\nspoofing and jamming by tackling real-world challenges through machine\nlearning, deep learning, and computer vision techniques. Through extensive\nexperiments on two real-world datasets related to spoofing and jamming\ndetection using advanced algorithms, we achieved state of the art results. In\nthe GNSS/GPS jamming detection task, we attained approximately 99% accuracy,\nimproving performance by around 5% compared to previous studies. Additionally,\nwe addressed a challenging tasks related to spoofing detection, yielding\nresults that underscore the potential of machine learning and deep learning in\nthis domain.\n","arxiv_id":"http://arxiv.org/abs/2501.02352v1","authors":["Ali Ghanbarzade","Hossein Soleimani"]},{"title":"FaceSpeak: Expressive and High-Quality Speech Synthesis from Human\n  Portraits of Different Styles","abstract":"  Humans can perceive speakers' characteristics (e.g., identity, gender,\npersonality and emotion) by their appearance, which are generally aligned to\ntheir voice style. Recently, vision-driven Text-to-speech (TTS) scholars\ngrounded their investigations on real-person faces, thereby restricting\neffective speech synthesis from applying to vast potential usage scenarios with\ndiverse characters and image styles. To solve this issue, we introduce a novel\nFaceSpeak approach. It extracts salient identity characteristics and emotional\nrepresentations from a wide variety of image styles. Meanwhile, it mitigates\nthe extraneous information (e.g., background, clothing, and hair color, etc.),\nresulting in synthesized speech closely aligned with a character's persona.\nFurthermore, to overcome the scarcity of multi-modal TTS data, we have devised\nan innovative dataset, namely Expressive Multi-Modal TTS, which is diligently\ncurated and annotated to facilitate research in this domain. The experimental\nresults demonstrate our proposed FaceSpeak can generate portrait-aligned voice\nwith satisfactory naturalness and quality.\n","arxiv_id":"http://arxiv.org/abs/2501.03181v1","authors":["Tian-Hao Zhang","Jiawei Zhang","Jun Wang","Xinyuan Qian","Xu-Cheng Yin"]},{"title":"Classifier-Guided Captioning Across Modalities","abstract":"  Most current captioning systems use language models trained on data from\nspecific settings, such as image-based captioning via Amazon Mechanical Turk,\nlimiting their ability to generalize to other modality distributions and\ncontexts. This limitation hinders performance in tasks like audio or video\ncaptioning, where different semantic cues are needed. Addressing this challenge\nis crucial for creating more adaptable and versatile captioning frameworks\napplicable across diverse real-world contexts. In this work, we introduce a\nmethod to adapt captioning networks to the semantics of alternative settings,\nsuch as capturing audibility in audio captioning, where it is crucial to\ndescribe sounds and their sources. Our framework consists of two main\ncomponents: (i) a frozen captioning system incorporating a language model (LM),\nand (ii) a text classifier that guides the captioning system. The classifier is\ntrained on a dataset automatically generated by GPT-4, using tailored prompts\nspecifically designed to enhance key aspects of the generated captions.\nImportantly, the framework operates solely during inference, eliminating the\nneed for further training of the underlying captioning model. We evaluate the\nframework on various models and modalities, with a focus on audio captioning,\nand report promising results. Notably, when combined with an existing zero-shot\naudio captioning system, our framework improves its quality and sets\nstate-of-the-art performance in zero-shot audio captioning.\n","arxiv_id":"http://arxiv.org/abs/2501.03183v1","authors":["Ariel Shaulov","Tal Shaharabany","Eitan Shaar","Gal Chechik","Lior Wolf"]},{"title":"Breaking Through the Spike: Spike Window Decoding for Accelerated and\n  Precise Automatic Speech Recognition","abstract":"  Recently, end-to-end automatic speech recognition has become the mainstream\napproach in both industry and academia. To optimize system performance in\nspecific scenarios, the Weighted Finite-State Transducer (WFST) is extensively\nused to integrate acoustic and language models, leveraging its capacity to\nimplicitly fuse language models within static graphs, thereby ensuring robust\nrecognition while also facilitating rapid error correction. However, WFST\nnecessitates a frame-by-frame search of CTC posterior probabilities through\nautoregression, which significantly hampers inference speed. In this work, we\nthoroughly investigate the spike property of CTC outputs and further propose\nthe conjecture that adjacent frames to non-blank spikes carry semantic\ninformation beneficial to the model. Building on this, we propose the Spike\nWindow Decoding algorithm, which greatly improves the inference speed by making\nthe number of frames decoded in WFST linearly related to the number of spiking\nframes in the CTC output, while guaranteeing the recognition performance. Our\nmethod achieves SOTA recognition accuracy with significantly accelerates\ndecoding speed, proven across both AISHELL-1 and large-scale In-House datasets,\nestablishing a pioneering approach for integrating CTC output with WFST.\n","arxiv_id":"http://arxiv.org/abs/2501.03257v1","authors":["Wei Zhang","Tian-Hao Zhang","Chao Luo","Hui Zhou","Chao Yang","Xinyuan Qian","Xu-Cheng Yin"]},{"title":"Navigation Variable-based Multi-objective Particle Swarm Optimization\n  for UAV Path Planning with Kinematic Constraints","abstract":"  Path planning is essential for unmanned aerial vehicles (UAVs) as it\ndetermines the path that the UAV needs to follow to complete a task. This work\naddresses this problem by introducing a new algorithm called navigation\nvariable-based multi-objective particle swarm optimization (NMOPSO). It first\nmodels path planning as an optimization problem via the definition of a set of\nobjective functions that include optimality and safety requirements for UAV\noperation. The NMOPSO is then used to minimize those functions through Pareto\noptimal solutions. The algorithm features a new path representation based on\nnavigation variables to include kinematic constraints and exploit the\nmaneuverable characteristics of the UAV. It also includes an adaptive mutation\nmechanism to enhance the diversity of the swarm for better solutions.\nComparisons with various algorithms have been carried out to benchmark the\nproposed approach. The results indicate that the NMOPSO performs better than\nnot only other particle swarm optimization variants but also other\nstate-of-the-art multi-objective and metaheuristic optimization algorithms.\nExperiments have also been conducted with real UAVs to confirm the validity of\nthe approach for practical flights. The source code of the algorithm is\navailable at https://github.com/ngandng/NMOPSO.\n","arxiv_id":"http://arxiv.org/abs/2501.03261v1","authors":["Thi Thuy Ngan Duong","Duy-Nam Bui","Manh Duong Phung"]},{"title":"Bridge the Inference Gaps of Neural Processes via Expectation\n  Maximization","abstract":"  The neural process (NP) is a family of computationally efficient models for\nlearning distributions over functions. However, it suffers from under-fitting\nand shows suboptimal performance in practice. Researchers have primarily\nfocused on incorporating diverse structural inductive biases, \\textit{e.g.}\nattention or convolution, in modeling. The topic of inference suboptimality and\nan analysis of the NP from the optimization objective perspective has hardly\nbeen studied in earlier work. To fix this issue, we propose a surrogate\nobjective of the target log-likelihood of the meta dataset within the\nexpectation maximization framework. The resulting model, referred to as the\nSelf-normalized Importance weighted Neural Process (SI-NP), can learn a more\naccurate functional prior and has an improvement guarantee concerning the\ntarget log-likelihood. Experimental results show the competitive performance of\nSI-NP over other NPs objectives and illustrate that structural inductive\nbiases, such as attention modules, can also augment our method to achieve SOTA\nperformance. Our code is available at\n\\url{https://github.com/hhq123gogogo/SI_NPs}.\n","arxiv_id":"http://arxiv.org/abs/2501.03264v1","authors":["Qi Wang","Marco Federici","Herke van Hoof"]},{"title":"Listening and Seeing Again: Generative Error Correction for Audio-Visual\n  Speech Recognition","abstract":"  Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech\nRecognition (AVSR) takes audio and visual signals simultaneously to infer the\ntranscription. Recent studies have shown that Large Language Models (LLMs) can\nbe effectively used for Generative Error Correction (GER) in ASR by predicting\nthe best transcription from ASR-generated N-best hypotheses. However, these\nLLMs lack the ability to simultaneously understand audio and visual, making the\nGER approach challenging to apply in AVSR. In this work, we propose a novel GER\nparadigm for AVSR, termed AVGER, that follows the concept of ``listening and\nseeing again''. Specifically, we first use the powerful AVSR system to read the\naudio and visual signals to get the N-Best hypotheses, and then use the\nQ-former-based Multimodal Synchronous Encoder to read the audio and visual\ninformation again and convert them into an audio and video compression\nrepresentation respectively that can be understood by LLM. Afterward, the\naudio-visual compression representation and the N-Best hypothesis together\nconstitute a Cross-modal Prompt to guide the LLM in producing the best\ntranscription. In addition, we also proposed a Multi-Level Consistency\nConstraint training criterion, including logits-level, utterance-level and\nrepresentations-level, to improve the correction accuracy while enhancing the\ninterpretability of audio and visual compression representations. The\nexperimental results on the LRS3 dataset show that our method outperforms\ncurrent mainstream AVSR systems. The proposed AVGER can reduce the Word Error\nRate (WER) by 24% compared to them. Code and models can be found at:\nhttps://github.com/CircleRedRain/AVGER.\n","arxiv_id":"http://arxiv.org/abs/2501.04038v1","authors":["Rui Liu","Hongyu Yuan","Haizhou Li"]},{"title":"A Survey on Large Language Models with some Insights on their\n  Capabilities and Limitations","abstract":"  The rapid advancement of artificial intelligence, particularly with the\ndevelopment of Large Language Models (LLMs) built on the transformer\narchitecture, has redefined the capabilities of natural language processing.\nThese models now exhibit remarkable performance across various language-related\ntasks, such as text generation, question answering, translation, and\nsummarization, often rivaling human-like comprehension. More intriguingly, LLMs\nhave demonstrated emergent abilities extending beyond their core functions,\nshowing proficiency in tasks like commonsense reasoning, code generation, and\narithmetic. This survey paper explores the foundational components, scaling\nmechanisms, and architectural strategies that drive these capabilities.\nEmphasizing models like GPT and LLaMA, we analyze the impact of exponential\ndata and computational growth on LLM performance, while also addressing the\ntrade-offs associated with scaling. We also examine LLM applications across\nsectors, such as healthcare, finance, education, and law, highlighting their\nadaptability and potential to solve domain-specific challenges. Central to this\nwork are the questions of how LLMs generalize across diverse tasks, exhibit\nplanning, and reasoning abilities, and whether these emergent abilities can be\nsystematically elicited or enhanced. In particular, we provide some insights\ninto the CoT (Chain of Thought) and PoT (Plan of Thought) abilities within\nLLMs, focusing on how pre-training data influences their emergence.\nAdditionally, we investigate LLM-modulo frameworks that integrate external\nsystems, allowing LLMs to handle complex, dynamic tasks. By analyzing these\nfactors, this paper aims to foster the ongoing discussion on the capabilities\nand limits of LLMs, promoting their responsible development and application in\nnovel and increasingly complex environments.\n","arxiv_id":"http://arxiv.org/abs/2501.04040v1","authors":["Andrea Matarazzo","Riccardo Torlone"]},{"title":"FLAME: Financial Large-Language Model Assessment and Metrics Evaluation","abstract":"  LLMs have revolutionized NLP and demonstrated potential across diverse\ndomains. More and more financial LLMs have been introduced for finance-specific\ntasks, yet comprehensively assessing their value is still challenging. In this\npaper, we introduce FLAME, a comprehensive financial LLMs evaluation system in\nChinese, which includes two core evaluation benchmarks: FLAME-Cer and\nFLAME-Sce. FLAME-Cer covers 14 types of authoritative financial certifications,\nincluding CPA, CFA, and FRM, with a total of approximately 16,000 carefully\nselected questions. All questions have been manually reviewed to ensure\naccuracy and representativeness. FLAME-Sce consists of 10 primary core\nfinancial business scenarios, 21 secondary financial business scenarios, and a\ncomprehensive evaluation set of nearly 100 tertiary financial application\ntasks. We evaluate 6 representative LLMs, including GPT-4o, GLM-4, ERNIE-4.0,\nQwen2.5, XuanYuan3, and the latest Baichuan4-Finance, revealing\nBaichuan4-Finance excels other LLMs in most tasks. By establishing a\ncomprehensive and professional evaluation system, FLAME facilitates the\nadvancement of financial LLMs in Chinese contexts. Instructions for\nparticipating in the evaluation are available on GitHub:\nhttps://github.com/FLAME-ruc/FLAME.\n","arxiv_id":"http://arxiv.org/abs/2501.06211v1","authors":["Jiayu Guo","Yu Guo","Martha Li","Songtao Tan"]},{"title":"Operator Learning for Reconstructing Flow Fields from Sparse\n  Measurements: an Energy Transformer Approach","abstract":"  Machine learning methods have shown great success in various scientific\nareas, including fluid mechanics. However, reconstruction problems, where full\nvelocity fields must be recovered from partial observations, remain\nchallenging. In this paper, we propose a novel operator learning framework for\nsolving reconstruction problems by using the Energy Transformer (ET), an\narchitecture inspired by associative memory models. We formulate reconstruction\nas a mapping from incomplete observed data to full reconstructed fields. The\nmethod is validated on three fluid mechanics examples using diverse types of\ndata: (1) unsteady 2D vortex street in flow past a cylinder using simulation\ndata; (2) high-speed under-expanded impinging supersonic jets impingement using\nSchlieren imaging; and (3) 3D turbulent jet flow using particle tracking. The\nresults demonstrate the ability of ET to accurately reconstruct complex flow\nfields from highly incomplete data (90\\% missing), even for noisy experimental\nmeasurements, with fast training and inference on a single GPU. This work\nprovides a promising new direction for tackling reconstruction problems in\nfluid mechanics and other areas in mechanics, geophysics, weather prediction,\nand beyond.\n","arxiv_id":"http://arxiv.org/abs/2501.08339v1","authors":["Qian Zhang","Dmitry Krotov","George Em Karniadakis"]},{"title":"VERITAS: Verifying the Performance of AI-native Transceiver Actions in\n  Base-Stations","abstract":"  Artificial Intelligence (AI)-native receivers prove significant performance\nimprovement in high noise regimes and can potentially reduce communication\noverhead compared to the traditional receiver. However, their performance\nhighly depends on the representativeness of the training dataset. A major issue\nis the uncertainty of whether the training dataset covers all test environments\nand waveform configurations, and thus, whether the trained model is robust in\npractical deployment conditions. To this end, we propose a joint\nmeasurement-recovery framework for AI-native transceivers post deployment,\ncalled VERITAS, that continuously looks for distribution shifts in the received\nsignals and triggers finite re-training spurts. VERITAS monitors the wireless\nchannel using 5G pilots fed to an auxiliary neural network that detects\nout-of-distribution channel profile, transmitter speed, and delay spread. As\nsoon as such a change is detected, a traditional (reference) receiver is\nactivated, which runs for a period of time in parallel to the AI-native\nreceiver. Finally, VERTIAS compares the bit probabilities of the AI-native and\nthe reference receivers for the same received data inputs, and decides whether\nor not a retraining process needs to be initiated. Our evaluations reveal that\nVERITAS can detect changes in the channel profile, transmitter speed, and delay\nspread with 99%, 97%, and 69% accuracies, respectively, followed by timely\ninitiation of retraining for 86%, 93.3%, and 94.8% of inputs in channel\nprofile, transmitter speed, and delay spread test sets, respectively.\n","arxiv_id":"http://arxiv.org/abs/2501.09761v1","authors":["Nasim Soltani","Michael Loehning","Kaushik Chowdhury"]},{"title":"Dynamics of Adversarial Attacks on Large Language Model-Based Search\n  Engines","abstract":"  The increasing integration of Large Language Model (LLM) based search engines\nhas transformed the landscape of information retrieval. However, these systems\nare vulnerable to adversarial attacks, especially ranking manipulation attacks,\nwhere attackers craft webpage content to manipulate the LLM's ranking and\npromote specific content, gaining an unfair advantage over competitors. In this\npaper, we study the dynamics of ranking manipulation attacks. We frame this\nproblem as an Infinitely Repeated Prisoners' Dilemma, where multiple players\nstrategically decide whether to cooperate or attack. We analyze the conditions\nunder which cooperation can be sustained, identifying key factors such as\nattack costs, discount rates, attack success rates, and trigger strategies that\ninfluence player behavior. We identify tipping points in the system dynamics,\ndemonstrating that cooperation is more likely to be sustained when players are\nforward-looking. However, from a defense perspective, we find that simply\nreducing attack success probabilities can, paradoxically, incentivize attacks\nunder certain conditions. Furthermore, defensive measures to cap the upper\nbound of attack success rates may prove futile in some scenarios. These\ninsights highlight the complexity of securing LLM-based systems. Our work\nprovides a theoretical foundation and practical insights for understanding and\nmitigating their vulnerabilities, while emphasizing the importance of adaptive\nsecurity strategies and thoughtful ecosystem design.\n","arxiv_id":"http://arxiv.org/abs/2501.00745v1","authors":["Xiyang Hu"]},{"title":"MuQ: Self-Supervised Music Representation Learning with Mel Residual\n  Vector Quantization","abstract":"  Recent years have witnessed the success of foundation models pre-trained with\nself-supervised learning (SSL) in various music informatics understanding\ntasks, including music tagging, instrument classification, key detection, and\nmore. In this paper, we propose a self-supervised music representation learning\nmodel for music understanding. Distinguished from previous studies adopting\nrandom projection or existing neural codec, the proposed model, named MuQ, is\ntrained to predict tokens generated by Mel Residual Vector Quantization\n(Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel\nspectrum quantization to enhance the stability and efficiency of target\nextraction and lead to better performance. Experiments in a large variety of\ndownstream tasks demonstrate that MuQ outperforms previous self-supervised\nmusic representation models with only 0.9K hours of open-source pre-training\ndata. Scaling up the data to over 160K hours and adopting iterative training\nconsistently improve the model performance. To further validate the strength of\nour model, we present MuQ-MuLan, a joint music-text embedding model based on\ncontrastive learning, which achieves state-of-the-art performance in the\nzero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints\nare open source in https://github.com/tencent-ailab/MuQ.\n","arxiv_id":"http://arxiv.org/abs/2501.01108v2","authors":["Haina Zhu","Yizhi Zhou","Hangting Chen","Jianwei Yu","Ziyang Ma","Rongzhi Gu","Yi Luo","Wei Tan","Xie Chen"]},{"title":"Symmetries-enhanced Multi-Agent Reinforcement Learning","abstract":"  Multi-agent reinforcement learning has emerged as a powerful framework for\nenabling agents to learn complex, coordinated behaviors but faces persistent\nchallenges regarding its generalization, scalability and sample efficiency.\nRecent advancements have sought to alleviate those issues by embedding\nintrinsic symmetries of the systems in the policy. Yet, most dynamical systems\nexhibit little to no symmetries to exploit. This paper presents a novel\nframework for embedding extrinsic symmetries in multi-agent system dynamics\nthat enables the use of symmetry-enhanced methods to address systems with\ninsufficient intrinsic symmetries, expanding the scope of equivariant learning\nto a wide variety of MARL problems. Central to our framework is the Group\nEquivariant Graphormer, a group-modular architecture specifically designed for\ndistributed swarming tasks. Extensive experiments on a swarm of\nsymmetry-breaking quadrotors validate the effectiveness of our approach,\nshowcasing its potential for improved generalization and zero-shot scalability.\nOur method achieves significant reductions in collision rates and enhances task\nsuccess rates across a diverse range of scenarios and varying swarm sizes.\n","arxiv_id":"http://arxiv.org/abs/2501.01136v1","authors":["Nikolaos Bousias","Stefanos Pertigkiozoglou","Kostas Daniilidis","George Pappas"]},{"title":"Change Detection-Based Procedures for Piecewise Stationary MABs: A\n  Modular Approach","abstract":"  Conventional Multi-Armed Bandit (MAB) algorithms are designed for stationary\nenvironments, where the reward distributions associated with the arms do not\nchange with time. In many applications, however, the environment is more\naccurately modeled as being nonstationary. In this work, piecewise stationary\nMAB (PS-MAB) environments are investigated, in which the reward distributions\nassociated with a subset of the arms change at some change-points and remain\nstationary between change-points. Our focus is on the asymptotic analysis of\nPS-MABs, for which practical algorithms based on change detection (CD) have\nbeen previously proposed. Our goal is to modularize the design and analysis of\nsuch CD-based Bandit (CDB) procedures. To this end, we identify the\nrequirements for stationary bandit algorithms and change detectors in a CDB\nprocedure that are needed for the modularization. We assume that the rewards\nare sub-Gaussian. Under this assumption and a condition on the separation of\nthe change-points, we show that the analysis of CDB procedures can indeed be\nmodularized, so that regret bounds can be obtained in a unified manner for\nvarious combinations of change detectors and bandit algorithms. Through this\nanalysis, we develop new modular CDB procedures that are order-optimal. We\ncompare the performance of our modular CDB procedures with various other\nmethods in simulations.\n","arxiv_id":"http://arxiv.org/abs/2501.01291v1","authors":["Yu-Han Huang","Argyrios Gerogiannis","Subhonmesh Bose","Venugopal V. Veeravalli"]},{"title":"Quantifying A Firm's AI Engagement: Constructing Objective, Data-Driven,\n  AI Stock Indices Using 10-K Filings","abstract":"  Following an analysis of existing AI-related exchange-traded funds (ETFs), we\nreveal the selection criteria for determining which stocks qualify as\nAI-related are often opaque and rely on vague phrases and subjective judgments.\nThis paper proposes a new, objective, data-driven approach using natural\nlanguage processing (NLP) techniques to classify AI stocks by analyzing annual\n10-K filings from 3,395 NASDAQ-listed firms between 2011 and 2023. This\nanalysis quantifies each company's engagement with AI through binary indicators\nand weighted AI scores based on the frequency and context of AI-related terms.\nUsing these metrics, we construct four AI stock indices-the Equally Weighted AI\nIndex (AII), the Size-Weighted AI Index (SAII), and two Time-Discounted AI\nIndices (TAII05 and TAII5X)-offering different perspectives on AI investment.\nWe validate our methodology through an event study on the launch of OpenAI's\nChatGPT, demonstrating that companies with higher AI engagement saw\nsignificantly greater positive abnormal returns, with analyses supporting the\npredictive power of our AI measures. Our indices perform on par with or surpass\n14 existing AI-themed ETFs and the Nasdaq Composite Index in risk-return\nprofiles, market responsiveness, and overall performance, achieving higher\naverage daily returns and risk-adjusted metrics without increased volatility.\nThese results suggest our NLP-based approach offers a reliable,\nmarket-responsive, and cost-effective alternative to existing AI-related ETF\nproducts. Our innovative methodology can also guide investors, asset managers,\nand policymakers in using corporate data to construct other thematic\nportfolios, contributing to a more transparent, data-driven, and competitive\napproach.\n","arxiv_id":"http://arxiv.org/abs/2501.01763v1","authors":["Lennart Ante","Aman Saggu"]},{"title":"KANS: Knowledge Discovery Graph Attention Network for Soft Sensing in\n  Multivariate Industrial Processes","abstract":"  Soft sensing of hard-to-measure variables is often crucial in industrial\nprocesses. Current practices rely heavily on conventional modeling techniques\nthat show success in improving accuracy. However, they overlook the non-linear\nnature, dynamics characteristics, and non-Euclidean dependencies between\ncomplex process variables. To tackle these challenges, we present a framework\nknown as a Knowledge discovery graph Attention Network for effective Soft\nsensing (KANS). Unlike the existing deep learning soft sensor models, KANS can\ndiscover the intrinsic correlations and irregular relationships between the\nmultivariate industrial processes without a predefined topology. First, an\nunsupervised graph structure learning method is introduced, incorporating the\ncosine similarity between different sensor embedding to capture the\ncorrelations between sensors. Next, we present a graph attention-based\nrepresentation learning that can compute the multivariate data parallelly to\nenhance the model in learning complex sensor nodes and edges. To fully explore\nKANS, knowledge discovery analysis has also been conducted to demonstrate the\ninterpretability of the model. Experimental results demonstrate that KANS\nsignificantly outperforms all the baselines and state-of-the-art methods in\nsoft sensing performance. Furthermore, the analysis shows that KANS can find\nsensors closely related to different process variables without domain\nknowledge, significantly improving soft sensing accuracy.\n","arxiv_id":"http://arxiv.org/abs/2501.02015v1","authors":["Hwa Hui Tew","Gaoxuan Li","Fan Ding","Xuewen Luo","Junn Yong Loo","Chee-Ming Ting","Ze Yang Ding","Chee Pin Tan"]},{"title":"Benchmark Evaluations, Applications, and Challenges of Large Vision\n  Language Models: A Survey","abstract":"  Multimodal Vision Language Models (VLMs) have emerged as a transformative\ntechnology at the intersection of computer vision and natural language\nprocessing, enabling machines to perceive and reason about the world through\nboth visual and textual modalities. For example, models such as CLIP, Claude,\nand GPT-4V demonstrate strong reasoning and understanding abilities on visual\nand textual data and beat classical single modality vision models on zero-shot\nclassification. Despite their rapid advancements in research and growing\npopularity in applications, a comprehensive survey of existing studies on VLMs\nis notably lacking, particularly for researchers aiming to leverage VLMs in\ntheir specific domains. To this end, we provide a systematic overview of VLMs\nin the following aspects: model information of the major VLMs developed over\nthe past five years (2019-2024); the main architectures and training methods of\nthese VLMs; summary and categorization of the popular benchmarks and evaluation\nmetrics of VLMs; the applications of VLMs including embodied agents, robotics,\nand video generation; the challenges and issues faced by current VLMs such as\nhallucination, fairness, and safety. Detailed collections including papers and\nmodel repository links are listed in\nhttps://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.\n","arxiv_id":"http://arxiv.org/abs/2501.02189v2","authors":["Zongxia Li","Xiyang Wu","Hongyang Du","Huy Nghiem","Guangyao Shi"]},{"title":"Optimizing Small Language Models for In-Vehicle Function-Calling","abstract":"  We propose a holistic approach for deploying Small Language Models (SLMs) as\nfunction-calling agents within vehicles as edge devices, offering a more\nflexible and robust alternative to traditional rule-based systems. By\nleveraging SLMs, we simplify vehicle control mechanisms and enhance the user\nexperience. Given the in-vehicle hardware constraints, we apply\nstate-of-the-art model compression techniques, including structured pruning,\nhealing, and quantization, ensuring that the model fits within the resource\nlimitations while maintaining acceptable performance. Our work focuses on\noptimizing a representative SLM, Microsoft's Phi-3 mini, and outlines best\npractices for enabling embedded models, including compression, task-specific\nfine-tuning, and vehicle integration. We demonstrate that, despite significant\nreduction in model size which removes up to 2 billion parameters from the\noriginal model, our approach preserves the model's ability to handle complex\nin-vehicle tasks accurately and efficiently. Furthermore, by executing the\nmodel in a lightweight runtime environment, we achieve a generation speed of 11\ntokens per second, making real-time, on-device inference feasible without\nhardware acceleration. Our results demonstrate the potential of SLMs to\ntransform vehicle control systems, enabling more intuitive interactions between\nusers and their vehicles for an enhanced driving experience.\n","arxiv_id":"http://arxiv.org/abs/2501.02342v1","authors":["Yahya Sowti Khiabani","Farris Atif","Chieh Hsu","Sven Stahlmann","Tobias Michels","Sebastian Kramer","Benedikt Heidrich","M. Saquib Sarfraz","Julian Merten","Faezeh Tafazzoli"]},{"title":"Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers","abstract":"  We present an approach to modifying Transformer architectures by integrating\ngraph-aware relational reasoning into the attention mechanism, merging concepts\nfrom graph neural networks and language modeling. Building on the inherent\nconnection between attention and graph theory, we reformulate the Transformer's\nattention mechanism as a graph operation and propose Graph-Aware Isomorphic\nAttention. This method leverages advanced graph modeling strategies, including\nGraph Isomorphism Networks (GIN) and Principal Neighborhood Aggregation (PNA),\nto enrich the representation of relational structures. Our approach captures\ncomplex dependencies and generalizes across tasks, as evidenced by a reduced\ngeneralization gap and improved learning performance. Additionally, we expand\nthe concept of graph-aware attention to introduce Sparse GIN-Attention, a\nfine-tuning approach that employs sparse GINs. By interpreting attention\nmatrices as sparse adjacency graphs, this technique enhances the adaptability\nof pre-trained foundational models with minimal computational overhead,\nendowing them with graph-aware capabilities. Sparse GIN-Attention fine-tuning\nachieves improved training dynamics and better generalization compared to\nalternative methods like low-rank adaption (LoRA). We discuss latent graph-like\nstructures within traditional attention mechanisms, offering a new lens through\nwhich Transformers can be understood. By evolving Transformers as hierarchical\nGIN models for relational reasoning. This perspective suggests profound\nimplications for foundational model development, enabling the design of\narchitectures that dynamically adapt to both local and global dependencies.\nApplications in bioinformatics, materials science, language modeling, and\nbeyond could benefit from this synthesis of relational and sequential data\nmodeling, setting the stage for interpretable and generalizable modeling\nstrategies.\n","arxiv_id":"http://arxiv.org/abs/2501.02393v2","authors":["Markus J. Buehler"]},{"title":"Toward Inclusive Educational AI: Auditing Frontier LLMs through a\n  Multiplexity Lens","abstract":"  As large language models (LLMs) like GPT-4 and Llama 3 become integral to\neducational contexts, concerns are mounting over the cultural biases, power\nimbalances, and ethical limitations embedded within these technologies. Though\ngenerative AI tools aim to enhance learning experiences, they often reflect\nvalues rooted in Western, Educated, Industrialized, Rich, and Democratic\n(WEIRD) cultural paradigms, potentially sidelining diverse global perspectives.\nThis paper proposes a framework to assess and mitigate cultural bias within\nLLMs through the lens of applied multiplexity. Multiplexity, inspired by\nSenturk et al. and rooted in Islamic and other wisdom traditions, emphasizes\nthe coexistence of diverse cultural viewpoints, supporting a multi-layered\nepistemology that integrates both empirical sciences and normative values. Our\nanalysis reveals that LLMs frequently exhibit cultural polarization, with\nbiases appearing in both overt responses and subtle contextual cues. To address\ninherent biases and incorporate multiplexity in LLMs, we propose two\nstrategies: \\textit{Contextually-Implemented Multiplex LLMs}, which embed\nmultiplex principles directly into the system prompt, influencing LLM outputs\nat a foundational level and independent of individual prompts, and\n\\textit{Multi-Agent System (MAS)-Implemented Multiplex LLMs}, where multiple\nLLM agents, each representing distinct cultural viewpoints, collaboratively\ngenerate a balanced, synthesized response. Our findings demonstrate that as\nmitigation strategies evolve from contextual prompting to MAS-implementation,\ncultural inclusivity markedly improves, evidenced by a significant rise in the\nPerspectives Distribution Score (PDS) and a PDS Entropy increase from 3.25\\% at\nbaseline to 98\\% with the MAS-Implemented Multiplex LLMs. Sentiment analysis\nfurther shows a shift towards positive sentiment across cultures,...\n","arxiv_id":"http://arxiv.org/abs/2501.03259v1","authors":["Abdullah Mushtaq","Muhammad Rafay Naeem","Muhammad Imran Taj","Ibrahim Ghaznavi","Junaid Qadir"]},{"title":"LLM Content Moderation and User Satisfaction: Evidence from Response\n  Refusals in Chatbot Arena","abstract":"  LLM safety and ethical alignment are widely discussed, but the impact of\ncontent moderation on user satisfaction remains underexplored. To address this,\nwe analyze nearly 50,000 Chatbot Arena response-pairs using a novel fine-tuned\nRoBERTa model, that we trained on hand-labeled data to disentangle refusals due\nto ethical concerns from other refusals due to technical disabilities or lack\nof information. Our findings reveal a significant refusal penalty on content\nmoderation, with users choosing ethical-based refusals roughly one-fourth as\noften as their preferred LLM response compared to standard responses. However,\nthe context and phrasing play critical roles: refusals on highly sensitive\nprompts, such as illegal content, achieve higher win rates than less sensitive\nethical concerns, and longer responses closely aligned with the prompt perform\nbetter. These results emphasize the need for nuanced moderation strategies that\nbalance ethical safeguards with user satisfaction. Moreover, we find that the\nrefusal penalty is notably lower in evaluations using the LLM-as-a-Judge\nmethod, highlighting discrepancies between user and automated assessments.\n","arxiv_id":"http://arxiv.org/abs/2501.03266v1","authors":["Stefan Pasch"]},{"title":"Cyber Shadows: Neutralizing Security Threats with AI and Targeted Policy\n  Measures","abstract":"  The digital age, driven by the AI revolution, brings significant\nopportunities but also conceals security threats, which we refer to as cyber\nshadows. These threats pose risks at individual, organizational, and societal\nlevels. This paper examines the systemic impact of these cyber threats and\nproposes a comprehensive cybersecurity strategy that integrates AI-driven\nsolutions, such as Intrusion Detection Systems (IDS), with targeted policy\ninterventions. By combining technological and regulatory measures, we create a\nmultilevel defense capable of addressing both direct threats and indirect\nnegative externalities. We emphasize that the synergy between AI-driven\nsolutions and policy interventions is essential for neutralizing cyber threats\nand mitigating their negative impact on the digital economy. Finally, we\nunderscore the need for continuous adaptation of these strategies, especially\nin response to the rapid advancement of autonomous AI-driven attacks, to ensure\nthe creation of secure and resilient digital ecosystems.\n","arxiv_id":"http://arxiv.org/abs/2501.09025v1","authors":["Marc Schmitt","Pantelis Koutroumpis"]},{"title":"Zero-Shot Statistical Tests for LLM-Generated Text Detection using\n  Finite Sample Concentration Inequalities","abstract":"  Verifying the provenance of content is crucial to the function of many\norganizations, e.g., educational institutions, social media platforms, firms,\netc. This problem is becoming increasingly difficult as text generated by Large\nLanguage Models (LLMs) becomes almost indistinguishable from human-generated\ncontent. In addition, many institutions utilize in-house LLMs and want to\nensure that external, non-sanctioned LLMs do not produce content within the\ninstitution. In this paper, we answer the following question: Given a piece of\ntext, can we identify whether it was produced by LLM $A$ or $B$ (where $B$ can\nbe a human)? We model LLM-generated text as a sequential stochastic process\nwith complete dependence on history and design zero-shot statistical tests to\ndistinguish between (i) the text generated by two different sets of LLMs $A$\n(in-house) and $B$ (non-sanctioned) and also (ii) LLM-generated and\nhuman-generated texts. We prove that the type I and type II errors for our\ntests decrease exponentially in the text length. In designing our tests, we\nderive concentration inequalities on the difference between log-perplexity and\nthe average entropy of the string under $A$. Specifically, for a given string,\nwe demonstrate that if the string is generated by $A$, the log-perplexity of\nthe string under $A$ converges to the average entropy of the string under $A$,\nexcept with an exponentially small probability in string length. We also show\nthat if $B$ generates the text, except with an exponentially small probability\nin string length, the log-perplexity of the string under $A$ converges to the\naverage cross-entropy of $B$ and $A$. Lastly, we present preliminary\nexperimental results to support our theoretical results. By enabling guaranteed\n(with high probability) finding of the origin of harmful LLM-generated text\nwith arbitrary size, we can help combat misinformation.\n","arxiv_id":"http://arxiv.org/abs/2501.02406v2","authors":["Tara Radvand","Mojtaba Abdolmaleki","Mohamed Mostagir","Ambuj Tewari"]}]
						
						
					
				
				
					
						Reference in New Issue
					
					View Git Blame
					Copy Permalink