桃子吃了有什么好处
百度 有的干部谈及网络经济时眉飞色舞,但一遇到网络民情民意就感到办法不多、方法不灵。See recent articles
Showing new listings for Tuesday, 5 August 2025
- [1] arXiv:2508.01255 [pdf, html, other]
-
Title: TestWeaver: Execution-aware, Feedback-driven Regression Testing Generation with Large Language ModelsCuong Chi Le, Cuong Duc Van, Tung Duy Vu, Thai Minh Pham Vu, Hoang Nhat Phan, Huy Nhat Phan, Tien N. NguyenSubjects: Software Engineering (cs.SE)
Regression testing ensures that code changes do not unintentionally break existing functionality. While recent advances in large language models (LLMs) have shown promise in automating test generation for regression testing, they often suffer from limited reasoning about program execution, resulting in stagnated coverage growth - a phenomenon known as the coverage plateau. In this paper, we present TestWeaver, a novel LLM-based approach that integrates lightweight program analysis to guide test generation more effectively. TestWeaver introduces three key innovations: (1) it reduces hallucinations and improves focus by supplying the LLM with the backward slice from the target line instead of full program context; (2) it identifies and incorporates close test cases - those that share control-flow similarities with the path to the target line - to provide execution context within the LLM's context window; and (3) it enhances LLM's reasoning with execution in-line annotations that encode variable states as comments along executed paths. By equipping LLMs with these targeted and contextualized inputs, TestWeaver improves coverage-guided test generation and mitigates redundant explorations. Empirical results demonstrate that TestWeaver accelerates code coverage growth and generates more effective regression test cases than existing LLM-based approaches.
- [2] arXiv:2508.01337 [pdf, html, other]
-
Title: Screencast-Based Analysis of User-Perceived GUI ResponsivenessSubjects: Software Engineering (cs.SE)
GUI responsiveness is critical for a positive user experience in mobile applications. Even brief delays in visual feedback can frustrate users and lead to negative reviews. However, detecting and quantifying such user-perceived delays remains challenging, especially in industrial testing pipelines that evaluate thousands of apps daily across diverse devices and OS versions. Existing techniques based on static analysis or system metrics, while useful, may not accurately capture user-perceived issues or scale effectively.
In this experience paper, we present \tool, a lightweight and black-box technique that measures GUI responsiveness directly from mobile screencasts -- video recordings captured during automated GUI testing. \tool detects user interactions and visual delays, helping developers identify GUI performance issues that affect the user experience. It uses computer vision to detect user interactions and analyzes frame-level visual changes to compute two key metrics: response time (from user action to first visual feedback) and finish time (until visual feedback stabilizes). We evaluate \tool on a manually annotated benchmark of 2,458 interactions from 64 popular Android apps. \tool achieves 0.96 precision and 0.93 recall in detecting interactions, and measures response and finish times within 50\,ms and 100\,ms error, respectively, for over 89\% of interactions. The tool has been deployed in an industrial testing pipeline and analyzes thousands of screencasts daily, uncovering responsiveness issues missed by traditional tools and improving performance debugging efficiency. - [3] arXiv:2508.01357 [pdf, html, other]
-
Title: HyClone: Bridging LLM Understanding and Dynamic Execution for Semantic Code Clone DetectionSubjects: Software Engineering (cs.SE)
Code clone detection is a critical task in software engineering, aimed at identifying duplicated or similar code fragments within or across software systems. Traditional methods often fail to capture functional equivalence, particularly for semantic clones (Type 4), where code fragments implement identical functionality despite differing syntactic structures. Recent advances in large language models (LLMs) have shown promise in understanding code semantics. However, directly applying LLMs to code clone detection yields suboptimal results due to their sensitivity to syntactic differences. To address these challenges, we propose a novel two-stage framework that combines LLM-based screening with execution-based validation for detecting semantic clones in Python programs. In the first stage, an LLM evaluates code pairs to filter out obvious non-clones based on semantic analysis. For pairs not identified as clones, the second stage employs an execution-based validation approach, utilizing LLM-generated test inputs to assess functional equivalence through cross-execution validation. Our experimental evaluation demonstrates significant improvements in precision, recall, and F1-score compared to direct LLM-based detection, highlighting the framework's effectiveness in identifying semantic clones. Future work includes exploring cross-language clone detection and optimizing the framework for large-scale applications.
- [4] arXiv:2508.01358 [pdf, html, other]
-
Title: An Empirical Validation of Open Source Repository Stability MetricsSubjects: Software Engineering (cs.SE)
Over the past few decades, open source software has been continuously integrated into software supply chains worldwide, drastically increasing reliance and dependence. Because of the role this software plays, it is important to understand ways to measure and promote its stability and potential for sustainability. Recent work proposed the use of control theory to understand repository stability and evaluate repositories' ability to return to equilibrium after a disturbance such as the introduction of a new feature request, a spike in bug reports, or even the influx or departure of contributors. This approach leverages commit frequency patterns, issue resolution rate, pull request merge rate, and community activity engagement to provide a Composite Stability Index (CSI). While this framework has theoretical foundations, there is no empirical validation of the CSI in practice. In this paper, we present the first empirical validation of the proposed CSI by experimenting with 100 highly ranked GitHub repositories. Our results suggest that (1) sampling weekly commit frequency pattern instead of daily is a more feasible measure of commit frequency stability across repositories and (2) improved statistical inferences (swapping mean with median), particularly with ascertaining resolution and review times in issues and pull request, improves the overall issue and pull request stability index. Drawing on our empirical dataset, we also derive data-driven half-width parameters that better align stability scores with real project behavior. These findings both confirm the viability of a control-theoretic lens on open-source health and provide concrete, evidence-backed applications for real-world project monitoring tools.
- [5] arXiv:2508.01430 [pdf, html, other]
-
Title: From Technical Excellence to Practical Adoption: Lessons Learned Building an ML-Enhanced Trace Analysis ToolSubjects: Software Engineering (cs.SE)
System tracing has become essential for understanding complex software behavior in modern systems, yet sophisticated trace analysis tools face significant adoption gaps in industrial settings. Through a year-long collaboration with Ericsson Montréal, developing TMLL (Trace-Server Machine Learning Library, now in the Eclipse Foundation), we investigated barriers to trace analysis adoption. Contrary to assumptions about complexity or automation needs, practitioners struggled with translating expert knowledge into actionable insights, integrating analysis into their workflows, and trusting automated results they could not validate. We identified what we called the Excellence Paradox: technical excellence can actively impede adoption when conflicting with usability, transparency, and practitioner trust. TMLL addresses this through adoption-focused design that embeds expert knowledge in interfaces, provides transparent explanations, and enables incremental adoption. Validation through Ericsson's experts' feedback, Eclipse Foundation's integration, and a survey of 40 industry and academic professionals revealed consistent patterns: survey results showed that 77.5% prioritize quality and trust in results over technical sophistication, while 67.5% prefer semi-automated analysis with user control, findings supported by qualitative feedback from industrial collaboration and external peer review. Results validate three core principles: cognitive compatibility, embedded expertise, and transparency-based trust. This challenges conventional capability-focused tool development, demonstrating that sustainable adoption requires reorientation toward adoption-focused design with actionable implications for automated software engineering tools.
- [6] arXiv:2508.01443 [pdf, html, other]
-
Title: Tuning LLM-based Code Optimization via Meta-Prompting: An Industrial PerspectiveJingzhi Gong, Rafail Giavrimis, Paul Brookes, Vardan Voskanyan, Fan Wu, Mari Ashiga, Matthew Truscott, Mike Basios, Leslie Kanthan, Jie Xu, Zheng WangComments: Submitted to ASE'25 Industry ShowcaseSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
There is a growing interest in leveraging large language models (LLMs) for automated code optimization. However, industrial platforms deploying multiple LLMs face a critical challenge: prompts optimized for one LLM often fail with others, requiring expensive model-specific prompt engineering. This cross-model prompt engineering bottleneck severely limits the practical deployment of multi-LLM optimization systems in production environments. To address this, we introduce Meta-Prompted Code Optimization (MPCO), a framework that automatically generates high-quality, task-specific prompts across diverse LLMs while maintaining industrial efficiency requirements. MPCO leverages meta-prompting to dynamically synthesize context-aware optimization prompts by integrating project metadata, task requirements, and LLM-specific contexts, and it seamlessly deploys on the ARTEMIS industrial platform for automated validation and scaling.
Our comprehensive evaluation on five real-world codebases with 366 hours of runtime benchmarking demonstrates MPCO's effectiveness: it achieves overall performance improvements up to 19.06% with the best statistical rank across all systems compared to baseline methods. Analysis shows that 96% of the top-performing optimizations stem from meaningful edits. Through systematic ablation studies and meta-prompter sensitivity analysis, we identify that comprehensive context integration is essential for effective meta-prompting, and that all three major LLMs can serve effectively as meta-prompters, providing actionable insights for industrial practitioners. - [7] arXiv:2508.01472 [pdf, html, other]
-
Title: Directed Grammar-Based Test GenerationLukas Kirschner (Saarland University and University of Luxembourg), Ezekiel Soremekun (Singapore University of Technology and Design)Comments: 21 pages, 10 figures, 13 tables, submitted to IEEE Transactions on Software Engineering, for replication package, see this http URLSubjects: Software Engineering (cs.SE)
To effectively test complex software, it is important to generate goal-specific inputs, i.e., inputs that achieve a specific testing goal. However, most state-of-the-art test generators are not designed to target specific goals. Notably, grammar-based test generators, which (randomly) produce syntactically valid inputs via an input specification (i.e., grammar) have a low probability of achieving an arbitrary testing goal. This work addresses this challenge by proposing an automated test generation approach (called FdLoop) which iteratively learns relevant input properties from existing inputs to drive the generation of goal-specific inputs. Given a testing goal, FdLoop iteratively selects, evolves and learn the input distribution of goal-specific test inputs via test feedback and a probabilistic grammar. We concretize FdLoop for four testing goals, namely unique code coverage, input-to-code complexity, program failures (exceptions) and long execution time. We evaluate FdLoop using three (3) well-known input formats (JSON, CSS and JavaScript) and 20 open-source software. In most (86%) settings, FdLoop outperforms all five tested baselines namely the baseline grammar-based test generators (random, probabilistic and inverse-probabilistic methods), EvoGFuzz and DynaMosa. FdLoop is (up to) twice (2X) as effective as the best baseline (EvoGFuzz) in inducing erroneous behaviors. In addition, we show that the main components of FdLoop (i.e., input mutator, grammar mutator and test feedbacks) contribute positively to its effectiveness. Finally, our evaluation demonstrates that FdLoop effectively achieves single testing goals (revealing erroneous behaviors, generating complex inputs, or inducing long execution time) and scales to multiple testing goals across varying parameter settings.
- [8] arXiv:2508.01489 [pdf, other]
-
Title: GitHub Marketplace: Driving Automation and Fostering Innovation in Software DevelopmentComments: SANER 2025 journal first paperSubjects: Software Engineering (cs.SE)
GitHub, a central hub for collaborative software development, has revolutionized the open-source software (OSS) ecosystem through its GitHub Marketplace, a platform launched in 2017 to host automation tools aimed at enhancing the efficiency and scalability of software projects. As the adoption of automation in OSS production grows, understanding the trends, characteristics, and underlying dynamics of this marketplace has become vital. Furthermore, despite the rich repository of academic research on software automation, a disconnect persists between academia and industry practices. This study seeks to bridge this gap by providing a systematic analysis of the GitHub Marketplace, comparing trends observed in industry tools with advancements reported in academic literature, and identifying areas where academia can contribute to practical innovation.
- [9] arXiv:2508.01492 [pdf, html, other]
-
Title: OpenLambdaVerse: A Dataset and Analysis of Open-Source Serverless ApplicationsComments: 8 pages, 7 figures, 13th IEEE International Conference on Cloud Engineering (IC2E 2025, accepted, to appear)Subjects: Software Engineering (cs.SE)
Function-as-a-Service (FaaS) is at the core of serverless computing, enabling developers to easily deploy applications without managing computing resources. With an Infrastructure-as-Code (IaC) approach, frameworks like the Serverless Framework use YAML configurations to define and deploy APIs, tasks, workflows, and event-driven applications on cloud providers, promoting zero-friction development. As with any rapidly evolving ecosystem, there is a need for updated insights into how these tools are used in real-world projects. Building on the methodology established by the Wonderless dataset for serverless computing (and applying multiple new filtering steps), OpenLambdaVerse addresses this gap by creating a dataset of current GitHub repositories that use the Serverless Framework in applications that contain one or more AWS Lambda functions. We then analyze and characterize this dataset to get an understanding of the state-of-the-art in serverless architectures based on this stack. Through this analysis we gain important insights on the size and complexity of current applications, which languages and runtimes they employ, how are the functions triggered, the maturity of the projects, and their security practices (or lack of). OpenLambdaVerse thus offers a valuable, up-to-date resource for both practitioners and researchers that seek to better understand evolving serverless workloads.
- [10] arXiv:2508.01523 [pdf, html, other]
-
Title: Exploring Direct Instruction and Summary-Mediated Prompting in LLM-Assisted Code ModificationSubjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC)
This paper presents a study of using large language models (LLMs) in modifying existing code. While LLMs for generating code have been widely studied, their role in code modification remains less understood. Although "prompting" serves as the primary interface for developers to communicate intents to LLMs, constructing effective prompts for code modification introduces challenges different from generation. Prior work suggests that natural language summaries may help scaffold this process, yet such approaches have been validated primarily in narrow domains like SQL rewriting. This study investigates two prompting strategies for LLM-assisted code modification: Direct Instruction Prompting, where developers describe changes explicitly in free-form language, and Summary-Mediated Prompting, where changes are made by editing the generated summaries of the code. We conducted an exploratory study with 15 developers who completed modification tasks using both techniques across multiple scenarios. Our findings suggest that developers followed an iterative workflow: understanding the code, localizing the edit, and validating outputs through execution or semantic reasoning. Each prompting strategy presented trade-offs: direct instruction prompting was more flexible and easier to specify, while summary-mediated prompting supported comprehension, prompt scaffolding, and control. Developers' choice of strategy was shaped by task goals and context, including urgency, maintainability, learning intent, and code familiarity. These findings highlight the need for more usable prompt interactions, including adjustable summary granularity, reliable summary-code traceability, and consistency in generated summaries.
- [11] arXiv:2508.01550 [pdf, html, other]
-
Title: RepoForge: Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at ScaleZhilong Chen, Chengzong Zhao, Boyuan Chen, Dayi Lin, Yihao Chen, Arthur Leung, Gopi Krishnan Rajbahadur, Gustavo A. Oliva, Ahmed E. HassanSubjects: Software Engineering (cs.SE)
Training software engineering (SWE) LLMs is bottlenecked by expensive infrastructure, inefficient evaluation pipelines, scarce training data, and costly quality control. We present RepoForge, an autonomous, end-to-end pipeline that generates, evaluates, and trains SWE agents at scale. Our key contributions include: (1) RepoForge-8B-Agent, achieving 17.4\% on SWE-Bench-Verified~\citep{swebench_verified2024}, establishing new state-of-the-art for $\leq$8B non-thinking LLMs; (2) 7,304 executable environments auto-generated from real GitHub commits with zero manual intervention; (3) 14$\times$ storage reduction (1.4GB $\rightarrow$ 102MB per instance) via intelligent dependency management and image pruning; (4) $>$70\% faster evaluation using a Ray-powered~\citep{ray2018} distributed RepoForge harness; (5) 19,000$\times$ cheaper labeling through our automated SPICE~\citep{spice2024} difficulty assessment technique. By unifying storage-efficient sandboxing, Ray-powered evaluation harness, automated data generation, SPICE-based labeling, and bubble-free RL scaffold, we demonstrate that even $\leq$8B models can reach new state-of-the-art performance on demanding benchmarks like SWE-Bench-Verified. Our approach addresses critical bottlenecks in SWE agent training: high storage costs of container-based evaluation, inefficient sequential reward pipelines, limited availability of high-quality training data, expensive manual labeling, and multi-turn RL pipeline bottlenecks.
- [12] arXiv:2508.01974 [pdf, html, other]
-
Title: Flow Sensitivity without Control Flow Graph: An Efficient Andersen-Style Flow-Sensitive Pointer AnalysisSubjects: Software Engineering (cs.SE); Programming Languages (cs.PL)
Flow-sensitive pointer analysis constitutes an essential component of precise program analysis for accurately modeling pointer behaviors by incorporating control flows. Flow-sensitive pointer analysis is extensively used in alias analysis, taint analysis, program understanding, compiler optimization, etc. Existing flow-sensitive pointer analysis approaches, which are conducted based on control flow graphs, have significantly advanced the precision of pointer analysis via sophisticated techniques to leverage control flow information. However, they inevitably suffer from computational inefficiencies when resolving points-to information due to the inherent complex structures of control flow graphs. We present CG-FSPTA, a Flow-Sensitive Constraint Graph (FSConsG) based flow-sensitive pointer analysis to overcome the inefficiency of control-flow-graph-based analysis. CG-FSPTA uses a flow-sensitive variant to leverage the structural advantages of set-constraint graphs (which are commonly used in flow-insensitive pointer analysis) while keeping the flow sensitivity of variable definitions and uses, allowing the incorporation of sophisticated graph optimization and dynamic solving techniques. In this way, CG-FSPTA achieves significant efficiency improvements while keeping the precision of flow-sensitive analysis. Experimental evaluations on benchmark programs demonstrate that CG-FSPTA, significantly reduces both memory usage and execution time while maintaining precision. In particular, by solving in the FSConsG, CG-FSPTA achieves an average memory reduction of 33.05\% and accelerates flow-sensitive pointer analysis by 7.27x compared to the state-of-art method. These experimental results underscore the efficacy of CG-FSPTA as a scalable solution to analyze large-scale software systems, establishing a robust foundation for future advancements in efficient program analysis frameworks.
- [13] arXiv:2508.02023 [pdf, other]
-
Title: PCREQ: Automated Inference of Compatible Requirements for Python Third-party Library UpgradesComments: 52 pages, 33 figuresSubjects: Software Engineering (cs.SE)
Python third-party libraries (TPLs) are essential in modern software development, but upgrades often cause compatibility issues, leading to system failures. These issues fall into two categories: version compatibility issues (VCIs) and code compatibility issues (CCIs). Existing tools mainly detect dependency conflicts but overlook code-level incompatibilities, with no solution fully automating the inference of compatible versions for both VCIs and CCIs. To fill this gap, we propose PCREQ, the first approach to automatically infer compatible requirements by combining version and code compatibility analysis. PCREQ integrates six modules: knowledge acquisition, version compatibility assessment, invoked APIs and modules extraction, code compatibility assessment, version change, and missing TPL completion. PCREQ collects candidate versions, checks for conflicts, identifies API usage, evaluates code compatibility, and iteratively adjusts versions to generate a compatible this http URL with a detailed repair report. To evaluate PCREQ, we construct REQBench, a large-scale benchmark with 2,095 upgrade test cases (including 406 unsolvable by pip). Results show PCREQ achieves a 94.03% inference success rate, outperforming PyEGo (37.02%), ReadPyE (37.16%), and LLM-based approaches (GPT-4o, DeepSeek V3/R1) by 18-20%. PCREQ processes each case from REQBench in 60.79s on average, demonstrating practical efficiency. PCREQ significantly reduces manual effort in troubleshooting upgrades, advancing Python dependency maintenance automation.
- [14] arXiv:2508.02144 [pdf, html, other]
-
Title: BiFuzz: A Two-Stage Fuzzing Tool for Open-World Video GamesComments: 4 pages, 5 figuresSubjects: Software Engineering (cs.SE)
Open-world video games present a broader search space than other games, posing challenges for test automation. Fuzzing, which generates new inputs by mutating an initial input, is commonly used to uncover failures. In this study, we proposed BiFuzz, a two-stage fuzzer designed for automated testing of open-world video games, and investigated its effectiveness. The results revealed that BiFuzz mutated the overall strategy of gameplay and test cases, including actual movement paths, step by step. Consequently, BiFuzz can detect `stucking' failures. The tool and its video are at this http URL.
- [15] arXiv:2508.02167 [pdf, html, other]
-
Title: An MLIR-based Compilation Framework for Control Flow Management on CGRAsSubjects: Software Engineering (cs.SE)
Coarse Grained Reconfigurable Arrays (CGRAs) present both high flexibility and efficiency, making them well-suited for the acceleration of intensive workloads. Nevertheless, a key barrier towards their widespread adoption is posed by CGRA compilation, which must cope with a multi-dimensional space spanning both the spatial and the temporal domains. Indeed, state-of-the-art compilers are limited in scope as they mostly deal with the data flow of applications, while having little or no support for control flow. Hence, they mostly target the mapping of single loops and/or delegate the management of control flow divergences to ad-hoc hardware units.
Conversely, in this paper we show that control flow can be effectively managed and optimized at the compilation level, allowing for a broad set of applications to be targeted while being hardware-agnostic and achieving high performance. We embody our methodology in a modular compilation framework consisting of transformation and optimization passes, enabling support for applications with arbitrary control flows running on abstract CGRA meshes. We also introduce a novel mapping methodology that acts as a compilation back-end, addressing the limitations in available CGRA hardware resources and guaranteeing a feasible solution in the compilation process. Our framework achieves up to 2.1X speedups over state-of-the-art approaches, purely through compilation optimizations. - [16] arXiv:2508.02176 [pdf, html, other]
-
Title: Highly Interactive Testing for Uninterrupted Development FlowComments: 12 pages, ICFP-2025Subjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC)
Highly interactive development environments (HIDEs) enable uninterrupted development flow through continuous program evolution and rapid hypothesis checking. However, traditional testing approaches -- typically executed separately via CLI -- isolate tests from HIDE tooling (interactive debuggers, value and stack inspectors, etc.) and introduce disruptive delays due to coarse execution granularity and lack of runtime context. This disconnect breaks development flow by exceeding critical attention thresholds. In this paper we present a library that provides runtime representation for tests, allowing tight integration with HIDEs, and enabling immediate access to HIDE tooling in the context of test failure. We then describe development workflows enhanced with testing and demonstrate how they achieve subsecond test reexecution times crucial for maintaining developer focus.
- [17] arXiv:2508.02233 [pdf, other]
-
Title: A Methodological Framework for LLM-Based Mining of Software RepositoriesSubjects: Software Engineering (cs.SE)
Large Language Models (LLMs) are increasingly used in software engineering research, offering new opportunities for automating repository mining tasks. However, despite their growing popularity, the methodological integration of LLMs into Mining Software Repositories (MSR) remains poorly understood. Existing studies tend to focus on specific capabilities or performance benchmarks, providing limited insight into how researchers utilize LLMs across the full research pipeline. To address this gap, we conduct a mixed-method study that combines a rapid review and questionnaire survey in the field of LLM4MSR. We investigate (1) the approaches and (2) the threats that affect the empirical rigor of researchers involved in this field. Our findings reveal 15 methodological approaches, nine main threats, and 25 mitigation strategies. Building on these findings, we present PRIMES 2.0, a refined empirical framework organized into six stages, comprising 23 methodological substeps, each mapped to specific threats and corresponding mitigation strategies, providing prescriptive and adaptive support throughout the lifecycle of LLM-based MSR studies. Our work contributes to establishing a more transparent and reproducible foundation for LLM-based MSR research.
- [18] arXiv:2508.02279 [pdf, html, other]
-
Title: Dialogue Systems Engineering: A Survey and Future DirectionsComments: 18 pages, 2 figuresSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
This paper proposes to refer to the field of software engineering related to the life cycle of dialogue systems as Dialogue Systems Engineering, and surveys this field while also discussing its future directions. With the advancement of large language models, the core technologies underlying dialogue systems have significantly progressed. As a result, dialogue system technology is now expected to be applied to solving various societal issues and in business contexts. To achieve this, it is important to build, operate, and continuously improve dialogue systems correctly and efficiently. Accordingly, in addition to applying existing software engineering knowledge, it is becoming increasingly important to evolve software engineering tailored specifically to dialogue systems. In this paper, we enumerate the knowledge areas of dialogue systems engineering based on those of software engineering, as defined in the Software Engineering Body of Knowledge (SWEBOK) Version 4.0, and survey each area. Based on this survey, we identify unexplored topics in each area and discuss the future direction of dialogue systems engineering.
- [19] arXiv:2508.02335 [pdf, other]
-
Title: Interoperable verification and dissemination of software assets in repositories using COAR NotifyComments: 8 pages. Presented at the 20th International Conference on Open Repositories, June 15-18 2025, Chicago, Illinois, USASubjects: Software Engineering (cs.SE); Digital Libraries (cs.DL)
The discoverability, attribution, and reusability of open research software are often hindered by its obscurity within academic manuscripts. To address this, the SoFAIR project (2024-2025) introduces a comprehensive workflow leveraging machine learning tools for extracting software mentions from research papers. The project integrates repository systems, authors, and services like HAL and Software Heritage to ensure proper archiving, citation, and accessibility of research software in alignment with FAIR principles. To enable interoperable communication across the various systems we present an integration of the COAR Notify Protocol, which facilitates automated, interoperable communication among repositories and authors to validate and disseminate software mentions. This paper outlines the SoFAIR workflow and the implementation of the COAR Notify Protocol, emphasising its potential to enhance the visibility and credibility of research software as first-class bibliographic records.
- [20] arXiv:2508.02338 [pdf, html, other]
-
Title: Vision Language Model-based Testing of Industrial Autonomous Mobile RobotsSubjects: Software Engineering (cs.SE); Robotics (cs.RO)
Autonomous Mobile Robots (AMRs) are deployed in diverse environments (e.g., warehouses, retail spaces, and offices), where they work alongside humans. Given that human behavior can be unpredictable and that AMRs may not have been trained to handle all possible unknown and uncertain behaviors, it is important to test AMRs under a wide range of human interactions to ensure their safe behavior. Moreover, testing in real environments with actual AMRs and humans is often costly, impractical, and potentially hazardous (e.g., it could result in human injury). To this end, we propose a Vision Language Model (VLM)-based testing approach (RVSG) for industrial AMRs developed by PAL Robotics in Spain. Based on the functional and safety requirements, RVSG uses the VLM to generate diverse human behaviors that violate these requirements. We evaluated RVSG with several requirements and navigation routes in a simulator using the latest AMR from PAL Robotics. Our results show that, compared with the baseline, RVSG can effectively generate requirement-violating scenarios. Moreover, RVSG-generated scenarios increase variability in robot behavior, thereby helping reveal their uncertain behaviors.
- [21] arXiv:2508.02397 [pdf, other]
-
Title: JC-Finder: Detecting Java Clone-based Third-Party Library by Class-level Tree AnalysisLida Zhao, Chaofan Li, Yueming Wu, Lyuye Zhang, Jiahui Wu, Chengwei Liu, Sen Chen, Yutao Hu, Zhengzi Xu, Yi Liu, Jingquan Ge, Jun Sun, Yang LiuSubjects: Software Engineering (cs.SE)
While reusing third-party libraries (TPL) facilitates software development, its chaotic management has brought great threats to software maintenance and the unauthorized use of source code also raises ethical problems such as misconduct on copyrighted code. To identify TPL reuse in projects, Software Composition Analysis (SCA) is employed, and two categories of SCA techniques are used based on how TPLs are introduced: clone-based SCA and package-manager-based SCA (PM-based SCA). Although introducing TPLs by clones is prevalent in Java, no clone-based SCA tools are specially designed for Java. Also, directly applying clone-based SCA techniques from other tools is problematic. To fill this gap, we introduce JC-Finder, a novel clone-based SCA tool that aims to accurately and comprehensively identify instances of TPL reuse introduced by source code clones in Java projects. JC-Finder achieves both accuracy and efficiency in identifying TPL reuse from code cloning by capturing features at the class level, maintaining inter-function relationships, and excluding trivial or duplicated elements. To evaluate the efficiency of JC-Finder, we applied it to 9,965 most popular Maven libraries as reference data and tested the TPL reuse of 1,000 GitHub projects. The result shows that JC-Finder achieved an F1-score of 0.818, outperforming the other function-level tool by 0.427. The average time taken for resolving TPL reuse is 14.2 seconds, which is approximately 9 times faster than the other tool. We further applied JC-Finder to 7,947 GitHub projects, revealing TPL reuse by code clones in 789 projects (about 9.89% of all projects) and identifying a total of 2,142 TPLs. JC-Finder successfully detects 26.20% more TPLs that are not explicitly declared in package managers.
- [22] arXiv:2508.02407 [pdf, other]
-
Title: Quantum Machine Learning-based Test Oracle for Autonomous Mobile RobotsSubjects: Software Engineering (cs.SE)
Robots are increasingly becoming part of our daily lives, interacting with both the environment and humans to perform their tasks. The software of such robots often undergoes upgrades, for example, to add new functionalities, fix bugs, or delete obsolete functionalities. As a result, regression testing of robot software becomes necessary. However, determining the expected correct behavior of robots (i.e., a test oracle) is challenging due to the potentially unknown environments in which the robots must operate. To address this challenge, machine learning (ML)-based test oracles present a viable solution. This paper reports on the development of a test oracle to support regression testing of autonomous mobile robots built by PAL Robotics (Spain), using quantum machine learning (QML), which enables faster training and the construction of more precise test oracles. Specifically, we propose a hybrid framework, QuReBot, that combines both quantum reservoir computing (QRC) and a simple neural network, inspired by residual connection, to predict the expected behavior of a robot. Results show that QRC alone fails to converge in our case, yielding high prediction error. In contrast, QuReBot converges and achieves 15% reduction of prediction error compared to the classical neural network baseline. Finally, we further examine QuReBot under different configurations and offer practical guidance on optimal settings to support future robot software testing.
- [23] arXiv:2508.02455 [pdf, html, other]
-
Title: TreeRanker: Fast and Model-agnostic Ranking System for Code Suggestions in IDEsSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Token-level code completion is one of the most critical features in modern Integrated Development Environments (IDEs). It assists developers by suggesting relevant identifiers and APIs during coding. While completions are typically derived from static analysis, their usefulness depends heavily on how they are ranked, as correct predictions buried deep in the list are rarely seen by users. Most current systems rely on hand-crafted heuristics or lightweight machine learning models trained on user logs, which can be further improved to capture context information and generalize across projects and coding styles. In this work, we propose a new scoring approach to ranking static completions using language models in a lightweight and model-agnostic way. Our method organizes all valid completions into a prefix tree and performs a single greedy decoding pass to collect token-level scores across the tree. This enables a precise token-aware ranking without needing beam search, prompt engineering, or model adaptations. The approach is fast, architecture-agnostic, and compatible with already deployed models for code completion. These findings highlight a practical and effective pathway for integrating language models into already existing tools within IDEs, and ultimately providing smarter and more responsive developer assistance.
- [24] arXiv:2508.02473 [pdf, html, other]
-
Title: An Efficient and Adaptive Next Edit Suggestion Framework with Zero Human Instructions in IDEsXinfang Chen, Siyang Xiao, Xianying Zhu, Junhong Xie, Ming Liang, Dajun Chen, Wei Jiang, Yong Li, Peng DiComments: 13 pagesSubjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Code editing, including modifying, refactoring, and maintaining existing code, is the most frequent task in software development and has garnered significant attention from AI-powered tools. However, existing solutions that translate explicit natural language instructions into code edits face critical limitations, such as heavy reliance on human instruction input and high latency, which hinder their effective integration into a developer's workflow. We observe that developers' habitual behaviors and coding objectives are often reflected in their historical editing patterns, making this data key to addressing existing limitations. To leverage these insights, we propose NES (Next Edit Suggestion), an LLM-driven code editing framework that delivers an instruction-free and low-latency experience. Built on a dual-model architecture and trained with our high-quality SFT and DAPO datasets, NES enhances productivity by understanding developer intent while optimizing inference to minimize latency. NES is a scalable, industry-ready solution with a continuous Tab key interaction workflow, seamlessly adopted by a FinTech company with over 20,000 developers. Evaluations on real-world datasets show NES achieves 75.6% and 81.6% accuracy in two tasks of predicting next edit locations, alongside 91.36% ES and 27.7% EMR for intent-aligned edits, outperforming SOTA models. Our open-sourced SFT and DAPO datasets have been demonstrated to enhance the performance of open-source CodeLLMs. The demonstration of NES is available at this http URL.
- [25] arXiv:2508.02487 [pdf, html, other]
-
Title: Commit Stability as a Signal for Risk in Open-Source ProjectsSubjects: Software Engineering (cs.SE)
Open source software (OSS) generates trillions of dollars in economic value and has become essential to technical infrastructures worldwide. As organizations increasingly depend on OSS, understanding project evolution is critical. While existing metrics provide insights into project health, one dimension remains understudied: project resilience -- the ability to return to normal operations after disturbances such as contributor departures, security vulnerabilities, and bug report spikes. We hypothesize that stable commit patterns reflect underlying project characteristics such as mature governance, sustained contributors, and robust development processes that enable resilience. Building on the Composite Stability Index (CSI) framework, we empirically validate commit frequency patterns across 100 highly ranked repositories. Our findings reveal that only 2\% of repositories exhibit daily stability, 29\% achieve weekly stability, and 50\% demonstrate monthly stability, while half remain unstable across all temporal levels. Programming languages and blockchain applications were the most stable. We identified two exemplary repositories that achieved stability at all three granularities, whose governance models, CI cadence, and release policies could serve as reference frameworks. We observed that large yearly commit throughput does not necessarily correlate with stability. Beyond commits, stability can be enriched with issue-resolution times, PR merge rates, and community-engagement metrics to broaden resilience assessment and sharpen stability-based risk evaluation.
- [26] arXiv:2508.02497 [pdf, html, other]
-
Title: Bridging Language Gaps in Open-Source Documentation with Large-Language-Model TranslationSubjects: Software Engineering (cs.SE)
While open source communities attract diverse contributors globally, few repositories provide essential documentation in languages other than English. Large language models (LLMs) have demonstrated remarkable capabilities in software engineering tasks and translations across domains. However, little is known about LLM capabilities in translating open-source technical documentation, which mixes natural language, code, URLs, and markdown formatting. To understand the need and potential for LLMs in technical documentation translation, we evaluated community translation activity and English-to-German translations of 50 README files using OpenAI's ChatGPT 4 and Anthropic's Claude. We found scarce translation activity, mostly in larger repositories and community-driven in nature. LLM performance comparison suggests they can provide accurate translations. However, analysis revealed fidelity challenges: both models struggled to preserve structural components (e.g., hyperlinks) and exhibited formatting inconsistencies. These findings highlight both promise and challenges of LLM-assisted documentation internationalization. As a first step toward translation-aware continuous integration pipelines, we introduce TRIFID, an early-stage translation fidelity scoring framework that automatically checks how well translations preserve code, links, and formatting. Our efforts provide a foundation for automated LLM-driven support for creating and maintaining open source documentation.
- [27] arXiv:2508.02541 [pdf, other]
-
Title: Automatic Identification of Machine Learning-Specific Code SmellsSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Machine learning (ML) has rapidly grown in popularity, becoming vital to many industries. Currently, the research on code smells in ML applications lacks tools and studies that address the identification and validity of ML-specific code smells. This work investigates suitable methods and tools to design and develop a static code analysis tool (MLpylint) based on code smell criteria. This research employed the Design Science Methodology. In the problem identification phase, a literature review was conducted to identify ML-specific code smells. In solution design, a secondary literature review and consultations with experts were performed to select methods and tools for implementing the tool. We evaluated the tool on data from 160 open-source ML applications sourced from GitHub. We also conducted a static validation through an expert survey involving 15 ML professionals. The results indicate the effectiveness and usefulness of the MLpylint. We aim to extend our current approach by investigating ways to introduce MLpylint seamlessly into development workflows, fostering a more productive and innovative developer environment.
- [28] arXiv:2508.02611 [pdf, html, other]
-
Title: Meta-RAG on Large Codebases Using Code SummarizationSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Large Language Model (LLM) systems have been at the forefront of applied Artificial Intelligence (AI) research in a multitude of domains. One such domain is software development, where researchers have pushed the automation of a number of code tasks through LLM agents. Software development is a complex ecosystem, that stretches far beyond code implementation and well into the realm of code maintenance. In this paper, we propose a multi-agent system to localize bugs in large pre-existing codebases using information retrieval and LLMs. Our system introduces a novel Retrieval Augmented Generation (RAG) approach, Meta-RAG, where we utilize summaries to condense codebases by an average of 79.8\%, into a compact, structured, natural language representation. We then use an LLM agent to determine which parts of the codebase are critical for bug resolution, i.e. bug localization. We demonstrate the usefulness of Meta-RAG through evaluation with the SWE-bench Lite dataset. Meta-RAG scores 84.67 % and 53.0 % for file-level and function-level correct localization rates, respectively, achieving state-of-the-art performance.
New submissions (showing 28 of 28 entries)
- [29] arXiv:2508.00843 (cross-list from cs.HC) [pdf, html, other]
-
Title: Generative AI for CAD Automation: Leveraging Large Language Models for 3D ModellingSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Large Language Models (LLMs) are revolutionizing industries by enhancing efficiency, scalability, and innovation. This paper investigates the potential of LLMs in automating Computer-Aided Design (CAD) workflows, by integrating FreeCAD with LLM as CAD design tool. Traditional CAD processes are often complex and require specialized sketching skills, posing challenges for rapid prototyping and generative design. We propose a framework where LLMs generate initial CAD scripts from natural language descriptions, which are then executed and refined iteratively based on error feedback. Through a series of experiments with increasing complexity, we assess the effectiveness of this approach. Our findings reveal that LLMs perform well for simple to moderately complex designs but struggle with highly constrained models, necessitating multiple refinements. The study highlights the need for improved memory retrieval, adaptive prompt engineering, and hybrid AI techniques to enhance script robustness. Future directions include integrating cloud-based execution and exploring advanced LLM capabilities to further streamline CAD automation. This work underscores the transformative potential of LLMs in design workflows while identifying critical areas for future development.
- [30] arXiv:2508.00858 (cross-list from cs.LG) [pdf, html, other]
-
Title: Deploying Geospatial Foundation Models in the Real World: Lessons from WorldCerealChristina Butsko, Kristof Van Tricht, Gabriel Tseng, Giorgia Milli, David Rolnick, Ruben Cartuyvels, Inbal Becker Reshef, Zoltan Szantoi, Hannah KernerSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
The increasing availability of geospatial foundation models has the potential to transform remote sensing applications such as land cover classification, environmental monitoring, and change detection. Despite promising benchmark results, the deployment of these models in operational settings is challenging and rare. Standardized evaluation tasks often fail to capture real-world complexities relevant for end-user adoption such as data heterogeneity, resource constraints, and application-specific requirements. This paper presents a structured approach to integrate geospatial foundation models into operational mapping systems. Our protocol has three key steps: defining application requirements, adapting the model to domain-specific data and conducting rigorous empirical testing. Using the Presto model in a case study for crop mapping, we demonstrate that fine-tuning a pre-trained model significantly improves performance over conventional supervised methods. Our results highlight the model's strong spatial and temporal generalization capabilities. Our protocol provides a replicable blueprint for practitioners and lays the groundwork for future research to operationalize foundation models in diverse remote sensing applications. Application of the protocol to the WorldCereal global crop-mapping system showcases the framework's scalability.
- [31] arXiv:2508.00952 (cross-list from cs.CY) [pdf, other]
-
Title: Academic Vibe Coding: Opportunities for Accelerating Research in an Era of Resource ConstraintSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
Academic laboratories face mounting resource constraints: budgets are tightening, grant overheads are potentially being capped, and the market rate for data-science talent significantly outstrips university compensation. Vibe coding, which is structured, prompt-driven code generation with large language models (LLMs) embedded in reproducible workflows, offers one pragmatic response. It aims to compress the idea-to-analysis timeline, reduce staffing pressure on specialized data roles, and maintain rigorous, version-controlled outputs. This article defines the vibe coding concept, situates it against the current academic resourcing crisis, details a beginner-friendly toolchain for its implementation, and analyzes inherent limitations that necessitate governance and mindful application.
- [32] arXiv:2508.01249 (cross-list from cs.CR) [pdf, html, other]
-
Title: AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt InjectionSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
Large Language Model (LLM) agents offer a powerful new paradigm for solving various problems by combining natural language reasoning with the execution of external tools. However, their dynamic and non-transparent behavior introduces critical security risks, particularly in the presence of prompt injection attacks. In this work, we propose a novel insight that treats the agent runtime traces as structured programs with analyzable semantics. Thus, we present AgentArmor, a program analysis framework that converts agent traces into graph intermediate representation-based structured program dependency representations (e.g., CFG, DFG, and PDG) and enforces security policies via a type system. AgentArmor consists of three key components: (1) a graph constructor that reconstructs the agent's working traces as graph-based intermediate representations with control flow and data flow described within; (2) a property registry that attaches security-relevant metadata of interacted tools & data, and (3) a type system that performs static inference and checking over the intermediate representation. By representing agent behavior as structured programs, AgentArmor enables program analysis over sensitive data flow, trust boundaries, and policy violations. We evaluate AgentArmor on the AgentDojo benchmark, the results show that AgentArmor can achieve 95.75% of TPR, with only 3.66% of FPR. Our results demonstrate AgentArmor's ability to detect prompt injection vulnerabilities and enforce fine-grained security constraints.
- [33] arXiv:2508.01451 (cross-list from cs.CR) [pdf, other]
-
Title: Think Broad, Act Narrow: CWE Identification with Multi-Agent Large Language ModelsSubjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
Machine learning and Large language models (LLMs) for vulnerability detection has received significant attention in recent years. Unfortunately, state-of-the-art techniques show that LLMs are unsuccessful in even distinguishing the vulnerable function from its benign counterpart, due to three main problems: Vulnerability detection requires deep analysis, which LLMs often struggle with when making a one-shot prediction. Existing techniques typically perform function-level analysis, whereas effective vulnerability detection requires contextual information beyond the function scope. The focus on binary classification can result in identifying a vulnerability but associating it with the wrong security weaknesses (CWE), which may mislead developers. We propose a novel multi-agent LLM approach to address the challenges of identifying CWEs. This approach consists of three steps: (1) a team of LLM agents performs an exhaustive search for potential CWEs in the function under review, (2) another team of agents identifies relevant external context to support or refute each candidate CWE, and (3) a final agent makes informed acceptance or rejection decisions for each CWE based on the gathered context. A preliminary evaluation of our approach shows promising results. In the PrimeVul dataset, Step 1 correctly identifies the appropriate CWE in 40.9\% of the studied vulnerable functions. We further evaluated the full pipeline on ten synthetic programs and found that incorporating context information significantly reduced false positives from 6 to 9 CWEs to just 1 to 2, while still correctly identifying the true CWE in 9 out of 10 cases.
- [34] arXiv:2508.01494 (cross-list from cs.DC) [pdf, html, other]
-
Title: An Analysis of HPC and Edge Architectures in the CloudComments: 8 pages, 10 figures, accepted at 2nd Workshop on Accelerated HPC in the Cloud-Edge Continuum 2025, held in conjunction with 13th IEEE International Conference on Cloud Engineering (IC2E 2025)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
We analyze a recently published dataset of 396 real-world cloud architectures deployed on AWS, from companies belonging to a wide range of industries. From this dataset, we identify those architectures that contain HPC or edge components and characterize their designs. Specifically, we investigate the prevalence and interplay of AWS services within these architectures, examine the types of storage systems employed, assess architectural complexity and the use of machine learning services, discuss the implications of our findings and how representative these results are of HPC and edge architectures in the cloud. This characterization provides valuable insights into current industry practices and trends in building robust and scalable HPC and edge solutions in the cloud continuum, and can be valuable for those seeking to better understand how these architectures are being built and to guide new research.
- [35] arXiv:2508.01655 (cross-list from cs.CR) [pdf, html, other]
-
Title: JSidentify-V2: Leveraging Dynamic Memory Fingerprinting for Mini-Game Plagiarism DetectionZhihao Li, Chaozheng Wang, Zongjie Li, Xinyong Peng, Qun Xia, Haochuan Lu, Ting Xiong, Shuzheng Gao, Cuiyun Gao, Shuai Wang, Yuetang Deng, Huafeng MaComments: 12 pagesSubjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
The explosive growth of mini-game platforms has led to widespread code plagiarism, where malicious users access popular games' source code and republish them with modifications. While existing static analysis tools can detect simple obfuscation techniques like variable renaming and dead code injection, they fail against sophisticated deep obfuscation methods such as encrypted code with local or cloud-based decryption keys that completely destroy code structure and render traditional Abstract Syntax Tree analysis ineffective. To address these challenges, we present JSidentify-V2, a novel dynamic analysis framework that detects mini-game plagiarism by capturing memory invariants during program execution. Our key insight is that while obfuscation can severely distort static code characteristics, runtime memory behavior patterns remain relatively stable. JSidentify-V2 employs a four-stage pipeline: (1) static pre-analysis and instrumentation to identify potential memory invariants, (2) adaptive hot object slicing to maximize execution coverage of critical code segments, (3) Memory Dependency Graph construction to represent behavioral fingerprints resilient to obfuscation, and (4) graph-based similarity analysis for plagiarism detection.
We evaluate JSidentify-V2 against eight obfuscation methods on a comprehensive dataset of 1,200 mini-games ... - [36] arXiv:2508.01750 (cross-list from cs.CR) [pdf, html, other]
-
Title: LLM-Assisted Model-Based Fuzzing of Protocol ImplementationsSubjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
Testing network protocol implementations is critical for ensuring the reliability, security, and interoperability of distributed systems. Faults in protocol behavior can lead to vulnerabilities and system failures, especially in real-time and mission-critical applications. A common approach to protocol testing involves constructing Markovian models that capture the state transitions and expected behaviors of the protocol. However, building such models typically requires significant domain expertise and manual effort, making the process time-consuming and difficult to scale across diverse protocols and implementations.
We propose a novel method that leverages large language models (LLMs) to automatically generate sequences for testing network protocol implementations. Our approach begins by defining the full set of possible protocol states, from which the LLM selects a subset to model the target implementation. Using this state-based model, we prompt the LLM to generate code that produces sequences of states. This program serves as a protocol-specific sequences generator. The sequences generator then generates test inputs to call the protocol implementation under various conditions. We evaluated our approach on three widely used network protocol implementations and successfully identified 12 previously unknown vulnerabilities. We have reported them to the respective developers for confirmation. This demonstrates the practical effectiveness of our LLM-assisted fuzzing framework in uncovering real-world security issues. - [37] arXiv:2508.01856 (cross-list from cs.DC) [pdf, other]
-
Title: Efficient Byzantine Consensus MechanismBased on Reputation in IoT BlockchainJournal-ref: Hindawi Wireless Communications and Mobile Computing 2021Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR); Databases (cs.DB); Software Engineering (cs.SE)
Blockchain technology has advanced rapidly in recent years and is now widely used in a variety of fields. Blockchain appears to be one of the best solutions for managing massive heterogeneous devices while achieving advanced data security and data reputation, particularly in the field of large-scale IoT (Internet of Things) networks. Despite the numerous advantages, there are still challenges while deploying IoT applications on blockchain systems due to the limited storage, power, and computing capability of IoT devices, and some of these problems are caused by the consensus algorithm, which plays a significant role in blockchain systems by ensuring overall system reliability and robustness. Nonetheless, most existing consensus algorithms are prone to poor node reliability, low transaction per second (TPS) rates, and scalability issues. Aiming at some critical problems in the existing consensus algorithms, this paper proposes the Efficient Byzantine Reputation-based Consensus (EBRC) mechanism to resolve the issues raised above. In comparison to traditional algorithms, we reinvented ways to evaluate node reliability and robustness and manage active nodes. Our experiments show that the EBRC algorithm has lower consensus delay, higher throughput, improved security, and lower verification costs. It offers new reference ideas for solving the Internet of Things+blockchain+Internet court construction problem.
- [38] arXiv:2508.01863 (cross-list from cs.CR) [pdf, html, other]
-
Title: Hard-Earned Lessons in Access Control at Scale: Enforcing Identity and Policy Across Trust Boundaries with Reverse Proxies and mTLSComments: 6 pages, 3 figuresSubjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI); Software Engineering (cs.SE)
In today's enterprise environment, traditional access methods such as Virtual Private Networks (VPNs) and application-specific Single Sign-On (SSO) often fall short when it comes to securely scaling access for a distributed and dynamic workforce. This paper presents our experience implementing a modern, Zero Trust-aligned architecture that leverages a reverse proxy integrated with Mutual TLS (mTLS) and centralized SSO, along with the key challenges we encountered and lessons learned during its deployment and scaling. This multidimensional solution involves both per-device and per-user authentication, centralized enforcement of security policies, and comprehensive observability, hence enabling organizations to deliver secure and seamless access to their internal applications.
- [39] arXiv:2508.02359 (cross-list from eess.SP) [pdf, other]
-
Title: Toward a reliable PWM-based light-emitting diode visual stimulus for improved SSVEP response with minimal visual fatigueJournal-ref: The Journal of Engineering (JoE) 2017Subjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
Steady state visual evoked response (SSVEP) is widely used in visual-based diagnosis and applications such as brain computer interfacing due to its high information transfer rate and the capability to activate commands through simple gaze control. However, one major impediment in using flashing visual stimulus to obtain SSVEP is eye fatigue that prevents continued long term use preventing practical deployment. This combined with the difficulty in establishing precise pulse-width modulation (PWM) that results in poorer accuracy warrants the development of appropriate approach to solve these issues. Various studies have suggested the usage of high frequencies of visual stimulus to reduce the visual fatigue for the user but this results in poor response performance. Here, the authors study the use of extremely high duty-cycles in the stimulus in the hope of solving these constraints. Electroencephalogram data was recorded with PWM duty-cycles of 50 to 95% generated by a precise custom-made light-emitting diode hardware and tested ten subjects responded that increasing duty-cycles had less visual strain for all the frequency values and the SSVEP exhibited a subject-independent peak response for duty-cycle of 85%. This could pave the way for increased usage of SSVEP for practical applications.
- [40] arXiv:2508.02427 (cross-list from cs.AI) [pdf, html, other]
-
Title: CABENCH: Benchmarking Composable AI for Solving Complex Tasks through Composing Ready-to-Use ModelsTung-Thuy Pham, Duy-Quan Luong, Minh-Quan Duong, Trung-Hieu Nguyen, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh VoSubjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Composable AI offers a scalable and effective paradigm for tackling complex AI tasks by decomposing them into sub-tasks and solving each sub-task using ready-to-use well-trained models. However, systematically evaluating methods under this setting remains largely unexplored. In this paper, we introduce CABENCH, the first public benchmark comprising 70 realistic composable AI tasks, along with a curated pool of 700 models across multiple modalities and domains. We also propose an evaluation framework to enable end-to-end assessment of composable AI solutions. To establish initial baselines, we provide human-designed reference solutions and compare their performance with two LLM-based approaches. Our results illustrate the promise of composable AI in addressing complex real-world problems while highlighting the need for methods that can fully unlock its potential by automatically generating effective execution pipelines.
- [41] arXiv:2508.02470 (cross-list from cs.HC) [pdf, html, other]
-
Title: AIAP: A No-Code Workflow Builder for Non-Experts with Natural Language and Multi-Agent CollaborationComments: 14 pages, 6 figuresSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
While many tools are available for designing AI, non-experts still face challenges in clearly expressing their intent and managing system complexity. We introduce AIAP, a no-code platform that integrates natural language input with visual workflows. AIAP leverages a coordinated multi-agent system to decompose ambiguous user instructions into modular, actionable steps, hidden from users behind a unified interface. A user study involving 32 participants showed that AIAP's AI-generated suggestions, modular workflows, and automatic identification of data, actions, and context significantly improved participants' ability to develop services intuitively. These findings highlight that natural language-based visual programming significantly reduces barriers and enhances user experience in AI service design.
- [42] arXiv:2508.02609 (cross-list from cs.LG) [pdf, html, other]
-
Title: Entity Representation Learning Through Onsite-Offsite Graph for Pinterset AdsJiayin Jin, Zhimeng Pan, Yang Tang, Jiarui Feng, Kungang Li, Chongyuan Xiang, Jiacheng Li, Runze Su, Siping Ji, Han Sun, Ling Leng, Prathibha DeshikacharSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Graph Neural Networks (GNN) have been extensively applied to industry recommendation systems, as seen in models like GraphSage\cite{GraphSage}, TwHIM\cite{TwHIM}, LiGNN\cite{LiGNN} etc. In these works, graphs were constructed based on users' activities on the platforms, and various graph models were developed to effectively learn node embeddings. In addition to users' onsite activities, their offsite conversions are crucial for Ads models to capture their shopping interest. To better leverage offsite conversion data and explore the connection between onsite and offsite activities, we constructed a large-scale heterogeneous graph based on users' onsite ad interactions and opt-in offsite conversion activities. Furthermore, we introduced TransRA (TransR\cite{TransR} with Anchors), a novel Knowledge Graph Embedding (KGE) model, to more efficiently integrate graph embeddings into Ads ranking models. However, our Ads ranking models initially struggled to directly incorporate Knowledge Graph Embeddings (KGE), and only modest gains were observed during offline experiments. To address this challenge, we employed the Large ID Embedding Table technique and innovated an attention based KGE finetuning approach within the Ads ranking models. As a result, we observed a significant AUC lift in Click-Through Rate (CTR) and Conversion Rate (CVR) prediction models. Moreover, this framework has been deployed in Pinterest's Ads Engagement Model and contributed to $2.69\%$ CTR lift and $1.34\%$ CPC reduction. We believe the techniques presented in this paper can be leveraged by other large-scale industrial models.
Cross submissions (showing 14 of 14 entries)
- [43] arXiv:2305.16092 (replaced) [pdf, html, other]
-
Title: AI Techniques in the Microservices Life-Cycle: A Systematic Mapping StudySergio Moreschini, Shahrzad Pour, Ivan Lanese, Daniel Balouek-Thomert, Justus Bogner, Xiaozhou Li, Fabiano Pecorelli, Jacopo Soldani, Eddy Truyen, Davide TaibiComments: Accepted for publication at Computing (Springer)Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
The use of AI in microservices (MSs) is an emerging field as indicated by a substantial number of surveys. However these surveys focus on a specific problem using specific AI techniques, therefore not fully capturing the growth of research and the rise and disappearance of trends. In our systematic mapping study, we take an exhaustive approach to reveal all possible connections between the use of AI techniques for improving any quality attribute (QA) of MSs during the DevOps phases. Our results include 16 research themes that connect to the intersection of particular QAs, AI domains and DevOps phases. Moreover by mapping identified future research challenges and relevant industry domains, we can show that many studies aim to deliver prototypes to be automated at a later stage, aiming at providing exploitable products in a number of key industry domains.
- [44] arXiv:2405.04861 (replaced) [pdf, html, other]
-
Title: Refactoring Deep Learning Code: A Study of Practices and Unsatisfied Tool NeedsComments: 12 pages, 6 figures, ICSME25 acceptSubjects: Software Engineering (cs.SE)
With the rapid development of deep learning, the implementation of intricate algorithms and substantial data processing have become standard elements of deep learning projects. As a result, the code has become progressively complex as the software evolves, which is difficult to maintain and understand. Existing studies have investigated the impact of refactoring on software quality within traditional software. However, the insight of code refactoring in the context of deep learning is still unclear. This study endeavors to fill this knowledge gap by empirically examining the current state of code refactoring in deep learning realm, and practitioners' views on refactoring. We first manually analyzed the commit history of five popular and well-maintained deep learning projects (e.g., PyTorch). We mined 4,921 refactoring practices in historical commits and measured how different types and elements of refactoring operations are distributed and found that refactoring operation types' distribution in deep learning projects is different from it in traditional Java software. We then surveyed 159 practitioners about their views of code refactoring in deep learning projects and their expectations of current refactoring tools. The result of the survey showed that refactoring research and the development of related tools in the field of deep learning are crucial for improving project maintainability and code quality, and that current refactoring tools do not adequately meet the needs of practitioners. Lastly, we provided our perspective on the future advancement of refactoring tools and offered suggestions for developers' development practices.
- [45] arXiv:2409.15204 (replaced) [pdf, html, other]
-
Title: RAMBO: Enhancing RAG-based Repository-Level Method Body CompletionSubjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Code completion is essential in software development, helping developers by predicting code snippets based on context. Among completion tasks, Method Body Completion (MBC) is particularly challenging as it involves generating complete method bodies based on their signatures and context. This task becomes significantly harder in large repositories, where method bodies must integrate repositoryspecific elements such as custom APIs, inter-module dependencies, and project-specific conventions. In this paper, we introduce RAMBO, a novel RAG-based approach for repository-level MBC. Instead of retrieving similar method bodies, RAMBO identifies essential repository-specific elements, such as classes, methods, and variables/fields, and their relevant usages. By incorporating these elements and their relevant usages into the code generation process, RAMBO ensures more accurate and contextually relevant method bodies. Our experimental results with leading code LLMs across 40 Java projects show that RAMBO significantly outperformed the state-of-the-art repository-level MBC approaches, with the improvements of up to 46% in BLEU, 57% in CodeBLEU, 36% in Compilation Rate, and up to 3X in Exact Match. Notably, RAMBO surpassed RepoCoder Oracle method by up to 12% in Exact Match, setting a new benchmark for repository-level MBC.
- [46] arXiv:2410.08676 (replaced) [pdf, html, other]
-
Title: Bridging Developer Needs and Feasible Features for AI Assistants in IDEsComments: 11 pages, 2 figures, 1 table submitted to ASE Industry Showcase 2025Subjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC)
Despite the increasing presence of AI assistants in Integrated Development Environments, it remains unclear what developers actually need from these tools and which features are likely to be implemented in practice. To investigate this gap, we conducted a two-phase study. First, we interviewed 35 professional developers from three user groups (Adopters, Churners, and Non-Users) to uncover unmet needs and expectations. Our analysis revealed five key areas: Technology Improvement, Interaction, and Alignment, as well as Simplifying Skill Building, and Programming Tasks. We then examined the feasibility of addressing selected needs through an internal prediction market involving 102 practitioners. The results demonstrate a strong alignment between the developers' needs and the practitioners' judgment for features focused on implementation and context awareness. However, features related to proactivity and maintenance remain both underestimated and technically unaddressed. Our findings reveal gaps in current AI support and provide practical directions for developing more effective and sustainable in-IDE AI systems.
- [47] arXiv:2412.02410 (replaced) [pdf, html, other]
-
Title: AutoPLC: Generating Vendor-Aware Structured Text for Programmable Logic ControllersDonghao Yang, Aolang Wu, Tianyi Zhang, Li Zhang, Fang Liu, Xiaoli Lian, Yuming Ren, Jiaji Tian, Xiaoyin CheComments: 12 pages, 3 figures. Replaces "A Multi-Agent Framework for Extensible Structured Text Generation in PLCs" with an updated AutoPLC framework and new experimentsSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Among the programming languages for Programmable Logic Controllers (PLCs), Structured Text (ST) is widely adopted for industrial automation due to its expressiveness and flexibility. However, major vendors implement ST with proprietary extensions and hardware-specific libraries - Siemens' SCL and CODESYS' ST each differ in syntax and functionality. This fragmentation forces engineers to relearn implementation details across platforms, creating substantial productivity barriers. To address this challenge, we developed AutoPLC, a framework capable of automatically generating vendor-aware ST code directly from natural language requirements. Our solution begins by building two essential knowledge sources tailored to each vendor's specifications: a structured API library containing platform-exclusive functions, and an annotated case database that captures real-world implementation experience. Building on these foundations, we created a four-stage generation process that combines step-wise planning (enhanced with a lightweight natural language state machine support for control logic), contextual case retrieval using LLM-based reranking, API recommendation guided by industrial data, and dynamic validation through direct interaction with vendor IDEs. Implemented for Siemens TIA Portal and the CODESYS platform, AutoPLC achieves 90%+ compilation success on our 914-task benchmark (covering general-purpose and process control functions), outperforming all selected baselines, at an average cost of only $0.13 per task. Experienced PLC engineers positively assessed the practical utility of the generated code, including cases that failed compilation. We open-source our framework at this http URL.
- [48] arXiv:2412.15305 (replaced) [pdf, html, other]
-
Title: Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task HandlingComments: This idea was first submitted to the NeuralPS Workshop "System 2 Reasoning At Scale" in September 2024. Its OpenReview: this http URL. It was then submitted to the NAACL 2025 in October 2024, which is recorded in: this http URL. Now this paper has been accepted for publication in ACL 2025 FindingsSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Solving complex reasoning tasks is a key real-world application of agents. Thanks to the pretraining of Large Language Models (LLMs) on code data, recent approaches like CodeAct successfully use code as LLM agents' action, achieving good results. However, CodeAct greedily generates the next action's code block by relying on fragmented thoughts, resulting in inconsistency and instability. Moreover, CodeAct lacks action-related ground-truth (GT), making its supervision signals and termination conditions questionable in multi-turn interactions. To address these issues, we first introduce a simple yet effective end-to-end code generation paradigm, CodeProgram, which leverages code's systematic logic to align with global reasoning and enable cohesive problem-solving. Then, we propose Tree-of-Code (ToC), which self-grows CodeProgram nodes based on the executable nature of the code and enables self-supervision in a GT-free scenario. Experimental results on two datasets using ten popular zero-shot LLMs show ToC remarkably boosts accuracy by nearly 20% over CodeAct with less than 1/4 turns. Several LLMs even perform better on one-turn CodeProgram than on multi-turn CodeAct. To further investigate the trade-off between efficacy and efficiency, we test different ToC tree sizes and exploration mechanisms. We also highlight the potential of ToC's end-to-end data generation for supervised and reinforced fine-tuning.
- [49] arXiv:2501.12934 (replaced) [pdf, html, other]
-
Title: Correctness Assessment of Code Generated by Large Language Models Using Internal RepresentationsSubjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Ensuring the correctness of code generated by Large Language Models (LLMs) presents a significant challenge in AI-driven software development. Existing approaches predominantly rely on black-box (closed-box) approaches that evaluate correctness post-generation, failing to utilize the rich insights embedded in the LLMs' internal states during code generation. In this paper, we introduce OPENIA, a novel white-box (open-box) framework that leverages these internal representations to assess the correctness of LLM-generated code. OPENIA systematically analyzes the intermediate states of representative open-source LLMs specialized for code, including DeepSeek-Coder, CodeLlama, and MagicCoder, across diverse code generation benchmarks. Our empirical analysis reveals that these internal representations encode latent information, which strongly correlates with the correctness of the generated code. Building on these insights, OPENIA uses a white-box/open-box approach to make informed predictions about code correctness, offering significant advantages in adaptability and robustness over traditional classification-based methods and zero-shot approaches. Experimental results demonstrate that OPENIA consistently outperforms baseline models, achieving higher accuracy, precision, recall, and F1-Scores with up to a 2X improvement in standalone code generation and a 46% enhancement in repository-specific scenarios. By unlocking the potential of in-process signals, OPENIA paves the way for more proactive and efficient quality assurance mechanisms in LLM-assisted code generation.
- [50] arXiv:2502.10374 (replaced) [pdf, html, other]
-
Title: Robustness tests for biomedical foundation models should tailor to specificationsR. Patrick Xian, Noah R. Baker, Tom David, Qiming Cui, A. Jay Holmgren, Stefan Bauer, Madhumita Sushil, Reza Abbasi-AslComments: revised version, for associated repo see this http URLSubjects: Software Engineering (cs.SE); Computers and Society (cs.CY)
The rise of biomedical foundation models creates new hurdles in model testing and authorization given their broad capabilities and susceptibility to complex distribution shifts. We suggest tailoring robustness tests according to task-dependent priorities and propose to integrate granular notions of robustness in a predefined specification to guide implementation. Our approach facilitates the standardization of robustness assessments in the model lifecycle and connects abstract AI regulatory frameworks with concrete testing procedures.
- [51] arXiv:2502.19166 (replaced) [pdf, html, other]
-
Title: CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationComments: Accepted as an ACL 2025 Industry Track paper (15 pages)Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
With the rapid advancement of Large Language Models (LLMs), the demand for robust instruction-following capabilities in code generation tasks has grown significantly. Code generation not only facilitates faster prototyping and automated testing, but also augments developer efficiency through improved maintainability and reusability of code. In this paper, we introduce CodeIF, the first benchmark specifically designed to assess the abilities of LLMs to adhere to task-oriented instructions within diverse code generation scenarios. CodeIF encompasses a broad range of tasks, including function synthesis, error debugging, algorithmic refactoring, and code explanation, thereby providing a comprehensive suite to evaluate model performance across varying complexity levels and programming domains. We conduct extensive experiments with LLMs, analyzing their strengths and limitations in meeting the demands of these tasks. The experimental results offer valuable insights into how well current models align with human instructions, as well as the extent to which they can generate consistent, maintainable, and contextually relevant code. Our findings not only underscore the critical role that instruction-following LLMs can play in modern software development, but also illuminate pathways for future research aimed at enhancing their adaptability, reliability, and overall effectiveness in automated code generation. CodeIF data and code are publicly available: this http URL
- [52] arXiv:2503.21735 (replaced) [pdf, html, other]
-
Title: GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release AnalyticsSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Ensuring reliable software release decisions is critical in safety-critical domains such as automotive manufacturing. Release validation relies on large tabular datasets, yet manual analysis is slow, costly, and error-prone. While Large Language Models (LLMs) offer promising automation potential, they face challenges in analytical reasoning, structured data handling, and ambiguity resolution. This paper introduces GateLens, an LLM-based system for analyzing tabular data in the automotive domain. GateLens translates natural language queries into Relational Algebra (RA) expressions and generates optimized Python code. Unlike traditional multi-agent or planning-based systems that can be slow, opaque, and costly to maintain, GateLens emphasizes speed, transparency, and reliability. Experimental results show that GateLens outperforms the existing Chain-of-Thought (CoT) + Self-Consistency (SC) based system on real-world datasets, particularly in handling complex and ambiguous queries. Ablation studies confirm the essential role of the RA layer. Industrial deployment shows over 80% reduction in analysis time while maintaining high accuracy across test result interpretation, impact assessment, and release candidate evaluation. GateLens operates effectively in zero-shot settings without requiring few-shot examples or agent orchestration. This work advances deployable LLM system design by identifying key architectural features-intermediate formal representations, execution efficiency, and low configuration overhead-crucial for safety-critical industrial applications.
- [53] arXiv:2504.08113 (replaced) [pdf, html, other]
-
Title: Test Amplification for REST APIs via Single and Multi-Agent LLM SystemsSubjects: Software Engineering (cs.SE)
REST APIs (Representational State Transfer Application Programming Interfaces) play a vital role in modern cloud-native applications. As these APIs grow in complexity and scale, ensuring their correctness and robustness becomes increasingly important. Automated testing is essential for identifying hidden bugs, particularly those that appear in edge cases or under unexpected inputs. However, creating comprehensive and effective test suites for REST APIs is challenging and often demands significant effort. In this paper, we investigate the use of large language model (LLM) systems, both single-agent and multi-agent setups, for amplifying existing REST API test suites. These systems generate additional test cases that aim to push the boundaries of the API, uncovering behaviors that might otherwise go untested. We present a comparative evaluation of the two approaches across several dimensions, including test coverage, bug detection effectiveness, and practical considerations such as computational cost and energy usage. Our evaluation demonstrates increased API coverage, identification of numerous bugs in the API under test, and insights into the computational cost and energy consumption of both approaches.
- [54] arXiv:2504.14026 (replaced) [pdf, html, other]
-
Title: Assumptions to Evidence: Evaluating Security Practices Adoption and Their Impact on Outcomes in the npm EcosystemComments: 12 pages, 2 figures, 4 tablesSubjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Practitioners often struggle with the overwhelming number of security practices outlined in cybersecurity frameworks for risk mitigation. Given the limited budget, time, and resources, practitioners want to prioritize the adoption of security practices based on empirical evidence. The goal of this study is to assist practitioners and policymakers in making informed decisions on which security practices to adopt by evaluating the relationship between software security practices adoption and security outcome metrics. To do this, we analyzed the adoption of security practices and their impact on security outcome metrics across 145K npm packages. We selected the OpenSSF Scorecard metrics to automatically measure the adoption of security practices in npm GitHub repositories. We also investigated project-level security outcome metrics: the number of open vulnerabilities (Vul_Count)), mean time to remediate (MTTR) vulnerabilities in dependencies, and mean time to update (MTTU) dependencies. We conducted regression and causal analysis using 11 Scorecard metrics and the aggregated Scorecard score (computed by aggregating individual security practice scores) as predictors and Vul_Count), MTTR, and MTTU as target variables. Our findings reveal that aggregated adoption of security practices is associated with 5.2 fewer vulnerabilities, 216.8 days faster MTTR, and 52.3 days faster MTTU. Repository characteristics have an impact on security practice effectiveness: repositories with high security practice adoptions, especially those that are mature, actively maintained, large in size, have many contributors, few dependencies, and high download volumes, tend to exhibit better outcomes compared to smaller or inactive repositories.
- [55] arXiv:2507.09108 (replaced) [pdf, html, other]
-
Title: SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort EstimationAaditya Bhatia, Gustavo A. Oliva, Gopi Krishnan Rajbahadur, Haoxiang Zhang, Yihao Chen, Zhilong Chen, Arthur Leung, Dayi Lin, Boyuan Chen, Ahmed E. HassanSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
High-quality labeled datasets are crucial for training and evaluating foundation models in software engineering, but creating them is often prohibitively expensive and labor-intensive. We introduce SPICE, a scalable, automated pipeline for labeling SWE-bench-style datasets with annotations for issue clarity, test coverage, and effort estimation. SPICE combines context-aware code navigation, rationale-driven prompting, and multi-pass consensus to produce labels that closely approximate expert annotations. SPICE's design was informed by our own experience and frustration in labeling more than 800 instances from SWE-Gym. SPICE achieves strong agreement with human-labeled SWE-bench Verified data while reducing the cost of labeling 1,000 instances from around $100,000 (manual annotation) to just $5.10. These results demonstrate SPICE's potential to enable cost-effective, large-scale dataset creation for SE-focused FMs. To support the community, we release both SPICE tool and SPICE Bench, a new dataset of 6,802 SPICE-labeled instances curated from 291 open-source projects in SWE-Gym (over 13x larger than SWE-bench Verified).
- [56] arXiv:2507.11146 (replaced) [pdf, html, other]
-
Title: Automata Models for Effective Bug Pattern DescriptionComments: Accepted to the ACM/IEEE 28th International Conference on Model Driven Engineering Languages and Systems (MODELS 2025)Subjects: Software Engineering (cs.SE)
Debugging complex systems is a crucial yet time-consuming task. This paper presents the use of automata learning and testing techniques to obtain concise and informative bug descriptions. We introduce the concepts of Failure Explanations (FE), Eventual Failure Explanations (EFE), and Early Detection (ED) to provide meaningful summaries of failing behavior patterns. By factoring out irrelevant information and focusing on essential test patterns, our approach aims to enhance bug detection and understanding. We evaluate our methods using various test patterns and real-world benchmarks, demonstrating their effectiveness in producing compact and informative bug descriptions.
- [57] arXiv:2507.11671 (replaced) [pdf, html, other]
-
Title: Decision Models for Selecting Architecture Patterns and Strategies in Quantum Software SystemsMst Shamima Aktar, Peng Liang, Muhammad Waseem, Amjed Tahir, Mojtaba Shahin, Muhammad Azeem Akbar, Arif Ali Khan, Aakash Ahmad, Musengamana Jean de Dieu, Ruiyin LiComments: 49 pages, 10 images, 16 tables, Manuscript submitted to a journal (2025)Subjects: Software Engineering (cs.SE)
Quantum software represents disruptive technologies in terms of quantum-specific software systems, services, and applications - leverage the principles of quantum mechanics via programmable quantum bits (Qubits) that manipulate quantum gates (QuGates) - to achieve quantum supremacy in computing. Quantum software architecture enables quantum software developers to abstract away implementation-specific details (i.e., mapping of Qubits and QuGates to high-level architectural components and connectors). Architectural patterns and strategies can provide reusable knowledge and best practices to engineer quantum software systems effectively and efficiently. However, quantum software practitioners face significant challenges in selecting and implementing appropriate patterns and strategies due to the complexity of quantum software systems and the lack of guidelines. To address these challenges, this study proposes decision models for selecting patterns and strategies in six critical design areas in quantum software systems: Communication, Decomposition, Data Processing, Fault Tolerance, Integration and Optimization, and Algorithm Implementation. These decision models are constructed based on data collected from both a mining study (i.e., GitHub and Stack Exchange) and a Systematic Literature Review, which were used to identify relevant patterns and strategies with their involved Quality Attributes (QAs). We then conducted semi-structured interviews with 16 quantum software practitioners to evaluate the familiarity, understandability, completeness, and usefulness of the proposed decision models. The results show that the proposed decision models can aid practitioners in selecting suitable patterns and strategies to address the challenges related to the architecture design of quantum software systems. The dataset is available at [6], allowing the community to reproduce and build upon our findings.
- [58] arXiv:2507.15343 (replaced) [pdf, html, other]
-
Title: StackTrans: From Large Language Model to Large Pushdown Automata ModelSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence, effectively catalyzing the advent of large language models (LLMs). However, despite its remarkable capabilities and the substantial progress it has facilitated, the Transformer architecture still has some limitations. One such intrinsic limitation is its inability to effectively capture the Chomsky hierarchy, such as regular expressions or deterministic context-free grammars. Drawing inspiration from pushdown automata, which efficiently resolve deterministic context-free grammars using stacks, we propose StackTrans to address the aforementioned issue within LLMs. Unlike previous approaches that modify the attention computation, StackTrans explicitly incorporates hidden state stacks between Transformer layers. This design maintains compatibility with existing frameworks like flash-attention. Specifically, our design features stack operations -- such as pushing and popping hidden states -- that are differentiable and can be learned in an end-to-end manner. Our comprehensive evaluation spans benchmarks for both Chomsky hierarchies and large-scale natural languages. Across these diverse tasks, StackTrans consistently outperforms standard Transformer models and other baselines. We have successfully scaled StackTrans up from 360M to 7B parameters. In particular, our from-scratch pretrained model StackTrans-360M outperforms several larger open-source LLMs with 2-3x more parameters, showcasing its superior efficiency and reasoning capability.
- [59] arXiv:2507.15822 (replaced) [pdf, html, other]
-
Title: Do AI models help produce verified bug fixes?Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Among areas of software engineering where AI techniques -- particularly, Large Language Models -- seem poised to yield dramatic improvements, an attractive candidate is Automatic Program Repair (APR), the production of satisfactory corrections to software bugs. Does this expectation materialize in practice? How do we find out, making sure that proposed corrections actually work? If programmers have access to LLMs, how do they actually use them to complement their own skills?
To answer these questions, we took advantage of the availability of a program-proving environment, which formally determines the correctness of proposed fixes, to conduct a study of program debugging with two randomly assigned groups of programmers, one with access to LLMs and the other without, both validating their answers through the proof tools. The methodology relied on a division into general research questions (Goals in the Goal-Query-Metric approach), specific elements admitting specific answers (Queries), and measurements supporting these answers (Metrics). While applied so far to a limited sample size, the results are a first step towards delineating a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
These results caused surprise as compared to what one might expect from the use of AI for debugging and APR. The contributions also include: a detailed methodology for experiments in the use of LLMs for debugging, which other projects can reuse; a fine-grain analysis of programmer behavior, made possible by the use of full-session recording; a definition of patterns of use of LLMs, with 7 distinct categories; and validated advice for getting the best of LLMs for debugging and Automatic Program Repair. - [60] arXiv:2507.22223 (replaced) [pdf, html, other]
-
Title: Secure coding for web applications: Frameworks, challenges, and the role of LLMsComments: 11 pages, 5 figures, 3 tables, 6 listingsSubjects: Software Engineering (cs.SE)
Secure coding is a critical yet often overlooked practice in software development. Despite extensive awareness efforts, real-world adoption remains inconsistent due to organizational, educational, and technical barriers. This paper provides a comprehensive review of secure coding practices across major frameworks and domains, including web development, DevSecOps, and cloud security. It introduces a structured framework comparison and categorizes threats aligned with the OWASP Top 10. Additionally, we explore the rising role of Large Language Models (LLMs) in evaluating and recommending secure code, presenting a reproducible case study across four major vulnerability types. This paper offers practical insights for researchers, developers, and educators on integrating secure coding into real-world development processes.
- [61] arXiv:2508.00198 (replaced) [pdf, other]
-
Title: Testing the Untestable? An Empirical Study on the Testing Process of LLM-Powered Software SystemsSubjects: Software Engineering (cs.SE)
Background: Software systems powered by large language models are becoming a routine part of everyday technologies, supporting applications across a wide range of domains. In software engineering, many studies have focused on how LLMs support tasks such as code generation, debugging, and documentation. However, there has been limited focus on how full systems that integrate LLMs are tested during development. Aims: This study explores how LLM-powered systems are tested in the context of real-world application development. Method: We conducted an exploratory case study using 99 individual reports written by students who built and deployed LLM-powered applications as part of a university course. Each report was independently analyzed using thematic analysis, supported by a structured coding process. Results: Testing strategies combined manual and automated methods to evaluate both system logic and model behavior. Common practices included exploratory testing, unit testing, and prompt iteration. Reported challenges included integration failures, unpredictable outputs, prompt sensitivity, hallucinations, and uncertainty about correctness. Conclusions: Testing LLM-powered systems required adaptations to traditional verification methods, blending source-level reasoning with behavior-aware evaluations. These findings provide evidence on the practical context of testing generative components in software systems.
- [62] arXiv:2508.00738 (replaced) [pdf, html, other]
-
Title: Tool-Assisted Conformance Checking to Reference Process ModelsSubjects: Software Engineering (cs.SE); Formal Languages and Automata Theory (cs.FL)
Reference models convey best practices and standards. The reference frameworks necessitate conformance checks to ensure adherence to established guidelines and principles, which is crucial for maintaining quality and consistency in various processes. This paper explores automated conformance checks for concrete process models against reference models using causal dependency analysis of tasks and events. Existing notions of conformance checking for process models focus on verifying process execution traces and lack the expressiveness and automation needed for semantic model comparison, leaving this question unresolved. We integrate our approach into a broader semantic framework for defining reference model conformance. We outline an algorithm for reference process model conformance checking, evaluate it through a case study, and discuss its strengths and limitations. Our research provides a tool-assisted solution enhancing accuracy and flexibility in process model conformance verification.
- [63] arXiv:2401.10969 (replaced) [pdf, other]
-
Title: MacroSwarm: A Field-based Compositional Framework for Swarm ProgrammingSubjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
Swarm behaviour engineering is an area of research that seeks to investigate methods and techniques for coordinating computation and action within groups of simple agents to achieve complex global goals like pattern formation, collective movement, clustering, and distributed sensing. Despite recent progress in the analysis and engineering of swarms (of drones, robots, vehicles), there is still a need for general design and implementation methods and tools that can be used to define complex swarm behaviour in a principled way. To contribute to this quest, this article proposes a new field-based coordination approach, called MacroSwarm, to design and program swarm behaviour in terms of reusable and fully composable functional blocks embedding collective computation and coordination. Based on the macroprogramming paradigm of aggregate computing, MacroSwarm builds on the idea of expressing each swarm behaviour block as a pure function, mapping sensing fields into actuation goal fields, e.g., including movement vectors. In order to demonstrate the expressiveness, compositionality, and practicality of MacroSwarm as a framework for swarm programming, we perform a variety of simulations covering common patterns of flocking, pattern formation, and collective decision-making. The implications of the inherent self-stabilisation properties of field-based computations in MacroSwarm are discussed, which formally guarantee some resilience properties and guided the design of the library.
- [64] arXiv:2502.02194 (replaced) [pdf, html, other]
-
Title: Understanding User Mental Models in AI-Driven Code Completion Tools: Insights from an Elicitation StudySubjects: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
Integrated Development Environments increasingly implement AI-powered code completion tools (CCTs), which promise to enhance developer efficiency, accuracy, and productivity. However, interaction challenges with CCTs persist, mainly due to mismatches between developers' mental models and the unpredictable behavior of AI-generated suggestions, which is an aspect underexplored in the literature. We conducted an elicitation study with 56 developers using co-design workshops to elicit their mental models when interacting with CCTs. Different important findings that might drive the interaction design with CCTs emerged. For example, developers expressed diverse preferences on when and how code suggestions should be triggered (proactive, manual, hybrid), where and how they are displayed (inline, sidebar, popup, chatbot), as well as the level of detail. It also emerged that developers need to be supported by customization of activation timing, display modality, suggestion granularity, and explanation content, to better fit the CCT to their preferences. To demonstrate the feasibility of these and the other guidelines that emerged during the study, we developed ATHENA, a proof-of-concept CCT that dynamically adapts to developers' coding preferences and environments, ensuring seamless integration into diverse workflows.
- [65] arXiv:2502.20528 (replaced) [pdf, html, other]
-
Title: ConfuGuard: Using Metadata to Detect Active and Stealthy Package Confusion Attacks Accurately and at ScaleSubjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
Package confusion attacks such as typosquatting threaten software supply chains. Attackers make packages with names that syntactically or semantically resemble legitimate ones, tricking engineers into installing malware. While prior work has developed defenses against package confusions in some software package registries, notably NPM, PyPI, and RubyGems, gaps remain: high false-positive rates, generalization to more software package ecosystems, and insights from real-world deployment.
In this work, we introduce ConfuGuard, a state-of-art detector for package confusion threats. We begin by presenting the first empirical analysis of benign signals derived from prior package confusion data, uncovering their threat patterns, engineering practices, and measurable attributes. Advancing existing detectors, we leverage package metadata to distinguish benign packages, and extend support from three up to seven software package registries. Our approach significantly reduces false positive rates (from 80% to 28%), at the cost of an additional 14s average latency to filter out benign packages by analyzing the package metadata. ConfuGuard is used in production at our industry partner, whose analysts have already confirmed 630 real attacks detected by ConfuGuard.