车讯:2020年前发布 疑似新一代Giulietta谍照
百度 从使用方式来看,汽车由过去的私人拥有和私人使用为主,向私人拥有和共享--同拥有相结合的科学合理使用转变未来出行。See recent articles
Showing new listings for Tuesday, 5 August 2025
- [1] arXiv:2508.00904 [pdf, html, other]
-
Title: Forecasting LLM Inference Performance via Hardware-Agnostic Analytical ModelingComments: 10 pages, 9 figuresSubjects: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Large language models (LLMs) have been increasingly deployed as local agents on personal devices with CPUs, NPUs and integrated GPUs. However, forecasting inference performance on devices with such heterogeneity remains challenging due to the dynamic compute and memory demands. Existing approaches rely on GPU benchmarking or machine learning-based latency predictors, which are often hardware-specific and lack generalizability. To this end, we introduce LIFE, a lightweight and modular analytical framework that is comprised of modular analytical model of operators, configurable to characterize LLM inference workloads in a hardware and dataset-agnostic manner. LIFE characterizes the influence of software and model optimizations, such as quantization, KV cache compression, LoRA adapters, chunked prefill, different attentions, and operator fusion, on performance metrics such as time-to-first-token (TTFT), time-per-output-token (TPOT) and tokens-per-second (TPS). LIFE enables performance forecasting using only hardware specifications, such as TOPS and memory bandwidth, without requiring extensive dataset benchmarking. We validate LIFE's forecasting with inference on AMD Ryzen CPUs, NPUs, iGPUs and NVIDIA V100 GPUs, with Llama2-7B variants, demonstrating the utility of LIFE in forecasting LLM performance through lens of system efficiency to enable efficient LLM deployment across different hardware platforms.
New submissions (showing 1 of 1 entries)
- [2] arXiv:1905.13011 (cross-list from cs.DB) [pdf, other]
-
Title: Don't Persist All : Efficient Persistent Data StructuresComments: 10 pages, 12 figuresSubjects: Databases (cs.DB); Hardware Architecture (cs.AR); Data Structures and Algorithms (cs.DS); Performance (cs.PF)
Data structures used in software development have inbuilt redundancy to improve software reliability and to speed up performance. Examples include a Doubly Linked List which allows a faster deletion due to the presence of the previous pointer. With the introduction of Persistent Memory, storing the redundant data fields into persistent memory adds a significant write overhead, and reduces performance. In this work, we focus on three data structures - Doubly Linked List, B+Tree and Hashmap, and showcase alternate partly persistent implementations where we only store a limited set of data fields to persistent memory. After a crash/restart, we use the persistent data fields to recreate the data structures along with the redundant data fields. We compare our implementation with the base implementation and show that we achieve speedups around 5-20% for some data structures, and up to 165% for a flush-dominated data structure.
- [3] arXiv:2508.01506 (cross-list from cs.LG) [pdf, html, other]
-
Title: FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank ModelsZishan Shao, Yixiao Wang, Qinsi Wang, Ting Jiang, Zhixu Du, Hancheng Ye, Danyang Zhuo, Yiran Chen, Hai LiComments: Technical ReportSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
Singular Value Decomposition (SVD) has recently seen a surge of interest as a simple yet powerful tool for large language models (LLMs) compression, with a growing number of works demonstrating 20-80% parameter reductions at minimal accuracy loss. Previous SVD-based approaches have focused primarily on reducing the memory footprint of model weights, largely overlooking the additional activation memory overhead incurred during inference when applying truncated factors via standard dense CUDA kernels. Our experiments demonstrate that this activation overhead, scaling with sequence length and hidden dimension, prevents current SVD compression techniques from achieving any reduction in peak inference memory, thereby limiting their viability for real-world, on-device deployments.
We introduce FlashSVD, a novel, end-to-end rank-aware streaming inference framework specifically designed for SVD-compressed large language models. FlashSVD can be seamlessly integrated with any model that employs SVD-based methods for parameter reduction. By fusing low-rank projection kernels directly into both the self-attention and feed-forward network (FFN) pipelines, FlashSVD avoid materializing full-size activation buffers. Instead, small tiles of the truncated factors are loaded into on-chip SRAM, multiplied and reduced on the fly, and immediately evicted, preserving high GPU occupancy and adding no extra latency. On standard encoder benchmarks (e.g., BERT-Base), FlashSVD cuts peak activation memory by up to 70.2% and intermediate transient memory by 75%, all while incur no accuracy loss with upstreaming compression methods, offering a practical path toward memory-constrained deployment of low-rank LLMs. - [4] arXiv:2508.01635 (cross-list from cs.LG) [pdf, html, other]
-
Title: Learning Unified System Representations for Microservice Tail Latency PredictionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
Microservice architectures have become the de facto standard for building scalable cloud-native applications, yet their distributed nature introduces significant challenges in performance monitoring and resource management. Traditional approaches often rely on per-request latency metrics, which are highly sensitive to transient noise and fail to reflect the holistic behavior of complex, concurrent workloads. In contrast, window-level P95 tail latency provides a stable and meaningful signal that captures both system-wide trends and user-perceived performance degradation. We identify two key shortcomings in existing methods: (i) inadequate handling of heterogeneous data, where traffic-side features propagate across service dependencies and resource-side signals reflect localized bottlenecks, and (ii) the lack of principled architectural designs that effectively distinguish and integrate these complementary modalities. To address these challenges, we propose USRFNet, a deep learning network that explicitly separates and models traffic-side and resource-side features. USRFNet employs GNNs to capture service interactions and request propagation patterns, while gMLP modules independently model cluster resource dynamics. These representations are then fused into a unified system embedding to predict window-level P95 latency with high accuracy. We evaluate USRFNet on real-world microservice benchmarks under large-scale stress testing conditions, demonstrating substantial improvements in prediction accuracy over state-of-the-art baselines.
Cross submissions (showing 3 of 3 entries)
- [5] arXiv:2507.21895 (replaced) [pdf, html, other]
-
Title: Beamforming-based Achievable Rate Maximization in ISAC System for Multi-UAV NetworkingSubjects: Performance (cs.PF)
Airborne mobile Integrated Sensing and Communication (ISAC) base stations have garnered significant attention recently, with ISAC technology being a crucial application for 6G networks. Since ISAC can sense potential mobile communication users, this paper studies an effective scheme for a multi-UAV network tailored for emergency communication. In this paper, we develop a temporal-assisted frame structure utilizing integrated omnidirectional and directional beampattern to facilitate efficient and frequent searching, with extended Kalman filtering (EKF) as an aid to beam alignment. Further, we address an optimization problem to maximize the total achievable rate per slot by jointly designing UAV beamforming, load management, and UAV direction planning, all while adhering to the constraints of the predicted beam coverage. Given the problem NP-hard, we introduce three robust mechanisms for its resolution: an enhanced distributed Successive Convex Approximation (SCA)-Iterative Rank Minimization (IRM) algorithm, an coalition game approach, and a Fermat point search method. In particular, the proposed SCA-IRM algorithm decomposes the original complex optimization problem into several sub-problems and assigns them equally to each UAV, so as to realize distributed computing and improve computational efficiency. Our proposed simulations demonstrate the improved system performance in terms of communication rate, fairness, and sensing accuracy, providing design guidelines of UAV-assisted emergency communication networking.
- [6] arXiv:2506.02666 (replaced) [pdf, html, other]
-
Title: Spatially Correlated multi-RIS Communication: The Effect of Inter-Operator InterferenceComments: Submitted to IEEE Journal. arXiv admin note: text overlap with arXiv:2403.00349Subjects: Information Theory (cs.IT); Performance (cs.PF)
A multi-operator wireless communication system is studied where each operator is equipped with a reconfigurable intelligent surface (RIS) to enhance its communication quality. RISs controlled by different operators affect the system performance of one another due to the inherently rapid phase shift adjustments that occur on an independent basis. The system performance of such a communication scenario is analytically studied for the practical case where spatial correlation occurs at RIS of arbitrary size. The proposed framework is quite general since it is analyzed under Nakagami-$m$ channel fading conditions. Finally, the derived analytical results are verified via numerical and simulation trials as well as some new and useful engineering outcomes are revealed.
- [7] arXiv:2506.09226 (replaced) [pdf, html, other]
-
Title: Terabyte-Scale Analytics in the Blink of an EyeSubjects: Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
For the past two decades, the DB community has devoted substantial research to take advantage of cheap clusters of machines for distributed data analytics -- we believe that we are at the beginning of a paradigm shift. The scaling laws and popularity of AI models lead to the deployment of incredibly powerful GPU clusters in commercial data centers. Compared to CPU-only solutions, these clusters deliver impressive improvements in per-node compute, memory bandwidth, and inter-node interconnect performance. In this paper, we study the problem of scaling analytical SQL queries on distributed clusters of GPUs, with the stated goal of establishing an upper bound on the likely performance gains. To do so, we build a prototype designed to maximize performance by leveraging ML/HPC best practices, such as group communication primitives for cross-device data movements. This allows us to conduct thorough performance experimentation to point our community towards a massive performance opportunity of at least 60$\times$. To make these gains more relatable, before you can blink twice, our system can run all 22 queries of TPC-H at a 1TB scale factor!
- [8] arXiv:2506.10872 (replaced) [pdf, html, other]
-
Title: The Gittins Index: A Design Principle for Decision-Making Under UncertaintySubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Performance (cs.PF); Probability (math.PR); Machine Learning (stat.ML)
The Gittins index is a tool that optimally solves a variety of decision-making problems involving uncertainty, including multi-armed bandit problems, minimizing mean latency in queues, and search problems like the Pandora's box model. However, despite the above examples and later extensions thereof, the space of problems that the Gittins index can solve perfectly optimally is limited, and its definition is rather subtle compared to those of other multi-armed bandit algorithms. As a result, the Gittins index is often regarded as being primarily a concept of theoretical importance, rather than a practical tool for solving decision-making problems.
The aim of this tutorial is to demonstrate that the Gittins index can be fruitfully applied to practical problems. We start by giving an example-driven introduction to the Gittins index, then walk through several examples of problems it solves - some optimally, some suboptimally but still with excellent performance. Two practical highlights in the latter category are applying the Gittins index to Bayesian optimization, and applying the Gittins index to minimizing tail latency in queues. - [9] arXiv:2507.14959 (replaced) [pdf, html, other]
-
Title: Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded DevicesSaeid Ghafouri, Mohsen Fayyaz, Xiangchen Li, Deepu John, Bo Ji, Dimitrios Nikolopoulos, Hans VandierendonckSubjects: Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
Real-time multi-label video classification on embedded devices is constrained by limited compute and energy budgets. Yet, video streams exhibit structural properties such as label sparsity, temporal continuity, and label co-occurrence that can be leveraged for more efficient inference. We introduce Polymorph, a context-aware framework that activates a minimal set of lightweight Low Rank Adapters (LoRA) per frame. Each adapter specializes in a subset of classes derived from co-occurrence patterns and is implemented as a LoRA weight over a shared backbone. At runtime, Polymorph dynamically selects and composes only the adapters needed to cover the active labels, avoiding full-model switching and weight merging. This modular strategy improves scalability while reducing latency and energy overhead. Polymorph achieves 40% lower energy consumption and improves mAP by 9 points over strong baselines on the TAO dataset. Polymorph is open source at this http URL.