The recent wave of research focusing on machine intelligence (machine learning and artificial intelligence) and its applications has been fuelled by both hardware improvements and deep learning frameworks that simplify the design and training of neural models. Advances in AI also accelerate research towards Reinforcement Learning (RL), where dynamic control mechanisms are designed to tackle complex tasks. Further, machine learning based optimisation, such as Bayesian Optimisation, is gaining traction in the computer systems community where optimisation needs to scale with complex and large parameter spaces; areas of interest range from hyperparameter tuning to system configuration tuning,

The EuroMLSys workshop will provide a platform for discussing emerging trends in building frameworks, programming models, optimisation algorithms, and software engineering tools to support AI/ML applications. At the same time, using ML for building such frameworks or optimisation tools will be discussed. EuroMLSys aims to bridge the gap between AI research and practice, through a technical program of fresh ideas on software infrastructure, tools, design principles, and theory/algorithms (including issues of instability, data efficiency, etc.), from a systems perspective. We will also explore potential applications that will take advantage of ML.


Key dates

  • Paper submission deadline (extended): March 12, 2023 (23:59 AoE)
  • Acceptance notification: April 10, 2023
  • Final paper due: April 16, 2023
  • Workshop: May 8, 2023 (full-day workshop)

Past Editions

Call for Papers

EuroMLSys is an interdisciplinary workshop that brings together researchers in computer architecture, systems and machine learning, along with practitioners who are active in these emerging areas.

Topics of interest include, but are not limited to, the following:

  • Scheduling algorithms for data processing clusters
  • Custom hardware for machine learning
  • Programming languages for machine learning
  • Benchmarking systems (for machine learning algorithms)
  • Synthetic input data generation for training
  • Systems for training and serving machine learning models at scale
  • Graph neural networks
  • Neural network compression and pruning in systems
  • Systems for incremental learning algorithms
  • Large scale distributed learning algorithms in practice
  • Database systems for large scale learning
  • Model understanding tools (debugging, visualisation, etc.)
  • Systems for model-free and model-based Reinforcement Learning
  • Optimisation in end-to-end deep learning
  • System optimisation using Bayesian Optimisation
  • Acceleration of model building (e.g., imitation learning in RL)
  • Use of probabilistic models in ML/AI application
  • Learning models for inferring network attacks, device/service fingerprinting, congestion, etc.
  • Techniques to collect and analyze network data in a privacy-preserving manner
  • Learning models to capture network events and control actions
  • Machine learning in networking (e.g., use of Deep RL in networking)
  • Analysis of distributed ML algorithms
  • Semantics for distributed ML languages
  • Probabilistic modelling for distributed ML algorithms
  • Synchronisation and state control of distributed ML algorithms

Accepted papers will be published in the ACM Digital Library (you can opt out from this).


ACM Proceeding will be available on May 8, 2023 on ACM Digital Library

Please join the slack for question/discussion. Anybody can join it. Join!

Program timezone is CEST (UTC+2.00).

9:00 Opening
09:15 Session 1: Model, Training and Optimisation - (15mins presentations)
Actionable Data Insights for Machine Learning Nils Braun (Apple) Artificial Intelligence (AI) and Machine Learning (ML) have made tremendous progress in the recent decade and have become ubiquitous in almost all application domains. Many recent advancements in the ease-of-use of ML frameworks and the low-code model training automations have further reduced the threshold for ML model building. As ML algorithms and pre-trained models become commodities, curating the appropriate training datasets and model evaluations remain critical challenges. However, these tasks are labor-intensive and require ML practitioners to have bespoke data skills. Based on the feedback from different ML projects, we built ADIML (Actionable Data Insights for ML) – a holistic data toolset. The goal is to democratize data-centric ML approaches by removing big data and distributed system barriers for engineers. We show in several case studies how the application of ADIML has helped solve specific data challenges and shorten the time to obtain actionable insights.
Dynamic Stashing Quantization for Efficient Transformer Training Guo Yang (University of Cambridge) Large Language Models (LLMs) have demonstrated impressive performance on a range of Natural Language Processing (NLP) tasks. Unfortunately, the immense amount of computations and memory accesses required for LLM training makes them prohibitively expensive in terms of hardware cost, and thus challenging to deploy in use cases such as on-device learning. In this paper, motivated by the observation that LLM training is memory-bound, we propose a novel dynamic quantization strategy, termed Dynamic Stashing Quantization (DSQ), that puts a special focus on reducing the memory operations, but also enjoys the other benefits of low precision training, such as the reduced arithmetic cost. We conduct a thorough study on two translation tasks (trainedfrom-scratch) and three classification tasks (fine-tuning). DSQ reduces the amount of arithmetic operations by 20.95X and the number of DRAM operations by 2.55X on IWSLT17 compared to the standard 16-bit fixed-point, which is widely used in on-device learning.
Towards A Platform for Model Training on Dynamic Datasets  Maximilian Böther (ETHZ) Machine learning (ML) is often applied in use cases where training data evolves and/or grows over time. Training must incorporate data changes for high model quality, however this is often challenging and expensive due to large datasets and models. In contrast, ML researchers often train and evaluate ML models on static datasets or with artificial assumptions about data dynamics. This gap between research and practice is largely due to (i) the absence of an open-source platform that manages dynamic datasets at scale and supports pluggable policies for when and what data to train on, and (ii) the lack of representative open-source benchmarks for ML training on dynamic datasets. To address this gap, we propose to design a platform that enables ML researchers and practitioners to explore training and data selection policies, while alleviating the burdens of managing large dynamic datasets and orchestrating recurring training jobs. We also propose to build an accompanying benchmark suite that integrates public dynamic datasets and ML models from a variety of representative use cases.
Profiling and Monitoring Deep Learning Training Tasks Ehsan Yousefzadeh-Asl-Miandoab (IT University of Copenhagen) The embarrassingly parallel nature of deep learning training tasks makes CPU-GPU co-processors the primary commodity hardware for them. The computing and memory requirements of these tasks, however, do not always align well with the available GPU resources. It is, therefore, important to monitor and profile the behavior of training tasks on co-processors to understand better the requirements of different use cases. In this paper, our goal is to shed more light on the variety of tools for profiling and monitoring deep learning training tasks on server-grade NVIDIA GPUs. In addition to surveying the main characteristics of the tools, we analyze the functional limitations and overheads of each tool by using a both light and heavy training scenario. Our results show that monitoring tools like nvidia-smi and dcgm can be integrated with resource managers for online decision making thanks to their low overheads. On the other hand, one has to be careful about the set of metrics to correctly reason about the GPU utilization. When it comes to profiling, each tool has its time to shine; a framework-based or system-wide GPU profiler can first detect the frequent kernels or bottlenecks, and then, a lower-level GPU profiler can focus on particular kernels at the micro-architectural-level.
MCTS-GEB: Monte Carlo Tree Search is a Good E-graph Builder  Guoliang He (University of Cambridge) Rewrite systems [ 11, 16, 18] have been widely employing equality saturation [15], which is an optimisation methodology that uses a saturated e-graph to represent all possible sequences of rewrite simultaneously, and then extracts the optimal one. As such, optimal results can be achieved by avoiding the phase-ordering problem. However, we observe that when the e-graph is not saturated, it cannot represent all possible rewrite opportunities and therefore the phase-ordering problem is re-introduced during the construction phase of the e-graph. To address this problem, we propose MCTS-GEB, a domain-general rewrite system that applies reinforcement learning (RL) to e-graph construction. At its core, MCTS-GEB uses a Monte Carlo Tree Search (MCTS) [4] to efficiently plan for the optimal e-graph construction, and therefore it can effectively eliminate the phase-ordering problem at the construction phase and achieve better performance within a reasonable time. Evaluation in two different domains shows MCTS-GEB can outperform the state-of-the-art rewrite systems by up to 49x, while the optimisation can generally take less than an hour, indicating MCTS-GEB is a promising building block for the future generation of rewrite systems.
10:00 Coffee Break
10:30 Session 2: Decentralised Learning, Federated Learning - (15mins presentations)
Decentralized Learning Made Easy with DecentralizePy Rishi Sharma (EPFL) Decentralized learning (DL) has gained prominence for its potential benefits in terms of scalability, privacy, and fault tolerance. It consists of many nodes that coordinate without a central server and exchange millions of parameters in the inherently iterative process of machine learning (ML) training. In addition, these nodes are connected in complex and potentially dynamic topologies. Assessing the intricate dynamics of such networks is clearly not an easy task. Often in literature, researchers resort to simulated environments that do not scale and fail to capture practical and crucial behaviors, including the ones associated to parallelism, data transfer, network delays, and wall-clock time. In this paper, we propose DecentralizePy, a distributed framework for decentralized ML, which allows for the emulation of large-scale learning networks in arbitrary topologies. We demonstrate the capabilities of DecentralizePy by deploying techniques such as sparsification and secure aggregation on top of several topologies, including dynamic networks with more than one thousand nodes.
Towards Practical Few-shot Federated NLP Dongqi Cai (Beiyou Shenzhen Institute) Transformer-based pre-trained models have emerged as the predominant solution for natural language processing (NLP). Fine-tuning such pre-trained models for downstream tasks often requires a considerable amount of labeled private data. In practice, private data is often distributed across heterogeneous mobile devices and may be prohibited from being uploaded. Moreover, well-curated labeled data is often scarce, presenting an additional challenge. To address these challenges, we first introduce a data generator for federated few-shot learning tasks, which encompasses the quantity and skewness of scarce labeled data in a realistic setting. Subsequently, we propose AUG-FedPrompt, a prompt-based fed}erated learning system that exploits abundant unlabeled data for data augmentation. Our experiments indicate that AUG-FedPrompt can perform on par with full-set fine-tuning with a limited amount of labeled data. However, such competitive performance comes at a significant system cost.
Towards Robust and Bias-free Federated Learning Ousmane Touat (LIRIS INSA Lyon) Federated learning (FL) is an exciting machine learning approach where multiple devices collaboratively train a model without sharing their raw data. The FL system is vulnerable to the action of Byzantine clients sending arbitrary model updates, and the trained model may exhibit prediction bias towards specific groups. However, FL mechanisms tackling robustness and bias mitigation have contradicting objectives, motivating the question of building a FL system that comprehensively combines both objectives. In this paper, we first survey state-of-the-art approaches to robustness to Byzantine behavior and bias mitigation and analyze their respective objectives. Then, we conduct an empirical evaluation to illustrate the interplay between state-of-the-art FL robustness mechanisms and FL bias mitigation mechanisms. Specifically, we show that classical robust FL methods may inadvertently filter out benign FL clients that have statistically rare data, particularly for minority groups. Finally, we derive research directions for building more robust and bias-free FL systems.
Gradient-less Federated Gradient Boosting Tree with Learnable Learning Rate Chenyang Ma (University of Cambridge) The privacy-sensitive nature of decentralized datasets and the robustness of eXtreme Gradient Boosting (XGBoost) on tabular data raise the needs to train XGBoost in the context of federated learning (FL). Existing works on federated XGBoost in the horizontal setting rely on the sharing of gradients, which induce per-node level communication frequency and serious privacy concerns. To alleviate these problems, we develop an innovative framework for horizontal federated XGBoost which does not depend on the sharing of gradients and simultaneously boosts privacy and communication efficiency by making the learning rates of the aggregated tree ensembles learnable. We conduct extensive evaluations on various classification and regression datasets, showing our approach achieves performance comparable to the state-of-the-art method and effectively improves communication efficiency by lowering both communication rounds and communication overhead by factors ranging from 25x to 700x.
Distributed Training for Speech Recognition using Local Knowledge Aggregation and Knowledge Distillation in Heterogeneous Systems Valentin Radu (U. Sheffield) Data privacy and protection are crucial issues for any automatic speech recognition (ASR) system when relying on client generated data for training. The best protection is achieved when training is distributed closer to the client local data, rather than centralising the training. However, distributed training suffers from system heterogeneity, due to clients having unequal computation resources, and data heterogeneity, due to training data being non-independent and identically distributed (non-IID). To tackle these challenges, we introduce FedKAD, a Federated Learning (FL) framework that uses local Knowledge Aggregation over top level feature maps and Knowledge Distillation. We show that our FedKAD achieves better communication efficiency than standard FL methods that use uniform models, due to transferring parameters of smaller size client models, and overall better accuracy than FedMD, an alternative KD-based approach designed for heterogeneous data. Our work enables faster, cheaper and more inclusive participation of clients in heterogeneous distributed training.
12:15 Poster Elevator Pitch
Best of both, Structured and Unstructured Sparsity in Neural Networks Sven Wagner (Bosch Sicherheitssysteme GmbH)
TSMix: time series data augmentation by mixing sources Artjom Joosen (Huawei)
Toward Pattern-based Model Selection for Cloud Resource Forecasting Georgia Christofidi & Konstantinos Papaioannou (IMDEA Software Institute)
Can Fair Federated Learning Reduce the need for Personalisation? Alex Iacob (University of Cambridge)
A First Look at the Impact of Distillation Hyper-Parameters in Federated Knowledge Distillation Norah Alballa (KAUST)
Causal fault localisation in dataflow systems Andrei Paleyes (University of Cambridge)
Accelerating Model Training: Performance Antipatterns Eliminator Framework Ravi Singh (TCS Research)
TinyMLOps for real-time ultra-low power MCUs applied to frame-based event classification Minh Tri Lê (Inria Grenoble Rhône-Alpes)
Scalable High-Performance Architecture for Evolving Recommender System  Ravi Singh (TCS Research)
13:00 Lunch Break / Poster Session
14:30 Session 3: Service Functions, TinyML, CDN - (15mins presentations)
FoldFormer: sequence folding and seasonal attention for fine-grained long-term FaaS forecasting Luke Darlow (Huawei) Fine-grained long-term (FGLT) time series forecasting is a fundamental challenge in Function as a Service (FaaS) platforms. The data that FaaS function requests produce are fine-grained (per-second/minute), often have daily periodicity, and are persistent over the long term. Forecasting in the FGLT data regime is challenging, and Transformer models can scale poorly for long sequences. We propose FoldFormer that combines several novel elements – time-to-latent folding, seasonal attention, and convolutions over FFT representations – as a new solution for FGLT forecasting of FaaS function requests. FoldFormer is designed to efficiently consume very fine-grained multi-day data with nearly no additional model, memory, or compute overhead, when compared to consuming coarse-grained data. We show either state-of-the-art or competitive performance for per-minute function requests on the top 5 most requested functions for three data sources, including two in-house Huawei Cloud sources and Azure 2019. We also show state-of-the-art performance at per-second granularity — a regime that critically limits most other methods.
Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems Alireza Sanaee (Queen Mary University of London) The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources. Failing to right-size computing resources results in either latency service level objectives (SLOs) violations or wasted computing resources. Adapting to dynamic workloads considering all the pillars of accuracy, latency, and resource cost is challenging. In response to these challenges, we propose an adaptation mechanism, InfAdapter, that proactively selects a set of ML model variants with their resource allocations to meet latency SLO while maximizing an objective function composed of accuracy and cost. InfAdapter decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler (Kubernetes Vertical Pod Autoscaler).
Robust and Tiny Binary Neural Networks using Gradient-based Explainability Methods Muhammad Sabih (Friedrich-Alexander) Binary neural networks (BNNs) are a highly resource-efficient variant of neural networks. The efficiency of BNNs for tiny machine learning (TinyML) systems can be enhanced by structured pruning and making BNNs robust to faults. This fault tolerance can be traded off for energy consumption, latency, or cost when used with approximate memory systems. For pruning, magnitude-based heuristics are not useful because the weights in a BNN can either be -1 or +1. Global pruning of BNNs has not been studied well so far. Thus, in this paper, we explore gradient-based ranking criteria for pruning BNNs and use them in combination with a sensitivity analysis. For robustness, the state-of-the-art is to train the BNNs with bit-flips in what is known as fault-aware training. We propose a method to guide fault-aware training using gradient-based explainability methods. This allows us to obtain robust and efficient BNNs for deployment on tiny devices. Experiments on audio and image processing applications show that our proposed approach outperforms the existing approaches, making it useful for obtaining efficient and robust models for a slight degradation in accuracy. This makes our approach valuable for many TinyML use cases.
Illuminating the hidden challenges of data-driven CDNs Theophilus A. Benson (CMU) While Data-driven CDNs have the potential to provide un- paralleled performance and availability improvements, they open up an intricate and exciting tapestry of previously un- addressed problems. This paper highlights these problems, explores existing solutions, and identifies open research questions for each direction. We, also, present a strawman approach, Guard-Rails, that embodies preliminary techniques that can be used to help safeguard data-driven CDNs against the identified perils.
15:30 Poster Session
16:00 Coffee Break
16:30 Keynote: Next-Generation Domain-Specific Accelerators: From Hardware to System Sophia Shao (UC Berkeley) Decades of exponential growth in computing have transformed the way our society operates. As the benefits of traditional technology scaling fade, the computing industry has started developing vertically integrated systems with specialized accelerators to deliver improved performance and energy efficiency. In fact, domain-specific accelerators have become a key component in today’s systems-on-chip (SoCs) and systems-on-package (SoPs), driving active research and product development to build novel accelerators for emerging applications such as machine learning, robotics, cryptography, and many more, entering a golden edge for computer architecture. The natural evolution of this trend will lead to an increasing volume and diversity of accelerators on future computing platforms. In this talk, I will discuss challenges and opportunities for the next-generation of domain-specific accelerators, with a special focus on system-level implications of designing, integrating, and scheduling of future heterogeneous platforms.
18:00 Wrapup and Closing


Papers must be submitted electronically as PDF files, formatted for 8.5x11-inch paper. The length of the paper must be no more than 6 pages in the ACM double-column format (10-pt font). References are out of the 6 pages limit. Submitted papers must use the official ACM Master article template

Submissions will be single-blind.

Submit your paper at:


  • Sophia Shao

    16:30 Sophia Shao University of California

    Next-Generation Domain-Specific Accelerators: From Hardware to System


    Decades of exponential growth in computing have transformed the way our society operates. As the benefits of traditional technology scaling fade, the computing industry has started developing vertically integrated systems with specialized accelerators to deliver improved performance and energy efficiency. In fact, domain-specific accelerators have become a key component in today’s systems-on-chip (SoCs) and systems-on-package (SoPs), driving active research and product development to build novel accelerators for emerging applications such as machine learning, robotics, cryptography, and many more, entering a golden edge for computer architecture. The natural evolution of this trend will lead to an increasing volume and diversity of accelerators on future computing platforms. In this talk, I will discuss challenges and opportunities for the next-generation of domain-specific accelerators, with a special focus on system-level implications of designing, integrating, and scheduling of future heterogeneous platforms.

    Bio: Professor Sophia Shao is an Assistant Professor of Electrical Engineering and Computer Sciences at the University of California, Berkeley. Previously, she was a Senior Research Scientist at NVIDIA and received her Ph.D. degree in 2016 from Harvard University. Her research interests are in the area of computer architecture, with a special focus on domain-specific architecture, deep-learning accelerators, and high-productivity hardware design methodology. Her work has been awarded the Best Paper Award at DAC’2021, the Best Paper Award at JSSC’2020, a Best Paper Award at MICRO’2019, a Research Highlight of Communications of ACM (2021), Top Picks in Computer Architecture (2014), and Honorable Mentions (2019*2). Her Ph.D. dissertation was nominated by Harvard for the ACM Doctoral Dissertation Award. She is a recipient of an NSF CAREER Award, the 2022 IEEE TCCA Young Computer Architect Award, an Intel Rising Star Faculty Award, a Google Faculty Rising Star Award in System Research, a Facebook Research Award, and the inaugural Dr. Sudhakar Yalamanchili Award. Her personal webpage is



Workshop and TPC Chairs

Technical Program Committee

  • Aaron Zhao, Imperial College London
  • Ahmed M. Abdelmoniem, Queen Mary University of London
  • Alexandros Koliousis, Northeastern University London and Institute for Experiential AI
  • Amir Payberah, KTH
  • Amitabha Roy,
  • Chi Zhang, Brandeis University
  • Daniel Goodman, Oracle
  • Daniel Mendoza, Stanford University
  • Davide Sanvito, NEC Laboratories Europe
  • Dawei Li, Amazon
  • Deepak George Thomas, Iowa State University
  • Dimitris Chatzopoulos, University College Dublin
  • Fiodar Kazhamiaka,Stanford University
  • Guilherme H. Apostolo, Vrije Universiteit Amsterdam
  • Guoliang He, University of Cambridge
  • Hamed Haddadi, Imperial College London
  • Jenny Huang, NVIDIA
  • Jon Crowcroft, University of Cambridge
  • Jose Cano, University of Glasgow
  • Junru Shao, OctoML
  • Keshav Santhanam, Stanford University
  • Liang Zhang, TigerGraph
  • Lianmin Zheng, UC Berkeley
  • Mengying Zhou, Fudan University
  • Nasrullah Sheikh, IBM Research Almaden
  • Nikolas Ioannou, Google
  • Paul Patras, University of Edinburgh
  • Peter Pietzuch, Imperial College London
  • Peter Triantafillou, University of Warwick
  • Pouya Hamadanian, MIT
  • Pratik Fegade, Google
  • Qian Li, Stanford University
  • Sam Ainsworth, University of Edinburgh
  • Sami Alabed, University of Cambridge
  • Shay Vargaftik, Vmware Research
  • Stefano Cereda, Politecnico di Milano
  • Taiyi Wang, University of Cambridge
  • Thaleia Dimitra Doudali, IMDEA
  • Valentin Radu, University of Sheffield
  • Veljko Pejovic, University of Ljubljana
  • Xupeng Miao, Peking University
  • Yaniv Ben-Itzhak, Vmware Research
  • Zheng Wang, University of Leeds
  • Zhihao Jia, CMU

Web Chair

  • Alexis Duque, Net AI


For any question(s) related to EuroMLSys 2023, please contact the TPC Chairs Eiko Yoneki and Luigi Nardi.

Follow us on Twitter: @euromlsys