Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Towards Adaptive Resource Management for HPC Workloads in Cloud Environments
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).ORCID iD: 0000-0002-1434-3042
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Maximizing resource efficiency is crucial when designing cloud-based systems,which are primarily built to meet specific quality-of-service requirements.Common optimization techniques include containerization, workflow orchestration,elasticity, and vertical scaling, all aimed at improving resource utilizationand reducing costs. In contrast, on-premises high-performance computingsystems prioritize maximum performance, typically relying on static resourceallocation. While this approach offers certain advantages over cloud systems,it can be restrictive in handling the increasingly dynamic resource demands oftightly coupled HPC workloads, making adaptive resource management challenging.

This thesis explores the execution of high-performance workloads in cloudbasedenvironments, investigating both horizontal and vertical scaling strategiesas well as the feasibility of running HPC workflows in the cloud. Additionally,we will evaluate the costs of deploying these workloads in containerizedenvironments and examine the advantages of using object storagein cloud-based HPC systems.

Abstract [sv]

Att maximera resurseffektiviteten ar avgörande vid utformningen av molnbaserade system, som framst byggs för att uppfylla specifika krav på tjänstekvalitet. Vanliga optimeringstekniker inkluderar containerisering, arbetsflödesorkestrering, elasticitet och vertikal skalning, med målet att förbättra resursutnyttjandet och minska kostnaderna. I kontrast fokuserar lokala högprestandaberäkningssystem (HPC) på maximal prestanda och förlitar sig oftast på statisk resursallokering. Även om denna strategi har vissa fördelar jämfört med molnlösningar, kan den vara begränsande när det gäller att hantera de allt mer dynamiska resursbehoven hos tätt sammankopplade HPC-arbetslaster, vilket gör adaptiv resursförvaltning utmanande. Denna avhandling undersöker körningen av högprestandaarbetslaster i molnbaserade miljöer, med fokus på både horisontell och vertikal skalning samt möjligheten att köra HPC-arbetsflöden i molnet. Dessutom kommer vi att analysera kostnaderna for att distribuera dessa arbetslaster i containeriserade miljöer och utvärdera fördelarna med att använda objektlagring i molnbaserade HPC-system.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2025. , p. 91
Series
TRITA-EECS-AVL ; 2025:51
Keywords [en]
high-performance computing, resource adaptability, cloud computing, containers, horizontal scaling, vertical scaling, object storage
Keywords [sv]
Högprestandaberäkning, resursanpassningsförmåga, molnberäkning, containerisering, horisontell skalning, vertikal skalning, objektlagring
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-363164ISBN: 978-91-8106-279-3 (print)OAI: oai:DiVA.org:kth-363164DiVA, id: diva2:1956653
Public defence
2025-06-02, E2, Lindstedtsvägen 3, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20250506

Available from: 2025-05-06 Created: 2025-05-06 Last updated: 2025-05-06Bibliographically approved
List of papers
1. Kub: Enabling Elastic HPC Workloads on Containerized Environments
Open this publication in new window or tab >>Kub: Enabling Elastic HPC Workloads on Containerized Environments
2023 (English)In: Proceedings of the 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Institute of Electrical and Electronics Engineers (IEEE), 2023Conference paper, Published paper (Refereed)
Abstract [en]

The conventional model of resource allocation in HPC systems is static. Thus, a job cannot leverage newly available resources in the system or release underutilized resources during the execution. In this paper, we present Kub, a methodology that enables elastic execution of HPC workloads on Kubernetes so that the resources allocated to a job can be dynamically scaled during the execution. One main optimization of our method is to maximize the reuse of the originally allocated resources so that the disruption to the running job can be minimized. The scaling procedure is coordinated among nodes through remote procedure calls on Kubernetes for deploying workloads in the cloud. We evaluate our approach using one synthetic benchmark and two production-level MPI-based HPC applications - GRO-MACS and CM1. Our results demonstrate that the benefits of adapting the allocated resources depend on the workload characteristics. In the tested cases, a properly chosen scaling point for increasing resources during execution achieved up to 2x speedup. Also, the overhead of checkpointing and data reshuffling significantly influences the selection of optimal scaling points and requires application-specific knowledge.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
HPC, Cloud, scaling, Kubernetes, Elasticity, Malleability
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-339917 (URN)10.1109/SBAC-PAD59825.2023.00031 (DOI)2-s2.0-85178503556 (Scopus ID)
Conference
35th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2023, Porto Alegre, Brazil, October 17-20, 2023
Funder
European Commission
Note

Part of ISBN 979-8-3503-0548-7

QC 20231122

Available from: 2023-11-21 Created: 2023-11-21 Last updated: 2025-05-06Bibliographically approved
2. Understanding Layered Portability from HPC to Cloud in Containerized Environments
Open this publication in new window or tab >>Understanding Layered Portability from HPC to Cloud in Containerized Environments
2024 (English)In: Proceedings of the International Supercomputing Conference, Workshops, 2024Conference paper, Published paper (Refereed)
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-363169 (URN)
Conference
International Workshop on Converged Computing on Edge, Cloud, and HPC (WOCC ’24
Note

QC 20250507

Available from: 2025-05-06 Created: 2025-05-06 Last updated: 2025-05-07Bibliographically approved
3. A GPU-Accelerated Molecular Docking Workflow with Kubernetes and Apache Airflow
Open this publication in new window or tab >>A GPU-Accelerated Molecular Docking Workflow with Kubernetes and Apache Airflow
2023 (English)In: High Performance Computing: ISC High Performance 2023 International Workshops, Revised Selected Papers, Springer Nature , 2023, p. 193-206Conference paper, Published paper (Refereed)
Abstract [en]

Complex workflows play a critical role in accelerating scientific discovery. In many scientific domains, efficient workflow management can lead to faster scientific output and broader user groups. Workflows that can leverage resources across the boundary between cloud and HPC are a strong driver for the convergence of HPC and cloud. This study investigates the transition and deployment of a GPU-accelerated molecular docking workflow that was designed for HPC systems onto a cloud-native environment with Kubernetes and Apache Airflow. The case study focuses on state-of-of-the-art molecular docking software for drug discovery. We provide a DAG-based implementation in Apache Airflow and technical details for GPU-accelerated deployment. We evaluated the workflow using the SWEETLEAD bioinformatics dataset and executed it in a Cloud environment with heterogeneous computing resources. Our workflow can effectively overlap different stages when mapped onto different computing resources.

Place, publisher, year, edition, pages
Springer Nature, 2023
Keywords
Apache Airflow, Converged Computing, Drug Discovery, HPC and Cloud, HPC workflow, Kubernetes
National Category
Computer Sciences Computer Systems
Identifiers
urn:nbn:se:kth:diva-337889 (URN)10.1007/978-3-031-40843-4_15 (DOI)2-s2.0-85171329143 (Scopus ID)
Conference
38th International Conference on High Performance Computing, ISC High Performance 2023, Hamburg, Germany, May 21 2023 - May 25 2023
Note

Part of ISBN 9783031408427

QC 20231010

Available from: 2023-10-10 Created: 2023-10-10 Last updated: 2025-05-06Bibliographically approved
4. ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments
Open this publication in new window or tab >>ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments
Show others...
2025 (English)In: 31st International European Conference on Parallel and Distributed Computing, 2025Conference paper, Published paper (Refereed)
Keywords
Vertical scaling, HPC workloads, Cloud Computing, Resource Adaptivity, Memory Resource Provisioning
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-363170 (URN)
Conference
The 31st International European Conference on Parallel and Distributed Computing (Euro-Par ’25), Dresden, Germany, 25-29 Aug, 2025
Available from: 2025-05-06 Created: 2025-05-06 Last updated: 2025-05-09
5. LibCOS: Enabling Converged HPC and Cloud Data Stores with MPI
Open this publication in new window or tab >>LibCOS: Enabling Converged HPC and Cloud Data Stores with MPI
2023 (English)In: Proceedings of International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2023, Association for Computing Machinery (ACM) , 2023, p. 106-116Conference paper, Published paper (Refereed)
Abstract [en]

Recently, federated HPC and cloud resources are becoming increasingly strategic for providing diversified and geographically available computing resources. However, accessing data stores across HPC and cloud storage systems is challenging. Many cloud providers use object storage systems to support their clients in storing and retrieving data over the internet. One popular method is REST APIs atop the HTTP protocol, with Amazon's S3 APIs being supported by most vendors. In contrast, HPC systems are contained within their networks and tend to use parallel file systems with POSIX-like interfaces. This work addresses the challenge of diverse data stores on HPC and cloud systems by providing native object storage support through the unified MPI I/O interface in HPC applications. In particular, we provide a prototype library called LibCOS that transparently enables MPI applications running on HPC systems to access object storage on remote cloud systems. We evaluated LibCOS on a Ceph object storage system and a traditional HPC system. In addition, we conducted performance characterization of core S3 operations that enable individual and collective MPI I/O. Our evaluation in HACC, IOR, and BigSort shows that enabling diverse data stores on HPC and Cloud storage is feasible and can be transparently achieved through the widely adopted MPI I/O. Also, we show that a native object storage system like Ceph could improve the scalability of I/O operations in parallel applications.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
Ceph, MPI, object storage, parallel computing, S3
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-333389 (URN)10.1145/3578178.3578236 (DOI)2-s2.0-85149438253 (Scopus ID)
Conference
6th International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2023, Singapore, Singapore, Feb 27 2023 - Mar 2 2023
Note

Part of ISBN 9781450398060

QC 20230801

Available from: 2023-08-01 Created: 2023-08-01 Last updated: 2025-05-06Bibliographically approved

Open Access in DiVA

fulltext(12252 kB)65 downloads
File information
File name FULLTEXT01.pdfFile size 12252 kBChecksum SHA-512
452b7dabdb1d3dad9510a092f44e0e9d7dfc8674c5d01ae9e3d5bd32f8787ed2cc796c5bd0919f73f8013335c4cd351d6affd23c348445a1baf9197d7efbb38a
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Araújo De Medeiros, Daniel
By organisation
Computational Science and Technology (CST)
Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 67 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 773 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf