Springer Nature
Faster Segmented Sort on GPUs.zip (1.85 MB)

Artifact for Euro-Par 2023 Paper: "Faster Segmented Sort on GPUs"

Download (1.85 MB)
posted on 2024-05-15, 09:54 authored by Robin Kobus, Johannes Nelgen, Valentin Henkys, Bertil Schmidt

This software contains the code needed to reproduce the results of the Euro-Par 2023 conference paper "Faster Segmented Sort on GPUs".

It includes implementations of our optimized segmented sort algorithm for four different GPUs, namely NVidia RTX 4090, V100, A100 and GTX 1080, as described in the paper. For each of these implementations there is also a benchmark program provided to test against ModernGPU, CUB and the original algorithm by Hou et al..

A key only version for the 64-bit and 32-bit keys on the GTX 1080 is also provided.

The artifact consists of a single zip, including the code and an overview document on how to use the software.

Paper Abstract

Efficient parallel implementations of various sorting algorithms on  modern hardware platforms are essential to numerous application areas.  In this paper, we first measure the performance of the leading segmented  sort implementation on CUDA-enabled GPUs and determine optimal setups  using the resulting runtimes. Subsequently, we propose a number of  changes that improve efficiency for segments of specific lengths.  Furthermore, an alternative key-only version is introduced, that is  specifically
optimized to just sort keys instead of key-value pairs, which allows for further
optimization. Performance is evaluated by comparing runtimes of the
original algorithm with our improved version for segments of different lengths
resulting  in average speedups between 1.26x and 1.35x on four GPUs of different  generations (Pascal, Volta, Ampere, Ada Lovelace). Furthermore,  comparison to alternative
segmented sort implementations from CUB and  ModernGPU results in average speedups of at least 2.2x and 2.5x,  respectively, across all tested architectures. To illustrate how our  improved sorting algorithm can be beneficial in a practical application,  we have integrated it into the MetaCache-GPU pipeline for metagenomic  DNA classification resulting in speedups of up to 25.6% for the sorting  step.


Usage metrics

    European Conference on Parallel Processing



    Ref. manager