Getting 70x Data Processing Speed with GPU vs CPU: A Performance Deep Dive

Getting 70x Data Processing Speed with GPU vs CPU: A Performance Deep Dive


I began to run some of my own tests after seeing some incredible performance figures from NVIDIA’s open-source data processing efforts using tools like RAPIDS and cuDF, which you can explore further here: NVIDIA RAPIDS Open Source for Data Processing. It occurred to me that while my existing data pipelines complete within the expected time frame, maybe I wasn’t getting all performance I could be from utilizing the cpu. I needed a true baseline to understand the limits of my existing hardware before pursuing specialized solutions. I should point out that the existing benchmark testing in the RAPIDS documentation it great, I recommend checking them out. But anytime you can test code that is closer to what you are acutally running in production, the more likely you are to have gains that match the excpected results.

The complete code and benchmarks from this analysis are available in my GitHub repository: gpu_rapids_demo. I hope it helps people get started with their own GPU acceleration experiments, though there are many other excellent resources and tutorials available on this topic as well.

The Optimized CPU Baseline: Leveraging Multi-Threading

To establish a competitive baseline, I benchmarked a core Extract, Transform, Load (ETL) workflow—complex string searches and multi-key aggregations—on a multi-million row CMS Open Payments Dataset. The test environment utilized a powerful modern processor: the 12th Gen Intel Core i9-12900K.

My initial test of the core ETL workflow resulted in a 2.79-second execution time. This impressive speed was achieved because the Python libraries, specifically Pandas’ C/C++ backends, automatically utilized multi-threading for I/O and vectorized operations. Based on rough estimates, running the initial CSV read without any multi-threading optimization would have taken over 20 seconds. By simply leveraging the optimized backends of Pandas, we gained a more than 5x speedup on the CPU before ever touching the GPU. This established a robust, highly optimized multi-threaded baseline against which the GPU would have to compete.

The GPU Challenge: Understanding Architectural Prerequisites

Initially, the GPU appeared slower than this optimized CPU time, leading to early disappointment: I was not seeing the expected gains. Taking a step back it was me who needed to change. The initial code was built on a CPU-centric architecture, and the GPU was not failing; it was being tested to the wrong standard.

The seemingly poor GPU performance was caused by two major bottlenecks that were inadvertently included in the timer:

  1. Memory Copy Latency: The slow process of copying the large dataset from system RAM to the GPU’s Video RAM (VRAM)
  2. JIT Compilation Tax: The time spent on Just-In-Time (JIT) compilation of the GPU kernels on the first run

By correcting these code prerequisites—namely, pre-copying the data to VRAM outside the timer and adding a “warm-up” run to handle the JIT compilation tax—the GPU’s true parallel execution speed was revealed.

Performance Results

HardwareExecution TimeSpeedup
Intel i9-12900K (Optimized CPU)2.79 secondsBaseline
GPU (cuDF)0.039 seconds71.70x faster

The 71.70x improvement confirms a key finding: for heavy, highly vectorizable processing jobs where time is a factor, a GPU provides a dramatic, justified advantage that no amount of CPU optimization can achieve.

Making the Right Architectural Choice: CPU vs. GPU

This experience underscores a fundamental rule: the solution is always a right-tool, right-job question. Performance improvement begins with optimization of the existing hardware, and often the answer isn’t immediately obvious without testing.

The CPU, with its fewer but faster cores, excels at sequential operations, branching logic, and file I/O. Before seeking specialized hardware, you should always ensure you’re taking advantage of multi-threading for appropriate tasks, as demonstrated by the 5x speedup gained with Pandas alone. Techniques like lazy evaluation and data chunking remain invaluable strategies for initial data exploration, memory conservation, and building robust job pipelines. Sometimes chunking your data and processing it in smaller batches can give you the performance you need without any hardware changes.

The GPU, with its thousands of smaller, specialized cores, is the architecture for massive parallelism. For workloads that are highly vectorizable—like complex aggregation, string matching, or matrix multiplication—the CPU’s 16 cores are simply the wrong architecture. The GPU’s massive parallelism is built for production-level throughput. The 71x improvement achieved here far surpasses any incremental gain possible through further CPU-only optimization.

For mixed workloads, the CPU can orchestrate while the GPU processes. When data volume inevitably exceeds the VRAM of a single card, the solution is not more CPU threads, but vertical scaling with multiple GPUs orchestrated via frameworks like Dask-cuDF.

The key is to test your specific workload. Start with chunking and lazy evaluation on your existing hardware. If that’s not enough, try optimizing your data structures and algorithms. Only when you’ve exhausted those options should you consider GPU acceleration. But when you do need that level of parallelism, the performance gains can be transformational.

Getting Started with GPU Acceleration

If you’ve determined that GPU acceleration is right for your workload, the GPU acceleration used in this benchmark comes from NVIDIA’s RAPIDS suite of open-source libraries. RAPIDS includes cuDF (GPU DataFrames), cuML (GPU Machine Learning), cuGraph (GPU Graph Analytics), and Dask-cuDF for multi-GPU scaling.

The CUDA Toolkit, available free at developer.nvidia.com/cuda-toolkit, provides the foundation for all GPU computing. For those interested in writing custom GPU kernels without deep CUDA knowledge, Triton is an open-source language that makes GPU programming accessible to Python developers.

You can install the suite via conda (recommended) or pip:

# Using conda (recommended)
conda create -n rapids-env -c rapidsai -c conda-forge rapids=24.10 python=3.11 cudatoolkit=12.0

# OR using pip
pip install cudf-cu12 dask-cudf-cu12 cuml-cu12