As the use of Graphics Processing Units (GPUs) becomes increasingly prevalent in various domains, including artificial intelligence, machine learning, and high-performance computing, it's essential to ensure that software applications running on these devices are reliable, efficient, and free from errors. CUDA is a popular programming model for GPUs developed by NVIDIA, which allows developers to harness the power of parallel processing to accelerate their applications. However, analyzing CUDA code can be challenging due to its unique characteristics, such as inherent parallelism, variants, and floating-point values. In this blog post, we'll explore some of the key challenges that static analysis systems face when trying to analyze CUDA code.
We have accepted these challenges and, starting with Axivion Suite 7.10, we fully support analyzing applications utilizing CUDA.
CUDAs challenges for Code Analysis
Heavy Use of Templates
The usual benefits of using C++ templates, e.g. compile-time evaluation including type checking and constant folding apply to CUDA code as well. Furthermore, the nature of CUDA adheres to some benefits of using templates particularly well:
- The ability to write generic code that can be reused for multiple types and configurations is very beneficial to CUDA applications as they must deal with different compute capabilities of the underlying hardware.
- Meta-programming techniques or techniques for enabling or disabling code at compile time such as SFINAE can help to reduce a binaries footprint in the presence of different compute capabilities and features.
In consequence, typical CUDA applications make heavy use of C++ templates. This is also true for the CUDA standard library and for several supporting libraries provided by NVIDIA such as cuBLAS (for linear algebra) or cu DNN (for neural networks).
However, templates are heavy on the compiler and by extension on static analysis systems as well. Typically, compiling templates involves analyzing dependencies between them, resolving overloads and finding appropriate instantiations. Static analyzers must take the same steps to appropriately analyze CUDA code, as the pure written-as-is source code does not convey enough information to infer desired properties.
Inherent Parallelism
Many advanced analysis techniques ultimately rely on propagating information through the control or dataflow of an application. While this can already be complicated for linear but branching control flows, analysis of different possible thread interleavings is certainly even more complex and consumes higher computational resources. Yet, parallelism is what CUDA is all about. Offloading computation to GPUs really benefits from their high number of cores and their capabilities to parallelize.
This blows up the search space of potential program behaviors an analysis tool must traverse to uncover potentially problematic behaviors. Furthermore, new kinds of issues can arise in these systems and should be detected by static analysis systems: deadlocks, data races or other incorrect usage of shared memory.
Variants and Compute Capabilities
Compute capabilities refer to the architecture of NVIDIA GPUs and their ability to execute CUDA kernels. Each compute capability represents a specific generation of NVIDIA GPUs, with newer generations typically offering improved performance, power efficiency, and support for more advanced features.
To support both old and new hardware at the same time, CUDA applications have to take the compute capabilities into account. This can either be done at compile time using conditional compilation and the __CUDA_ARCH__ macro or dynamically at runtime using compute_capability(), as shown in the following examples:
void perform_operation(…) {
#ifdef(__CUDA_ARCH__) && __CUDA_ARCH__ >= 302
// code using features available in the architectures
#else
// code for older architectures' capabilities
#endif
}
void perform_operation(…) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
if (prop.major * 10 + prop.minor >= 30) {
// code using features available in the architectures
} else {
// code for older architectures' capabilities
}
}
Both approaches have consequences for static analysis systems. In the static case, where the compute capability is determined at compile time, analyzers must decide which branch (or both in two analyses!) to take when checking the code in question.
This is comparable but not identical to the increased complexity when considering software product lines. However, when considering product lines, it is usually sufficient to consider each individual realization separately as they never interact directly. This is not the case for CUDA, where it is perfectly possible to compile different versions for different devices and use them in the same application. Therefore, analyzers have to compile different versions of the application under analysis and analyze them simultaneously.
Comparably, the dynamic case does not lead to an increase in variants that have to be considered but instead increases the complexity of the control-flow that has to be considered.
Floating Point Values
With their roots in graphics processing, GPUs and by extensions the AI-tailored chips by NVIDIA work on floating-point or even double-precision floating-point data first and foremost. While this is appropriate for their primary use cases in scientific computing or machine learning, it is unusual for the domains in which static analysis systems are usually applied. In fact, the well-known MISRA ruleset for safety critical systems discourages the use of floating-point numbers and arithmetic on them to a certain degree. Thus, tools are not well tuned to the analysis of floating-point values. Furthermore, floats are harder to analyze due to special behavior when it comes to rounding, handling of precision, their representation in memory and different other aspects.
However, the new focus on floating-point numbers due to CUDA and various AI libraries and systems will also help improve results for other rulesets such as MISRA or CERT. At the same time, examples in these rulesets can be extended to CUDA to get a first impression on what properties can and should be reasonable checked.
As an example, CERT C (which is to some extend applicable to C++ as well) includes among others three rules regarding floating points that should be adhered to by CUDA applications as well:
- FLP34: Ensure that floating-point conversions are within range of the new type, i.e., there is no precision or data loss when converting floating-point numbers. When thinking about the different computing capabilities, this could also take different precision on different devices into account.
- FLP36-C: Preserve precision when converting integral values to floating-point type, i.e., when creating floating-point numbers from integer numbers, again ensure that there is no loss of data. As the integers and floating-point number ranges do not fully overlap, this cannot just be a simple comparison of smallest or largest numbers. Rather, we again need to consider individual properties of individual platforms.
- FLP37-C. Do not use object representations to compare floating-point values. Again a rule that increases in importance if taking systems using devices with different floating-point representations into account. These systems cannot rely on any knowledge about the individual representation of a floating-point number in bits.
Conclusion
In conclusion, analyzing CUDA code presents several unique challenges for static analysis. By understanding the complexities of CUDA programming and adapting our analysis techniques accordingly, we ensured that Axivion can help developing safety-critical applications with the highest level of quality and maintainability in mind even and especially if they contain CUDA code. As the use of GPUs continues to grow in various safety-critical domains, the importance of developing robust and reliable CUDA applications will only increase. By addressing the challenges posed by CUDA, Axivion stays ahead of these developments and provides the static analysis tools needed ensure that software applications meet the high standards required.
Learn more about Axivion for CUDA or request a demo from one of our experts.