NVIDIA Compute Visual Profiler Version 4.0
Published by
NVIDIA Corporation
2701 San Tomas Expressway
Santa
Clara, CA 95050
Notice
BY DOWNLOADING THIS FILE, USER AGREES TO THE FOLLOWING:
ALL NVIDIA SOFTWARE, DESIGN SPECIFICATIONS, REFERENCE
BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND
SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS". NVIDIA MAKES NO WARRANTIES,
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND
EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY,
AND FITNESS FOR A PARTICULAR PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. These materials supersedes and replaces all information previously supplied. NVIDIA Corporation products are not authorized for use as critical components in life support devices or systems without express written approval of NVIDIA Corporation.
Trademarks
NVIDIA, CUDA, and the NVIDIA logo are trademarks or registered trademarks
of NVIDIA Corporation in the United States and other countries. Other
company and product names may be trademarks of the respective companies
with which they are associated.
Copyright (C) 2007-2011 by NVIDIA Corporation. All rights reserved.
PLEASE REFER EULA.txt FOR THE LICENSE AGREEMENT FOR USING NVIDIA SOFTWARE.
List of supported features:
Execute a CUDA or OpenCL program (referred to as Compute program in this document) with profiling enabled and view the profiler output
as a table. The table has the following columns for each GPU method:
File, Profile, Session, Options, Window and Help.
See the description below for details on the menu options.
Second line has 4 groups of tool bar icons.
Summary session information is displayed when a session is selected in the tree view.
Summary device information is displayed when a device is selected in the tree view.
Session context menu.
Session->Device context menu :
Counter | Description | Type | 1.0 | 1.1 | 1.2 | 1.3 | 2.0 | 2.1 |
---|---|---|---|---|---|---|---|---|
branch | Number of branches taken by threads executing a kernel. This counter will be incremented by one if at least one thread in a warp takes the branch. Note that barrier instructions (__syncThreads()) also get counted as branches. | SM | Y | Y | Y | Y | Y | Y |
divergent branch | Number of divergent branches within a warp. This counter will be incremented by one if at least one thread in a warp diverges (that is, follows a different execution path) via a data dependent conditional branch. The counter will be incremented by one at each point of divergence in a warp. | SM | Y | Y | Y | Y | Y | Y |
instructions | Number of instructions executed. | SM | Y | Y | Y | Y | N | N |
warp serialize | If two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. This counter gives the number of thread warps that serialize on address conflicts to either shared or constant memory. | SM | Y | Y | Y | Y | N | N |
sm cta launched | Number of threads blocks launched on a multiprocessor. | SM | Y | Y | Y | Y | Y | Y |
gld uncoalesced | Number of non-coalesced global memory loads. | TPC | Y | Y | N | N | N | N |
gld coalesced | Number of coalesced global memory loads. | TPC | Y | Y | N | N | N | N |
gld request | Number of global memory load requests. On devices with compute capability 1.3 enabling this counter will result in increased counts for the "instructions" and "branch" counter values if they are also enabled in the same application run. | TPC | N | N | Y | Y | Y | Y |
gld 32 byte | Number of 32 byte global memory load transactions. This increments by 1 for each 32 byte transaction. | TPC | N | N | Y | Y | N | N |
gld 64 byte | Number of 64 byte global memory load transactions. This increments by 1 for each 64 byte transaction. | TPC | N | N | Y | Y | N | N |
gld 128 byte | Number of 128 byte global memory load transactions. This increments by 1 for each 128 byte transaction. | TPC | N | N | Y | Y | N | N |
gst coalesced | Number of coalesced global memory stores. | TPC | Y | Y | N | N | N | N |
gst request | Number of global memory store requests. On devices with compute capability 1.3 enabling this counter will result in increased counts for the "instructions" and "branch" counter values if they are also enabled in the same application run. | TPC | N | N | Y | Y | Y | Y |
gst 32 byte | Number of 32 byte global memory store transactions. This increments by 2 for each 32 byte transaction. | TPC | N | N | Y | Y | N | N |
gst 64 byte | Number of 64 byte global memory store transactions. This increments by 4 for each 64 byte transaction. | TPC | N | N | Y | Y | N | N |
gst 128 byte | Number of 128 byte global memory store transactions. This increments by 8 for each 128 byte transaction. | TPC | N | N | Y | Y | N | N |
local load | Number of local memory load transactions. Each local load request will generate one transaction irrespective of the size of the transaction. | TPC | Y | Y | Y | Y | Y | Y |
local store | Number of local memory store transactions. This increments by 2 for each 32-byte transaction, by 4 for each 64-byte transaction and by 8 for each 128-byte transaction for compute devices having compute capability 1.x. This increments by 1 irrespective of the size of the transaction for compute devices having compute capability 2.0. | TPC | Y | Y | Y | Y | Y | Y |
cta launched | Number of threads blocks launched on a TPC. | TPC | Y | Y | Y | Y | N | N |
texture cache hit | Number of texture cache hits. | TPC | Y | Y | Y | Y | N | N |
texture cache miss | Number of texture cache misses. | TPC | Y | Y | Y | Y | N | N |
prof triggers | There are 8 such triggers that user can profile. Those are generic and can be inserted in any place of the code to collect the related information. | TPC | Y | Y | Y | Y | Y | Y |
shared load | Number of executed shared load instructions per warp on a multiprocessor. | SM | N | N | N | N | Y | Y |
shared store | Number of executed shared store instructions per warp on a multiprocessor. | SM | N | N | N | N | Y | Y |
instructions issued | Number of instructions issued including replays. | SM | N | N | N | N | Y | Y |
instructions executed | Number of instructions executed, do not include replays. | SM | N | N | N | N | Y | Y |
threads instruction executed | Number of instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads in the warp that execute the instruction | SM | N | N | N | N | Y | Y |
warps launched | Number of warps launched on a multiprocessor. | SM | N | N | N | N | Y | Y |
threads launched | Number of threads launched on a multiprocessor. | SM | N | N | N | N | Y | Y |
active cycles | Number of cycles a multiprocessor has at least one active warp. | SM | N | N | N | N | Y | Y |
active warps | Accumulated number of active warps per cycle. For every cycle it increments by the number of active warps in the cycle which can be in the range 0 to 48. | SM | N | N | N | N | Y | Y |
l1 global load hit | Number of global load hits in L1 cache. | SM | N | N | N | N | Y | Y |
l1 global load miss | Number of global load misses in L1 cache. | SM | N | N | N | N | Y | Y |
l1 local load hit | Number of local load hits in L1 cache. | SM | N | N | N | N | Y | Y |
l1 local load miss | Number of local load misses in L1 cache | SM | N | N | N | N | Y | Y |
l1 local store hit | Number of local store hits in L1 cache. | SM | N | N | N | N | Y | Y |
l1 local store miss | Number of local store misses in L1 cache. | SM | N | N | N | N | Y | Y |
l1 shared bank conflicts | Number of shared bank conflicts. | SM | N | N | N | N | Y | Y |
uncached global load transaction | Number of uncached global load transactions. Increments by 1 per transaction. Transaction size can be 32/64/128 bytes. Non-zero values are only seen when L1 cache is disabled during compile time. Please refer to CUDA Programming Guide(Section G.4.2) for disabling L1 cache. | SM | N | N | N | N | Y | Y |
global store transaction | Number of global store transactions. Increments by 1 per transaction. Transaction size can be 32/64/128 bytes. | SM | N | N | N | N | Y | Y |
l2 read requests | Number of read requests from L1 to L2 cache. This increments by 1 for each 32-byte access. | FB | N | N | N | N | Y | Y |
l2 read texture requests | Number of read requests from texture cache to L2 cache. This increments by 1 for each 32-byte access. | FB | N | N | N | N | Y | Y |
l2 write requests | Number of write requests from L1 to L2 cache. This increments by 1 for each 32-byte access. | FB | N | N | N | N | Y | Y |
l2 read misses | Number of read misses in L2 cache. This increments by 1 for each 32-byte access. | FB | N | N | N | N | Y | Y |
l2 write misses | Number of write misses in L2 cache. This increments by 1 for each 32-byte access. | FB | N | N | N | N | Y | Y |
dram reads | Number of read requests to DRAM. This increments by 1 for each 32-byte access. | FB | N | N | N | N | Y | Y |
dram writes | Number of write requests to DRAM. This increments by 1 for each 32-byte access. | FB | N | N | N | N | Y | Y |
tex cache requests | Number of texture cache requests. This increments by 1 for each 32-byte access. | SM | N | N | N | N | Y | Y |
tex cache misses | Number of texture cache misses. This increments by 1 for each 32-byte access. | SM | N | N | N | N | Y | Y |
gld instructions 8bit | Total number of 8-bit global load instructions that are executed by all the threads across all thread blocks. | SW | N | N | N | N | Y | Y |
gld instructions 16bit | Total number of 16-bit global load instructions that are executed by all the threads across all thread blocks. | SW | N | N | N | N | Y | Y |
gld instructions 32bit | Total number of 32-bit global load instructions that are executed by all the threads across all thread blocks. | SW | N | N | N | N | Y | Y |
gld instructions 64bit | Total number of 64-bit global load instructions that are executed by all the threads across all thread blocks. | SW | N | N | N | N | Y | Y |
gld instructions 128bit | Total number of 128-bit global load instructions that are executed by all the threads across all thread blocks. | SW | N | N | N | N | Y | Y |
gst instructions 8bit | Total number of 8-bit global store instructions that are executed by all the threads across all thread blocks. | SW | N | N | N | N | Y | Y |
gst instructions 16bit | Total number of 16-bit global store instructions that are executed by all the threads across all thread blocks. | SW | N | N | N | N | Y | Y |
gst instructions 32bit | Total number of 32-bit global store instructions that are executed by all the threads across all thread blocks. | SW | N | N | N | N | Y | Y |
gst instructions 64bit | Total number of 64-bit global store instructions that are executed by all the threads across all thread blocks. | SW | N | N | N | N | Y | Y |
gst instructions 128bit | Total number of 128-bit global store instructions that are executed by all the threads across all thread blocks. | SW | N | N | N | N | Y | Y |
Derived stats | Description | 1.0 | 1.1 | 1.2 | 1.3 | 2.0 | 2.1 |
---|---|---|---|---|---|---|---|
glob mem read throughput |
Global memory read throughput in giga-bytes per second. For compute capability < 2.0 this is calculated as (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / (gputime * 1000) For compute capability >= 2.0 this is calculated as ((DRAM reads) * 32) / (gputime * 1000) |
* | * | * | * | * | * |
glob mem write throughput |
Global memory write throughput in giga-bytes per second. For compute capability < 2.0 this is calculated as (((gst_32*32) + (gst_64*64) + (gst_128*128)) * TPC) / (gputime * 1000) For compute capability >= 2.0 this is calculated as ((DRAM writes) * 32) / (gputime * 1000) |
* | * | * | * | * | * |
glob mem overall throughput |
Global memory overall throughput in giga-bytes per second. This is calcualted as Global memory read throughput + Global memory write throughput |
* | * | * | * | * | * |
gld efficiency | Global load efficiency | NA | NA | 0-1 | 0-1 | NA | NA |
gst efficiency | Global store efficiency | NA | NA | 0-1 | 0-1 | NA | NA |
Instruction throughput |
instruction throughput: Instruction throughput ratio. This is the ratio of achieved instruction rate to peak single issue instruction rate. The achieved instruction rate is calculated using the "instructions" profiler counter. The peak instruction rate is calculated based on the GPU clock speed. In the case of instruction dual-issue coming into play, this ratio shoots up to greater than 1. This is calculated as gpu_time * clock_frequency / (instructions) |
0-1 | 0-1 | 0-1 | 0-1 | NA | NA |
retire ipc |
Retired instructions per cycle This is calculated as (instuctions executed) / (active cycles). |
NA | NA | NA | NA | 0-2 | 0-4 |
active warps/active cycles |
The average number of warps that are active on a multiprocessor per cycle. This is calculated as (active warps) / (active cycles). This is supported only for GPUs with compute capability 2.0. |
NA | NA | NA | NA | 0-48 | 0-48 |
l1 gld hit rate |
This is calculated as 100 * (l1 global load hit count) / ((l1 global load hit count) + (l1 global load miss count)) This is supported only for GPUs with compute capability 2.0. |
NA | NA | NA | NA | 0-100 | 0-100 |
texture hit rate % |
This is calculated as 100 * (tex_cache_requests - tex_cache_misses) / (tex_cache_requests) This is supported only for GPUs with compute capability 2.0. |
NA | NA | NA | NA | 0-100 | 0-100 |
Ideal Instruction/Byte ratio |
This is a ratio of the peak instruction throughput and the peak memory throughput of the CUDA device. This is a property of the device and is independent of the kernel. |
NA | NA | NA | NA | * | * |
instruction/byte |
This is the ratio of the total number of instructions issued by the kernel and the total number of bytes accessed by the kernel from global memory. If this ratio is greater than the Ideal instruction/byte ratio, then the kernel is compute bound and if it’s less, then the kernel is memory bound. This is calculated as (32 * instructions issued * #SM)/ {32 * (l2 read requests + l2 write requests + l2 read texture requests)} |
NA | NA | NA | NA | * | * |
Achieved Kernel Occupancy |
This ratio provides the actual occupancy of the kernel based on the number of warps executing per cycle on the SM. This is the ratio of active warps and active cycles divided by the max number of warps that can execute on an SM. This is calculated as (active warps/active cycles)/48 |
NA | NA | NA | NA | 0-1 | 0-1 |
Kernel requested global memory read throughput (GB/s) |
This is the actual number of bytes requested in terms of loads by the kernel from global memory divided by the kernel execution time. These requests are made in terms of global load instructions which can be of varying word sizes of 8, 16, 32, 64 or 128 bits. This is calculated as (gld instructions 8bit + 2 * gld instructions 16bit + 4 * gld instructions 32bit + 8 * gld instructions 64bit + 16 * gld instructions 128bit) / (gpu time * 1000) |
NA | NA | NA | NA | * | * |
Kernel requested global memory write throughput (GB/s) |
This is the actual number of bytes requested in terms of stores by the kernel from global memory divided by the kernel execution time. These requests are made in terms of global store instructions which can be of varying word sizes of 8, 16, 32, 64 or 128 bits. This is calculated as (gst instructions 8bit + 2 * gst instructions 16bit + 4 * gst instructions 32bit + 8 * gst instructions 64bit + 16 * gst instructions 128bit) / (gpu time * 1000) |
NA | NA | NA | NA | * | * |
Kernel requested global memory throughput (GB/s) |
This is the combined kernel requested read and write memory throughput. This is calculated as (Kernel requested global memory read throughput + Kernel requested global memory write throughput) |
NA | NA | NA | NA | * | * |
L1 cache read throughput (GB/s) |
This gives the throughput achieved while accessing data from L1 cache. This is calculated as [(l1 global load hit + l1 local load hit) * 128 * #SM + l2 read requests * 32] / (gpu time * 1000) |
NA | NA | NA | NA | * | * |
L1 cache global hit ratio (%) |
Percentage of hits that occur in L1 cache while accessing global memory. This statistic will be zero when L1 cache is disabled. This is calculated as (100 * l1 global load hit)/(l1 global load hit + l1 global load miss ) |
NA | NA | NA | NA | 0-100 | 0-100 |
Texture cache memory throughput (GB/s) |
This gives the memory throughput achieved while reading data from texture memory. This statistic will be zero when texture memory is not used. This is calculated as (#SM * tex cache sector queries * 32) / (gpu time * 1000) |
NA | NA | NA | NA | * | * |
Texture cache hit rate (%) |
Percentage of hits that occur in texture cache while accessing data from texture memory. This statistic will be zero when texture memory is not used. This is calculated as 100 * (tex cache requests – tex cache misses)/tex cache requests |
NA | NA | NA | NA | 0-100 | 0-100 |
L2 cache texture memory read throughput (GB/s) |
This gives the throughput achieved while reading data from L2 cache when a request for data residing in texture memory is made. This is calculated as (l2 read tex requests * 32)/(gpu time *1000) |
NA | NA | NA | NA | * | * |
L2 cache global memory read throughput (GB/s) |
This gives the throughput achieved while reading data from L2 cache when a request for data residing in global memory is made by L1. This is calculated as (l2 read requests * 32)/(gpu time * 1000) |
NA | NA | NA | NA | * | * |
L2 cache global memory read throughput (GB/s) |
This gives the throughput achieved while reading data from L2 cache when a request for data residing in global memory is made by L1. This is calculated as (l2 read requests * 32)/(gpu time * 1000) |
NA | NA | NA | NA | * | * |
L2 cache global memory write throughput (GB/s) |
This gives the throughput achieved while writing data to L2 cache when a request to store data in global memory is made by L1. This is calculated as (l2 write requests * 32)/(gpu time * 1000) |
NA | NA | NA | NA | * | * |
L2 cache global memory throughput (GB/s) |
This is the combined L2 cache read and write memory throughput. This is calculated as (L2 cache global memory read throughput + L2 cache global memory write throughput) |
NA | NA | NA | NA | * | * |
L2 cache read hit ratio (%) |
Percentage of hits that occur in L2 cache while reading from global memory. This is calculated as 100 * (L2 cache global memory read throughput - glob mem read throughput)/( L2 cache global memory read throughput) |
NA | NA | NA | NA | 0-100 | 0-100 |
L2 cache write hit ratio (%) |
Percentage of hits that occur in L2 cache while writing to global memory. This is calculated as 100 * (L2 cache global memory write throughput - glob mem write throughput)/( L2 cache global memory write throughput) |
NA | NA | NA | NA | 0-100 | 0-100 |
Local memory bus traffic (%) |
Percentage of bus traffic caused due to accesses to local memory. This is calculated as (2 * l1 local load miss * 128 * 100)/((l2 read requests + l2 write requests)* 32 / #SMs) |
NA | NA | NA | NA | 0-100 | 0-100 |
Global memory excess load (%) |
This shows the percentage of excess data that is fetched while making global memory load transactions. Ideally 0% excess loads will be achieved when kernel requested global memory read throughput is equal to the L2 cache read throughput i.e. the number of bytes requested by the kernel in terms of reads are equal to the number of bytes actually fetched by the hardware during kernel execution to service the kernel. If this statistic is high, it implies that the access pattern for fetch is not coalesced, many extra bytes are getting fetched while serving the threads of the kernel. This is calculated as 100 - (100 * kernel requested global memory read throughput / l2 read throughput) |
NA | NA | NA | NA | 0-100 | 0-100 |
Global memory excess store (%) |
This shows the percentage of excess data that is accessed while making global memory store transactions. Ideally 0% excess stores will be achieved when kernel requested global memory write throughput is equal to the L2 cache write throughput i.e. the number of bytes requested by the kernel in terms of stores are equal to the number of bytes actually accessed by the hardware during kernel execution to service the kernel. If this statistic is high, it implies that the access pattern for store is not coalesced, many extra bytes are getting accessed while execution of the threads of the kernel. This is calculated as 100 - (100 * kernel requested global memory write throughput / l2 write throughput) |
NA | NA | NA | NA | 0-100 | 0-100 |
Peak global memory throughput (GB/s) |
This is the peak memory throughput or bandwidth that can be achieved on the present CUDA device. This is a device property and the kernel achieved memory throughput should be as close as possible to this peak. |
* | * | * | * | * | * |
IPC - Instructions/Cycle |
This gives the number of instructions issued per cycle. This should be compared to maximum IPC possible for the device. The range provided is for single precision floating point instructions. This is calculated as (instructions issued/active cycles) |
NA | NA | NA | NA | 0-2 | 0-4 |
Divergent branches (%) |
The percentage of branches that are causing divergence within a warp amongst all the branches present in the kernel. Divergence within a warp causes serialization in execution. This is calculated as (100*divergent branch)/(divergent branch + branch) |
0-100 | 0-100 | 0-100 | 0-100 | 0-100 | 0-100 |
Divergent branches (%) |
The percentage of branches that are causing divergence within a warp amongst all the branches present in the kernel. Divergence within a warp causes serialization in execution. This is calculated as (100*divergent branch)/(divergent branch + branch) |
0-100 | 0-100 | 0-100 | 0-100 | 0-100 | 0-100 |
Control flow divergence (%) |
Control flow divergence gives the percentage of thread instructions that were not executed by all threads in the warp, hence causing divergence. This should be as low as possible. This is calculated as 100 * ((32 * instructions executed) – threads instruction executed)/(32* instructions executed) |
NA | NA | NA | NA | 0-100 | 0-100 |
Replayed Instructions (%) |
This gives the percentage of instructions replayed during kernel execution. Replayed instructions are the difference between the numbers of instructions that are actually issued by the hardware to the number of instructions that are to be executed by the kernel. Ideally this should be zero. This is calculated as 100 * (instructions issued - instruction executed) /instruction issued |
NA | NA | NA | NA | 0-100 | 0-100 |
Global memory replay (%) |
Percentage of replayed instructions caused due to global memory accesses. This is calculated as 100 * (l1 global load miss)/ instructions issued |
NA | NA | NA | NA | 0-100 | 0-100 |
Local memory replay (%) |
Percentage of replayed instructions caused due to local memory accesses. This is calculated as 100 * (l1 local load miss + l1 local store miss)/ instructions issued |
NA | NA | NA | NA | 0-100 | 0-100 |
Shared bank conflict replay (%) |
Percentage of replayed instructions caused due to shared memory bank conflicts. This is calculated as 100 * (l1 shared conflict)/ instructions issued |
NA | NA | NA | NA | 0-100 | 0-100 |
Shared memory bank conflict per shared memory instruction (%) |
This gives an indication of the number of bank conflicts caused per shared memory instruction. This may exceed 100% if there are n-way bank conflicts or the data accessed is double precision. This is calculated as 100 * (l1 shared bank conflict)/(shared load + shared store) |
NA | NA | NA | NA | 0-100 | 0-100 |
SM activity (%) |
Percentage of multiprocessor utilization. This is calculated as 100 * (active cycles)/ elapsed clocks |
NA | NA | NA | NA | 0-100 | 0-100 |