当前位置: 首页 > 编程日记 > 正文

《GPU高性能编程CUDA实战》中代码整理

CUDA架构专门为GPU计算设计了一种全新的模块,目的是减轻早期GPU计算中存在的一些限制,而正是这些限制使得之前的GPU在通用计算中没有得到广泛的应用。

使用CUDA C来编写代码的前提条件包括:(1)、支持CUDA的图形处理器,即由NVIDIA推出的GPU显卡,要求显存超过256MB;(2)、NVIDIA设备驱动程序,用于实现应用程序与支持CUDA的硬件之间的通信,确保安装最新的驱动程序,注意选择与开发环境相符的图形卡和操作系统;(3)、CUDA开发工具箱即CUDA Toolkit,此工具箱中包括一个编译GPU代码的编译器;(4)、标准C编译器,即CPU编译器。CUDA C应用程序将在两个不同的处理器上执行计算,因此需要两个编译器。其中一个编译器为GPU编译代码,而另一个为CPU编译代码。

一般,将CPU以及系统的内存称为主机(Host),而将GPU及其内存称为设备(Device)。在GPU设备上执行的函数通常称为核函数(Kernel)

cudaMalloc函数使用限制总结:(1)、可以将cudaMalloc()分配的指针传递给在设备上执行的函数;(2)、可以在设备代码中使用cudaMalloc()分配的指针进行内存读/写操作;(3)、可以将cudaMalloc()分配的指针传递给在主机上执行的函数;(4)、不能在主机代码中使用cudaMalloc()分配的指针进行内存读/写操作。

不能使用标准C的free()函数来释放cudaMalloc()分配的内存;要释放cudaMalloc()分配的内存,需要调用cudaFree()。

设备指针的使用方式与标准C中指针的使用方式完全一样。主机指针只能访问主机代码中的内存,而设备指针也只能访问设备代码中的内存。

在主机代码中可以通过调用cudaMemcpy()来访问设备上的内存。

有可能在单块卡上包含了两个或多个GPU。

在集成的GPU上运行代码,可以与CPU共享内存

计算功能集的版本为1.3或者更高的显卡才能支持双精度浮点数的计算

尖括号的第一个参数表示设备在执行核函数时使用的并行线程块的数量,

并行线程块集合也称为一个线程格(Grid)。线程格既可以是一维的线程块集合,也可以是二维的线程块集合。

GPU有着完善的内存管理机制,它将强行结束所有违反内存访问规则的进程

在启动线程块数组时,数组每一维的最大数量都不能超过65535.这是一种硬件限制,如果启动的线程块数量超过了这个限值,那么程序将运行失败。

CUDA运行时将线程块(Block)分解为多个线程。当需要启动多个并行线程块时,只需将尖括号中的第一个参数由1改为想要启动的线程块数量。在尖括号中,第二个参数表示CUDA运行时在每个线程块中创建的线程数量。

硬件将线程块的数量限制为不超过65535.同样,对于启动核函数时每个线程块中的线程数量,硬件也进行了限制。具体来说,最大的线程数量不能超过设备属性结构中maxThreadsPerBlock域的值。这个值并不固定,有的是512,有的是1024.

内置变量blockDim,对于所有线程块来说,这个变量是一个常数,保存的是线程块中每一维的线程数量。

内置变量gridDim,对于所有线程块来说,这个变量是一个常数,用来保存线程格每一维的大小,即每个线程格中线程块的数量。

内置变量blockIdx,变量中包含的值就是当前执行设备代码的线程块的索引。

内置变量threadIdx,变量中包含的值就是当前执行设备代码的线程索引。

CUDA运行时允许启动一个二维线程格,并且线程格中的每个线程块都是一个三维的线程数组。

CUDA C支持共享内存:可以将CUDA  C的关键字__share__添加到变量声明中,这将使这个变量驻留在共享内存中。CUDA C编译器对共享内存中的变量与普通变量将分别采取不同的处理方式。

CUDA架构将确保,除非线程块中的每个线程都执行了__syncthreads(),否则没有任何线程能执行__syncthreads()之后的指令。

由于在GPU上包含有数百个数学计算单元,因此性能瓶颈通常并不在于芯片的数学计算吞吐量,而是在于芯片的内存带宽。

常量内存用于保存在核函数执行期间不会发生变化的数据。NVIDIA硬件提供了64KB的常量内存,并且对常量内存采取了不同于标准全局内存的处理方式。在某些情况下,用常量内存来替换全局内存能有效地减少内存带宽。要使用常量内存,需在变量前面加上__constant__关键字。

在CUDA架构中,线程束是指一个包含32个线程的集合,这个线程集合被“编织在一起”并且以“步调一致(Lockstep)”的形式执行。在程序中的每一行,线程束中的每个线程都将在不同的数据上执行相同的指令。

纹理内存是在CUDA C程序中可以使用的另一种只读内存。与常量内存类似的是,纹理内存同样缓存在芯片上,因此在某些情况中,它能够减少对内存的请求并提供更高效的内存带宽。纹理缓存是专门为那些在内存访问模式中存在大量空间局部性(Spatial Locality)的图形应用程序而设计的。

NVIDIA将GPU支持的各种功能统称为计算功能集(Compute Capability)。高版本计算功能集是低版本计算功能集的超集

只有1.1或者更高版本的GPU计算功能集才能支持全局内存上的原子操作。此外,只有1.2或者更高版本的GPU计算功能集才能支持共享内存上的原子操作。CUDA C支持多种原子操作。

C库函数malloc函数将分配标准的,可分页的(Pageble)主机内存。而cudaHostAlloc函数将分配页锁定的主机内存。页锁定内存也称为固定内存(Pinned Memory)或者不可分页内存,它有一个重要的属性:操作系统将不会对这块内存分页并交换到磁盘上,从而确保了该内存始终驻留在物理内存上。因此,操作系统能够安全地使某个应用程序访问该内存的物理地址,因为这块内存将不会被破坏或者重新定位。

固定内存是一把双刃剑。当使用固定内存时,你将失去虚拟内存的所有功能。特别是,在应用程序中使用每个页锁定内存时都需要分配物理内存,因为这些内存不能交换到磁盘上。这意味着,与使用标准的malloc函数调用相比,系统将更快地耗尽内存。因此,应用程序在物理内存较少的机器上会运行失败,而且意味着应用程序将影响在系统上运行的其它应用程序的性能。

建议,仅对cudaMemcpy()调用中的源内存或者目标内存,才使用页锁定内存,并且在不再需要使用它们时立即释放,而不是等到应用程序关闭时才释放。

CUDA流表示一个GPU操作队列,并且该队列中的操作将以指定的顺序执行

通过使用零拷贝内存,可以避免CPU和GPU之间的显式复制操作

对于零拷贝内存,独立GPU和集成GPU,带来的性能提升是不同的。对于集成GPU,使用零拷贝内存通常都会带来性能提升,因为内存在物理上与主机是共享的。将缓冲区声明为零拷贝内存的唯一作用就是避免不必要的数据复制。但是,所有类型的固定内存都存在一定的局限性,零拷贝内存同样不例外。每个固定内存都会占用系统的可用物理内存,这最终将降低系统的性能。对于独立GPU,当输入内存和输出内存都只能使用一次时,那么在独立GPU上使用零拷贝内存将带来性能提升。但由于GPU不会缓存零拷贝内存的内容,如果多次读取内存,那么最终将得不偿失,还不如一开始就将数据复制到GPU。

CUDA工具箱(CUDAToolkit)包含了两个重要的工具库:(1)、CUFFT(Fast FourierTransform,快速傅里叶变换)库;(2)、CUBLAS(Basic Linear Algebra Subprograms,BLAS)是一个线性代数函数库。

NPP(NVIDIA Performance Primitives)称为NVIDIA性能原语,它是一个函数库,用来执行基于CUDA加速的数据处理操作,它的基本功能集合主要侧重于图像处理和视频处理。

新建一个基于CUDA的测试工程testCUDA,此工程中除了包括common文件外,还添加了另外三个文件,分别为testCUDA.cu、funset.cu、funset.cuh,这三个文件包括了书中绝大部分的测试代码:

testCUDA.cu:

#include "funset.cuh"
#include <iostream>
#include "book.h"
#include "cpu_bitmap.h"
#include "gpu_anim.h"using namespace std;int test1();//简单的两数相加
int test2();//获取GPU设备相关属性
int test3();//通过线程块索引来计算两个矢量和
int test4();//Julia的CUDA实现
int test5();//通过线程索引来计算两个矢量和
int test6();//通过线程块索引和线程索引来计算两个矢量和
int test7();//ripple的CUDA实现
int test8();//点积运算的CUDA实现
int test9();//Julia的CUDA实现,加入了线程同步函数__syncthreads()
int test10();//光线跟踪(Ray Tracing)实现,没有常量内存+使用事件来计算GPU运行时间
int test11();//光线跟踪(Ray Tracing)实现,使用常量内存+使用事件来计算GPU运行时间
int test12();//模拟热传导,使用纹理内存,有些问题
int test13();//模拟热传导,使用二维纹理内存,有些问题
int test14();//ripple的CUDA+OpenGL实现
int test15();//模拟热传导,CUDA+OpenGL实现,有些问题
int test16();//直方图计算,利用原子操作函数atomicAdd实现
int test17();//固定内存的使用
int test18();//单个stream的使用
int test19();//多个stream的使用
int test20();//通过零拷贝内存的方式实现点积运算
int test21();//使用多个GPU实现点积运算int main(int argc, char* argv[])
{test21();cout<<"ok!"<<endl;return 0;
}int test1()
{int a = 2, b = 3, c = 0;int* dev_c = NULL;HANDLE_ERROR(cudaMalloc((void**)&dev_c, sizeof(int)));//尖括号表示要将一些参数传递给CUDA编译器和运行时系统//尖括号中这些参数并不是传递给设备代码的参数,而是告诉运行时如何启动设备代码,//传递给设备代码本身的参数是放在圆括号中传递的,就像标准的函数调用一样add<<<1, 1>>>(a, b, dev_c);HANDLE_ERROR(cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost));printf("%d + %d = %d\n", a, b, c);cudaFree(dev_c);return 0;
}int test2()
{int count = -1;HANDLE_ERROR(cudaGetDeviceCount(&count));printf("device count: %d\n", count);cudaDeviceProp prop;for (int i = 0; i < count; i++) {HANDLE_ERROR(cudaGetDeviceProperties(&prop, i));printf("   --- General Information for device %d ---\n", i);printf("Name:  %s\n", prop.name);printf("Compute capability:  %d.%d\n", prop.major, prop.minor);printf("Clock rate:  %d\n", prop.clockRate);printf("Device copy overlap:  ");if (prop.deviceOverlap) printf("Enabled\n");else printf("Disabled\n");printf("Kernel execution timeout :  ");if (prop.kernelExecTimeoutEnabled) printf("Enabled\n");else printf("Disabled\n");printf("   --- Memory Information for device %d ---\n", i);printf("Total global mem:  %ld\n", prop.totalGlobalMem);printf("Total constant Mem:  %ld\n", prop.totalConstMem);printf("Max mem pitch:  %ld\n", prop.memPitch);printf("Texture Alignment:  %ld\n", prop.textureAlignment);printf("   --- MP Information for device %d ---\n", i);printf("Multiprocessor count:  %d\n", prop.multiProcessorCount);printf("Shared mem per mp:  %ld\n", prop.sharedMemPerBlock);printf("Registers per mp:  %d\n", prop.regsPerBlock);printf("Threads in warp:  %d\n", prop.warpSize);printf("Max threads per block:  %d\n", prop.maxThreadsPerBlock);printf("Max thread dimensions:  (%d, %d, %d)\n", prop.maxThreadsDim[0], prop.maxThreadsDim[1], prop.maxThreadsDim[2]);printf("Max grid dimensions:  (%d, %d, %d)\n", prop.maxGridSize[0], prop.maxGridSize[1], prop.maxGridSize[2]);printf("\n");}int dev;HANDLE_ERROR(cudaGetDevice(&dev));printf("ID of current CUDA device:  %d\n", dev);memset(&prop, 0, sizeof(cudaDeviceProp));prop.major = 1;prop.minor = 3;HANDLE_ERROR(cudaChooseDevice(&dev, &prop));printf("ID of CUDA device closest to revision %d.%d:  %d\n", prop.major, prop.minor, dev);HANDLE_ERROR(cudaSetDevice(dev));return 0;
}int test3()
{int a[NUM] = {0}, b[NUM] = {0}, c[NUM] = {0};int *dev_a = NULL, *dev_b = NULL, *dev_c = NULL;//allocate the memory on the GPUHANDLE_ERROR(cudaMalloc((void**)&dev_a, NUM * sizeof(int)));HANDLE_ERROR(cudaMalloc((void**)&dev_b, NUM * sizeof(int)));HANDLE_ERROR(cudaMalloc((void**)&dev_c, NUM * sizeof(int)));//fill the arrays 'a' and 'b' on the CPUfor (int i=0; i<NUM; i++) {a[i] = -i;b[i] = i * i;}//copy the arrays 'a' and 'b' to the GPUHANDLE_ERROR(cudaMemcpy(dev_a, a, NUM * sizeof(int), cudaMemcpyHostToDevice));HANDLE_ERROR(cudaMemcpy(dev_b, b, NUM * sizeof(int), cudaMemcpyHostToDevice));//尖括号中的第一个参数表示设备在执行核函数时使用的并行线程块的数量add_blockIdx<<<NUM,1>>>( dev_a, dev_b, dev_c );//copy the array 'c' back from the GPU to the CPUHANDLE_ERROR(cudaMemcpy(c, dev_c, NUM * sizeof(int), cudaMemcpyDeviceToHost));//display the resultsfor (int i=0; i<NUM; i++) {printf( "%d + %d = %d\n", a[i], b[i], c[i] );}//free the memory allocated on the GPUHANDLE_ERROR(cudaFree(dev_a));HANDLE_ERROR(cudaFree(dev_b));HANDLE_ERROR(cudaFree(dev_c));return 0;
}int test4()
{//globals needed by the update routinestruct DataBlock {unsigned char* dev_bitmap;};DataBlock   data;CPUBitmap bitmap(DIM, DIM, &data);unsigned char* dev_bitmap;HANDLE_ERROR(cudaMalloc((void**)&dev_bitmap, bitmap.image_size()));data.dev_bitmap = dev_bitmap;//声明一个二维的线程格//类型dim3表示一个三维数组,可以用于指定启动线程块的数量//当用两个值来初始化dim3类型的变量时,CUDA运行时将自动把第3维的大小指定为1dim3 grid(DIM, DIM);kernel_julia<<<grid,1>>>(dev_bitmap);HANDLE_ERROR(cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost));HANDLE_ERROR(cudaFree(dev_bitmap));bitmap.display_and_exit();return 0;
}int test5()
{int a[NUM], b[NUM], c[NUM];int *dev_a = NULL, *dev_b = NULL, *dev_c = NULL;//在GPU上分配内存HANDLE_ERROR(cudaMalloc((void**)&dev_a, NUM * sizeof(int)));HANDLE_ERROR(cudaMalloc((void**)&dev_b, NUM * sizeof(int)));HANDLE_ERROR(cudaMalloc((void**)&dev_c, NUM * sizeof(int)));//在CPU上为数组'a'和'b'赋值for (int i = 0; i < NUM; i++) {a[i] = i;b[i] = i * i;}//将数组'a'和'b'复制到GPUHANDLE_ERROR(cudaMemcpy(dev_a, a, NUM * sizeof(int), cudaMemcpyHostToDevice));HANDLE_ERROR(cudaMemcpy(dev_b, b, NUM * sizeof(int), cudaMemcpyHostToDevice));add_threadIdx<<<1, NUM>>>(dev_a, dev_b, dev_c);//将数组'c'从GPU复制到CPUHANDLE_ERROR(cudaMemcpy(c, dev_c, NUM * sizeof(int), cudaMemcpyDeviceToHost));//显示结果for (int i = 0; i < NUM; i++) {printf("%d + %d = %d\n", a[i], b[i], c[i]);}//释放在GPU分配的内存cudaFree(dev_a);cudaFree(dev_b);cudaFree(dev_c);return 0;
}int test6()
{int a[NUM], b[NUM], c[NUM];int *dev_a = NULL, *dev_b = NULL, *dev_c = NULL;//在GPU上分配内存HANDLE_ERROR(cudaMalloc((void**)&dev_a, NUM * sizeof(int)));HANDLE_ERROR(cudaMalloc((void**)&dev_b, NUM * sizeof(int)));HANDLE_ERROR(cudaMalloc((void**)&dev_c, NUM * sizeof(int)));//在CPU上为数组'a'和'b'赋值for (int i = 0; i < NUM; i++) {a[i] = i;b[i] = i * i / 10;}//将数组'a'和'b'复制到GPUHANDLE_ERROR(cudaMemcpy(dev_a, a, NUM * sizeof(int), cudaMemcpyHostToDevice));HANDLE_ERROR(cudaMemcpy(dev_b, b, NUM * sizeof(int), cudaMemcpyHostToDevice));add_blockIdx_threadIdx<<<128, 128>>>(dev_a, dev_b, dev_c);//将数组'c'从GPU复制到CPUHANDLE_ERROR(cudaMemcpy(c, dev_c, NUM * sizeof(int), cudaMemcpyDeviceToHost));//验证GPU确实完成了我们要求的工作bool success = true;for (int i = 0; i < NUM; i++) {if ((a[i] + b[i]) != c[i]) {printf("error: %d + %d != %d\n", a[i], b[i], c[i]);success = false;}}if (success)printf("we did it!\n");//释放在GPU分配的内存cudaFree(dev_a);cudaFree(dev_b);cudaFree(dev_c);return 0;
}int test7()
{DataBlock data;CPUAnimBitmap bitmap(DIM, DIM, &data);data.bitmap = &bitmap;HANDLE_ERROR(cudaMalloc((void**)&data.dev_bitmap, bitmap.image_size()));bitmap.anim_and_exit((void(*)(void*,int))generate_frame, (void(*)(void*))cleanup);return 0;
}void generate_frame(DataBlock *d, int ticks)
{dim3 blocks(DIM/16, DIM/16);dim3 threads(16, 16);ripple_kernel<<<blocks,threads>>>(d->dev_bitmap, ticks);HANDLE_ERROR(cudaMemcpy(d->bitmap->get_ptr(), d->dev_bitmap, d->bitmap->image_size(), cudaMemcpyDeviceToHost));
}//clean up memory allocated on the GPU
void cleanup(DataBlock *d)
{HANDLE_ERROR(cudaFree(d->dev_bitmap)); 
}int test8()
{float *a, *b, c, *partial_c;float *dev_a, *dev_b, *dev_partial_c;//allocate memory on the cpu sidea = (float*)malloc(NUM * sizeof(float));b = (float*)malloc(NUM * sizeof(float));partial_c = (float*)malloc(blocksPerGrid * sizeof(float));//allocate the memory on the GPUHANDLE_ERROR(cudaMalloc((void**)&dev_a, NUM * sizeof(float)));HANDLE_ERROR(cudaMalloc((void**)&dev_b, NUM * sizeof(float)));HANDLE_ERROR(cudaMalloc((void**)&dev_partial_c, blocksPerGrid*sizeof(float)));//fill in the host memory with datafor (int i = 0; i < NUM; i++) {a[i] = i;b[i] = i*2;}//copy the arrays 'a' and 'b' to the GPUHANDLE_ERROR(cudaMemcpy(dev_a, a, NUM * sizeof(float), cudaMemcpyHostToDevice));HANDLE_ERROR(cudaMemcpy(dev_b, b, NUM * sizeof(float), cudaMemcpyHostToDevice)); dot_kernel<<<blocksPerGrid,threadsPerBlock>>>(dev_a, dev_b, dev_partial_c);//copy the array 'c' back from the GPU to the CPUHANDLE_ERROR(cudaMemcpy(partial_c, dev_partial_c, blocksPerGrid * sizeof(float), cudaMemcpyDeviceToHost));//finish up on the CPU sidec = 0;for (int i = 0; i < blocksPerGrid; i++) {c += partial_c[i];}//点积计算结果应该是从0到NUM-1中每个数值的平方再乘以2//闭合形式解
#define sum_squares(x)  (x * (x + 1) * (2 * x + 1) / 6)printf("Does GPU value %.6g = %.6g?\n", c, 2 * sum_squares((float)(NUM - 1)));//free memory on the gpu sideHANDLE_ERROR(cudaFree(dev_a));HANDLE_ERROR(cudaFree(dev_b));HANDLE_ERROR(cudaFree(dev_partial_c));//free memory on the cpu sidefree(a);free(b);free(partial_c);return 0;
}int test9()
{DataBlock data;CPUBitmap bitmap(DIM, DIM, &data);unsigned char *dev_bitmap;HANDLE_ERROR(cudaMalloc((void**)&dev_bitmap, bitmap.image_size()));data.dev_bitmap = dev_bitmap;dim3 grids(DIM / 16, DIM / 16);dim3 threads(16,16);julia_kernel<<<grids, threads>>>(dev_bitmap);HANDLE_ERROR(cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost));HANDLE_ERROR(cudaFree(dev_bitmap));bitmap.display_and_exit();return 0;
}int test10()
{DataBlock data;//capture the start timecudaEvent_t start, stop;HANDLE_ERROR(cudaEventCreate(&start));HANDLE_ERROR(cudaEventCreate(&stop));HANDLE_ERROR(cudaEventRecord(start, 0));CPUBitmap bitmap(DIM, DIM, &data);unsigned char *dev_bitmap;Sphere *s;//allocate memory on the GPU for the output bitmapHANDLE_ERROR(cudaMalloc((void**)&dev_bitmap, bitmap.image_size()));//allocate memory for the Sphere datasetHANDLE_ERROR(cudaMalloc((void**)&s, sizeof(Sphere) * SPHERES));//allocate temp memory, initialize it, copy to memory on the GPU, then free our temp memorySphere *temp_s = (Sphere*)malloc(sizeof(Sphere) * SPHERES);for (int i = 0; i < SPHERES; i++) {temp_s[i].r = rnd(1.0f);temp_s[i].g = rnd(1.0f);temp_s[i].b = rnd(1.0f);temp_s[i].x = rnd(1000.0f) - 500;temp_s[i].y = rnd(1000.0f) - 500;temp_s[i].z = rnd(1000.0f) - 500;temp_s[i].radius = rnd(100.0f) + 20;}HANDLE_ERROR(cudaMemcpy( s, temp_s, sizeof(Sphere) * SPHERES, cudaMemcpyHostToDevice));free(temp_s);//generate a bitmap from our sphere datadim3 grids(DIM / 16, DIM / 16);dim3 threads(16, 16);RayTracing_kernel<<<grids, threads>>>(s, dev_bitmap);//copy our bitmap back from the GPU for displayHANDLE_ERROR(cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost));//get stop time, and display the timing resultsHANDLE_ERROR(cudaEventRecord(stop, 0));HANDLE_ERROR(cudaEventSynchronize(stop));float elapsedTime;HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime, start, stop));printf("Time to generate:  %3.1f ms\n", elapsedTime);HANDLE_ERROR(cudaEventDestroy(start));HANDLE_ERROR(cudaEventDestroy(stop));HANDLE_ERROR(cudaFree(dev_bitmap));HANDLE_ERROR(cudaFree(s));// displaybitmap.display_and_exit();return 0;
}int test11()
{DataBlock data;//capture the start timecudaEvent_t start, stop;HANDLE_ERROR(cudaEventCreate(&start));HANDLE_ERROR(cudaEventCreate(&stop));HANDLE_ERROR(cudaEventRecord(start, 0));CPUBitmap bitmap(DIM, DIM, &data);unsigned char *dev_bitmap;//allocate memory on the GPU for the output bitmapHANDLE_ERROR(cudaMalloc((void**)&dev_bitmap, bitmap.image_size()));//allocate temp memory, initialize it, copy to constant memory on the GPU, then free temp memorySphere *temp_s = (Sphere*)malloc(sizeof(Sphere) * SPHERES);for (int i = 0; i < SPHERES; i++) {temp_s[i].r = rnd(1.0f);temp_s[i].g = rnd(1.0f);temp_s[i].b = rnd(1.0f);temp_s[i].x = rnd(1000.0f) - 500;temp_s[i].y = rnd(1000.0f) - 500;temp_s[i].z = rnd(1000.0f) - 500;temp_s[i].radius = rnd(100.0f) + 20;}HANDLE_ERROR(cudaMemcpyToSymbol(s, temp_s, sizeof(Sphere) * SPHERES));free(temp_s);//generate a bitmap from our sphere datadim3 grids(DIM / 16, DIM / 16);dim3 threads(16, 16);RayTracing_kernel<<<grids, threads>>>(dev_bitmap);//copy our bitmap back from the GPU for displayHANDLE_ERROR(cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost));//get stop time, and display the timing resultsHANDLE_ERROR(cudaEventRecord(stop, 0));HANDLE_ERROR(cudaEventSynchronize(stop));float elapsedTime;HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime, start, stop));printf("Time to generate:  %3.1f ms\n", elapsedTime);HANDLE_ERROR(cudaEventDestroy(start));HANDLE_ERROR(cudaEventDestroy(stop));HANDLE_ERROR(cudaFree(dev_bitmap));//displaybitmap.display_and_exit();return 0;
}int test12()
{Heat_DataBlock data;CPUAnimBitmap bitmap(DIM, DIM, &data);data.bitmap = &bitmap;data.totalTime = 0;data.frames = 0;HANDLE_ERROR(cudaEventCreate(&data.start));HANDLE_ERROR(cudaEventCreate(&data.stop));int imageSize = bitmap.image_size();HANDLE_ERROR(cudaMalloc((void**)&data.output_bitmap, imageSize));//assume float == 4 chars in size (ie rgba)HANDLE_ERROR(cudaMalloc((void**)&data.dev_inSrc, imageSize));HANDLE_ERROR(cudaMalloc((void**)&data.dev_outSrc, imageSize));HANDLE_ERROR(cudaMalloc((void**)&data.dev_constSrc, imageSize));HANDLE_ERROR(cudaBindTexture(NULL, texConstSrc, data.dev_constSrc, imageSize));HANDLE_ERROR(cudaBindTexture(NULL, texIn, data.dev_inSrc, imageSize));HANDLE_ERROR(cudaBindTexture(NULL, texOut, data.dev_outSrc, imageSize));//intialize the constant datafloat *temp = (float*)malloc(imageSize);for (int i = 0; i < DIM*DIM; i++) {temp[i] = 0;int x = i % DIM;int y = i / DIM;if ((x>300) && (x<600) && (y>310) && (y<601))temp[i] = MAX_TEMP;}temp[DIM * 100 + 100] = (MAX_TEMP + MIN_TEMP) / 2;temp[DIM * 700 + 100] = MIN_TEMP;temp[DIM * 300 + 300] = MIN_TEMP;temp[DIM * 200 + 700] = MIN_TEMP;for (int y = 800; y < 900; y++) {for (int x = 400; x < 500; x++) {temp[x + y * DIM] = MIN_TEMP;}}HANDLE_ERROR(cudaMemcpy(data.dev_constSrc, temp, imageSize, cudaMemcpyHostToDevice));    //initialize the input datafor (int y = 800; y < DIM; y++) {for (int x = 0; x < 200; x++) {temp[x+y*DIM] = MAX_TEMP;}}HANDLE_ERROR(cudaMemcpy(data.dev_inSrc, temp,imageSize, cudaMemcpyHostToDevice));free(temp);bitmap.anim_and_exit((void (*)(void*,int))Heat_anim_gpu, (void (*)(void*))Heat_anim_exit);return 0;
}int test13()
{Heat_DataBlock data;CPUAnimBitmap bitmap(DIM, DIM, &data);data.bitmap = &bitmap;data.totalTime = 0;data.frames = 0;HANDLE_ERROR(cudaEventCreate(&data.start));HANDLE_ERROR(cudaEventCreate(&data.stop));int imageSize = bitmap.image_size();HANDLE_ERROR(cudaMalloc((void**)&data.output_bitmap, imageSize));//assume float == 4 chars in size (ie rgba)HANDLE_ERROR(cudaMalloc((void**)&data.dev_inSrc, imageSize));HANDLE_ERROR(cudaMalloc((void**)&data.dev_outSrc, imageSize));HANDLE_ERROR(cudaMalloc((void**)&data.dev_constSrc, imageSize));cudaChannelFormatDesc desc = cudaCreateChannelDesc<float>();HANDLE_ERROR(cudaBindTexture2D(NULL, texConstSrc2, data.dev_constSrc, desc, DIM, DIM, sizeof(float) * DIM));HANDLE_ERROR(cudaBindTexture2D(NULL, texIn2, data.dev_inSrc, desc, DIM, DIM, sizeof(float) * DIM));HANDLE_ERROR(cudaBindTexture2D(NULL, texOut2, data.dev_outSrc, desc, DIM, DIM, sizeof(float) * DIM));//initialize the constant datafloat *temp = (float*)malloc(imageSize);for (int i = 0; i < DIM*DIM; i++) {temp[i] = 0;int x = i % DIM;int y = i / DIM;if ((x > 300) && ( x < 600) && (y > 310) && (y < 601))temp[i] = MAX_TEMP;}temp[DIM * 100 + 100] = (MAX_TEMP + MIN_TEMP) / 2;temp[DIM * 700 + 100] = MIN_TEMP;temp[DIM * 300 + 300] = MIN_TEMP;temp[DIM * 200 + 700] = MIN_TEMP;for (int y = 800; y < 900; y++) {for (int x = 400; x < 500; x++) {temp[x + y * DIM] = MIN_TEMP;}}HANDLE_ERROR(cudaMemcpy(data.dev_constSrc, temp, imageSize, cudaMemcpyHostToDevice));    //initialize the input datafor (int y = 800; y < DIM; y++) {for (int x = 0; x < 200; x++) {temp[x + y * DIM] = MAX_TEMP;}}HANDLE_ERROR(cudaMemcpy(data.dev_inSrc, temp,imageSize, cudaMemcpyHostToDevice));free(temp);bitmap.anim_and_exit((void (*)(void*,int))anim_gpu, (void (*)(void*))anim_exit);return 0;
}void Heat_anim_gpu(Heat_DataBlock *d, int ticks)
{HANDLE_ERROR(cudaEventRecord(d->start, 0));dim3 blocks(DIM / 16, DIM / 16);dim3 threads(16, 16);CPUAnimBitmap *bitmap = d->bitmap;//since tex is global and bound, we have to use a flag to//select which is in/out per iterationvolatile bool dstOut = true;for (int i = 0; i < 90; i++) {float *in, *out;if (dstOut) {in  = d->dev_inSrc;out = d->dev_outSrc;} else {out = d->dev_inSrc;in  = d->dev_outSrc;}Heat_copy_const_kernel<<<blocks, threads>>>(in);Heat_blend_kernel<<<blocks, threads>>>(out, dstOut);dstOut = !dstOut;}float_to_color<<<blocks, threads>>>(d->output_bitmap, d->dev_inSrc);HANDLE_ERROR(cudaMemcpy(bitmap->get_ptr(), d->output_bitmap, bitmap->image_size(), cudaMemcpyDeviceToHost));HANDLE_ERROR(cudaEventRecord(d->stop, 0));HANDLE_ERROR(cudaEventSynchronize(d->stop));float elapsedTime;HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime, d->start, d->stop));d->totalTime += elapsedTime;++d->frames;printf( "Average Time per frame:  %3.1f ms\n", d->totalTime/d->frames );
}void anim_gpu(Heat_DataBlock *d, int ticks)
{HANDLE_ERROR(cudaEventRecord(d->start, 0));dim3 blocks(DIM / 16, DIM / 16);dim3 threads(16, 16);CPUAnimBitmap  *bitmap = d->bitmap;//since tex is global and bound, we have to use a flag to//select which is in/out per iterationvolatile bool dstOut = true;for (int i = 0; i < 90; i++) {float *in, *out;if (dstOut) {in  = d->dev_inSrc;out = d->dev_outSrc;} else {out = d->dev_inSrc;in  = d->dev_outSrc;}copy_const_kernel<<<blocks, threads>>>(in);blend_kernel<<<blocks, threads>>>(out, dstOut);dstOut = !dstOut;}float_to_color<<<blocks, threads>>>(d->output_bitmap, d->dev_inSrc);HANDLE_ERROR(cudaMemcpy(bitmap->get_ptr(), d->output_bitmap, bitmap->image_size(), cudaMemcpyDeviceToHost));HANDLE_ERROR(cudaEventRecord(d->stop, 0));HANDLE_ERROR(cudaEventSynchronize(d->stop));float elapsedTime;HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime, d->start, d->stop));d->totalTime += elapsedTime;++d->frames;printf("Average Time per frame:  %3.1f ms\n", d->totalTime/d->frames);
}void Heat_anim_exit(Heat_DataBlock *d)
{cudaUnbindTexture(texIn);cudaUnbindTexture(texOut);cudaUnbindTexture(texConstSrc);HANDLE_ERROR(cudaFree(d->dev_inSrc));HANDLE_ERROR(cudaFree(d->dev_outSrc));HANDLE_ERROR(cudaFree(d->dev_constSrc));HANDLE_ERROR(cudaEventDestroy(d->start));HANDLE_ERROR(cudaEventDestroy(d->stop));
}//clean up memory allocated on the GPU
void anim_exit(Heat_DataBlock *d) 
{cudaUnbindTexture(texIn2);cudaUnbindTexture(texOut2);cudaUnbindTexture(texConstSrc2);HANDLE_ERROR(cudaFree(d->dev_inSrc));HANDLE_ERROR(cudaFree(d->dev_outSrc));HANDLE_ERROR(cudaFree(d->dev_constSrc));HANDLE_ERROR(cudaEventDestroy(d->start));HANDLE_ERROR(cudaEventDestroy(d->stop));
}int test14()
{GPUAnimBitmap  bitmap(DIM, DIM, NULL);bitmap.anim_and_exit((void (*)(uchar4*, void*, int))generate_frame_opengl, NULL);return 0;
}int test15()
{DataBlock_opengl data;GPUAnimBitmap bitmap(DIM, DIM, &data);data.totalTime = 0;data.frames = 0;HANDLE_ERROR(cudaEventCreate(&data.start));HANDLE_ERROR(cudaEventCreate(&data.stop));int imageSize = bitmap.image_size();//assume float == 4 chars in size (ie rgba)HANDLE_ERROR(cudaMalloc((void**)&data.dev_inSrc, imageSize));HANDLE_ERROR(cudaMalloc((void**)&data.dev_outSrc, imageSize));HANDLE_ERROR(cudaMalloc((void**)&data.dev_constSrc, imageSize));HANDLE_ERROR(cudaBindTexture(NULL, texConstSrc ,data.dev_constSrc, imageSize));HANDLE_ERROR(cudaBindTexture(NULL, texIn, data.dev_inSrc, imageSize));HANDLE_ERROR(cudaBindTexture(NULL, texOut, data.dev_outSrc, imageSize));//intialize the constant datafloat *temp = (float*)malloc(imageSize);for (int i = 0; i < DIM*DIM; i++) {temp[i] = 0;int x = i % DIM;int y = i / DIM;if ((x>300) && (x<600) && (y>310) && (y<601))temp[i] = MAX_TEMP;}temp[DIM*100+100] = (MAX_TEMP + MIN_TEMP)/2;temp[DIM*700+100] = MIN_TEMP;temp[DIM*300+300] = MIN_TEMP;temp[DIM*200+700] = MIN_TEMP;for (int y = 800; y < 900; y++) {for (int x = 400; x < 500; x++) {temp[x+y*DIM] = MIN_TEMP;}}HANDLE_ERROR(cudaMemcpy(data.dev_constSrc, temp, imageSize, cudaMemcpyHostToDevice));    //initialize the input datafor (int y = 800; y < DIM; y++) {for (int x = 0; x < 200; x++) {temp[x+y*DIM] = MAX_TEMP;}}HANDLE_ERROR(cudaMemcpy(data.dev_inSrc, temp, imageSize, cudaMemcpyHostToDevice));free(temp);bitmap.anim_and_exit((void (*)(uchar4*, void*, int))anim_gpu_opengl, (void (*)(void*))anim_exit_opengl);return 0;
}void anim_gpu_opengl(uchar4* outputBitmap, DataBlock_opengl *d, int ticks)
{HANDLE_ERROR(cudaEventRecord(d->start, 0));dim3 blocks(DIM / 16, DIM / 16);dim3 threads(16, 16);//since tex is global and bound, we have to use a flag to select which is in/out per iterationvolatile bool dstOut = true;for (int i = 0; i < 90; i++) {float *in, *out;if (dstOut) {in  = d->dev_inSrc;out = d->dev_outSrc;} else {out = d->dev_inSrc;in  = d->dev_outSrc;}Heat_copy_const_kernel_opengl<<<blocks, threads>>>(in);Heat_blend_kernel_opengl<<<blocks, threads>>>(out, dstOut);dstOut = !dstOut;}float_to_color<<<blocks, threads>>>(outputBitmap, d->dev_inSrc);HANDLE_ERROR(cudaEventRecord(d->stop, 0));HANDLE_ERROR(cudaEventSynchronize(d->stop));float elapsedTime;HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime, d->start, d->stop));d->totalTime += elapsedTime;++d->frames;printf("Average Time per frame:  %3.1f ms\n", d->totalTime/d->frames);
}void anim_exit_opengl(DataBlock_opengl *d)
{HANDLE_ERROR(cudaUnbindTexture(texIn));HANDLE_ERROR(cudaUnbindTexture(texOut));HANDLE_ERROR(cudaUnbindTexture(texConstSrc));HANDLE_ERROR(cudaFree(d->dev_inSrc));HANDLE_ERROR(cudaFree(d->dev_outSrc));HANDLE_ERROR(cudaFree(d->dev_constSrc));HANDLE_ERROR(cudaEventDestroy(d->start));HANDLE_ERROR(cudaEventDestroy(d->stop));
}int test16()
{unsigned char *buffer = (unsigned char*)big_random_block(SIZE);//capture the start time starting the timer here so that we include the cost of//all of the operations on the GPU.  if the data were already on the GPU and we just //timed the kernel the timing would drop from 74 ms to 15 ms.  Very fast.cudaEvent_t start, stop;HANDLE_ERROR( cudaEventCreate( &start ) );HANDLE_ERROR( cudaEventCreate( &stop ) );HANDLE_ERROR( cudaEventRecord( start, 0 ) );// allocate memory on the GPU for the file's dataunsigned char *dev_buffer;unsigned int *dev_histo;HANDLE_ERROR(cudaMalloc((void**)&dev_buffer, SIZE));HANDLE_ERROR(cudaMemcpy(dev_buffer, buffer, SIZE, cudaMemcpyHostToDevice));HANDLE_ERROR(cudaMalloc((void**)&dev_histo, 256 * sizeof(int)));HANDLE_ERROR(cudaMemset(dev_histo, 0, 256 * sizeof(int)));//kernel launch - 2x the number of mps gave best timingcudaDeviceProp prop;HANDLE_ERROR(cudaGetDeviceProperties(&prop, 0));int blocks = prop.multiProcessorCount;histo_kernel<<<blocks*2, 256>>>(dev_buffer, SIZE, dev_histo);unsigned int histo[256];HANDLE_ERROR(cudaMemcpy(histo, dev_histo, 256 * sizeof(int), cudaMemcpyDeviceToHost));//get stop time, and display the timing resultsHANDLE_ERROR(cudaEventRecord(stop, 0));HANDLE_ERROR(cudaEventSynchronize(stop));float elapsedTime;HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime, start, stop));printf("Time to generate:  %3.1f ms\n", elapsedTime);long histoCount = 0;for (int i=0; i<256; i++) {histoCount += histo[i];}printf("Histogram Sum:  %ld\n", histoCount);//verify that we have the same counts via CPUfor (int i = 0; i < SIZE; i++)histo[buffer[i]]--;for (int i = 0; i < 256; i++) {if (histo[i] != 0)printf("Failure at %d!\n", i);}HANDLE_ERROR(cudaEventDestroy(start));HANDLE_ERROR(cudaEventDestroy(stop));cudaFree(dev_histo);cudaFree(dev_buffer);free(buffer);return 0;
}float cuda_malloc_test(int size, bool up)
{cudaEvent_t start, stop;int *a, *dev_a;float elapsedTime;HANDLE_ERROR(cudaEventCreate(&start));HANDLE_ERROR(cudaEventCreate(&stop));a = (int*)malloc(size * sizeof(*a));HANDLE_NULL(a);HANDLE_ERROR(cudaMalloc((void**)&dev_a,size * sizeof(*dev_a)));HANDLE_ERROR(cudaEventRecord(start, 0));for (int i=0; i<100; i++) {if (up)HANDLE_ERROR(cudaMemcpy(dev_a, a, size * sizeof( *dev_a ), cudaMemcpyHostToDevice));elseHANDLE_ERROR(cudaMemcpy(a, dev_a, size * sizeof(*dev_a), cudaMemcpyDeviceToHost));}HANDLE_ERROR(cudaEventRecord(stop, 0));HANDLE_ERROR(cudaEventSynchronize(stop));HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime, start, stop));free(a);HANDLE_ERROR(cudaFree(dev_a));HANDLE_ERROR(cudaEventDestroy(start));HANDLE_ERROR(cudaEventDestroy(stop));return elapsedTime;
}float cuda_host_alloc_test(int size, bool up) 
{cudaEvent_t start, stop;int *a, *dev_a;float elapsedTime;HANDLE_ERROR(cudaEventCreate(&start));HANDLE_ERROR(cudaEventCreate(&stop));HANDLE_ERROR(cudaHostAlloc((void**)&a,size * sizeof(*a), cudaHostAllocDefault));HANDLE_ERROR(cudaMalloc((void**)&dev_a, size * sizeof(*dev_a)));HANDLE_ERROR(cudaEventRecord(start, 0));for (int i=0; i<100; i++) {if (up)HANDLE_ERROR(cudaMemcpy(dev_a, a,size * sizeof(*a), cudaMemcpyHostToDevice));elseHANDLE_ERROR(cudaMemcpy(a, dev_a,size * sizeof(*a), cudaMemcpyDeviceToHost));}HANDLE_ERROR(cudaEventRecord(stop, 0));HANDLE_ERROR(cudaEventSynchronize(stop));HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime, start, stop));HANDLE_ERROR(cudaFreeHost(a));HANDLE_ERROR(cudaFree(dev_a));HANDLE_ERROR(cudaEventDestroy(start));HANDLE_ERROR(cudaEventDestroy(stop));return elapsedTime;
}int test17()
{float elapsedTime;float MB = (float)100 * SIZE * sizeof(int) / 1024 / 1024;//try it with cudaMallocelapsedTime = cuda_malloc_test(SIZE, true);printf("Time using cudaMalloc:  %3.1f ms\n", elapsedTime);printf("\tMB/s during copy up:  %3.1f\n", MB/(elapsedTime/1000));elapsedTime = cuda_malloc_test(SIZE, false);printf("Time using cudaMalloc:  %3.1f ms\n", elapsedTime);printf("\tMB/s during copy down:  %3.1f\n", MB/(elapsedTime/1000));//now try it with cudaHostAllocelapsedTime = cuda_host_alloc_test(SIZE, true);printf("Time using cudaHostAlloc:  %3.1f ms\n", elapsedTime);printf("\tMB/s during copy up:  %3.1f\n", MB/(elapsedTime/1000));elapsedTime = cuda_host_alloc_test(SIZE, false);printf("Time using cudaHostAlloc:  %3.1f ms\n", elapsedTime);printf("\tMB/s during copy down:  %3.1f\n", MB/(elapsedTime/1000));return 0;
}int test18()
{cudaDeviceProp prop;int whichDevice;HANDLE_ERROR(cudaGetDevice(&whichDevice));HANDLE_ERROR(cudaGetDeviceProperties(&prop, whichDevice));if (!prop.deviceOverlap) {printf("Device will not handle overlaps, so no speed up from streams\n");return 0;}cudaEvent_t start, stop;float elapsedTime;cudaStream_t stream;int *host_a, *host_b, *host_c;int *dev_a, *dev_b, *dev_c;//start the timersHANDLE_ERROR(cudaEventCreate(&start));HANDLE_ERROR(cudaEventCreate(&stop));//initialize the streamHANDLE_ERROR(cudaStreamCreate(&stream));//allocate the memory on the GPUHANDLE_ERROR(cudaMalloc((void**)&dev_a, NUM * sizeof(int)));HANDLE_ERROR(cudaMalloc((void**)&dev_b, NUM * sizeof(int)));HANDLE_ERROR(cudaMalloc((void**)&dev_c, NUM * sizeof(int)));//allocate host locked memory, used to streamHANDLE_ERROR(cudaHostAlloc((void**)&host_a, FULL_DATA_SIZE * sizeof(int), cudaHostAllocDefault));HANDLE_ERROR(cudaHostAlloc((void**)&host_b, FULL_DATA_SIZE * sizeof(int), cudaHostAllocDefault));HANDLE_ERROR(cudaHostAlloc((void**)&host_c, FULL_DATA_SIZE * sizeof(int), cudaHostAllocDefault));for (int i=0; i<FULL_DATA_SIZE; i++) {host_a[i] = rand();host_b[i] = rand();}HANDLE_ERROR(cudaEventRecord(start, 0));//now loop over full data, in bite-sized chunksfor (int i=0; i<FULL_DATA_SIZE; i+= NUM) {//copy the locked memory to the device, asyncHANDLE_ERROR(cudaMemcpyAsync(dev_a, host_a+i, NUM * sizeof(int), cudaMemcpyHostToDevice, stream));HANDLE_ERROR(cudaMemcpyAsync(dev_b, host_b+i, NUM * sizeof(int), cudaMemcpyHostToDevice, stream));singlestream_kernel<<<NUM/256, 256, 0, stream>>>(dev_a, dev_b, dev_c);//copy the data from device to locked memoryHANDLE_ERROR(cudaMemcpyAsync(host_c+i, dev_c, NUM * sizeof(int), cudaMemcpyDeviceToHost, stream));}// copy result chunk from locked to full bufferHANDLE_ERROR(cudaStreamSynchronize(stream));HANDLE_ERROR(cudaEventRecord(stop, 0));HANDLE_ERROR(cudaEventSynchronize(stop));HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime, start, stop));printf("Time taken:  %3.1f ms\n", elapsedTime);//cleanup the streams and memoryHANDLE_ERROR(cudaFreeHost(host_a));HANDLE_ERROR(cudaFreeHost(host_b));HANDLE_ERROR(cudaFreeHost(host_c));HANDLE_ERROR(cudaFree(dev_a));HANDLE_ERROR(cudaFree(dev_b));HANDLE_ERROR(cudaFree(dev_c));HANDLE_ERROR(cudaStreamDestroy(stream));return 0;
}int test19()
{cudaDeviceProp prop;int whichDevice;HANDLE_ERROR(cudaGetDevice(&whichDevice));HANDLE_ERROR(cudaGetDeviceProperties(&prop, whichDevice));if (!prop.deviceOverlap) {printf( "Device will not handle overlaps, so no speed up from streams\n" );return 0;}//start the timerscudaEvent_t start, stop;HANDLE_ERROR(cudaEventCreate(&start));HANDLE_ERROR(cudaEventCreate(&stop));//initialize the streamscudaStream_t stream0, stream1;HANDLE_ERROR(cudaStreamCreate(&stream0));HANDLE_ERROR(cudaStreamCreate(&stream1));int *host_a, *host_b, *host_c;int *dev_a0, *dev_b0, *dev_c0;//为第0个流分配的GPU内存int *dev_a1, *dev_b1, *dev_c1;//为第1个流分配的GPU内存//allocate the memory on the GPUHANDLE_ERROR(cudaMalloc((void**)&dev_a0, NUM * sizeof(int)));HANDLE_ERROR(cudaMalloc((void**)&dev_b0, NUM * sizeof(int)));HANDLE_ERROR(cudaMalloc((void**)&dev_c0, NUM * sizeof(int)));HANDLE_ERROR(cudaMalloc((void**)&dev_a1, NUM * sizeof(int)));HANDLE_ERROR(cudaMalloc((void**)&dev_b1, NUM * sizeof(int)));HANDLE_ERROR(cudaMalloc((void**)&dev_c1, NUM * sizeof(int)));//allocate host locked memory, used to streamHANDLE_ERROR(cudaHostAlloc((void**)&host_a, FULL_DATA_SIZE * sizeof(int), cudaHostAllocDefault));HANDLE_ERROR(cudaHostAlloc((void**)&host_b, FULL_DATA_SIZE * sizeof(int), cudaHostAllocDefault));HANDLE_ERROR(cudaHostAlloc((void**)&host_c, FULL_DATA_SIZE * sizeof(int), cudaHostAllocDefault));for (int i=0; i<FULL_DATA_SIZE; i++) {host_a[i] = rand();host_b[i] = rand();}HANDLE_ERROR(cudaEventRecord(start, 0));//now loop over full data, in bite-sized chunksfor (int i=0; i<FULL_DATA_SIZE; i+= NUM*2) {//enqueue copies of a in stream0 and stream1//将锁定内存以异步方式复制到设备上HANDLE_ERROR(cudaMemcpyAsync(dev_a0, host_a+i, NUM * sizeof(int), cudaMemcpyHostToDevice, stream0));HANDLE_ERROR(cudaMemcpyAsync(dev_a1, host_a+i+NUM, NUM * sizeof(int), cudaMemcpyHostToDevice, stream1));//enqueue copies of b in stream0 and stream1HANDLE_ERROR(cudaMemcpyAsync(dev_b0, host_b+i, NUM * sizeof(int), cudaMemcpyHostToDevice, stream0));HANDLE_ERROR(cudaMemcpyAsync(dev_b1, host_b+i+NUM, NUM * sizeof(int), cudaMemcpyHostToDevice, stream1));//enqueue kernels in stream0 and stream1   singlestream_kernel<<<NUM/256, 256, 0, stream0>>>(dev_a0, dev_b0, dev_c0);singlestream_kernel<<<NUM/256, 256, 0, stream1>>>(dev_a1, dev_b1, dev_c1);//enqueue copies of c from device to locked memoryHANDLE_ERROR(cudaMemcpyAsync(host_c+i, dev_c0, NUM * sizeof(int), cudaMemcpyDeviceToHost, stream0));HANDLE_ERROR(cudaMemcpyAsync(host_c+i+NUM, dev_c1, NUM * sizeof(int), cudaMemcpyDeviceToHost, stream1));}float elapsedTime;HANDLE_ERROR(cudaStreamSynchronize(stream0));HANDLE_ERROR(cudaStreamSynchronize(stream1));HANDLE_ERROR(cudaEventRecord(stop, 0));HANDLE_ERROR(cudaEventSynchronize(stop));HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime,start, stop));printf( "Time taken:  %3.1f ms\n", elapsedTime );//cleanup the streams and memoryHANDLE_ERROR(cudaFreeHost(host_a));HANDLE_ERROR(cudaFreeHost(host_b));HANDLE_ERROR(cudaFreeHost(host_c));HANDLE_ERROR(cudaFree(dev_a0));HANDLE_ERROR(cudaFree(dev_b0));HANDLE_ERROR(cudaFree(dev_c0));HANDLE_ERROR(cudaFree(dev_a1));HANDLE_ERROR(cudaFree(dev_b1));HANDLE_ERROR(cudaFree(dev_c1));HANDLE_ERROR(cudaStreamDestroy(stream0));HANDLE_ERROR(cudaStreamDestroy(stream1));return 0;
}float malloc_test(int size)
{cudaEvent_t start, stop;float *a, *b, c, *partial_c;float *dev_a, *dev_b, *dev_partial_c;float elapsedTime;HANDLE_ERROR(cudaEventCreate(&start));HANDLE_ERROR(cudaEventCreate(&stop));//allocate memory on the CPU sidea = (float*)malloc(size * sizeof(float));b = (float*)malloc(size * sizeof(float));partial_c = (float*)malloc(blocksPerGrid * sizeof(float));//allocate the memory on the GPUHANDLE_ERROR(cudaMalloc((void**)&dev_a, size * sizeof(float)));HANDLE_ERROR(cudaMalloc((void**)&dev_b, size * sizeof(float)));HANDLE_ERROR(cudaMalloc((void**)&dev_partial_c, blocksPerGrid * sizeof(float)));//fill in the host memory with datafor (int i=0; i<size; i++) {a[i] = i;b[i] = i * 2;}HANDLE_ERROR(cudaEventRecord(start, 0));//copy the arrays 'a' and 'b' to the GPUHANDLE_ERROR(cudaMemcpy(dev_a, a, size * sizeof(float), cudaMemcpyHostToDevice));HANDLE_ERROR(cudaMemcpy(dev_b, b, size * sizeof(float), cudaMemcpyHostToDevice)); dot_kernel<<<blocksPerGrid, threadsPerBlock>>>(size, dev_a, dev_b, dev_partial_c);//copy the array 'c' back from the GPU to the CPUHANDLE_ERROR(cudaMemcpy(partial_c, dev_partial_c,blocksPerGrid*sizeof(float), cudaMemcpyDeviceToHost));HANDLE_ERROR(cudaEventRecord(stop, 0));HANDLE_ERROR(cudaEventSynchronize(stop));HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime,start, stop));//finish up on the CPU sidec = 0;for (int i=0; i<blocksPerGrid; i++) {c += partial_c[i];}HANDLE_ERROR(cudaFree(dev_a));HANDLE_ERROR(cudaFree(dev_b));HANDLE_ERROR(cudaFree(dev_partial_c));//free memory on the CPU sidefree(a);free(b);free(partial_c);//free eventsHANDLE_ERROR(cudaEventDestroy(start));HANDLE_ERROR(cudaEventDestroy(stop));printf("Value calculated:  %f\n", c);return elapsedTime;
}float cuda_host_alloc_test(int size)
{cudaEvent_t start, stop;float *a, *b, c, *partial_c;float *dev_a, *dev_b, *dev_partial_c;float elapsedTime;HANDLE_ERROR(cudaEventCreate(&start));HANDLE_ERROR(cudaEventCreate(&stop));//allocate the memory on the CPUHANDLE_ERROR(cudaHostAlloc((void**)&a, size*sizeof(float), cudaHostAllocWriteCombined |cudaHostAllocMapped));HANDLE_ERROR(cudaHostAlloc((void**)&b, size*sizeof(float), cudaHostAllocWriteCombined |cudaHostAllocMapped));HANDLE_ERROR(cudaHostAlloc((void**)&partial_c, blocksPerGrid*sizeof(float), cudaHostAllocMapped));//find out the GPU pointersHANDLE_ERROR(cudaHostGetDevicePointer(&dev_a, a, 0));HANDLE_ERROR(cudaHostGetDevicePointer(&dev_b, b, 0));HANDLE_ERROR( cudaHostGetDevicePointer(&dev_partial_c, partial_c, 0));//fill in the host memory with datafor (int i=0; i<size; i++) {a[i] = i;b[i] = i*2;}HANDLE_ERROR(cudaEventRecord(start, 0));dot_kernel<<<blocksPerGrid, threadsPerBlock>>>(size, dev_a, dev_b, dev_partial_c);HANDLE_ERROR(cudaThreadSynchronize());HANDLE_ERROR(cudaEventRecord(stop, 0));HANDLE_ERROR(cudaEventSynchronize(stop));HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime,start, stop));//finish up on the CPU sidec = 0;for (int i=0; i<blocksPerGrid; i++) {c += partial_c[i];}HANDLE_ERROR(cudaFreeHost(a));HANDLE_ERROR(cudaFreeHost(b));HANDLE_ERROR(cudaFreeHost(partial_c));// free eventsHANDLE_ERROR(cudaEventDestroy(start));HANDLE_ERROR(cudaEventDestroy(stop));printf("Value calculated:  %f\n", c);return elapsedTime;
}int test20()
{cudaDeviceProp prop;int whichDevice;HANDLE_ERROR(cudaGetDevice(&whichDevice));HANDLE_ERROR(cudaGetDeviceProperties(&prop, whichDevice));if (prop.canMapHostMemory != 1) {printf( "Device can not map memory.\n" );return 0;}HANDLE_ERROR(cudaSetDeviceFlags(cudaDeviceMapHost));//try it with mallocfloat elapsedTime = malloc_test(NUM);printf("Time using cudaMalloc:  %3.1f ms\n", elapsedTime);//now try it with cudaHostAllocelapsedTime = cuda_host_alloc_test(NUM);printf("Time using cudaHostAlloc:  %3.1f ms\n", elapsedTime);return 0;
}void* routine(void *pvoidData)
{DataStruct *data = (DataStruct*)pvoidData;HANDLE_ERROR(cudaSetDevice(data->deviceID));int size = data->size;float *a, *b, c, *partial_c;float *dev_a, *dev_b, *dev_partial_c;//allocate memory on the CPU sidea = data->a;b = data->b;partial_c = (float*)malloc(blocksPerGrid * sizeof(float));//allocate the memory on the GPUHANDLE_ERROR(cudaMalloc((void**)&dev_a, size * sizeof(float)));HANDLE_ERROR(cudaMalloc((void**)&dev_b, size * sizeof(float)));HANDLE_ERROR(cudaMalloc((void**)&dev_partial_c, blocksPerGrid*sizeof(float)));//copy the arrays 'a' and 'b' to the GPUHANDLE_ERROR(cudaMemcpy(dev_a, a, size*sizeof(float), cudaMemcpyHostToDevice));HANDLE_ERROR(cudaMemcpy(dev_b, b, size*sizeof(float), cudaMemcpyHostToDevice)); dot_kernel<<<blocksPerGrid, threadsPerBlock>>>(size, dev_a, dev_b, dev_partial_c);//copy the array 'c' back from the GPU to the CPUHANDLE_ERROR(cudaMemcpy( partial_c, dev_partial_c, blocksPerGrid * sizeof(float), cudaMemcpyDeviceToHost));//finish up on the CPU sidec = 0;for (int i=0; i<blocksPerGrid; i++) {c += partial_c[i];}HANDLE_ERROR(cudaFree(dev_a));HANDLE_ERROR(cudaFree(dev_b));HANDLE_ERROR(cudaFree(dev_partial_c));//free memory on the CPU sidefree(partial_c);data->returnValue = c;return 0;
}int test21()
{int deviceCount;HANDLE_ERROR(cudaGetDeviceCount(&deviceCount));if (deviceCount < 2) {printf("We need at least two compute 1.0 or greater devices, but only found %d\n", deviceCount);return 0;}float *a = (float*)malloc(sizeof(float) * NUM);HANDLE_NULL(a);float *b = (float*)malloc(sizeof(float) * NUM);HANDLE_NULL(b);//fill in the host memory with datafor (int i=0; i<NUM; i++) {a[i] = i;b[i] = i*2;}//prepare for multithreadDataStruct  data[2];data[0].deviceID = 0;data[0].size = NUM/2;data[0].a = a;data[0].b = b;data[1].deviceID = 1;data[1].size = NUM/2;data[1].a = a + NUM/2;data[1].b = b + NUM/2;CUTThread thread = start_thread(routine, &(data[0]));routine(&(data[1]));end_thread(thread);//free memory on the CPU sidefree(a);free(b);printf("Value calculated:  %f\n", data[0].returnValue + data[1].returnValue);return 0;
}

funset.cuh:

#ifndef _FUNSET_CUH_
#define _FUNSET_CUH_#include <stdio.h>
#include "cpu_anim.h"#define NUM  33 * 1024 * 1024//1024*1024//33 * 1024//10
#define DIM 1024//1000
#define PI 3.1415926535897932f
#define imin(a, b) (a < b ? a : b)
const int threadsPerBlock = 256;
const int blocksPerGrid = imin(32, (NUM/2+threadsPerBlock-1) / threadsPerBlock);//imin(32, (NUM + threadsPerBlock - 1) / threadsPerBlock);
#define rnd(x) (x * rand() / RAND_MAX)
#define INF 2e10f
#define SPHERES 20
#define MAX_TEMP 1.0f
#define MIN_TEMP 0.0001f
#define SPEED 0.25f
#define SIZE (100*1024*1024)
#define FULL_DATA_SIZE (NUM*20)//__global__关键字将告诉编译器,函数应该编译为在设备而不是主机上运行
__global__ void add(int a, int b, int* c);
__global__ void add_blockIdx(int* a, int* b, int* c);
__global__ void add_threadIdx(int* a, int* b, int* c);
__global__ void add_blockIdx_threadIdx(int* a, int* b, int* c);struct cuComplex {float r, i;__device__ cuComplex(float a, float b) : r(a), i(b)  {}__device__ float magnitude2(void) {return r * r + i * i;}__device__ cuComplex operator*(const cuComplex& a) {return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);}__device__ cuComplex operator+(const cuComplex& a){return cuComplex(r+a.r, i+a.i);}
};__device__ int julia(int x, int y); 
__global__ void kernel_julia(unsigned char *ptr);
__global__ void ripple_kernel(unsigned char *ptr, int ticks);
struct Sphere;struct DataBlock {unsigned char *dev_bitmap;CPUAnimBitmap *bitmap;Sphere *s;
};void generate_frame(DataBlock *d, int ticks);
void cleanup(DataBlock *d); __global__ void dot_kernel(float *a, float *b, float *c);
__global__ void julia_kernel(unsigned char *ptr);//通过一个数据结构对球面建模
struct Sphere {float r,b,g;float radius;float x,y,z;__device__ float hit(float ox, float oy, float *n){float dx = ox - x;float dy = oy - y;if (dx*dx + dy*dy < radius*radius) {float dz = sqrtf(radius*radius - dx*dx - dy*dy);*n = dz / sqrtf(radius * radius);return dz + z;}return -INF;}
};//声明为常量内存,__constant__将把变量的访问限制为只读
__constant__ Sphere s[SPHERES];__global__ void RayTracing_kernel(Sphere *s, unsigned char *ptr);
__global__ void RayTracing_kernel(unsigned char *ptr);//these exist on the GPU side
texture<float> texConstSrc, texIn, texOut;
texture<float, 2> texConstSrc2, texIn2, texOut2;//this kernel takes in a 2-d array of floats it updates the value-of-interest by a 
//scaled value based on itself and its nearest neighbors
__global__ void Heat_blend_kernel(float *dst, bool dstOut);
__global__ void blend_kernel(float *dst, bool dstOut);
__global__ void Heat_copy_const_kernel(float *iptr);
__global__ void copy_const_kernel(float *iptr);struct Heat_DataBlock {unsigned char   *output_bitmap;float           *dev_inSrc;float           *dev_outSrc;float           *dev_constSrc;CPUAnimBitmap  *bitmap;cudaEvent_t     start, stop;float           totalTime;float           frames;
};//globals needed by the update routine
struct DataBlock_opengl {float           *dev_inSrc;float           *dev_outSrc;float           *dev_constSrc;cudaEvent_t     start, stop;float           totalTime;float           frames;
};void Heat_anim_gpu(Heat_DataBlock *d, int ticks);
void anim_gpu(Heat_DataBlock *d, int ticks);
//clean up memory allocated on the GPU
void Heat_anim_exit(Heat_DataBlock *d);
void anim_exit(Heat_DataBlock *d); 
void generate_frame_opengl(uchar4 *pixels, void*, int ticks);
__global__ void ripple_kernel_opengl(uchar4 *ptr, int ticks);
__global__ void Heat_blend_kernel_opengl(float *dst, bool dstOut);
__global__ void Heat_copy_const_kernel_opengl(float *iptr);
void anim_gpu_opengl(uchar4* outputBitmap, DataBlock_opengl *d, int ticks);
void anim_exit_opengl(DataBlock_opengl *d);
__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo);
__global__ void singlestream_kernel(int *a, int *b, int *c);
__global__ void dot_kernel(int size, float *a, float *b, float *c);struct DataStruct{int     deviceID;int     size;float   *a;float   *b;float   returnValue;
};#endif //_FUNSET_CUH_

funset.cu:

#include "funset.cuh"
#include <stdio.h>__global__ void add(int a, int b, int* c)
{*c = a + b;
}//__global__:从主机上调用并在设备上运行
__global__ void add_blockIdx(int* a, int* b, int* c)
{//计算该索引处的数据//变量blockIdx,是一个内置变量,在CUDA运行时中已经预先定义了这个变量//此变量中包含的值就是当前执行设备代码的线程块的索引int tid = blockIdx.x;//this thread handles the data at its thread idif (tid < NUM)c[tid] = a[tid] + b[tid];
}//__device__:表示代码将在GPU而不是主机上运行,
//由于此函数已声明为__device__函数,因此只能从其它__device__函数或者
//从__global__函数中调用它们
__device__ int julia(int x, int y) 
{const float scale = 1.5;float jx = scale * (float)(DIM/2 - x)/(DIM/2);float jy = scale * (float)(DIM/2 - y)/(DIM/2);cuComplex c(-0.8, 0.156);cuComplex a(jx, jy);int i = 0;for (i=0; i<200; i++) {a = a * a + c;if (a.magnitude2() > 1000)return 0;}return 1;
}__global__ void kernel_julia(unsigned char *ptr)
{//map from blockIdx to pixel positionint x = blockIdx.x;int y = blockIdx.y;//gridDim为内置变量,对所有的线程块来说,gridDim是一个常数,用来保存线程格每一维的大小//此处gridDim的值是(DIM, DIM)int offset = x + y * gridDim.x;//now calculate the value at that positionint juliaValue = julia(x, y);ptr[offset*4 + 0] = 255 * juliaValue;ptr[offset*4 + 1] = 0;ptr[offset*4 + 2] = 0;ptr[offset*4 + 3] = 255;
}__global__ void add_threadIdx(int* a, int* b, int* c)
{//使用线程索引来对数据进行索引而非通过线程块索引(blockIdx.x)int tid = threadIdx.x;if (tid < NUM)c[tid] = a[tid] + b[tid];
}__global__ void add_blockIdx_threadIdx(int* a, int* b, int* c)
{int tid = threadIdx.x + blockIdx.x * blockDim.x;if (tid == 0) {printf("blockDim.x = %d, gridDim.x = %d\n", blockDim.x, gridDim.x);}while (tid < NUM) {c[tid] = a[tid] + b[tid];tid += blockDim.x * gridDim.x;}
}__global__ void ripple_kernel(unsigned char *ptr, int ticks)
{// map from threadIdx/BlockIdx to pixel position//将线程和线程块的索引映射到图像坐标//对x和y的值进行线性化从而得到输出缓冲区中的一个偏移int x = threadIdx.x + blockIdx.x * blockDim.x;int y = threadIdx.y + blockIdx.y * blockDim.y;int offset = x + y * blockDim.x * gridDim.x;// now calculate the value at that position//生成一个随时间变化的正弦曲线"波纹"float fx = x - DIM/2;float fy = y - DIM/2;float d = sqrtf(fx * fx + fy * fy);unsigned char grey = (unsigned char)(128.0f + 127.0f * cos(d/10.0f - ticks/7.0f) / (d/10.0f + 1.0f)); ptr[offset*4 + 0] = grey;ptr[offset*4 + 1] = grey;ptr[offset*4 + 2] = grey;ptr[offset*4 + 3] = 255;
}__global__ void dot_kernel(float *a, float *b, float *c)
{//声明了一个共享内存缓冲区,它将保存每个线程计算的加和值__shared__ float cache[threadsPerBlock];int tid = threadIdx.x + blockIdx.x * blockDim.x;int cacheIndex = threadIdx.x;float temp = 0;while (tid < NUM) {temp += a[tid] * b[tid];tid += blockDim.x * gridDim.x;}//set the cache valuescache[cacheIndex] = temp;//synchronize threads in this block//对线程块中的线程进行同步//这个函数将确保线程块中的每个线程都执行完__syncthreads()前面的语句后,才会执行下一条语句__syncthreads();//for reductions(归约), threadsPerBlock must be a power of 2 because of the following codeint i = blockDim.x/2;while (i != 0) {if (cacheIndex < i)cache[cacheIndex] += cache[cacheIndex + i];//在循环迭代中更新了共享内存变量cache,并且在循环的下一次迭代开始之前,//需要确保当前迭代中所有线程的更新操作都已经完成__syncthreads();i /= 2;}if (cacheIndex == 0)c[blockIdx.x] = cache[0];
}__global__ void julia_kernel(unsigned char *ptr)
{//map from threadIdx/BlockIdx to pixel positionint x = threadIdx.x + blockIdx.x * blockDim.x;int y = threadIdx.y + blockIdx.y * blockDim.y;int offset = x + y * blockDim.x * gridDim.x;__shared__ float shared[16][16];//now calculate the value at that positionconst float period = 128.0f;shared[threadIdx.x][threadIdx.y] = 255 * (sinf(x*2.0f*PI/ period) + 1.0f) *(sinf(y*2.0f*PI/ period) + 1.0f) / 4.0f;//removing this syncthreads shows graphically what happens//when it doesn't exist.this is an example of why we need it.__syncthreads();ptr[offset*4 + 0] = 0;ptr[offset*4 + 1] = shared[15 - threadIdx.x][15 - threadIdx.y];ptr[offset*4 + 2] = 0;ptr[offset*4 + 3] = 255;
}__global__ void RayTracing_kernel(Sphere *s, unsigned char *ptr)
{//map from threadIdx/BlockIdx to pixel positionint x = threadIdx.x + blockIdx.x * blockDim.x;int y = threadIdx.y + blockIdx.y * blockDim.y;int offset = x + y * blockDim.x * gridDim.x;float ox = (x - DIM/2);float oy = (y - DIM/2);float r=0, g=0, b=0;float maxz = -INF;for (int i = 0; i < SPHERES; i++) {float n;float t = s[i].hit(ox, oy, &n);if (t > maxz) {float fscale = n;r = s[i].r * fscale;g = s[i].g * fscale;b = s[i].b * fscale;maxz = t;}} ptr[offset*4 + 0] = (int)(r * 255);ptr[offset*4 + 1] = (int)(g * 255);ptr[offset*4 + 2] = (int)(b * 255);ptr[offset*4 + 3] = 255;
}__global__ void RayTracing_kernel(unsigned char *ptr)
{//map from threadIdx/BlockIdx to pixel positionint x = threadIdx.x + blockIdx.x * blockDim.x;int y = threadIdx.y + blockIdx.y * blockDim.y;int offset = x + y * blockDim.x * gridDim.x;float ox = (x - DIM/2);float oy = (y - DIM/2);float r=0, g=0, b=0;float maxz = -INF;for(int i = 0; i < SPHERES; i++) {float n;float t = s[i].hit(ox, oy, &n);if (t > maxz) {float fscale = n;r = s[i].r * fscale;g = s[i].g * fscale;b = s[i].b * fscale;maxz = t;}} ptr[offset*4 + 0] = (int)(r * 255);ptr[offset*4 + 1] = (int)(g * 255);ptr[offset*4 + 2] = (int)(b * 255);ptr[offset*4 + 3] = 255;
}__global__ void Heat_blend_kernel(float *dst, bool dstOut)
{//map from threadIdx/BlockIdx to pixel positionint x = threadIdx.x + blockIdx.x * blockDim.x;int y = threadIdx.y + blockIdx.y * blockDim.y;int offset = x + y * blockDim.x * gridDim.x;int left = offset - 1;int right = offset + 1;if (x == 0) left++;if (x == DIM-1) right--; int top = offset - DIM;int bottom = offset + DIM;if (y == 0) top += DIM;if (y == DIM-1) bottom -= DIM;float t, l, c, r, b;if (dstOut) {//tex1Dfetch是编译器内置函数,从设备内存取纹理t = tex1Dfetch(texIn, top);l = tex1Dfetch(texIn, left);c = tex1Dfetch(texIn, offset);r = tex1Dfetch(texIn, right);b = tex1Dfetch(texIn, bottom);} else {t = tex1Dfetch(texOut, top);l = tex1Dfetch(texOut, left);c = tex1Dfetch(texOut, offset);r = tex1Dfetch(texOut, right);b = tex1Dfetch(texOut, bottom);}dst[offset] = c + SPEED * (t + b + r + l - 4 * c);
}__global__ void blend_kernel(float *dst, bool dstOut)
{//map from threadIdx/BlockIdx to pixel positionint x = threadIdx.x + blockIdx.x * blockDim.x;int y = threadIdx.y + blockIdx.y * blockDim.y;int offset = x + y * blockDim.x * gridDim.x;float t, l, c, r, b;if (dstOut) {t = tex2D(texIn2, x, y-1);l = tex2D(texIn2, x-1, y);c = tex2D(texIn2, x, y);r = tex2D(texIn2, x+1, y);b = tex2D(texIn2, x, y+1);} else {t = tex2D(texOut2, x, y-1);l = tex2D(texOut2, x-1, y);c = tex2D(texOut2, x, y);r = tex2D(texOut2, x+1, y);b = tex2D(texOut2, x, y+1);}dst[offset] = c + SPEED * (t + b + r + l - 4 * c);
}__global__ void Heat_copy_const_kernel(float *iptr)
{//map from threadIdx/BlockIdx to pixel positionint x = threadIdx.x + blockIdx.x * blockDim.x;int y = threadIdx.y + blockIdx.y * blockDim.y;int offset = x + y * blockDim.x * gridDim.x;float c = tex1Dfetch(texConstSrc, offset);if (c != 0)iptr[offset] = c;
}__global__ void copy_const_kernel(float *iptr) 
{//map from threadIdx/BlockIdx to pixel positionint x = threadIdx.x + blockIdx.x * blockDim.x;int y = threadIdx.y + blockIdx.y * blockDim.y;int offset = x + y * blockDim.x * gridDim.x;float c = tex2D(texConstSrc2, x, y);if (c != 0)iptr[offset] = c;
}void generate_frame_opengl(uchar4 *pixels, void*, int ticks)
{dim3 grids(DIM / 16, DIM / 16);dim3 threads(16, 16);ripple_kernel_opengl<<<grids, threads>>>(pixels, ticks);
}__global__ void ripple_kernel_opengl(uchar4 *ptr, int ticks)
{//map from threadIdx/BlockIdx to pixel positionint x = threadIdx.x + blockIdx.x * blockDim.x;int y = threadIdx.y + blockIdx.y * blockDim.y;int offset = x + y * blockDim.x * gridDim.x;// now calculate the value at that positionfloat fx = x - DIM / 2;float fy = y - DIM / 2;float d = sqrtf(fx * fx + fy * fy);unsigned char grey = (unsigned char)(128.0f + 127.0f * cos(d/10.0f - ticks/7.0f) / (d/10.0f + 1.0f));    ptr[offset].x = grey;ptr[offset].y = grey;ptr[offset].z = grey;ptr[offset].w = 255;
}__global__ void Heat_blend_kernel_opengl(float *dst, bool dstOut)
{//map from threadIdx/BlockIdx to pixel positionint x = threadIdx.x + blockIdx.x * blockDim.x;int y = threadIdx.y + blockIdx.y * blockDim.y;int offset = x + y * blockDim.x * gridDim.x;int left = offset - 1;int right = offset + 1;if (x == 0) left++;if (x == DIM-1) right--; int top = offset - DIM;int bottom = offset + DIM;if (y == 0) top += DIM;if (y == DIM-1) bottom -= DIM;float t, l, c, r, b;if (dstOut) {t = tex1Dfetch(texIn, top);l = tex1Dfetch(texIn, left);c = tex1Dfetch(texIn, offset);r = tex1Dfetch(texIn, right);b = tex1Dfetch(texIn, bottom);} else {t = tex1Dfetch(texOut, top);l = tex1Dfetch(texOut, left);c = tex1Dfetch(texOut, offset);r = tex1Dfetch(texOut, right);b = tex1Dfetch(texOut, bottom);}dst[offset] = c + SPEED * (t + b + r + l - 4 * c);
}__global__ void Heat_copy_const_kernel_opengl(float *iptr)
{int x = threadIdx.x + blockIdx.x * blockDim.x;int y = threadIdx.y + blockIdx.y * blockDim.y;int offset = x + y * blockDim.x * gridDim.x;float c = tex1Dfetch(texConstSrc, offset);if (c != 0)iptr[offset] = c;
}__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo)
{//clear out the accumulation buffer called temp since we are launched with 256 threads, //it is easy to clear that memory with one write per thread__shared__  unsigned int temp[256]; //共享内存缓冲区temp[threadIdx.x] = 0;__syncthreads();//calculate the starting index and the offset to the next block that each thread will be processingint i = threadIdx.x + blockIdx.x * blockDim.x;int stride = blockDim.x * gridDim.x;while (i < size) {atomicAdd(&temp[buffer[i]], 1);i += stride;}//sync the data from the above writes to shared memory then add the shared memory values to the values from//the other thread blocks using global memory atomic adds same as before, since we have 256 threads,//updating the global histogram is just one write per thread!__syncthreads();atomicAdd(&(histo[threadIdx.x]), temp[threadIdx.x]);
}__global__ void singlestream_kernel(int *a, int *b, int *c)
{int idx = threadIdx.x + blockIdx.x * blockDim.x;if (idx < NUM) {int idx1 = (idx + 1) % 256;int idx2 = (idx + 2) % 256;float as = (a[idx] + a[idx1] + a[idx2]) / 3.0f;float bs = (b[idx] + b[idx1] + b[idx2]) / 3.0f;c[idx] = (as + bs) / 2;}
}__global__ void dot_kernel(int size, float *a, float *b, float *c)
{__shared__ float cache[threadsPerBlock];int tid = threadIdx.x + blockIdx.x * blockDim.x;int cacheIndex = threadIdx.x;float temp = 0;while (tid < size) {temp += a[tid] * b[tid];tid += blockDim.x * gridDim.x;}//set the cache valuescache[cacheIndex] = temp;//synchronize threads in this block__syncthreads();//for reductions(归约), threadsPerBlock must be a power of 2 because of the following codeint i = blockDim.x / 2;while (i != 0) {if (cacheIndex < i)cache[cacheIndex] += cache[cacheIndex + i];__syncthreads();i /= 2;}if (cacheIndex == 0)c[blockIdx.x] = cache[0];
}


以上来自于对《GPU高性能编程CUDA实战》书中内容整理。

GitHub:https://github.com/fengbingchun/CUDA_Test


相关文章:

​50年来最具影响力的十大编程语言!

作者 | javinpaul译者 | 馨怡责编 | 屠敏出品 | CSDN&#xff08;ID&#xff1a;CSDNnews&#xff09;【导语】“适者生存”的自然法则在应用竞争激烈的编程语言界同样适用&#xff0c;而在数百种编程语言中&#xff0c;相对而言&#xff0c;哪些最具影响力&#xff1f;哪些才是…

【基础篇】DatePickerDialog日期控件的基本使用(一)

项目步骤&#xff1a; 1.首先在Main.xml布局文件中添加一个Button标签&#xff0c;用来点击显示日期控件&#xff0c;Main.xml内容如下&#xff1a; <RelativeLayout xmlns:android"http://schemas.android.com/apk/res/android" xmlns:tools"http://sch…

PoPo数据可视化第9期

PoPo数据可视化 聚焦于Web数据可视化与可视化交互领域&#xff0c;发现可视化领域有意思的内容。不想错过可视化领域的精彩内容, 就快快关注吧 :)2018 in the Ito Design Lab&#xff08;视频内容请关注微信公众号浏览&#xff09;1900~2018年城市温度异常变化可视化Temperatur…

面向可解释的NLP:北大、哈工大等提出文本分类的生成性解释框架

作者 | Hui Liu, Qingyu Yin, William Yang Wang 译者 | Rachel编辑 | Jane出品 | AI科技大本营&#xff08;ID: rgznai100&#xff09;【导语】北大、哈工大和加州大学圣巴巴拉分校在 ACL 2019 的一篇论文中联合提出了一个全新的生成性解释框架&#xff0c;该框架能够对分类策…

pyramid参数

2019独角兽企业重金招聘Python工程师标准>>> 普通参数permission: 该view的访问权限&#xff0c;这个后续会具体介绍。attr: Pyramid默认调用的是view类的__call__函数&#xff0c;如果需要指定调用其他方法&#xff0c;通过attr指定。如attrindex。renderer: 指定构…

Linux下常用的C/C++开源Socket库

1. Linux Socket Programming In C : http://tldp.org/LDP/LG/issue74/tougher.html 2. ACE: http://www.cs.wustl.edu/~schmidt/ACE.html ACE采用ACE_OS适配层屏蔽各种不同的、复杂繁琐的操作系统API。 ACE是一个大型的中间件产品&#xff0c;代码20万行左右&…

前端技术选型的遗憾和经验教训

我是Max&#xff0c;Spectrum的技术联合创始人。Spectrum 是一个面向大型在线社区的开源聊天应用程序&#xff0c;最近被GitHub收购。我们是一个三人团队&#xff0c;主要拥有前端和设计背景&#xff0c;我们在这个项目上工作了近两年时间。 事后看来&#xff0c;以下是我做出的…

时间序列的建模新思路:清华、李飞飞团队等提出强记忆力E3D-LSTM网络

作者 | Yunbo Wang,、Lu Jiang、 Ming-Hsuan Yang、Li-Jia Li、Mingsheng Long、Li Fei-Fei译者 | 凯隐编辑 | Jane出品 | AI科技大本营&#xff08;ID&#xff1a;rgznai100&#xff09;【导读】如何对时间序列进行时空建模及特征抽取&#xff0c;是RGB视频预测分类&#xff0…

了解node.js

转载自http://debuggable.com/posts/understanding-node-js:4bd98440-45e4-4a9a-8ef7-0f7ecbdd56cb 当我向人们介绍node.js时&#xff0c;通常会得到两种反应&#xff0c;一种人马上就能了解&#xff0c;另一种则是非常困惑。 如果你是第二种人&#xff0c;请看一下我对node的解…

VS2013中Image Watch插件的使用(OpenCV)

之前在vs2010中写OpenCV程序时经常用NativeViewer&#xff0c;安装此插件后&#xff0c;在调试代码时&#xff0c;对于cv::Mat变量&#xff0c;CV_TYPE类型为CV_8UC1或CV_8UC3&#xff0c;可以随时查看显示结果。其操作步骤为&#xff1a;1. 从http://sourceforge.net/p…

【spring boot2】第8篇:spring boot 中的 servlet 容器及如何使用war包部署

嵌入式 servlet 容器 在 spring boot 之前的web开发&#xff0c;我们都是把我们的应用部署到 Tomcat 等servelt容器&#xff0c;这些容器一般都会在我们的应用服务器上安装好环境&#xff0c;但是 spring boot 中并不需要外部应用服务器安装这些servlet容器&#xff0c;spring …

让织梦内容页arclist标签的当前文章标题加亮显示

很多人在用织梦做站的时候&#xff0c;会用到在当前栏目页面&#xff0c;给当前栏目标题使用指定样式如标题加亮&#xff0c;或者放个背景图。这是一个很常用和实用的功能&#xff0c;比如在导航页面&#xff0c;标识当前在浏览哪个栏目。如下图&#xff1a; 但是有些时候&…

RHEL6入门系列之九,常用命令2

今天还是继续来学习Linux的基本命令。4、touch命令——建立空文件touch命令用于建立空文件。[rootlocalhost ~]# mkdir /root/test ‘创建目录/root/test[rootlocalhost ~]# touch /root/test/test1.txt ‘在目录/root/test中创建空文件test1.txt[rootlocalhos…

为什么华为200万招聘AI博士,马斯克却推出脑机接口对抗AI?

作者 | 伍杏玲来源 | CSDN&#xff08;ID&#xff1a;CSDNnews&#xff09;7 月&#xff0c;华为一则薪资通知刷爆朋友圈&#xff1a;华为给8位博士应届生给予 89.6 万至 201 万的年薪。其中薪资最高的两位博士均研究人工智能相关专业。7 月还有一件大事&#xff1a;马斯克发布…

Artistic Style在windows下的使用(C/C++)

ArtisticStyle是一个开源的源代码格式化工具。主页地址为&#xff1a;http://astyle.sourceforge.net/&#xff0c;它可以应用在C、C、Objective-C、C#、Java等程序语言中。http://astyle.sourceforge.net/astyle.html中为对使用它的详细介绍。从http://sourceforge.net/projec…

ESXi主机与网络中其他主机的网咯数据包捕获

1、tcpdump-uw -i vmk0 -s 1514 host x.x.x.x 指定捕获与某台主机间的网络数据包2、tcpdump -i vmk0 -s 1514 port not 22 and port not 53 在捕获的数据包中&#xff0c;过滤掉指定端口的数据包3、tcpdump-uw -i vmk0 -s 1514 -w traffic.pcap 捕获的数据包保存成PCAP文件&…

Windows下批处理文件(.bat)的使用

批处理(Batch)&#xff0c;就是进行批量的处理&#xff0c;英文译文BATCH&#xff0c;批处理文件后缀BAT就取的前三个字母&#xff0c;通常被认为是一种简化的脚本语言&#xff0c;它应用于DOS和Windows系统中。批处理文件是扩展名为.bat或.cmd的文本文件&#xff0c;包含一条或…

金融业加速智能化,解析360金融AI基础架构和应用

传统金融信贷业务中&#xff0c;催收、客服及电销人员占比超过 60%&#xff0c;人员素质参差不齐的现状造成了管理成本过高的问题&#xff0c;由此衍生的客户体验差&#xff0c;也成为困扰金融业的一大通病。 8 月 15 日&#xff0c;在 360金融 AI 媒体开放日上&#xff0c;360…

正则:匹配一个汉字姓名

//汉字姓名正则var reg/^[\u4e00-\u9fa5]{2,}(\.[\u4e00-\u9fa5])?$/console.log(reg.test(张卫健.爱新觉罗))console.log(reg.test(兔子)) 复制代码

NLP命名实体识别开源实战教程 | 深度应用

作者 | 小宋是呢来源 | CSDN博客近几年来&#xff0c;基于神经网络的深度学习方法在计算机视觉、语音识别等领域取得了巨大成功&#xff0c;另外在自然语言处理领域也取得了不少进展。在NLP的关键性基础任务—命名实体识别&#xff08;Named Entity Recognition&#xff0c;NER…

poj 2063完全背包

题意&#xff1a;给出总资金和投资年份 &#xff0c;n个股票 给出股票价格和其一年的利润。问如何选择能获得最大利润。 思路&#xff1a;股票可以重复选择&#xff0c;完全背包问题&#xff0c;完全背包也是从01背包衍生而行的&#xff0c;其主要区别在于中间那层循环的次序不…

UTF-8 CPP的使用

UTF-8 CPP是一个简单、小巧、轻量级、跨平台的UTF-8编码字符串库。下面对其使用方法进行简单的介绍&#xff1a;1. 从http://sourceforge.net/projects/utfcpp/下载最新的utf8_v2_3_4.zip源码&#xff0c;将其解压缩&#xff1b;2. 新建一个vs2013 控制台工程TestUTF…

一行js代码识别Selenium+Webdriver及其应对方案

有不少朋友在开发爬虫的过程中喜欢使用Selenium Chromedriver&#xff0c;以为这样就能做到不被网站的反爬虫机制发现。 先不说淘宝这种基于用户行为的反爬虫策略&#xff0c;仅仅是一个普通的小网站&#xff0c;使用一行Javascript代码&#xff0c;就能轻轻松松识别你是否使用…

Android系统移植与调试之-------如何修改Android设备添加重启、飞行模式、静音模式等功能(一)...

1、首先先来看一下修改前后的效果对比图 修改之后的图片 确认重启界面 具体的修改内容在下一篇中具体介绍。 Android系统移植与调试之------->如何修改Android设备添加重启、飞行模式、静音模式等功能&#xff08;二&#xff09; 作者&#xff1a;欧阳鹏 欢迎转载&#xf…

鸿蒙霸榜GitHub,从最初的Plan B到“取代Android”?

整理 | 郭芮出品 | CSDN&#xff08;ID&#xff1a;CSDNnews&#xff09;距离 8 月 9 日鸿蒙&#xff08;HarmonyOS&#xff09;正式发布刚刚过去九天&#xff0c;这场由华为领衔的技术风暴&#xff0c;经过十年蛰伏&#xff0c;终于成功引爆全球。与此同时&#xff0c;一个非官…

CODING 最佳实践:快课网研发效能提升之路

快课企业移动学习平台是上海快微网络科技有限公司自主研发的企业级 SaaS 平台&#xff0c;提供移动学习、考试练习、培训管理、知识分享、统计分析等学习和培训功能&#xff0c;为员工、经销商及客户等全价值链合作伙伴提供全面的知识服务。本文将详细介绍快课网的研发团队是如…

基于chyh1990/caffe-compact在windows vs2013上编译caffe步骤

1. 从https://github.com/chyh1990/caffe-compact下载caffe-compact代码&#xff1b; 2. 通过CMake(cmake-gui)生成vs2013 x64 caffe工程&#xff1b; 3. 从https://github.com/google/protobuf下载ProtoBuf&#xff0c;解压缩&#xff0c;编译ProtoBuf&#xf…

菜鸟学***——菜鸟的旅程

第一章&#xff0e;菜鸟的旅程作为一个典型的菜鸟&#xff0c;第一次去网吧上网我不知道怎么开机是很正常的事情&#xff0c;但是无论怎么菜怎么无知也无法阻止我对于***的崇拜&#xff0c;我喜欢他们的那种神秘和在我们普通人眼中的无所不能&#xff0c;在网络世界里他们就想鱼…

程序员假冒AI,印度公司竟骗取2亿元投资

作者 | 神经小姐姐来源 | HyperAI超神经&#xff08;ID&#xff1a;HyperAI&#xff09;人工智能界有句调侃的话——「有多少智能&#xff0c;就有多少人工」&#xff0c;今天&#xff0c;印度的一家人工智能公司就印证了这句话。据《华尔街日报》报道&#xff0c;印度创业公司…

Windows7上使用VS2013编译Caffe源码(不带GPU支持)步骤

1. 从https://github.com/BVLC/caffe/通过git clone下载caffe源码,master分支,版本号为09868ac&#xff1a;$ git clone https://github.com/BVLC/caffe.git &#xff1b;2. 先使用cmake-gui构建生成vs2013工程&#xff0c;发现有很多错误&#xff0c;提示缺少各种依…