HIP编程 —— 计算一维和二维矩阵相乘(GPU并行计算)
文章目录
- 一维数组
- 二维矩阵
- 总结
一维数组
#include
#include #define CHECK(cmd) \
{\hipError_t error = cmd;\if (error != hipSuccess) { \fprintf(stderr, "error: '%s'(%d) at %s:%d\n", hipGetErrorString(error), error,__FILE__, __LINE__); \exit(EXIT_FAILURE);\}\
}//下面函数:会在每个开启的线程上,都执行该函数,并行执行
//在线程数大于数组元素数量时,没问题。
//但是当线程数小于数组元素数时就不对了,可以用第二个计算函数
// template
// __global__ void
// vector_square(T *C_d, T *A_d, size_t N)
// {
// size_t i = (blockIdx.x * blockDim.x + threadIdx.x); //线程号,从0分配到给分配的线程数
// //如果线程号小于数组数量,则可以对该数组赋值,否则不需要操作
// //也就是说这个判断可以保证线程数大于要计算的数组元素数量时,保证结果正确性。
// if(i < N) {
// C_d[i] = A_d[i] * A_d[i];
// }
// }template <typename T>
__global__ void
vector_square(T *C_d, T *A_d, size_t N)
{size_t offset = (blockIdx.x * blockDim.x + threadIdx.x); //线程号size_t totalThreads = blockDim.x * gridDim.x; //线程总数//下面这个循环作用://当线程数大于数组元素时:每个线程会计算一个元素//当线程数小于数组元素时:比如线程数是100,数组元素是200,那么线程0计算index=0,100。也就是每个线程会计算两个元素,将元素均匀分配个每个线程。for (size_t i=offset; i<N; i+=totalThreads) {C_d[i] = A_d[i] * A_d[i];}
}int main(int argc, char *argv[])
{float *A_d, *C_d;float *A_h, *C_h;size_t N = 600;size_t Nbytes = N * sizeof(float);hipDeviceProp_t props;CHECK(hipGetDeviceProperties(&props, 0/*deviceID*/));printf ("info: running on device %s, maxThreadsPerBlock %d\n", props.name, props.maxThreadsPerBlock);//host上创建两个数组A_h = (float*)malloc(Nbytes);CHECK(A_h == 0 ? hipErrorOutOfMemory : hipSuccess );C_h = (float*)malloc(Nbytes);CHECK(C_h == 0 ? hipErrorOutOfMemory : hipSuccess );//A_h中添加元素// Fill with Phi + ifor (size_t i=0; i<N; i++) {A_h[i] = 1.618f + i; }//在device上也创建两个数组CHECK(hipMalloc(&A_d, Nbytes));CHECK(hipMalloc(&C_d, Nbytes));//将A_h拷贝到device上的A_dCHECK ( hipMemcpy(A_d, A_h, Nbytes, hipMemcpyHostToDevice));const unsigned blocks = 1;const unsigned threadsPerBlock = 700; //最大线程数不能超过1024//开启GPU计算,将值存储到device的C_d中printf ("info: launch 'vector_square' kernel, compute in GPU\n");hipLaunchKernelGGL(vector_square, dim3(blocks), dim3(threadsPerBlock), 0, 0, C_d, A_d, N);//将C_d的值拷贝回hostCHECK ( hipMemcpy(C_h, C_d, Nbytes, hipMemcpyDeviceToHost));//在host上检查计算的值是否正确printf ("info: check result in host\n");for (size_t i=0; i<N; i++) {if (C_h[i] != A_h[i] * A_h[i]) {printf ("i= %zu C_h[i]= %f , A_h[i]= %f\n ", i, C_h[i], A_h[i]);CHECK(hipErrorUnknown);}}printf ("PASSED!\n");
}
二维矩阵
#include
#include #define A_ROW_SIZE 700
#define A_COL_SIZE 5000
#define B_ROW_SIZE 5000
#define B_COL_SIZE 600
#define RAND_SEED 5000//利用GPU的每个线程计算: A行*B列,得到矩阵C的其中一个元素的值
//当前线程计算的是C[i,j] = A[i]行 *B[j]列
__global__ void matrix_multiplication(double* A, double* B, double* C, size_t N)
{size_t offsetThread = (blockIdx.x * blockDim.x + threadIdx.x); //线程号size_t totalThreads = gridDim.x * blockDim.x; //总线程数for(size_t i = offsetThread; i < N; i += totalThreads) {size_t c_row = i / B_COL_SIZE;size_t c_col = i % B_COL_SIZE;C[i] = 0;//2d grid of 2d block//blockId = blockIdx.y * gridDim.x + blockIdx.x;//globalId = blockId * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;//遍历 A的一行,也是B的一列for(size_t j = 0; j < A_COL_SIZE; j++) {size_t indexA = c_row * A_COL_SIZE + j;size_t indexB = j * B_COL_SIZE + c_col;C[i] += A[indexA] * B[indexB];}}
}//这个矩阵其实是个一维数组,打印时我们按照row大小换行
void printMatrix(double *array, int row, int col) {printf("array is: \n");int i = 0;while (i < row * col) {printf("%f,", array[i]);i++;if(i % row == 0) {printf("\n");}}
}int main()
{double *d_A, *d_B, *d_C, *h_A, *h_B, *h_C;int N_A = A_ROW_SIZE * A_COL_SIZE;int N_B = B_ROW_SIZE * B_COL_SIZE;int N_C = A_ROW_SIZE * B_COL_SIZE;hipDeviceProp_t prop;hipGetDeviceProperties(&prop, 0);hipHostMalloc(&h_A, N_A * sizeof(double), hipDeviceScheduleAuto);hipHostMalloc(&h_B, N_B * sizeof(double), hipDeviceScheduleAuto);hipHostMalloc(&h_C, N_C * sizeof(double), hipDeviceScheduleAuto);hipMalloc(&d_A, N_A * sizeof(double));hipMalloc(&d_B, N_B * sizeof(double));hipMalloc(&d_C, N_C * sizeof(double));for(int i=0; i < N_A; i++)h_A[i] = rand() % RAND_SEED;//printMatrix(h_A,A_ROW_SIZE, A_COL_SIZE);for(int i=0; i < N_B; i++)h_B[i] = rand() % RAND_SEED;//printMatrix(h_B,B_ROW_SIZE, B_COL_SIZE);hipMemcpy(d_A, h_A, N_A * sizeof(double), hipMemcpyHostToDevice);hipMemcpy(d_B, h_B, N_B * sizeof(double), hipMemcpyHostToDevice);float time_cpu,time_gpu;clock_t start_cpu,stop_cpu;hipEvent_t start_GPU,stop_GPU;hipEventCreate(&start_GPU);hipEventCreate(&stop_GPU);hipEventRecord(start_GPU,0);//进入GPU进行计算matrix_multiplication<<<600,700>>>(d_A, d_B, d_C, A_ROW_SIZE*B_COL_SIZE);hipEventRecord(stop_GPU,0);hipEventSynchronize(start_GPU);hipEventSynchronize(stop_GPU);hipEventElapsedTime(&time_gpu, start_GPU,stop_GPU);printf("\nThe time from GPU:\t%f(s)\n", time_gpu/1000);hipDeviceSynchronize();hipEventDestroy(start_GPU);hipEventDestroy(stop_GPU);hipMemcpy(h_C, d_C, N_C * sizeof(double), hipMemcpyDeviceToHost);printf("\nkernel test:\n");//printMatrix(h_C,A_ROW_SIZE, B_COL_SIZE);printf("\nwaiting for cpu result ....\n");//在host上计算一次矩阵C,C的每个元素都和GPU上计算的进行对比,如果有元素不相等,则挂掉。start_cpu = clock();double *MatC_check;MatC_check = (double*)malloc(N_C * sizeof(double));for(int i=0; i<N_C; i++){size_t rowC = i/B_COL_SIZE;size_t colC = i%B_COL_SIZE;MatC_check[i] = 0;for(int index=0; index<A_COL_SIZE; index++)MatC_check[i] += h_A[rowC*A_COL_SIZE+index] * h_B[index*B_COL_SIZE+colC];if (fabs(MatC_check[i] - h_C[i]) > 0.001){printf("%d %f %f %f\n",i,MatC_check[i],h_C[i],fabs(MatC_check[i] - h_C[i]));return 0;}}stop_cpu = clock();printf("\nCPU time is %f(s)\n",(float)(stop_cpu-start_cpu)/CLOCKS_PER_SEC);//printMatrix(MatC_check,A_ROW_SIZE, B_COL_SIZE);printf("\ntest passed!\n");//释放已申请的内存hipHostFree(h_A);hipHostFree(h_B);hipHostFree(h_C);hipFree(d_A);hipFree(d_B);hipFree(d_C);return 0;
}
输出
The time from GPU: 0.068484(s)kernel test:waiting for cpu result ....CPU time is 6.171318(s)test passed!
https://www.cnblogs.com/lin1216/p/12719836.html
总结
通过上面输出我们可以看到,在进行这种大规模矩阵运算的适合,使用GPU并行计算,相比CPU,可以大幅提升运算效率。规模越大,对比越明显。这也就是为什么要使用GPU进行超算的原因。
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
