Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

详细解释homework4 #17

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
# 高性能并行编程与优化 - 第04讲的回家作业

通过 pull request 提交作业。会批分数,但是:

没有结业证书,回家作业仅仅作为评估学习效果和巩固知识的手段,不必为分数感到紧张 :)
量力而行,只要能在本课中,学到昨天的自己不懂的知识,就是胜利,没必要和别人攀比。
注意不要偷看别人的作业哦!

- 课件:https://github.com/parallel101/course
- 录播:https://space.bilibili.com/263032155

作业提交时间不限 :) 即使完结了还想交的话我也会看的~ 不过最好在下一讲开播前完成。

- 如何开 pull request:https://zhuanlan.zhihu.com/p/51199833
- 如何设置 https 代理:https://www.jianshu.com/p/b481d2a42274
## 详细解释
1. size_t 在 64 位系统上相当于 uint64_t size_t 在 32 位系统上相当于 uint32_t
从而不需要用 movslq 从 32 位符号扩展到 64 位,更高效。而且也能处理数组大小超过 INT_MAX 的情况,推荐始终用 size_t 表示数组大小和索引
2. 开启优化:-O3
3. 浮点作为参数和返回:xmm系列寄存器 xmm寄存器有128位宽, 可以容纳4个float,或2个double
4. SIMD(single-instruction multiple-data)称为单个指令处理多个数据的技术,他可以大大增加计算密集型程序的吞吐量
5. AOS:紧凑存储多个属性. 符合一般面向对象编程(OOP)的习惯,但常常不利于性能
6. SOA:分离存储多个属性. 不符合面向对象编程(OOP)的习惯,但常常有利于性能。又称之为面向数据编程 (DOP)
7. AOSOA: SOA便于SIMD优化;AOS便于存储在传统容器;AOSOA两者得兼!
8. 对齐到 16 或 64 字节
9. 试试看 #pragma omp simd
10. 循环中不变的常量挪到外面来
11. 对小循环体用 #pragma unroll
12. -ffast-math 和 -march=native

## 评分规则

Expand Down
150 changes: 112 additions & 38 deletions main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,68 +4,142 @@
#include <chrono>
#include <cmath>

//-march=native 让编译器自动判断当前硬件支持的指令
//强迫编译器在编译期求值
//用 constexpr 函数迫使编译器进行常量折叠!
constexpr float speedup = 1.0 / RAND_MAX;

float frand() {
return (float)rand() / RAND_MAX * 2 - 1;
return (float)rand() / speedup * 2 - 1;
}
constexpr int length = 48;

//存储在栈上无法动态扩充大小,这就是为什么 vector
//这种数据结构要存在堆上,而固定长度的 array 可以存在栈上

struct Star {
float px, py, pz;
float vx, vy, vz;
float mass;
float px[length];
float py[length];
float pz[length];
float vx[length];
float vy[length];
float vz[length];
float mass[length];
};

std::vector<Star> stars;
//SOA:分离存储多个属性
//不符合面向对象编程 (OOP) 的习惯,但常常有利于性能。又称之为面向数据编程

//AOS:紧凑存储多个属性
//符合一般面向对象编程 (OOP) 的习惯,但常常不利于性能
Star stars;

void init() {
for (int i = 0; i < 48; i++) {
stars.push_back({
frand(), frand(), frand(),
frand(), frand(), frand(),
frand() + 1,
});
stars.px[i] = frand();
stars.py[i] = frand();
stars.pz[i] = frand();
stars.vx[i] = frand();
stars.vy[i] = frand();
stars.vz[i] = frand();
stars.mass[i] = frand() + 1;
}
}

//循环中的不变量:挪到外面来
float G = 0.001;
float eps = 0.001;
float dt = 0.01;

void step() {
for (auto &star: stars) {
for (auto &other: stars) {
float dx = other.px - star.px;
float dy = other.py - star.py;
float dz = other.pz - star.pz;
float d2 = dx * dx + dy * dy + dz * dz + eps * eps;
d2 *= sqrt(d2);
star.vx += dx * other.mass * G * dt / d2;
star.vy += dy * other.mass * G * dt / d2;
star.vz += dz * other.mass * G * dt / d2;
size_t len = length;
float eps2 = eps * eps;
float gdt = G * dt;
#pragma GCC unroll 16
for (size_t i = 0 ; i < len; i++) {
float dxs[length];
float dys[length];
float dzs[length];
float d2s[length];
float ivf_d2s[length];
#pragma opm simd
for(size_t j=0; j < len; j++)
{
dxs[j] = stars.px[j] - stars.px[i];
}
#pragma opm simd
for(size_t j=0; j < len; j++)
{
dys[j] = stars.py[j] - stars.py[i];
}
#pragma opm simd
for(size_t j=0; j < len; j++)
{
dzs[j] = stars.pz[j] - stars.pz[i];
}
#pragma opm simd
for(size_t j=0; j<len; j++)
{
d2s[j] = dxs[j] * dxs[j] + dys[j] * dys[j] + dzs[j] * dzs[j] + eps2;
}
#pragma opm simd
for(size_t j=0; j<len; j++){
ivf_d2s[j] = 1.0 / (d2s[j] * std::sqrt(d2s[j]));
}
#pragma opm simd
for(size_t j=0; j<len; j++){
stars.vx[i] += dxs[j] * stars.mass[j] * (gdt * ivf_d2s[j]);
}
#pragma opm simd
for(size_t j=0; j<len; j++){
stars.vy[i] += dys[j] * stars.mass[j] * (gdt * ivf_d2s[j]);
}
#pragma opm simd
for(size_t j=0; j<len; j++){
stars.vz[i] += dzs[j] * stars.mass[j] * (gdt * ivf_d2s[j]);
}
}
//pragma omp simd
//C/C++ 的缺点:指针的自由度过高,允许多个 immutable reference 指向同一个对象
//而 Rust 从语法层面禁止,从而让编译器放心大胆优化。
#pragma opm simd
for(size_t i=0; i<len; i++)
{
stars.px[i] += stars.vx[i] * dt ;
}
#pragma opm simd
for(size_t i=0; i < len; i++)
{
stars.py[i] += stars.vy[i] * dt ;
}
#pragma opm simd
for(size_t i = 0; i < len; i++)
{
stars.pz[i] += stars.vz[i] * dt;
}
}
for (auto &star: stars) {
star.px += star.vx * dt;
star.py += star.vy * dt;
star.pz += star.vz * dt;
}
}

//结论:不管是编译器还是 CPU,都喜欢顺序的连续访问
float calc() {
float energy = 0;
for (auto &star: stars) {
float v2 = star.vx * star.vx + star.vy * star.vy + star.vz * star.vz;
energy += star.mass * v2 / 2;
for (auto &other: stars) {
float dx = other.px - star.px;
float dy = other.py - star.py;
float dz = other.pz - star.pz;
size_t len = length;
for (size_t i = 0; i < len; i++) {
float v2 = stars.vx[i] * stars.vx[i] + stars.vy[i]* stars.vy[i]+ stars.vz[i]* stars.vz[i];
energy += stars.mass[i] * v2 / 2;
#pragma GCC unroll 32
//小的循环体进行 unroll 可能是划算的,但最好不要 unroll 大的循环体,否则会造成指令缓存的压力反而变慢!
for (size_t j=0; j < len; j++) {
float dx = stars.px[j] - stars.px[i];
float dy = stars.py[j] - stars.py[i];
float dz = stars.pz[j] - stars.pz[i];
float d2 = dx * dx + dy * dy + dz * dz + eps * eps;
energy -= other.mass * star.mass * G / sqrt(d2) / 2;
float ivf_d2 = 1.0 / (std::sqrt(d2) * 2);
energy -= stars.mass[j] * stars.mass[j] * (G * ivf_d2);
//数学优化:除法变乘法
}
}
return energy;
}

//-ffast-math 选项让 GCC 更大胆地尝试浮点运算的优化,有时能带来 2 倍左右的提升。作为代价,他对 NaN 和无穷大的处理,可能会和 IEEE 标准(腐朽的)规定的不一致。
//如果你能保证,程序中永远不会出现 NaN 和无穷大,那么可以放心打开 -ffast-math。
template <class Func>
long benchmark(Func const &func) {
auto t0 = std::chrono::steady_clock::now();
Expand All @@ -85,4 +159,4 @@ int main() {
printf("Final energy: %f\n", calc());
printf("Time elapsed: %ld ms\n", dt);
return 0;
}
}
Binary file added opt_main
Binary file not shown.
8 changes: 5 additions & 3 deletions run.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/bin/sh
set -e
cmake -B build
cmake --build build
build/main

g++ -std=c++17 -march=native -ffast-math -O3 -fopenmp main.cpp -o opt_main
#g++ -std=c++17 -march=native -O3 -fopenmp main.cpp -o opt_main
./opt_main