algorithm

Some tests and benchmarks of common algorithm

编译

Linux

为了编译这里面的代码，你需要

cmake 2.8及以上版本
gcc或clang。gcc的版本应该>=4.8 , clang的版本应该>=3.3。

git clone https://github.com/snnn/algorithm/trunk
cd algorithm
mkdir build
cd build
cmake ..
make
src/all_unitest

Windows

为了编译这里面的代码，你需要

cmake 2.8及以上版本
Visual Studio，版本大于等于2012

首先从 https://github.com/snnn/algorithm/trunk 用git或者svn检出代码，然后用cmake生成项目文件，然后编译。

目录说明

/src 一些常见的算法的实现，每个都含unitest。
/common 为了适应Linux/Windows而写的一点点操作系统抽象层代码。主要是mutex、条件变量、线程池、snprintf等。
/btree 从老的berkeleydb中整理出来的关于btree和hash索引的代码。过于陈旧，无实用价值，主要是整理一下、阅读学习。
/google_benchmark google提供的一个C++的benchmark框架。我略微做了点小修改。
/gtest google的C++的unitest框架。
/zlib 第三方库。png要用
/libpng 第三方库。

common目录

include/slib/threadpool.h and common/threadpool.cpp: 一个支持定时任务的线程池
include/slib/mutex_pthread.h include/slib/mutex_win.h common/mutex_pthread.cpp common/mutex_win.cpp mutex and condition var

算法目录(不完全）

src/salamin_pi.cpp: 用Brent-Salamin算法计算pi。误差限我还不会算，所以停步条件有问题。
src/heap.h: 数据结构，堆。仿照STL的接口，实现了iterator和对Allocator的支持。
src/fib.cpp: 计算fib数列第n项值。分别用迭代法和分治求n次幂的方法。
src/random.h: 随机数生成器，比stdlib中的要稍好一些。并提供了对intel硬件随机数生成器的封装。
src/sort.h：几种排序算法(冒泡排序、选择排序、堆排序、二路归并排序)及二分查找
src/quick_sort.h：quick sort
src/insert_sort.h：insert sort
src/merge_sort_list_unitest.cpp：对链表进行merge sort
src/draw_binary_tree.h: TR算法绘制二叉树
src/topk.h: 计算第k大的数
src/maxsubarray.h: 最大子数组和 and 长度最长的最大不重复子数组
src/lis.h: Longest strictly Increasing Subsequence(LIS)
src/lcs.cpp: Longest common subsequence (LCS)
src/edit_distance.cpp: Edit distance
src/hash.cpp: 一个简单的hash函数。 char[] -> uint32_t
src/TASLock.h: TAS spin-lock algorithm
src/TTASLock.h: TTAS spin-lock algorithm
src/stackword.h: generate all stack word （见TAOCP对stack word的讨论）

关于benchmark的一些注记

一个问题常有多种算法解决，一个算法常有多种实现，哪个运行效率更高，要视具体环境而定。所以，最好的方法是把它们都实现出来，然后实际跑一下，然后用数理统计的假设检验方法给出一个结论。

我现在做benchmark的框架代码主要来自于Google的开源项目google/benchmark，这个是给C/C++程序用的。Java程序推荐caliper，这个也是google开源出来的，我用了很久，感觉还不错。但是相对来说，C/C++比Java更容易做micro benchmark，因为干扰项更少。

在做benchmark的时候通常关注两个时间，一个是CPU usage time，一个是wall clock time。

CPU time的几种来源：

int getrusage(int who, struct rusage *usage); ru_utime+ru_stime

wall clock time的几种来源：

rdtsc
gettimeofday
clock_gettime

做benchmark之前需要注意的几件事情：

禁用CPU的自动调频。/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sched_setscheduler: set scheduling policy to FIFO, require root privilege.
测试数据要与cache line对齐。所以动态内存要用posix_memalign来分配

一些测试结果

硬件环境1: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz

软件环境1: Ubuntu 14.04 LTS, llvm/clang 3.4（without polly)

编译参数：-std=c++11 -pthread -Wall -Wextra -Wno-unused-parameter -stdlib=libc++ -O3 -mtune=native -march=native -flto -DNDEBUG -g3

关于时间单位： 1 second = 1,000,000 microseconds = 1,000,000,000 nanoseconds。下文中常把nanoseconds缩写成ns。

排序

命令

（我自己实现的）堆排序均匀随机32位整数 ./mybenchmark --benchmark_filter='BM_heap_sort' --benchmark_min_time=2

（libc++ STL）std::sort 均匀随机32位整数 ./mybenchmark --benchmark_filter='BM_std_sort' --benchmark_min_time=2

(FreeBSD libc) qsort 均匀随机32位整数 ./mybenchmark --benchmark_filter='qsort' --benchmark_min_time=2

结果

Benchmark	Time(ns)	CPU(ns)	Iterations
BM_heap_sort<int>/16	342	606	3300111
BM_heap_sort<int>/64	2352	2723	734670
BM_heap_sort<int>/512	29660	31726	63047
BM_heap_sort<int>/4k	308012	323559	6182
BM_heap_sort<int>/32k	3075102	3212472	623
BM_heap_sort<int>/256k	31398673	32604226	62
BM_heap_sort<int>/1024k	172382331	178276917	12
BM_std_sort<int>/16	184	452	4430101
BM_std_sort<int>/64	1210	1566	1277377
BM_std_sort<int>/512	15298	17196	116342
BM_std_sort<int>/4k	169104	183394	10906
BM_std_sort<int>/32k	1705724	1826278	1096
BM_std_sort<int>/256k	16603514	17616035	114
BM_std_sort<int>/1024k	76316858	80586680	25
BM_qsort_int/16	558	1152	1736306
BM_qsort_int/64	3123	4774	418924
BM_qsort_int/512	36114	48319	41413
BM_qsort_int/4k	376951	473376	4226
BM_qsort_int/32k	3716338	4515307	443
BM_qsort_int/256k	36471804	43464255	47
BM_qsort_int/1024k	174555536	204984700	10

其中">>/"右面的数字代表待排序数组的长度。比如4k代表4*1024个int。

结论

无论是否开启LTO(Link-time optimizations), qsort都很慢. 也许不是因为算法差，而是因为实现时C语言的局限（没有模板，类型信息不够丰富，减少了inline的可能性）。

heap sort, quick sort虽然在算法复杂度上AC Time都是一样的。但是实际上heap sort要比quick sort慢一个常数因子。（课本上也是这么说的）

整数自增

当多线程需要访问同一个整数，并进行读写操作时，需要使用一定的同步策略。可以使用互斥量（pthread_mutex_t），也可以用CPU的原子化命令。

如果使用CPU的原子化指令，那么在x86 CPU上，i++操作会变成一条 lock xadd指令。而++i操作会变成lock xadd之后再跟一个普通inc指令（与线程同步无关）。

命令

./mybenchmark --benchmark_filter='Int.*_single_thread' --benchmark_iterations=1000

结果

Benchmark	Time(ns)	CPU(ns)	Iterations
BM_Int_Inc_std_mutex_single_thread/8	197	205	1000
BM_Int_Inc_std_mutex_single_thread/64	1582	1593	1000
BM_Int_Inc_std_mutex_single_thread/64k	1544026	1560324	1000
BM_Int_Inc_atomic_int_single_thread/8	67	73	1000
BM_Int_Inc_atomic_int_single_thread/64	453	460	1000
BM_Int_Inc_atomic_int_single_thread/64k	478326	485313	1000

第二列是wall clock time，第三列是cpu time。

结论

在硬件环境1下，以单线程方式测试，原子化指令耗时大约7ns，mutex耗时大约22ns。后者是前者的3倍。

sg47 / algorithm Goto Github PK

algorithm's Introduction

algorithm

编译

Linux

Windows

目录说明

关于benchmark的一些注记

一些测试结果

排序

命令

结果

结论

整数自增

命令

结果

结论

algorithm's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs