对条件变量(condition variable)的讨论

作者：王东

1.1 什么是条件变量和条件等待？

简单的说：

条件变量(condition variable)是利用线程间共享的全局变量进行同步的一种机制，主要包括两个动作：一个线程等待某个条件为真，而将自己挂起；另一个线程使的条件成立，并通知等待的线程继续。为了防止竞争，条件变量的使用总是和一个互斥锁结合在一起。

Wiki中的定义如下：

Conceptually a condition variable is a queue of threads, associated with a monitor, on which a thread may wait for some condition to become true. Thus each condition variable c is associated with an assertion P. While a thread is waiting on a condition variable, that thread is not considered to occupy the monitor, and so other threads may enter the monitor to change the monitor's state. In most types of monitors, these other threads may signal the condition variable c to indicate that assertion P is true in the current state[1].

条件变量(condition variable)是一种特殊的同步变量，它是与一个互斥量(monitor)关联的线程队列，条件变量都与一个断言(assertion) P关联，因为其中的线程队列中有一个线程在等待这个断言P为真。当一个线程处于等待条件变量(condition variable)时，该线程不再占用互斥量(monitor)，让其他线程能够进入互斥区去改变条件状态。

在条件变量上有两种基本操作：

l 等待（wait）：一个线程因为等待断言(assertion) P为真而处于等待在条件变量上，此时线程不会占用互斥量(monitor);

l 通知（signal/notify）：另一个线程在使得断言(assertion) P为真的时候，通知条件变量。

一个线程发生signal时，另一个线程被激活，那么两个线程都占用的互斥量(monitor), 选择哪个线程来占用互斥，这就分为了Blocking condition variables（把优先级给被通知的线程）和Nonblocking condition variables（把优先级给发出signal通知的线程[1]。

使用条件等待有如下的场景：

多线程访问一个互斥区域内的资源，如果获取资源的条件不够时，则线程需要等待，直到其他线程释放资源后，唤醒等待条件，使得线程得以继续。例如：

Thread1：

Lock (mutex)

while (condition is false) {

//为什么要在这里用while而不是if呢?

//参考1.2.1条件变量存在的问题

Cond_wait(cond, mutex, timeout)

}

DoSomething()

Unlock (mutex)

Thread2:

Lock (mutex)

…

condition is true

Cond_signal(cond)

Unlock (mutex)

例如 Thread1从一个大小为50的链接池中获取一个链接，如果已经用的链接达到50时，那该线程必须等待一个条件。 Thread2 用完一个链接时，将该链接还给链接池，然后发送条件notify，告诉Thread1 可以继续了。.

1.1.1 关于条件变量(condition variable)和信号量（Semaphore）

信号量（Semaphore）是一个非负的整数计数器，被用于进程或线程间的同步与互斥。

通过信号量可以实现 “PV操作”这种进程或线程间的同步机制。

P操作是获得资源，将信号量的值减1，如果结果不为负则继续执行，线程获得资源，否则线程被阻塞，处于睡眠状态，直到等待的资源被别的线程释放；

V操作则是释放资源，给信号量的值加1，释放一个因执行P操作而等待的线程。

最简单的信号灯形式，信号灯的值只能取0或1，类似于mutex。

当信号量的值为任意非负值（大于1），其值就代表可用资源的个数。

可以将信号量Semaphore和互斥锁(mutex)来实现一个来实现对一个池的同步和保护。使用mutex来实现同步，使用semaphore用于实现对资源记数。

获得资源的线程：

sem_wait (semaphore1)

Lock (mutex)

…

Unlock (mutex)

sem_post (semaphore2)

释放资源的线程:

sem_wait (semaphore2)

Lock (mutex)

…

Unlock (mutex)

sem_post (semaphore1)

这个模型很像多线程的生产者与消费者模型，这里的semaphore2是为了防止过度释放。

比起信号量来说，条件变量可以实现更为复杂的等待条件。当然，条件变量和互斥锁也可以实现信号量的功能（window下的条件变量只能实现线程同步不能实现进程同步）。

在Posix.1基本原理一文声称，有了互斥锁和条件变量还提供信号量的原因是：“本标准提供信号量的而主要目的是提供一种进程间同步的方式；这些进程可能共享也可能不共享内存区。互斥锁和条件变量是作为线程间的同步机制说明的；这些线程总是共享(某个)内存区。这两者都是已广泛使用了多年的同步方式。每组原语都特别适合于特定的问题”。尽管信号量的意图在于进程间同步，互斥锁和条件变量的意图在于线程间同步，但是信号量也可用于线程间，互斥锁和条件变量也可用于进程间。应当根据实际的情况进行决定。信号量最有用的场景是用以指明可用资源的数量[11]。

个人的感觉是：由于起源不同，导致了两种理念，一中理念力挺条件变量(condition variable)，觉得信号量没有什么用（例如POSIX Thread模型中没有信号量的概念，虽然也提出了Posix Semaphore，但是为什么一开始不把它放在一起呢？）；另一理念恰好相反（例如window刚开始没有条件变量的概念，只有信号量的概念）。

进化到后来，目前的linux和window都同时具备了这二者。

1.2 Linux中的条件等待函数是那些？

Linux提供了的条件等待函数和notify函数。

l pthread_cond_timedwait(cond, mutex, abstime);

l pthread_cond_wait(cond, mutex);

l pthread_cond_signal(cond); 将至少解锁一个线程（阻塞在条件变量上的线程）。

l pthread_cond_broadcast(cond) : 将对所有阻塞在条件变量上的线程解锁。

线程1调用pthread_cond_wait() 所做的事三个部分：

1. 同时对mutex解锁，

2. 并等待条件 cond 发生

3. 获得通知后，对mutex加锁；

调用pthread_cond_wait()后，同时对mutex解锁，并等待条件 cond 发生（要求解锁并阻塞是一个原子操作）

现在互斥对象已被解锁，其它线程可以进入互斥区域,修改条件。

此时，pthread_cond_wait() 调用还未返回。等待条件 mycond是一个阻塞操作，这意味着线程将睡眠，在它苏醒之前不会消耗 CPU 周期。直到特定条件发生[3]。

假设另一个线程2对mutex加锁, 并改变条件, 然后调用函数 pthread_cond_signal() 激活等待条件。这意味着线程1现在将苏醒。此时线程1试图对mutex加锁，由于线程2还没有对mutex解锁，所以线程1只有等待，只有在线程2对mutex解锁后，线程1优先获得mutex加锁，然后就能做想做的事情了。

这里是存在问题的：如何让线程1优先获得mutex加锁，而不是其他线程，pthread_mutex_lock 的伪代码[4]中展示了这种实现的可能性，signal函数中优先激活了wait中的线程。

pthread_cond_wait(mutex, cond):

value = cond->value;

pthread_mutex_unlock(mutex);

pthread_mutex_lock(cond->mutex);

if (value == cond->value) {

me->next_cond = cond->waiter;

cond->waiter = me;

pthread_mutex_unlock(cond->mutex);

unable_to_run(me);

} else

pthread_mutex_unlock(cond->mutex);

pthread_mutex_lock(mutex);

pthread_cond_signal(cond):

pthread_mutex_lock(cond->mutex);

cond->value++;

if (cond->waiter) {

sleeper = cond->waiter;

cond->waiter = sleeper->next_cond;

able_to_run(sleeper);

}

pthread_mutex_unlock(cond->mutex);

下面的例子展示了使用条件变量的示例代码[2]：

其中一个或多个线程负责count数增加（inc_count），另一个线程负责监听count数，一旦达到COUNT_LIMIT，就报告（watch_count）。

void inc_count (void) {

…

pthread_mutex_lock(&count_mutex);

count++;

if (count == COUNT_LIMIT) {

pthread_cond_signal(&count_threshold_cv);

printf("inc_count(): thread %ld, count = %d Threshold reached./n",

my_id, count);

}

printf("inc_count(): thread %ld, count = %d, unlocking mutex/n",

my_id, count);

pthread_mutex_unlock(&count_mutex);

…

}

void watch_count (void) {

…

pthread_mutex_lock(&count_mutex);

while (count<COUNT_LIMIT) {

pthread_cond_wait(&count_threshold_cv, &count_mutex);

printf("watch_count(): thread %ld Condition signal received./n", my_id);

count += 125;

printf("watch_count(): thread %ld count now = %d./n", my_id, count);

}

pthread_mutex_unlock(&count_mutex);

…

}

1.2.1 条件变量中存在的问题：虚假唤醒

Linux中帮助中提到的：

在多核处理器下，pthread_cond_signal可能会激活多于一个线程（阻塞在条件变量上的线程）。 On a multi-processor, it may be impossible for an implementation of pthread_cond_signal() to avoid the unblocking of more than one thread blocked on a condition variable.

结果是，当一个线程调用pthread_cond_signal()后，多个调用pthread_cond_wait()或pthread_cond_timedwait()的线程返回。这种效应成为”虚假唤醒”(spurious wakeup) [4]

The effect is that more than one thread can return from its call to pthread_cond_wait() or pthread_cond_timedwait() as a result of one call to pthread_cond_signal(). This effect is called "spurious wakeup". Note that the situation is self-correcting in that the number of threads that are so awakened is finite; for example, the next thread to call pthread_cond_wait() after the sequence of events above blocks.

虽然虚假唤醒在pthread_cond_wait函数中可以解决，为了发生概率很低的情况而降低边缘条件（fringe condition）效率是不值得的，纠正这个问题会降低对所有基于它的所有更高级的同步操作的并发度。所以pthread_cond_wait的实现上没有去解决它。

While this problem could be resolved, the loss of efficiency for a fringe condition that occurs only rarely is unacceptable, especially given that one has to check the predicate associated with a condition variable anyway. Correcting this problem would unnecessarily reduce the degree of concurrency in this basic building block for all higher-level synchronization operations.

所以通常的标准解决办法是这样的：

将条件的判断从if 改为while

pthread_cond_wait中的while()不仅仅在等待条件变量前检查条件变量，实际上在等待条件变量后也检查条件变量。

这样对condition进行多做一次判断，即可避免“虚假唤醒”.

这就是为什么在pthread_cond_wait()前要加一个while循环来判断条件是否为假的原因。

有意思的是这个问题也存在几乎所有地方，包括: linux 条件等待的描述, POSIX Threads的描述, window API(condition variable), java等等。

l 在linux的帮助中对条件变量的描述是[4]：

添加while检查的做法被认为是增加了程序的健壮性，在IEEE Std 1003.1-2001中认为spurious wakeup是允许的。

An added benefit of allowing spurious wakeups is that applications are forced to code a predicate-testing-loop around the condition wait. This also makes the application tolerate superfluous condition broadcasts or signals on the same condition variable that may be coded in some other part of the application. The resulting applications are thus more robust. Therefore, IEEE Std 1003.1-2001 explicitly documents that spurious wakeups may occur.

l 在POSIX Threads中[5]:

David R. Butenhof 认为多核系统中 条件竞争（race condition [8]）导致了虚假唤醒的发生，并且认为完全消除虚假唤醒本质上会降低了条件变量的操作性能。

“…, but on some multiprocessor systems, making condition wakeup completely predictable might substantially slow all condition variable operations. The race conditions that cause spurious wakeups should be considered rare”

l 在window的条件变量中[6]:

MSDN帮助中描述为，spurious wakeups问题依然存在，条件需要重复check。

Condition variables are subject to spurious wakeups (those not associated with an explicit wake) and stolen wakeups (another thread manages to run before the woken thread). Therefore, you should recheck a predicate (typically in a while loop) after a sleep operation returns.

l 在Java中 [7]，对等待的写法如下：

synchronized (obj) {

while (<condition does not hold>)

obj.wait();

... // Perform action appropriate to condition

}

Effective java 曾经提到Item 50: Never invoke wait outside a loop.

显然，虚假唤醒是个问题,但它也是在JLS的第三版的JDK5的修订中才得以澄清。在JDK 5的Javadoc进行更新

A thread can also wake up without being notified, interrupted, or timing out, a so-called spurious wakeup. While this will rarely occur in practice, applications must guard against it by testing for the condition that should have caused the thread to be awakened, and continuing to wait if the condition is not satisfied. In other words, waits should always occur in loops.

Apparently, the spurious wakeup is an issue (I doubt that it is a well known issue) that intermediate to expert developers know it can happen but it just has been clarified in JLS third edition which has been revised as part of JDK 5 development. The javadoc of wait method in JDK 5 has also been updated

1.3 Window 里面的条件等待函数是那些？

比较奇怪的是一直到vista和window2008以前，window居然没有标准的条件变量的概念。

实现条件变量和条件等待.

Window 采用了一种组合方式的策略来实现条件等待。

1. 使用 mutex 来实现互斥锁

2. 使用 SignalObjectAndWait实现条件等待；

3. 对autoreset类型的event 使用 PulseEvent实现signal 条件；

事实上，window是用 autoreset的event来实现条件的。

具体的：

Thread1:

WaitForSingleObject(hMutex);

If (condition is false){

SignalObjectAndWait(hMutex, hEvent);

WaitForSingleObject(hMutex);

}

Dosomething();

ReleaseMutex(hMutex);

Thead2:

WaitForSingleObject(hMutex);

PulseEvent (hEvent);

ReleaseMutex(hMutex);

SignalObjectAndWait做的工作和cond_wait不完全一样。

SignalObjectAndWait只做2件事情：

1. 同时对mutex解锁，

2. 并等待条件 cond 发生

获得signal通知，直接就往下走了，而不是等待mutex，这一点与Pthread_cond_wait()不同。因此在使用SignalObjectAndWait 后，必须使用lock(mutex)来获得mutex的锁，防止两个线程同时进入mutex。

这里lock(mutex)是可能存在问题的，因为无法保证这里的lock(mutex)能优先获得进入的权利。事实上如果在SignalObjectAndWait()和lock(mutex)(WaitForSingleObject)加sleep()，就会导致其他线程先获得mutex的lock。因此也应该像linux中一样，使用循环判断。

Thread1:

WaitForSingleObject(hMutex);

while (condition is false){

SignalObjectAndWait(hMutex, hEvent);

WaitForSingleObject(hMutex);

}

Dosomething();

ReleaseMutex(hMutex);

1.3.1 PulseEvent在条件等待中存在的问题，如何解决？

实际上PulseEvent 是不可信赖的，因为当一个线程处于等待状态时，会一些瞬间其状态并不是等待状态，这就导致了PulseEvent()不能激活这些线程。

例如：核心模式的APC调用，会导致等待的线程瞬间不处于等待状态。

A thread waiting on a synchronization object can be momentarily removed from the wait state by a kernel-mode APC, and then returned to the wait state after the APC is complete. If the call to PulseEvent occurs during the time when the thread has been removed from the wait state, the thread will not be released because PulseEvent releases only those threads that are waiting at the moment it is called. Therefore, PulseEvent is unreliable and should not be used by new applications. Instead, use condition variables.

当然可以用SetEvent ()来代替PulseEvent()，那将会唤醒所以等待的线程。这样的唤醒更像是notifyAll,而不是notify.

标准的解决办法是使用window中condition variables[9].

1.3.2 Window下标准做法

condition variables是微软从vista和2008以后引入的技术，xp和2003的系统不支持。

condition variables和临界区一样，是用户态调用，不是系统调用（意味着是高效的），只支持同一个进程内的多线程。

l WakeConditionVariable 唤醒一个等待条件变量的线程

l WakeAllConditionVariable 唤醒所有等待条件变量的线程；

l SleepConditionVariableCS 释放临界区锁和等待条件变量作为原子性操作

l SleepConditionVariableSRW 释放SRW锁和等待条件变量作为原子性操作.

这样，在window下使用条件变量的方法如下所示：

Thread1：

EnterCriticalSection(&CritSection);

while( TestPredicate() == FALSE ){

SleepConditionVariableCS(&ConditionVar, &CritSection, INFINITE);

}

DoSth();

LeaveCriticalSection(&CritSection);

Thread2：

EnterCriticalSection(&CritSection);

WakeConditionVariable (ConditionVar);

LeaveCriticalSection(&CritSection);

同样的spurious wakeups问题依然存在，条件需要重复check。

1.3.3 Window下的其他方法？

Window下面使用condition variable固然好，但是对于不能使用condition variable的xp和2003怎么办呢。

有两个办法：

1 使用pthread porting到window中的pthread win32方法。使用pthread在window中的运行库。具体可参考：

http://sourceware.org/pthreads-win32/

2 自己写一个，或者利用他人写好的条件变量

可参考：

a) A Fair Monitor (Condition Variables) Implementation for Win32

http://thbecker.net/free_software_utilities/fair_monitor_for_win32/start_page.html

b) Windows下条件变量的实现

http://blog.csdn.net/leafarmy/archive/2009/03/31/4039548.aspx

1.4 参考：

1．Monitor (synchronization) Condition variable

http://en.wikipedia.org/wiki/Condition_variable

2 POSIX Threads Programming

https://computing.llnl.gov/tutorials/pthreads/#ConditionVariables

3 pthread_cond_wait()太难理解了

http://hi.baidu.com/nkhzj/blog/item/f5480d4f740f7f35aec3ab4b.html

4 pthread_cond_signal(3) - Linux man page

http://linux.die.net/man/3/pthread_cond_signal

5 POSIX Threads-Spurious wakeup

http://en.wikipedia.org/wiki/Spurious_wakeup

6 Condition Variables

http://msdn.microsoft.com/en-us/library/ms682052(v=VS.85).aspx

7 java-的spurious wakeup问题

http://www.devguli.com/blog/eng/spurious-wakeup/

8 Race condition

http://en.wikipedia.org/wiki/Race_condition

9 PulseEvent

http://msdn.microsoft.com/en-us/library/ms684914(VS.85).aspx

10 windows - Condition Variables

http://msdn.microsoft.com/en-us/library/ms682052(v=VS.85).aspx

11 进程间的通信（互斥锁、条件变量、读写锁、文件锁、信号灯）

http://blog.csdn.net/ccskyer/archive/2010/12/24/6096710.aspx

12 pthread_cond_wait的spurious wakeup问题

http://blog.chinaunix.net/u/12592/showart_2213910.html