[ previous ] [ next ] [ threads ]
 To :  yate@v...
 From :  Etoile =?iso-8859-15?q?Di=E8se?= <support@e...>
 Subject :  SIGSEGV in Yate, problem found
 Date :  Thu, 31 Aug 2006 08:58:31 +0200
Hello,

We thought that the segmentation faults Yate produced in our platform were due 
to heavy loads disturbing the Zaptel part.
We were wrong. The problem is that the ThreadPrivate class terminates by 
killing the thread where it runs. When a method kills its own thread (whether 
softly or hardly by pthread_cancel), this method becomes then orphan of any 
object. If this method tries to access an attribute or another method of the 
same object, it accesses a freed memory wich is forbidden. The memory filled 
with the code of this method is liberated whenever the method terminates.
Moreover, a "new" may create a new object at the same address as the old 
object freed by canceling it thread. And it happens. In this case, the still 
not ended method is integrated in the new object but is not aware of this 
modification of its environment.

We saw two problems in Yate that caused segmentation fault (quiet often under 
good load). The first one was detected by running Yate with Valgrind. Here is 
the Valgrind message :

==16122== 
==16122== Thread 9:
==16122== Invalid write of size 1
==16122==    at 0x4C5C09D: TelEngine::ThreadPrivate::cancel(bool) 
(Thread.cpp:315)
==16122==    by 0x4C5C48C: TelEngine::ThreadPrivate::pubdestroy() 
(Thread.cpp:260)
==16122==    by 0x4C5C604: TelEngine::Thread::~Thread() (Thread.cpp:472)
==16122==    by 0x4C68566: 
TelEngine::ThreadedSourcePrivate::~ThreadedSourcePrivate() 
(DataFormat.cpp:861)
==16122==    by 0x4C64BA1: TelEngine::ThreadedSource::stop() 
(DataFormat.cpp:876)
==16122==    by 0x5951151: (anonymous namespace)::WaveSource::~WaveSource() 
(in /usr/local/lib/yate/modules/wavefile.yate)
==16122==    by 0x4C531E9: TelEngine::RefObject::deref() 
(in /usr/local/lib/libyate.so.1.0.0)
==16122==    by 0x4C6666B: 
TelEngine::DataEndpoint::setSource(TelEngine::DataSource*) 
(DataFormat.cpp:751)
==16122==    by 0x59537B5: (anonymous 
namespace)::AttachHandler::received(TelEngine::Message&) 
(in /usr/local/lib/yate/modules/wavefile.yate)
==16122==    by 0x4C60786: 
TelEngine::MessageDispatcher::dispatch(TelEngine::Message&) 
(in /usr/local/lib/libyate.so.1.0.0)
==16122==    by 0x4C60AD1: TelEngine::MessageDispatcher::dequeueOne() 
(in /usr/local/lib/libyate.so.1.0.0)
==16122==    by 0x4C60B0B: TelEngine::MessageDispatcher::dequeue() 
(in /usr/local/lib/libyate.so.1.0.0)
==16122==  Address 0x55EC920 is 24 bytes inside a block of size 40 free'd
==16122==    at 0x4A18D1A: operator delete(void*) (vg_replace_malloc.c:244)
==16122==    by 0x4C5B801: TelEngine::ThreadPrivate::~ThreadPrivate() 
(Thread.cpp:226)
==16122==    by 0x4C5B9CB: TelEngine::ThreadPrivate::destroyFunc(void*) 
(Thread.cpp:417)
==16122==    by 0x4B226F7: __nptl_deallocate_tsd 
(in /lib64/tls/libpthread-0.10.so)
==16122==    by 0x4B229B1: start_thread (in /lib64/tls/libpthread-0.10.so)
==16122==    by 0x52C1462: clone (in /lib64/tls/libc-2.3.3.so)

Since we inserted some comments in the code, maybe the line numbers aren't 
correct, but it doesn't matter.
The reason of this message is that Yate modify the attribute m_running just 
after having killed the thread by pthread_cancel.
The second was detected by those Yate logs :

 ZapSource::~ZapSource() [0x5b8580]
 Thread::~Thread() [0x5b8610]
>>> ThreadPrivate::pubdestroy() 0x5b8610 'ZapSource' [0x5a7f20]
   ThreadPrivate::cleanup() 0x5b8610 'ZapSource' [0x5a7f20]
   Thread::cleanup() [0x5b8610]
   ThreadPrivate::cancel(false) 'ZapSource' [0x5a7f20]
   Thread::exit()
   ThreadPrivate::cleanupFunc(0x5a7f20)
   ThreadPrivate::cleanup() (nil) 'ZapSource' [0x5a7f20]
  >>> ThreadPrivate::destroyFunc(0x5a7f20)
     ThreadPrivate::destroy() 'ZapSource' [0x5a7f20]
     ThreadPrivate::cleanup() (nil) 'ZapSource' [0x5a7f20]
    >>> ThreadPrivate::~ThreadPrivate() (nil) 'ZapSource' [0x5a7f20]
    <<< ThreadPrivate::~ThreadPrivate()
  <<< ThreadPrivate::destroyFunc
   ConfChan::~ConfChan() conf/2401 [0x594b90]
   processLine 
'%% Matched message
   processLine 
'%%>message:115676438617389:1156764386:chan.attach::consumer=wave/record/-:maxlen=4800000:notify=machZap74-1156764051.115676438617389:source=wave/play//home/audio/message.slin'
   Created message 0x57c540 'chan.attach' [0x584a00]
   WaveSource::WaveSource("/home/audio/message.slin",0x57c130) 
[0x5b9640]
  >>> Thread::Thread("WaveSource",2) [0x5c4800]
    >>> ThreadPrivate::ThreadPrivate(0x5c4800,"WaveSource") [0x5a7f20]
    <<< ThreadPrivate::ThreadPrivate
  <<< Thread::Thread
   TRACE : allocation in 0x5b9640 of tmp = 0x5c4800
   ThreadPrivate::startFunc(0x5a7f20)
   ThreadPrivate::run() 'WaveSource' [0x5a7f20]
   ExtMod [0x584a00] message 'chan.notify' [0x5a0190] returning true
   ThreadPrivate::cancel(true) 'WaveSource' [0x5a7f20]
   ThreadPrivate terminating pthread 0x5a7f30 [0x5a7f20]
   ThreadPrivate::cleanupFunc(0x5a7f20)
   ThreadPrivate::cleanup() 0x5c4800 'WaveSource' [0x5a7f20]
   WaveSource [0x5b9640] cleanup, total=0
  >>> ThreadPrivate::destroyFunc(0x5a7f20)
     ThreadPrivate::destroy() 'WaveSource' [0x5a7f20]
     ThreadPrivate::cleanup() 0x5c4800 'WaveSource' [0x5a7f20]
    >>> ThreadPrivate::~ThreadPrivate() 0x5c4800 'WaveSource' [0x5a7f20]
       Thread::~Thread() [0x5c4800]
    <<< ThreadPrivate::~ThreadPrivate()
  <<< ThreadPrivate::destroyFunc
<<< ThreadPrivate::pubdestroy()

...

 WaveSource::~WaveSource() [0x5b9640] total=0 stamp=0
 TRACE : liberation in 0x5b9640 of tmp = 0x5c4800


In this situation, the pubdestroy method terminate its own thread softly and 
enters a loop that check the reality of this thread termination. Before it 
does it first test in the loop, a WaveFile object comes and create a new 
ThreadPrivate object which takes the address 0x5a7f20 newly freed by the 
thread cancelation. The problem is that pubdestroy is still running and finds 
that this 0x5a7f20 thread does not terminate (of course it has been replaced 
by a new one). At the end of the pubdestroy loop, the method pubdestroy then 
calls a hard cancel of the thread and kills the new one with all the objects 
inside which may be used by the WaveFile object 0x5b9640. The segmentation 
fault finally comes when the WaveFile object 0x5b9640 tries to delete this 
thread object (which was already freed by pthread_cancel and the functions 
that follow : destroyFunc). The solution is to give to pubdestroy the 
possibility to test if it makes part of a new object before taking the 
decision of hard canceling the thread. The flag m_cancel is a good indicator 
and has proven here that it works. This scenario may vary, but the basic 
reason stays always the same.

Here is the patch we made against the last CVS version to fix those problems, 
and I have waited several days before posting it to be sure there was no more 
problems of segmentation :


diff -Naur yate/engine/Thread.cpp yate.ed/engine/Thread.cpp
--- yate/engine/Thread.cpp	2006-08-28 10:32:30.000000000 +0200
+++ yate.ed/engine/Thread.cpp	2006-08-29 18:59:33.000000000 +0200
@@ -257,6 +257,8 @@
 		return;
 	    Thread::msleep(5,false);
 	}
+	if (!m_cancel)
+	    return;
 	if (!cancel(true))
 	    Debug(DebugWarn,"ThreadPrivate::pubdestroy() %p '%s' failed cancel 
[%p]",m_thread,m_name,this);
     }
@@ -303,6 +305,7 @@
 			return true;
 		}
 	    }
+	    m_running = false;
 #ifdef _WINDOWS
 	    Debug(DebugGoOn,"ThreadPrivate terminating win32 thread %lu 
[%p]",thread,this);
 	    ret = ::TerminateThread(reinterpret_cast(thread),0) != 0;
@@ -311,10 +314,10 @@
 	    ret = !::pthread_cancel(thread);
 #endif
 	    if (ret) {
-		m_running = false;
 		Thread::msleep(1);
 		return true;
 	    }
+	    m_running = true;
 	}
 	m_cancel = true;
     }

Regards,

-- 
Support Etoile Dièse



diff -Naur yate/engine/Thread.cpp yate.ed/engine/Thread.cpp
--- yate/engine/Thread.cpp	2006-08-28 10:32:30.000000000 +0200
+++ yate.ed/engine/Thread.cpp	2006-08-29 18:59:33.000000000 +0200
@@ -257,6 +257,8 @@
 		return;
 	    Thread::msleep(5,false);
 	}
+	if (!m_cancel)
+	    return;
 	if (!cancel(true))
 	    Debug(DebugWarn,"ThreadPrivate::pubdestroy() %p '%s' failed cancel [%p]",m_thread,m_name,this);
     }
@@ -303,6 +305,7 @@
 			return true;
 		}
 	    }
+	    m_running = false;
 #ifdef _WINDOWS
 	    Debug(DebugGoOn,"ThreadPrivate terminating win32 thread %lu [%p]",thread,this);
 	    ret = ::TerminateThread(reinterpret_cast(thread),0) != 0;
@@ -311,10 +314,10 @@
 	    ret = !::pthread_cancel(thread);
 #endif
 	    if (ret) {
-		m_running = false;
 		Thread::msleep(1);
 		return true;
 	    }
+	    m_running = true;
 	}
 	m_cancel = true;
     }