Userspace Thread Latency during IO operations

2018-06-30 10:08:50

I'm working on a project using an embedded Linux kernel and I encounter a problem of thread latency when accessing a flash memory.

My application is multithreaded and some threads have to complete a given task in less than 500 ms. The problem is that these threads are sometimes "frozen" during more than 1 second and my 500 ms execution time is exceded.

This behaviour seems to be linked to flash writes since it occurs also when I execute a "dd" command from shell to write continuously in the flash memory.

I tried various configurations :

increased the priority of my real time threads : SCHED_RR, priority=55

changed the IO scheduler : deadline => cfq (better: failure occurs after 15 min instead of 3 min).

By using the ftrace tool I could see that, during the "freeze" time, some threads and processes are still running, with a lot of "idle" task time between the others tasks (idle task timeslot duration is > 20ms):

2 network threads (SCHED_RR, priority=50)

dd process

I don't understand:

Why all the other tasks are "locked" during all this time (sometimes when requesting a mutex, sometimes when calcultating a simple 16bits-CRC).

Why so much idle time can be seen with ftrace (between sched events) during this duration.

Why higher application thread priorities don't solve the issue.

I suspect something linked with the IO management in the kernel, as if the kernel preempted every non IO thread in order to do all the works relating to IO (network, files, ...).

Does anybody have an idea of what might cause this latency ?

My kernel settings:

Linux kernel version 2.6.39

Preempt option enabled

tickless

HZ=1000

CFQ scheduler (Default settings)

Edit:

As I'm not an expert, I share with you ftrace capture (to be viewed with kernelshark): https://drive.google.com/file/d/0B6pJb20-D0D2NHZBUHJVRlV0aDg/view?usp=sharing

Maybe it could help you to see what is really happening on my system.

In this capture I reproduced, with an external "dd" command, a similar behavior I encountered with my application in nominal condition.

The "hole" ("freeze") is (no more custom ftrace marker from my application) at timestamps:

begin: 469.118370

end: 469.802940

Another little "hole"

begin: 469.807644

end: 469.952975

I think this can be because the kernel has decided it must flush some filesystem metadata, or do other filesystem housekeeping, and must stall your process until it has done enough.

I had similar problems and used multi-threading and a userland buffer to absorb the stalls. See my old question and answer here.

I update the status of this topic: We think we found the root cause of the lock.

My company hired a Linux expert for 2 days.

Thanks to him we found that:

The locks were caused by the fact that all flash access are blocked when a flush of data is done by kernel.

Especially our logger module (used for timestamping...) which calls the syslog() function.

But this syslog() function blocks also the process, even if syslogd daemon is the real process accessing to flash... (we suspect that unix sockets used for syslog communications block until resources are available, like a bash pipe '|' when writing a lot of logs into a file located in flash).

The solution was to split all the access to flash between real time threads and the other by doing log/flash access into an isolated thread (with a non blocking custom message Queue as communication item)

And it seems that it works !

I didn't read blueshift answer before but it seems he was right ;-)

链接地址: http://www.djcxy.com/p/84884.html

上一篇: 我想睡一边拿着一个互斥体

下一篇: 用户空间IO操作期间的线程延迟