CEPH – osd timeout


If you have not low latency network between you CEPH nodes or you are using CEPH across more datacenters you may meet with this error in your osd log.:

2018-05-11 10:37:14.425098 7fd6213b8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd60c35e700' had timed out after 15

It means that some operations take a little bit longer to be processed by the osd. Unfortunately when CEPH register that error many times he will mark OSD down which will cause rebalance in your cluster for a few minutes (after few minutes will OSD go up again).

To solve this proble is good idea to increase osd_op_tp timeout a little more. For example from 15 to 60. Default value is 15 seconds. There is easy way how to do it on osd.7 node with this command:

ceph tell osd.* injectargs '--osd_op_thread_timeout = 60'

Better way is increase limit only on OSD which is causing problems.

Now, on some osd node check that value was changed by this command:

ceph --admin-daemon /var/run/ceph/ceph-osd..asok config show | grep thread_timeout

Warning: Increasing timeout can impact perfomance, but everything is better than OSD whis is going down every hour.

