CEPH – osd timeout

Ceph_Logo_Stacked_RGB_120411_fa

If you have not low latency network between you CEPH nodes or you are using CEPH across more datacenters you may meet with this error in your osd log.:

2018-05-11 10:37:14.425098 7fd6213b8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd60c35e700' had timed out after 15

It means that some operations take a little bit longer to be processed by the osd. Unfortunately when CEPH register that error many times he will mark OSD down which will cause rebalance in your cluster for a few minutes (after few minutes will OSD go up again).

To solve this proble is good idea to increase osd_op_tp timeout a little more. For example from 15 to 60. Default value is 15 seconds. There is easy way how to do it on osd.7 node with this command:

ceph tell osd.* injectargs '--osd_op_thread_timeout = 60'

Better way is increase limit only on OSD which is causing problems.

Now, on some osd node check that value was changed by this command:

ceph --admin-daemon /var/run/ceph/ceph-osd..asok config show | grep thread_timeout

Warning: Increasing timeout can impact perfomance, but everything is better than OSD whis is going down every hour.

Napsat komentář

Vaše emailová adresa nebude zveřejněna. Vyžadované informace jsou označeny *

Můžete používat následující HTML značky a atributy: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>