I occassionally hit the KASSERT:
panic: Error VMBUS: Message Post Failed
cpuid = 7
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe011cbeb450
vpanic() at vpanic+0x182/frame 0xfffffe011cbeb4d0
kassert_panic() at kassert_panic+0x126/frame 0xfffffe011cbeb540
hv_vmbus_post_message() at hv_vmbus_post_message+0x1ce/frame 0xfffffe011cbeb570
hv_vmbus_channel_establish_gpadl() at hv_vmbus_channel_establish_gpadl+0x4e2/frame 0xfffffe011cbeb5d0
hv_nv_on_device_add() at hv_nv_on_device_add+0x670/frame 0xfffffe011cbeb640
hv_rf_on_device_add() at hv_rf_on_device_add+0x8f/frame 0xfffffe011cbeb6f0
netvsc_attach() at netvsc_attach+0x141c/frame 0xfffffe011cbeb830
device_attach() at device_attach+0x41d/frame 0xfffffe011cbeb890
hv_vmbus_child_device_register() at hv_vmbus_child_device_register+0x13a/frame 0xfffffe011cbeb980
vmbus_channel_on_offer_internal() at vmbus_channel_on_offer_internal+0x33c/frame 0xfffffe011cbeb9c0
work_item_callback() at work_item_callback+0x10/frame 0xfffffe011cbeb9e0
taskqueue_run_locked() at taskqueue_run_locked+0xf0/frame 0xfffffe011cbeba40
taskqueue_thread_loop() at taskqueue_thread_loop+0x88/frame 0xfffffe011cbeba70
fork_exit() at fork_exit+0x84/frame 0xfffffe011cbebab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe011cbebab0
So let's use a better retry method instead, by
- retry more times;
- use a slightly bigger delay and double the delay time on retry;
- use pause_sbt() to replace the busy DELAY();
Usually hv_vmbus_post_msg_via_msg_ipc() doesn't fail and I think it only fails
when creating GPADLs of big shared memory with the host, e.g., in the above
netvsc initialization code, a GPADL of 15MB sendbuf is created, causing lots
of messages posted to the host in a short period of time, and it looks the
host (or the hypervisor) may have a throttling machanism by returning
HV_STATUS_INSUFFICIENT_BUFFERS. In practice, we resove the issue by
delay-and-retry. And, since the GPADL setup is one-shot, there is no
performance issue when we retry with a bigger delay -- usually we only need
to retry once or twice.