infiniband rdma poor transfer bw
In my application I use an infiniband infrastructure to send a stream of data from a server to another one. I have used to easy the development ip over infiniband because I'm more familiar with socket programming. Until now the performance (max bw) was good enough for me (I knew I wasn't getting the maximum bandwith achievable), now I need to get out from that infiniband connection more bandwidth.
ib_write_bw claims that my max achievable bandwidth is around 1500 MB/s (I'm not getting 3000MB/s because my card is installed in a PCI 2.0 8x).
So far so good. I coded my communication channel using ibverbs and rdma but I'm getting far less than the bandwith I can get, I'm even getting a bit less bandwidth than using socket but at least my application doesn't use any CPU power:
ib_write_bw: 1500 MB/s
sockets: 700 MB/s <= One core of my system is at 100% during this test
ibvers+rdma: 600 MB/s <= No CPU is used at all during this test
It seems that the bottleneck is here:
ibv_sge sge;
sge.addr = (uintptr_t)memory_to_transfer;
sge.length = memory_to_transfer_size;
sge.lkey = memory_to_transfer_mr->lkey;
ibv_send_wr wr;
memset(&wr, 0, sizeof(wr));
wr.wr_id = 0;
wr.opcode = IBV_WR_RDMA_WRITE;
wr.sg_list = &sge;
wr.num_sge = 1;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.rdma.remote_addr = (uintptr_t)thePeerMemoryRegion.addr;
wr.wr.rdma.rkey = thePeerMemoryRegion.rkey;
ibv_send_wr *bad_wr = NULL;
if (ibv_post_send(theCommunicationIdentifier->qp, &wr, &bad_wr) != 0) {
notifyError("Unable to ibv post receive");
}
at this point the next code waiting for completation that is:
//Wait for completation
ibv_cq *cq;
void* cq_context;
if (ibv_get_cq_event(theCompletionEventChannel, &cq, &cq_context) != 0) {
notifyError("Unable to get a ibv cq event");
}
ibv_ack_cq_events(cq, 1);
if (ibv_req_notify_cq(cq, 0) != 0) {
notifyError("Unable to get a req notify");
}
ibv_wc wc;
int myRet = ibv_poll_cq(cq, 1, &wc);
if (myRet > 1) {
LOG(WARNING) << "Got more than a single ibv_wc, expecting one";
}
The time from my ibv_post_send and when ibv_get_cq_event returns an event is 13.3ms when transfering chuncks of 8 MB achieving then around 600 MB/s.
To specify more (in pseudocode what I do globally):
Active Side:
post a message receive
rdma connection
wait for rdma connection event
<<at this point transfer tx flow starts>>
start:
register memory containing bytes to transfer
wait remote memory region addr/key ( I wait for a ibv_wc)
send data with ibv_post_send
post a message receive
wait for ibv_post_send event ( I wait for a ibv_wc) (this lasts 13.3 ms)
send message "DONE"
unregister memory
goto start
Passive Side:
post a message receive
rdma accept
wait for rdma connection event
<<at this point transfer rx flow starts>>
start:
register memory that has to receive the bytes
send addr/key of memory registered
wait "DONE" message
unregister memory
post a message receive
goto start
Does anyone knows what I'm doing wrong? Or what I can improve? I'm not affected by "Not Invented Here" syndrome so I'm even open to throw away what I have done until now and adopting something else. I only need a point to point contiguous transfer.
Based on your pseudocode, it looks as if you register and unregister a memory region for every transfer. I think that's probably the main reason things are slow: memory registration is a pretty expensive operation, so you want to do it as little as possible and reuse your memory region as much as possible. All the time spent registering memory is time that you don't spend transferring data.
That points to a second problem with your pseudocode: you are synchronously waiting for completion and not posting another work request until the previous one completes. That means that during the time from when the work request completes until you get the completion and post another request, the HCA is idle. You're much better off keeping multiple send/receive work requests in flight, so that when the HCA completes one work request, it can move onto the next one immediately.
I solved the issue allocating my buffers to be transmitted alligned to the page size. In my system page size is 4K (value returned by sysconf(_SC_PAGESIZE)). Doing so I'm able (I still do the registration/unregistration) to reach now around 1400 MB/sec.
链接地址: http://www.djcxy.com/p/64618.html上一篇: 没有GPUDirect的mpi编程模型
下一篇: infiniband rdma差转移bw