If latency is important to you then NUMA affinity really makes a difference. If you run your test (i.e. ib_write_lat) in the same socket where the FDR card is connected, you can achieve latencies under 1 usec. Running in another socket will be ~20% slower. This does not affect the throughput, not in a noticeable way anyways.
IRQ affinity is also very important. BIOS setting too.
There is a very nice, and relatively short, Tuning Guide published by Mellanox which I think is a must: http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf