Dear sirs,
I encounter a problem when I use ROCE mode in Hadoop 2.2.0, the details is as follow (And english is not my mother lanugage, if some words unproper, please pardon ):
CPU Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
DDR: 192GB
Mellanox Technologies MT27520 Family
OS: Suse 11 SP3 x86_64
Hadoop version: 2.2.0 or 2.4.1
Hadoop nodes: 2 datanodes
uda version: 3.4.1-0.0
OFED: MLNX_OFED_LINUX-2.4-1.0.4-sles11sp3-x86_64.iso
driver: mlx4_en
version: 2.4-1.0.4 (Mar 8 2015)
firmware-version: 2.31.5270
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
I have configured the Mellanox netcard as eth but not IB, so could use ROCE mode
Test solution: HiBench Terasort
1. modify the mapred-site.xml as follow:
......
<property>
<name>mapreduce.job.reduce.shuffle.consumer.plugin.class</name>
<value>com.mellanox.hadoop.mapred.UdaShuffleConsumerPlugin</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx1024m</value>
</property>
......
2. modify the yarn-site.xml as follow:
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>com.mellanox.hadoop.mapred.UdaShuffleHandler</value>
</property>
2. During run the Terasort, the output looks fine, no errors,
3. But the performance is not improved comparing with normal socket test
4. trace the syslog(for example: application_1433928017396_0005/container_1433928017396_0005_01_000143/syslog), I see some UDA errors as follow:
PS: From the syslog, I see that, do the action of register RDMA memory twice, but only the first time is success, the second time is failed !!!
And I don't know, why the register memory failed, the output message on the terminal will also finish the reduce steps. Is it go to the normal socket path when encounter the problem ?
2015-06-10 20:33:31,486 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2015-06-10 20:33:31,566 INFO [main] org.apache.hadoop.mapred.ReduceTask: Using ShuffleConsumerPlugin: com.mellanox.hadoop.mapred.UdaShuffleConsumerPlugin@6b4ffe93
2015-06-10 20:33:31,567 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: init - Using UdaShuffleConsumerPlugin
2015-06-10 20:33:31,571 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.job.shuffle.input.buffer.percent is deprecated. Instead, use mapreduce.reduce.shuffle.input.buffer.percent
2015-06-10 20:33:31,571 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: Using JAVA Xmx with mapred.job.shuffle.input.buffer.percent to limit UDA shuffle memory
2015-06-10 20:33:31,571 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: RDMA shuffle memory is limited to 686MB
2015-06-10 20:33:31,571 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: UDA: user prefer rdma.buf.size=1024KB
2015-06-10 20:33:31,572 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: UDA: minimum rdma.buf.size=16KB
2015-06-10 20:33:31,572 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: UDA has limited support for map task speculative execution
2015-06-10 20:33:31,572 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: UDA: number of segments to fetch: 300
2015-06-10 20:33:31,575 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: going to execute C++ thru JNI with argc=: 12 cmd: [-w, 256, -r, 9011, -a, 1, -m, 1, -g, /home/f00318826/cloud/hadoop-2.2.0/logs/userlogs/application_1433928017396_0005/container_1433928017396_0005_01_000143/, -s, 1024]
2015-06-10 20:33:31,577 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: +++>>> invoking UdaBridge.startNative: isNetMerger=true
2015-06-10 20:33:31,577 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: UDA version is 3.4.1-0.0 (Merger/NetMergerMain.cc:55)
2015-06-10 20:33:31,577 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: Compiled on the Jun 8 2015, 11:29:01
(Merger/NetMergerMain.cc:56)
2015-06-10 20:33:31,577 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: size of rdma buffer as passed from java is 1048576
(Merger/NetMergerMain.cc:58)
2015-06-10 20:33:31,578 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: main initialization finished ret=0 (UdaBridge.cc:235)
2015-06-10 20:33:31,578 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: <<<+++ after UdaBridge.startNative ret=0
2015-06-10 20:33:31,580 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: mParams array is [300, job_1433928017396_0005, attempt_1433928017396_0005_r_000002_0, 0, 1048576, 16384, org.apache.hadoop.io.Text, null, 262144, 720332378, 1, /home/f00318826/cloud/hadoop-2.2.0/temp/nm-local-dir/usercache/f00318826/appcache/application_1433928017396_0005]
2015-06-10 20:33:31,580 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: UDA: sending INIT_COMMAND
2015-06-10 20:33:31,581 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: got params from java: hadoop_cmd->count=13, num_maps=300, job_id=job_1433928017396_0005, reduce_task_id=attempt_1433928017396_0005_r_000002_0, lpq_size=0, rdma.buf.size=1048576B, minBuffer=16384B, cmp_func=org.apache.hadoop.io.Text, comp_alg=null, comp_block_size=262144, shuffleMemorySize=720332378B (Merger/reducer.cc:65)
2015-06-10 20:33:31,582 INFO [Thread-9] org.apache.hadoop.mapred.ShuffleConsumerPlugin: C++ THREAD STARTED (by RdmaClient) and attached to JVM tid=0xd10f3700 (CommUtils/UdaUtil.cc:46)
2015-06-10 20:33:31,589 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: Successfully init'ed device (DataNet/RDMAComm.cc:370)
2015-06-10 20:33:31,589 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: Going to register memory with contig-pages (DataNet/RDMAComm.cc:123)
2015-06-10 20:33:31,589 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: Going to register RDMA memory. size=650117120 (DataNet/RDMAComm.cc:69)
2015-06-10 20:33:31,773 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: After RDMA memory registration. size=650117120 (DataNet/RDMAComm.cc:107)
2015-06-10 20:33:31,778 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: Successfully init'ed device (DataNet/RDMAComm.cc:370)
2015-06-10 20:33:31,778 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: Going to register memory with contig-pages (DataNet/RDMAComm.cc:123)
2015-06-10 20:33:31,778 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: Going to register RDMA memory. size=650117120 (DataNet/RDMAComm.cc:69)
2015-06-10 20:33:31,779 ERROR [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: ibv_reg_mr failed for memory of total_size=650117120 , MSG=Cannot allocate memory (errno=12), memory-pointer=140028050276032 (DataNet/RDMAComm.cc:98)
2015-06-10 20:33:31,779 ERROR [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: raising com/mellanox/hadoop/mapred/UdaRuntimeException to java side, with info=ibv_reg_mr failure (UdaBridge.cc:91)
2015-06-10 20:33:31,779 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: checking mapred.rdma.developer.mode...
2015-06-10 20:33:31,781 ERROR [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: Critical failure has occured in UdaPlugin - We'll try to use vanilla as fallbackPlugin.
Exception is:com.mellanox.hadoop.mapred.UdaRuntimeException: This is a UDA exception invoked from C++ with the following info: ibv_reg_mr failure
And with the following C++ stacktrace:
/home/f00318826/cloud/hadoop-2.2.0/share/hadoop/mapreduce/libuda.so(_ZN12UdaExceptionC2EPKc+0x30) [0x7f5ad20ff340]
/home/f00318826/cloud/hadoop-2.2.0/share/hadoop/mapreduce/libuda.so(_Z20netlev_init_rdma_memPPvmP10netlev_devm+0x114) [0x7f5ad20fbd94]
/home/f00318826/cloud/hadoop-2.2.0/share/hadoop/mapreduce/libuda.so(_Z16rdma_mem_managerPPvmP10netlev_dev+0x84) [0x7f5ad20fc064]
/home/f00318826/cloud/hadoop-2.2.0/share/hadoop/mapreduce/libuda.so(_Z10create_devP11ibv_contextP10netlev_ctxPFvP14progress_eventPvEPS5_l+0x65) [0x7f5ad20fd485]
/home/f00318826/cloud/hadoop-2.2.0/share/hadoop/mapreduce/libuda.so(_Z14map_ib_devicesP10netlev_ctxPFvP14progress_eventPvEPS3_l+0x7b) [0x7f5ad20fd66b]
/home/f00318826/cloud/hadoop-2.2.0/share/hadoop/mapreduce/libuda.so(_ZN10RdmaClient12register_memEP11memory_pool13double_buffer+0x25) [0x7f5ad20f81e5]
/home/f00318826/cloud/hadoop-2.2.0/share/hadoop/mapreduce/libuda.so(_Z15handle_init_msgP10hadoop_cmd+0x243) [0x7f5ad2108663]
/home/f00318826/cloud/hadoop-2.2.0/share/hadoop/mapreduce/libuda.so(_Z23reduce_downcall_handlerRKSs+0x98) [0x7f5ad2108eb8]
/home/f00318826/cloud/hadoop-2.2.0/share/hadoop/mapreduce/libuda.so(Java_com_mellanox_hadoop_mapred_UdaBridge_doCommandNative+0xa3) [0x7f5ad210edb3]
[0x7f5ae0600f90]
at com.mellanox.hadoop.mapred.UdaBridge.doCommandNative(Native Method)
at com.mellanox.hadoop.mapred.UdaBridge.doCommand(UdaBridge.java:63)
at com.mellanox.hadoop.mapred.UdaPluginRT.<init>(UdaPlugin.java:316)
at com.mellanox.hadoop.mapred.UdaShuffleConsumerPluginShared.init(UdaShuffleConsumerPluginShared.java:191)
at com.mellanox.hadoop.mapred.UdaShuffleConsumerPluginShared.init(UdaShuffleConsumerPluginShared.java:200)
at com.mellanox.hadoop.mapred.UdaShuffleConsumerPlugin.init(UdaShuffleConsumerPlugin.java:64)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:379)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
2015-06-10 20:33:31,809 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: MergerManager: memoryLimit=720332352, maxSingleShuffleLimit=180083088, mergeThreshold=475419360, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2015-06-10 20:33:31,811 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: Succesfuly switched to Using fallbackPlugin
2015-06-10 20:33:31,811 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: fetchOutputs: Using fallbackPlugin
2015-06-10 20:33:31,812 INFO [main] org.apache.hadoop.mapred.ShuffleConsumerPlugin: createKVIterator: Using fallbackPlugin
2015-06-10 20:33:31,814 INFO [EventFetcher for fetching Map Completion Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher: attempt_1433928017396_0005_r_000002_0 Thread started: EventFetcher for fetching Map Completion Events
2015-06-10 20:33:31,831 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning GPU4:13562 with 1 to fetcher#5
2015-06-10 20:33:31,832 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 to GPU4:13562 to fetcher#5
2015-06-10 20:33:31,833 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning GPU5:13562 with 1 to fetcher#1
2015-06-10 20:33:31,833 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 to GPU5:13562 to fetcher#1
2015-06-10 20:33:31,848 INFO [EventFetcher for fetching Map Completion Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher: attempt_1433928017396_0005_r_000002_0: Got 139 new map-outputs
2015-06-10 20:33:31,931 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=13562/mapOutput?job=job_1433928017396_0005&reduce=2&map=attempt_1433928017396_0005_m_000008_0 sent hash and received reply
2015-06-10 20:33:31,931 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=13562/mapOutput?job=job_1433928017396_0005&reduce=2&map=attempt_1433928017396_0005_m_000000_0 sent hash and received reply
2015-06-10 20:33:31,939 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#5 about to shuffle output of map attempt_1433928017396_0005_m_000000_0 decomp: 14028770 len: 14028774 to MEMORY
2015-06-10 20:33:31,943 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 about to shuffle output of map attempt_1433928017396_0005_m_000008_0 decomp: 14038130 len: 14038134 to MEMORY
2015-06-10 20:33:31,963 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 14028770 bytes from map-output for attempt_1433928017396_0005_m_000000_0
2015-06-10 20:33:31,966 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14028770, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->28066900
2015-06-10 20:33:31,967 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: GPU4:13562 freed by fetcher#5 in 136ms
2015-06-10 20:33:31,967 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning GPU4:13562 with 75 to fetcher#5
2015-06-10 20:33:31,967 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 20 of 75 to GPU4:13562 to fetcher#5
2015-06-10 20:33:31,967 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 14038130 bytes from map-output for attempt_1433928017396_0005_m_000008_0
2015-06-10 20:33:31,968 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14038130, inMemoryMapOutputs.size() -> 2, commitMemory -> 14028770, usedMemory ->28066900
2015-06-10 20:33:31,968 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: GPU5:13562 freed by fetcher#1 in 135ms