Quantcast
Channel: Mellanox Interconnect Community: Message List
Viewing all articles
Browse latest Browse all 6278

MXM ERROR failed to create send cq: Cannot allocate memory

$
0
0

I am trying to setup a small HPC cluster using Mellanox MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] cards and a M3601Q switch to work with openMPI and SLURM.

 

I have read that I need to activate MXM support when compiling openMPI. I have solved a lot of small problems, but I have this problem now. The openMPI jobs crash and tell they cannot allocate the memory:

 

mpirun noticed that process rank 3 with PID 2208 on node node01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
MXM: Got signal 15 (Terminated)
MXM: Got signal 15 (Terminated)
MXM: Got signal 15 (Terminated)
MXM: Got signal 15 (Terminated)
MXM: Got signal 15 (Terminated)
[1398427878.955592] [node02:2198 :0]      cib_ep.c:261  MXM  ERROR failed to create send cq: Cannot allocate memory
[1398427878.957108] [node02:2198 :0]      cib_ep.c:93   MXM  ERROR Failed to cancel async thread.
MXM: Got signal 11 (Segmentation fault)
==== backtrace ====
[1398427878.955621] [node02:2199 :0]      cib_ep.c:261  MXM  ERROR failed to create send cq: Cannot allocate memory
[1398427878.957105] [node02:2199 :0]      cib_ep.c:93   MXM  ERROR Failed to cancel async thread.
MXM: Got signal 11 (Segmentation fault)
==== backtrace ====
[1398427878.955550] [node02:2200 :0]      cib_ep.c:261  MXM  ERROR failed to create send cq: Cannot allocate memory
[1398427878.957063] [node02:2200 :0]      cib_ep.c:93   MXM  ERROR Failed to cancel async thread.
MXM: Got signal 11 (Segmentation fault)
==== backtrace ====
MXM: Got signal 15 (Terminated)
[1398427878.958152] [node02:2202 :0]      cib_ep.c:261  MXM  ERROR failed to create send cq: Cannot allocate memory
[1398427878.959497] [node02:2202 :0]      cib_ep.c:93   MXM  ERROR Failed to cancel async thread.
MXM: Got signal 11 (Segmentation fault)
==== backtrace ====
MXM: Got signal 15 (Terminated)
[1398427878.963706] [node02:2204 :0]      cib_ep.c:261  MXM  ERROR failed to create send cq: Cannot allocate memory
[1398427878.965081] [node02:2204 :0]      cib_ep.c:93   MXM  ERROR Failed to cancel async thread.
MXM: Got signal 11 (Segmentation fault)
==== backtrace ====
4 total processes killed (some possibly by mpirun during cleanup)

 

moreover, when I try to run the jobs with slurm command SRUN I get the following error:

 

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

Local host:   node01
Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory.  This typically can indicate that the
memlock limits are set too low.  For most HPC installations, the
memlock limits should be set to "unlimited".  The failure occured
here:


  Local host:    node01
  OMPI source:   btl_openib_component.c:1216
  Function:      ompi_free_list_init_ex_new()
  Device:        mlx4_0
  Memlock limit: 65536

[I cut the error message, it is repetitive for each node]

I have read here that I should modify the MTT values. I have followed the procedure, but I still get the same error. Does anyone know how to troubleshoot this?

 

 

Thanks in advance

Kind regards,

Andrea

 

NB: I have compiled openMPI also with slurm support and MXM.


Viewing all articles
Browse latest Browse all 6278

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>