-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leaks when calling mpi_file_write_all #12677
Comments
@andymwood thank you for the report, I will have a look over the weekend |
@andymwood The backtrace shows If you have a chance, could you try the ring allgatherv algorithm which does not involve temporary datatypes with:
Please let us know if this works. |
We talked about this today on slack and have some leads. First, the datatype built in the allgatherv is contiguous as indicated by the dump below:
Tracking the code by hand also indicates that the datatype is correctly released in the allgatherv. However, as we only release data based on the object refcount, it is possible that if the PML/MTL still has a reference onto the datatype (which would be normal for pending send/recv communications) then the datatype will not be release there, but later once the communications using it complete. @wenduwan recall that not long ago we had a bug in CM where the refcount of the datatype has not been released properly #12544. This PR is missing from the 4.x branch, so it could be the cause of the memory leaks. @andymwood can you try to run with the OB1 PML please ? (add |
I can confirm that applying #12544 on v4.1.x mitigates this leak
With |
Backport is approved #12772 This should be included in 4.1.7 release. |
@wenduwan was that fix back ported to the 5.0 branch as well? |
@edgargabriel Yes both main and v5.0.x have the fix |
Thanks for the responses. Setting these environment variables has no effect on the test problem I supplied. |
This option reduces the number of leaks from mpi_file_write_all in the short test case to just one:
The full problem that I'm running takes much longer. I'll report back later. |
Adding |
can you please give the inline patch below a try? diff --git a/ompi/mca/common/ompio/common_ompio_aggregators.c b/ompi/mca/common/ompio/common_ompio_aggregators.c
index a6448f97e9..c31cbf5412 100644
--- a/ompi/mca/common/ompio/common_ompio_aggregators.c
+++ b/ompi/mca/common/ompio/common_ompio_aggregators.c
@@ -529,7 +529,7 @@ int mca_common_ompio_set_aggregator_props (struct ompio_file_t *fh,
}
fh->f_num_aggrs = fh->f_init_num_aggrs;
- fh->f_aggr_list = (int*) malloc ( fh->f_num_aggrs * sizeof(int));
+ fh->f_aggr_list = (int*) realloc( fh->f_aggr_list, fh->f_num_aggrs * sizeof(int));
if (NULL == fh->f_aggr_list ) {
opal_output (1, "OUT OF MEMORY\n");
return OMPI_ERR_OUT_OF_RESOURCE; @edgargabriel |
@ggouaillardet Thank you, I think you are on the right track. Thanks again! |
I've created another suppression file to suppress what I think are false positives, or leaks that don't matter
With the attached suppression file and this option, the leaks are reduced to this: ==7355== 8 bytes in 1 blocks are definitely lost in loss record 7 of 59
==7355== at 0x4C29E63: malloc (vg_replace_malloc.c:309)
==7355== by 0x6C61EA9: opal_malloc (malloc.c:101)
==7355== by 0x10A14CF8: ???
==7355== by 0x5783B29: query_2_0_0 (io_base_file_select.c:415)
==7355== by 0x5783B29: query (io_base_file_select.c:394)
==7355== by 0x5783B29: check_one_component (io_base_file_select.c:354)
==7355== by 0x5783B29: check_components (io_base_file_select.c:321)
==7355== by 0x5781B7F: mca_io_base_file_select (io_base_file_select.c:155)
==7355== by 0x55B0E53: ompi_file_open (file.c:135)
==7355== by 0x564D087: PMPI_File_open (pfile_open.c:109)
==7355== by 0x52F41AE: mpi_file_open (pfile_open_f.c:88)
==7355== by 0x4095A4: MAIN__ (test-leak.f90:20)
==7355== by 0x409411: main (in /home/andrew/open-mpi-bugs/memory-leaks/build/test_ompi)
==7355==
==7355== LEAK SUMMARY:
==7355== definitely lost: 8 bytes in 1 blocks
==7355== indirectly lost: 0 bytes in 0 blocks
==7355== possibly lost: 0 bytes in 0 blocks
==7355== still reachable: 0 bytes in 0 blocks
==7355== suppressed: 75,766 bytes in 145 blocks
==7355==
==7355== For counts of detected and suppressed errors, rerun with: -v
==7355== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 56 from 56) I'm currently running the full problem to see if that fixes the out-of-memory error. I'll see what effect the suggested patch has later. |
I've now tried this patch and it doesn't have any effect on the test case I provided. |
@andymwood this is odd. with the patch and the program you initially provided
without the patch
fwiw, i run on a lustre filesystem. note the leak is in you can |
@edgargabriel I ran the reproducer on the debugger, and indeed, |
@ggouaillardet you are technically correct, and I am fundamentally ok with the I would however refrain from trying to cache the values from one call to another, since the user is technically allowed to set a different file view between subsequent calls. We can investigate whether something along those lines could be done, but I would suggest to do that separately, outside of this patch. I would like to make sure that we do not have any unintended side-effects. |
@ggouaillardet would you like to file pr with the fix, or shall I go ahead and do that? |
@edgargabriel please go ahead, thanks! |
we didn't correctly free the fh->f_aggr_list array in the vulcan file_write_all file. Thanks @andymwood for reporting the issue and @ggouaillardet for identifying the cause for the leak. Fixes Issue open-mpi#12677 (at least partially) Signed-off-by: Edgar Gabriel <[email protected]>
we didn't correctly free the fh->f_aggr_list array in the vulcan file_write_all file. Thanks @andymwood for reporting the issue and @ggouaillardet for identifying the cause for the leak. Fixes Issue open-mpi#12677 (at least partially) Signed-off-by: Edgar Gabriel <[email protected]> (cherry picked from commit 3fde6af)
we didn't correctly free the fh->f_aggr_list array in the vulcan file_write_all file. Thanks @andymwood for reporting the issue and @ggouaillardet for identifying the cause for the leak. Fixes Issue open-mpi#12677 (at least partially) Signed-off-by: Edgar Gabriel <[email protected]> (cherry picked from commit 3fde6af)
I'm getting memory leaks when calling
mpi_file_write_all
, using Open MPI 4.1.5 built from a source downloaded from here. Eventually the process runs out of memory and is killed by the operating system.Valgrind also reports memory leaks from
mpi_init
andmpi_finalize
, but I suspect it's the ones inmpi_file_write_all
that are causing the major problems.I built Open MPI Intel 2020.2, like so:
$ export FC=ifort CC=icc $ ./configure --prefix=/home/andrew/install/Debug/openmpi-4.1.5 --with-psm2 --disable-psm2-version-check --with-slurm --with-pmi=/mnt/beegfs/software/slurm/19.05.2/../prod --enable-mem-debug --enable-debug
System
Details of the problem
The following code, test-leak.f90, reproduces the problem:
I'm using the following CMakeLists.txt to build the executable:
Compile and run like so:
Here's some sample output from valgrind:
The text was updated successfully, but these errors were encountered: