[ROOT-6909] Memory problems in LHCb GaudiPython Created: 20/Nov/14  Updated: 17/Dec/14  Resolved: 17/Dec/14

Status: Closed
Project: ROOT
Component/s: Cling, PyROOT
Affects Version/s: 6.02/01
Fix Version/s: 6.04.00, 6.02/03

Type: Bug Priority: Critical
Reporter: Roel Aaij Assignee: Axel Naumann
Resolution: Fixed Votes: 0
Labels: None
Environment:

SLC 6.6; LHCb software latest release or nightly build (lhcb-lcg-head)


Attachments: File gaudipython_valgrind.tar.bz2     File massif.out.1000evt.bz2     File massif.out.13151.bz2     File massif.out.2666.bz2     File massif_100ev_0dot1perc.tar.xz     File massif_100ev_1perc.tar.xz     File massif_10ev_1012.tar.xz    
Development:

 Description   

In a GaudiPython script, memory usage increases linearly, upto 5+ GB for a few thousand events. No problem in ROOT 5.

I've run the script through valgrind with memcheck and massif; 100 events for memcheck and both 1 and 100 events for massif and I attach the logs here. Marco Clemencic should be able to allow you to run the script, if needed.



 Comments   
Comment by Axel Naumann [ 20/Nov/14 ]

Hi,

This is in the typedef lookup. I cannot reproduce it with any trivial succeeding nor failing nor template-instantiating example, so I will have to see the real code. Can someone let me know the details?

Cheers, Axel.

Comment by Axel Naumann [ 20/Nov/14 ]

Hi,

Commit 611876e (in v6-02-00-patches) should have drastically improved the situation. Can I get some new numbers with that one? massif on 100ev is enough; massif_visualizer is doing the rest

Cheers, Axel.

Comment by Marco Clemencic [ 21/Nov/14 ]

It looks much better now, but I still see a bit of a slope...

Comment by Axel Naumann [ 21/Nov/14 ]

Thanks!

That's still on 100 events? In the part I fixed I now see memory hoarding (this issue never involved leaking memory!) in the order of 60MB / 100 events, down from 340MB. That's in the noise level of massif (not my decision - all paths leading there end up in "all below threshold". Let me know if you want me to pursue this further, in which case I need a more detailed massif report e.g. with --threshold=0.1 (the percentage limit of the significance threshold). I can do that myself if you tell me how to set things up else please provide it.

Another 17MB/100ev is caused by parsing.

And 26MB are again due to generic "below threshold" things (generic = not involving the fix I did).

Is this still critical for LHCb?

Cheers, Axel.

Comment by Marco Clemencic [ 21/Nov/14 ]

Hi,

yes, still 100 evt.

My worry is that it still looks like there is a small memory leak (or constant memory growth).

I'm running the 100 evt with --threshold=0.1, then I'll try with more events to see if it still grows or it saturates.

Cheers
Marco

Comment by Marco Clemencic [ 21/Nov/14 ]

I added the massif output with lower threshold.
I just started a job over 1000 evts... it will take some time...

Comment by Marco Clemencic [ 22/Nov/14 ]

Attached the massif output for 1000 evts. There is still a leak.

I do not know how many events are usually processed with this script, so I cannot tell how urgent it is to get it fixed.
I let Roel comment on it.

Cheers
Marco

Comment by Roel Aaij [ 24/Nov/14 ]

Like any Gaudi application/script GaudiPython scripts should be able to run over something like a million events without problems. I think any type of leak or hoarding should therefore be fixed.

Cheers, Roel

Comment by Axel Naumann [ 24/Nov/14 ]

Hi Roel,

Absolutely, no worries, we will fix that.

The question is how urgent it is, whether it should block the release 6.02.02 (which we'd like to cut this week) or whether it can wait for 6.02.03. That's your call: if you say that a lot of LHCb analyses will be hit by this bug then we will delay the release.

I am asking because this bug might need some synchronization between Wim (PyROOT) and me and that makes it noticeably slower for instance due to time zone effects.

Cheers, Axel.

Comment by Roel Aaij [ 24/Nov/14 ]

Hi Axel,

How many people will be hit by this is a bit hard to judge. I'm not sure how many analyses use GaudiPython (and Bender, which is based on it) and also require the latest software stack. I'll ask around and let you know.

Cheers, Roel

Comment by Marco Clemencic [ 24/Nov/14 ]

Hi Axel,

you can tag with what you have at the moment.
We will pick up this release for a specific use case (and testing, of course), but it will not be used via Python... in any case we will have to ask for the parallel release of ROOT 5 too, as a back up.

Cheers
Marco

Comment by Roel Aaij [ 27/Nov/14 ]

Massif output for job with wlightly more complicated loop over objects in memory.

Comment by Roel Aaij [ 27/Nov/14 ]

Hi Axel,

I've run another job from a user who encountered problems. This script loops over two containers of objects in memory and gets then into python, but doesn't do anything with them. The logs are in massif_100ev_1perc.tar.xz. The job uses nearly 5 GiB of memory after 100 events.

I forgot to set the threshold to 0.1%, so it's still at 1%. I'm rerunning with a lower threshold.

Cheers, Roel

Comment by Roel Aaij [ 27/Nov/14 ]

Hi Axel,

I've attached a new massif log for slightly more complicated script with threshold=0.1 %

Cheers, Roel

Comment by Axel Naumann [ 03/Dec/14 ]

Hi Roel,

Thanks! But I don't understand the output:

massif claims that cling::utils::TypeName::GetFullyQualifiedName() calls cling::Value::GetDtorWrapperPtr() - and I don't believe that. Now what...

  • Let's make sure that this is indeed not due to an inconsistent build: could you put a breakpoint at cling::Value::GetDtorWrapperPtr() and see whether that gets triggered during the event loop? You should be good to just ignore the breakpoint for 1000 times and then see what happens. If I am wrong (?!) you'll see a backtrace showing cling::utils::TypeName::GetFullyQualifiedName() - I'd like to see that one.
  • Please give me a reproducer - ideally without LHCb code because I'd like to make sure that it's not from an inconsistent build.

FYI - this is now my #1 priority. We can sit down together and debug if that helps.

Cheers, Axel.

Comment by Axel Naumann [ 03/Dec/14 ]

Hi Roel,

Sorry - my brain is only half functional and you happened to get the wrong half here.

It makes perfect sense; I'm working on a fix. I will rely on you to confirm that this future fix does get rid of the hoarding.

Cheers, Axel.

Comment by Axel Naumann [ 05/Dec/14 ]

Hi,

Could you try tomorrow's nightly? I tried your python script after my fix - super constant memory consumption until the lxplus CPU time watchdog killed me. Could you confirm, please?

(The fix is in the master and v6-02-00-patches.)

Cheers, Axel.

Comment by Roel Aaij [ 09/Dec/14 ]

Hi Axel,

The nightlies were broken during the weekend, so I tried with today's build.

Unfortunately I still see the same behaviour. I've checked that the correct version of ROOT is being picked up, so I'm not sure what's going wrong and what is different from when you run things. I'm running with massif again and will attach the log as soon as the job is finished.

Cheers, Roel

Comment by Roel Aaij [ 10/Dec/14 ]

Massif log with today's nightlies. GetFullyQualifiedName still seems to show up a log and the memory hoarding indeed doesn't look fixed.

Cheers, Roel

Comment by Axel Naumann [ 15/Dec/14 ]

Hi Roel,

I'm back. I ran your script for 38 minutes CPU time and it did not increase its memory footprint - after I fixed my previous commit such that it actually works Apologies for that.

This should now really be fixed in master and v6-02-00-patches. Could you please confirm tomorrow?

Cheers, Axel.

Comment by Axel Naumann [ 16/Dec/14 ]

Sorry, forgot to push (instead of only committing) into v6-02-00-patches So tomorrow it is, unless you can get a new build still today? I pushed at 8:38.

Axel.

Comment by Roel Aaij [ 17/Dec/14 ]

Hi Axel,

I'm happy to report that with today's nightly build, memory usage is constant, so this is fixed.

Thanks for chasing this down!

Roel

Comment by Axel Naumann [ 17/Dec/14 ]

Thanks for your report and your profiles, Roel!

Generated at Sun Sep 22 01:24:46 CEST 2019 using Jira 7.13.1#713001-sha1:5e06076c2d215a6f699b7e5c90ab2fae7ba5a1ce.