[ROOT-6698] Crash in PROOF lite Created: 16/Sep/14  Updated: 27/Sep/14  Resolved: 27/Sep/14

Status: Closed
Project: ROOT
Component/s: PROOF
Affects Version/s: 5.34/00
Fix Version/s: 6.02.00

Type: Bug Priority: Critical
Reporter: Sebastian Uhl Assignee: Gerardo Ganis
Resolution: Fixed Votes: 0
Labels: None
Environment:

SLC6
ROOT 5.34.20, 5.34.21 and GIT branch v5-34-00-patches


Attachments: File prooflite.log    
Development:

 Description   

Hi,

since switching to ROOT 5.34.20, but also with the more recent versions 5.34.21 and the GIT branch v5-34-00-patches from this morning, I observe crashes of PROOF lite. I set up the processing like this:

TChain a("USR314")
TProof::Open("lite://")
a.SetProof()
a.Add("/nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.*.root")
a.Process("KinematicPlotsEta.C+")

The output on the console is

Info in <TProofLite::SetQueryRunning>: starting query: 1
Info in <TProofQueryResult::SetRunning>: nwrks: 4       
Looking up for exact location of files: OK (493 files)                 
Info in <TPacketizerAdaptive::TPacketizerAdaptive>: Setting max number of workers per node to 4
10:10:45 17043 Wrk-0.1 | Error in <TDSet::GetEntries>: cannot find tree "USR314" in /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W33_slot4_69611.root
Error in <TPacketizerAdaptive::ValidateFiles>: cannot get entries for file: /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W33_slot4_69611.root - skipping
10:10:45 17043 Wrk-0.1 | Error in <TDSet::GetEntries>: cannot find tree "USR314" in /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W35_slot4_70053.root   
Error in <TPacketizerAdaptive::ValidateFiles>: cannot get entries for file: /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W35_slot4_70053.root - skipping
10:10:45 17047 Wrk-0.3 | Error in <TDSet::GetEntries>: cannot find tree "USR314" in /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W35_slot4_70054.root   
Error in <TPacketizerAdaptive::ValidateFiles>: cannot get entries for file: /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W35_slot4_70054.root - skipping
10:10:45 17043 Wrk-0.1 | Error in <TDSet::GetEntries>: cannot find tree "USR314" in /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W35_slot4_70070.root   
Error in <TPacketizerAdaptive::ValidateFiles>: cannot get entries for file: /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W35_slot4_70070.root - skipping
10:10:45 17041 Wrk-0.0 | Error in <TDSet::GetEntries>: cannot find tree "USR314" in /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W35_slot4_70225.root   
Error in <TPacketizerAdaptive::ValidateFiles>: cannot get entries for file: /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W35_slot4_70225.root - skipping
10:10:45 17047 Wrk-0.3 | Error in <TDSet::GetEntries>: cannot find tree "USR314" in /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W37_slot4_70525.root   
Error in <TPacketizerAdaptive::ValidateFiles>: cannot get entries for file: /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W37_slot4_70525.root - skipping
10:10:45 17045 Wrk-0.2 | Error in <TDSet::GetEntries>: cannot find tree "USR314" in /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W37_slot4_70526.root   
Error in <TPacketizerAdaptive::ValidateFiles>: cannot get entries for file: /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W37_slot4_70526.root - skipping
10:10:45 17043 Wrk-0.1 | Error in <TDSet::GetEntries>: cannot find tree "USR314" in /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W37_slot4_70726.root   
Error in <TPacketizerAdaptive::ValidateFiles>: cannot get entries for file: /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W37_slot4_70726.root - skipping
10:10:46 17045 Wrk-0.2 | Error in <TDSet::GetEntries>: cannot find tree "USR314" in /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W37_slot4_70939.root   
Error in <TPacketizerAdaptive::ValidateFiles>: cannot get entries for file: /nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W37_slot4_70939.root - skipping
Info in <TPacketizerAdaptive::InitStats>: fraction of remote files 0.000000                                                                                         
entries: -1 (-1)                                                                                                                                                    
entries: -1 (-1)                                                                                                                                                    
entries: -1 (-1)                                                                                                                                                    
entries: -1 (-1)                                                                                                                                                    
entries: -1 (-1)                                                                                                                                                    
entries: -1 (-1)                                                                                                                                                    
entries: -1 (-1)                                                                                                                                                    
entries: -1 (-1)                                                                                                                                                    
entries: -1 (-1)                                                                                                                                                    
Info in <TProofLite::MarkBad>:                                                                                                                                      
 +++ Message from master at optiplex09.e18.physik.tu-muenchen.de : marking optiplex09.e18.physik.tu-muenchen.de:-1 (0.1) as bad                                     
 +++ Reason: received kPROOF_FATAL                                                                                                                                  
 
 +++ Message from master at optiplex09.e18.physik.tu-muenchen.de : marking optiplex09.e18.physik.tu-muenchen.de:-1 (0.1) as bad
 +++ Reason: received kPROOF_FATAL                                                                                             
 
 +++ Most likely your code crashed
 +++ Please check the session logs for error messages either using
 +++ the 'Show logs' button or executing                          
 +++                                                              
 +++ root [] TProof::Mgr("optiplex09.e18.physik.tu-muenchen.de")->GetSessionLogs()->Display("*")
 
 
entries: 66 (66)
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
entries: 28 (28)                                                           
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
entries: 61 (61)                                                           
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
entries: 46 (46)                                                           
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
entries: 20 (20)                                                           
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
entries: 36 (36)                                                           
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
Info in <TPacketizerAdaptive::InitStats>: fraction of remote files 0.000000
0.2: caught exception triggered by signal '1' while processing dset:'TDSet:USR314', file:'/nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W33_slot4_69616.root' - check logs for possible stacktrace - last event: 7                                                                                                                                                                                                         
0.0: caught exception triggered by signal '1' while processing dset:'TDSet:USR314', file:'/nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W33_slot4_69623.root' - check logs for possible stacktrace - last event: 9                                                                                                                                                                                                         
Info in <TProofLite::MarkBad>:                                                                                                                                                                                     
 +++ Message from master at optiplex09.e18.physik.tu-muenchen.de : marking optiplex09.e18.physik.tu-muenchen.de:-1 (0.0) as bad                                                                                    
 +++ Reason: undefined message in TProof::CollectInputFrom(...)                                                                                                                                                    
 
 +++ Message from master at optiplex09.e18.physik.tu-muenchen.de : marking optiplex09.e18.physik.tu-muenchen.de:-1 (0.0) as bad
 +++ Reason: undefined message in TProof::CollectInputFrom(...)                                                                
 
 +++ Most likely your code crashed
 +++ Please check the session logs for error messages either using
 +++ the 'Show logs' button or executing                          
 +++                                                              
 +++ root [] TProof::Mgr("optiplex09.e18.physik.tu-muenchen.de")->GetSessionLogs()->Display("*")
 
 
entries: 28 (28)
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
entries: 10 (10)                                                           
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
entries: 14 (14)                                                           
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
Info in <TPacketizerAdaptive::InitStats>: fraction of remote files 0.000000
0.3: caught exception triggered by signal '1' while processing dset:'TDSet:USR314', file:'/nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.2008_W33_slot4_69643.root' - check logs for possible stacktrace - last event: 29                                                                                                                                                                                                        
Info in <TProofLite::MarkBad>:                                                                                                                                                                                     
 +++ Message from master at optiplex09.e18.physik.tu-muenchen.de : marking optiplex09.e18.physik.tu-muenchen.de:-1 (0.3) as bad                                                                                    
 +++ Reason: undefined message in TProof::CollectInputFrom(...)                                                                                                                                                    
 
 +++ Message from master at optiplex09.e18.physik.tu-muenchen.de : marking optiplex09.e18.physik.tu-muenchen.de:-1 (0.3) as bad
 +++ Reason: undefined message in TProof::CollectInputFrom(...)                                                                
 
 +++ Most likely your code crashed
 +++ Please check the session logs for error messages either using
 +++ the 'Show logs' button or executing                          
 +++                                                              
 +++ root [] TProof::Mgr("optiplex09.e18.physik.tu-muenchen.de")->GetSessionLogs()->Display("*")
 
 
entries: 85 (85)
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
entries: 30 (30)                                                           
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
entries: 103 (103)                                                         
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
Info in <TPacketizerAdaptive::InitStats>: fraction of remote files 0.000000
Info in <TProofLite::MarkBad>:                                             
 +++ Message from master at optiplex09.e18.physik.tu-muenchen.de : marking optiplex09.e18.physik.tu-muenchen.de:-1 (0.2) as bad
 +++ Reason: undefined message in TProof::CollectInputFrom(...)                                                                
 
 +++ Message from master at optiplex09.e18.physik.tu-muenchen.de : marking optiplex09.e18.physik.tu-muenchen.de:-1 (0.2) as bad
 +++ Reason: undefined message in TProof::CollectInputFrom(...)                                                                
 
 +++ Most likely your code crashed
 +++ Please check the session logs for error messages either using
 +++ the 'Show logs' button or executing                          
 +++                                                              
 +++ root [] TProof::Mgr("optiplex09.e18.physik.tu-muenchen.de")->GetSessionLogs()->Display("*")
 
 
entries: 91 (91)
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
entries: 5 (5)                                                             
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
entries: 33 (33)                                                           
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
entries: 8 (8)                                                             
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
entries: 85 (85)                                                           
Error in <TPacketizerAdaptive::SplitPerHost>: Error removing a missing file
Info in <TPacketizerAdaptive::InitStats>: fraction of remote files 0.000000
Lite-0: all output objects have been merged

The PROOF session log is attached.

The same script works on the same data without problems in 5.34.19. The crash does only occur if I process trees obtained from real data, for the roughly 500 trees the number of entries ranges between 0 and 350 (with about half of the trees having less than 100 entries). If I process Monte Carlo data, in which case each tree contains 50001 events, no crash occurs. The crash also occurs with a single worker.

In the attached logfile there is this error message Fatal: IsWriting() violated at line 3346 of `/tmp/suhl/root/io/io/src/TBufferFile.cxx' for one process (however, running the commands from above multiple times, this message is not always present). TFile::SetCacheRead always occurs in the stack-trace of at least one process.

Do you have an idea what might be wrong? Thanks, Sebastian



 Comments   
Comment by Gerardo Ganis [ 16/Sep/14 ]

Hi Sebastian,

Does it run without PROOF? I mean, can you process the TChain if you comment out the SetProof() call?

Gerri

Comment by Sebastian Uhl [ 16/Sep/14 ]

Hi Gerri,

simply calling

TChain a("USR314")
a.Add("/nfs/mds/user/suhl/analysis/slot4/filtered_eta/hist.*.root")
a.Process("KinematicPlotsEta.C+", "filtered")

also works.

Sebastian

Comment by Sebastian Uhl [ 16/Sep/14 ]

Running git bisect on the ROOT repository tells that 2201cac9d4b38c4f3a7f485cd64861ed4c7dabe1 is the commit causing the crash.

Comment by Gerardo Ganis [ 25/Sep/14 ]

Hi,
Sorry for the late reply.
So there must be an issue with the new way of setting the cache.
I'll try to debug it.
In the meantime, can you try by setting

ProofPlayer.UseTreeCache 0

is $HOME/.rootrc file?
This should turnoff the cache and avoid the problem, if the diagnostic is correct.

Gerri

Comment by Sebastian Uhl [ 25/Sep/14 ]

Hi,
thanks for taking care.

Indeed switching of the cache in .rootrc works around the problem.

Regards,
Sebastian

Comment by David Smith [ 26/Sep/14 ]

Hi Sebastian,

I believe this problem is fixed now on v5-34-00-patches; if you wouldn't mind and have opportunity to try it, it would good to confirm it has definitely solved the problem you found.

Yours,
David

Comment by Sebastian Uhl [ 26/Sep/14 ]

Hi David,

your patch has indeed fixed this crash for me.

Thank you very much,
Sebastian

Comment by Gerardo Ganis [ 27/Sep/14 ]

Thanks to David Smith this should be fixed in all relevant branches.

Generated at Sat Sep 21 06:37:53 CEST 2019 using Jira 7.13.1#713001-sha1:5e06076c2d215a6f699b7e5c90ab2fae7ba5a1ce.