Description
The current implementation of TTreeProcessorMT::Process uses one TChain per thread. From that TChain, the cluster boundaries are extracted and one task is created per cluster. For each task, threads construct a TTreeReader with their thread-local TChain and process the assigned range for that task. The performance of this implementation suffers when there are many files in the TChain, since threads spend a lot of time going through files in the TChain to find the one that contains the range they need to process. In particular, many calls to LoadTree are observed and this causes severe lock contention.
An alternative implementation proposes to spawn a super task for each file, which at its turn will spawn one sub-task for each cluster in that file. This helps threads keep locality on files, since the sub-tasks for each cluster in a file will land in the local TBB queue of that thread. In this implementation, no TChain is used and every thread operates on a thread-local TTree with tree-local ranges. One advantage of this solution is that there is no first pass over the whole chain, opening all its files: files are opened for the first time by the threads that need to process them. However, it is possible, if one thread becomes idle, that this thread steals a task from another thread: in that case, the former will just open the new file it has to work with.
The implementation described above does not fit the case where a TChain has friends and/or is restricted by a TEntryList. This case, although more uncommon, needs to be covered as well, and it needs to work on a TChain to which friends are added and to which global entry lists are applied. One way to do it is to create an initial TChain in the main thread that our code will go through, forcing it to store the knowledge of the entry offset at which each file in the chain starts. This "wise" TChain will then be copied into the thread-local copies of the threads, so that the knowledge of the offset is not lost and threads do not pay the price of opening so many files when working on a certain range of the chain.
The general case (more performant) and the friend/entry list case need to be identified in the code and proceed accordingly with the implementations suggested above.