[ROOT-7500] uint64_t streaming in derived class across Mac/linux broken since v5.34/19 Created: 24/Jul/15  Updated: 31/Jul/15  Resolved: 31/Jul/15

Status: Closed
Project: ROOT
Component/s: I/O
Affects Version/s: 6.00.00, 6.00.01, 6.00.02, 6.02.00, 6.04.00, 6.02/01, 6.02/02, 5.34/24, 6.02/03, 5.34/25, 6.02/04, 5.34/26, 6.02/05, 6.02/08, 5.34/28, 5.34/30, 6.02/10, 6.02/12, 6.04/02, 5.34/32
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jason Detwiler Assignee: Philippe Canal
Resolution: Fixed Votes: 0
Labels: None
Environment:

Linux and Mac


Attachments: File FileError.tar.gz    
Development:

 Description   

We have software deployed on a linux cluster (PDSF at NERSC running Scientific Linux 6.4) and many users with Mac laptops. We found that our analysis files created on linux could not be opened on Macs, and vice versa: if we loaded our software libraries, root would crash as soon as we tried to call TFile::Open().

We narrowed the problem down to a couple of classes, which we noticed were all classes that used uint64_t in a base class. So I constructed a test case that eliminates any of our complicated code, and sure enough, it crashes in the same way.

So it appears is that streaming is broken across platforms for classes that have a base class with a uint64_t data member. Please see the attached code and their associated Makefiles on their respective systems. The macro fileError.C, when run on linux, creates a file that contains a tree with a branch "fI" that tree->Print() says is a ULong_t, and the file can't be opened on a Mac if libMyLib.so is loaded first. When run on mac, the macro creates a file that contains a tree with also with a branch "fI", but this time tree->Print() says it is a ULong64_t, and the file can't be opened on linux if libMyLib.so is loaded first.

Interestingly, if the tree is filled with the base class instead of the derived class, the file opens just fine! However, fI is still given the different type (ULong_t vs ULong64_t) on the two different systems.

We did extensive testing to see when this problem started, and it occurs in root6 as well as in root_v5.34/19 and subsequent patches.

Please help! This is a big problem for us, and means that we have to fix it and reprocess all of our data before our disks fill up with data coming in.



 Comments   
Comment by Philippe Canal [ 29/Jul/15 ]

Hi,

This is fixed in the master and in the patch branch for 6.04, 6.02 and 5.34.

You can work around the problem by either replacing uint64_t with ULong64_t
or 'unsigned long long'.

Alternatively you can use (in Derived):

#ifdef __linux
  ClassDef(Derived,1);
#else
  ClassDef(Derived,2);
#endif

The problem was due to a missing null pointer check and addition search
for a base class StreamerInfo.

This is triggered by a fix (to prevent silent data corruption) that enforce
that whenever a base class layout changes, the derived class version (when
one is provided) must be updated. In your case, the use of the typedef
uint64_t means that unbeknowst to you the base class layout changes between
Mac and Linux.

Note that in some circunstances (when member-wise streaming the derived class),
if you do not apply one of the work-around, you will still have problem which
will be diagnosed with a message similar to:

Warning in <CreateReadMemberWiseActions>:    The StreamerInfo of class Derived read from file file.linux.root
has the same version (=1) as the active class but a different checksum.
You should update the version to ClassDef(Derived,2).
The objects on this file might not be readable because:
The in-memory layout version 1 for class 'Derived' has a base class (Base) with checksum 213cd568 but the on-file layout version 1 recorded the checksum value 85a36aab for this base class (Base).

Cheers,
Philippe.

Comment by Jason Detwiler [ 30/Jul/15 ]

Thank you very much Philippe. We will try updating and verify that it fixes
our problem. However, can you say a few more words about the class layout
differences between mac and linux? In our code, the base class is defined
in a module of code that must stay ROOT-free "by decree" (not my choice),
so using ULong64_t is not an option. So we used uint64_t from <stdint.h> to
guarantee a 64-bit unsigned integer for several of our data fields,
independent of what system we are on. So I'm surprised that it would result
in a different layout on Mac vs Linux? Your suggestion of "unsigned long
long" would give us "at least 64 bits" but the exact bit depth would be
system-dependent, so there I would expect (if not now then maybe
eventually) cross-platform problems requiring #ifdef __linux directives and
the like. Why does uint64_t cause a class layout differences between mac
and linux that can be resolved by going to the system-dependent type
unsigned long long, rather than the other way around?

Thanks,
Jason

On Wed, Jul 29, 2015 at 2:40 PM, Philippe Canal (JIRA) <sft.jira@cern.ch>

Comment by Philippe Canal [ 30/Jul/15 ]

Hi Jason,

It is relatively new that the typedef like 'uint64_t' are actually standard (one page says, only since C++11) and are not yet natively supported in the I/O (and will never be natively support in v5.34). Consequently the I/O never actually sees the typedef and only see the resulting underlying type (i.e. long on some platforms, long long on some other) and thus the schema 'looks' different (however indeed on file both long and long long are stored using 64 bits).

> Your suggestion of "unsigned long long" would give us "at least 64 bits" but the exact bit depth would be system-dependent

In practice (at least at the moment as far as I know), all implementations of long long are exactly 64 bits.

So until we can natively support the standard typedefs and for the limited cases where you are using them in a base class and some of the derived class have a ClassDef macro, I would still recommend to use 'unsigned long long' (and use a static assert to verify that the sizeof(unsigned long long) is what you need it to be).

Cheers,
Philippe.

Comment by Philippe Canal [ 30/Jul/15 ]

Hi Jason,

Thinking of it further, in this case the I/O layout are the same (whether it is using long or long long) and thus there should not be any error message.

I.e. There is one more fix need in ROOT
and you should not have to make any change.

Cheers,
Philippe.

Comment by Jason Detwiler [ 30/Jul/15 ]

Thanks Philippe, I was hoping such a fix was possible. I'll wait for it.
Jason

On Thu, Jul 30, 2015 at 7:57 AM, Philippe Canal (JIRA) <sft.jira@cern.ch>

Comment by Philippe Canal [ 31/Jul/15 ]

Hi Jason,

I uploaded to the master and the patch branches for v5.34, v6.02 and v6.04, the last part of the fix. Your existing code should now work as-is in all cases.

Cheers,
Philippe.

Generated at Sat Jul 20 03:05:06 CEST 2019 using Jira 7.13.1#713001-sha1:5e06076c2d215a6f699b7e5c90ab2fae7ba5a1ce.