Details
-
Bug
-
Resolution: Fixed
-
High
-
CernVM-FS 2.8.1
-
None
-
Bug report
-
ANY
-
Description
Hello,
A software analyst updated the /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so file in our software stack. Generally every other program in the stack depends on libc, so this library is among the most frequently accessed and may be held open by jobs and processes running for a long time, even >~ 30 days. We have done periodic libc updates in the past with no issue but the way it was updated this time was slightly different: using Gentoo prefix, the libc file was updated in place, whereas previously Nix was used, which instead creates new files while leaving the old files in place, and updates references using symlinks.
The problem began shortly after the update, user processes on some client nodes (approximately ~ 10% of nodes at one major site were affected) started logging errors from user jobs such as "Illegal Instructions" and segmentation faults, e.g.
May 03 10:09:26 cdr97.int.cedar.computecanada.ca cvmfs2[11186]: (soft.computecanada.ca) switched to catalog revision 5142
|
May 03 10:09:26 cdr97.int.cedar.computecanada.ca kernel: timeout[4899]: segfault at 103ae75f6 ip 00002ad524007c50 sp 00007ffe65386bd8 error 6 in libc-2.30.so[2ad523efb000+14
|
May 03 10:09:27 cdr97.int.cedar.computecanada.ca kernel: timeout[4903]: segfault at 103ae75f6 ip 00002afd77740c50 sp 00007ffdc0e5f078 error 6 in libc-2.30.so[2afd77634000+14
|
This is the correct file content after the update:
sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
9c5974c8916b361c6facaad761bd4ea0bdf3cefb6964e9e6802e88666a2edd32 /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
On problematic nodes, the file contained incorrect data. In fact, frequently checksumming the file showed that the contents were not only incorrect, but changing rapidly among several different outputs, while sometimes repeating the same values:
[cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
640d78edba8da8a26de1313d3e04b4f0f52c4b05b3fdddec2ceb78b0a874049f /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
[cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
640d78edba8da8a26de1313d3e04b4f0f52c4b05b3fdddec2ceb78b0a874049f /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
[cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
640d78edba8da8a26de1313d3e04b4f0f52c4b05b3fdddec2ceb78b0a874049f /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
[cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
e2f306ba5017db93e4172c996781e532765e4727b1b87b8fedab9c2b8757e57c /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
[cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
251e1a6713edcf07a148705eaea1428740ab7a210c17f88374b89b444c26f472 /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
[cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
e2f306ba5017db93e4172c996781e532765e4727b1b87b8fedab9c2b8757e57c /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
[cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
640d78edba8da8a26de1313d3e04b4f0f52c4b05b3fdddec2ceb78b0a874049f /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
I checked the data files on all the stratum servers at the filesystem level and confirmed they were not corrupted. Similarly for HTTP downloads. The sha256sum of the data file was c81ce4... which decompresses to the correct result 9c5974...
# on each stratum 1
|
$ sha256sum /srv/cvmfs/soft.computecanada.ca/data/92/87fc35b5f9438edd49cf29ea739569e43404ec-shake128
|
c81ce4a062af64d286ae7194dd907fe899f4b19738d293a6ba2d9c0ef8d2916c /srv/cvmfs/soft.computecanada.ca/data/92/87fc35b5f9438edd49cf29ea739569e43404ec-shake128
|
|
|
$ wget http://cvmfs-s1-east.computecanada.ca:8000/cvmfs/soft.computecanada.ca/data/92/87fc35b5f9438edd49cf29ea739569e43404ec-shake128
|
$ sha256sum 87fc35b5f9438edd49cf29ea739569e43404ec-shake128
|
c81ce4a062af64d286ae7194dd907fe899f4b19738d293a6ba2d9c0ef8d2916c 87fc35b5f9438edd49cf29ea739569e43404ec-shake128
|
$ openssl zlib -d -in 87fc35b5f9438edd49cf29ea739569e43404ec-shake128 | sha256sum
|
9c5974c8916b361c6facaad761bd4ea0bdf3cefb6964e9e6802e88666a2edd32 -
|
At least two separate sites were affected so I do not think it is squid related either.
Also, cvmfs_talk proxy group switch and host switch did not resolve the problem.
Looking closer at an affected client, the data file stored in the cache was correct, but the client still returned incorrect data on the mounted CVMFS repository:
[cdr97 ~]$ getfattr -n user.hash /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
getfattr: Removing leading '/' from absolute path names
|
# file: cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
user.hash="9287fc35b5f9438edd49cf29ea739569e43404ec-shake128"
|
[cdr97 ~]$ sudo sha256sum /var/lib/cvmfs/shared/92/87fc35b5f9438edd49cf29ea739569e43404ec-shake128
|
9c5974c8916b361c6facaad761bd4ea0bdf3cefb6964e9e6802e88666a2edd32 /var/lib/cvmfs/shared/92/87fc35b5f9438edd49cf29ea739569e43404ec-shake128
|
[cdr97 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
46e119f50d097de9e3578e99eeea0991d789b11f377c364b4f0d64abb247e55e /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
|
What could explain this mismatch between the cached data and the data presented on the FUSE mount?
We found that `cvmfs_config umount` did not fix the problem, but `cvmfs_config killall` did, so perhaps the problem was related to the cachemgr process, or perhaps reference counting?
We also found that if the user processes which still held open the old libc file were killed, it allowed the updated libc file to be correctly served by the filesystem.
CVMFS client versions of affected nodes included 2.8.0, 2.8.1, 2.7.3.
Attachments
Issue Links
- relates to
-
CVM-2002 Make fuse library/package part of the bugreport tarball
-
- Closed
-