Uploaded image for project: 'CernVM'
  1. CernVM
  2. CVM-2001

large scale corruption of a frequently used file in CVMFS

    XMLWordPrintable

Details

    • Bug
    • Status: In Progress
    • High
    • Resolution: Unresolved
    • CernVM-FS 2.8.1
    • None
    • CVMFS, DOC
    • None
    • Bug report
    • ANY

    Description

      Hello,

      A software analyst updated the /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so file in our software stack. Generally every other program in the stack depends on libc, so this library is among the most frequently accessed and may be held open by jobs and processes running for a long time, even >~ 30 days. We have done periodic libc updates in the past with no issue but the way it was updated this time was slightly different: using Gentoo prefix, the libc file was updated in place, whereas previously Nix was used, which instead creates new files while leaving the old files in place, and updates references using symlinks.

      The problem began shortly after the update, user processes on some client nodes (approximately ~ 10% of nodes at one major site were affected) started logging errors from user jobs such as "Illegal Instructions" and segmentation faults, e.g.

      May 03 10:09:26 cdr97.int.cedar.computecanada.ca cvmfs2[11186]: (soft.computecanada.ca) switched to catalog revision 5142
      May 03 10:09:26 cdr97.int.cedar.computecanada.ca kernel: timeout[4899]: segfault at 103ae75f6 ip 00002ad524007c50 sp 00007ffe65386bd8 error 6 in libc-2.30.so[2ad523efb000+14
      May 03 10:09:27 cdr97.int.cedar.computecanada.ca kernel: timeout[4903]: segfault at 103ae75f6 ip 00002afd77740c50 sp 00007ffdc0e5f078 error 6 in libc-2.30.so[2afd77634000+14
      

      This is the correct file content after the update:

      sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      9c5974c8916b361c6facaad761bd4ea0bdf3cefb6964e9e6802e88666a2edd32  /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      

      On problematic nodes, the file contained incorrect data. In fact, frequently checksumming the file showed that the contents were not only incorrect, but changing rapidly among several different outputs, while sometimes repeating the same values:

      [cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      640d78edba8da8a26de1313d3e04b4f0f52c4b05b3fdddec2ceb78b0a874049f  /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      [cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      640d78edba8da8a26de1313d3e04b4f0f52c4b05b3fdddec2ceb78b0a874049f  /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      [cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      640d78edba8da8a26de1313d3e04b4f0f52c4b05b3fdddec2ceb78b0a874049f  /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      [cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      e2f306ba5017db93e4172c996781e532765e4727b1b87b8fedab9c2b8757e57c  /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      [cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      251e1a6713edcf07a148705eaea1428740ab7a210c17f88374b89b444c26f472  /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      [cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      e2f306ba5017db93e4172c996781e532765e4727b1b87b8fedab9c2b8757e57c  /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      [cdr994 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      640d78edba8da8a26de1313d3e04b4f0f52c4b05b3fdddec2ceb78b0a874049f  /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      

      I checked the data files on all the stratum servers at the filesystem level and confirmed they were not corrupted. Similarly for HTTP downloads. The sha256sum of the data file was c81ce4... which decompresses to the correct result 9c5974...

      # on each stratum 1 
      $ sha256sum /srv/cvmfs/soft.computecanada.ca/data/92/87fc35b5f9438edd49cf29ea739569e43404ec-shake128
      c81ce4a062af64d286ae7194dd907fe899f4b19738d293a6ba2d9c0ef8d2916c  /srv/cvmfs/soft.computecanada.ca/data/92/87fc35b5f9438edd49cf29ea739569e43404ec-shake128
       
       
      $ wget http://cvmfs-s1-east.computecanada.ca:8000/cvmfs/soft.computecanada.ca/data/92/87fc35b5f9438edd49cf29ea739569e43404ec-shake128
      $ sha256sum 87fc35b5f9438edd49cf29ea739569e43404ec-shake128 
      c81ce4a062af64d286ae7194dd907fe899f4b19738d293a6ba2d9c0ef8d2916c  87fc35b5f9438edd49cf29ea739569e43404ec-shake128
      $ openssl zlib -d -in 87fc35b5f9438edd49cf29ea739569e43404ec-shake128 | sha256sum
      9c5974c8916b361c6facaad761bd4ea0bdf3cefb6964e9e6802e88666a2edd32  -
      

      At least two separate sites were affected so I do not think it is squid related either.
      Also, cvmfs_talk proxy group switch and host switch did not resolve the problem.

      Looking closer at an affected client, the data file stored in the cache was correct, but the client still returned incorrect data on the mounted CVMFS repository:

      [cdr97 ~]$ getfattr -n user.hash   /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      getfattr: Removing leading '/' from absolute path names
      # file: cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      user.hash="9287fc35b5f9438edd49cf29ea739569e43404ec-shake128"
      [cdr97 ~]$ sudo sha256sum /var/lib/cvmfs/shared/92/87fc35b5f9438edd49cf29ea739569e43404ec-shake128
      9c5974c8916b361c6facaad761bd4ea0bdf3cefb6964e9e6802e88666a2edd32  /var/lib/cvmfs/shared/92/87fc35b5f9438edd49cf29ea739569e43404ec-shake128
      [cdr97 ~]$ sha256sum /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      46e119f50d097de9e3578e99eeea0991d789b11f377c364b4f0d64abb247e55e  /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc-2.30.so
      

      What could explain this mismatch between the cached data and the data presented on the FUSE mount?

      We found that `cvmfs_config umount` did not fix the problem, but `cvmfs_config killall` did, so perhaps the problem was related to the cachemgr process, or perhaps reference counting?

      We also found that if the user processes which still held open the old libc file were killed, it allowed the updated libc file to be correctly served by the filesystem.

      CVMFS client versions of affected nodes included 2.8.0, 2.8.1, 2.7.3.

      Attachments

        Issue Links

          Activity

            People

              rapopesc Radu Popescu
              rptaylor Ryan Taylor
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: