Uploaded image for project: 'CernVM'
  1. CernVM
  2. CVM-1803

CVMFS crashing with "no space in cache"

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Medium
    • Resolution: Fixed
    • Affects Version/s: CernVM-FS 2.6.2
    • Fix Version/s: CernVM-FS 2.7, CernVM-FS 2.6.4
    • Component/s: CVMFS
    • Labels:
      None
    • Environment:

      SLES12SP3-based OS (CLE 6.0.UP07) running on a Cray XC HPC platform.

    • Operating System:
      GNU/Linux
    • Platforms:
      ANY
    • Development:

      Description

      Hi,

      on Piz Daint ATLAS has recently started running native Singularity containers along with CMS. Since then, we've started noticing weird errors on a few nodes such as this:

      2019-09-18T16:59:13.741481+02:00 c9-0c1s13n0 cvmfs_cache_ram[12601]: clean up cache until at most 4529848 KB is used
      2019-09-18T16:59:13.741493+02:00 c9-0c1s13n0 cvmfs_cache_ram[12601]: session 'atlas.cern.ch:memory': failed to start transaction (4 - no space in cache)
      2019-09-18T16:59:13.741505+02:00 c9-0c1s13n0 cvmfs2[12867]: (atlas.cern.ch) decompressing /data/83/6acc6fcc6ee04a6d7b963cf6c5192ba66dea38P, local IO error
      2019-09-18T16:59:13.741517+02:00 c9-0c1s13n0 cvmfs2[12867]: (atlas.cern.ch) failed to fetch Part of /repo/sw/software/22.0/AthenaExternals/22.0.1/InstallArea/x86_
      64-centos7-gcc8-opt/bin/prmon (hash: 836acc6fcc6ee04a6d7b963cf6c5192ba66dea38, error 1 [local I/O failure])
      2019-09-18T16:59:13.741530+02:00 c9-0c1s13n0 cvmfs_cache_ram[12601]: clean up cache until at most 4529848 KB is used
      2019-09-18T16:59:13.741541+02:00 c9-0c1s13n0 cvmfs_cache_ram[12601]: session 'atlas.cern.ch:memory': failed to start transaction (4 - no space in cache) 

      It seems like a few nodes can be recovered with 

      nid01829:~ # cvmfs_config reload atlas.cern.ch
      Connecting to CernVM-FS loader... done
      Entering maintenance mode
      Draining out kernel caches (60s)
      Blocking new file system calls
      Waiting for active file system calls
      Saving inode tracker
      Saving chunk tables
      Saving inode generation
      Saving inode generation
      Saving open files table
      Saving open files counter
      Unloading Fuse module
      Re-Loading Fuse module
      Restoring inode tracker...  done
      Restoring chunk tables...  done
      Restoring inode generation...  done
      Restoring open files table... done
      Restoring open files counter...  done
      Releasing saved glue buffer
      Releasing chunk tables
      Releasing saved inode generation info
      Releasing saved open files table
      Releasing open files counter
      Activating Fuse module

      after which the repo under /cvmfs/atlas.cern.ch is usable again.

      We run a very particular setup with layered cache, 6GB in RAM and a few TBs on a shared cache across all nodes on a GPFS filesystem. CVMFS workspace and locks are on a tmpfs filesystem, which we don't have proof of being filled, but could be the case for a few seconds.

      We never saw this before with ATLAS and LHCb running containers using local Shifter container images and CMS using Singularity containers off /cvfms so I'm guessing 6GB of RAM is not sufficient for the in_memory cache with 2 VOs pulling files, but wanted to check with you to see if something else is going on. Attached is the bugreport for reference. 

      How should we proceed?

      Thanks,

      Miguel

        Attachments

          Activity

            People

            • Assignee:
              jblomer Jakob Blomer
              Reporter:
              mgila Miguel Gila
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                PlannedEnd:
                PlannedStart: