Uploaded image for project: 'CernVM'
  1. CernVM
  2. CVM-1895

CVMFS repos hung and processes stuck in loop

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Medium
    • Resolution: Not Needed
    • Affects Version/s: CernVM-FS 2.7.1
    • Fix Version/s: CernVM-FS 2.8
    • Component/s: CVMFS
    • Labels:
      None
    • Environment:

      CentOS7

    • Platforms:
      ANY
    • Development:

      Description

      We have been seeing occasional client issues where repositories become unavailable and can not be recovered except by killing the associated processes.
      For example:

      $ ls /cvmfs/atlas.cern.ch
      ls: cannot access '/cvmfs/atlas.cern.ch': No such file or directory
      $ sudo mount -t cvmfs atlas.cern.ch  /mnt/
      Repository atlas.cern.ch is already mounted on /cvmfs/atlas.cern.ch
      

      The command 'sudo cvmfs_config umount atlas.cern.ch'
      completed apparently successfully, but didn't actually change anything.

      'sudo cvmfs_talk -i atlas.cern.ch revision' worked and showed the revision number.

      If I understand correctly (see attached bugreport), process 5411 is the watchdog pid for this repository, 5407 is the actual pid for atlas.cern.ch, and 5407 holds the pipe(write) that 5411 is reading from.

      It seemed that 5407 was stuck in one of its threads, it keeps repeating the accept system call and not finishing:

      $ sudo strace -f -p 5407
      strace: Process 5407 attached with 13 threads
      [pid  5423] read(16,  <unfinished ...>
      [pid  5422] read(16,  <unfinished ...>
      [pid  5421] accept(4,  <unfinished ...>
      [pid  5420] accept(15,  <unfinished ...>
      [pid  5419] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
      [pid  5418] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
      [pid  5417] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
      [pid  5416] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
      [pid  5415] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
      [pid  5414] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
      [pid  5413] read(12,  <unfinished ...>
      [pid  5412] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
      [pid  5407] futex(0x7ffc1a3a2a50, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
      [pid  5415] <... restart_syscall resumed> ) = 0
      [pid  5415] poll([{fd=23, events=POLLIN|POLLPRI}], 1, 60000 <unfinished ...>
      [pid  5420] <... accept resumed> {sa_family=AF_LOCAL, NULL}, [2]) = 40
      [pid  5420] recvfrom(40, "mountpoint", 512, 0, NULL, NULL) = 10
      [pid  5420] sendto(40, "/cvmfs/atlas.cern.ch\n", 21, MSG_NOSIGNAL, NULL, 0) = 21
      [pid  5420] shutdown(40, SHUT_RDWR)     = 0
      [pid  5420] close(40)                   = 0
      [pid  5420] accept(15,  <unfinished ...>
      [pid  5415] <... poll resumed> )        = 0 (Timeout)
      [pid  5415] poll([{fd=23, events=POLLIN|POLLPRI}], 1, 60000 <unfinished ...>
      [pid  5420] <... accept resumed> {sa_family=AF_LOCAL, NULL}, [2]) = 40
      [pid  5420] recvfrom(40, "mountpoint", 512, 0, NULL, NULL) = 10
      [pid  5420] sendto(40, "/cvmfs/atlas.cern.ch\n", 21, MSG_NOSIGNAL, NULL, 0) = 21
      [pid  5420] shutdown(40, SHUT_RDWR)     = 0
      [pid  5420] close(40)                   = 0
      [pid  5420] accept(15,  <unfinished ...>
      [pid  5415] <... poll resumed> )        = 0 (Timeout)
      [pid  5415] poll([{fd=23, events=POLLIN|POLLPRI}], 1, 60000 <unfinished ...>
      [pid  5420] <... accept resumed> {sa_family=AF_LOCAL, NULL}, [2]) = 40
      [pid  5420] recvfrom(40, "mountpoint", 512, 0, NULL, NULL) = 10
      [pid  5420] sendto(40, "/cvmfs/atlas.cern.ch\n", 21, MSG_NOSIGNAL, NULL, 0) = 21
      [pid  5420] shutdown(40, SHUT_RDWR)     = 0
      [pid  5420] close(40)                   = 0
      [pid  5420] accept(15,  <unfinished ...>
      [pid  5415] <... poll resumed> )        = 0 (Timeout)
      [pid  5415] poll([{fd=23, events=POLLIN|POLLPRI}], 1, 60000 <unfinished ...>
      [pid  5420] <... accept resumed> {sa_family=AF_LOCAL, NULL}, [2]) = 40
      [pid  5420] recvfrom(40, "mountpoint", 512, 0, NULL, NULL) = 10
      [pid  5420] sendto(40, "/cvmfs/atlas.cern.ch\n", 21, MSG_NOSIGNAL, NULL, 0) = 21
      [pid  5420] shutdown(40, SHUT_RDWR)     = 0
      [pid  5420] close(40)                   = 0
      [pid  5420] accept(15,  <unfinished ...>
      [pid  5415] <... poll resumed> )        = 0 (Timeout)
      [pid  5415] poll([{fd=23, events=POLLIN|POLLPRI}], 1, 60000 <unfinished ...>
      [pid  5420] <... accept resumed> {sa_family=AF_LOCAL, NULL}, [2]) = 40
      [pid  5420] recvfrom(40, "mountpoint", 512, 0, NULL, NULL) = 10
      [pid  5420] sendto(40, "/cvmfs/atlas.cern.ch\n", 21, MSG_NOSIGNAL, NULL, 0) = 21
      [pid  5420] shutdown(40, SHUT_RDWR)     = 0
      [pid  5420] close(40)                   = 0
      [pid  5420] accept(15,  <unfinished ...>
      [pid  5415] <... poll resumed> )        = 0 (Timeout)
      [pid  5415] poll([{fd=23, events=POLLIN|POLLPRI}], 1, 60000^Cstrace: Process 5407 detached
      

      The associated file descriptor is:

      cvmfs2  5407 cvmfs   15u  unix 0xffff8ffb38b76c00      0t0     60568 ./cvmfs_io.atlas.cern.ch
      

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              jblomer Jakob Blomer
              Reporter:
              rptaylor Ryan Taylor
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: