Uploaded image for project: 'CernVM'
  1. CernVM
  2. CVM-810

The cvmfs_server is_snapshotting lock file doesn't get cleaned up on crash

    Details

    • Type: Bug
    • Status: Closed
    • Priority: High
    • Resolution: Fixed
    • Affects Version/s: CernVM-FS 2.1.20
    • Fix Version/s: CernVM-FS 2.2.0
    • Component/s: CVMFS
    • Labels:
      None
    • Platforms:
      x86_64-slc6-gcc48-opt

      Description

      If a system crashes during a cvmfs_server snapshot, the is_snapshotting lock file is left around and after restart the others wait forever. This could be fixed by putting a process id in the lock file and having new snapshot commands that come along do kill -0 on the process id to see if it is still running, if not clean out the lock file.

        Attachments

          Activity

          Hide
          bbockelm Brian Paul Bockelman added a comment -

          The problem is likely in the use of a file for locking in the first place - as opposed to POSIX or BSD locks (see 'man fcntl'), which would have prevented this in the first place. I can dredge up a few examples, but it's a pretty straightforward exercise.

          The algorithm proposed above is not particularly useful as PIDs are reused. What happens if the 'apache' process has the same PID as the old running process?

          Show
          bbockelm Brian Paul Bockelman added a comment - The problem is likely in the use of a file for locking in the first place - as opposed to POSIX or BSD locks (see 'man fcntl'), which would have prevented this in the first place. I can dredge up a few examples, but it's a pretty straightforward exercise. The algorithm proposed above is not particularly useful as PIDs are reused. What happens if the 'apache' process has the same PID as the old running process?
          Hide
          dwd Dave Dykstra added a comment -

          I'm not sure about the advantages or disadvantages of using fcntl, but I know that quite a few /etc/init.d scripts at least use the PID in the file trick. Maybe because it's a little challenging to use fcntl from a shell script, I don't know. You're right that there are rare cases where it doesn't work; I ran into that with frontier-squid and worked around it by looking at the program name linked to /proc/<pid>/exe. Anyway doing a PID in a file is much better than doing nothing.

          Show
          dwd Dave Dykstra added a comment - I'm not sure about the advantages or disadvantages of using fcntl, but I know that quite a few /etc/init.d scripts at least use the PID in the file trick. Maybe because it's a little challenging to use fcntl from a shell script, I don't know. You're right that there are rare cases where it doesn't work; I ran into that with frontier-squid and worked around it by looking at the program name linked to /proc/<pid>/exe. Anyway doing a PID in a file is much better than doing nothing.
          Hide
          bbockelm Brian Paul Bockelman added a comment -

          Yup - there's quite a few init scripts out there that implement things in various ways. [You might want to talk to Mine about what you describe in the frontier-squid init script; that's exactly what Gratia used to do and, in the end, it was considered a security issue.]

          Using proper locks is very beneficial in the end; even if it requires a bit of C code, I think it's well-worth the effort.

          Show
          bbockelm Brian Paul Bockelman added a comment - Yup - there's quite a few init scripts out there that implement things in various ways. [You might want to talk to Mine about what you describe in the frontier-squid init script; that's exactly what Gratia used to do and, in the end, it was considered a security issue.] Using proper locks is very beneficial in the end; even if it requires a bit of C code, I think it's well-worth the effort.
          Hide
          rmeusel Rene Meusel (Inactive) added a comment - - edited

          Created a pull request for this issue using "hardlink creation" as atomic operation and PIDs for stale lock detection.

          Show
          rmeusel Rene Meusel (Inactive) added a comment - - edited Created a pull request for this issue using "hardlink creation" as atomic operation and PIDs for stale lock detection.
          Hide
          dwd Dave Dykstra added a comment -

          I do not see this in the cvmfs-server 2.2.0 pre-release announcement. Did it not make it in? We hit this bug on the FNAL stratum 1 this week so I was thinking about asking them to upgrade, but it won't help if the fix didn't get in.

          Show
          dwd Dave Dykstra added a comment - I do not see this in the cvmfs-server 2.2.0 pre-release announcement . Did it not make it in? We hit this bug on the FNAL stratum 1 this week so I was thinking about asking them to upgrade, but it won't help if the fix didn't get in.
          Hide
          rmeusel Rene Meusel (Inactive) added a comment -

          I can confirm that the pull request I referenced above is part of the release. It got merged on August 17th and the released commit is from August 25th.
          Out of curiosity: How often do you see this problem as it should only appear on really nasty crashes, no?

          Show
          rmeusel Rene Meusel (Inactive) added a comment - I can confirm that the pull request I referenced above is part of the release. It got merged on August 17th and the released commit is from August 25th. Out of curiosity: How often do you see this problem as it should only appear on really nasty crashes, no?
          Hide
          dwd Dave Dykstra added a comment -

          Please update the announcement then.

          We saw this problem because of an HA master switch; that process does a hard abort of a snapshot in progress, and it left behind the lock.

          Show
          dwd Dave Dykstra added a comment - Please update the announcement then. We saw this problem because of an HA master switch; that process does a hard abort of a snapshot in progress, and it left behind the lock.
          Hide
          jblomer Jakob Blomer added a comment -

          Thanks for the notice! We updated the release notes.

          Show
          jblomer Jakob Blomer added a comment - Thanks for the notice! We updated the release notes.

            People

            • Assignee:
              rmeusel Rene Meusel (Inactive)
              Reporter:
              dwd Dave Dykstra
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: