[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] .update.ad problems after upgrade.

Hello, htcondor peoples,

My general strategy is Never Upgrade, because upgrading always causes
problems. It's unavoidable, of course, so on Friday I upgraded from
condor 8.6.5 to condor 8.8.9. Things seemed to go well over the weekend,
possibly because nobody was submitting jobs, but that didn't last.

Currently I'm seeing a LOT of these in my log files:

07/20/20 14:56:07 (pid:10322) Failed to open '.update.ad' to read update
ad: No such file or directory (2).

I'm also having users report jobs failing. Immediately following the
line above:

07/20/20 14:56:07 (pid:10322) All jobs have exited... starter exiting
07/20/20 14:56:07 (pid:10322) **** condor_starter (condor_STARTER) pid

>From what I've seen, this file should be created in /var/condor/execute,
which definitely exists on the node in question, and I believe the
permissions are fine:

angrist-14 14:59:24$ ls -al /var/condor/execute/
total 8
drwxr-xr-x 2 condor bin  4096 Jul 20 14:56 .
drwxr-xr-x 6 root   root 4096 Jul 30  2019 ..

google has not presented me with a wealth of fellow htcondor users
having this problem upon upgrade, so at this point I'm not positive this
IS a problem? Is it THE problem that's causing these jobs to fail? What
the heck can I do to diagnose/resolve this issue?

Any help would be incredibly appreciated. The cluster is being lightly
used right now, but things may get really loud and angry if some certain
students researchers start using the cluster again right now.