[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] .update.ad problems after upgrade.



The warning messages about .update.ad are probably a red herring.  

In the 8.8 release it is actually very common to have a harmless warning message about reading the .update.ad, but it goes away once the Starter has fully initialized.  The reason you did not see this message in 8.6, was because 8.6 did not have update ads, they were added in the 8.7 series.

If the warning message continues to repeat after the job has started, then it is indication of a real problem, but 
even then it will not prevent jobs from running, it just means that changes like disk usage may properly update
between Startd and Starter. 

If jobs are failing to start, we need to look for some other indication in the StartLog  or StarterLog.* for what the problem is. 

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Amy Bush
Sent: Monday, July 20, 2020 3:03 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] .update.ad problems after upgrade.

Hello, htcondor peoples,

My general strategy is Never Upgrade, because upgrading always causes
problems. It's unavoidable, of course, so on Friday I upgraded from
condor 8.6.5 to condor 8.8.9. Things seemed to go well over the weekend,
possibly because nobody was submitting jobs, but that didn't last.

Currently I'm seeing a LOT of these in my log files:

07/20/20 14:56:07 (pid:10322) Failed to open '.update.ad' to read update
ad: No such file or directory (2).

I'm also having users report jobs failing. Immediately following the
line above:

07/20/20 14:56:07 (pid:10322) All jobs have exited... starter exiting
07/20/20 14:56:07 (pid:10322) **** condor_starter (condor_STARTER) pid
10322 EXITING WITH STATUS 0

>From what I've seen, this file should be created in /var/condor/execute,
which definitely exists on the node in question, and I believe the
permissions are fine:

angrist-14 14:59:24$ ls -al /var/condor/execute/
total 8
drwxr-xr-x 2 condor bin  4096 Jul 20 14:56 .
drwxr-xr-x 6 root   root 4096 Jul 30  2019 ..

google has not presented me with a wealth of fellow htcondor users
having this problem upon upgrade, so at this point I'm not positive this
IS a problem? Is it THE problem that's causing these jobs to fail? What
the heck can I do to diagnose/resolve this issue?

Any help would be incredibly appreciated. The cluster is being lightly
used right now, but things may get really loud and angry if some certain
students researchers start using the cluster again right now.

--
amy
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/