[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Dagman stalling with shadow exception messages?



Hi all, I'm trying to track down the cause of this intermittent problem 
we're having with Condor 6.6.1 on RedHat 8.  For some reason, every once 
in a while, a dagman job will get "stuck" in a state where it can no 
longer submit any jobs to any render machines.

In the last three months we've run well over 500 dagman jobs here (each 
with as many as 1600 individual jobs), and things generally work pretty 
well.  I'd say this has happened on at most 5% of the dagman jobs.  When 
it does happen, it happens after many of the jobs in the dag have already 
run, and when there are definitely resources available that should match 
the remaining jobs.  There are no dag dependencies that it could be 
waiting on.  There's plenty of RAM and disk space on both the submit host 
and the render hosts.  The only workaround seems to be to delete the dag 
job from the queue and re-submit the remaining jobs (which then proceed to 
run fine).

Anyone else have a problem like this, or have we uncovered an obscure bug 
in dagman?

Thanks for any help....

-Mike



Here are snippets from some logs which may be relevent:


----dagman.out (this line repeated many times...):
4/6 21:00:24 Event: ULOG_SHADOW_EXCEPTION for Condor Job 
st006_comp_tk25__296-300 (22190.0.0)


-- ShadowLog on submit host:
4/6 21:00:27 Initializing a VANILLA shadow
4/6 21:00:27 (22190.0) (7173): Request to run on <192.168.1.111:32771> was 
ACCEPTED
4/6 21:00:27 (22190.0) (7173): ERROR "Can no longer talk to condor_starter 
on execute machine (192.168.1.111)" at line 63 in file NTreceivers.C
----------------------
Note: The above message is repeated for any render host that gets matched, 
and the hosts are definitely up and visible to the submit host.  In 
addition, that same render host will happily render other jobs from other 
dags in other people's queues.


-- StartLog on render host:
4/6 21:00:02 DaemonCore: Command received via UDP from host 
<192.168.1.88:36926>
4/6 21:00:02 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling 
handler (command_handler)
4/6 21:00:02 vm1: Called deactivate_claim()
4/6 21:00:02 Starter pid 16818 exited with status 0
4/6 21:00:02 vm1: State change: starter exited
4/6 21:00:02 vm1: Changing activity: Busy -> Idle
4/6 21:00:02 DaemonCore: Command received via TCP from host 
<192.168.1.88:45808>
4/6 21:00:02 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), 
calling handler (command_handler)
4/6 21:00:02 vm1: Called deactivate_claim_forcibly()
4/6 21:00:02 DaemonCore: Command received via UDP from host 
<192.168.1.88:36926>
4/6 21:00:02 DaemonCore: received command 443 (RELEASE_CLAIM), calling 
handler (command_handler)
4/6 21:00:02 vm1: State change: received RELEASE_CLAIM command
4/6 21:00:02 vm1: Changing state and activity: Claimed/Idle -> 
Preempting/Vacating
4/6 21:00:02 vm1: State change: No preempting claim, returning to owner
4/6 21:00:02 vm1: Changing state and activity: Preempting/Vacating -> 
Owner/Idle
4/6 21:00:02 vm1: State change: IS_OWNER is false
4/6 21:00:02 vm1: Changing state: Owner -> Unclaimed
4/6 21:00:02 DaemonCore: Command received via UDP from host 
<192.168.1.88:36926>
4/6 21:00:02 DaemonCore: received command 443 (RELEASE_CLAIM), calling 
handler (command_handler)
4/6 21:00:02 Error: can't find resource with capability 
(<192.168.1.111:32771>#7698602094)
----------------------
Note: That last line puzzles me.  I don't know what the #7698602094 referrs 
to.  



Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>