[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Dagman lost track of one of it's nodes
- Date: Fri, 19 Nov 2010 10:38:54 -0500
- From: Peter Doherty <doherty@xxxxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] Dagman lost track of one of it's nodes
I was running a large DAG yesterday (100K nodes, on about 800 worker
nodes) After all the jobs finished, the dagman process was still in
the queue, but all the nodes were done.
A little digging revealed the answer. Dagman thought one node was
From the dagman.out file:
11/19/10 09:48:28 Pending DAG nodes:
11/19/10 09:48:28 Node e111_4805t3-2gmva1, Condor ID 9818391, status
and looking at the user job log:
000 (9818391.000.000) 11/18 18:03:47 Job submitted from host: <[ip
001 (9818391.000.000) 11/18 18:03:49 Job executing on host: <[ip
006 (9818391.000.000) 11/18 18:08:57 Image size of job updated: 177728
The job never wrote out status 005 when it exited.
But the Schedd saw it exit:
11/18/10 18:03:47 (pid:4586) Starting add_shadow_birthdate(9818391.0)
11/18/10 18:03:47 (pid:4586) Started shadow for job 9818391.0 on machine@xxxxxxxxxxxxx
<[ip removed]:37699?CCBID=[ip removed]:9639#6840> for
user@xxxxxxxxxxx, (shadow pid = 7471)
11/18/10 18:11:01 (pid:4586) Shadow pid 7471 for job 9818391.0 exited
with status 100
11/18/10 18:03:47 Initializing a VANILLA shadow for job 9818391.0
11/18/10 18:03:47 (9818391.0) (7471): Request to run on machine@xxxxxxxxxxxxx
<[ip removed]:37699?CCBID=[ip removed]:9639#6840> was ACCEPTED
11/18/10 18:11:01 (9818391.0) (7471): Job 9818391.0 terminated: exited
with status 0
11/18/10 18:11:01 (9818391.0) (7471): **** condor_shadow
(condor_SHADOW) pid 7471 EXITING WITH STATUS 100
The job wrote out it's stdout,stderr, and other job specific files
okay. Why did this job get orphaned?
I've seen similar things happen with other recent jobs.
I'm running Condor 7.5.4. My guy instinct is that it's related to
running things over NFS. But it's been just one or two jobs in these
100K node dags that seem to exhibit this behavior, and I haven't
really seen any other odd behavior. I don't see anything in the log
files that show an error related to writing out the data.