[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Dagman lost track of one of it's nodes

I was running a large DAG yesterday (100K nodes, on about 800 worker nodes) After all the jobs finished, the dagman process was still in the queue, but all the nodes were done. A little digging revealed the answer. Dagman thought one node was still running.

From the dagman.out file:
11/19/10 09:48:28 Pending DAG nodes:
11/19/10 09:48:28 Node e111_4805t3-2gmva1, Condor ID 9818391, status STATUS_SUBMITTED

and looking at the user job log:
000 (9818391.000.000) 11/18 18:03:47 Job submitted from host: <[ip removed]:40621> 001 (9818391.000.000) 11/18 18:03:49 Job executing on host: <[ip removed]:37699?CCBID=[ip removed]:9639#6840>
006 (9818391.000.000) 11/18 18:08:57 Image size of job updated: 177728

The job never wrote out status 005 when it exited.

But the Schedd saw it exit:

11/18/10 18:03:47 (pid:4586) Starting add_shadow_birthdate(9818391.0)
11/18/10 18:03:47 (pid:4586) Started shadow for job 9818391.0 on machine@xxxxxxxxxxxxx <[ip removed]:37699?CCBID=[ip removed]:9639#6840> for user@xxxxxxxxxxx, (shadow pid = 7471) 11/18/10 18:11:01 (pid:4586) Shadow pid 7471 for job 9818391.0 exited with status 100

11/18/10 18:03:47 Initializing a VANILLA shadow for job 9818391.0
11/18/10 18:03:47 (9818391.0) (7471): Request to run on machine@xxxxxxxxxxxxx <[ip removed]:37699?CCBID=[ip removed]:9639#6840> was ACCEPTED 11/18/10 18:11:01 (9818391.0) (7471): Job 9818391.0 terminated: exited with status 0 11/18/10 18:11:01 (9818391.0) (7471): **** condor_shadow (condor_SHADOW) pid 7471 EXITING WITH STATUS 100

The job wrote out it's stdout,stderr, and other job specific files okay. Why did this job get orphaned?
I've seen similar things happen with other recent jobs.
I'm running Condor 7.5.4. My guy instinct is that it's related to running things over NFS. But it's been just one or two jobs in these 100K node dags that seem to exhibit this behavior, and I haven't really seen any other odd behavior. I don't see anything in the log files that show an error related to writing out the data.