[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dagman lost track of one of it's nodes



Peter,

that does sound like a serious issue. A couple questions:

- Did more than one job write to the same user log file?
- Did you use locking?

Could you file a ticket to condor-admin@xxxxxxxxxxx and attach the complete user log file, and any dagman.out output files related to that? That would be extremely helpful for us for debugging the issue.

In case the files are too large to attach, could you put them online somewhere and send us a download link?

Thanks a lot!

-- Cathrin

On 11/19/2010 09:38 AM, Peter Doherty wrote:
I was running a large DAG yesterday (100K nodes, on about 800 worker
nodes)  After all the jobs finished, the dagman process was still in the
queue, but all the nodes were done.
A little digging revealed the answer. Dagman thought one node was still
running.

 From the dagman.out file:
11/19/10 09:48:28 Pending DAG nodes:
11/19/10 09:48:28 Node e111_4805t3-2gmva1, Condor ID 9818391, status
STATUS_SUBMITTED


and looking at the user job log:
000 (9818391.000.000) 11/18 18:03:47 Job submitted from host: <[ip
removed]:40621>
001 (9818391.000.000) 11/18 18:03:49 Job executing on host: <[ip
removed]:37699?CCBID=[ip removed]:9639#6840>
006 (9818391.000.000) 11/18 18:08:57 Image size of job updated: 177728

The job never wrote out status 005 when it exited.

But the Schedd saw it exit:

SchedLog
11/18/10 18:03:47 (pid:4586) Starting add_shadow_birthdate(9818391.0)
11/18/10 18:03:47 (pid:4586) Started shadow for job 9818391.0 on
machine@xxxxxxxxxxxxx <[ip removed]:37699?CCBID=[ip removed]:9639#6840>
for user@xxxxxxxxxxx, (shadow pid = 7471)
11/18/10 18:11:01 (pid:4586) Shadow pid 7471 for job 9818391.0 exited
with status 100

ShadowLog:
11/18/10 18:03:47 Initializing a VANILLA shadow for job 9818391.0
11/18/10 18:03:47 (9818391.0) (7471): Request to run on
machine@xxxxxxxxxxxxx <[ip removed]:37699?CCBID=[ip removed]:9639#6840>
was ACCEPTED
11/18/10 18:11:01 (9818391.0) (7471): Job 9818391.0 terminated: exited
with status 0
11/18/10 18:11:01 (9818391.0) (7471): **** condor_shadow (condor_SHADOW)
pid 7471 EXITING WITH STATUS 100


The job wrote out it's stdout,stderr, and other job specific files okay.
Why did this job get orphaned?
I've seen similar things happen with other recent jobs.
I'm running Condor 7.5.4. My guy instinct is that it's related to
running things over NFS. But it's been just one or two jobs in these
100K node dags that seem to exhibit this behavior, and I haven't really
seen any other odd behavior. I don't see anything in the log files that
show an error related to writing out the data.



--
Cathrin Weiss
Condor Project