[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] DAGman problem



Hi,

I'm trying to submit a DAG to condor. For the moment I'm just doing 
tests so I'm just sending HelloWorld.
My case is a simple "problem division"+"parallel processing"+"result 
composition", so my DAG is just like :

 JOB J1 job1.cmd
 JOB J2 job2.cmd
 JOB J3 job3.cmd
 PARENT J1 CHILD J2
 PARENT J2 CHILD J3

Then my 3 jobs are very similar, apart that job1 calls :
"java Hello job1" (which logs "Hello World job1")
job2 calls "java Hello job2" and is queued 5 times
job3 calls "java Hello job3".

The job1 is correctly executed and logs :
Hello World
arg[0]:job1
The job2 is then launched 5 times... then I see 10 occurences because 
apparently condor retries each one, thinking the tasks failed. But when 
I look at the log files it contains what it should :
Hello World
arg[0]:job2
And as the jobs rerun, the log files are emptied then filled again. At 
the end, my log files are ok, meaning that the execution ended ok, but 
the 3rd task never run.

As I read the dag log I found strange errors (I only put the last lines, 
there are a lot of them before) :

5/16 15:49:29 Job submit failed after 6 tries.
5/16 15:49:29 Event: ULOG_EXECUTE for Unknown Job (196.4): ignoring...
5/16 15:49:29 Event: ULOG_JOB_TERMINATED for Unknown Job (196.4): 
ignoring...
5/16 15:49:29 ERROR: job J2: job ID in userlog submit event (200.0) 
doesn't match ID reported earlier by submit command (199.4)!  Trusting 
the userlog for n
ow., but this is scary!
5/16 15:49:29 Event: ULOG_SUBMIT for Condor Job J2 (200.0)
5/16 15:49:29 ERROR: job J2: job ID in userlog submit event (200.1) 
doesn't match ID reported earlier by submit command (200.0)!  Trusting 
the userlog for n
ow., but this is scary!
5/16 15:49:29 Event: ULOG_SUBMIT for Condor Job J2 (200.1)
5/16 15:49:29 ERROR: job J2: job ID in userlog submit event (200.2) 
doesn't match ID reported earlier by submit command (200.1)!  Trusting 
the userlog for n
ow., but this is scary!
5/16 15:49:29 Event: ULOG_SUBMIT for Condor Job J2 (200.2)
5/16 15:49:29 ERROR: job J2: job ID in userlog submit event (200.3) 
doesn't match ID reported earlier by submit command (200.2)!  Trusting 
the userlog for n
ow., but this is scary!
5/16 15:49:29 Event: ULOG_SUBMIT for Condor Job J2 (200.3)
5/16 15:49:29 ERROR: job J2: job ID in userlog submit event (200.4) 
doesn't match ID reported earlier by submit command (200.3)!  Trusting 
the userlog for n
ow., but this is scary!
5/16 15:49:29 Event: ULOG_SUBMIT for Condor Job J2 (200.4)
5/16 15:49:29 Of 3 nodes total:
5/16 15:49:29  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
5/16 15:49:29   ===     ===      ===     ===     ===        ===      ===
5/16 15:49:29     1       0        0       0       0          1        1
5/16 15:49:29 ERROR: the following job(s) failed:
5/16 15:49:29 ---------------------- Job ----------------------
5/16 15:49:29       Node Name: J2
5/16 15:49:29          NodeID: 1
5/16 15:49:29     Node Status: STATUS_ERROR
5/16 15:49:29           Error: Job submit failed
5/16 15:49:29 Job Submit File: job2.cmd
5/16 15:49:29  Condor Job ID: (200.4.0)
5/16 15:49:29       Q_PARENTS: 0, <END>
5/16 15:49:29       Q_WAITING: <END>
5/16 15:49:29      Q_CHILDREN: 2, <END>
5/16 15:49:29 ---------------------------------------   <END>
5/16 15:49:29 Aborting DAG...
5/16 15:49:29 Writing Rescue DAG to Dag.rescue...
5/16 15:49:29 **** condor_scheduniv_exec.193.0 (condor_DAGMAN) EXITING 
WITH STATUS 1

I don't understand the job ID problem. Does anyone has a clue about it ?

Thanks,

Matthieu Cargnelli