[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Dagman script problem.



Hi 

I'm trying out Dagman to shorten some jobs, and having problems getting
condor to run any of my pre/post scripts. 
The first job in the DAG runs fine, then the whole thing stops at the
first post script. 
Whether it's post A or pre B doesnt' make anydifference. 

The following is an example of ome of the post scripts to run between
each job. 

-rwxr-xr-x    1 zhimei   users         178 Nov 27 17:04 copyPreStep2.sh 

#!/bin/sh

cp /home/step1/REVCON /home/step2/CONFIG
cp /home/step1/REVIVE /home/step2/REVOLD
cp /home/step1/FIELD /home/step2/FIELD
cp /home/step1/CONTROL /home/step2/CONTROL
/usr/bin/perl preStep1.pl
chmod a+w /home/zhimei/test-step/step2/*

exit 1

This script works just fine when run manually from the command line. The
perl line executes a perl script which edits the CONTROL file. 

Condor however cannot run it. Here is the error from the dagman.out
file. Job A runs fine, then there's an error at the job A post script: 

11/27 16:33:13 Submitting Job A ...
11/27 16:33:14  assigned Condor ID (4597.0.0)
11/27 16:33:15 Event: ULOG_SUBMIT for Job A (4597.0.0)
11/27 16:33:15 0/2 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
11/27 16:33:40 Event: ULOG_EXECUTE for Job A (4597.0.0)
11/27 16:33:50 Event: ULOG_IMAGE_SIZE for Job A (4597.0.0)
11/27 16:33:55 Event: ULOG_JOB_TERMINATED for Job A (4597.0.0)
11/27 16:33:55 Job A completed successfully.
11/27 16:54:17 Running POST script of Job A...
11/27 16:54:17 0/2 done, 0 failed, 0 submitted, 0 ready, 0 pre, 1 post
11/27 16:54:22 Event: ULOG_POST_SCRIPT_TERMINATED for Job A (4599.0.0)
11/27 16:54:22 POST Script of Job A failed with status 1
11/27 16:54:22 0/2 done, 1 failed, 0 submitted, 0 ready, 0 pre, 0 post
11/27 16:54:22 ERROR: the following job(s) failed:
11/27 16:54:22 ---------------------- Job ----------------------
11/27 16:54:22       Node Name: A
11/27 16:54:22          NodeID: 0
11/27 16:54:22     Node Status: STATUS_ERROR
11/27 16:54:22           Error: POST Script failed with status 1
11/27 16:54:22 Job Submit File: /home/step1/dl_1.sub
11/27 16:54:22     POST Script: /home/step2/copyPreStep2.sh
11/27 16:54:22   Condor Job ID: (4599.0.0)
11/27 16:54:22       Q_PARENTS: <END>
11/27 16:54:22       Q_WAITING: <END>
11/27 16:54:22      Q_CHILDREN: 1, <END>
11/27 16:54:22 ---------------------------------------  <END>
11/27 16:54:22 Writing Rescue DAG file...
11/27 16:54:22 **** condor_scheduniv_exec.4598.0 (condor_DAGMAN) EXITING
WITH STATUS 1

Here's the dagman submit file: 

Job  A  /home/zhimei/test-step/step1/dl_1.sub
Job  B  /home/zhimei/test-step/step2/dl_2.sub

Script POST A /home/zhimei/test-step/step2/copyPreStep2.sh
Script POST B /home/zhimei/test-step/step3/copyPreStep3.sh

PARENT A CHILD B

Here's an example submit file: 

Universe       =  vanilla
Requirements = OpSys == "WINNT50" && ARCH == "INTEL"
transfer_input_files = CONFIG FIELD CONTROL
executable = /home/zhimei/bin/dlpoly_new.exe
Error   = dlDAG.err.$(cluster)
Output  = dlDAG.out.$(cluster)
Log     = /home/DlDAG.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
TransferFiles = ON_EXIT
Initialdir = /home/step1
notification = NEVER
queue


I think I'm missing something fundamental here- about Condor, dagman and
how to get my scripts to work from Condor. 

Any ideas? 

Paul.