[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor-C to DAGman problem




Hi,
   I am trying to use condor-C to DAGman, which means submitting a remote DAGman job.As the condor team suggested, I modified the dagman.condor.sub file, and submit the dag.condor.sub file with condor_submit.
   However,the dag job ended very fast also, without submitting and running any jobs. In file dag_test.dag.lib.err there is nothing in, but there is an error from the dag_test.dag.dagman.out file below which is confused to me, I searched this error in google, but seems not much helpful...can anyone give some suggestions? Thank you very much!
   PS: I can submit local DAGman and it runs well, but with the remote mode, the error occurs.
The error is as below:
9/1 14:25:15 Submitting Condor Node testA job(s)...
9/1 14:25:15 submitting: condor_submit -a dag_node_name' '=' 'testA -a +DAGManJobId' '=' '-1 -a DAGManJobId' '=' '-1 -a submit_event_notes' '=' 'DAG' 'Node:' 'testA -a +DAGParentNodeNames' '=' '"" testA.sub
9/1 14:25:16 From submit:
9/1 14:25:16 From submit: ERROR: Can't find address of local schedd 9/1 14:25:16 failed while reading from pipe.
9/1 14:25:16 Read so far: ERROR: Can't find address of local schedd 9/1 14:25:16 ERROR: submit attempt failed
9/1 14:25:16 submit command was: condor_submit -a dag_node_name' '=' 'testA -a +DAGManJobId' '=' '-1 -a DAGManJobId' '=' '-1 -a submit_event_notes' '=' 'DAG' 'Node:' 'testA -a +DAGParentNodeNames' '=' '"" testA.sub
9/1 14:25:16 Job submit try 2/6 failed, will try again in >= 2 seconds.

Here are the files I use:

DAD file:
JOB testA testA.sub
JOB testB testB.sub
JOB testC testC.sub
PARENT testA CHILD testB testC
PARENT testC CHILD testB

dag.condor.sub file
# Filename: dag_test.dag.condor.sub
# Generated by condor_submit_dag dag_test.dag
universe        = grid
grid_resource = condor L50.com L50**.com
executable        = C:\condor\bin\condor_dagman.exe
getenv                = True
output                = dag_test.dag.lib.out
error                = dag_test.dag.lib.err
log                = dag_test.dag.dagman.log
# Note: default on_exit_remove _expression_:
# ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
# attempts to ensure that DAGMan is automatically
# requeued by the schedd if it exits abnormally or
# is killed (e.g., during a reboot).
on_exit_remove        = ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
copy_to_spool        = False
arguments        = -f -l . -Debug 3 -Lockfile dag_test.dag.lock -Condorlog DAGmantest.log.txt -Dag dag_test.dag -Rescue dag_test.dag.rescue
environment        = _CONDOR_DAGMAN_LOG=dag_test.dag.dagman.out|_CONDOR_MAX_DAGMAN_LOG=0
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files =DAGmantestA.bat,testA.sub,DAGmantestB.bat,testB.sub,DAGmantestC.bat,testC.sub,dag_test.dag
queue  

job sub file(only testA is given here):  
Universe = Vanilla
Executable =DAGmantestA.bat
GetEnv     = True
RunAsOwner = True
Log        = DAGmantest.log.txt
Error      = DAGmantest.bat.error.txt
Queue




"condor-admin response tracking system" <condor-admin@xxxxxxxxxxx>

08/31/2009 06:30 PM

Bitte antworten an
condor-admin@xxxxxxxxxxx

An
Tao.3.Chen@xxxxxxxxxxxxxxxxxxxxxxxxxxx
Kopie
Thema
Re: [condor-admin #19650] how to set  DAGman file with Condor-C(Grid Universe)





First, check your DAGMan log files, found at:
dag_test.dag.lib.out, dag_test.dag.lib.err,
dag_test.dag.dagman.log.  I expect they will tell you what went
wrong.  Furthermore, I'm guessing they'll tell you that DAGMan
couldn't find your submit files.

When I mentioned input and output files, I'm referring to any
files used to submit jobs, or their resulting output.  I did
slightly misspeak: so long as your job is entirely contained to a
single directory you should not need to specify output files;
Condor-C jobs will pull the output back.  (This is not true of
other Grid universe types, for example gt2)

In your DAG file, you have the following:
> JOB testA testA.sub
> JOB testB testB.sub
> JOB testC testC.sub

Your job needs testA.sub, testB.sub, and testC.sub, so these are
input files you need to specify in transfer_input_files.

Your individual job files look like this:
> Universe   = grid
> Executable = DAGmantestA.bat
> Log        = DAGmantestA.log.txt
> Error      = DAGmantestA.bat.error.txt
> grid_resource = condor L50**.com L50**.com
>
> +remote_jobuniverse = 5
> +remote_Executable =DAGmantestA.bat
> +remote_GetEnv     = True
> +remote_RunAsOwner = True
> +remote_Log        = DAGmantestA.log.txt
> +remote_Error      = DAGmantestA.bat.error.txt
> Queue

Your job needs DAGmantestA.bat as input, so add it to
transfer_input_files.

I'm also surprised that your individual jobs are grid jobs.  Is
this intentional?  The jobs will already be at L50**.com because
DAGMan itself is running at L50**.com.  You should be able to
specify universe=vanilla, delete the grid_resource, and delete
the +remote_* entries.


========================================
MESSAGE INFORMATION
========================================
* From: Alan De Smet <adesmet@xxxxxxxxxxx>
* Ticket Email List: Tao.3.Chen@xxxxxxxxxxxxxxxxxxxxxxxxxxx,