Hi !
We are running on Condor 7.0.1.
I want to use dag jobs so i tested with two simple Cpp progs :
prog 1: un#include <iostream>
using namespace std;
int main(int argc, char ** argv)
{
int elmt = atoi(argv[1]);
if (elmt < 10)
cout << "1" << endl;
else if (elmt < 30 )
cout << "2" << endl;
else if (elmt < 60)
cout << "3" << endl;
else if (elmt < 100)
cout << "4" << endl;
else
cout << "5" << endl;
}
prog 2 : deux#include <iostream>
using namespace std;
int main(int argc, char ** argv)
{
char buf[1];
cin >> buf;
int elmt
= atoi(buf);
switch (elmt)
{
case 1 :
cout << "Category 1" << endl;
break;
case 2 :
cout << "Category 2" << endl;
break;
case 3 :
cout << "Category 3" << endl;
break;
case 4 :
cout << "Category 4" << endl;
break;
case 5 :
cout << "Category 5" << endl;
break;
default :
break;
}
}
un submit file :Universe = standard
Executable = one
Log = one.log
Output = one.out
Error = one.err
Arguments = 35
Queue
deux submit file :Universe = standard
Executable = deux
Input = deux.out
Output = deux.out
Log = un.log
Error = deux.err
Queue
dag input file :JOB un un.sub
JOB deux deux.sub
PARENT un CHILD deux
condor_submit_dag trois.dag
condor_q
122.0 lvigilant 7/7 10:30 0+00:02:39 R 0 4.6 condor_dagman -f -
123.0 lvigilant 7/7 10:31 0+00:00:00 I 0 2.2 un
The first job runs perfectly but seems the schedduler can't start the second one ...
122.0 lvigilant 7/7 10:30 0+00:02:39 I 0 4.6 condor_dagman -f -
123.0 lvigilant 7/7 10:31 0+00:00:00 R 0 2.2 un
And ...in condor_q , dagman switch to Idle state ...
122.0 lvigilant 7/7 10:30 0+00:13:05 I 0 4.6 condor_dagman -f -
In the Sched log :
7/7 11:26:24 (pid:19051) scheduler universe job (122.0) pid 14775 exited with status 4
7/7 11:27:24 (pid:19051) FileLock::obtain(1) failed - errno 5 (Input/output error)
In the dagman.log
01 (122.000.000) 07/07 11:17:23 Job executing on host: <x.x.x.x:x>
...
004 (122.000.000) 07/07 11:19:24 Job was evicted.
(0) Job was not
checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
001 (122.000.000) 07/07 11:24:24 Job executing on host: <x.x.x.x:x>
...
004 (122.000.000) 07/07 11:26:24 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote
Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
In the .dagman.out:
7/7 11:24:24 ******************************************************
7/7 11:24:24 ** condor_scheduniv_exec.122.0 (CONDOR_DAGMAN) STARTING UP
7/7 11:24:24 ** /condor/bin/condor_dagman
7/7 11:24:24 ** $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
7/7 11:24:24 ** $CondorPlatform: I386-LINUX_RHEL5 $
7/7 11:24:24 ** PID = 14712
7/7 11:24:24 ** Log last touched 7/7 11:18:08
7/7 11:24:24 ******************************************************
7/7 11:24:24 Using config source:
/condor/etc/condor_config
7/7 11:24:24 Using local config sources:
7/7 11:24:24 /condor/local.grille/condor_config.local
7/7 11:24:24 DaemonCore: Command Socket at <10.6.200.190:43248>
7/7 11:24:24 DAGMAN_SUBMIT_DELAY setting: 0
7/7 11:24:24 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
7/7 11:24:25 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
7/7 11:24:25 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
7/7 11:24:25 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
7/7 11:24:25 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
7/7 11:24:25 DAGMAN_RETRY_NODE_FIRST setting: 0
7/7 11:24:25 DAGMAN_MAX_JOBS_IDLE setting: 0
7/7 11:24:25 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
7/7 11:24:25 DAGMAN_MUNGE_NODE_NAMES setting: 1
7/7 11:24:25 DAGMAN_DELETE_OLD_LOGS setting: 1
7/7 11:24:25 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
7/7 11:24:25 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0
7/7 11:24:25
DAGMAN_ABORT_DUPLICATES setting: 1
7/7 11:24:25 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
7/7 11:24:25 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
7/7 11:24:25 argv[0] == "condor_scheduniv_exec.122.0"
7/7 11:24:25 argv[1] == "-Debug"
7/7 11:24:25 argv[2] == "3"
7/7 11:24:25 argv[3] == "-Lockfile"
7/7 11:24:25 argv[4] == "trois.dag.lock"
7/7 11:24:25 argv[5] == "-Condorlog"
7/7 11:24:25 argv[6] == "/auto_home/lvigilant/grille/dag/trois.log"
7/7 11:24:25 argv[7] == "-Dag"
7/7 11:24:25 argv[8] == "trois.dag"
7/7 11:24:25 argv[9] == "-Rescue"
7/7 11:24:25 argv[10] == "trois.dag.rescue"
7/7 11:24:25 DAG Lockfile will be written to trois.dag.lock
7/7 11:24:25 DAG Input file is trois.dag
7/7 11:24:25 Rescue DAG will be written to trois.dag.rescue
7/7 11:24:25 All DAG node user log files:
7/7 11:24:25 /auto_home/lvigilant/grille/dag/trois.log (Condor)
7/7 11:24:25 Parsing trois.dag ...
7/7
11:24:25 Dag contains 2 total jobs
7/7 11:24:25 Lock file trois.dag.lock detected,
7/7 11:24:25 Duplicate DAGMan PID 14584 is no longer alive; this DAGMan should continue.
7/7 11:24:25 Sleeping for 12 seconds to ensure ProcessId uniqueness
7/7 11:24:37 Bootstrapping...
7/7 11:24:37 Number of pre-completed nodes: 0
7/7 11:24:37 Running in RECOVERY mode...
7/7 11:25:37 FileLock::obtain(1) failed - errno 5 (Input/output error)
7/7 11:25:37 ERROR "Assertion ERROR on (m_is_locked)" at line 1125 in file read_user_log.C
Thanks in advance all :)
VIGILANT Lionel
ISEM
University of Montpellier 2