[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Dagman job cannot start second node



Hi !

We are running on Condor 7.0.1.
I want to use dag jobs so i tested with two simple Cpp progs :

prog 1: un
#include <iostream>
using namespace std;

int main(int argc, char ** argv)
{
int elmt = atoi(argv[1]);

if (elmt < 10)
cout << "1" << endl;
else if (elmt < 30 )
cout << "2" << endl;
else if (elmt < 60)
cout << "3" << endl;
else if (elmt < 100)
cout << "4" << endl;
else
cout << "5" << endl;
}


prog 2 : deux
#include <iostream>
using namespace std;

int main(int argc, char ** argv)
{
char buf[1];
cin >> buf;

int elmt = atoi(buf);
switch (elmt)
{
case 1 :
cout << "Category 1" << endl;
break;
case 2 :
cout << "Category 2" << endl;
break;
case 3 :
cout << "Category 3" << endl;
break;
case 4 :
cout << "Category 4" << endl;
break;
case 5 :
cout << "Category 5" << endl;
break;
default :
break;
}
}

un submit file :
Universe = standard
Executable = one
Log = one.log
Output = one.out
Error = one.err
Arguments = 35
Queue

deux submit file :
Universe = standard
Executable = deux
Input = deux.out
Output = deux.out
Log = un.log
Error = deux.err
Queue

dag input file :
JOB un un.sub
JOB deux deux.sub
PARENT un CHILD deux


 condor_submit_dag trois.dag

condor_q
 122.0   lvigilant       7/7  10:30   0+00:02:39 R  0   4.6  condor_dagman -f -
 123.0   lvigilant       7/7  10:31   0+00:00:00 I  0   2.2  un

The first job runs perfectly but seems the schedduler can't start the second one ...
 122.0   lvigilant       7/7  10:30   0+00:02:39 I  0   4.6  condor_dagman -f -
 123.0   lvigilant       7/7  10:31   0+00:00:00 R  0   2.2  un


And ...in condor_q , dagman switch to Idle state ...
122.0   lvigilant       7/7  10:30   0+00:13:05 I  0   4.6  condor_dagman -f -




In the Sched log  :

7/7 11:26:24 (pid:19051) scheduler universe job (122.0) pid 14775 exited with status 4
7/7 11:27:24 (pid:19051) FileLock::obtain(1) failed - errno 5 (Input/output error)


In the dagman.log
01 (122.000.000) 07/07 11:17:23 Job executing on host: <x.x.x.x:x>
...
004 (122.000.000) 07/07 11:19:24 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...

001 (122.000.000) 07/07 11:24:24 Job executing on host: <x.x.x.x:x>
...
004 (122.000.000) 07/07 11:26:24 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...


In the .dagman.out:

7/7 11:24:24 ******************************************************
7/7 11:24:24 ** condor_scheduniv_exec.122.0 (CONDOR_DAGMAN) STARTING UP
7/7 11:24:24 ** /condor/bin/condor_dagman
7/7 11:24:24 ** $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
7/7 11:24:24 ** $CondorPlatform: I386-LINUX_RHEL5 $
7/7 11:24:24 ** PID = 14712
7/7 11:24:24 ** Log last touched 7/7 11:18:08
7/7 11:24:24 ******************************************************
7/7 11:24:24 Using config source: /condor/etc/condor_config
7/7 11:24:24 Using local config sources:
7/7 11:24:24    /condor/local.grille/condor_config.local
7/7 11:24:24 DaemonCore: Command Socket at <10.6.200.190:43248>
7/7 11:24:24 DAGMAN_SUBMIT_DELAY setting: 0
7/7 11:24:24 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
7/7 11:24:25 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
7/7 11:24:25 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
7/7 11:24:25 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
7/7 11:24:25 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
7/7 11:24:25 DAGMAN_RETRY_NODE_FIRST setting: 0
7/7 11:24:25 DAGMAN_MAX_JOBS_IDLE setting: 0
7/7 11:24:25 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
7/7 11:24:25 DAGMAN_MUNGE_NODE_NAMES setting: 1
7/7 11:24:25 DAGMAN_DELETE_OLD_LOGS setting: 1
7/7 11:24:25 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
7/7 11:24:25 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0
7/7 11:24:25 DAGMAN_ABORT_DUPLICATES setting: 1
7/7 11:24:25 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
7/7 11:24:25 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
7/7 11:24:25 argv[0] == "condor_scheduniv_exec.122.0"
7/7 11:24:25 argv[1] == "-Debug"
7/7 11:24:25 argv[2] == "3"
7/7 11:24:25 argv[3] == "-Lockfile"
7/7 11:24:25 argv[4] == "trois.dag.lock"
7/7 11:24:25 argv[5] == "-Condorlog"
7/7 11:24:25 argv[6] == "/auto_home/lvigilant/grille/dag/trois.log"
7/7 11:24:25 argv[7] == "-Dag"
7/7 11:24:25 argv[8] == "trois.dag"
7/7 11:24:25 argv[9] == "-Rescue"
7/7 11:24:25 argv[10] == "trois.dag.rescue"
7/7 11:24:25 DAG Lockfile will be written to trois.dag.lock
7/7 11:24:25 DAG Input file is trois.dag
7/7 11:24:25 Rescue DAG will be written to trois.dag.rescue
7/7 11:24:25 All DAG node user log files:
7/7 11:24:25   /auto_home/lvigilant/grille/dag/trois.log (Condor)
7/7 11:24:25 Parsing trois.dag ...
7/7 11:24:25 Dag contains 2 total jobs
7/7 11:24:25 Lock file trois.dag.lock detected,
7/7 11:24:25 Duplicate DAGMan PID 14584 is no longer alive; this DAGMan should continue.
7/7 11:24:25 Sleeping for 12 seconds to ensure ProcessId uniqueness
7/7 11:24:37 Bootstrapping...
7/7 11:24:37 Number of pre-completed nodes: 0
7/7 11:24:37 Running in RECOVERY mode...
7/7 11:25:37 FileLock::obtain(1) failed - errno 5 (Input/output error)
7/7 11:25:37 ERROR "Assertion ERROR on (m_is_locked)" at line 1125 in file read_user_log.C

Thanks in advance all :)



VIGILANT Lionel
ISEM
University of Montpellier 2





Envoyé avec Yahoo! Mail.
Une boite mail plus intelligente.