[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Strange DAGMan behaviour



I ran into some strange DAGMan behaviour in a software system I
maintain.  The system submits a DAG that may recursively submit
another DAG, and so on.  The problem is that the execution of one of
the DAGs eventually fails without ever having run.  The first DAG
always succeeds, but the following DAGs seem to have about a 50-50
chance of success.  Sometimes it will iterate several times, but most
of the time, it fails on the first or second iteration.  In the DAGMan
log for the last DAG, I get this:

005 (518.000.000) 10/18 15:42:55 Job terminated.
        (0) Abnormal termination (signal 9)
        (0) No core file
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job

I've managed to write a simplification of the system in a couple of
simple perl scripts.  I get the same behaviour on the machine I'm
working on. The scripts are really straight-forward.  They need to be
run on a machine that is both a Condor Submit and Execute machine. 
The behavior is far from consistent, but it tends to fail earlier
rather than later.

I'm running Condor 6.7.12.

I'm coming up with no reason for the behaviour, but I'm by no means a
Condor guru. Any suggestions are welcome.  (And if you're going to
suggest that I use the Condor Perl module.  I agree... but my
supervisor doesn't.)

Mark
#!/usr/bin/perl

# The script needs to be run on a machine that can both Submit and
# Execute Condor jobs. I use a Requirement to ensure that they are the
# same machine, but this should work as long as all accessible Execute
# machines are also Submit machines.

my $condor_machine = "your.condor.submit.and.execute.machine";
my $max_runs = 10;

my $run = 0;

if (defined($ARGV[0])) {
    $run = $ARGV[0];
}

if ($run < $max_runs) {
    my $next_run = $run + 1;

    
    open(TESTJOB, ">test.job.$run")
	or die "Couldn't open 'test.job.$run'.\n";

    print TESTJOB <<"EOF";
Universe = vanilla
Executable = testscript.sh
Requirements = Machine == "$condor_machine"
Log = test.log.$run
Output = test.out.$run
Error = test.error.$run
GetEnv = true
Notification = never
Queue
EOF
    close(TESTJOB);


    open(CHECKJOB, ">check.job.$run")
	or die "Couldn't open 'check.pl'.\n";

    print CHECKJOB <<"EOF";
Universe = vanilla
Executable = checkscript.pl
Arguments = $next_run
Requirements = Machine == "$condor_machine"
Log = test.log.$run
Output = check.out.$run
Error = check.error.$run
GetEnv = true
Notification = never
Queue
EOF
    close(CHECKJOB);


    open(DAG, ">test.dag.$run")
	or die "Couldn't open 'test.dag.$run'.\n";

    print DAG <<"EOF";
Job test test.job.$run
Job check check.job.$run
PARENT test CHILD check
EOF

    close(DAG);

    my $retval = system("condor_submit_dag -notification never test.dag.$run");

    # Did condor_submit_dag fail?
    print $retval / 256, "\n";
}

Attachment: testscript.sh
Description: Bourne shell script