[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problem with test job - possibly with output file



Hi,

I am new to condor and I am trying to test the submission from a LCG site (globus based middleware)
to a condor pool. Here is the "vanilla" test job:

#!/bin/bash

/bin/hostname
/bin/date

The job seems to fail. Here is the output from the shadow log on the execute machine:

10/15 10:50:49 Submitting machine is "globus-lcg.it.uom.gr"
10/15 10:50:49 File transfer completed successfully.
10/15 10:50:50 Starting a VANILLA universe job with ID: 82.0
10/15 10:50:50 IWD: /home/condor/execute/dir_24729
10/15 10:50:50 Output file: /home/condor/execute/dir_24729/_condor_stdout
10/15 10:50:50 Error file: /home/condor/execute/dir_24729/_condor_stderr
10/15 10:50:50 About to exec /home/condor/execute/dir_24729/condor_exec.exe UI=000003:NS=0000000003:WM=000016:BH=0000000000:JSS=000012:LM=000018:LRMS=000000:APP=000000
10/15 10:50:50 Create_Process succeeded, pid=24733
10/15 10:50:50 Process exited, pid=24733, status=1
10/15 10:50:50 Got SIGQUIT.  Performing fast shutdown.
10/15 10:50:50 ShutdownFast all jobs.
10/15 10:50:51 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

Here is the output of the sched log on the submitting machine:


10/15 10:50:49 (pid:23454) Starting add_shadow_birthdate(82.0)
10/15 10:50:49 (pid:23454) Started shadow for job 82.0 on "<195.251.209.23:55245>", (shadow pid = 12092) 10/15 10:50:50 (pid:23454) Sent ad to central manager for dteam015@xxxxxxxxxxxxxxxxxxxx 10/15 10:50:50 (pid:23454) Sent ad to 1 collectors for dteam015@xxxxxxxxxxxxxxxxxxxx 10/15 10:50:51 (pid:23454) Shadow pid 12092 for job 82.0 exited with status 100 10/15 10:50:51 (pid:23454) match (<195.251.209.23:55245>#1191836703#123#...) out of jobs (cluster id 82); relinquishing 10/15 10:50:51 (pid:23454) Sent RELEASE_CLAIM to startd on <195.251.209.23:55245> 10/15 10:50:51 (pid:23454) Match record (<195.251.209.23:55245>, 82, -1) deleted 10/15 10:50:51 (pid:23454) statfs(/home/dteam015/gram_scratch_7BFyjXkbCS) failed: 13/Permission denied 10/15 10:50:51 (pid:23454) DaemonCore: Command received via TCP from host <195.251.209.23:52869> 10/15 10:50:51 (pid:23454) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
10/15 10:50:51 (pid:23454) Got VACATE_SERVICE from <195.251.209.23:52869>

What does this statfs - permission denied means? Does anyone have seen something similar?

By watching condor_status it seems that a machine is matched for the test job but i see that its state goes from 'Unclaimed' to 'Claimed' and back to 'Unclaimed'. Nothing beyond that and
the activity status is always 'Idle'.

I submit the job from the user interface of the LCG site with 'edg-job-submit' and the job never finishes. When i run the job with condor_submit from the LCG Computing Element it runs fine. Furthermore when i run a
'/bin/date' from the User Interface with 'globus-job-run' it runs ok.

Can anyone assist me with this?
Thanks in advance.

--
********************************************
Kostas Georgakopoulos - MSc, Systems and Network Administrator

E-mail		: kgeorga@xxxxxx
Office Tel.	: +30 2310 887973

Department Of Applied Informatics,
University Of Macedonia, Egnatias 156, Thessaloniki, Greece
********************************************