[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] my jobs won't run on my pool :(



Hi everyone

I have problems running some jobs on my pool : here are 3 examples : 2 from the net, one from our lab.


1st job : uname.sh (from http://condor.optena.com/display/CONDOR/mail/3277)
___________
 guiot@chagall:~/tmp/TestCondor/JobPerso$ more uname.cmd
Universe        = vanilla
Executable      = uname.sh
Output          = uname.out
Error           = uname.err
log             = uname.log

queue

guiot@chagall:~/tmp/TestCondor/JobPerso$ more uname.sh
#!/bin/bash

# Print the machine name we ran on
uname -n
guiot@chagall:~/tmp/TestCondor/JobPerso$  

--> job is not running : 
Overview of the uname.log file : 
...
001 (110.000.000) 10/21 15:39:03 Job executing on host: <193.49.27.11:34130>
...
007 (110.000.000) 10/21 15:39:03 Shadow exception!
        Error from starter on vrubel.galaxy.ibpc.fr: Failed to execute '/ibpc/chagall/guiot/tmp/TestCondor
/JobPerso/uname.sh condor_exec.exe': Permission denied
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...

$tail /scratch/condor/log/SchedLog
10/21 15:18:59 (pid:3971) Starting add_shadow_birthdate(108.0)
10/21 15:18:59 (pid:3971) Started shadow for job 108.0 on "<193.49.27.11:34130>", (shadow pid = 21834)
10/21 15:18:59 (pid:3971) Shadow pid 21834 for job 108.0 exited with status 4
10/21 15:18:59 (pid:3971) ERROR: Shadow exited with job exception code!
10/21 15:19:01 (pid:3971) Starting add_shadow_birthdate(108.0)
10/21 15:19:01 (pid:3971) Started shadow for job 108.0 on "<193.49.27.11:34130>", (shadow pid = 21838)
10/21 15:19:02 (pid:3971) Shadow pid 21838 for job 108.0 exited with status 4
10/21 15:19:02 (pid:3971) ERROR: Shadow exited with job exception code!
10/21 15:19:02 (pid:3971) Match for cluster 108 has had 5 shadow exceptions, relinquishing.
10/21 15:19:02 (pid:3971) Sent RELEASE_CLAIM to startd on <193.49.27.11:34130>
10/21 15:19:02 (pid:3971) Match record (<193.49.27.11:34130>, 108, 0) deleted
10/21 15:19:03 (pid:3971) Sent ad to central manager for guiot@xxxxxxxxxxxxxx
10/21 15:19:03 (pid:3971) Sent ad to 1 collectors for guiot@xxxxxxxxxxxxxx

____________________________________________

2nd job : foo.condor (from http://www.csit.fsu.edu/~burkardt/f_src/condor/)
__________
guiot@chagall:~/tmp/TestCondor/JobPerso$ more foo.condor
universe = vanilla
executable = foo.csh
log = foo.log
output = foo.out
queue
guiot@chagall:~/tmp/TestCondor/JobPerso$ more foo.csh
#!/bin/csh
#
date
echo " "
echo "FOO.CSH."
echo "  A simple shell script that shows off."
#
foreach i (10 20 40)
  echo $i
end
#
echo "Current directory is " $PWD "."
#
echo " "
echo "FOO.CSH."
echo "  Normal end of execution."
echo " "
date
guiot@chagall:~/tmp/TestCondor/JobPerso$ 

-->
here is an overview on the foo.log file : 
...
001 (109.000.000) 10/21 15:35:08 Job executing on host: <193.49.27.11:34130>
...
007 (109.000.000) 10/21 15:35:08 Shadow exception!
        Error from starter on vrubel.galaxy.ibpc.fr: Failed to execute '/ibpc/chagall/guiot/tmp/TestCondor
/JobPerso/foo.csh condor_exec.exe': Permission denied
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...

$tail /scratch/condor/log/SchedLog
10/21 15:35:10 (pid:3971) Starting add_shadow_birthdate(109.0)
10/21 15:35:10 (pid:3971) Started shadow for job 109.0 on "<193.49.27.11:34130>", (shadow pid = 22188)
10/21 15:35:10 (pid:3971) Shadow pid 22188 for job 109.0 exited with status 4
10/21 15:35:10 (pid:3971) ERROR: Shadow exited with job exception code!
10/21 15:35:12 (pid:3971) Starting add_shadow_birthdate(109.0)
10/21 15:35:12 (pid:3971) Started shadow for job 109.0 on "<193.49.27.11:34130>", (shadow pid = 22191)
10/21 15:35:13 (pid:3971) Shadow pid 22191 for job 109.0 exited with status 4
10/21 15:35:13 (pid:3971) ERROR: Shadow exited with job exception code!
10/21 15:35:13 (pid:3971) Match for cluster 109 has had 5 shadow exceptions, relinquishing.
10/21 15:35:13 (pid:3971) Sent RELEASE_CLAIM to startd on <193.49.27.11:34130>
10/21 15:35:13 (pid:3971) Match record (<193.49.27.11:34130>, 109, 0) deleted
10/21 15:35:13 (pid:3971) DaemonCore: Command received via TCP from host <193.49.27.11:36789>
10/21 15:35:13 (pid:3971) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
10/21 15:35:13 (pid:3971) Got VACATE_SERVICE from <193.49.27.11:36789>
10/21 15:35:14 (pid:3971) Sent ad to central manager for guiot@xxxxxxxxxxxxxx
10/21 15:35:14 (pid:3971) Sent ad to 1 collectors for guiot@xxxxxxxxxxxxxx

___________________________________________________________________

these jobs were submitted as user "guiot" (me)  : the daemons were started as user root : where did I miss the permission thing ?

3nd job : This is a bit different : this is a job one of my user had as a script, that I tried to convert into a condor_submit format file 
___________
Original shell file : 
guiot@chagall:/run_cns_30$ more refine.csh
#!/bin/csh
## results will be stored here
setenv NEWIT /place/to/store/the/results

## project path
setenv RUN /path/to/the/project

## individual run.cns is stored here
setenv RUN_CNS /place/where/run.cns/is/located


## command line
/cns_solve_1.1/intel-i686-linux_g77/bin/cns_solve < /path/to/the/project/run1/cns/protocols/refine.inp >! refine.out

touch done
guiot@chagall:/run_cns_30$
____________
1st TEST : submit a shell file :  
guiot@chagall:~/tmp/TestCondor/JobPerso$ more Benjamin1.cmd
####################
#
#  Test du prog de Benjamin
#
####################

Universe = vanilla

Executable      = /run_cns_30/refine.csh


error           = Benjamin1.err
Log             = Benjamin1.log

queue
guiot@chagall:~/tmp/TestCondor/JobPerso$  

it runs , but in only a few seconds, and doesn't make any computation (should last around 10 hours...)
___________
I tried a 2nd test with this submit file : 
guiot@chagall:~/tmp/TestCondor/JobPerso$ more Benjamin2.cmd
####################
#
# Test du prog de Benjamin
#
####################

Universe        = vanilla

Executable      = /run_cns_30/refine.csh

environment     = NEWIT=/place/to/store/the/results;RUN=/path/to/the/project;RUN_CNS=/place/where/run.cns/is/located

arguments       = < /path/to/the/project/run1/cns/protocols/refine.inp
error           = Benjamin2.err
Log             = Benjamin2.log

queue
guiot@chagall:~/tmp/TestCondor/JobPerso$        
    
Exactly the same behavior : "runs" for a few seconds, but still no results.

________________________________

So : what could be the reason I can't run any job on my cluster ?

I've run some other job perfectly fine (the example that come with the condor install package, thoses ones : http://www.usc.edu/hpcc/systems/condorv.php in standard and vanilla universe ), but Why can't I run my OWN (useful...) jobs ?

Thanks in advance for your help

Nicolas GUIOT

-----------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE

Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
------------------------------------------------