[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Weird behaviour of condor



Hi all,

I first submitted a job that started a lot of trouble in my pool...

here is the submit file : 
Universe = vanilla

Executable      = gogogo.sh
#output         = Out47.txt
output          = Junior.out
error           = Junior.err
Log             = Junior.log

notify_user     = guiot@xxxxxxx
notification    = always

requirements    =
queue

and here here is the gogogo.sh file : 
#!/bin/sh
`cd FichiersSimul && ./attract File1.pdb File2.pdb 47 > Out47.txt`

--> Now I changed this file, I found a way to modify the "> Out47.txt" , but when I submitted the job, I couldn't get the prompt back, as if it was waiting for the results. On an other shell, I tried to see "condor_q" : it was here (don't remember if Running or Idle, or else...), but finally I tried to remove it : just impossible : it stayed in the condor_q as "X".
I deleted the /scratch/condor/log/ShadowLock file.

But Now, here is the BIG problem : it's impossible to have condor_q, or even submit any new job : 
guiot@chagall:$ condor_q

-- Failed to fetch ads from: <193.49.27.24:35171> : chagall.galaxy.ibpc.fr

guiot@chagall$

If I try to submit a job, it keeps telling "Submitting job(s)", but nothing happens.

I tried to restarts condor on the submit machine, nothing happens...


Any idea to get me out of this s*** ?
Some log files, if it can help (job 123 is THE job that started all the problems...): 

guiot@chagall$ tail  /scratch/condor/log/SchedLog 
Starting add_shadow_birthdate(123.0)
10/24 16:48:59 (pid:3971) Started shadow for job 123.0 on "<193.49.27.11:32772>", (shadow pid = 12449)
10/24 16:49:00 (pid:3971) Sent ad to central manager for guiot@xxxxxxxxxxxxxx
10/24 16:49:00 (pid:3971) Sent ad to 1 collectors for guiot@xxxxxxxxxxxxxx
10/24 16:52:14 (pid:3971) DaemonCore: Command received via TCP from host <193.49.27.24:58919>
10/24 16:52:14 (pid:3971) DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
10/24 16:52:34 (pid:3971) condor_read(): timeout reading buffer.
10/24 16:54:00 (pid:3971) Sent ad to central manager for guiot@xxxxxxxxxxxxxx
10/24 16:54:00 (pid:3971) Sent ad to 1 collectors for guiot@xxxxxxxxxxxxxx
10/24 16:54:54 (pid:3971) DaemonCore: Command received via TCP from host <193.49.27.11:32805>
10/24 16:54:54 (pid:3971) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
10/24 16:54:54 (pid:3971) Got VACATE_SERVICE from <193.49.27.11:32805>
10/24 16:54:54 (pid:3971) Sent RELEASE_CLAIM to startd on <193.49.27.11:32772>
10/24 16:54:54 (pid:3971) Match record (<193.49.27.11:32772>, 123, 0) deleted
10/24 16:59:00 (pid:3971) Sent ad to central manager for guiot@xxxxxxxxxxxxxx
10/24 16:59:00 (pid:3971) Sent ad to 1 collectors for guiot@xxxxxxxxxxxxxx
10/24 17:04:00 (pid:3971) Sent ad to central manager for guiot@xxxxxxxxxxxxxx
10/24 17:04:00 (pid:3971) Sent ad to 1 collectors for guiot@xxxxxxxxxxxxxx
10/24 17:09:00 (pid:3971) Sent ad to central manager for guiot@xxxxxxxxxxxxxx
10/24 17:09:00 (pid:3971) Sent ad to 1 collectors for guiot@xxxxxxxxxxxxxx
10/24 17:14:00 (pid:3971) Sent ad to central manager for guiot@xxxxxxxxxxxxxx
10/24 17:14:00 (pid:3971) Sent ad to 1 collectors for guiot@xxxxxxxxxxxxxx
10/24 17:17:27 (pid:3971) DaemonCore: Command received via TCP from host <193.49.27.24:59523>
10/24 17:17:27 (pid:3971) DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
10/24 18:15:28 (pid:14639) ******************************************************
10/24 18:15:28 (pid:14639) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
10/24 18:15:28 (pid:14639) ** /ibpc/io/condor/sbin/condor_schedd
10/24 18:15:28 (pid:14639) ** $CondorVersion: 6.7.10 Aug  3 2005 $
10/24 18:15:28 (pid:14639) ** $CondorPlatform: I386-LINUX_RH9 $
10/24 18:15:28 (pid:14639) ** PID = 14639
10/24 18:15:28 (pid:14639) ******************************************************
10/24 18:15:28 (pid:14639) Using config file: /ibpc/io/condor/etc/condor_config
10/24 18:15:28 (pid:14639) Using local config files: /scratch/condor/condor_config.local
10/24 18:15:28 (pid:14639) DaemonCore: Command Socket at <193.49.27.24:59931>
10/25 11:15:55 (pid:2479) ******************************************************
10/25 11:15:55 (pid:2479) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
10/25 11:15:55 (pid:2479) ** /ibpc/io/condor/sbin/condor_schedd
10/25 11:15:55 (pid:2479) ** $CondorVersion: 6.7.10 Aug  3 2005 $
10/25 11:15:55 (pid:2479) ** $CondorPlatform: I386-LINUX_RH9 $
10/25 11:15:55 (pid:2479) ** PID = 2479
10/25 11:15:55 (pid:2479) ******************************************************
10/25 11:15:55 (pid:2479) Using config file: /ibpc/io/condor/etc/condor_config
10/25 11:15:55 (pid:2479) Using local config files: /scratch/condor/condor_config.local
10/25 11:15:55 (pid:2479) DaemonCore: Command Socket at <193.49.27.24:35171>
guiot@chagall:~/tmp/TestCondor/JobPerso$       

guiot@chagall$ tail /scratch/condor/log/ShadowLog
10/24 16:48:59 ******************************************************
10/24 16:48:59 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/24 16:48:59 ** /ibpc/io/condor/sbin/condor_shadow
10/24 16:48:59 ** $CondorVersion: 6.7.10 Aug  3 2005 $
10/24 16:48:59 ** $CondorPlatform: I386-LINUX_RH9 $
10/24 16:48:59 ** PID = 12449
10/24 16:48:59 ******************************************************
10/24 16:48:59 Using config file: /ibpc/io/condor/etc/condor_config
10/24 16:48:59 Using local config files: /scratch/condor/condor_config.local
10/24 16:48:59 DaemonCore: Command Socket at <193.49.27.24:58862>
10/24 16:48:59 Initializing a VANILLA shadow for job 123.0
10/24 16:48:59 (123.0) (12449): Request to run on <193.49.27.11:32772> was ACCEPTED

If you need more file, just ask,plz


                                   
Thanks in advance for your help
Nicolas GUIOT

-----------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE

Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
------------------------------------------------