[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Shadow Exception !!!



Hi,
  It looks like you have a file permissions problem writing to the log file. If you are running as condor. Make sure condor owns all the files and directories.
Kevan

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Condor Grib
Sent: 21 September 2004 12:07
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Shadow Exception !!!


I've installed the last version of condor in my PC,
and it's running ok under linux (Redhat 9). The
problem appears when I sent a process to run, after
running for a few seconds a shadow exception appears,
and the the process starts again until another shadow
exception stops it. I dunno what's happening, if you
run the process in one pc wothout using condors works
perfect. 
Looking at the logs first I saw in the log of my
program this:
000 (003.000.000) 09/21 13:40:49 Job submitted from
host: <193.147.240.233:36284
>
...
001 (003.000.000) 09/21 13:40:52 Job executing on
host: <193.147.240.233:36286>
...
006 (003.000.000) 09/21 13:41:00 Image size of job
updated: 1332
...
007 (003.000.000) 09/21 13:41:42 Shadow exception!
        Can no longer talk to condor_starter <193.147.240.233:36286>
        0  -  Run Bytes Sent By Job
        14829  -  Run Bytes Received By Job
*****************************************************

So i looked in the Starterlog and this is what I've
got:

Starterlog:

9/21 13:40:52
******************************************************
9/21 13:40:52 ** condor_starter (CONDOR_STARTER)
STARTING UP
9/21 13:40:52 **
/home/condor/condor-6.7.1/sbin/condor_starter
9/21 13:40:52 ** $CondorVersion: 6.7.1 Aug 10 2004 $
9/21 13:40:52 ** $CondorPlatform: I386-LINUX_RH9 $
9/21 13:40:52 ** PID = 15935
9/21 13:40:52
******************************************************
9/21 13:40:52 Using config file: /home/condor/condor-6.7.1/etc/condor_config
9/21 13:40:52 Using local config files: /home/condor/condor-6.7.1/local.golem/c
o
ndor_config.local
9/21 13:40:52 DaemonCore: Command Socket at <193.147.240.233:36308> 9/21 13:40:52 Done setting resource limits 9/21 13:40:52 Communicating with shadow <193.147.240.233:36306> 9/21 13:40:52 Submitting machine is "golem.imim.es" 9/21 13:40:52 File transfer completed successfully. 9/21 13:40:52 Starting a VANILLA universe job with ID: 3.0 9/21 13:40:52 IWD: /home/condor/condor-6.7.1/local.golem/execute/dir_15935
9/21 13:40:52 Output file: /home/condor/condor-6.7.1/local.golem/execute/dir_15
935/2program.out
9/21 13:40:52 Error file: /home/condor/condor-6.7.1/local.golem/execute/dir_159
35/2program.err
9/21 13:40:52 About to exec /home/condor/condor-6.7.1/local.golem/execute/dir_1
5935/condor_exec.exe
9/21 13:40:52 Create_Process succeeded, pid=15937
9/21 13:41:42 Process exited, pid=15937, status=0
9/21 13:41:42 ReliSock: put_file: Failed to open file /home/condor/condor-6.7.1 /local.golem/execute/dir_15935/2program.log, errno = 2. 9/21 13:41:42 ERROR "DoUpload: Failed to send file /home/condor/condor-6.7.1/lo cal.golem/execute/dir_15935/2program.log, exiting at 1408 " at line 1407 in file file_transfer.C 9/21 13:41:42 ShutdownFast all jobs.
*****************************************************

the Shadowlog :
9/21 13:40:52
******************************************************
9/21 13:40:52 ** condor_shadow (CONDOR_SHADOW)
STARTING UP
9/21 13:40:52 **
/home/condor/condor-6.7.1/sbin/condor_shadow
9/21 13:40:52 ** $CondorVersion: 6.7.1 Aug 10 2004 $
9/21 13:40:52 ** $CondorPlatform: I386-LINUX_RH9 $
9/21 13:40:52 ** PID = 15934
9/21 13:40:52
******************************************************
9/21 13:40:52 Using config file: /home/condor/condor-6.7.1/etc/condor_config
9/21 13:40:52 Using local config files: /home/condor/condor-6.7.1/local.golem/c
ondor_config.local
9/21 13:40:52 DaemonCore: Command Socket at <193.147.240.233:36306> 9/21 13:40:52 Initializing a VANILLA shadow for job 3.0 9/21 13:40:52 (3.0) (15934): Request to run on <193.147.240.233:36286> was ACCE PTED 9/21 13:41:42 (3.0) (15934): ERROR "Can no longer talk to condor_starter <193.1 47.240.233:36286>" at line 93 in file NTreceivers.C
*********************

Anyone knows where the problem is? 

BTW, I just have only one machine that everytime a
process is send it starts running imediately.
If you need more info let me know




		
______________________________________________
Renovamos el Correo Yahoo!: ¡100 MB GRATIS!
Nuevos servicios, más seguridad
http://correo.yahoo.es _______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx http://lists.cs.wisc.edu/mailman/listinfo/condor-users