[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Shadow Exception !!!




I have had this error before. You need to make sure that the user which executes "condor_submit" or "condor_submit_dag" has read/write proiveledges in it's current working directory.
If you are on linux this is probably the "condor" user account...some people try to run condor as root to overcome file permission problems but, condor will not submit a job as user root..so do not try this.
In my scripts i always change to the directory where my condor files are located to ensure that i am in a directory which will allow read and write priveledges for the user.


In my case i was using PhP scripts to automatically submit and generate jobs.
So i had to tell PhP to switch to teh directory where my condor files were located and use that for it's current working directory..



Hope this helps out.


JW


Condor Grib wrote:


I've installed the last version of condor in my PC,
and it's running ok under linux (Redhat 9). The
problem appears when I sent a process to run, after
running for a few seconds a shadow exception appears,
and the the process starts again until another shadow
exception stops it. I dunno what's happening, if you
run the process in one pc wothout using condors works
perfect. Looking at the logs first I saw in the log of my
program this:
000 (003.000.000) 09/21 13:40:49 Job submitted from
host: <193.147.240.233:36284


...
001 (003.000.000) 09/21 13:40:52 Job executing on
host: <193.147.240.233:36286>
...
006 (003.000.000) 09/21 13:41:00 Image size of job
updated: 1332
...
007 (003.000.000) 09/21 13:41:42 Shadow exception!
       Can no longer talk to condor_starter
<193.147.240.233:36286>
       0  -  Run Bytes Sent By Job
       14829  -  Run Bytes Received By Job
*****************************************************

So i looked in the Starterlog and this is what I've
got:

Starterlog:

9/21 13:40:52
******************************************************
9/21 13:40:52 ** condor_starter (CONDOR_STARTER)
STARTING UP
9/21 13:40:52 **
/home/condor/condor-6.7.1/sbin/condor_starter
9/21 13:40:52 ** $CondorVersion: 6.7.1 Aug 10 2004 $
9/21 13:40:52 ** $CondorPlatform: I386-LINUX_RH9 $
9/21 13:40:52 ** PID = 15935
9/21 13:40:52
******************************************************
9/21 13:40:52 Using config file:
/home/condor/condor-6.7.1/etc/condor_config
9/21 13:40:52 Using local config files:
/home/condor/condor-6.7.1/local.golem/c
o
ndor_config.local
9/21 13:40:52 DaemonCore: Command Socket at
<193.147.240.233:36308>
9/21 13:40:52 Done setting resource limits
9/21 13:40:52 Communicating with shadow
<193.147.240.233:36306>
9/21 13:40:52 Submitting machine is "golem.imim.es"
9/21 13:40:52 File transfer completed successfully.
9/21 13:40:52 Starting a VANILLA universe job with ID:
3.0
9/21 13:40:52 IWD:
/home/condor/condor-6.7.1/local.golem/execute/dir_15935
9/21 13:40:52 Output file:
/home/condor/condor-6.7.1/local.golem/execute/dir_15
935/2program.out
9/21 13:40:52 Error file:
/home/condor/condor-6.7.1/local.golem/execute/dir_159
35/2program.err
9/21 13:40:52 About to exec
/home/condor/condor-6.7.1/local.golem/execute/dir_1
5935/condor_exec.exe
9/21 13:40:52 Create_Process succeeded, pid=15937
9/21 13:41:42 Process exited, pid=15937, status=0
9/21 13:41:42 ReliSock: put_file: Failed to open file
/home/condor/condor-6.7.1
/local.golem/execute/dir_15935/2program.log, errno =
2.
9/21 13:41:42 ERROR "DoUpload: Failed to send file
/home/condor/condor-6.7.1/lo
cal.golem/execute/dir_15935/2program.log, exiting at
1408
" at line 1407 in file file_transfer.C
9/21 13:41:42 ShutdownFast all jobs.
*****************************************************

the Shadowlog :
9/21 13:40:52
******************************************************
9/21 13:40:52 ** condor_shadow (CONDOR_SHADOW)
STARTING UP
9/21 13:40:52 **
/home/condor/condor-6.7.1/sbin/condor_shadow
9/21 13:40:52 ** $CondorVersion: 6.7.1 Aug 10 2004 $
9/21 13:40:52 ** $CondorPlatform: I386-LINUX_RH9 $
9/21 13:40:52 ** PID = 15934
9/21 13:40:52
******************************************************
9/21 13:40:52 Using config file:
/home/condor/condor-6.7.1/etc/condor_config
9/21 13:40:52 Using local config files:
/home/condor/condor-6.7.1/local.golem/c
ondor_config.local
9/21 13:40:52 DaemonCore: Command Socket at
<193.147.240.233:36306>
9/21 13:40:52 Initializing a VANILLA shadow for job
3.0
9/21 13:40:52 (3.0) (15934): Request to run on
<193.147.240.233:36286> was ACCE
PTED
9/21 13:41:42 (3.0) (15934): ERROR "Can no longer talk
to condor_starter <193.1
47.240.233:36286>" at line 93 in file NTreceivers.C
*********************

Anyone knows where the problem is?

BTW, I just have only one machine that everytime a
process is send it starts running imediately.
If you need more info let me know





______________________________________________
Renovamos el Correo Yahoo!: ¡100 MB GRATIS!
Nuevos servicios, más seguridad
http://correo.yahoo.es
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users