[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Shadow exception!



Dear All,
     This is my first mail on this mailing list. I have just installed
Condor and on AMD64 with Fedora core 6. For testing purposes i have
only two computers in my pool. named as

ibm15 ---> is central manager, execute, submit node
ibm16 ---> is execute, submit node

Now when i submit jobs from ibm15 they execute on both "ibm15" and "ibm16"
without any error.

But when i submit the jobs from "ibm16" the jobs executed by ibm16 runs
without any problem but the jobs running on "ibm15" gives an error(please
see new.log at the end of the mail).

I have also copied the "ShadowLog" file at the end of this email. I have
searched a lot on the internet and I think that it is a problem of the
permissions. but even after searching a lot and trying different things i
don't know how to correct the problem... any help or suggestion will be
greatly appreciated.

Thanks
Asim

"new.log"
--------------------------------------------
...
001 (065.000.000) 08/01 17:18:45 Job executing on host: <202.241.97.56:52835>
...
005 (065.000.000) 08/01 17:18:45 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        824  -  Run Bytes Sent By Job
        4929913  -  Run Bytes Received By Job
        824  -  Total Bytes Sent By Job
        4929913  -  Total Bytes Received By Job
...
001 (057.000.000) 08/01 17:18:48 Job executing on host: <202.241.97.55:47678>
...
007 (057.000.000) 08/01 17:18:48 Shadow exception!
        Unable to talk to job: disconnected

        82  -  Run Bytes Sent By Job
        160  -  Run Bytes Received By Job
--------------------------------------------

"ShadowLog"
--------------------------------------------

8/1 17:19:29 (?.?) (10939):******* Standard Shadow starting up *******
8/1 17:19:29 (?.?) (10939):** $CondorVersion: 6.8.5 May 17 2007 $
8/1 17:19:29 (?.?) (10939):** $CondorPlatform: X86_64-LINUX_RHEL3 $
8/1 17:19:29 (?.?) (10939):*******************************************
8/1 17:19:29 (?.?) (10939):uid=0, euid=504, gid=0, egid=504
8/1 17:19:29 (?.?) (10939):Hostname = "<202.241.97.55:47678>", Job = 58.0
8/1 17:19:29 (58.0) (10939):Requesting Primary Starter
8/1 17:19:29 (58.0) (10939):Shadow: Request to run a job was ACCEPTED
8/1 17:19:29 (58.0) (10939):Shadow: RSC_SOCK connected, fd = 17
8/1 17:19:29 (58.0) (10939):Shadow: CLIENT_LOG connected, fd = 18
8/1 17:19:29 (58.0) (10939):My_Filesystem_Domain = "naregi.hokudai.ac.jp"
8/1 17:19:29 (58.0) (10939):My_UID_Domain = "naregi.hokudai.ac.jp"
8/1 17:19:37 (57.0) (10938):ERROR "Unable to talk to job: disconnected
" at line 135 in file receivers.C
8/1 17:19:37 (57.0) (10938):Shadow: DoCleanup: unlinking TmpCkpt
'/scratch/condo
r/spool/cluster57.proc0.subproc0.tmp'
8/1 17:19:37 (57.0) (10938):Trying to unlink
/scratch/condor/spool/cluster57.pro
c0.subproc0.tmp
8/1 17:19:39 (58.0) (10939):ERROR "Unable to talk to job: disconnected
" at line 135 in file receivers.C
8/1 17:19:39 (58.0) (10939):Shadow: DoCleanup: unlinking TmpCkpt
'/scratch/condo
r/spool/cluster58.proc0.subproc0.tmp'
8/1 17:19:39 (58.0) (10939):Trying to unlink
/scratch/condor/spool/cluster58.pro
c0.subproc0.tmp
-----------------------------------------