[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] job restarts because of Shadow



Hi,

 I have some "long" jobs (few hours, no more than a day) who restart
and restart and restart and then sometime finish correctly.

 I checked the log and it seems I have lost of connexion between 
Startd and the Shadow which stop and restart the job.

 I'd like to know if you have some hints to find precisely where
my problems come from. If you know also a way to stop the shadows
communications during the job, it could help.

 BTW, I tried to understand some macro I could see in the logs.
One is SEC_DEFAULT_SESSION_DURATION which is set "by default"
to 3600 in my logs when the doc says "Defaults to 8640000 seconds 
(100 days)". Can it be connected ?

                                      Best regards,

                                           Olivier.



ShadowLog.old :

condor_read(): Socket closed when trying to read buffer
1/5 12:40:57 (508.0) (3682): ERROR "Can no longer talk to condor_starter
on execute machine (10.38.111.100)" at line 63 in file NTreceivers.C
1/5 12:40:57 PASSWD_CACHE_REFRESH is undefined, using default value of 300
1/5 12:40:57 ******************************************************
1/5 12:40:57 ** condor_shadow (CONDOR_SHADOW) STARTING UP
1/5 12:40:57 ** $CondorVersion: 6.6.5 May  3 2004 $
1/5 12:40:57 ** $CondorPlatform: I386-LINUX-RH9 $
1/5 12:40:57 ** PID = 21421
1/5 12:40:57 ******************************************************
1/5 12:40:57 Using config file: /etc/condor/condor_config
1/5 12:40:57 Using local config files: /var/condor/condor_config.local
1/5 12:40:57 DaemonCore: Command Socket at <10.38.111.42:47915>


and another one from the log file of the job :

000 (498.000.000) 01/04 17:35:31 Job submitted from host:
<10.38.111.42:42736>
...
001 (498.000.000) 01/04 17:35:36 Job executing on host:
<10.38.111.100:32772>
...
006 (498.000.000) 01/04 17:35:44 Image size of job updated: 560376
...
006 (498.000.000) 01/04 17:55:44 Image size of job updated: 1278632
...
007 (498.000.000) 01/04 17:59:50 Shadow exception!
        Can no longer talk to condor_starter on execute machine
(10.38.111.100)
        0  -  Run Bytes Sent By Job
        3408177  -  Run Bytes Received By Job
...
001 (498.000.000) 01/04 17:59:52 Job executing on host:
<10.38.111.100:32772>
...
006 (498.000.000) 01/04 18:00:00 Image size of job updated: 560396
...
006 (498.000.000) 01/04 18:20:00 Image size of job updated: 1278652
...
005 (498.000.000) 01/04 22:55:36 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 04:54:26, Sys 0 00:00:56  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 04:54:26, Sys 0 00:00:56  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        302950624  -  Run Bytes Sent By Job
        3408177  -  Run Bytes Received By Job
        302950624  -  Total Bytes Sent By Job
        6816354  -  Total Bytes Received By Job