[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] remote condor job never gets removed



A remote job submitted from 6.7.18 SuSE 9.3/x86_64 to 6.7.19 SuSE 8.2/x86 completes but never gets removed from the queue or the results returned back to the submitting machine:

   $ cat remote_vanilla.sub
   universe = vanilla
   executable = vanilla.sh
   requirements = Arch == "INTEL"
   output = $(Cluster).$(Process).out
   error  = $(Cluster).$(Process).err
   should_transfer_files = YES
   when_to_transfer_output = ON_EXIT_OR_EVICT
   log = remote_vanilla.log
   notification = never
   queue
   $ condor_submit -remote cmhost -pool cmhost remote_vanilla.sub
   Submitting job(s).
   Logging submit event(s).
   1 job(s) submitted to cluster 61.
   Spooling data files for 1 jobs...
   $ condor_q -pool cmhost -name cmhost


   -- Schedd: cmhost.bestsystems.co.jp : <172.16.10.117:46010>
    ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
     61.0   ajs             5/18 16:55   0+00:00:06 C  0   9.8  vanilla.sh

   0 jobs; 0 idle, 0 running, 0 held
   $ ls -l 61*
   -rw-r--r--  1 ajs users 0 May 18 16:55 61.0.err
   -rw-r--r--  1 ajs users 0 May 18 16:55 61.0.out
   $

The SchedLog shows the job completed but ends with an mrec error:

   5/18 16:38:32 Job 61.0 is finished
   5/18 16:38:32 Added data to SelfDrainingQueue job_is_finished_queue,
   now has 1 element(s)
   5/18 16:38:32 Registered timer for SelfDrainingQueue
   job_is_finished_queue, period: 0 (id: 52)
   5/18 16:38:32 Exited check_zombie( 15343, 0x0x856a504 )
   5/18 16:38:32
   5/18 16:38:32 ..................
   5/18 16:38:32 .. Shadow Recs (0/1)
   5/18 16:38:32 ..................

   5/18 16:38:32 Exited delete_shadow_rec( 15343 )
   5/18 16:38:32 -------- Begin starting jobs --------
   5/18 16:38:32 Job 61.-1: not runnable
   5/18 16:38:32 match (<172.16.10.117:46011>#1147937001#5) out of jobs
   (cluster id 61); relinquishing
   5/18 16:38:32 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default
   value of 0
   5/18 16:38:32 SEC_DEBUG_PRINT_KEYS is undefined, using default value
   of False
   5/18 16:38:32 Called send_vacate( <172.16.10.117:46011>, 443 )
   5/18 16:38:32 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default
   value of 0
   5/18 16:38:32 SEC_DEBUG_PRINT_KEYS is undefined, using default value
   of False
   5/18 16:38:32 Sent RELEASE_CLAIM to startd on <172.16.10.117:46011>
   5/18 16:38:32 Match record (<172.16.10.117:46011>, 61, -1) deleted
   5/18 16:38:32 ClaimId of deleted match:
   <172.16.10.117:46011>#1147937001#5
   5/18 16:38:32 -------- Done starting jobs --------
   5/18 16:38:32 Inside SelfDrainingQueue::timerHandler() for
   job_is_finished_queue
   5/18 16:38:32 Job cleanup for 61.0 will block, calling
   jobIsFinished() in a thread
   5/18 16:38:32 SelfDrainingQueue job_is_finished_queue is empty, not
   resetting timer
   5/18 16:38:32 Canceling timer for SelfDrainingQueue
   job_is_finished_queue (timer id: 52)
   5/18 16:38:32 DaemonCore: No more children processes to reap.
   5/18 16:38:32 jobIsFinished() completed, calling DestroyProc(61.0)
   5/18 16:38:32 SCHEDD_ROUND_ATTR_JobFinishedHookDone is undefined,
   using default value of 0
   5/18 16:38:32 Got VACATE_SERVICE from <172.16.10.117:47921>
   5/18 16:38:32 mrec for "<172.16.10.117:46011>#1147937001#5" not
   found -- match not deleted

Both the submit machine and remote schedd machine are included in each other's /etc/hosts. The submit machine condor_config has the following authentication:

   SEC_CLIENT_AUTHENTICATION = OPTIONAL
   SEC_CLIENT_AUTHENTICATION_METHODS = CLAIMTOBE

and the remote schedd condor_config has:

   SEC_DEFAULT_AUTHENTICATION = OPTIONAL
   SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE

Do I have a configuration problem?

Andrew

--
Andrew Stubbings
BestSystems, Inc.
Tel: +81 29 860 7080
E-mail: ajs@xxxxxxxxxxxxxxxxx
www.bestsystems.co.jp