[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor jobs leave directories in hosts/*/execute



I've just set up Condor 6.6.5 on a Linux cluster. When I run jobs, they apparently complete OK, but when the jobs have completed, there are directories left in the ~condor/hosts/hostname/execute directory.

#  find ./*/execute -mtime -1
./livlae/execute
./livlaf/execute
./livlaf/execute/dir_20562
./livlaf/execute/dir_20567
./livlah/execute
./livlah/execute/dir_4722
./livlai/execute
#

The only items in the condor logs that look exceptional are, in StartLog, DEACTIVATE_CLAIM_FORCIBLY and "Error: can't find resource with capability", and in StarterLog.vm2, "ERROR: the submitting host claims to be in our UidDomain (nerc-bidston.ac.uk), yet its hostname (bilag) does not match". I have CONDOR_HOST set to livlae.nerc-bidston.ac.uk; UID_DOMAIN and FILESYSTEM_DOMAIN are both set to nerc-bidston.ac.uk; nslookup on bilag's address gives bilag.nerc-bidston.ac.uk.

How do I ensure jobs clean up after themselves? Are these messages related? If not, should I worry about them?

I haven't seen the same problem in a Solaris installation.

Any suggestions appreciated.

Dick

-----------------------------------------

bilag log $ tail -20 StartLog
7/21 11:35:11 vm2: Got universe "VANILLA" (5) from request classad
7/21 11:35:11 vm2: State change: claim-activation protocol successful
7/21 11:35:11 vm2: Changing activity: Idle -> Busy
7/21 11:35:45 DaemonCore: Command received via TCP from host <192.171.134.241:36396>
7/21 11:35:45 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
7/21 11:35:45 vm2: Called deactivate_claim_forcibly()
7/21 11:35:45 Starter pid 4722 exited with status 0
7/21 11:35:45 vm2: State change: starter exited
7/21 11:35:45 vm2: Changing activity: Busy -> Idle
7/21 11:35:45 DaemonCore: Command received via UDP from host <192.171.134.241:33597>
7/21 11:35:45 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
7/21 11:35:45 vm2: State change: received RELEASE_CLAIM command
7/21 11:35:45 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
7/21 11:35:45 vm2: State change: No preempting claim, returning to owner
7/21 11:35:45 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle
7/21 11:35:45 vm2: State change: IS_OWNER is false
7/21 11:35:45 vm2: Changing state: Owner -> Unclaimed
7/21 11:35:45 DaemonCore: Command received via UDP from host <192.171.134.241:33597>
7/21 11:35:45 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
7/21 11:35:45 Error: can't find resource with capability (<192.171.134.112:34528>#1933771416)


bilag log $ tail -20 StarterLog.vm2
7/21 11:35:11 ** condor_starter (CONDOR_STARTER) STARTING UP
7/21 11:35:11 ** $CondorVersion: 6.6.5 May 3 2004 $
7/21 11:35:11 ** $CondorPlatform: I386-LINUX-RH9 $
7/21 11:35:11 ** PID = 4722
7/21 11:35:11 ******************************************************
7/21 11:35:11 Using config file: /users/condor/condor_config
7/21 11:35:11 Using local config files: /users/condor/hosts/livlah/condor_config.local
7/21 11:35:11 DaemonCore: Command Socket at <192.171.134.112:34537>
7/21 11:35:11 Done setting resource limits
7/21 11:35:11 Starter communicating with condor_shadow <192.171.134.241:36375>
7/21 11:35:11 Submitting machine is "bilag"
7/21 11:35:11 ERROR: the submitting host claims to be in our UidDomain (nerc-bidston.ac.uk), yet its hostname (bilag) does not match
7/21 11:35:11 Starting a VANILLA universe job with ID: 50.0
7/21 11:35:11 IWD: /users/susa/condor
7/21 11:35:11 About to exec /users/susa/condor/bigloop
7/21 11:35:11 Create_Process succeeded, pid=4725
7/21 11:35:45 Process exited, pid=4725, status=0
7/21 11:35:45 Got SIGQUIT. Performing fast shutdown.
7/21 11:35:45 ShutdownFast all jobs.
7/21 11:35:45 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
bilag log $


--
Richard Gillman
iTSS UNIX Systems Group, Maclean Building, Wallingford OX10 8BB
Tel: 01491 - 692 339