Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor jobs leave directories in hosts/*/execute

Date: Wed, 21 Jul 2004 14:03:33 +0100
From: Richard Gillman <R.Gillman@xxxxxxxxxx>
Subject: [Condor-users] Condor jobs leave directories in hosts/*/execute

I've just set up Condor 6.6.5 on a Linux cluster. When I run jobs, they apparently complete OK, but when the jobs have completed, there are directories left in the ~condor/hosts/hostname/execute directory.

#  find ./*/execute -mtime -1
./livlae/execute
./livlaf/execute
./livlaf/execute/dir_20562
./livlaf/execute/dir_20567
./livlah/execute
./livlah/execute/dir_4722
./livlai/execute
#

The only items in the condor logs that look exceptional are, in StartLog, DEACTIVATE_CLAIM_FORCIBLY and "Error: can't find resource with capability", and in StarterLog.vm2, "ERROR: the submitting host claims to be in our UidDomain (nerc-bidston.ac.uk), yet its hostname (bilag) does not match". I have CONDOR_HOST set to livlae.nerc-bidston.ac.uk; UID_DOMAIN and FILESYSTEM_DOMAIN are both set to nerc-bidston.ac.uk; nslookup on bilag's address gives bilag.nerc-bidston.ac.uk.

How do I ensure jobs clean up after themselves? Are these messages related? If not, should I worry about them?

I haven't seen the same problem in a Solaris installation.

Any suggestions appreciated.

Dick

-----------------------------------------

bilag log $ tail -20 StartLog 7/21 11:35:11 vm2: Got universe "VANILLA" (5) from request classad 7/21 11:35:11 vm2: State change: claim-activation protocol successful 7/21 11:35:11 vm2: Changing activity: Idle -> Busy 7/21 11:35:45 DaemonCore: Command received via TCP from host <192.171.134.241:36396> 7/21 11:35:45 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 7/21 11:35:45 vm2: Called deactivate_claim_forcibly() 7/21 11:35:45 Starter pid 4722 exited with status 0 7/21 11:35:45 vm2: State change: starter exited 7/21 11:35:45 vm2: Changing activity: Busy -> Idle 7/21 11:35:45 DaemonCore: Command received via UDP from host <192.171.134.241:33597> 7/21 11:35:45 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler) 7/21 11:35:45 vm2: State change: received RELEASE_CLAIM command 7/21 11:35:45 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating 7/21 11:35:45 vm2: State change: No preempting claim, returning to owner 7/21 11:35:45 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle 7/21 11:35:45 vm2: State change: IS_OWNER is false 7/21 11:35:45 vm2: Changing state: Owner -> Unclaimed 7/21 11:35:45 DaemonCore: Command received via UDP from host <192.171.134.241:33597> 7/21 11:35:45 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler) 7/21 11:35:45 Error: can't find resource with capability (<192.171.134.112:34528>#1933771416)

bilag log $ tail -20 StarterLog.vm2 7/21 11:35:11 ** condor_starter (CONDOR_STARTER) STARTING UP 7/21 11:35:11 ** $CondorVersion: 6.6.5 May 3 2004 $ 7/21 11:35:11 ** $CondorPlatform: I386-LINUX-RH9 $ 7/21 11:35:11 ** PID = 4722 7/21 11:35:11 ****************************************************** 7/21 11:35:11 Using config file: /users/condor/condor_config 7/21 11:35:11 Using local config files: /users/condor/hosts/livlah/condor_config.local 7/21 11:35:11 DaemonCore: Command Socket at <192.171.134.112:34537> 7/21 11:35:11 Done setting resource limits 7/21 11:35:11 Starter communicating with condor_shadow <192.171.134.241:36375> 7/21 11:35:11 Submitting machine is "bilag" 7/21 11:35:11 ERROR: the submitting host claims to be in our UidDomain (nerc-bidston.ac.uk), yet its hostname (bilag) does not match 7/21 11:35:11 Starting a VANILLA universe job with ID: 50.0 7/21 11:35:11 IWD: /users/susa/condor 7/21 11:35:11 About to exec /users/susa/condor/bigloop 7/21 11:35:11 Create_Process succeeded, pid=4725 7/21 11:35:45 Process exited, pid=4725, status=0 7/21 11:35:45 Got SIGQUIT. Performing fast shutdown. 7/21 11:35:45 ShutdownFast all jobs. 7/21 11:35:45 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0 bilag log $

--
Richard Gillman
iTSS UNIX Systems Group, Maclean Building, Wallingford OX10 8BB
Tel: 01491 - 692 339

Follow-Ups:
- Re: [Condor-users] Condor jobs leave directories in hosts/*/execute
  - From: Erik Paulson

Prev by Date: [Condor-users] Transfer problem?
Next by Date: Re: [Condor-users] How to have schedd drop claim after each job
Previous by thread: RE: [Condor-users] Transfer problem?
Next by thread: Re: [Condor-users] Condor jobs leave directories in hosts/*/execute
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Condor jobs leave directories in hosts/*/execute