[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] restarting jobs after they ran out of disk space.



Hi

I gave condor ~12k jobs and after ~11k of them had completed ran out of disk space to write results to. I've made more diskspace but I can't get the remaining jobs to run; just getting 'no match found' but they should match, they're the same as the ones that did run. If I set them off as new jobs they run but I really don't want to have to workout which ones ran and which didn't.

Any ideas on how I can get these jobs going again?

Many thanks

extracts from logs...


[root@galaxy chel]# tail -50 /home/condor/condor/local.galaxy/log/MasterLog
5/10 10:16:03 Preen pid is 29411
5/10 10:16:16 Child 29411 died, but not a daemon -- Ignored
5/11 10:16:03 Preen pid is 11290
5/11 10:16:16 Child 11290 died, but not a daemon -- Ignored
5/12 09:32:17 The SCHEDD (pid 22048) exited with status 4
5/12 09:32:17 Sending obituary for "/home/condor/condor/sbin/condor_schedd"
5/12 09:32:17 restarting /home/condor/condor/sbin/condor_schedd in 10 seconds 5/12 09:32:27 Started DaemonCore process "/home/condor/condor/sbin/condor_schedd", pid and pgroup = 25242
5/12 09:44:08 The SCHEDD (pid 25242) exited with status 4
5/12 09:44:08 Sending obituary for "/home/condor/condor/sbin/condor_schedd"
5/12 09:44:08 restarting /home/condor/condor/sbin/condor_schedd in 10 seconds 5/12 09:44:18 Started DaemonCore process "/home/condor/condor/sbin/condor_schedd", pid and pgroup = 25368
5/12 09:56:15 The SCHEDD (pid 25368) exited with status 4
5/12 09:56:15 Sending obituary for "/home/condor/condor/sbin/condor_schedd"
5/12 09:56:15 restarting /home/condor/condor/sbin/condor_schedd in 10 seconds 5/12 09:56:25 Started DaemonCore process "/home/condor/condor/sbin/condor_schedd", pid and pgroup = 25472
5/12 10:00:29 The NEGOTIATOR (pid 24513) exited with status 4
5/12 10:00:29 Sending obituary for "/home/condor/condor/sbin/condor_negotiator" 5/12 10:00:29 restarting /home/condor/condor/sbin/condor_negotiator in 10 seconds 5/12 10:00:39 Started DaemonCore process "/home/condor/condor/sbin/condor_negotiator", pid and pgroup
= 25510
5/12 10:00:39 The NEGOTIATOR (pid 25510) exited with status 4
5/12 10:00:39 Sending obituary for "/home/condor/condor/sbin/condor_negotiator" 5/12 10:00:39 restarting /home/condor/condor/sbin/condor_negotiator in 11 seconds 5/12 10:00:50 Started DaemonCore process "/home5/15 09:12:59 Started DaemonCore process "/home/condor/condor/sbin/condor_negotiator", pid and pgroup = 8535
5/15 09:54:26 Preen pid is 8964
5/15 09:54:43 Child 8964 died, but not a daemon -- Ignored
5/15 14:24:38 Got SIGTERM. Performing graceful shutdown.
5/15 14:24:38 Sent SIGTERM to COLLECTOR (pid 8294)
5/15 14:24:38 Sent SIGTERM to NEGOTIATOR (pid 8535)
5/15 14:24:38 Sent SIGTERM to SCHEDD (pid 8296)
5/15 14:24:38 The COLLECTOR (pid 8294) exited with status 0
5/15 14:24:38 The NEGOTIATOR (pid 8535) exited with status 0
5/15 14:24:50 The SCHEDD (pid 8296) exited with status 0
5/15 14:24:50 All daemons are gone.  Exiting.
5/15 14:24:50 **** condor_master (condor_MASTER) EXITING WITH STATUS 0
5/15 14:26:45 ******************************************************
5/15 14:26:45 ** condor_master (CONDOR_MASTER) STARTING UP
5/15 14:26:45 ** /home/condor/condor/sbin/condor_master
5/15 14:26:45 ** $CondorVersion: 6.6.6 Jul 26 2004 $
5/15 14:26:45 ** $CondorPlatform: I386-LINUX_RH9 $
5/15 14:26:45 ** PID = 11894
5/15 14:26:45 ******************************************************
5/15 14:26:45 Using config file: /users/condor/condor_config
5/15 14:26:45 Using local config files: /users/condor/hosts/galaxy/condor_config.local
5/15 14:26:45 DaemonCore: Command Socket at <192.168.0.2:34952>
5/15 14:26:45 Started DaemonCore process "/home/condor/condor/sbin/condor_collector", pid and pgroup = 11895 5/15 14:26:45 Started DaemonCore process "/home/condor/condor/sbin/condor_negotiator", pid and pgroup
= 11896
5/15 14:26:45 Started DaemonCore process "/home/condor/condor/sbin/condor_schedd", pid and pgroup = 11897
5/15 15:26:45 Preen pid is 12625
5/15 15:26:59 Child 12625 died, but not a daemon -- Ignored
[root@galaxy chel]# tail -50 /home/condor/condor/local.galaxy/log/CollectorLog
Now in new log file /home/condor/hosts/galaxy/log/CollectorLog
5/16 12:02:16 (Sent 12 ads in response to query)
5/16 12:07:17 (Sent 27 ads in response to query)
5/16 12:07:17 Got QUERY_STARTD_PVT_ADS
5/16 12:07:17 (Sent 12 ads in response to query)
5/16 12:09:06 Housekeeper:  Ready to clean old ads
5/16 12:09:06   Cleaning StartdAds ...
5/16 12:09:06   Cleaning StartdPrivateAds ...
5/16 12:09:06   Cleaning ScheddAds ...
5/16 12:09:06   Cleaning SubmittorAds ...
5/16 12:09:06   Cleaning LicenseAds ...
5/16 12:09:06   Cleaning MasterAds ...
5/16 12:09:06   Cleaning CkptServerAds ...
5/16 12:09:06   Cleaning CollectorAds ...
5/16 12:09:06   Cleaning StorageAds ...
5/16 12:09:06 Housekeeper:  Done cleaning
[root@galaxy chel]# tail -50 /home/condor/condor/local.galaxy/log/MatchLog
5/16 11:12:47 Rejected 39144.109 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:12:47 Rejected 39154.252 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:12:47 Rejected 39164.399 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:12:47 Rejected 39165.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:12:47 Rejected 39166.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:20:56 Rejected 39144.109 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:20:56 Rejected 39154.252 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:20:56 Rejected 39164.399 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:20:57 Rejected 39165.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:20:57 Rejected 39166.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:25:57 Rejected 39144.109 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:25:57 Rejected 39154.252 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:25:57 Rejected 39164.399 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:25:57 Rejected 39165.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:25:57 Rejected 39166.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:30:57 Rejected 39144.109 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:30:57 Rejected 39154.252 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:30:57 Rejected 39164.399 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:30:57 Rejected 39165.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:30:57 Rejected 39166.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:39:06 Rejected 39144.109 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:39:06 Rejected 39154.252 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:39:07 Rejected 39164.399 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:39:07 Rejected 39165.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:39:07 Rejected 39166.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:44:07 Rejected 39144.109 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:44:07 Rejected 39154.252 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:44:07 Rejected 39164.399 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:44:07 Rejected 39165.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:44:07 Rejected 39166.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:49:07 Rejected 39144.109 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:49:07 Rejected 39154.252 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:49:07 Rejected 39164.399 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:49:07 Rejected 39165.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:49:08 Rejected 39166.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:57:16 Rejected 39144.109 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:57:16 Rejected 39154.252 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:57:16 Rejected 39164.399 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 11:57:16 Rejected 39165.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 11:57:16 Rejected 39166.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 12:02:16 Rejected 39144.109 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 12:02:16 Rejected 39154.252 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 12:02:16 Rejected 39164.399 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 12:02:16 Rejected 39165.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 12:02:17 Rejected 39166.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 12:07:17 Rejected 39144.109 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 12:07:17 Rejected 39154.252 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 12:07:17 Rejected 39164.399 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 12:07:17 Rejected 39165.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found 5/16 12:07:17 Rejected 39166.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found [root@galaxy chel]# tail -50 /home/condor/condor/local.galaxy/log/NegotiatorLog
5/16 11:57:16     Request 39166.00000:
5/16 11:57:16 Rejected 39166.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found
5/16 11:57:16     Got NO_MORE_JOBS;  done negotiating
5/16 11:57:16 ---------- Finished Negotiation Cycle ----------
5/16 12:02:16 ---------- Started Negotiation Cycle ----------
5/16 12:02:16 Phase 1:  Obtaining ads from collector ...
5/16 12:02:16   Getting all public ads ...
5/16 12:02:16   Sorting 27 ads ...
5/16 12:02:16   Getting startd private ads ...
5/16 12:02:16 Got ads: 27 public and 12 private
5/16 12:02:16 Public ads include 1 submitter, 12 startd
5/16 12:02:16 Phase 2:  Performing accounting ...
5/16 12:02:16 Phase 3:  Sorting submitter ads by priority ...
5/16 12:02:16 Phase 4.1:  Negotiating with schedds ...
5/16 12:02:16 Negotiating with tanya@localdomain localhost at <192.168.0.2:34953>
5/16 12:02:16     Request 39144.00109:
5/16 12:02:16 Rejected 39144.109 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 12:02:16 Request 39154.00252: 5/16 12:02:16 Rejected 39154.252 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 12:02:16 Request 39164.00399: 5/16 12:02:16 Rejected 39164.399 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 12:02:16 Request 39165.00000: 5/16 12:02:16 Rejected 39165.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found
5/16 12:02:17     Request 39166.00000:
5/16 12:02:17 Rejected 39166.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found
5/16 12:02:17     Got NO_MORE_JOBS;  done negotiating
5/16 12:02:17 ---------- Finished Negotiation Cycle ----------
5/16 12:07:17 ---------- Started Negotiation Cycle ----------
5/16 12:07:17 Phase 1:  Obtaining ads from collector ...
5/16 12:07:17   Getting all public ads ...
5/16 12:07:17   Sorting 27 ads ...
5/16 12:07:17   Getting startd private ads ...
5/16 12:07:17 Got ads: 27 public and 12 private
5/16 12:07:17 Public ads include 1 submitter, 12 startd
5/16 12:07:17 Phase 2:  Performing accounting ...
5/16 12:07:17 Phase 3:  Sorting submitter ads by priority ...
5/16 12:07:17 Phase 4.1:  Negotiating with schedds ...
5/16 12:07:17 Negotiating with tanya@localdomain localhost at <192.168.0.2:34953>
5/16 12:07:17     Request 39144.00109:
5/16 12:07:17 Rejected 39144.109 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 12:07:17 Request 39154.00252: 5/16 12:07:17 Rejected 39154.252 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 12:07:17 Request 39164.00399: 5/16 12:07:17 Rejected 39164.399 tanya@localdomain localhost <192.168.0.2:34953>: no match found5/16 12:07:17 Request 39165.00000: 5/16 12:07:17 Rejected 39165.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found
5/16 12:07:17     Request 39166.00000:
5/16 12:07:17 Rejected 39166.0 tanya@localdomain localhost <192.168.0.2:34953>: no match found
5/16 12:07:17     Got NO_MORE_JOBS;  done negotiating
5/16 12:07:17 ---------- Finished Negotiation Cycle ----------
[root@galaxy chel]# tail -50 /home/condor/condor/local.galaxy/log/ShadowLog
5/15 14:21:48 ******************************************************
5/15 14:21:48 ** condor_shadow (CONDOR_SHADOW) STARTING UP
5/15 14:21:48 ** /home/condor/condor/sbin/condor_shadow
5/15 14:21:48 ** $CondorVersion: 6.6.6 Jul 26 2004 $
5/15 14:21:48 ** $CondorPlatform: I386-LINUX_RH9 $
5/15 14:21:48 ** PID = 11840
5/15 14:21:48 ******************************************************
5/15 14:21:48 Using config file: /users/condor/condor/etc/condor_config
5/15 14:21:48 Using local config files: /users/condor/hosts/galaxy/condor_config.local
5/15 14:21:48 DaemonCore: Command Socket at <192.168.0.2:34925>
5/15 14:21:49 Initializing a VANILLA shadow
5/15 14:21:49 (39166.1) (11840): Request to run on <192.168.0.12:40259> was ACCEPTED
5/15 14:21:50 ******************************************************
5/15 14:21:50 ** condor_shadow (CONDOR_SHADOW) STARTING UP
5/15 14:21:50 ** /home/condor/condor/sbin/condor_shadow
5/15 14:21:50 ** $CondorVersion: 6.6.6 Jul 26 2004 $
5/15 14:21:50 ** $CondorPlatform: I386-LINUX_RH9 $
5/15 14:21:50 ** PID = 11843
5/15 14:21:50 ******************************************************
5/15 14:21:50 Using config file: /users/condor/condor/etc/condor_config
5/15 14:21:50 Using local config files: /users/condor/hosts/galaxy/condor_config.local
5/15 14:21:50 DaemonCore: Command Socket at <192.168.0.2:34929>
5/15 14:21:51 Initializing a VANILLA shadow
5/15 14:21:52 (39166.5) (11843): Request to run on <192.168.0.14:45528> was ACCEPTED 5/15 14:23:28 (39166.4) (11771): get_file(): Failed to open file /home/tanya/chel/pro_74_uni_chelicerata_3205.fsa.out, errno = 13. 5/15 14:23:28 (39166.4) (11771): ERROR "Can no longer talk to condor_starter on execute machine (192.168.0.14)" at line 63 in file NTreceivers.C
5/15 14:23:29 ******************************************************
5/15 14:23:29 ** condor_shadow (CONDOR_SHADOW) STARTING UP
5/15 14:23:29 ** /home/condor/condor/sbin/condor_shadow
5/15 14:23:29 ** $CondorVersion: 6.6.6 Jul 26 2004 $
5/15 14:23:29 ** $CondorPlatform: I386-LINUX_RH9 $
5/15 14:23:29 ** PID = 11856
5/15 14:23:29 ******************************************************
5/15 14:23:29 Using config file: /users/condor/condor/etc/condor_config
5/15 14:23:29 Using local config files: /users/condor/hosts/galaxy/condor_config.local
5/15 14:23:29 DaemonCore: Command Socket at <192.168.0.2:34933>
5/15 14:23:30 Initializing a VANILLA shadow
5/15 14:23:30 (39166.4) (11856): Request to run on <192.168.0.14:45528> was ACCEPTED
5/15 14:24:39 (39166.5) (11843): Job 39166.5 is being evicted
5/15 14:24:39 (39166.5) (11843): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107
5/15 14:24:41 (39166.4) (11856): Job 39166.4 is being evicted
5/15 14:24:41 (39166.4) (11856): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107
5/15 14:24:43 (39166.1) (11840): Job 39166.1 is being evicted
5/15 14:24:43 (39166.1) (11840): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107
5/15 14:24:45 (39166.3) (11839): Job 39166.3 is being evicted
5/15 14:24:45 (39166.3) (11839): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107
5/15 14:24:47 (39166.0) (11830): Job 39166.0 is being evicted
5/15 14:24:47 (39166.0) (11830): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107
5/15 14:24:49 (39166.2) (11807): Job 39166.2 is being evicted
5/15 14:24:50 (39166.2) (11807): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107
[root@galaxy chel]# tail -50 /home/condor/condor/local.galaxy/log/StartLog
1/10 07:59:59 ******************************************************
1/10 07:59:59 ** condor_startd (CONDOR_STARTD) STARTING UP
1/10 07:59:59 ** /home/condor/condor/sbin/condor_startd
1/10 07:59:59 ** $CondorVersion: 6.6.6 Jul 26 2004 $
1/10 07:59:59 ** $CondorPlatform: I386-LINUX_RH9 $
1/10 07:59:59 ** PID = 18119
1/10 07:59:59 ******************************************************
1/10 07:59:59 Using config file: /users/condor/condor_config
1/10 07:59:59 Using local config files: /users/condor/hosts/galaxy/condor_config.local
1/10 07:59:59 DaemonCore: Command Socket at <192.168.0.2:40573>
1/10 07:59:59 Error computing physical memory with calc_phys_mem().
               MEMORY parameter not defined in config file.
               Try setting MEMORY to the number of megabytes of RAM.
1/10 07:59:59 ERROR "Can't compute physical memory." at line 60 in file ResAttributes.C
1/10 08:08:11 ******************************************************
1/10 08:08:11 ** condor_startd (CONDOR_STARTD) STARTING UP
1/10 08:08:11 ** /home/condor/condor/sbin/condor_startd
1/10 08:08:11 ** $CondorVersion: 6.6.6 Jul 26 2004 $
1/10 08:08:11 ** $CondorPlatform: I386-LINUX_RH9 $
1/10 08:08:11 ** PID = 18228
1/10 08:08:11 ******************************************************
1/10 08:08:11 Using config file: /users/condor/condor_config
1/10 08:08:11 Using local config files: /users/condor/hosts/galaxy/condor_config.local
1/10 08:08:11 DaemonCore: Command Socket at <192.168.0.2:40584>
1/10 08:08:11 Error computing physical memory with calc_phys_mem().
               MEMORY parameter not defined in config file.
               Try setting MEMORY to the number of megabytes of RAM.
1/10 08:08:11 ERROR "Can't compute physical memory." at line 60 in file ResAttributes.C
1/10 08:59:09 ******************************************************
1/10 08:59:09 ** condor_startd (CONDOR_STARTD) STARTING UP
1/10 08:59:09 ** /home/condor/condor/sbin/condor_startd
1/10 08:59:09 ** $CondorVersion: 6.6.6 Jul 26 2004 $
1/10 08:59:09 ** $CondorPlatform: I386-LINUX_RH9 $
1/10 08:59:09 ** PID = 18819
1/10 08:59:09 ******************************************************
1/10 08:59:09 Using config file: /users/condor/condor_config
1/10 08:59:09 Using local config files: /users/condor/hosts/galaxy/condor_config.local
1/10 08:59:09 DaemonCore: Command Socket at <192.168.0.2:40643>
1/10 08:59:17 vm1: New machine resource allocated
1/10 08:59:17 vm2: New machine resource allocated
1/10 08:59:17 About to run initial benchmarks.
1/10 08:59:22 Completed initial benchmarks.
1/10 08:59:22 vm1: State change: IS_OWNER is false
1/10 08:59:22 vm1: Changing state: Owner -> Unclaimed
1/10 08:59:22 vm2: State change: IS_OWNER is false
1/10 08:59:22 vm2: Changing state: Owner -> Unclaimed
[root@galaxy chel]#

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar - get it now! http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/