[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Strange behaviour AND worker node jobs are keeping idle for ever



When i submit jobs at master they are executing , when aim submitting at worker nodes they are keeping idle for ever 

one more strange behavior is that when master executing some job at worker saying as fallow MATCHED BUT SERVING HIGH PRIORY
035.009:  Run analysis summary.  Of 2 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      2 match, but are serving users with a better priority in the pool
      0 match, match, but reject the job for unknown reasons
      0 match, but will not currently preempt their existing job
      0 are available to run your job

when master completed pool jobs , at worker it showing like   MATCHED BUT REJECTED FOR UNKNOWN
035.009:  Run analysis summary.  Of 2 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match, but are serving users with a better priority in the pool
      2 match, match, but reject the job for unknown reasons
      0 match, but will not currently preempt their existing job
      0 are available to run your job
and one more doubt

2/12 21:35:27 Error on stat(/dev/:0,0xfeffc370), errno = 2(No such file or directory)
2/12 21:35:27 Error on stat(/dev/mouse,0xfeffc530), errno = 2(No such file or directory)
will it cause any bad impact ???
 
my worker node log files can be seen like these
##collector log.......


2/12 20:13:19 PASSWD_CACHE_REFRESH is undefined, using default value of 300

2/12 20:13:20 ******************************************************
2/12 20:13:20 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
2/12 20:13:20 ** /home/condor/condor/sbin/condor_collector
2/12 20:13:20 ** $CondorVersion: 6.6.10 Jun 13 2005 $
2/12 20:13:20 ** $CondorPlatform: I386-LINUX_RH9 $
2/12 20:13:20 ** PID = 3437
2/12 20:13:20 ******************************************************
2/12 20:13:20 Using config file: /home/condor/condor/etc/condor_config
2/12 20:13:20 Using local config files: /home/condor/condor_config.local
2/12 20:13:20 Current Socket bufsize=108k
2/12 20:13:20 Current Socket bufsize=16k
2/12 20:13:20 Reset OS socket buffer size to 255k
2/12 20:13:20 DaemonCore: Command Socket at <172.16.16.42:9618>
2/12 20:13:20 SEC_DEFAULT_SESSION_DURATION is undefined, using default value of 3600
2/12 20:13:20 COLLECTOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:13:20 In ViewServer::Init()
2/12 20:13:20 In CollectorDaemon::Init()
2/12 20:13:20 In ViewServer::Config()
2/12 20:13:20 In CollectorDaemon::Config()
2/12 20:13:20 COLLECTOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:13:20 Will use UDP to update collector
2/12 20:13:20 No SocketCache, will refuse TCP updates
2/12 20:13:20 enable: Creating stats hash table
2/12 20:13:21 DaemonCore: in SendAliveToParent()
2/12 20:13:21 DaemonCore: attempting to connect to '<172.16.16.42:32791>'
2/12 20:13:21 COLLECTOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:28:20 Housekeeper:  Ready to clean old ads
2/12 20:28:20     Cleaning StartdAds ...
2/12 20:28:20     Cleaning StartdPrivateAds ...
2/12 20:28:20     Cleaning ScheddAds ...
2/12 20:28:20     Cleaning SubmittorAds ...
2/12 20:28:20     Cleaning LicenseAds ...
2/12 20:28:20     Cleaning MasterAds ...
2/12 20:28:20     Cleaning CkptServerAds ...
2/12 20:28:20     Cleaning CollectorAds ...
2/12 20:28:20     Cleaning StorageAds ...
2/12 20:28:20 Housekeeper:  Done cleaning
2/12 20:32:51 DaemonCore: in SendAliveToParent()

2/12 20:13:20 PASSWD_CACHE_REFRESH is undefined, using default value of 300

2/12 20:13:20 ******************************************************
2/12 20:13:20 ** condor_negotiator (CONDOR_NEGOTIATOR) STARTING UP
2/12 20:13:20 ** /home/condor/condor/sbin/condor_negotiator
2/12 20:13:20 ** $CondorVersion: 6.6.10 Jun 13 2005 $
2/12 20:13:20 ** $CondorPlatform: I386-LINUX_RH9 $
2/12 20:13:20 ** PID = 3440
2/12 20:13:20 ******************************************************
2/12 20:13:20 Using config file: /home/condor/condor/etc/condor_config
2/12 20:13:20 Using local config files: /home/condor/condor_config.local
2/12 20:13:20 DaemonCore: Command Socket at <172.16.16.42:9614>
2/12 20:13:20 SEC_DEFAULT_SESSION_DURATION is undefined, using default value of 3600
2/12 20:13:20 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:13:20 About to truncate log /home/condor/spool/Accountantnew.log
2/12 20:13:20 ACCOUNTANT_HOST = None (local)
2/12 20:13:20 NEGOTIATOR_INTERVAL = 300 sec
2/12 20:13:20 NEGOTIATOR_TIMEOUT = 30 sec
2/12 20:13:20 PREEMPTION_REQUIREMENTS = (CurrentTime - EnteredCurrentState) > (1 * (60 * 60)) && RemoteUserPrio > SubmittorPrio * 1.2
2/12 20:13:20 PREEMPTION_RANK = (RemoteUserPrio * 1000000) - TARGET.ImageSize
2/12 20:13:20 ---------- Started Negotiation Cycle ----------
2/12 20:13:20 Phase 1:  Obtaining ads from collector ...
2/12 20:13:20   Getting all public ads ...
2/12 20:13:20 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:13:20   Sorting 2 ads ...
2/12 20:13:20   Getting startd private ads ...
2/12 20:13:20 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:13:20 condor_read(): recv() returned -1, errno = 104, assuming failure.
2/12 20:13:20 Couldn't fetch ads: communication error
2/12 20:13:20 Aborting negotiation cycle
2/12 20:13:21 DaemonCore: in SendAliveToParent()
2/12 20:13:21 DaemonCore: attempting to connect to '<172.16.16.42:32791>'
2/12 20:13:21 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:18:20 ---------- Started Negotiation Cycle ----------
2/12 20:18:20 Phase 1:  Obtaining ads from collector ...
2/12 20:18:20   Getting all public ads ...
2/12 20:18:20 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:18:20 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
2/12 20:18:20   Sorting 8 ads ...
2/12 20:18:20   Getting startd private ads ...
2/12 20:18:20 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:18:20 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
2/12 20:18:20 condor_read(): recv() returned -1, errno = 104, assuming failure.
2/12 20:18:20 Couldn't fetch ads: communication error
2/12 20:18:20 Aborting negotiation cycle
2/12 20:23:20 ---------- Started Negotiation Cycle ----------
2/12 20:23:20 Phase 1:  Obtaining ads from collector ...
2/12 20:23:20   Getting all public ads ...
2/12 20:23:20 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:23:20 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
2/12 20:23:20   Sorting 8 ads ...
2/12 20:23:20   Getting startd private ads ...
2/12 20:23:20 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:23:20 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
2/12 20:23:20 condor_read(): Socket closed when trying to read buffer
2/12 20:23:20 Couldn't fetch ads: communication error
2/12 20:23:20 Aborting negotiation cycle
2/12 20:28:20 ---------- Started Negotiation Cycle ----------
2/12 20:28:20 Phase 1:  Obtaining ads from collector ...
2/12 20:28:20   Getting all public ads ...
2/12 20:28:20 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:28:20 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
2/12 20:28:20   Sorting 8 ads ...
2/12 20:28:20   Getting startd private ads ...
2/12 20:28:20 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:28:20 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
2/12 20:28:20 condor_read(): recv() returned -1, errno = 104, assuming failure.
2/12 20:28:20 Couldn't fetch ads: communication error
2/12 20:28:20 Aborting negotiation cycle
2/12 20:32:51 DaemonCore: in SendAliveToParent()
2/12 20:32:51 DaemonCore: attempting to connect to '<172.16.16.42:32791>'
2/12 20:32:51 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:32:51 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
2/12 20:33:20 ---------- Started Negotiation Cycle ----------
2/12 20:33:20 Phase 1:  Obtaining ads from collector ...
2/12 20:33:20   Getting all public ads ...
2/12 20:33:20 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:33:20 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
2/12 20:33:20   Sorting 8 ads ...
2/12 20:33:20   Getting startd private ads ...
2/12 20:33:20 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default value of 0
2/12 20:33:20 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
2/12 20:33:20 condor_read(): recv() returned -1, errno = 104, assuming failure.
2/12 20:33:20 Couldn't fetch ads: communication error
2/12 20:33:20 Aborting negotiation cycle


2/12 21:35:27 23597600 kbytes available for "/home/condor/execute"
2/12 21:35:27 Looking up RESERVED_DISK parameter
2/12 21:35:27 Reserving 5120 kbytes for file system
2/12 21:35:27 Disk space: 23592480
2/12 21:35:27 Error on stat(/dev/:0,0xfeffc370), errno = 2(No such file or directory)
2/12 21:35:27 Error on stat(/dev/mouse,0xfeffc530), errno = 2(No such file or directory)
2/12 21:35:27 Mouse IRQ: 12
2/12 21:35:27 Add 83825 mouse interrupts.  Total: 83825
2/12 21:35:27 Job wants old RSC/Ckpt starter, skipping /home/condor/condor/sbin/condor_starter
2/12 21:35:27 Job wants old RSC/Ckpt starter, skipping /home/condor/condor/sbin/condor_starter.pvm
2/12 21:35:27 Remote job ID is 74.552
2/12 21:35:27 exec_starter( srinivas.cse.com, 11, 12 ) : pid 4859
2/12 21:35:27 execl(/home/condor/condor/sbin/condor_starter.std, "condor_starter", srinivas.cse.com, 0)
2/12 21:35:27 Got RemoteUser (condor@xxxxxxxxxxxxxxxx) from request classad
2/12 21:35:27 Got universe "STANDARD" (1) from request classad
2/12 21:35:27 State change: claim-activation protocol successful
2/12 21:35:27 Changing activity: Idle -> Busy
2/12 21:35:27 DaemonCore: Command received via TCP from host <172.16.16.33:46285>
2/12 21:35:27 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
2/12 21:35:27 Called deactivate_claim_forcibly()
2/12 21:35:27 In Starter::kill() with pid 4859, sig 3 (SIGQUIT)
2/12 21:35:27 DaemonCore: No more children processes to reap.
2/12 21:35:27 Starter pid 4859 exited with status 0
2/12 21:35:27 Canceled hardkill-starter timer (1236)
2/12 21:35:27 ProcAPI::buildFamily failed: parent 4859 not found on system.
2/12 21:35:27 ProcAPI: pid 4859 does not exist.
2/12 21:35:27 State change: starter exited
2/12 21:35:27 Changing activity: Busy -> Idle
2/12 21:35:27 DaemonCore: Command received via TCP from host <172.16.16.33:46294>
2/12 21:35:27 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
2/12 21:35:27 Got activate_claim request from shadow (<172.16.16.33:46294>)
2/12 21:35:27 Read request ad and starter from shadow.
2/12 21:35:27 Swap space: 0
2/12 21:35:27 23598572 kbytes available for "/home/condor/execute"
2/12 21:35:27 Looking up RESERVED_DISK parameter
2/12 21:35:27 Reserving 5120 kbytes for file system
2/12 21:35:27 Disk space: 23593452

--
Thanks and regards,
Srinivas.Malyala