[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Need help for job disconnection and reconnection failure! Argent...



This is my ShadowLog:
05/15/13 00:12:43 Can't open directory "/var/opt/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
05/15/13 00:12:43 Setting maximum accepts per cycle 8.
05/15/13 00:12:43 ******************************************************
05/15/13 00:12:43 ** condor_shadow (CONDOR_SHADOW) STARTING UP
05/15/13 00:12:43 ** /opt/condor/sbin/condor_shadow
05/15/13 00:12:43 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
05/15/13 00:12:43 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
05/15/13 00:12:43 ** $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
05/15/13 00:12:43 ** $CondorPlatform: x86_64_rhap_6.3 $
05/15/13 00:12:43 ** PID = 3650
05/15/13 00:12:43 ** Log last touched time unavailable (No such file or directory)
05/15/13 00:12:43 ******************************************************
05/15/13 00:12:43 Using config source: /opt/condor/etc/condor_config
05/15/13 00:12:43 Using local config sources:
05/15/13 00:12:43    /opt/condor/etc/condor_config.local
05/15/13 00:12:43 DaemonCore: command socket at <192.168.1.100:40219?noUDP>
05/15/13 00:12:43 DaemonCore: private command socket at <192.168.1.100:40219>
05/15/13 00:12:43 Setting maximum accepts per cycle 8.
05/15/13 00:12:43 Initializing a VANILLA shadow for job 3.0
05/15/13 00:12:43 (3.0) (3650): Request to run on slot1@xxxxxxxxxxxxxxx <10.255.255.254:44453> was ACCEPTED
05/15/13 00:12:43 Can't open directory "/var/opt/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
05/15/13 00:12:43 Setting maximum accepts per cycle 8.
05/15/13 00:12:43 ******************************************************
05/15/13 00:12:43 ** condor_shadow (CONDOR_SHADOW) STARTING UP
05/15/13 00:12:43 ** /opt/condor/sbin/condor_shadow
05/15/13 00:12:43 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
05/15/13 00:12:43 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
05/15/13 00:12:43 ** $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
05/15/13 00:12:43 ** $CondorPlatform: x86_64_rhap_6.3 $
05/15/13 00:12:43 ** PID = 3651
05/15/13 00:12:43 ** Log last touched 5/15 00:12:43
05/15/13 00:12:43 ******************************************************
05/15/13 00:12:43 Using config source: /opt/condor/etc/condor_config
05/15/13 00:12:43 Using local config sources:
05/15/13 00:12:43    /opt/condor/etc/condor_config.local
05/15/13 00:12:43 DaemonCore: command socket at <192.168.1.100:40949?noUDP>
05/15/13 00:12:43 DaemonCore: private command socket at <192.168.1.100:40949>
05/15/13 00:12:43 Setting maximum accepts per cycle 8.
05/15/13 00:12:43 Initializing a VANILLA shadow for job 3.1
05/15/13 00:12:43 (3.1) (3651): Request to run on slot2@xxxxxxxxxxxxxxx <10.255.255.254:44453> was ACCEPTED
05/15/13 00:12:43 (3.0) (3650): Can no longer talk to condor_starter <10.255.255.254:44453>
05/15/13 00:12:43 (3.0) (3650): Trying to reconnect to disconnected job
05/15/13 00:12:43 (3.0) (3650): LastJobLeaseRenewal: 1368547963 Wed May 15 00:12:43 2013
05/15/13 00:12:43 (3.0) (3650): JobLeaseDuration: 1200 seconds
05/15/13 00:12:43 (3.0) (3650): JobLeaseDuration remaining: 1200
05/15/13 00:12:43 (3.0) (3650): Attempting to locate disconnected starter
05/15/13 00:12:43 (3.0) (3650): Found starter: <10.255.255.254:45037>
05/15/13 00:12:43 (3.0) (3650): Attempting to reconnect to starter <10.255.255.254:45037>
05/15/13 00:12:43 (3.0) (3650): attempt to connect to <10.255.255.254:45037> failed: Connection refused (connect errno = 111).
05/15/13 00:12:43 (3.0) (3650): Attempt to reconnect failed: Failed to connect to starter <10.255.255.254:45037>
05/15/13 00:12:43 (3.0) (3650): JobLeaseDuration remaining: 1200
05/15/13 00:12:43 (3.0) (3650): Scheduling another attempt to reconnect in 8 seconds
05/15/13 00:12:43 Can't open directory "/var/opt/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
05/15/13 00:12:43 Setting maximum accepts per cycle 8.
05/15/13 00:12:43 ******************************************************
05/15/13 00:12:43 ** condor_shadow (CONDOR_SHADOW) STARTING UP
05/15/13 00:12:43 ** /opt/condor/sbin/condor_shadow
05/15/13 00:12:43 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
05/15/13 00:12:43 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
05/15/13 00:12:43 ** $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
05/15/13 00:12:43 ** $CondorPlatform: x86_64_rhap_6.3 $
05/15/13 00:12:43 ** PID = 3654
05/15/13 00:12:43 ** Log last touched 5/15 00:12:43
05/15/13 00:12:43 ******************************************************
05/15/13 00:12:43 Using config source: /opt/condor/etc/condor_config
05/15/13 00:12:43 Using local config sources:
05/15/13 00:12:43    /opt/condor/etc/condor_config.local
05/15/13 00:12:43 DaemonCore: command socket at <192.168.1.100:41168?noUDP>
05/15/13 00:12:43 DaemonCore: private command socket at <192.168.1.100:41168>
05/15/13 00:12:43 Setting maximum accepts per cycle 8.
05/15/13 00:12:43 Initializing a VANILLA shadow for job 3.2
05/15/13 00:12:43 (3.1) (3651): Can no longer talk to condor_starter <10.255.255.254:44453>
05/15/13 00:12:43 (3.1) (3651): Trying to reconnect to disconnected job
05/15/13 00:12:43 (3.1) (3651): LastJobLeaseRenewal: 1368547963 Wed May 15 00:12:43 2013
05/15/13 00:12:43 (3.1) (3651): JobLeaseDuration: 1200 seconds
05/15/13 00:12:43 (3.1) (3651): JobLeaseDuration remaining: 1200
05/15/13 00:12:43 (3.1) (3651): Attempting to locate disconnected starter
05/15/13 00:12:43 (3.2) (3654): Request to run on slot3@xxxxxxxxxxxxxxx <10.255.255.254:44453> was ACCEPTED
05/15/13 00:12:43 (3.1) (3651): Found starter: <10.255.255.254:48322>
05/15/13 00:12:43 (3.1) (3651): Attempting to reconnect to starter <10.255.255.254:48322>
05/15/13 00:12:43 (3.1) (3651): attempt to connect to <10.255.255.254:48322> failed: Connection refused (connect errno = 111).
05/15/13 00:12:43 (3.1) (3651): Attempt to reconnect failed: Failed to connect to starter <10.255.255.254:48322>
05/15/13 00:12:43 (3.1) (3651): JobLeaseDuration remaining: 1200
05/15/13 00:12:43 (3.1) (3651): Scheduling another attempt to reconnect in 8 seconds
05/15/13 00:12:43 Can't open directory "/var/opt/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
05/15/13 00:12:43 Setting maximum accepts per cycle 8.
05/15/13 00:12:43 ******************************************************
05/15/13 00:12:43 ** condor_shadow (CONDOR_SHADOW) STARTING UP
05/15/13 00:12:43 ** /opt/condor/sbin/condor_shadow
05/15/13 00:12:43 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
05/15/13 00:12:43 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
05/15/13 00:12:43 ** $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
05/15/13 00:12:43 ** $CondorPlatform: x86_64_rhap_6.3 $
05/15/13 00:12:43 ** PID = 3657
05/15/13 00:12:43 ** Log last touched 5/15 00:12:43
05/15/13 00:12:43 ******************************************************
05/15/13 00:12:43 Using config source: /opt/condor/etc/condor_config
05/15/13 00:12:43 Using local config sources:
05/15/13 00:12:43    /opt/condor/etc/condor_config.local
05/15/13 00:12:43 DaemonCore: command socket at <192.168.1.100:41971?noUDP>
05/15/13 00:12:43 DaemonCore: private command socket at <192.168.1.100:41971>
05/15/13 00:12:43 Setting maximum accepts per cycle 8.
05/15/13 00:12:43 Initializing a VANILLA shadow for job 3.3
05/15/13 00:12:43 (3.2) (3654): Can no longer talk to condor_starter <10.255.255.254:44453>
05/15/13 00:12:43 (3.2) (3654): Trying to reconnect to disconnected job
05/15/13 00:12:43 (3.2) (3654): LastJobLeaseRenewal: 1368547963 Wed May 15 00:12:43 2013
05/15/13 00:12:43 (3.2) (3654): JobLeaseDuration: 1200 seconds
05/15/13 00:12:43 (3.2) (3654): JobLeaseDuration remaining: 1200
05/15/13 00:12:43 (3.2) (3654): Attempting to locate disconnected starter
05/15/13 00:12:43 (3.3) (3657): Request to run on slot4@xxxxxxxxxxxxxxx <10.255.255.254:44453> was ACCEPTED
05/15/13 00:12:43 (3.2) (3654): Found starter: <10.255.255.254:41314>
05/15/13 00:12:43 (3.2) (3654): Attempting to reconnect to starter <10.255.255.254:41314>
05/15/13 00:12:43 (3.2) (3654): attempt to connect to <10.255.255.254:41314> failed: Connection refused (connect errno = 111).
05/15/13 00:12:43 (3.2) (3654): Attempt to reconnect failed: Failed to connect to starter <10.255.255.254:41314>
05/15/13 00:12:43 (3.2) (3654): JobLeaseDuration remaining: 1200
05/15/13 00:12:43 (3.2) (3654): Scheduling another attempt to reconnect in 8 seconds
05/15/13 00:12:43 (3.3) (3657): Can no longer talk to condor_starter <10.255.255.254:44453>
05/15/13 00:12:43 (3.3) (3657): Trying to reconnect to disconnected job
05/15/13 00:12:43 (3.3) (3657): LastJobLeaseRenewal: 1368547963 Wed May 15 00:12:43 2013
05/15/13 00:12:43 (3.3) (3657): JobLeaseDuration: 1200 seconds
05/15/13 00:12:43 (3.3) (3657): JobLeaseDuration remaining: 1200
05/15/13 00:12:43 (3.3) (3657): Attempting to locate disconnected starter
05/15/13 00:12:43 (3.3) (3657): locateStarter(): ClaimId (<10.255.255.254:44453>#1368547847#4#3f82cc534a2d381f56162e44d7889a0c7482bdc5) and GlobalJobId ( imagegrid.otitan.com#3.3#1368545597 ) not found
05/15/13 00:12:43 (3.3) (3657): Reconnect FAILED: Job not found at execution machine
05/15/13 00:12:43 (3.3) (3657): **** condor_shadow (condor_SHADOW) pid 3657 EXITING WITH STATUS 107
05/15/13 00:12:51 (3.0) (3650): Attempting to locate disconnected starter
05/15/13 00:12:51 (3.0) (3650): locateStarter(): ClaimId (<10.255.255.254:44453>#1368547847#1#9db520717396f32abb3af6508584d0d6acaffce8) and GlobalJobId ( imagegrid.otitan.com#3.0#1368545597 ) not found
05/15/13 00:12:51 (3.0) (3650): Reconnect FAILED: Job not found at execution machine
05/15/13 00:12:51 (3.0) (3650): **** condor_shadow (condor_SHADOW) pid 3650 EXITING WITH STATUS 107
05/15/13 00:12:51 (3.1) (3651): Attempting to locate disconnected starter
05/15/13 00:12:51 (3.1) (3651): locateStarter(): ClaimId (<10.255.255.254:44453>#1368547847#2#0663dd211fe1bb98a86788c49b4782b74d187c02) and GlobalJobId ( imagegrid.otitan.com#3.1#1368545597 ) not found
05/15/13 00:12:51 (3.1) (3651): Reconnect FAILED: Job not found at execution machine
05/15/13 00:12:51 (3.1) (3651): **** condor_shadow (condor_SHADOW) pid 3651 EXITING WITH STATUS 107
05/15/13 00:12:51 (3.2) (3654): Attempting to locate disconnected starter
05/15/13 00:12:51 (3.2) (3654): locateStarter(): ClaimId (<10.255.255.254:44453>#1368547847#3#5e07c4e9c596f6b520b92baadab1f2b1de3253fb) and GlobalJobId ( imagegrid.otitan.com#3.2#1368545597 ) not found
05/15/13 00:12:51 (3.2) (3654): Reconnect FAILED: Job not found at execution machine
05/15/13 00:12:51 (3.2) (3654): **** condor_shadow (condor_SHADOW) pid 3654 EXITING WITH STATUS 107
This is my SchedLog:
05/15/13 00:10:43 (pid:3588) Can't open directory "/var/opt/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
05/15/13 00:10:43 (pid:3588) Setting maximum accepts per cycle 8.
05/15/13 00:10:43 (pid:3588) ******************************************************
05/15/13 00:10:43 (pid:3588) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
05/15/13 00:10:43 (pid:3588) ** /opt/condor/sbin/condor_schedd
05/15/13 00:10:43 (pid:3588) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
05/15/13 00:10:43 (pid:3588) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
05/15/13 00:10:43 (pid:3588) ** $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
05/15/13 00:10:43 (pid:3588) ** $CondorPlatform: x86_64_rhap_6.3 $
05/15/13 00:10:43 (pid:3588) ** PID = 3588
05/15/13 00:10:43 (pid:3588) ** Log last touched time unavailable (No such file or directory)
05/15/13 00:10:43 (pid:3588) ******************************************************
05/15/13 00:10:43 (pid:3588) Using config source: /opt/condor/etc/condor_config
05/15/13 00:10:43 (pid:3588) Using local config sources:
05/15/13 00:10:43 (pid:3588)    /opt/condor/etc/condor_config.local
05/15/13 00:10:43 (pid:3588) DaemonCore: command socket at <192.168.1.100:48906>
05/15/13 00:10:43 (pid:3588) DaemonCore: private command socket at <192.168.1.100:48906>
05/15/13 00:10:43 (pid:3588) Setting maximum accepts per cycle 8.
05/15/13 00:10:43 (pid:3588) History file rotation is enabled.
05/15/13 00:10:43 (pid:3588)   Maximum history file size is: 20971520 bytes
05/15/13 00:10:43 (pid:3588)   Number of rotated history files is: 2
05/15/13 00:10:48 (pid:3588) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
05/15/13 00:10:48 (pid:3588) Sent ad to central manager for kyle@local
05/15/13 00:10:48 (pid:3588) Sent ad to 1 collectors for kyle@local
05/15/13 00:10:57 (pid:3588) Number of Active Workers 1
05/15/13 00:10:57 (pid:3603) Number of Active Workers 0
05/15/13 00:12:43 (pid:3588) Using negotiation protocol: NEGOTIATE
05/15/13 00:12:43 (pid:3588) Negotiating for owner: kyle@local
05/15/13 00:12:43 (pid:3588) AutoCluster:config() significant attributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts,None,RemoteGroup,SubmitterGroup
05/15/13 00:12:43 (pid:3588) Checking consistency running and runnable jobs
05/15/13 00:12:43 (pid:3588) Tables are consistent
05/15/13 00:12:43 (pid:3588) Rebuilt prioritized runnable job list in 0.001s.
05/15/13 00:12:43 (pid:3588) Completed REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle
05/15/13 00:12:43 (pid:3588) Starting add_shadow_birthdate(3.0)
05/15/13 00:12:43 (pid:3588) Started shadow for job 3.0 on slot1@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle, (shadow pid = 3650)
05/15/13 00:12:43 (pid:3588) Completed REQUEST_CLAIM to startd slot2@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle
05/15/13 00:12:43 (pid:3588) Starting add_shadow_birthdate(3.1)
05/15/13 00:12:43 (pid:3588) Started shadow for job 3.1 on slot2@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle, (shadow pid = 3651)
05/15/13 00:12:43 (pid:3588) Finished negotiating for kyle in local pool: 4 matched, 0 rejected
05/15/13 00:12:43 (pid:3588) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
05/15/13 00:12:43 (pid:3588) Sent ad to central manager for kyle@local
05/15/13 00:12:43 (pid:3588) Sent ad to 1 collectors for kyle@local
05/15/13 00:12:43 (pid:3588) Completed REQUEST_CLAIM to startd slot3@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle
05/15/13 00:12:43 (pid:3588) Starting add_shadow_birthdate(3.2)
05/15/13 00:12:43 (pid:3588) Started shadow for job 3.2 on slot3@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle, (shadow pid = 3654)
05/15/13 00:12:43 (pid:3588) Completed REQUEST_CLAIM to startd slot4@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle
05/15/13 00:12:43 (pid:3588) Starting add_shadow_birthdate(3.3)
05/15/13 00:12:43 (pid:3588) Started shadow for job 3.3 on slot4@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle, (shadow pid = 3657)
05/15/13 00:12:43 (pid:3588) Shadow pid 3657 for job 3.3 exited with status 107
05/15/13 00:12:43 (pid:3588) Completed RELEASE_CLAIM to startd slot4@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle
05/15/13 00:12:43 (pid:3588) Match record (slot4@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle, 3.3) deleted
05/15/13 00:12:51 (pid:3588) Shadow pid 3650 for job 3.0 exited with status 107
05/15/13 00:12:51 (pid:3588) Completed RELEASE_CLAIM to startd slot1@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle
05/15/13 00:12:51 (pid:3588) Match record (slot1@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle, 3.0) deleted
05/15/13 00:12:51 (pid:3588) Shadow pid 3651 for job 3.1 exited with status 107
05/15/13 00:12:51 (pid:3588) Completed RELEASE_CLAIM to startd slot2@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle
05/15/13 00:12:51 (pid:3588) Match record (slot2@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle, 3.1) deleted
05/15/13 00:12:51 (pid:3588) Shadow pid 3654 for job 3.2 exited with status 107
05/15/13 00:12:51 (pid:3588) Completed RELEASE_CLAIM to startd slot3@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle
05/15/13 00:12:51 (pid:3588) Match record (slot3@xxxxxxxxxxxxxxx <10.255.255.254:44453> for kyle, 3.2) deleted
 
This is StarterLog.slot1:
05/15/13 00:12:43 Can't open directory "/var/opt/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
05/15/13 00:12:43 Setting maximum accepts per cycle 8.
05/15/13 00:12:43 ******************************************************
05/15/13 00:12:43 ** condor_starter (CONDOR_STARTER) STARTING UP
05/15/13 00:12:43 ** /opt/condor/sbin/condor_starter
05/15/13 00:12:43 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
05/15/13 00:12:43 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
05/15/13 00:12:43 ** $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
05/15/13 00:12:43 ** $CondorPlatform: x86_64_rhap_6.3 $
05/15/13 00:12:43 ** PID = 3296
05/15/13 00:12:43 ** Log last touched time unavailable (No such file or directory)
05/15/13 00:12:43 ******************************************************
05/15/13 00:12:43 Using config source: /opt/condor/etc/condor_config
05/15/13 00:12:43 Using local config sources:
05/15/13 00:12:43    /opt/condor/etc/condor_config.local
05/15/13 00:12:43 DaemonCore: command socket at <10.255.255.254:45037>
05/15/13 00:12:43 DaemonCore: private command socket at <10.255.255.254:45037>
05/15/13 00:12:43 Setting maximum accepts per cycle 8.
05/15/13 00:12:43 Communicating with shadow <192.168.1.100:40219?noUDP>
05/15/13 00:12:43 Submitting machine is "imagegrid.local"
05/15/13 00:12:43 setting the orig job name in starter
05/15/13 00:12:43 setting the orig job iwd in starter
05/15/13 00:12:43 passwd_cache::cache_uid(): getpwnam("kyle") failed: user not found
05/15/13 00:12:43 ERROR: Uid for "kyle" not found in passwd file and SOFT_UID_DOMAIN is False
05/15/13 00:12:43 ERROR: Failed to determine what user to run this job as, aborting
05/15/13 00:12:43 Failed to initialize JobInfoCommunicator, aborting
05/15/13 00:12:43 Unable to start job.
05/15/13 00:12:43 **** condor_starter (condor_STARTER) pid 3296 EXITING WITH STATUS 1
05/15/13 00:14:44 Can't open directory "/var/opt/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
05/15/13 00:14:44 Setting maximum accepts per cycle 8.
05/15/13 00:14:44 ******************************************************
05/15/13 00:14:44 ** condor_starter (CONDOR_STARTER) STARTING UP
05/15/13 00:14:44 ** /opt/condor/sbin/condor_starter
05/15/13 00:14:44 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
05/15/13 00:14:44 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
05/15/13 00:14:44 ** $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
05/15/13 00:14:44 ** $CondorPlatform: x86_64_rhap_6.3 $
05/15/13 00:14:44 ** PID = 3406
05/15/13 00:14:44 ** Log last touched 5/15 00:12:43
05/15/13 00:14:44 ******************************************************
05/15/13 00:14:44 Using config source: /opt/condor/etc/condor_config
05/15/13 00:14:44 Using local config sources:
05/15/13 00:14:44    /opt/condor/etc/condor_config.local
05/15/13 00:14:44 DaemonCore: command socket at <10.255.255.254:42117>
05/15/13 00:14:44 DaemonCore: private command socket at <10.255.255.254:42117>
05/15/13 00:14:44 Setting maximum accepts per cycle 8.
05/15/13 00:14:44 Communicating with shadow <192.168.1.100:40146?noUDP>
05/15/13 00:14:44 Submitting machine is "imagegrid.local"
05/15/13 00:14:44 setting the orig job name in starter
05/15/13 00:14:44 setting the orig job iwd in starter
05/15/13 00:14:44 passwd_cache::cache_uid(): getpwnam("kyle") failed: user not found
05/15/13 00:14:44 ERROR: Uid for "kyle" not found in passwd file and SOFT_UID_DOMAIN is False
05/15/13 00:14:44 ERROR: Failed to determine what user to run this job as, aborting
05/15/13 00:14:44 Failed to initialize JobInfoCommunicator, aborting
05/15/13 00:14:44 Unable to start job.
05/15/13 00:14:44 **** condor_starter (condor_STARTER) pid 3406 EXITING WITH STATUS 1
Should I create the same user on all compute nodes?


2013/5/14 钱晓明 <kyleqian@xxxxxxxxx>
I submit jobs to my cluster but no job can run because they all disconnected. Here is my condor version(I am using Rocks to manage my cluster):
[kyle@imagegrid ~]$ condor_version
$CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
$CondorPlatform: x86_64_rhap_6.3 $
[kyle@imagegrid ~]$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot10@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:05
slot11@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:06
slot12@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:07
slot13@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:08
slot14@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:09
slot15@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:10
slot16@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:03
slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:00:04
slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:00:05
slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:00:06
slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:00:06
slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.020   499  0+00:25:08
slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:09
slot7@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:10
slot8@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:03
slot9@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:04
slot10@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:06
slot11@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:07
slot12@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:08
slot13@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:09
slot14@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:10
slot15@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:11
slot16@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:04
slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:14:41
slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:06
slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:07
slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:08
slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:09
slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:10
slot7@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:11
slot8@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:04
slot9@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:05
                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX    32     0       0        32       0          0        0
               Total    32     0       0        32       0          0        0
[kyle@imagegrid ~]$ condor_q
-- Submitter: imagegrid.otitan.com : <192.168.1.100:40073> : imagegrid.otitan.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD              
   2.0   kyle            5/14 23:24   0+00:00:00 I  0   0.0  showpwd.sh       
   2.1   kyle            5/14 23:24   0+00:00:08 I  0   0.0  showpwd.sh       
   2.2   kyle            5/14 23:24   0+00:00:17 I  0   0.0  showpwd.sh       
   2.3   kyle            5/14 23:24   0+00:00:01 I  0   0.0  showpwd.sh       
4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended
 
The log content of my job is:
[kyle@imagegrid ~]$ cat showpwd.log
000 (002.000.000) 05/14 23:24:57 Job submitted from host: <192.168.1.100:40073>
...
000 (002.001.000) 05/14 23:24:57 Job submitted from host: <192.168.1.100:40073>
...
000 (002.002.000) 05/14 23:24:57 Job submitted from host: <192.168.1.100:40073>
...
000 (002.003.000) 05/14 23:24:57 Job submitted from host: <192.168.1.100:40073>
...
022 (002.000.000) 05/14 23:24:57 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
024 (002.000.000) 05/14 23:24:57 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot1@xxxxxxxxxxxxxxx, rescheduling job
...
022 (002.001.000) 05/14 23:24:57 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot2@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
024 (002.001.000) 05/14 23:24:57 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot2@xxxxxxxxxxxxxxx, rescheduling job
...
022 (002.002.000) 05/14 23:24:58 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot3@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
022 (002.003.000) 05/14 23:24:58 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot4@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
024 (002.003.000) 05/14 23:24:58 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot4@xxxxxxxxxxxxxxx, rescheduling job
...
024 (002.002.000) 05/14 23:25:06 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot3@xxxxxxxxxxxxxxx, rescheduling job
...
022 (002.000.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
024 (002.000.000) 05/14 23:26:58 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot1@xxxxxxxxxxxxxxx, rescheduling job
...
022 (002.001.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot2@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
022 (002.002.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot3@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
022 (002.003.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot4@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
024 (002.003.000) 05/14 23:26:58 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot4@xxxxxxxxxxxxxxx, rescheduling job
...
024 (002.001.000) 05/14 23:27:06 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot2@xxxxxxxxxxxxxxx, rescheduling job
...
024 (002.002.000) 05/14 23:27:06 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot3@xxxxxxxxxxxxxxx, rescheduling job
...
 
I can see that after submission, some slots became claimed, but after few seconds, they became Unclaimed again.
Here is my local configure(generated by Rocks):
 
ALLOW_WRITE = $(HOSTALLOW_WRITE)
AMAZON_GAHP = $(SBIN)/amazon_gahp
AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME)
COLLECTOR_NAME = Collector at imagegrid.otitan.com
COLLECTOR_SOCKET_CACHE_SIZE = 1000
CONDOR_ADMIN = condor@xxxxxxxxxxxxxxxxxxxx
CONDOR_DEVELOPERS = NONE
CONDOR_DEVELOPERS_COLLECTOR = NONE
CONDOR_HOST = imagegrid.otitan.com
CONDOR_IDS = 407.500
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
CONTINUE = True
DAEMON_LIST = MASTER, STARTD
EMAIL_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = otitan.com
HIGHPORT = 50000
HOSTALLOW_WRITE = imagegrid.otitan.com, *.local, *.local
JAVA = /usr/bin/java
KILL = False
LOCAL_DIR = /var/opt/condor
LOCK = /tmp/condor-lock.$(HOSTNAME)
LOWPORT = 40000
MAIL = /bin/mail
NEGOTIATOR_INTERVAL = 120
NETWORK_INTERFACE = 10.255.255.254
PREEMPT = False
RANK = None
RELEASE_DIR = /opt/condor
SOAP_SSL_CA_FILE = /etc/pki/tls/cert.pem
START = True
STARTD_EXPRS = $(STARTD_EXPRS)
SUSPEND = False
UID_DOMAIN = local
UPDATE_COLLECTOR_WITH_TCP = True
WANT_SUSPEND = False
WANT_VACATE = False
# First set JAVA_MAXHEAP_ARGUMENT to null, to disable the default of max RAM
JAVA_MAXHEAP_ARGUMENT =
JAVA_EXTRA_ARGUMENTS = -Xmx1906m
Can some one help me? Thanks!