[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Error with executing simple job via Condor




I wrote a simple python executable to submit with Condor. The job is submitted and these jobs' state change to run for a second or so but then change to idle. If I wait, they are resubmitted, change to Run state but they never run. The log files for each queue never have content. Based on the shadowlog, I get errno = 10054, which means a socket was closed. All of our machines are windows xp including the central manager. As you can tell from the log, we are using NTSSPI and SSL. When I run condor_Status everything looks fine with regard to see cores/slots, claimed and unclaimed machines. I am not seeing any errors in the masterlog and as far as I can tell everything looks ok.

Does anyone have any ideas of what might be causing this. We first set up condor without ssl and did not have any issues and now we are working on a more secured system, which is likely causing the problems. This might not be related, but we also had our CM routed through a 100MB switch, while our network is 1GB. The CM was not working and we still cannot see two machines on this 100MB router. However, once we moved the CM off the 100MB router we were able to see all machines in our pool (currently we are testing and working out the configuration and therefore only have about 6 machines in our pool).

Thank you,
Mike

When I run the following command I get:
condor_q -analyze 88
088.009:  Run analysis summary.  Of 10 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      6 match but are serving users with a better priority in the pool
      4 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 match but are currently offline
      0 are available to run your job
        Last successful match: Wed Apr 28 07:39:24 2010

The shadowlog on the submit machine looks like this (note that I used a search and replace for accounts, ip, and other info, but should make sense):
Command = 60008
04/28 07:19:52 (88.6) (488): SECMAN: startCommand succeeded.
04/28 07:19:52 (88.6) (488): Authorizing server '*/IP.39'.
04/28 07:19:52 (88.6) (488): SEND [1000] <IP.39:3385> <IP.39:1851>
04/28 07:19:52 (88.6) (488): SEND [164] <IP.39:3385> <IP.39:1851>
04/28 07:19:52 (88.6) (488): DaemonCore: Leaving SendAliveToParent() - success
04/28 07:19:52 (88.6) (488): Return from Timer handler 5 (DaemonCore::SendAliveToParent)
04/28 07:19:52 (88.6) (488): Calling Handler <HandleSyscalls> (2)
04/28 07:19:52 (88.6) (488): condor_read(fd=1692 startd slot4@ExecuteMachine,,size=21,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): condor_read(): fd=1692
04/28 07:19:52 (88.6) (488): condor_read(): select returned 1
04/28 07:19:52 (88.6) (488): condor_read(fd=1692 startd slot4@ExecuteMachine,,size=8,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): condor_read(): fd=1692
04/28 07:19:52 (88.6) (488): condor_read(): select returned 1
04/28 07:19:52 (88.6) (488): entering FileTransfer::Init
04/28 07:19:52 (88.6) (488): entering FileTransfer::SimpleInit
04/28 07:19:52 (88.6) (488): Entering FileTransfer::InitDownloadFilenameRemaps
04/28 07:19:52 (88.6) (488): condor_write(fd=1692 startd slot4@ExecuteMachine,,size=4096,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): condor_write(fd=1692 startd slot4@ExecuteMachine,,size=3153,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): Return from Handler <HandleSyscalls>
04/28 07:19:52 (88.6) (488): Calling Handler <HandleSyscalls> (2)
04/28 07:19:52 (88.6) (488): condor_read(fd=1692 startd slot4@ExecuteMachine,,size=21,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): condor_read(): fd=1692
04/28 07:19:52 (88.6) (488): condor_read(): select returned 1
04/28 07:19:52 (88.6) (488): condor_read(fd=1692 startd slot4@ExecuteMachine,,size=580,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): condor_read(): fd=1692
04/28 07:19:52 (88.6) (488): condor_read(): select returned 1
04/28 07:19:52 (88.6) (488): condor_write(fd=1692 startd slot4@ExecuteMachine,,size=29,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): Return from Handler <HandleSyscalls>
04/28 07:19:52 (88.6) (488): Calling Handler <HandleSyscalls> (2)
04/28 07:19:52 (88.6) (488): condor_read(fd=1692 startd slot4@ExecuteMachine,,size=21,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): condor_read(): fd=1692
04/28 07:19:52 (88.6) (488): condor_read(): select returned 1
04/28 07:19:52 (88.6) (488): condor_read() failed: recv() returned -1, errno = 10054 , reading 21 bytes from startd slot4@ExecuteMachine.
04/28 07:19:52 (88.6) (488): IO: Failed to read packet header
04/28 07:19:52 (88.6) (488): Stream::get(int) failed to read padding
04/28 07:19:52 (88.6) (488): Can no longer talk to condor_starter <IP.15:1849>
04/28 07:19:52 (88.6) (488): CLOSE <IP.39:3378> fd=1692
04/28 07:19:52 (88.6) (488): WriteUserLog: not initialized @ writeEvent()
04/28 07:19:52 (88.6) (488): Trying to reconnect job USER@xxxxxxxxxxxxxxxxx#88.6#1272460515
04/28 07:19:52 (88.6) (488): Trying to reconnect to disconnected job
04/28 07:19:52 (88.6) (488): LastJobLeaseRenewal: 1272460792 Wed Apr 28 07:19:52 2010
04/28 07:19:53 (88.6) (488): JobLeaseDuration: 1200 seconds
04/28 07:19:53 (88.6) (488): Resource slot4@ExecuteMachine changing state from STARTUP to RECONNECT
04/28 07:19:53 (88.6) (488): JobLeaseDuration remaining: 1199
04/28 07:19:53 (88.6) (488): Return from Handler <HandleSyscalls>
04/28 07:19:53 (88.6) (488): Calling Timer handler 8 (RemoteResource::attemptReconnect())
04/28 07:19:53 (88.6) (488): Attempting to locate disconnected starter
04/28 07:19:53 (88.6) (488): gjid is USER@xxxxxxxxxxxxxxxxx#88.6#1272460515 claimid is <IP.15:1849>#1272040933#1100#...
04/28 07:19:53 (88.6) (488): CONNECT src="" fd=1676 dst=<IP.15:1849>
04/28 07:19:53 (88.6) (488): SECMAN: command 1200 CA_CMD to startd slot4@ExecuteMachine from TCP port 3397 (blocking).
04/28 07:19:53 (88.6) (488): SECMAN: using session ClientExecuteMachine:5060:1272460787:460 for {<IP.15:1849>,<1200>}.
04/28 07:19:53 (88.6) (488): SECMAN: found cached session id ClientExecuteMachine:5060:1272460787:460 for {<IP.15:1849>,<1200>}.
MyType = ""
TargetType = ""
OutgoingNegotiation = "REQUIRED"
Subsystem = "SHADOW"
Command = 444
RemoteVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $"
Enact = "YES"
AuthMethodsList = "NTSSPI,SSL"
AuthMethods = "NTSSPI"
CryptoMethods = "3DES,BLOWFISH"
Authentication = "YES"
Encryption = "YES"
Integrity = "YES"
SessionDuration = "86400"
UseSession = "YES"
Sid = "ClientExecuteMachine:5060:1272460787:460"
MyRemoteUserName = "USER"
ValidCommands = "60000,60008,60017,403,404,427,435,436,441,442,443,444,446,466,503,504,505,506,60004,1200,1000,5,60007,60011,448,452,457,470"
TriedAuthentication = TRUE
04/28 07:19:53 (88.6) (488): SECMAN: Security Policy:
MyType = ""
TargetType = ""
OutgoingNegotiation = "REQUIRED"
Subsystem = "SHADOW"
Command = 444
RemoteVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $"
Enact = "YES"
AuthMethodsList = "NTSSPI,SSL"
AuthMethods = "NTSSPI"
CryptoMethods = "3DES,BLOWFISH"
Authentication = "YES"
Encryption = "YES"
Integrity = "YES"
SessionDuration = "86400"
UseSession = "YES"
Sid = "ClientExecuteMachine:5060:1272460787:460"
MyRemoteUserName = "USER"

- - - - - - - - - - - - - - - - - - - - - - - - - -
Michael O'Donnell
ADP Software Specialist, ASRC Management Services
USGS Fort Collins Science Center
2150 Centre Ave., Bldg C
Fort Collins, CO 80526

Phone: 970.226.9407
Fax: 970.226.9230
Email: odonnellm@xxxxxxxx