[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] submit jobs complete but then go back into queue




I have a simple submit job that executes on a client via Condor, but then returns to the Condor pool. I think the problem is related to the Condor daemon owners, the submit owner and the ad owner. Our daemons run under a service account because we are restricting the configuration files on a shared drive. I am not sure this is the best approach yet, but we are still trying to meet all of our security concerns.

We are using a small pool for testing, SSL, NTSSPI, All condor clients and CM is Windows XP as well (SP3). SSL is used for client authentication. If I run the Condor job under the condor service account (the same account that the condor daemons run as, the jobs exit the Queue properly. If I run the Condor job as my user account, the job executes, but then returns to the Queue as 'I', and eventually it will run again and again and so forth.

I tried to specify the owner class add in the submit file but when I do this the job sits in idle and never runs. Does anyone have any ideas as to what I might be doing wrong or have any recommendations as to how we should handle running the daemons under a different account. What does not make sense to me is that the state changes from idle to run after the job is submitted, but then can not exit the queue after it completes when the class ad (odonnellm) and submit owner (igskabcb-condoradmin) are different.

PS I do not see anything else out of the norm in the other log files (submit or execute machines).


Thank you for your help/suggestions,
Mike

Below are snapshots of the log files indicating the errors when the state of the job is trying to be changed.

Shadowlog (submit machine):
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=595,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): SECMAN: resume, other side is $CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $, NOT reauthenticating.
05/10 15:35:35 (248.0) (2088): SECMAN: about to enable message authenticator.
05/10 15:35:35 (248.0) (2088): SECMAN: successfully enabled message authenticator!
05/10 15:35:35 (248.0) (2088): SECMAN: about to enable encryption.
05/10 15:35:35 (248.0) (2088): SECMAN: successfully enabled encryption!
05/10 15:35:35 (248.0) (2088): SECMAN: startCommand succeeded.
05/10 15:35:35 (248.0) (2088): Authorizing server '*/IP'.
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=76,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree: Failed SetAttribute(NumJobStarts, 1)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=92,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree: Failed SetAttribute(LastJobLeaseRenewal, 1273527179)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=83,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree: Failed SetAttribute(RemoteSysCpu, 0.000000)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=85,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree: Failed SetAttribute(RemoteUserCpu, 22.000000)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=77,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree: Failed SetAttribute(ImageSize, 69116)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=80,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree: Failed SetAttribute(ExitBySignal, FALSE)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=72,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree: Failed SetAttribute(ExitCode, 0)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=85,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree: Failed SetAttribute(TerminationPending, TRUE)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=80,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree: Failed SetAttribute(CommittedTime, 1502)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=80,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree: Failed SetAttribute(BytesSent, 0.000000)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=81,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=16,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): updateExprTree: Failed SetAttribute(BytesRecvd, 0.000000)
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=29,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=21,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): condor_read(fd=1672 schedd at <IP:4905>,,size=8,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): condor_read(): fd=1672
05/10 15:35:35 (248.0) (2088): condor_read(): select returned 1
05/10 15:35:35 (248.0) (2088): condor_write(fd=1672 schedd at <IP:4905>,,size=29,timeout=300,flags=0)
05/10 15:35:35 (248.0) (2088): CLOSE <IP:2912> fd=1672
05/10 15:35:35 (248.0) (2088): Failed to perform final update to job queue!
05/10 15:35:35 (248.0) (2088): Maximum number of job cleanup retry attempts (SHADOW_MAX_JOB_CLEANUP_RETRIES=5) reached; Forcing job requeue!
05/10 15:35:35 (248.0) (2088): KEYCACHEENTRY: deleted: 00D18648
05/10 15:35:35 (248.0) (2088): KEYCACHEENTRY: deleted: 00D33CD0
05/10 15:35:35 (248.0) (2088): KEYCACHEENTRY: deleted: 00D217A8
05/10 15:35:35 (248.0) (2088): KEYCACHE: deleted: 00B7B6F0
05/10 15:35:35 (248.0) (2088): CLOSE <IP:2880> fd=1716
05/10 15:35:35 (248.0) (2088): CLOSE <127.0.0.1:2881> fd=1248
05/10 15:35:35 (248.0) (2088): CLOSE <127.0.0.1:2882> fd=1732
05/10 15:35:35 (248.0) (2088): **** condor_shadow (condor_SHADOW) pid 2088 EXITING WITH STATUS 107

SchedLog (Submit machine):
05/10 15:35:35 (pid:1844) ad owner: odonnellm, queue submit owner: IGSKBACB-condoradmin
05/10 15:35:35 (pid:1844) OwnerCheck(IGSKBACB-condoradmin) failed in SetAttribute for job 248.0
05/10 15:35:35 (pid:1844) condor_write(fd=1228 <IP:2912>,,size=37,timeout=20,flags=0)
05/10 15:35:35 (pid:1844) condor_read(fd=1228 <IP:2912>,,size=21,timeout=20,flags=0)
05/10 15:35:35 (pid:1844) condor_read(): fd=1228
05/10 15:35:35 (pid:1844) condor_read(): select returned 1
05/10 15:35:35 (pid:1844) condor_read(fd=1228 <IP:2912>,,size=71,timeout=20,flags=0)
05/10 15:35:35 (pid:1844) condor_read(): fd=1228
05/10 15:35:35 (pid:1844) condor_read(): select returned 1
...
05/10 16:10:08 (pid:1844) SECMAN: startCommand succeeded.
05/10 16:10:08 (pid:1844) Authorizing server '*/159.189.162.73'.
05/10 16:10:08 (pid:1844) SEND [1000] <IP:3265> <159.189.162.73:1202>
05/10 16:10:08 (pid:1844) SEND [146] <IP:3265> <159.189.162.73:1202>
05/10 16:10:08 (pid:1844) Completed RELEASE_CLAIM to startd at <159.189.162.73:1202>
05/10 16:10:08 (pid:1844) Match record (IGSKBACBLT214.gs.doi.net <IP2:1202>#1273511706#25#... for odonnellm@xxxxxxxxxx, 249.0) deleted
05/10 16:10:08 (pid:1844) IpVerify::FillHole: removed DAEMON-level opening for 159.189.162.73
05/10 16:10:08 (pid:1844) IpVerify::FillHole: removed WRITE-level opening for 159.189.162.73
05/10 16:10:08 (pid:1844) IpVerify::FillHole: open count at level READ for 159.189.162.73 now 1
05/10 16:10:08 (pid:1844) IpVerify::FillHole: removed READ-level opening for 159.189.162.73
05/10 16:10:08 (pid:1844) Deleting shadow rec for PID 4404, job (249.0)
05/10 16:10:08 (pid:1844) Entered check_zombie( 4404, 0x00DD5824, st=2 )
05/10 16:10:08 (pid:1844) Prioritized runnable job list will be rebuilt, because ClassAd attribute JobStatus=1 changed
05/10 16:10:09 (pid:1844) Marked job 249.0 as IDLE
05/10 16:10:09 (pid:1844) Exited check_zombie( 4404, 0x00DD5824 )
05/10 16:10:09 (pid:1844) Shadow does not have a match record, so did not remove it from the match
05/10 16:10:09 (pid:1844)
05/10 16:10:09 (pid:1844) ..................
05/10 16:10:09 (pid:1844) .. Shadow Recs (0/0)
05/10 16:10:09 (pid:1844) ..................