[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Job threads getting held.



Hi all,

I’m having some troubles getting HTCondor to work.

When a user submits a job, some of the threads get held. This is random... For example, in a job with 254 processes, 54 get held.

Has this happened to anyone? Could someone help me?

The central manager and the nodes are all Windows Server 2008 R3 (x64).

Bellow are the logs.

Thanks!!

Alexandre


Here is the related output on SchedLog.log:

[…]
10/03/14 11:49:30 (pid:3420) Starting add_shadow_birthdate(6.237)
10/03/14 11:49:30 (pid:3420) Started shadow for job 6.237 on slot22@cluster09 <10.2.0.59:49163> for julieta, (shadow pid = 8692)
[…]
10/03/14 11:50:25 (pid:3420) Sent vacate command to <10.2.0.59:49163> for job 6.237
[…]
10/03/14 11:50:28 (pid:3420) Shadow pid 8692 for job 6.237 exited with status 112
10/03/14 11:50:28 (pid:3420) Putting job 6.237 on hold
[…]
10/03/14 17:04:37 (pid:3420) OwnerCheck(Administrator) failed in SetAttribute for job 6.237
[…]

And here is the related output on ShadowLog.log:

10/03/14 11:49:30 Initializing a VANILLA shadow for job 6.237
[…]
10/03/14 11:49:32 (6.237) (8692): Request to run on slot22@cluster09 <10.2.0.59:49163> was ACCEPTED
[…]
10/03/14 11:50:25 (6.237) (8692): perm: OpenProcessToken failed: 5
[…]
10/03/14 11:50:25 (6.237) (8692): perm::set_acls(C:\condor\spool\6\237\cluster6.proc237.subproc0): Unable to set file ACL(err=0).
[…]
10/03/14 11:50:25 (6.237) (8692): perm: OpenProcessToken failed: 5
[…]
10/03/14 11:50:25 (6.237) (8692): perm: SetNamedSecurityInfo(C:\condor\spool\6\237\cluster6.proc237.subproc0) failed (err=5)
[…]
10/03/14 11:50:25 (6.237) (8692): (6.237) Failed to chown C:\condor\spool\6\237\cluster6.proc237.subproc0 from to 8256944\8256608.
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\fort.16, errno = 2: No such file or directory. 10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 16 bytes of file transmission 10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\fort.16: (errno 2) No such file or directory 10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\GEOME, errno = 2: No such file or directory. 10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 0 bytes of file transmission 10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\GEOME: (errno 2) No such file or directory 10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\geome1, errno = 2: No such file or directory.
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 265438 bytes of file transmission
[…]
10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\geome1: (errno 2) No such file or directory
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\h100238, errno = 2: No such file or directory. 10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 0 bytes of file transmission
[…]
10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\h100238: (errno 2) No such file or directory
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\i100238, errno = 2: No such file or directory. 10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 0 bytes of file transmission 10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\i100238: (errno 2) No such file or directory 10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\lado12, errno = 2: No such file or directory.
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 1061752 bytes of file transmission 10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\lado12: (errno 2) No such file or directory
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\ovalp1, errno = 2: No such file or directory.
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 265438 bytes of file transmission
[…]
10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\ovalp1: (errno 2) No such file or directory
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\ovalp2, errno = 2: No such file or directory.
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 265438 bytes of file transmission
[…]
10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\ovalp2: (errno 2) No such file or directory 10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\ovalpm, errno = 2: No such file or directory.
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 265438 bytes of file transmission
[…]
10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\ovalpm: (errno 2) No such file or directory
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\p100238, errno = 2: No such file or directory.
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 0 bytes of file transmission
[…]
10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\p100238: (errno 2) No such file or directory
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\pontos, errno = 2: No such file or directory. 10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 0 bytes of file transmission
[…]
10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\pontos: (errno 2) No such file or directory 10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\RECEIVER, errno = 2: No such file or directory.
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 532350 bytes of file transmission
[…]
10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\RECEIVER: (errno 2) No such file or directory
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\recfreq_amp, errno = 2: No such file or directory. 10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 0 bytes of file transmission 10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\recfreq_amp: (errno 2) No such file or directory

10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\recfreq_ri, errno = 2: No such file or directory.
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 0 bytes of file transmission
[…]
10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\recfreq_ri: (errno 2) No such file or directory
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\rec_horizontal, errno = 2: No such file or directory.
[…]
10/03/14 11:50:27 (6.237) (8692): get_file(): consumed 430828 bytes of file transmission
[…]
10/03/14 11:50:27 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\rec_horizontal: (errno 2) No such file or directory
[…]
10/03/14 11:50:28 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\rec_perpend, errno = 2: No such file or directory. 10/03/14 11:50:28 (6.237) (8692): get_file(): consumed 194176 bytes of file transmission 10/03/14 11:50:28 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\rec_perpend: (errno 2) No such file or directory 10/03/14 11:50:28 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\rec_vert1, errno = 2: No such file or directory. 10/03/14 11:50:28 (6.237) (8692): get_file(): consumed 162874 bytes of file transmission 10/03/14 11:50:28 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\rec_vert1: (errno 2) No such file or directory 10/03/14 11:50:28 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\res1_absorcao, errno = 2: No such file or directory. 10/03/14 11:50:28 (6.237) (8692): get_file(): consumed 0 bytes of file transmission 10/03/14 11:50:28 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\res1_absorcao: (errno 2) No such file or directory 10/03/14 11:50:28 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\res2_absorcao, errno = 2: No such file or directory. 10/03/14 11:50:28 (6.237) (8692): get_file(): consumed 0 bytes of file transmission 10/03/14 11:50:28 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\res2_absorcao: (errno 2) No such file or directory 10/03/14 11:50:28 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\RESULT, errno = 2: No such file or directory.
[…]
10/03/14 11:50:28 (6.237) (8692): get_file(): consumed 841921 bytes of file transmission 10/03/14 11:50:28 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\RESULT: (errno 2) No such file or directory
[…]
10/03/14 11:50:28 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\s100238, errno = 2: No such file or directory. 10/03/14 11:50:28 (6.237) (8692): get_file(): consumed 0 bytes of file transmission 10/03/14 11:50:28 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\s100238: (errno 2) No such file or directory
[…]
10/03/14 11:50:28 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\v100238, errno = 2: No such file or directory. 10/03/14 11:50:28 (6.237) (8692): get_file(): consumed 0 bytes of file transmission 10/03/14 11:50:28 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\v100238: (errno 2) No such file or directory 10/03/14 11:50:28 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\_condor_stderr, errno = 2: No such file or directory. 10/03/14 11:50:28 (6.237) (8692): get_file(): consumed 783 bytes of file transmission 10/03/14 11:50:28 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\_condor_stderr: (errno 2) No such file or directory 10/03/14 11:50:28 (6.237) (8692): get_file(): Failed to open file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\_condor_stdout, errno = 2: No such file or directory. 10/03/14 11:50:28 (6.237) (8692): get_file(): consumed 4544 bytes of file transmission 10/03/14 11:50:28 (6.237) (8692): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory
[…]
10/03/14 11:50:28 (6.237) (8692): Mock terminating job 6.237: exited_by_signal=FALSE, exit_code=8 OR exit_signal=0, core_dumped=FALSE, exit_reason="Exited normally"
[…]
10/03/14 11:50:28 (6.237) (8692): Job 6.237 going into Hold state (code 12,2): Error from slot22@cluster09: STARTER at 10.2.0.59 failed to send file(s) to <10.2.0.70:50914>; SHADOW at 10.2.0.70 failed to write to file C:\condor\spool\6\237\cluster6.proc237.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory
[…]
10/03/14 11:50:28 (6.237) (8692): **** condor_shadow (condor_SHADOW) pid 8692 EXITING WITH STATUS 112
[…]