I don't know if it is right or not but I have checked the drive capacity on which the spool directory is located for all cluster nodes, they have at least 100 GB free,
It seems that when other users send jobs, for no reason condor just decides to hold mine (no matter how long it has been running) and throw the error I mentioned in the previous email.
I increased priority hopping that this wouldn’t happens again but it didn’t work!
Below is the shadow log content of a job that was run twice and held each time throwing shadow exception, I can’t understand what is going on there!
01/31/13 10:14:21 Locale: English_United States.1252
01/31/13 10:14:21 Setting maximum accepts per cycle 8.
01/31/13 10:14:21 ******************************************************
01/31/13 10:14:21 ** condor_shadow (CONDOR_SHADOW) STARTING UP
01/31/13 10:14:21 ** C:\condor\bin\condor_shadow.exe
01/31/13 10:14:21 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
01/31/13 10:14:21 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
01/31/13 10:14:21 ** $CondorVersion: 7.8.2 Aug 08 2012 $
01/31/13 10:14:21 ** $CondorPlatform: x86_64_winnt_6.1 $
01/31/13 10:14:21 ** PID = 6020
01/31/13 10:14:21 ** Log last touched 1/30 20:07:10
01/31/13 10:14:21 ******************************************************
01/31/13 10:14:21 Using config source: C:\condor\condor_config
01/31/13 10:14:21 Using local config sources:
01/31/13 10:14:21 C:\condor/condor_config.local
01/31/13 10:14:21 DaemonCore: command socket at <xxx.xx.xxx.113:49429>
01/31/13 10:14:22 DaemonCore: private command socket at <xxx.xx.xxx.113:49429>
01/31/13 10:14:22 Setting maximum accepts per cycle 8.
01/31/13 10:14:22 Initializing a VANILLA shadow for job 46993.0
01/31/13 10:14:22 (46993.0) (6020): Request to run on slot1@xxx-Sim4 <xxx.xx.xxx.58:49186> was ACCEPTED
01/31/13 10:14:23 (46993.0) (6020): my_popen: CreateProcess failed
01/31/13 10:14:23 (46993.0) (6020): FILETRANSFER: Failed to execute C:\condor/bin/curl_plugin, ignoring
01/31/13 10:14:23 (46993.0) (6020): FILETRANSFER: failed to add plugin "C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute C:\condor/bin/curl_plugin, ignoring
01/31/13 12:43:10 (46993.0) (6020): perm: OpenProcessToken failed: 5
01/31/13 12:43:10 (46993.0) (6020): perm::set_acls(C:\condor/spool\6993\0\cluster46993.proc0.subproc0): Unable to set file ACL(err=6).
01/31/13 12:43:10 (46993.0) (6020): perm: OpenProcessToken failed: 5
01/31/13 12:43:10 (46993.0) (6020): perm: SetNamedSecurityInfo(C:\condor/spool\6993\0\cluster46993.proc0.subproc0) failed (err=5)
01/31/13 12:43:10 (46993.0) (6020): (46993.0) Failed to chown C:\condor/spool\6993\0\cluster46993.proc0.subproc0 from to 27474704\27476512.
01/31/13 12:43:10 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_model.pyc, errno = 2: No such file or directory.
01/31/13 12:43:10 (46993.0) (6020): get_file(): consumed 92370 bytes of file transmission
01/31/13 12:43:10 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_model.pyc: (errno 2) No such file or directory
01/31/13 12:43:10 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_results.py, errno = 2: No such file or directory.
01/31/13 12:43:10 (46993.0) (6020): get_file(): consumed 444187 bytes of file transmission
01/31/13 12:43:10 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_results.py: (errno 2) No such file or directory
01/31/13 12:43:10 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmptklhty.pyomo.col, errno = 2: No such file or directory.
01/31/13 12:43:11 (46993.0) (6020): get_file(): consumed 7171022 bytes of file transmission
01/31/13 12:43:11 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmptklhty.pyomo.col: (errno 2) No such file or directory
01/31/13 12:43:11 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmptklhty.pyomo.nl, errno = 2: No such file or directory.
01/31/13 12:43:15 (46993.0) (6020): get_file(): consumed 51599812 bytes of file transmission
01/31/13 12:43:15 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmptklhty.pyomo.nl: (errno 2) No such file or directory
01/31/13 12:43:16 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmptklhty.pyomo.row, errno = 2: No such file or directory.
01/31/13 12:43:17 (46993.0) (6020): get_file(): consumed 13587542 bytes of file transmission
01/31/13 12:43:17 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmptklhty.pyomo.row: (errno 2) No such file or directory
01/31/13 12:43:17 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stderr, errno = 2: No such file or directory.
01/31/13 12:43:17 (46993.0) (6020): get_file(): consumed 2 bytes of file transmission
01/31/13 12:43:17 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stderr: (errno 2) No such file or directory
01/31/13 12:43:17 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stdout, errno = 2: No such file or directory.
01/31/13 12:43:17 (46993.0) (6020): get_file(): consumed 0 bytes of file transmission
01/31/13 12:43:17 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory
01/31/13 12:43:17 (46993.0) (6020): Mock terminating job 46993.0: exited_by_signal=FALSE, exit_code=-1073741510 OR exit_signal=0, core_dumped=FALSE, exit_reason="Exited normally"
01/31/13 12:43:17 (46993.0) (6020): Job 46993.0 going into Hold state (code 12,2): Error from slot1@xxx-Sim4: STARTER at xxx.xx.xxx.58 failed to send file(s) to <xxx.xx.xxx.113:49429>; SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory
01/31/13 12:43:17 (46993.0) (6020): **** condor_shadow (condor_SHADOW) pid 6020 EXITING WITH STATUS 112
01/31/13 12:57:09 Locale: English_United States.1252
01/31/13 12:57:09 Setting maximum accepts per cycle 8.
01/31/13 12:57:09 ******************************************************
01/31/13 12:57:09 ** condor_shadow (CONDOR_SHADOW) STARTING UP
01/31/13 12:57:09 ** C:\condor\bin\condor_shadow.exe
01/31/13 12:57:09 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
01/31/13 12:57:09 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
01/31/13 12:57:09 ** $CondorVersion: 7.8.2 Aug 08 2012 $
01/31/13 12:57:09 ** $CondorPlatform: x86_64_winnt_6.1 $
01/31/13 12:57:09 ** PID = 10712
01/31/13 12:57:09 ** Log last touched 1/31 12:43:17
01/31/13 12:57:09 ******************************************************
01/31/13 12:57:09 Using config source: C:\condor\condor_config
01/31/13 12:57:09 Using local config sources:
01/31/13 12:57:09 C:\condor/condor_config.local
01/31/13 12:57:09 DaemonCore: command socket at <xxx.xx.xxx.113:51309>
01/31/13 12:57:09 DaemonCore: private command socket at <xxx.xx.xxx.113:51309>
01/31/13 12:57:09 Setting maximum accepts per cycle 8.
01/31/13 12:57:09 Initializing a VANILLA shadow for job 46993.0
01/31/13 12:57:09 (46993.0) (10712): Request to run on slot1@xxx-sim6 <xxx.xx.xxx.219:49191> was ACCEPTED
01/31/13 12:57:09 (46993.0) (10712): my_popen: CreateProcess failed
01/31/13 12:57:09 (46993.0) (10712): FILETRANSFER: Failed to execute C:\condor/bin/curl_plugin, ignoring
01/31/13 12:57:09 (46993.0) (10712): FILETRANSFER: failed to add plugin "C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute C:\condor/bin/curl_plugin, ignoring
01/31/13 16:32:07 (46993.0) (10712): perm: OpenProcessToken failed: 5
01/31/13 16:32:07 (46993.0) (10712): perm::set_acls(C:\condor/spool\6993\0\cluster46993.proc0.subproc0): Unable to set file ACL(err=6).
01/31/13 16:32:07 (46993.0) (10712): perm: OpenProcessToken failed: 5
01/31/13 16:32:07 (46993.0) (10712): perm: SetNamedSecurityInfo(C:\condor/spool\6993\0\cluster46993.proc0.subproc0) failed (err=5)
01/31/13 16:32:07 (46993.0) (10712): (46993.0) Failed to chown C:\condor/spool\6993\0\cluster46993.proc0.subproc0 from to 26886624\26885984.
01/31/13 16:32:07 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_model.pyc, errno = 2: No such file or directory.
01/31/13 16:32:08 (46993.0) (10712): get_file(): consumed 92370 bytes of file transmission
01/31/13 16:32:08 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_model.pyc: (errno 2) No such file or directory
01/31/13 16:32:08 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_results.py, errno = 2: No such file or directory.
01/31/13 16:32:08 (46993.0) (10712): get_file(): consumed 546767 bytes of file transmission
01/31/13 16:32:08 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_results.py: (errno 2) No such file or directory
01/31/13 16:32:08 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmpek7gsh.pyomo.col, errno = 2: No such file or directory.
01/31/13 16:32:09 (46993.0) (10712): get_file(): consumed 7171022 bytes of file transmission
01/31/13 16:32:09 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmpek7gsh.pyomo.col: (errno 2) No such file or directory
01/31/13 16:32:09 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmpek7gsh.pyomo.nl, errno = 2: No such file or directory.
01/31/13 16:32:18 (46993.0) (10712): get_file(): consumed 51599812 bytes of file transmission
01/31/13 16:32:18 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmpek7gsh.pyomo.nl: (errno 2) No such file or directory
01/31/13 16:32:18 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmpek7gsh.pyomo.row, errno = 2: No such file or directory.
01/31/13 16:32:20 (46993.0) (10712): get_file(): consumed 13587542 bytes of file transmission
01/31/13 16:32:20 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmpek7gsh.pyomo.row: (errno 2) No such file or directory
01/31/13 16:32:20 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stderr, errno = 2: No such file or directory.
01/31/13 16:32:20 (46993.0) (10712): get_file(): consumed 2 bytes of file transmission
01/31/13 16:32:20 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stderr: (errno 2) No such file or directory
01/31/13 16:32:20 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stdout, errno = 2: No such file or directory.
01/31/13 16:32:20 (46993.0) (10712): get_file(): consumed 0 bytes of file transmission
01/31/13 16:32:20 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory
01/31/13 16:32:20 (46993.0) (10712): Mock terminating job 46993.0: exited_by_signal=FALSE, exit_code=-1073741510 OR exit_signal=0, core_dumped=FALSE, exit_reason="Exited normally"
01/31/13 16:32:20 (46993.0) (10712): Job 46993.0 going into Hold state (code 12,2): Error from slot1@xxx-sim6: STARTER at xxx.xx.xxx.219 failed to send file(s) to <xxx.xx.xxx.113:51309>; SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory
01/31/13 16:32:20 (46993.0) (10712): **** condor_shadow (condor_SHADOW) pid 10712 EXITING WITH STATUS 112
The spool directory is on the the machine where your condor_schedd daemon is running.- IanOn Thu, Jan 31, 2013 at 8:35 AM, Mostafa Bakhtvar <bakhtvar@xxxxxxxxx> wrote:
Do you know how can I check this? Where it is? On my PC or any of the nodes?
- Mosy
From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
Sent: 31 January 2013 16:28
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Shadow Exception after 3 hours of running!
Did your spool directory run out of disk space?
- Ian
On Thu, Jan 31, 2013 at 1:47 AM, Mostafa.B <bakhtvar@xxxxxxxxx> wrote:
Hi All,
Recently, the jobs that I send to condor are held after 3 hours (or more) of run,
I looked at the log file and it says:
...
007 (46992.000.000) 01/30 20:07:10 Shadow exception!
Error from slot1@xxxxxxxx: STARTER at xxx.xx.xxx.xx failed to send file(s) to <xxx.xx.xxx.xxx:xxxxx>; SHADOW at xxx.xx.xxx.xxx failed to write to file C:\condor/spool\6992\0\cluster46992.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory
72922408 - Run Bytes Sent By Job
232107 - Run Bytes Received By Job
...
012 (46992.000.000) 01/30 20:07:10 Job was held.
Error from slot1@xxxxxxxx: STARTER at xxx.xx.xxx.xx failed to send file(s) to <xxx.xx.xxx.xxx:xxxxx>; SHADOW at xxx.xx.xxx.xxx failed to write to file C:\condor/spool\6992\0\cluster46992.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory
Code 12 Subcode 2
...
any ideas why this happens? and how to solve it?
The jobs were OK with condor until 2 days ago, even they are still OK when I run them manually on my PC.
by the way I am the admin user of the Windows based PC that is sending jobs to Condor.
Regards
Mosy
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/