[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Shadow Exception after 3 hours of running!



Are you using a preemption policy in your system? Your jobs are probably being preempted by Condor when other users enter the system because your EUP is much higher in comparison to their EUP.

- Ian


On Fri, Feb 1, 2013 at 2:39 AM, Mostafa.B <bakhtvar@xxxxxxxxx> wrote:

I don't know if it is right or not but I have checked the drive capacity on which the spool directory is located for all cluster nodes, they have at least 100 GB free,

It seems that when other users send jobs, for no reason condor just decides to hold mine (no matter how long it has been running) and throw the error I mentioned in the previous email.

I increased priority hopping that this wouldn’t happens again but it didn’t work!

Below is the shadow log content of a job that was run twice and held each time throwing shadow exception, I can’t understand what is going on there!

 

01/31/13 10:14:21 Locale: English_United States.1252

01/31/13 10:14:21 Setting maximum accepts per cycle 8.

01/31/13 10:14:21 ******************************************************

01/31/13 10:14:21 ** condor_shadow (CONDOR_SHADOW) STARTING UP

01/31/13 10:14:21 ** C:\condor\bin\condor_shadow.exe

01/31/13 10:14:21 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)

01/31/13 10:14:21 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON

01/31/13 10:14:21 ** $CondorVersion: 7.8.2 Aug 08 2012 $

01/31/13 10:14:21 ** $CondorPlatform: x86_64_winnt_6.1 $

01/31/13 10:14:21 ** PID = 6020

01/31/13 10:14:21 ** Log last touched 1/30 20:07:10

01/31/13 10:14:21 ******************************************************

01/31/13 10:14:21 Using config source: C:\condor\condor_config

01/31/13 10:14:21 Using local config sources:

01/31/13 10:14:21    C:\condor/condor_config.local

01/31/13 10:14:21 DaemonCore: command socket at <xxx.xx.xxx.113:49429>

01/31/13 10:14:22 DaemonCore: private command socket at <xxx.xx.xxx.113:49429>

01/31/13 10:14:22 Setting maximum accepts per cycle 8.

01/31/13 10:14:22 Initializing a VANILLA shadow for job 46993.0

01/31/13 10:14:22 (46993.0) (6020): Request to run on slot1@xxx-Sim4 <xxx.xx.xxx.58:49186> was ACCEPTED

01/31/13 10:14:23 (46993.0) (6020): my_popen: CreateProcess failed

01/31/13 10:14:23 (46993.0) (6020): FILETRANSFER: Failed to execute C:\condor/bin/curl_plugin, ignoring

01/31/13 10:14:23 (46993.0) (6020): FILETRANSFER: failed to add plugin "C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute C:\condor/bin/curl_plugin, ignoring

01/31/13 12:43:10 (46993.0) (6020): perm: OpenProcessToken failed: 5

01/31/13 12:43:10 (46993.0) (6020): perm::set_acls(C:\condor/spool\6993\0\cluster46993.proc0.subproc0): Unable to set file ACL(err=6).

01/31/13 12:43:10 (46993.0) (6020): perm: OpenProcessToken failed: 5

01/31/13 12:43:10 (46993.0) (6020): perm: SetNamedSecurityInfo(C:\condor/spool\6993\0\cluster46993.proc0.subproc0) failed (err=5)

01/31/13 12:43:10 (46993.0) (6020): (46993.0) Failed to chown C:\condor/spool\6993\0\cluster46993.proc0.subproc0 from to 27474704\27476512.

01/31/13 12:43:10 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_model.pyc, errno = 2: No such file or directory.

01/31/13 12:43:10 (46993.0) (6020): get_file(): consumed 92370 bytes of file transmission

01/31/13 12:43:10 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_model.pyc: (errno 2) No such file or directory

01/31/13 12:43:10 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_results.py, errno = 2: No such file or directory.

01/31/13 12:43:10 (46993.0) (6020): get_file(): consumed 444187 bytes of file transmission

01/31/13 12:43:10 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_results.py: (errno 2) No such file or directory

01/31/13 12:43:10 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmptklhty.pyomo.col, errno = 2: No such file or directory.

01/31/13 12:43:11 (46993.0) (6020): get_file(): consumed 7171022 bytes of file transmission

01/31/13 12:43:11 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmptklhty.pyomo.col: (errno 2) No such file or directory

01/31/13 12:43:11 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmptklhty.pyomo.nl, errno = 2: No such file or directory.

01/31/13 12:43:15 (46993.0) (6020): get_file(): consumed 51599812 bytes of file transmission

01/31/13 12:43:15 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmptklhty.pyomo.nl: (errno 2) No such file or directory

01/31/13 12:43:16 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmptklhty.pyomo.row, errno = 2: No such file or directory.

01/31/13 12:43:17 (46993.0) (6020): get_file(): consumed 13587542 bytes of file transmission

01/31/13 12:43:17 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmptklhty.pyomo.row: (errno 2) No such file or directory

01/31/13 12:43:17 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stderr, errno = 2: No such file or directory.

01/31/13 12:43:17 (46993.0) (6020): get_file(): consumed 2 bytes of file transmission

01/31/13 12:43:17 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stderr: (errno 2) No such file or directory

01/31/13 12:43:17 (46993.0) (6020): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stdout, errno = 2: No such file or directory.

01/31/13 12:43:17 (46993.0) (6020): get_file(): consumed 0 bytes of file transmission

01/31/13 12:43:17 (46993.0) (6020): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory

01/31/13 12:43:17 (46993.0) (6020): Mock terminating job 46993.0: exited_by_signal=FALSE, exit_code=-1073741510 OR exit_signal=0, core_dumped=FALSE, exit_reason="Exited normally"

01/31/13 12:43:17 (46993.0) (6020): Job 46993.0 going into Hold state (code 12,2): Error from slot1@xxx-Sim4: STARTER at xxx.xx.xxx.58 failed to send file(s) to <xxx.xx.xxx.113:49429>; SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory

01/31/13 12:43:17 (46993.0) (6020): **** condor_shadow (condor_SHADOW) pid 6020 EXITING WITH STATUS 112

01/31/13 12:57:09 Locale: English_United States.1252

01/31/13 12:57:09 Setting maximum accepts per cycle 8.

01/31/13 12:57:09 ******************************************************

01/31/13 12:57:09 ** condor_shadow (CONDOR_SHADOW) STARTING UP

01/31/13 12:57:09 ** C:\condor\bin\condor_shadow.exe

01/31/13 12:57:09 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)

01/31/13 12:57:09 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON

01/31/13 12:57:09 ** $CondorVersion: 7.8.2 Aug 08 2012 $

01/31/13 12:57:09 ** $CondorPlatform: x86_64_winnt_6.1 $

01/31/13 12:57:09 ** PID = 10712

01/31/13 12:57:09 ** Log last touched 1/31 12:43:17

01/31/13 12:57:09 ******************************************************

01/31/13 12:57:09 Using config source: C:\condor\condor_config

01/31/13 12:57:09 Using local config sources:

01/31/13 12:57:09    C:\condor/condor_config.local

01/31/13 12:57:09 DaemonCore: command socket at <xxx.xx.xxx.113:51309>

01/31/13 12:57:09 DaemonCore: private command socket at <xxx.xx.xxx.113:51309>

01/31/13 12:57:09 Setting maximum accepts per cycle 8.

01/31/13 12:57:09 Initializing a VANILLA shadow for job 46993.0

01/31/13 12:57:09 (46993.0) (10712): Request to run on slot1@xxx-sim6 <xxx.xx.xxx.219:49191> was ACCEPTED

01/31/13 12:57:09 (46993.0) (10712): my_popen: CreateProcess failed

01/31/13 12:57:09 (46993.0) (10712): FILETRANSFER: Failed to execute C:\condor/bin/curl_plugin, ignoring

01/31/13 12:57:09 (46993.0) (10712): FILETRANSFER: failed to add plugin "C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute C:\condor/bin/curl_plugin, ignoring

01/31/13 16:32:07 (46993.0) (10712): perm: OpenProcessToken failed: 5

01/31/13 16:32:07 (46993.0) (10712): perm::set_acls(C:\condor/spool\6993\0\cluster46993.proc0.subproc0): Unable to set file ACL(err=6).

01/31/13 16:32:07 (46993.0) (10712): perm: OpenProcessToken failed: 5

01/31/13 16:32:07 (46993.0) (10712): perm: SetNamedSecurityInfo(C:\condor/spool\6993\0\cluster46993.proc0.subproc0) failed (err=5)

01/31/13 16:32:07 (46993.0) (10712): (46993.0) Failed to chown C:\condor/spool\6993\0\cluster46993.proc0.subproc0 from to 26886624\26885984.

01/31/13 16:32:07 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_model.pyc, errno = 2: No such file or directory.

01/31/13 16:32:08 (46993.0) (10712): get_file(): consumed 92370 bytes of file transmission

01/31/13 16:32:08 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_model.pyc: (errno 2) No such file or directory

01/31/13 16:32:08 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_results.py, errno = 2: No such file or directory.

01/31/13 16:32:08 (46993.0) (10712): get_file(): consumed 546767 bytes of file transmission

01/31/13 16:32:08 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\langkawi_results.py: (errno 2) No such file or directory

01/31/13 16:32:08 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmpek7gsh.pyomo.col, errno = 2: No such file or directory.

01/31/13 16:32:09 (46993.0) (10712): get_file(): consumed 7171022 bytes of file transmission

01/31/13 16:32:09 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmpek7gsh.pyomo.col: (errno 2) No such file or directory

01/31/13 16:32:09 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmpek7gsh.pyomo.nl, errno = 2: No such file or directory.

01/31/13 16:32:18 (46993.0) (10712): get_file(): consumed 51599812 bytes of file transmission

01/31/13 16:32:18 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmpek7gsh.pyomo.nl: (errno 2) No such file or directory

01/31/13 16:32:18 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmpek7gsh.pyomo.row, errno = 2: No such file or directory.

01/31/13 16:32:20 (46993.0) (10712): get_file(): consumed 13587542 bytes of file transmission

01/31/13 16:32:20 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\tmpek7gsh.pyomo.row: (errno 2) No such file or directory

01/31/13 16:32:20 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stderr, errno = 2: No such file or directory.

01/31/13 16:32:20 (46993.0) (10712): get_file(): consumed 2 bytes of file transmission

01/31/13 16:32:20 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stderr: (errno 2) No such file or directory

01/31/13 16:32:20 (46993.0) (10712): get_file(): Failed to open file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stdout, errno = 2: No such file or directory.

01/31/13 16:32:20 (46993.0) (10712): get_file(): consumed 0 bytes of file transmission

01/31/13 16:32:20 (46993.0) (10712): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory

01/31/13 16:32:20 (46993.0) (10712): Mock terminating job 46993.0: exited_by_signal=FALSE, exit_code=-1073741510 OR exit_signal=0, core_dumped=FALSE, exit_reason="Exited normally"

01/31/13 16:32:20 (46993.0) (10712): Job 46993.0 going into Hold state (code 12,2): Error from slot1@xxx-sim6: STARTER at xxx.xx.xxx.219 failed to send file(s) to <xxx.xx.xxx.113:51309>; SHADOW at xxx.xx.xxx.113 failed to write to file C:\condor/spool\6993\0\cluster46993.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory

01/31/13 16:32:20 (46993.0) (10712): **** condor_shadow (condor_SHADOW) pid 10712 EXITING WITH STATUS 112



On Thu, Jan 31, 2013 at 10:57 PM, Ian Chesal <ian.chesal@xxxxxxxxx> wrote:
The spool directory is on the the machine where your condor_schedd daemon is running.

- Ian


On Thu, Jan 31, 2013 at 8:35 AM, Mostafa Bakhtvar <bakhtvar@xxxxxxxxx> wrote:

Do you know how can I check this? Where it is? On my PC or any of the nodes?

 

-          Mosy

 

From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
Sent: 31 January 2013 16:28
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Shadow Exception after 3 hours of running!

 

Did your spool directory run out of disk space?

 

- Ian

 

On Thu, Jan 31, 2013 at 1:47 AM, Mostafa.B <bakhtvar@xxxxxxxxx> wrote:

Hi All,

 

Recently, the jobs that I send to condor are held after 3 hours (or more) of run,

 

I looked at the log file and it says:

 

...

007 (46992.000.000) 01/30 20:07:10 Shadow exception!

Error from slot1@xxxxxxxx: STARTER at xxx.xx.xxx.xx failed to send file(s) to <xxx.xx.xxx.xxx:xxxxx>; SHADOW at xxx.xx.xxx.xxx failed to write to file C:\condor/spool\6992\0\cluster46992.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory

72922408  -  Run Bytes Sent By Job

232107  -  Run Bytes Received By Job

...

012 (46992.000.000) 01/30 20:07:10 Job was held.

Error from slot1@xxxxxxxx: STARTER at xxx.xx.xxx.xx failed to send file(s) to <xxx.xx.xxx.xxx:xxxxx>; SHADOW at xxx.xx.xxx.xxx failed to write to file C:\condor/spool\6992\0\cluster46992.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory

Code 12 Subcode 2

...

 

any ideas why this happens? and how to solve it?

 

The jobs were OK with condor until 2 days ago, even they are still OK when I run them manually on my PC.

by the way I am the admin user of the Windows based PC that is sending jobs to Condor.

 

Regards

Mosy

 


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

 


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/