[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] how to avoid ...



If you have a single submit node, you could try using

  ##  How long should the schedd wait between spawning each shadow?
  #JOB_START_DELAY        = 2

to put enough delay between job starts to overcome this. If you use more than one submit node, it may not accomplish much.

- dave


DeVoil, Peter wrote:
Hi,

I have an infrequent problem on an 80cpu windows pool - all dual
processor hosts running 6.6.10. When an execute node decides to start up
two jobs (each using the same pre-installed executable, with many
dependant dlls) at exactly the same time, one of them gets a "IO error:
permission denied" message on stderr and stalls - presumably with a
dialog box in nowhere land.

The simplest way to avoid this is to stop trying to start both jobs at
the same time; but I can't see a configuration entry to help. Any
suggestions? I've attached logs showing vm1 stalling..
Yours,
pdev.
********************************DISCLAIMER****************************
The information contained in the above e-mail message or messages (which includes any attachments) is confidential and may be legally privileged. It is intended only for the use of the person or entity to which it is addressed. If you are not the addressee any form of disclosure, copying, modification, distribution or any action taken or omitted in reliance on the information is unauthorised. Opinions contained in the message(s) do not necessarily reflect the opinions of the Queensland Government and its authorities. If you received this communication in error, please notify the sender immediately and delete it from your computer system network.

------------------------------------------------------------------------

--- startlog:
1/31 11:27:13 DaemonCore: Command received via TCP from host <192.168.0.30:4348>
1/31 11:27:13 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
1/31 11:27:13 vm1: Got activate_claim request from shadow (<192.168.0.30:4348>)
1/31 11:27:13 vm1: Remote job ID is 12768.0
1/31 11:27:13 vm1: Got universe "VANILLA" (5) from request classad
1/31 11:27:13 vm1: State change: claim-activation protocol successful
1/31 11:27:13 vm1: Changing activity: Idle -> Busy
1/31 11:27:51 DaemonCore: Command received via TCP from host <192.168.0.30:4462>
1/31 11:27:51 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
1/31 11:27:51 vm2: Got activate_claim request from shadow (<192.168.0.30:4462>)
1/31 11:27:51 vm2: Remote job ID is 12769.0
1/31 11:27:51 vm2: Got universe "VANILLA" (5) from request classad
1/31 11:27:51 vm2: State change: claim-activation protocol successful
1/31 11:27:51 vm2: Changing activity: Idle -> Busy
1/31 11:41:32 DaemonCore: Command received via TCP from host <192.168.0.30:2250>
1/31 11:41:32 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
1/31 11:41:32 vm2: Called deactivate_claim_forcibly()
1/31 11:41:47 DaemonCore: Command received via TCP from host <192.168.0.30:2264>
1/31 11:41:47 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
1/31 11:41:47 vm2: Got activate claim while starter is still alive.
1/31 11:41:47 vm2: Telling shadow to try again later.
1/31 11:41:47 ProcFamily::currentfamily: ERROR: family_size is 0
1/31 11:41:47 vm2: WARNING: No processes found in starter's family
1/31 11:41:47 DaemonCore: Command received via UDP from host <192.168.0.112:1649>
1/31 11:41:47 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
1/31 11:41:47 Starter pid 1780 exited with status 0
1/31 11:41:48 vm2: State change: starter exited
1/31 11:41:48 vm2: Changing activity: Busy -> Idle
...
---starterlog.vm1:
1/31 11:23:31 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
1/31 11:27:13 ******************************************************
1/31 11:27:13 ** condor_starter (CONDOR_STARTER) STARTING UP
1/31 11:27:13 ** C:\Condor\bin\condor_starter.exe
1/31 11:27:13 ** $CondorVersion: 6.6.10 Jun 22 2005 $
1/31 11:27:13 ** $CondorPlatform: INTEL-WINNT50 $
1/31 11:27:13 ** PID = 308
1/31 11:27:13 ******************************************************
1/31 11:27:13 Using config file: C:\Condor\condor_config
1/31 11:27:13 Using local config files: C:\Condor/condor_config.local
1/31 11:27:13 DaemonCore: Command Socket at <192.168.0.112:1615>
1/31 11:27:13 Setting resource limits not implemented!
1/31 11:27:13 Starter communicating with condor_shadow <192.168.0.30:4343>
1/31 11:27:13 Submitting machine is "ODIN"
1/31 11:28:32 File transfer completed successfully.
1/31 11:28:33 Starting a VANILLA universe job with ID: 12768.0
1/31 11:28:33 IWD: C:\Condor/execute\dir_308
1/31 11:28:33 Output file: C:\Condor/execute\dir_308\job.stdout
1/31 11:28:33 Error file: C:\Condor/execute\dir_308\job.stderr
1/31 11:28:33 Renice expr "10" evaluated to 10
1/31 11:28:33 About to exec C:\WINDOWS\System32\cmd.exe /Q /C condor_exec.bat yp.sim 1
1/31 11:28:33 Create_Process succeeded, pid=1836
EOF - message in stderr..

---starterlog.vm2:
1/31 11:23:46 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
1/31 11:27:51 ******************************************************
1/31 11:27:51 ** condor_starter (CONDOR_STARTER) STARTING UP
1/31 11:27:51 ** C:\Condor\bin\condor_starter.exe
1/31 11:27:51 ** $CondorVersion: 6.6.10 Jun 22 2005 $
1/31 11:27:51 ** $CondorPlatform: INTEL-WINNT50 $
1/31 11:27:51 ** PID = 1780
1/31 11:27:51 ******************************************************
1/31 11:27:51 Using config file: C:\Condor\condor_config
1/31 11:27:51 Using local config files: C:\Condor/condor_config.local
1/31 11:27:51 DaemonCore: Command Socket at <192.168.0.112:1620>
1/31 11:27:51 Setting resource limits not implemented!
1/31 11:27:54 Starter communicating with condor_shadow <192.168.0.30:4456>
1/31 11:27:54 Submitting machine is "ODIN"
1/31 11:28:33 File transfer completed successfully.
1/31 11:28:34 Starting a VANILLA universe job with ID: 12769.0
1/31 11:28:34 IWD: C:\Condor/execute\dir_1780
1/31 11:28:34 Output file: C:\Condor/execute\dir_1780\job.stdout
1/31 11:28:34 Error file: C:\Condor/execute\dir_1780\job.stderr
1/31 11:28:34 Renice expr "10" evaluated to 10
1/31 11:28:34 About to exec C:\WINDOWS\System32\cmd.exe /Q /C condor_exec.bat yp.sim 1
1/31 11:28:34 Create_Process succeeded, pid=1844
1/31 11:40:51 Process exited, pid=1844, status=0
1/31 11:41:32 Got SIGQUIT.  Performing fast shutdown.
1/31 11:41:32 ShutdownFast all jobs.
1/31 11:41:32 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
1/31 11:42:04 ******************************************************
...etc


------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users