I have a Windows-only HTCondor pool, and I’m trying to submit a very simple task to that pool from another Windows machine outside the pool using the grid universe. The batch file that’s being run is on a network drive that’s accessible by all machines involved, and I don’t care about storing stdout, stderr, and log files, so I don’t want any transferring of files to happen. As a result, I’ve set transfer_executable to False and remote_ShouldTransferFiles to “NO”. Here are the contents of my submit file:
universe = grid
# This is accessible to all machines
executable = //FileServer/path/to/file/test.bat
transfer_executable = False
concurrency_limits = 100
accounting_group = group_condor
accounting_group_user = farnhamj
grid_resource = condor HeadNode.aqrcapital.com HeadNode.aqrcapital.com
remote_universe = vanilla
+remote_RunAsOwner = True
+remote_requirements = HasFincad == True
+remote_ShouldTransferFiles = "NO"
Once the task makes it onto the machine I’m calling HeadNode, it ends up staying Idle forever, because the condor_starter tries and fails to start the job. I found the following message in the StarterLog.slot1 log on the machine that was trying to start the task:
06/05/15 17:50:22 (pid:2092) Create_Process: CreateProcess failed, errno=267
06/05/15 17:50:22 (pid:2092) SharedPortEndpoint: Inside stop listener.
06/05/15 17:50:22 (pid:2092) Create_Process(//FileServer/path/to/file/test.bat,, ...) failed:
06/05/15 17:50:22 (pid:2092) In OwnerProfile::loaded()
06/05/15 17:50:22 (pid:2092) Failed to start job, exiting
06/05/15 17:50:22 (pid:2092) ShutdownFast all jobs.
06/05/15 17:50:22 (pid:2092) Got ShutdownFast when no jobs running.
06/05/15 17:50:22 (pid:2092) HOOK_JOB_EXIT not configured.
06/05/15 17:50:22 (pid:2092) Entering JICShadow::updateShadow()
06/05/15 17:50:22 (pid:2092) Sent job ClassAd update to startd.
06/05/15 17:50:22 (pid:2092) Leaving JICShadow::updateShadow(): success
06/05/15 17:50:22 (pid:2092) Inside JICShadow::transferOutput(void)
06/05/15 17:50:22 (pid:2092) JICShadow::transferOutput(void): Transferring...
06/05/15 17:50:22 (pid:2092) Inside JICShadow::transferOutputMopUp(void)
06/05/15 17:50:22 (pid:2092) dirscat: dirpath = /
06/05/15 17:50:22 (pid:2092) dirscat: subdir = C:\condor\execute
06/05/15 17:50:22 (pid:2092) Initializing Directory: curr_dir = /\C:\condor\execute\
06/05/15 17:50:22 (pid:2092) **** condor_starter (condor_STARTER) pid 2092 EXITING WITH STATUS 0
The last four lines look suspicious to me. It seems like Condor is trying to run out of C:\condor\execute instead of the location of the script, //FileServer/path/to/file/test.bat, which might why the condor_starter is failing to start.
In addition, when I use condor_q -l to look at the job’s ClassAd on the machine I’m calling HeadNode, I see the following:
Iwd = "C:\condor\spool\2133\0\cluster22133.proc0.subproc0"
This doesn’t look right--shouldn’t the initial working directory be //FileServer/path/to/file/test.bat?
Finally, every machine in question has the same value set for FILESYSTEM_DOMAIN, which was my attempt to avoid issues accessing the //FileServer/path/to/file UNC path.
I know this is a detailed question--thanks for any help you can provide.
Disclaimer: This e-mail may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please notify the sender immediately and destroy/delete this e-mail. You are hereby notified that any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly prohibited.
This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction. All information contained in this communication is not warranted as to completeness or accuracy and is subject to change without notice. Any comments or statements made in this communication do not necessarily reflect those of AQR Capital Management, LLC and its affiliates.