[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor runs immediately go to "Held" state



Hey Nate,

This looks to be a situation where the exec nodes don't have access to the same filesystems that the submit node does. Make sure that /some/path is available on all of the exec nodes, and I think your jobs will fire up successfully.

My guess is that your exec nodes are on a separate subnet shared with the central manager system, and everyone else is on a different subnet?

Also, make sure that the initialdir is not being shuffled up in the submit - if the user is submitting from one directory and the job is being assigned to a different initial directory which doesn't contain the input file, then that might cause a problem - though I think that would probably be caught at submit time, rather than execute time as this error is.

The alternative to a shared filesystem is to have the job submissions set up input and output file transfers.

	-Michael Pelletier.

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Mobley, Nate (Millennium)
Sent: Tuesday, January 2, 2018 9:33 AM
To: 'htcondor-admin@xxxxxxxxxxx' <htcondor-admin@xxxxxxxxxxx>; 'htcondor-users@xxxxxxxxxxx' <htcondor-users@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] Condor runs immediately go to "Held" state

Please advise; I still need assistance with this as my customer is under a deadline. Thank you. Please see below for a sample of the log file after trying a run:

"0  -  Run Bytes Received By Job
...
007 (1863.018.000) 12/27 09:30:12 Shadow exception!
Error from slot19@xxxxxxxxxxxxxxxxx: Failed to open '/some/path/filename.inp' as standard input: No such file or directory (errno 2)"

Some context: I have one head node (this is what we log into to submit runs) that is running RHEL6, and 9 compute nodes. 

Thanks for any assistance you can provide.