[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] batch submitssion strange problem



Hi Greg !
Thanks, here it is the log, according to it executable file has not been not copied into the docker image.

gergely.debreczeni@xxxxxxx:~/batchsubmission$ condor_q -anal 57.1


-- Schedd: X.X.X.X <10.1.8.8:51975?...
---
057.001:  Request is held.

Hold reason: Error from slot1@scorpio005: STARTER at 10.1.10.5 failed to send file(s) to <10.1.8.8:28343>: error reading from /var/lib/condor/execute/dir_1810221/output.out: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <10.1.10.5:10057>


And the reason for this is that the executable was not running, the executable was not copied. The job's stderr message says:

WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
/usr/local/bin/nvidia_entrypoint.sh: line 88: exec: batch.sh: not found


Experimenting with it a bit more, the executable only gets copied (with condor 8.4.2) if
  • it is defined as "./batch.sh" and not like "batch.sh"
  • AND
  • it is explicitely listed in the paramlist file as a variable which is passed to the transfer_input_file variable.
So like this in the paramlist file:

a, batch.sh, 1 2
a, batch.sh, 3 4
a, batch.sh, 5 6
a, batch.sh, 7 8

and this the submission file:


## Executable
executable              = ./batch.sh
universe                = docker
docker_image            = nv-pytorch-wglobus_v2

## Logs
log                     = out/batch.$(Process).log
output                  = out/batch.$(Process).stdout
error                   = out/batch.$(Process).stderr

## File transfer
should_transfer_files   = Yes
when_to_transfer_output = ON_EXIT
line = $(Row)

transfer_output_files   = output.out
transfer_output_remaps  = "output.out=out/output$INT(line).out"
transfer_input_files    = $(input_file1), $(input_file2)

## Resources requested
request_cpus            = 1
request_GPUs            = 0
Requirements            = (ResourceType == "Dedicated") && (regexp(".*nv-pytorch-wglobus_v2.*",LocallyAvailableDockerImages))


## Submit command
queue input_file1, input_file2, arguments from [0:2:1] ./paramlist


with condor 8.8.0 it works also without the ./ and explicit listing in paramlist file.

thanks,
Gergely



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Greg Thain <gthain@xxxxxxxxxxx>
Sent: Monday, May 6, 2019 4:07 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] batch submitssion strange problem
 
On 5/4/19 3:00 PM, Gergely Debreczeni via HTCondor-users wrote:

then the job is not running, it turns into Held state with the following error from condor_q -anal:


Can you send us the output of condor_q -hold. When a job is held, condor_q -hold will show the hold reason, which is often the best way to debug what's going on.


-greg



This e-mail and any files transmitted with it contain confidential and may contain privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorized use, copying, disclosure or distribution of the material in this e-mail is strictly forbidden.