[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How do I have my interactive job and my submission job in condor match 100%?



Hi Thomas,

I sourcing anything in my main.sh script. I did try to do:
# module load cuda-toolkit/10.2
# module load cuda-toolkit/11.1
but the executing node didn't know about the module command so stopped doing that. However, I wasn't doing that in my interactive job anyway so I don't think that is important.

Basically I don't source anything when I run my interactive job or my executing node. Is there something I should be sourcing? I assume the interactive node sources my .bashrc file, but I assumed that using getenv sourced the right things from my bashrc file automatically.

Btw, I did try your suggestion of comparing env. They aren't the same but the list is massive. I am unsure if pasting it here would help. I definitively don't know what to look for in it but it's likely the difference is there somewhere.

What do you recommend I try?

Thanks, Brando



On Thu, Mar 25, 2021 at 11:27 AM <thomas.hartmann@xxxxxxx> wrote:
Hi Brando,

getenv can be dangerous as the environment in your submission
environment might not work on the executing node.

Are you preparing the environment in your batch job the same way as you
set it up compared to when you run interactively? (do you source all the
same environment scripts etc.?)
Maybe you can try and print your batch job's environment into your log
file running `env` and compare with the interactive environment.

Cheers,
 ÂThomas

On 25/03/2021 16.55, brando.science@xxxxxxxxx wrote:
> Hi,
>
> I am a user of a HTCondor hpc. I noticed that my pytorch jobs that use
> cuda work just fine in the interactive mode (it seems with any version
> of pytorch or cuda even if nvidia-smi says one version of cuda but my
> pytorch says another) but when I try to run them in the condor_submit
> without interactive it doesn't run. It get's into a deadlock because I
> am trying to do parallel training (but note this does not happen in
> interactive mode even with 4 gpus).
>
> My question seems simple. How do I force my condor_submit job to be
> identical to the environment when I run it from a interactive session?
>
> I've tried the famous getenv flag and that didn't work for some reason.
> I assume it is because it copies my envs from the login node instead
> from the interactive session (but I cannot run a submission job from an
> interactive session so I can't do it that way). Is there a way to have
> the submission run job with exactly the same settings as a interactive
> job? I am not a sys adminÂI am only a user if that helps.
>
> I've also read these two pages:
>
> -
> https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables
> <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables>
>
> - https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html
> <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html>
> and posted this question on SO:
> https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100
> <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100>
>
>
>
> Thanks for your time HTConder users list.
>
>
> Sincerley, Brando
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/