[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How do I have my interactive job and my submission job in condor match 100%?

Hi Oliver,

Thank you very much for such a friendly and thorough response. Just to clarify since I am not a bash or sys admin expert would the following `main.sh` that I set as executable with the following content work?


source ~/.bashrc
source /etc/profile
# - run script
python ~/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py
# python ~/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py --debug --num_epochs 5 --batch_size 2 --term_encoder_embedding_dim 8


is that what you would expect to usually work? (I am aware my cluster might be different but I just wanted to make sure I tried your suggestion exactly as expected to avoid problems).

Thanks again and I will let the thread know if this worked.

Sincerely, Brando

On Thu, Mar 25, 2021 at 12:44 PM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
Hi Brando,

a common difference is that an interactive job will have an interactive (login) shell, which will source .bashrc and also other logon scripts (e.g. /etc/profile).

Usually, this is what you want for interactive use, since this allows e.g. "module load", shell completion and other things to work as expected.
But that is by default not done for a regular batch job, which does not run an interactive login shell.

That's likely also why "module load" does not work in the batch-job case for you: Your administrators can either set up a script which runs before the job,
explicitly setting up things such that the "module" command works, or instruct users to do "source /etc/profile".

We are currently choosing the latter approach, i.e. we tell our users to:
 source /etc/profile
for batch jobs such that they get access to the environment variables most of them expect (including setup of lmod, so "module load" works).
Now things can be more complicated if you have a .bashrc or other files sourced by an interactive shell (we decouple home directories between the cluster and submit node, which prevents this).

Another option would be to use a jobwrapper running things in a login shell explicitly, to ensure you also use a login shell for batch jobs.

In any case, I'd also advice against GetEnv â your cluster environment may be different from the environment on the submit node,
so copying over the environment (which may also be affected by other things you ran before) is usually inviting issues or non-reproducibility.

You can also dump the output of "env" to files, sort them, and diff the result to see the actual differences more cleanly.

In short:
I would expect sourcing /etc/profile removes most of the unwanted differences.
They are mainly caused by one shell being an interactive, login shell, automatically setting up things like shell completion,
while in batch mode a "clean" batch shell is used, and you have to explicitly source most environment parts.
Admins can configure this differently, "source /etc/profile" should bring you most of the way.

Cheers and hope this helps,

Am 25.03.21 um 17:50 schrieb brando.hpcs@xxxxxxxxx:
> Hi Thomas,
> I sourcing anything in my main.sh script. I did try to do:
> # module load cuda-toolkit/10.2
> # module load cuda-toolkit/11.1
> but the executing node didn't know about the module command so stopped doing that. However, I wasn't doing that in my interactive job anyway so I don't think that is important.
> Basically I don't source anything when I run my interactive job or my executing node. Is there something I should be sourcing? I assume the interactive node sources my .bashrc file, but I assumed that using getenv sourced the right things from my bashrc file automatically.
> Btw, I did try your suggestion of comparing env. They aren't the same but the list is massive. I am unsure if pasting it here would help. I definitively don't know what to look for in it but it's likely the difference is there somewhere.
> What do you recommend I try?
> Thanks, Brando
> On Thu, Mar 25, 2021 at 11:27 AM <thomas.hartmann@xxxxxxx <mailto:thomas.hartmann@xxxxxxx>> wrote:
>Â Â ÂHi Brando,
>Â Â Âgetenv can be dangerous as the environment in your submission
>Â Â Âenvironment might not work on the executing node.
>Â Â ÂAre you preparing the environment in your batch job the same way as you
>Â Â Âset it up compared to when you run interactively? (do you source all the
>Â Â Âsame environment scripts etc.?)
>Â Â ÂMaybe you can try and print your batch job's environment into your log
>Â Â Âfile running `env` and compare with the interactive environment.
>Â Â ÂCheers,
>Â Â Â Â ÂThomas
>Â Â ÂOn 25/03/2021 16.55, brando.science@xxxxxxxxx <mailto:brando.science@xxxxxxxxx> wrote:
>Â Â Â > Hi,
>Â Â Â >
>Â Â Â > I am a user of a HTCondor hpc. I noticed that my pytorch jobs that use
>Â Â Â > cuda work just fine in the interactive mode (it seems with any version
>Â Â Â > of pytorch or cuda even if nvidia-smi says one version of cuda but my
>Â Â Â > pytorch says another) but when I try to run them in the condor_submit
>Â Â Â > without interactive it doesn't run. It get's into a deadlock because I
>Â Â Â > am trying to do parallel training (but note this does not happen in
>Â Â Â > interactive mode even with 4 gpus).
>Â Â Â >
>Â Â Â > My question seems simple. How do I force my condor_submit job to be
>Â Â Â > identical to the environment when I run it from a interactive session?
>Â Â Â >
>Â Â Â > I've tried the famous getenv flag and that didn't work for some reason.
>Â Â Â > I assume it is because it copies my envs from the login node instead
>Â Â Â > from the interactive session (but I cannot run a submission job from an
>Â Â Â > interactive session so I can't do it that way). Is there a way to have
>Â Â Â > the submission run job with exactly the same settings as a interactive
>Â Â Â > job? I am not a sys adminÂI am only a user if that helps.
>Â Â Â >
>Â Â Â > I've also read these two pages:
>Â Â Â >
>Â Â Â > -
>Â Â Â > https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables>
>Â Â Â > <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables>>
>Â Â Â >
>Â Â Â > - https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html>
>Â Â Â > <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html>>
>Â Â Â > and posted this question on SO:
>Â Â Â > https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100 <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100>
>Â Â Â > <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100 <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100>>
>Â Â Â >
>Â Â Â >
>Â Â Â >
>Â Â Â > Thanks for your time HTConder users list.
>Â Â Â >
>Â Â Â >
>Â Â Â > Sincerley, Brando
>Â Â Â >
>Â Â Â >
>Â Â Â > _______________________________________________
>Â Â Â > HTCondor-users mailing list
>Â Â Â > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>Â Â Â > subject: Unsubscribe
>Â Â Â > You can also unsubscribe by visiting
>Â Â Â > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>
>Â Â Â >
>Â Â Â > The archives can be found at:
>Â Â Â > https://lists.cs.wisc.edu/archive/htcondor-users/ <https://lists.cs.wisc.edu/archive/htcondor-users/>
>Â Â Â >
>Â Â Â_______________________________________________
>Â Â ÂHTCondor-users mailing list
>Â Â ÂTo unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>Â Â Âsubject: Unsubscribe
>Â Â ÂYou can also unsubscribe by visiting
>Â Â Âhttps://lists.cs.wisc.edu/mailman/listinfo/htcondor-users <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>
>Â Â ÂThe archives can be found at:
>Â Â Âhttps://lists.cs.wisc.edu/archive/htcondor-users/ <https://lists.cs.wisc.edu/archive/htcondor-users/>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Oliver Freyermuth
UniversitÃt Bonn
Physikalisches Institut, Raum 1.047
NuÃallee 12
53115 Bonn
Tel.: +49 228 73 2367
Fax:Â +49 228 73 7869