[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How do I have my interactive job and my submission job in condor match 100%?



Hi Brando,

a common difference is that an interactive job will have an interactive (login) shell, which will source .bashrc and also other logon scripts (e.g. /etc/profile).

Usually, this is what you want for interactive use, since this allows e.g. "module load", shell completion and other things to work as expected.
But that is by default not done for a regular batch job, which does not run an interactive login shell.

That's likely also why "module load" does not work in the batch-job case for you: Your administrators can either set up a script which runs before the job,
explicitly setting up things such that the "module" command works, or instruct users to do "source /etc/profile".

We are currently choosing the latter approach, i.e. we tell our users to:
 source /etc/profile
for batch jobs such that they get access to the environment variables most of them expect (including setup of lmod, so "module load" works).
Now things can be more complicated if you have a .bashrc or other files sourced by an interactive shell (we decouple home directories between the cluster and submit node, which prevents this).

Another option would be to use a jobwrapper running things in a login shell explicitly, to ensure you also use a login shell for batch jobs.

In any case, I'd also advice against GetEnv â your cluster environment may be different from the environment on the submit node,
so copying over the environment (which may also be affected by other things you ran before) is usually inviting issues or non-reproducibility.

You can also dump the output of "env" to files, sort them, and diff the result to see the actual differences more cleanly.


In short:
I would expect sourcing /etc/profile removes most of the unwanted differences.
They are mainly caused by one shell being an interactive, login shell, automatically setting up things like shell completion,
while in batch mode a "clean" batch shell is used, and you have to explicitly source most environment parts.
Admins can configure this differently, "source /etc/profile" should bring you most of the way.

Cheers and hope this helps,
	Oliver

Am 25.03.21 um 17:50 schrieb brando.hpcs@xxxxxxxxx:
Hi Thomas,

I sourcing anything in my main.sh script. I did try to do:

# module load cuda-toolkit/10.2
# module load cuda-toolkit/11.1

but the executing node didn't know about the module command so stopped doing that. However, I wasn't doing that in my interactive job anyway so I don't think that is important.

Basically I don't source anything when I run my interactive job or my executing node. Is there something I should be sourcing? I assume the interactive node sources my .bashrc file, but I assumed that using getenv sourced the right things from my bashrc file automatically.

Btw, I did try your suggestion of comparing env. They aren't the same but the list is massive. I am unsure if pasting it here would help. I definitively don't know what to look for in it but it's likely the difference is there somewhere.

What do you recommend I try?

Thanks, Brando



On Thu, Mar 25, 2021 at 11:27 AM <thomas.hartmann@xxxxxxx <mailto:thomas.hartmann@xxxxxxx>> wrote:

    Hi Brando,

    getenv can be dangerous as the environment in your submission
    environment might not work on the executing node.

    Are you preparing the environment in your batch job the same way as you
    set it up compared to when you run interactively? (do you source all the
    same environment scripts etc.?)
    Maybe you can try and print your batch job's environment into your log
    file running `env` and compare with the interactive environment.

    Cheers,
     Â ÂThomas

    On 25/03/2021 16.55, brando.science@xxxxxxxxx <mailto:brando.science@xxxxxxxxx> wrote:
     > Hi,
     >
     > I am a user of a HTCondor hpc. I noticed that my pytorch jobs that use
     > cuda work just fine in the interactive mode (it seems with any version
     > of pytorch or cuda even if nvidia-smi says one version of cuda but my
     > pytorch says another) but when I try to run them in the condor_submit
     > without interactive it doesn't run. It get's into a deadlock because I
     > am trying to do parallel training (but note this does not happen in
     > interactive mode even with 4 gpus).
     >
     > My question seems simple. How do I force my condor_submit job to be
     > identical to the environment when I run it from a interactive session?
     >
     > I've tried the famous getenv flag and that didn't work for some reason.
     > I assume it is because it copies my envs from the login node instead
     > from the interactive session (but I cannot run a submission job from an
     > interactive session so I can't do it that way). Is there a way to have
     > the submission run job with exactly the same settings as a interactive
     > job? I am not a sys adminÂI am only a user if that helps.
     >
     > I've also read these two pages:
     >
     > -
     > https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables>
     > <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables>>
     >
     > - https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html>
     > <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html>>
     > and posted this question on SO:
     > https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100 <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100>
     > <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100 <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100>>
     >
     >
     >
     > Thanks for your time HTConder users list.
     >
     >
     > Sincerley, Brando
     >
     >
     > _______________________________________________
     > HTCondor-users mailing list
     > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a
     > subject: Unsubscribe
     > You can also unsubscribe by visiting
     > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>
     >
     > The archives can be found at:
     > https://lists.cs.wisc.edu/archive/htcondor-users/ <https://lists.cs.wisc.edu/archive/htcondor-users/>
     >

    _______________________________________________
    HTCondor-users mailing list
    To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>

    The archives can be found at:
    https://lists.cs.wisc.edu/archive/htcondor-users/ <https://lists.cs.wisc.edu/archive/htcondor-users/>


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Oliver Freyermuth
UniversitÃt Bonn
Physikalisches Institut, Raum 1.047
NuÃallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature