[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How do I have my interactive job and my submission job in condor match 100%?



My sincere apologies for the spam everyone, the exact main.sh I ran is actually this one (without the module loads, the interactive job doesn't need it)

#!/bin/bash -l

echo JOB STARTED

source /etc/profile
source ~/.bashrc
source ~/.bash_profile

# module load cuda-toolkit/10.2
# module load cuda-toolkit/11.1

#/usr/local/cuda/bin:/home/miranda9/miniconda3/envs/automl-meta-learning/bin:/home/miranda9/miniconda3/condabin:/usr/local/cuda/bin:/usr/local/bin:/usr/bin:/home/miranda9/my_bins:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/miranda9/my_bins:/home/miranda9/bin
#/usr/local/cuda/bin:/home/miranda9/miniconda3/envs/automl-meta-learning/bin:/home/miranda9/miniconda3/condabin:/usr/local/cuda/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/my_bins:/bin:/my_bins:/my_bins:/bin

nvidia-smi
conda list
echo $PATH
which python
# env

# - run script
python ~/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py
# python ~/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py --debug --num_epochs 5 --batch_size 2 --term_encoder_embedding_dim 8

echo JOB ENDED

On Thu, Mar 25, 2021 at 1:55 PM Brando Miranda <brando.hpcs@xxxxxxxxx> wrote:
Hi Oliver,

Thank you for you kind suggestions. I appreciate it.

Sadly it didn't seem to work to my surprise. Let me share the exact main.sh script I used:
```
#!/bin/bash -l

echo JOB STARTED

source /etc/profile
source ~/.bashrc
source ~/.bash_profile

module load cuda-toolkit/10.2
module load cuda-toolkit/11.1

# echo $PATH
nvidia-smi
conda list
which python
# env

# - run script
python ~/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py
# python ~/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py --debug --num_epochs 5 --batch_size 2 --term_encoder_embedding_dim 8

echo JOB ENDED
```
Though, I think there is no other way except to perhaps try zsh or compare the envs. I will try zsh first since that seems less overwhelming (the comparison of env I have done with diff already and a gui with pycharm but it still looks like to muchÂto digest for me - especially with my lack of sys admin expertise ).Â

I am really puzzled at what it might be since the version of python seem to be fine. The two seem pretty similar but I did notice that the PATH variables are not identical

#/usr/local/cuda/bin:/home/miranda9/miniconda3/envs/automl-meta-learning/bin:/home/miranda9/miniconda3/condabin:/usr/local/cuda/bin:/usr/local/bin:/usr/bin:/home/miranda9/my_bins:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/miranda9/my_bins:/home/miranda9/bin
#/usr/local/cuda/bin:/home/miranda9/miniconda3/envs/automl-meta-learning/bin:/home/miranda9/miniconda3/condabin:/usr/local/cuda/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/my_bins:/bin:/my_bins:/my_bins:/bin

the longer one seems to come from the interactive job. Is there a way for me to source everything exactly so that my submission job looks like an interactive job? I think that would solve my problem - since it consistently runs fine in interactive mode with nearly every single combination of pytorch & cuda I've tried so far.

If you have any other suggestions feel free to let me know.


Thanks again.

Sincerely, Brando

PS: node thatÂ#!/bin/bash -l is an no caps L as in laura.Â


On Thu, Mar 25, 2021 at 1:34 PM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
Hi Brando,

Am 25.03.21 um 19:17 schrieb Brando Miranda:
> Hi Oliver,
>
> Thank you very much for such a friendly and thorough response. Just to clarify since I am not a bash or sys admin expert would the following `main.sh` that I set as executable with the following content work?

yes, I would expect this to work â if you also have ~/.bash_profile, ~/.bash_login or ~/.profile, you may also have to source these.
Bash is quite thorough in looking for files to source during startup (some inheritance from the long evolution of shells...).
Another option would be to use:

#!/bin/bash -l

as the first line, and drop the source statements (but keeping the "source ~/.bashrc"). This will cause the batch-job-basb to be a login shell, so it should behave (almost[0]) the same as in an interactive job.
I would personally still go with the explicit "source" of the parts you know you need, to ensure you explicitly know which files may have an effect on it.

You can find most of these nitty-gritty details with "man bash", it's quite lengthy, but if you search for "-i" and "-l", you can find the interesting parts which list these files ;-).

Cheers,
    Oliver

[0] "Almost" since we only pass -l to make it a login shell, not -i to make it an interactive shell. Interactive shells source .bashrc in addition,
  Âbut the "-i" may also cause some programs to behave as if a user is present, which you do not want in a batch job.

>
> ```
>
> #!/bin/bash
>
> echo JOB STARTED
>
> source ~/.bashrc
> source /etc/profile
>
> # - run script
> python ~/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py
> # python ~/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py --debug --num_epochs 5 --batch_size 2 --term_encoder_embedding_dim 8
>
> echo JOB ENDED
>
> ```
>
> is that what you would expect to usually work? (I am aware my cluster might be different but I just wanted to make sure I tried your suggestion exactly as expected to avoid problems).
>
> Thanks again and I will let the thread know if this worked.
>
>
> Sincerely, Brando
>
> On Thu, Mar 25, 2021 at 12:44 PM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
>
>Â Â ÂHi Brando,
>
>Â Â Âa common difference is that an interactive job will have an interactive (login) shell, which will source .bashrc and also other logon scripts (e.g. /etc/profile).
>
>Â Â ÂUsually, this is what you want for interactive use, since this allows e.g. "module load", shell completion and other things to work as expected.
>Â Â ÂBut that is by default not done for a regular batch job, which does not run an interactive login shell.
>
>Â Â ÂThat's likely also why "module load" does not work in the batch-job case for you: Your administrators can either set up a script which runs before the job,
>Â Â Âexplicitly setting up things such that the "module" command works, or instruct users to do "source /etc/profile".
>
>Â Â ÂWe are currently choosing the latter approach, i.e. we tell our users to:
>Â Â Â Â source /etc/profile
>Â Â Âfor batch jobs such that they get access to the environment variables most of them expect (including setup of lmod, so "module load" works).
>Â Â ÂNow things can be more complicated if you have a .bashrc or other files sourced by an interactive shell (we decouple home directories between the cluster and submit node, which prevents this).
>
>Â Â ÂAnother option would be to use a jobwrapper running things in a login shell explicitly, to ensure you also use a login shell for batch jobs.
>
>Â Â ÂIn any case, I'd also advice against GetEnv â your cluster environment may be different from the environment on the submit node,
>Â Â Âso copying over the environment (which may also be affected by other things you ran before) is usually inviting issues or non-reproducibility.
>
>Â Â ÂYou can also dump the output of "env" to files, sort them, and diff the result to see the actual differences more cleanly.
>
>
>Â Â ÂIn short:
>Â Â ÂI would expect sourcing /etc/profile removes most of the unwanted differences.
>Â Â ÂThey are mainly caused by one shell being an interactive, login shell, automatically setting up things like shell completion,
>Â Â Âwhile in batch mode a "clean" batch shell is used, and you have to explicitly source most environment parts.
>Â Â ÂAdmins can configure this differently, "source /etc/profile" should bring you most of the way.
>
>Â Â ÂCheers and hope this helps,
>Â Â Â Â Â Â Â Oliver
>
>Â Â ÂAm 25.03.21 um 17:50 schrieb brando.hpcs@xxxxxxxxx <mailto:brando.hpcs@xxxxxxxxx>:
>Â Â Â > Hi Thomas,
>Â Â Â >
>Â Â Â > I sourcing anything in my main.sh script. I did try to do:
>Â Â Â >
>Â Â Â > # module load cuda-toolkit/10.2
>Â Â Â > # module load cuda-toolkit/11.1
>Â Â Â >
>Â Â Â > but the executing node didn't know about the module command so stopped doing that. However, I wasn't doing that in my interactive job anyway so I don't think that is important.
>Â Â Â >
>Â Â Â > Basically I don't source anything when I run my interactive job or my executing node. Is there something I should be sourcing? I assume the interactive node sources my .bashrc file, but I assumed that using getenv sourced the right things from my bashrc file automatically.
>Â Â Â >
>Â Â Â > Btw, I did try your suggestion of comparing env. They aren't the same but the list is massive. I am unsure if pasting it here would help. I definitively don't know what to look for in it but it's likely the difference is there somewhere.
>Â Â Â >
>Â Â Â > What do you recommend I try?
>Â Â Â >
>Â Â Â > Thanks, Brando
>Â Â Â >
>Â Â Â >
>Â Â Â >
>Â Â Â > On Thu, Mar 25, 2021 at 11:27 AM <thomas.hartmann@xxxxxxx <mailto:thomas.hartmann@xxxxxxx> <mailto:thomas.hartmann@xxxxxxx <mailto:thomas.hartmann@xxxxxxx>>> wrote:
>Â Â Â >
>Â Â Â >Â Â ÂHi Brando,
>Â Â Â >
>Â Â Â >Â Â Âgetenv can be dangerous as the environment in your submission
>Â Â Â >Â Â Âenvironment might not work on the executing node.
>Â Â Â >
>Â Â Â >Â Â ÂAre you preparing the environment in your batch job the same way as you
>Â Â Â >Â Â Âset it up compared to when you run interactively? (do you source all the
>Â Â Â >Â Â Âsame environment scripts etc.?)
>Â Â Â >Â Â ÂMaybe you can try and print your batch job's environment into your log
>Â Â Â >Â Â Âfile running `env` and compare with the interactive environment.
>Â Â Â >
>Â Â Â >Â Â ÂCheers,
>Â Â Â >Â Â Â Â ÂThomas
>Â Â Â >
>Â Â Â >Â Â ÂOn 25/03/2021 16.55, brando.science@xxxxxxxxx <mailto:brando.science@xxxxxxxxx> <mailto:brando.science@xxxxxxxxx <mailto:brando.science@xxxxxxxxx>> wrote:
>Â Â Â >Â Â Â > Hi,
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â > I am a user of a HTCondor hpc. I noticed that my pytorch jobs that use
>Â Â Â >Â Â Â > cuda work just fine in the interactive mode (it seems with any version
>Â Â Â >Â Â Â > of pytorch or cuda even if nvidia-smi says one version of cuda but my
>Â Â Â >Â Â Â > pytorch says another) but when I try to run them in the condor_submit
>Â Â Â >Â Â Â > without interactive it doesn't run. It get's into a deadlock because I
>Â Â Â >Â Â Â > am trying to do parallel training (but note this does not happen in
>Â Â Â >Â Â Â > interactive mode even with 4 gpus).
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â > My question seems simple. How do I force my condor_submit job to be
>Â Â Â >Â Â Â > identical to the environment when I run it from a interactive session?
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â > I've tried the famous getenv flag and that didn't work for some reason.
>Â Â Â >Â Â Â > I assume it is because it copies my envs from the login node instead
>Â Â Â >Â Â Â > from the interactive session (but I cannot run a submission job from an
>Â Â Â >Â Â Â > interactive session so I can't do it that way). Is there a way to have
>Â Â Â >Â Â Â > the submission run job with exactly the same settings as a interactive
>Â Â Â >Â Â Â > job? I am not a sys adminÂI am only a user if that helps.
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â > I've also read these two pages:
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â > -
>Â Â Â >Â Â Â > https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables> <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables>>
>Â Â Â >Â Â Â > <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables> <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables <https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables>>>
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â > - https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html> <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html>>
>Â Â Â >Â Â Â > <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html> <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html <https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html>>>
>Â Â Â >Â Â Â > and posted this question on SO:
>Â Â Â >Â Â Â > https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100 <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100> <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100 <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100>>
>Â Â Â >Â Â Â > <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100 <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100> <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100 <https://stackoverflow.com/questions/66790905/how-do-i-have-my-interactive-job-and-my-submission-job-in-condor-match-100>>>
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â > Thanks for your time HTConder users list.
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â > Sincerley, Brando
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â > _______________________________________________
>Â Â Â >Â Â Â > HTCondor-users mailing list
>Â Â Â >Â Â Â > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> <mailto:htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx>> with a
>Â Â Â >Â Â Â > subject: Unsubscribe
>Â Â Â >Â Â Â > You can also unsubscribe by visiting
>Â Â Â >Â Â Â > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users> <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>>
>Â Â Â >Â Â Â >
>Â Â Â >Â Â Â > The archives can be found at:
>Â Â Â >Â Â Â > https://lists.cs.wisc.edu/archive/htcondor-users/ <https://lists.cs.wisc.edu/archive/htcondor-users/> <https://lists.cs.wisc.edu/archive/htcondor-users/ <https://lists.cs.wisc.edu/archive/htcondor-users/>>
>Â Â Â >Â Â Â >
>Â Â Â >
>Â Â Â >Â Â Â_______________________________________________
>Â Â Â >Â Â ÂHTCondor-users mailing list
>Â Â Â >Â Â ÂTo unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> <mailto:htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx>> with a
>Â Â Â >Â Â Âsubject: Unsubscribe
>Â Â Â >Â Â ÂYou can also unsubscribe by visiting
>Â Â Â > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users> <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>>
>Â Â Â >
>Â Â Â >Â Â ÂThe archives can be found at:
>Â Â Â > https://lists.cs.wisc.edu/archive/htcondor-users/ <https://lists.cs.wisc.edu/archive/htcondor-users/> <https://lists.cs.wisc.edu/archive/htcondor-users/ <https://lists.cs.wisc.edu/archive/htcondor-users/>>
>Â Â Â >
>Â Â Â >
>Â Â Â > _______________________________________________
>Â Â Â > HTCondor-users mailing list
>Â Â Â > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>Â Â Â > subject: Unsubscribe
>Â Â Â > You can also unsubscribe by visiting
>Â Â Â > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>
>Â Â Â >
>Â Â Â > The archives can be found at:
>Â Â Â > https://lists.cs.wisc.edu/archive/htcondor-users/ <https://lists.cs.wisc.edu/archive/htcondor-users/>
>Â Â Â >
>
>
>Â Â Â--
>Â Â ÂOliver Freyermuth
>Â Â ÂUniversitÃt Bonn
>Â Â ÂPhysikalisches Institut, Raum 1.047
>Â Â ÂNuÃallee 12
>Â Â Â53115 Bonn
>Â Â Â--
>Â Â ÂTel.: +49 228 73 2367
>Â Â ÂFax:Â +49 228 73 7869
>Â Â Â--
>


--
Oliver Freyermuth
UniversitÃt Bonn
Physikalisches Institut, Raum 1.047
NuÃallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:Â +49 228 73 7869
--