[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job not starting correctly



Peter,

Is HTCondor able to create the output and error files specified in your job, and are you able to modify the runscript on the (or a targeted) execute host to print some information to stdout or stderr? It could be useful to have the runscript print out the environment at the line before the solver runs and compare for both interactive and batch modes. Also, consider having the runscript print out each command to see if the script exits before it starts running the solver.

Jason

On 10/27/21 3:25 AM, Peter Ellevseth wrote:
Christoph

The runscript uses only absolute paths.

We just got a new version of this code where I get this problem with the new version, and not with the old version. I check ldd for the binaries of both versions and get the same result.

Have discussed with supplier of the cfd code and the didnât have any good suggestions yet.

P

*From:* HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> *On Behalf Of *Beyer, Christoph
*Sent:* onsdag 27. oktober 2021 08:24
*To:* htcondor-users <htcondor-users@xxxxxxxxxxx>
*Subject:* Re: [HTCondor-users] Job not starting correctly

Hi,

make sure all the paths you need are set in the bash script or use absolute paths if in doubt. The interactive login uses ssh mechanisms and therefore sources your environment which is not necessarily the case in a regular condor job.

Try ldd <binary> to check if the libraries the binary uses are hidden somewhere and put all these paths in your bash script (LD_LIBRARY_PATH etc) ...

best

christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx <mailto:christoph.beyer@xxxxxxx>

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

*Von: *"Peter Ellevseth" <Peter.Ellevseth@xxxxxxxxxx <mailto:Peter.Ellevseth@xxxxxxxxxx>>
*An: *"htcondor-users" <htcondor-users@xxxxxxxxxxx <mailto:htcondor-users@xxxxxxxxxxx>>
*Gesendet: *Dienstag, 26. Oktober 2021 20:26:08
*Betreff: *Re: [HTCondor-users] Job not starting correctly

Jason

We have a shared file system between all nodes. When I run condor_submit -interactive I get a shell in the same folder as I was previously, but from the âviewâ of the execute node. I can then execute simply by â./runscriptâ.

Yes, I get the normal log/out/error files.

I have checked the env and there is nothing there that tells me why the job wonât start.

I can also ssh to one of my startd machines and start the job manually with the runscript.

Loss of ideas here now.

P

*From:*HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx <mailto:htcondor-users-bounces@xxxxxxxxxxx>> *On Behalf Of *Jason Patton
*Sent:* tirsdag 4. mai 2021 14.43
*To:* HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx <mailto:htcondor-users@xxxxxxxxxxx>>
*Subject:* Re: [HTCondor-users] Job not starting correctly

Hi Peter,

You say that when you submit an interactive job, you run the script by doing "./runscript". Do your jobs ever use condor file transfer or is your pool set up to assume a shared file system?

When you submit the job normally, do you still get back the output (stdout) and error (stderr) files? It might be useful to print out the environment at the very beginning of the script and compare between a normal job and an interactive job.

Jason Patton

On Mon, May 3, 2021 at 5:04 PM Peter Ellevseth <Peter.Ellevseth@xxxxxxxxxx <mailto:Peter.Ellevseth@xxxxxxxxxx>> wrote:

    Gents

    We are running a commercial CFD-code via htcondor. Been doing it for years without any issued. I installed a new version of that software and want to run it via htcondor as per usual. I to this by telling condor to run a locally installed bash-script on the execute node which in turn starts the CFD-solver. I have to do it this to source some files need by the solver to start (license etc).

    However, the new version is refusing to start. From the the StarterLog.slotX I see the job immediately stops with

    05/03/21 23:56:33 (pid:4135578) Create_Process succeeded, pid=4135579

    05/03/21 23:56:33 (pid:4135578) Process exited, pid=4135579, status=139

    05/03/21 23:56:33 (pid:4135578) Got SIGQUIT. Performing fast shutdown.

    If I ssh in to one of the execute nodes I can start it just and it runs as normal.

    If I do condor_submit -interactive my_submit_file, I am able to run the script with ./runscript just fine.

    The why wonât it start when I submit the file normally??

    Peter

    _______________________________________________
    HTCondor-users mailing list
    To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx>with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>

    The archives can be found at:
    https://lists.cs.wisc.edu/archive/htcondor-users/ <https://lists.cs.wisc.edu/archive/htcondor-users/>


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/ <https://lists.cs.wisc.edu/archive/htcondor-users/>


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/