[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Parallel job with flocking, condor_tail does not work, upload/download to/from a running job, slots in Claimed-Idle state, ...

On 15 Mar 2019, at 19:20, Todd L Miller <tlmiller@xxxxxxxxxxx> wrote:

I also had to write a universal (Windows & Linux) wrapper script cause (as far as I understood) it is impossible to use different executables (directive executable) in as single parallel job.

If I recall correctly, the canonical approach here is to do something like the following:

executable = my_executable_for_$$(ARCH)

so that the executable has a different name on Windows than it does on Linux.  I haven't tried this, though. :)

Indeed looks straightforward! Iâve tried this trick with the substitution "$(OPSYS)â, unfortunately it does not seem to work. It simply transforms to the OS of the host where job is submitted. So I see that the job is running on Windows, but the "$(OPSYS)â variable in the "executableâ field transforms to âLINUXâ.

Well, here comes a tricky part. I need to submit a job, with dozens of processes like in item 1 above, but one of these processes must be run on this special node like in item 2. I tried to tell this special node that it is also a âdedicatedâ one, but this does not seem to work. So I am stuck here. I suppose my question is the following. Is it possible to submit a parallel job in a way that one of these parallel processes flocks to a different pool.

Not as far as I know.  You may, in this case, want to consider startd flocking instead -- have the special node report to each of the pools which need to be able to run jobs on it.  (That is, add their collectors to the COLLECTOR_HOST list.)  This will probably result in the special node being matched simultaneously in multiple pools, which can have confusing results.  (It should work -- the first schedd to contact the start will 'win' -- but may lead starvation, if one of the schedds is consistently faster/slower than the others.)  However, since the special node will be in the pools, it will probably be accessible to parallel universe jobs.

Interesting idea, I shall try this. Where can I read about "startd flockingâ? Is there some recipe? Probably I simply not read the documentation careful enough, but I cannot find a word about this.

To solve previous item I tried the condor_tail and it does not seem to work at all. It simply hangs until job finishes, then it exits reporting that there is no such job. No output is provided. I could not make it work and I do not know how to debug. Any ideas?

Try it with a vanilla universe job first?  I don't know if condor_tail is expected to work with parallel universe jobs.

Iâve tried â no luck. Here is the simple submit file I used for this:

executable     = wrapper.sh
arguments      = ping -c444
universe       = vanilla
requirements   = OpSys == "LINUX"

How can I debug this?

All the best,
Alexander A. Prokhorov