Re: [HTCondor-users] Parallel job with flocking, condor_tail does not work, upload/download to/from a running job, slots in Claimed-Idle state, ...

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

On 15 Mar 2019, at 19:20, Todd L Miller <tlmiller@xxxxxxxxxxx> wrote:

I also had to write a universal (Windows & Linux) wrapper script cause (as far as I understood) it is impossible to use different executables (directive executable) in as single parallel job.

If I recall correctly, the canonical approach here is to do something like the following:

executable = my_executable_for_$$(ARCH)

so that the executable has a different name on Windows than it does on Linux. I haven't tried this, though. :)

Indeed looks straightforward! Iâve tried this trick with the substitution "$(OPSYS)â, unfortunately it does not seem to work. It simply transforms to the OS of the host where job is submitted. So I see that the job is running on Windows, but the "$(OPSYS)â variable in the "executableâ field transforms to âLINUXâ.

Well, here comes a tricky part. I need to submit a job, with dozens of processes like in item 1 above, but one of these processes must be run on this special node like in item 2. I tried to tell this special node that it is also a âdedicatedâ one, but this does not seem to work. So I am stuck here. I suppose my question is the following. Is it possible to submit a parallel job in a way that one of these parallel processes flocks to a different pool.

Not as far as I know. You may, in this case, want to consider startd flocking instead -- have the special node report to each of the pools which need to be able to run jobs on it. (That is, add their collectors to the COLLECTOR_HOST list.) This will probably result in the special node being matched simultaneously in multiple pools, which can have confusing results. (It should work -- the first schedd to contact the start will 'win' -- but may lead starvation, if one of the schedds is consistently faster/slower than the others.) However, since the special node will be in the pools, it will probably be accessible to parallel universe jobs.

Interesting idea, I shall try this. Where can I read about "startd flockingâ? Is there some recipe? Probably I simply not read the documentation careful enough, but I cannot find a word about this.

To solve previous item I tried the condor_tail and it does not seem to work at all. It simply hangs until job finishes, then it exits reporting that there is no such job. No output is provided. I could not make it work and I do not know how to debug. Any ideas?

Try it with a vanilla universe job first? I don't know if condor_tail is expected to work with parallel universe jobs.

Mailing List Archives

Public Access

Re: [HTCondor-users] Parallel job with flocking, condor_tail does not work, upload/download to/from a running job, slots in Claimed-Idle state, ...