[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Parallel job with flocking, condor_tail does not work, upload/download to/from a running job, slots in Claimed-Idle state, ...

I also had to write a universal (Windows & Linux) wrapper script cause (as far as I understood) it is impossible to use different executables (directive executable) in as single parallel job.

If I recall correctly, the canonical approach here is to do something like the following:

executable = my_executable_for_$$(ARCH)

so that the executable has a different name on Windows than it does on Linux. I haven't tried this, though. :)

Well, here comes a tricky part. I need to submit a job, with dozens of processes like in item 1 above, but one of these processes must be run on this special node like in item 2. I tried to tell this special node that it is also a âdedicatedâ one, but this does not seem to work. So I am stuck here. I suppose my question is the following. Is it possible to submit a parallel job in a way that one of these parallel processes flocks to a different pool.

Not as far as I know. You may, in this case, want to consider startd flocking instead -- have the special node report to each of the pools which need to be able to run jobs on it. (That is, add their collectors to the COLLECTOR_HOST list.) This will probably result in the special node being matched simultaneously in multiple pools, which can have confusing results. (It should work -- the first schedd to contact the start will 'win' -- but may lead starvation, if one of the schedds is consistently faster/slower than the others.) However, since the special node will be in the pools, it will probably be accessible to parallel universe jobs.

To solve previous item I tried the condor_tail and it does not seem to work at all. It simply hangs until job finishes, then it exits reporting that there is no such job. No output is provided. I could not make it work and I do not know how to debug. Any ideas?

Try it with a vanilla universe job first? I don't know if condor_tail is expected to work with parallel universe jobs.

- ToddM