[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Parallel job with flocking, condor_tail does not work, upload/download to/from a running job, slots in Claimed-Idle state, ...



Dear Colleagues,

I continue the evaluation of HTCondor for our company needs. I successfully set HTCondor up on a bunch of virtual machines and reproduced some tricky target use-cases. HTCondor indeed seems to be a mature product. Well, not surprisingly, since it has been developed since 80s.

Anyway, I would like to share what Iâve achieved and ask some tricky questions. Could you please help me with them, any comments/advices are welcome.

  1. The typical use-case is to submit a job with a dozen of processes which must be run simultaneously on different machines with different requirements (e.g. 3xWindows + 4xLinux). The solution Iâve found and tried is to use âuniverse = parallelâ with multiple sections in a submit file where each section specifies its own machine_count, arguments and requirements. I also had to write a universal (Windows & Linux) wrapper script cause (as far as I understood) it is impossible to use different executables (directive executable) in as single parallel job. Iâve implemented two wrappers: one in Python, second by hack with combined BAT-BASH script (rather dark magic though). It seems to work fine.
  2. Another use-case is to have a âspecial" node with some node-locked (e.g., by license) piece of software. This node must be available for jobs from multiple pools. Iâve implemented this with a flocking. Iâve set up a single-node pool on this machine and allowed different pools to flock jobs to this pool. Iâve tried with a single pool, but I suppose it shall work if there are many pools flock jobs to this special node.
  3. Well, here comes a tricky part. I need to submit a job, with dozens of processes like in item 1 above, but one of these processes must be run on this special node like in item 2. I tried to tell this special node that it is also a âdedicatedâ one, but this does not seem to work. So I am stuck here. I suppose my question is the following. Is it possible to submit a parallel job in a way that one of these parallel processes flocks to a different pool.
  4. Another tricky part is to somehow upload/download files to/from the node the job is running on. As I already mentioned, each process communicates with some server, sometimes it is necessary to transfer files to/from submit node while job is running. Is there a way to do this? I mean, definitely there is an upload&download mechanisms in HTCondor, since it transfers files before running the job and when job finishes (in case there is no shared file system). Can I somehow use this mechanism to upload/download files while job is running? Is there an API or command-line tool for this?
  5. To solve previous item I tried the condor_tail and it does not seem to work at all. It simply hangs until job finishes, then it exits reporting that there is no such job. No output is provided. I could not make it work and I do not know how to debug. Any ideas?
  6. Is it possible to somehow output a list of slots available in external pools? I mean, I can see slots in my our pool, but I cannot see slots available in pools my pool can flock to. It is strange, cause I see that job is running in condor_q, but I do not see where, condor_status reports only about nodes/slots in my pool.
  7. Suspicious issue is that during my experiments I several times came to the state in which all slots were in âClaimed â Idleâ state. They kept in this state for a rather long time (approx. half an hour or an hour). After this they woke up and continue processing jobs. I am still not sure how to reproduce this. Probably it is connected to restarting of central HTCondor manager (systemctl restart condor), but I am not 100% sure. Again, ideas?

Thanks in advance.

All the best,
Alexander A. Prokhorov