[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Parallel job with flocking, condor_tail does not work, upload/download to/from a running job, slots in Claimed-Idle state, ...



Alexander,

I can try to answer a couple of these (sorry if the numbering is off, blame my email client):

  1. Another tricky part is to somehow upload/download files to/from the node the job is running on. As I already mentioned, each process communicates with some server, sometimes it is necessary to transfer files to/from submit node while job is running. Is there a way to do this? I mean, definitely there is an upload&download mechanisms in HTCondor, since it transfers files before running the job and when job finishes (in case there is no shared file system). Can I somehow use this mechanism to upload/download files while job is running? Is there an API or command-line tool for this?
The tool for this is "condor_chirp": http://research.cs.wisc.edu/htcondor/manual/v8.8/Condorchirp.html

Parallel universe jobs have a few extra environment variables set for them that you might find useful when using chirp:
_CONDOR_PROCNO: Each node in the job has a specific number, going from 0 to machine_count - 1, and this variable contains that number
_CONDOR_REMOTE_SPOOL_DIR: A scratch directory on the submit node specifically for this parallel universe job, good for sharing files among nodes (using chirp)
  1. Is it possible to somehow output a list of slots available in external pools? I mean, I can see slots in my our pool, but I cannot see slots available in pools my pool can flock to. It is strange, cause I see that job is running in condor_q, but I do not see where, condor_status reports only about nodes/slots in my pool.
condor_status -pool <hostname/ip of external pool's central manager>
  1. Suspicious issue is that during my experiments I several times came to the state in which all slots were in âClaimed â Idleâ state. They kept in this state for a rather long time (approx. half an hour or an hour). After this they woke up and continue processing jobs. I am still not sure how to reproduce this. Probably it is connected to restarting of central HTCondor manager (systemctl restart condor), but I am not 100% sure. Again, ideas?
When using the dedicated scheduler (i.e. when submitting parallel universe jobs), the dedicated scheduler is configured to keep claims on any resources it gets, for a configurable amount of time, in case other jobs needing dedicated resources are sitting in the queue or are submitted shortly after those resources go idle. The config setting to adjust here is "UNUSED_CLAIM_TIMEOUT". This should be 10 minutes by default, so I'm not sure why you're seeing half an hour.

Also keep in mind that if you have any idle parallel universe jobs in your queue, the dedicated scheduler is going to try its best to claim resources for each of those jobs, and those resources are going to be claimed/idle until the scheduler is able to claim enough resources for the job to start.

Hopefully some others can chime in on the other questions!

Jason Patton


Thanks in advance.

All the best,
Alexander A. Prokhorov



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/