[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Doubts regarding submitting a job



On 9/18/2014 12:12 PM, Roshan Chaudhari wrote:
Hi,
       I have some doubts regarding job submission:


Suggest you do some reading of the HTCondor Manual (http://research.cs.wisc.edu/htcondor/manual/ ) and/or take a look at some of the tutorials on the web site, but quick answers to the below -

1. Is it possible to submit a job to # of nodes or computers?

Yes.

2. how to get informed if the job finished ?

Jobs that are either running or waiting to run are visible in the job queue with "condor_q" command-line tool or corresponding API interfaces (besides command-line tools, HTCondor has Python and SOAP-based interfaces, amongst others). When a job is completed, it will leave the job queue, and no longer be visible via condor_q, but instead will be visible via "condor_history" which shows a list of completed jobs. Also, you can request upon job submission that HTCondor write events into a specified "job event log" file - when the job completes, a job completion event will be written to this file.

3. How to kill a job ?

condor_rm will kill a job if running, and remove it from the queue.

condor_vacate will kill a job on a specified machine, and the job will then get rescheduled to run again (perhaps someplace else).

4. What happens to job if a user works on the computer, does it get low
priority or what?

HTCondor is very configurable in this regard. You can tell HTCondor to simply continue running the job at a low priority, or kill the job and restart it from the beginning someplace else, or suspend the job (i.e. stop using the CPU) and continue the job when the interactive user leaves, or kill the job and resume running it from where it left off if the job can be checkpointed.

5. What happens if a computer is offline? How long will the system wait to
declare a node "down" ?

This is configurable, but with the current defaults, figure it could take HTCondor about 20 minutes to notice if a machine "crashes". If you shut the machine down cleanly, HTCondor notices right away.

6. How to put a pc back in the queue? if a system was down and you turn it
back on.


You don't need to do anything to put a PC back into a pool, and HTCondor 'notices' a machine (re)joining in about a minute. HTCondor is very good at dealing with machines dynamically leaving and joining a cluster.


Hope the above helps,
Todd