[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Can a job send a trigger to let other jobs start?





The jobs (usually 2000-4000) are started via dagman and read a lot of data initially (about 2-3 GByte per jobs). After that they crunch through the loaded data for a couple of hours. This initial start-up phase is quite a lot of load on the central data server, thus we would like to have a handle to limit this.

With dagman's maxjobs feature this could be solved, however this would only start new jobs after the first batch of jobs is done. Thus my question is, is there a way to limit the initial number of jobs and send a "trigger" to dagman to start more jobs, once jobs are done with loading their data sets.
What a great question! You could use a DAGman prescript on each node to poll for a certain load threshold, and as long as the load is above some threshold, sleep for a random period and re-poll. The script could poll the data server's load directly, perhaps, if there's a way to do that. Or, it could run condor_q, and count the number of jobs that have been running for less than an hour (if the startup phase is about an hour). Or, perhaps the jobs themselves could use chirp or condor_qedit to set a job attribute in the schedd to indicate which phase they are in, and the prescript could poll for that.

-Greg