[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] 7.8 Upgrade Issues



Hi Todd,

Sorry for the delay in answering: I was on travel with little internet access.

I have investigated the issue further and it seems that the behavior I am seeing can be traced to the Starter taking a long time to quit and a new job being assigned to the same slot in the mean time. In those cases the Shadow waits for 20 seconds and then gives up. The corresponding job is then set to Idle (from Running).

With 7.6.1 I always got:
03/02/12 14:30:01 Process exited, pid=22818, status=0
03/02/12 14:30:01 Returning from CStarter::JobReaper()
03/02/12 14:30:02 Got SIGQUIT.  Performing fast shutdown.
03/02/12 14:30:02 ShutdownFast all jobs.
03/02/12 14:30:02 **** condor_starter (condor_STARTER) pid 22813 EXITING WITH STATUS 0

With 7.8.1 I often get:
06/19/12 17:20:11 Process exited, pid=9255, status=0
06/19/12 17:20:11 Got SIGQUIT.  Performing fast shutdown.
06/19/12 17:20:11 ShutdownFast all jobs.
06/19/12 17:20:37 Got SIGTERM. Performing graceful shutdown.
06/19/12 17:20:37 ShutdownGraceful all jobs.
06/19/12 17:20:37 **** condor_starter (condor_STARTER) pid 9188 EXITING WITH STATUS 0

This likely has to do with using job hooks, since when I do not I get the normal behavior. In both cases (i.e. with and without job hooks) I do get the weird "job 0 of a cluster runs while the others wait a few seconds" behavior which I do not understand.

Anyway, I have made a somewhat self contained test case. The content of the "copy_to_tmp" folder should be copied to /tmp with the exception of the job hook, which should be placed somewhere else. My hook config looks like this:

TEST_HOOK_PREPARE_JOB           = /jwstsw/version/development/bin/test_job_hook.py
TEST_HOOK_JOB_EXIT              = /jwstsw/version/development/bin/test_job_hook.py
TEST_HOOK_UPDATE_JOB_INFO       = /jwstsw/version/development/bin/test_job_hook.py


Thanks!
Francesco






On Jun 21, 2012, at 11:47 AM, Todd Tannenbaum wrote:

> On 6/20/2012 8:40 PM, Francesco Pierfederici wrote:
>> Hi,
>>
>> I just upgraded a small development cluster from 7.6.1 to 7.8.1 using the YUM repository. I noticed two things:
>>
>> 1. The upgrade wiped my local configs (both condor_config.local and all the files in /etc/condor/config.d/). Not a big deal to me, but something worth considering for future releases.
>>
>> 2. I am now seeing a lot more jobs in idle state than before. I am running a test DAG which at some point starts a cluster of 4 jobs. 3 of them start (jobs 1, 2, 3) while job 0 stays idle and is picked up at a later negotiation cycle. The next node in the DAG is then scheduled and stays idle for several negotiation cycles before being executed.
>>
>> Am I doing something wrong or is there some configuration knob I should be looking into? I attach my config file as well as the Negotiator log, in case it helps.
>>
>>
>> Thanks!
>> Francesco
>>
>
> Is there a small test case you can send along to reproduce the problem?  For instance, re the above, are those the only four jobs submitted? Could you share submit file(s) that reproduce the problem at your site?
>
> thanks
> Todd

Attachment: test.tar.gz
Description: test.tar.gz