[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] how to avoid ...



On 1/31/06, DeVoil, Peter <Peter.DeVoil@xxxxxxxxxxxxxx> wrote:
> Hi,
>
> I have an infrequent problem on an 80cpu windows pool - all dual
> processor hosts running 6.6.10. When an execute node decides to start up
> two jobs (each using the same pre-installed executable, with many
> dependant dlls) at exactly the same time, one of them gets a "IO error:
> permission denied" message on stderr and stalls - presumably with a
> dialog box in nowhere land.
>
> The simplest way to avoid this is to stop trying to start both jobs at
> the same time; but I can't see a configuration entry to help. Any
> suggestions? I've attached logs showing vm1 stalling..

Horribly, horribly hacky but, if it works, effective.
Can you install the pogram seperately twice to two different locations
and use the relevant install per vm?

Much cleaner if harder to get right is to use USER_JOB_WRAPPER to
enforce some form of delay to each job start (not perfect and tricky
to get totally right but just having a local file which is touched
every time a job runs and see if this happened in the last X seconds
and sleep till you exceed x - by locking on another file (or indeed
that one) during the check and sleep period you will ensure you only
have one starting at a time so long as you can make X long enough
based on the variability of the time it takes your jobs to get into /
out of their starting state.

Since you are already running bat scripts users have the option of
trying to spot the failure and retrying locally - perhaos not the best
way to manage it though

Matt