[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Monte Carlo simulation



On 9/1/06, Kwan Wing Keung <hcxckwk@xxxxxxxxxxxx> wrote:

Dear All,

a newbie question:

One of my user requires to run a simulation program on quantum mechanics.
He needs to run it for 10,000 times which is a normal standard
requirement on this specific area.  (Each simulation takes around 10
minutes of computer execution).

I converted the program for him and submitted it to a pool of Windows XP
machines in our student lab.  The setup works fine and the program
also works fine.

We planned to allow each sub-job to run for 100 simulation (i.e.
1000 minutes of running which is around 16 hours).  After each run,
the execution server will return an averaging value for the result based
on this 100 runs.

However it comes out that the students just keep on logging in/out and
switching on/off the lab. PCs, and the we can never conduct a successful
run for a continuous execution of 16 hours.

I am now going to modify the program so that after EACH successful
simulation, the result file will be overwritten, with the
updated simulation count and the average values stored.
i.e.
    (updated simulation count)
    xxxxxxxxxx
    yyyyyyyyyy
    zzzzzzzzzz
    ...

My question is therefore:

Is there any way to specify in the Condor Sript file such that once the
execution server is powered off, the result file will be sent back and
this particular sub-job will be terminated (i.e. not re-queued).

You are in effect partial check pointing (with the proviso that, if
the job completes a sufficient amount you are happy to ignore the
remainder and leave it at that)

Here is the bare bones of one way to do this - apologies I don't have
time for more detail.

1) submit with transfer on exit or evict
2) trap the wm_close message in your program
a) exit with one code if you program is considered to have 'done enough'
b) exit with another if you only just got started so the job should run again

(search previous messages for windows checkpointing for details on 1 and 2)

3) include an on_exit_hold expression in your submit file which checks this code
(note - I am am not 100% certain on_exit_hold is checked on eviction
but I think it is)
If you got (a) then the job goes on hold and you will need to either:

One - change the job so that it executes some place locally on the
machine that always lets it run and in essence causes it to 'start'
thus moving the checkpointed output files to the execute directory and
thence after immediately completing (change your code to spot this
restart case) with a code which indicates it is all fine the data
files will be transferred back to the normal finishing point'
I suggest doing this to make the merging job independent from the
eviction behaviour (data is not written back to the normal location,
it is stored in a temporary location in the spool directory). Doing
this via a local machine means you get hit with 2 needless copies but
no external network load

Two - stay on hold but have a job running to spot them and dodgily
grab the spooled checkpoint data then remove the job. Faster but more
complex and relies on the finding the checkpoint cleanly

Three - have your job write it's output to a network location directly
rather than using the transfer mechanism (then you no longer need to
go on hold - you can have an on_exit_remove statement instead if you
know you wrote all you needed in time)

If you got b) you just let the job get rescheduled - no need to resubmit.

This glosses over a *lot* of complications and, if you can sort out
the network access, Three is _vastly_ more pleasant and performant.

It is a start point for ideas though.

Matt