[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Monte Carlo simulation



how do i run condor to process my app on cluster.


From: "Matt Hope" <matthew.hope@xxxxxxxxx>
Reply-To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
To: "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>
Subject: Re: [Condor-users] Monte Carlo simulation
Date: Fri, 1 Sep 2006 13:51:57 +0100

On 9/1/06, Kwan Wing Keung <hcxckwk@xxxxxxxxxxxx> wrote:
>
> Dear All,
>
> a newbie question:
>
> One of my user requires to run a simulation program on quantum mechanics.
> He needs to run it for 10,000 times which is a normal standard
> requirement on this specific area.  (Each simulation takes around 10
> minutes of computer execution).
>
> I converted the program for him and submitted it to a pool of Windows XP
> machines in our student lab.  The setup works fine and the program
> also works fine.
>
> We planned to allow each sub-job to run for 100 simulation (i.e.
> 1000 minutes of running which is around 16 hours).  After each run,
> the execution server will return an averaging value for the result based
> on this 100 runs.
>
> However it comes out that the students just keep on logging in/out and
> switching on/off the lab. PCs, and the we can never conduct a successful
> run for a continuous execution of 16 hours.
>
> I am now going to modify the program so that after EACH successful
> simulation, the result file will be overwritten, with the
> updated simulation count and the average values stored.
> i.e.
>     (updated simulation count)
>     xxxxxxxxxx
>     yyyyyyyyyy
>     zzzzzzzzzz
>     ...
>
> My question is therefore:
>
> Is there any way to specify in the Condor Sript file such that once the
> execution server is powered off, the result file will be sent back and
> this particular sub-job will be terminated (i.e. not re-queued).

You are in effect partial check pointing (with the proviso that, if
the job completes a sufficient amount you are happy to ignore the
remainder and leave it at that)

Here is the bare bones of one way to do this - apologies I don't have
time for more detail.

1) submit with transfer on exit or evict
2) trap the wm_close message in your program
 a) exit with one code if you program is considered to have 'done enough'
b) exit with another if you only just got started so the job should run again

(search previous messages for windows checkpointing for details on 1 and 2)

3) include an on_exit_hold expression in your submit file which checks this code
(note - I am am not 100% certain on_exit_hold is checked on eviction
but I think it is)
If you got (a) then the job goes on hold and you will need to either:

One - change the job so that it executes some place locally on the
machine that always lets it run and in essence causes it to 'start'
thus moving the checkpointed output files to the execute directory and
thence after immediately completing (change your code to spot this
restart case) with a code which indicates it is all fine the data
files will be transferred back to the normal finishing point'
I suggest doing this to make the merging job independent from the
eviction behaviour (data is not written back to the normal location,
it is stored in a temporary location in the spool directory). Doing
this via a local machine means you get hit with 2 needless copies but
no external network load

Two - stay on hold but have a job running to spot them and dodgily
grab the spooled checkpoint data then remove the job. Faster but more
complex and relies on the finding the checkpoint cleanly

Three - have your job write it's output to a network location directly
rather than using the transfer mechanism (then you no longer need to
go on hold - you can have an on_exit_remove statement instead if you
know you wrote all you needed in time)

If you got b) you just let the job get rescheduled - no need to resubmit.

This glosses over a *lot* of complications and, if you can sort out
the network access, Three is _vastly_ more pleasant and performant.

It is a start point for ideas though.

Matt
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/