[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] I wasted some CPU cycles ;-)



Hi Miguel,

Condor can't checkpoint on Windows systems, therefore the job is killed and
restarted (from the beginning) on another machine.
If you want your jobs to run all the time (even if the machine is used by a
user), use the folling in your config file:

WANT_SUSPEND = FALSE
WANT_VACATE = FALSE
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE

mit freundlichen Grüßen / with kind regards,

Matthias Röhm

=======================================================
Matthias Röhm, DaimlerChrysler AG, Research Center Ulm,
Department for Data Mining Solutions, RMI/DM
89013 Ulm,  Germany

Phone:               +49 731 505 4864
Email:               mailto:Matthias.M.Roehm@xxxxxxxxxxxxxxxxxxx
=======================================================

condor-users-bounces@xxxxxxxxxxx schrieb am 07.07.2005 13:00:44:

> Disclaimer: idiot here ;-)

> I've got a serious problem.

> I was running my jobs for the last few days, until I accumulated 2 days
of
> run time (the "normal" time for such a task to finish) and today I
decided
> to check the size of the file being generated.
> This morning, after running overnight, the file was 44 MB... After 2 days
of
> running it should have been close to the final size of 610 MB, so that
was
> my first shock.
> Just checked again (the machine is currently in use by the Owner, so
Condor
> is not active) and the file is not there anymore.

> I suspected that this morning when I checked the file size...
> Instead of being suspended to resume later, my jobs are being killed for
> some reason. Being a new starter with Condor probably I missed something.

> A bit of background: the machines are all Windows (2K and XP), with the
> central server on 2K. After little struggling I got the jobs running
using
> this .sub:

> #
> # Submit 4 jobs of rtgen.exe to Condor
> Universe = vanilla
> Executable = rtgen.exe
> Arguments = ntlm alpha 1 7 $(Process) 9000 40000000 ncc
> Initialdir = E:/
> Transfer_input_files = libeay32.dll, charset.txt
> Should_transfer_files = YES
> When_to_transfer_output = ON_EXIT
> Nice_user = True
> Notification = Never
> Getenv = False
> Requirements = ( (OpSys == "WINNT50") || (OpSys == "WINNT51") )
> # later I've to try
> #Requirements = ( (OpSys == "WINNT50") || (OpSys == "WINNT51") ) &&
> (VirtualMachineID == 1)
> # and
> #hold = True
> Queue 4
>
> I'm pretty sure that my problem is not there, but in the condor_config
file
> on each node, most likely under Part 3, that I left exactly as installed
by
> the Windows GUI installer (I only modified bits in Parts 1 and 2, to make
it
> work).

> During installation using the GUI, I choose to suspend and continue
later,
> no migration.
> What do I have to modify in condor_config (in the clients only? Or also
the
> central server?) to ensure that a job that has to run for 2 days of CPU
> time, generating a file of 610 MB, is not killed when the owner is using
the
> machine?

> TIA!
> Regards,

> Miguel

>
>
***********************************************************************************************************

> DISCLAIMER:
> This e-mail contains proprietary information, some or all of which
> may be legally privileged.
> It is for the intended recipient only. If an addressing or
> transmission error has misdirected this e-mail,
> please notify the author by replying to this e-mail. If you are not
> the intended recipient you may not use,
> disclose, distribute, copy, print or rely on this e-mail.
>
***********************************************************************************************************


>
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users