[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Checkpointing failed on X86_64



Previously Junjun Mao  wrote:

> I compiled this simple program with condor_compile gcc -o count
count.c
> 
<snip>
> 
> When I used condor_hold while the program was running I got
this error 
> in the log file:
> 
> 001 (008.000.000) 11/17 19:13:25 Job executing on host: 
> <10.10.20.90:42208>
> ...
> 004 (008.000.000) 11/17 19:15:20 Job was evicted.
>         (0) Job was not checkpointed.
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote
Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local
Usage
>         570  -  Run Bytes Sent By Job
>         4754958  -  Run Bytes Received By Job
> 
> I looked for the manual 
>
http://www.cs.wisc.edu/condor/manual/v6.8/1_5Availability.html#se
:Availability
> 
> It appears condor_compile is not supported on my platform
Fedora Core 
> 4/Opteron. Is this the real reason?
> 

I doubt it if you are running  Condor v6.8.2, since that version
added 64bit Linux checkpoint support. 

 I don't recall if condor_hold will force a checkpoint or not. 
So I would retry your test using "condor_vacate" (or
condor_vacate_job) to checkpoint and leave the machine, or
"condor_checkpoint" (or condor_checkpoint_job) to checkpoint and
keep running. 

Another thought : maybe the above happened because the job only
ran for less than 2 minutes. Condor will (purposefully) not
bother to checkpoint upon pre-emption unless more than X seconds
of forward progress was made.  I don't recall off the top of my
head what X is, sorry, but it was short.  3 minutes perhaps?

Regards,
Todd

-- 
Posted via a Palm OS PDA (Handspring Visor Edge)