[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor 7.0.5 never restarting a successfully checkpointed job



Hello,
I've only just started using condor but hopefully I've included enough
information to start the debugging.

I was hoping someone could help me debug a problem with a checkpoint
problem on a Rocks clusters/Condor 7.0.5 install. Checkpointed jobs
are not restarting

Making a trivial C++ program,
#########################
#include <iostream>
#include <cmath>

int main()
{

  for (;;)
    {
      double sum(0);
      for (size_t i(0); i < 1000000; ++i)
        sum += std::sqrt(i);

      std::cout << "It turns out the sum is " << sum;
    }

  return 0;
}
#########################

Compiling it with "condor_compile g++ test.cpp -o test.bin" and
submitting it with

#########################
  universe	 = standard
  executable     = /home/mjki2mb2/dynamo/test.bin
  arguments	 =
  log            = condor.log
  output         = condor.out
  error		 = condor.error

  queue
#########################

Runs the job fine, output is as expected. If I then use condor_vacate
on the running job, the task checkpoints and stops, but then will
never restart. Running condor_q -better-analyze gives
#########################
015.000:  Run analysis summary.  Of 100 machines,
      0 are rejected by your job's requirements
    100 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
	Last successful match: Thu Mar 12 08:58:28 2009
	Last failed match: Thu Mar 12 09:19:32 2009
	Reason for last match failure: no match found

WARNING:  Be advised:   Request 15.0 did not match any resource's constraints


The following attributes are missing from the job ClassAd:

CheckpointPlatform
########################

Now the only problem i can find is that my job has

LastCheckpointPlatform = "LINUX INTEL 2.6.x normal 0x40000000"

but every node I have has

CheckpointPlatform = "LINUX INTEL 2.6.x normal 0x4001c000"

however if I ssh to any node (I've tested every node using tentakel) and run

/opt/condor/libexec/condor_ckpt_probe --vdso-addr

I obtain
VDSO: 0x40000000

(I got this executable name from the condor_config, I thought its what
you use to generate that hex address, the option is a guess).

I've checked all the condor logs and they're not much help, even with
D_ALL set for all daemons.
Please help, I've built this cluster and almost everything works fine
but I can't get my head round the checkpoint error. When is there any
way I can force  a regeneration of the Checkpoint platform? Thanks to
Rocks clusters every node is identical in set up, so could I just set

IsValidCheckpointPlatform = FALSE / TRUE (I thought it would be true
but I think the current expression evaluates to false when its ok)

Thanks in advance,
Marcus Bannerman