[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 7.0.5 never restarting a successfully checkpointed job



I've cured the symptoms but I have no idea what the problem was.

For future reference, I killed all condor_startd daemons.

This reran the benchmarks and regenerated CheckpointPlatform on each
compute node. This has now got the correct vsystable page address at
the end.

Marcus Bannerman

2009/3/12 Marcus Bannerman <m.bannerman@xxxxxxxxxxxxxxxxxxxxxxxxx>:
> Hello,
> I've only just started using condor but hopefully I've included enough
> information to start the debugging.
>
> I was hoping someone could help me debug a problem with a checkpoint
> problem on a Rocks clusters/Condor 7.0.5 install. Checkpointed jobs
> are not restarting
>
> Making a trivial C++ program,
> #########################
> #include <iostream>
> #include <cmath>
>
> int main()
> {
>
>  for (;;)
>    {
>      double sum(0);
>      for (size_t i(0); i < 1000000; ++i)
>        sum += std::sqrt(i);
>
>      std::cout << "It turns out the sum is " << sum;
>    }
>
>  return 0;
> }
> #########################
>
> Compiling it with "condor_compile g++ test.cpp -o test.bin" and
> submitting it with
>
> #########################
>  universe       = standard
>  executable     = /home/mjki2mb2/dynamo/test.bin
>  arguments      =
>  log            = condor.log
>  output         = condor.out
>  error          = condor.error
>
>  queue
> #########################
>
> Runs the job fine, output is as expected. If I then use condor_vacate
> on the running job, the task checkpoints and stops, but then will
> never restart. Running condor_q -better-analyze gives
> #########################
> 015.000:  Run analysis summary.  Of 100 machines,
>      0 are rejected by your job's requirements
>    100 reject your job because of their own requirements
>      0 match but are serving users with a better priority in the pool
>      0 match but reject the job for unknown reasons
>      0 match but will not currently preempt their existing job
>      0 are available to run your job
>        Last successful match: Thu Mar 12 08:58:28 2009
>        Last failed match: Thu Mar 12 09:19:32 2009
>        Reason for last match failure: no match found
>
> WARNING:  Be advised:   Request 15.0 did not match any resource's constraints
>
>
> The following attributes are missing from the job ClassAd:
>
> CheckpointPlatform
> ########################
>
> Now the only problem i can find is that my job has
>
> LastCheckpointPlatform = "LINUX INTEL 2.6.x normal 0x40000000"
>
> but every node I have has
>
> CheckpointPlatform = "LINUX INTEL 2.6.x normal 0x4001c000"
>
> however if I ssh to any node (I've tested every node using tentakel) and run
>
> /opt/condor/libexec/condor_ckpt_probe --vdso-addr
>
> I obtain
> VDSO: 0x40000000
>
> (I got this executable name from the condor_config, I thought its what
> you use to generate that hex address, the option is a guess).
>
> I've checked all the condor logs and they're not much help, even with
> D_ALL set for all daemons.
> Please help, I've built this cluster and almost everything works fine
> but I can't get my head round the checkpoint error. When is there any
> way I can force  a regeneration of the Checkpoint platform? Thanks to
> Rocks clusters every node is identical in set up, so could I just set
>
> IsValidCheckpointPlatform = FALSE / TRUE (I thought it would be true
> but I think the current expression evaluates to false when its ok)
>
> Thanks in advance,
> Marcus Bannerman
>