[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems with checkpointing.




> Can you be more specific about the errors you are getting?

OK, I was waiting for more details from users... I'll attach a bunch of
stuff below, trying to show lifecycle of jobs, but here's a typical log
entry when a job dies...  I know this job was condor_compiled on a RH9
box, I don't know where it initially ran, but here it dies on a RH9 box:

001 (12450.852.000) 04/27 17:08:09 Job executing on host: <129.89.200.78:51017>
...
005 (12450.852.000) 04/27 17:08:14 Job terminated.
(0) Abnormal termination (signal 11)

Hmmm...

Another thing... the user whose log's I'm just checking into has told me
that his failing jobs were condor_compile'ed under 6.7.3, and have been
failing on 6.7.6.  I haven't heard back from the user whose snippets are
listed earlier in the thread.

Would the jobs having been condor_compiled under 6.7.3 make a difference?

I don't think it should make any difference, unless we fixed a bug in the standard universe implementation. That said, I'm not aware of any relevant bug fixes. It would hurt to try condor_compiling with 6.7.6, but I don't expect it will help much.


At this point, I would try two things:

1) Look in the StarterLog and StartLog on the execution computer at the time the job failed to see if there are any obvious problems.

2) Do you get a core file back that can be looked at to see where the program died? If the program had a segfault, there are a few possibilities:

  a) The user's code is flawed and it crashes on its own accord.
  b) The Condor library that is linked with the job has a bug that
     caused the crash.
  c) The user relies on something that isn't true in the standard
     universe.
     http://www.cs.wisc.edu/condor/manual/v6.7/1_4Current_Limitations.html

There may be a subtle problem in the user's code. Refer to point 9 in the link above: "All files must be opened read-only or write-only. A file opened for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning but not an error." For example, what if the following sequence of events occurs?

  * Open file for reading and writing
  * CHECKPOINT
  * Read some data, write new data based on this
  * EVICT
  <on new machine>
  * Restart at checkpoint, read new data, get confused by the data, crash.

I'm not saying that it's definitely a bug in the user code. It may well be in Condor. I'm just saying that it might be tricky to track it down.

-alain