[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Ovewriting Checkpoint platform

I haven't done much beyond dummy jobs with checkpointing, so I can't really speculate on that 7.6.6 behavior.

Perhaps one of the devs will expand on this, but based on the documentation it appears that because the CheckpointPlatform is an "opaque" string, by default there's no parsing of it - the only way a checkpoint will restart on a given machine is when the original machine's string matches the new machine's exactly.

So even if, say, the original machine is SSSE3 and thus the executable will run fine on both older and newer platforms, the new machine won't be considered a valid checkpoint platform because it's also including the ssse4_1 and ssse4_2 tags in its CheckpointPlatform string - the old machine's CheckpointPlatform without those tags won't be an exact match.

My "substring" example would NOT allow newer machines' checkpoints to run on older platforms, but the fix for that in your case at your site is just to compare only the parts of the CheckpointPlatform strings before they start listing the SSSE versions - that is, just the first four fields (at least for this version of HTCondor).

It seems, though, that if your executables WERE using SSSE opcodes, you'd want to add "TARGET.has_ssse4_2" or what have you to your requirements _expression_, and just completely ignore the SSSE pieces of the CheckpointPlatform anyway.


Michael V. Pelletier
IT Program Execution
Principal Engineer
978.858.9681 (5-9681) NOTE NEW NUMBER
339.293.9149 cell
339.645.8614 fax