[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Ovewriting Checkpoint platform



You might take a look at reworking the is_valid_checkpoint_platform _expression_ to either ignore the SSE versions if you're certain that none of your standard universe jobs use SSSE4 opcodes, or allow lower SSSE version checkpoints to resume on higher versions.

As of 8.2.9 this is the default _expression_, from page 212 of the manual:

IS_VALID_CHECKPOINT_PLATFORM = (((TARGET.JobUniverse == 1) == FALSE) || ((MY.CheckpointPlatform =!= UNDEFINED) && ((TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))

The checkpoint platform attribute is described as "opaque," so when parsing it you can't expect it to work forever, but for your purposes at your site it may be what you need. You'd replace the =?= with some other method of comparing the two sides that will give you what you want.

For example, if the TARGET.LastCheckpointPlatform is a substring of MY.CheckpointPlatform, that would allow ssse3 checkpoints to resume on 4.1 and 4.2 machines.

 

Michael V. Pelletier
IT Program Execution
Principal Engineer
978.858.9681 (5-9681) NOTE NEW NUMBER
339.293.9149 cell
339.645.8614 fax

michael.v.pelletier@xxxxxxxxxxxx





From:        "Marcos M." <marcos.mazzini@xxxxxxxxx>
To:        htcondor-users@xxxxxxxxxxx
Date:        10/08/2015 12:03 PM
Subject:        [HTCondor-users] Ovewriting Checkpoint platform
Sent by:        "HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx>





Hi, I have a doubt about where to configure checkponting options.
I'm running htcondor  8.2.8 on a CentOS submit and central manager server and several execute nodes.
Execute nodes are heterogeneus (different processors) and calculate differently the checkpoint platform, if i run
condor_status -format "%s\n" checkpointplatform | sort | uniq -c
I get
20 LINUX X86_64 2.6.x normal 0x2aaaaaaab000 ssse3
40 LINUX X86_64 2.6.x normal 0x2aaaaaaab000 ssse3 sse4_1
56 LINUX X86_64 2.6.x normal 0x2aaaaaaab000 ssse3 sse4_1 sse4_2

This leads to not executing idle jobs on avilable machines because checkpoint platform is sligtly differnt.
Should I overwrite the CHECKPOINT_PLATFORM macro, configure a checkpoint server or there is any other option??

Thanks in advance.
Marcos._______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/