[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] check for cpu instruction?



Maybe use condor_qedit or Chirp in the wrapper to modify the job requirements expression to exclude the problem machine's CheckpointPlatform?

For instance, if the binary needed AVX and it landed on this machine which has no AVX, then you'd add this to requirements:

 ( TARGET.CheckpointPlatform =!= "LINUX X86_64 2.6.x normal 0x2aaaaaaab000 ssse3 sse4_1 sse4_2" )

Or maybe only use a substring of it - like everything after "normal" anchored to the end of the string?

( ! regexp("0x2aaaaaaab000 ssse3 sse4_1 sse4_2$", TARGET.CheckpointPlatform) )

And then set it up to requeue if ExitBySignal is true and it was a SIGILL...

	-Michael Pelletier


-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Di Domenico
Sent: Thursday, May 17, 2018 8:30 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] check for cpu instruction?

On Wed, May 16, 2018 at 5:33 PM, Greg Thain <gthain@xxxxxxxxxxx> wrote:
>
> There is no way for condor to do this.  To do this completely would 
> require solving the halting problem, which is beyond the scope of our research.

i don't understand "the halting problem".  but i agree this is a one off and probably not high on anyone's (even my) list

> In practice, though, there are some approaches that may help.  The 
> program that tries to execute an instruction which doesn't exist 
> should get killed with SIGILL (Illegal instruction).  If this program 
> is the top-level process in your job (i.e. there is no wrapper 
> script), Condor will at least see that the program got a SIGILL, and 
> you can administratively do something about that. (Put the job on 
> hold, retry on a different machine model number etc.)

we do run our jobs through a condor job wrapper.  i'm curious how you would retry the job on an alternate machine model num though.  is there some chunk of classad code that the user would have to put in their submit or is there something i can cram in the main config?
this would presumably work and be the shortest path for me, i don't care that the job restarts a few times.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/