[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] CheckpointPlatform error



antoni artigues wrote:
Hello

I'm trying to submit my first job on the vanilla universe.


Welcome to Condor!

I have my executable in the common nfs directory and my descripton file
is:

Executable = input.sh
Universe = vanilla Requirements = OpSys == "LINUX" && Arch =="X86_64" output = sim.out error = sim.error Log = sim.log
Queue

But after the job has been submited the condor_q -better-analyze
returns:
-----------------------
002.000:  Run analysis summary.  Of 6 machines,
      0 are rejected by your job's requirements
      2 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      4 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 match but are currently offline
      0 are available to run your job

The following attributes are missing from the job ClassAd:

CheckpointPlatform
----------------------
Where is the error? What is the CheckpointPlatform?

From the Condor Manual -

CheckpointPlatform: A string which opaquely encodes various aspects
about a machine's operating system, hardware, and kernel attributes.
It is used to identify systems where previously taken checkpoints for
the standard universe may resume.

But this strange to see better-analyze saying it is missing. CheckpointPlatform should appear by default in all the machine classads, the above message from condor_analyze would imply that some of your machines are not advertising a checkpoint platform. Would be curious to see what this command
  condor_status -con 'CheckpointPlatform =?= UNDEFINED'
returns (it will print out all machines in your pool that do not have CheckpointPlatform defined)

At any rate, CheckpointPlatform is only used when restarting Standard Universe jobs after a checkpoint, and is likely a red-herring if you are having problems getting your Vanilla job to run. The most frequent culprit getting a Vanilla job to run is 1) FILESYSTEM_DOMAIN in condor_config is unique on the submit host -vs- all the execute hosts. (although better-analyze should catch this problem)
or
2) UID_DOMAIN in condor_config is unique on the submit host -vs- all the execute hosts. If this is the case, then the vanilla job may indeed run on some host, but Condor will start the job as used "nobody" instead of as the user who submitted the job --- and unless your files in your shared filesystem are world-readable/writable, things will fail due to permissions.

regards,
Todd