[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] CheckpointPlatform error
- Date: Fri, 07 May 2010 09:13:37 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [Condor-users] CheckpointPlatform error
antoni artigues wrote:
I'm trying to submit my first job on the vanilla universe.
Welcome to Condor!
I have my executable in the common nfs directory and my descripton file
Executable = input.sh
Universe = vanilla
Requirements = OpSys == "LINUX" && Arch =="X86_64"
output = sim.out
error = sim.error
Log = sim.log
But after the job has been submited the condor_q -better-analyze
002.000: Run analysis summary. Of 6 machines,
0 are rejected by your job's requirements
2 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
4 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 match but are currently offline
0 are available to run your job
The following attributes are missing from the job ClassAd:
Where is the error? What is the CheckpointPlatform?
From the Condor Manual -
CheckpointPlatform: A string which opaquely encodes various aspects
about a machine's operating system, hardware, and kernel attributes.
It is used to identify systems where previously taken checkpoints for
the standard universe may resume.
But this strange to see better-analyze saying it is missing.
CheckpointPlatform should appear by default in all the machine classads,
the above message from condor_analyze would imply that some of your
machines are not advertising a checkpoint platform. Would be curious to
see what this command
condor_status -con 'CheckpointPlatform =?= UNDEFINED'
returns (it will print out all machines in your pool that do not have
At any rate, CheckpointPlatform is only used when restarting Standard
Universe jobs after a checkpoint, and is likely a red-herring if you are
having problems getting your Vanilla job to run. The most frequent
culprit getting a Vanilla job to run is
1) FILESYSTEM_DOMAIN in condor_config is unique on the submit host -vs-
all the execute hosts. (although better-analyze should catch this problem)
2) UID_DOMAIN in condor_config is unique on the submit host -vs- all the
execute hosts. If this is the case, then the vanilla job may indeed run
on some host, but Condor will start the job as used "nobody" instead of
as the user who submitted the job --- and unless your files in your
shared filesystem are world-readable/writable, things will fail due to