Re: [Condor-users] Why does machine reject job for unknown reasons

I would suggest looking at the log files on the submission and central manager (Condor gurus will
be more specific with exactly where to look).
My (automatic these days) first response is to ensure that there are no firewalls between
the submission node and any of the prospective execute nodes. And if there is, are the appropriate
fixed and ephemeral ports open for both UDP and TCP.
This scenario where jobs match to a machine and then never get there can
also be caused by NATs causing similar connection problems.
Both of above would cause "evidence" to appear in the log files.
Another problem might be where the job cannot start at the machine because of file transfer,
remote filestore issues (although I can't recall whether the symptoms would be the same). Again
the log files would give useful hints as to what was happening.
Do you have other jobs running OK in the pool? If so, what is different about this one?
If not, then I'd suggest running a more trivial job (like /bin/hostname or equivalent).
BTW this group is for users so we don't always have time to respond to queries. While often it is
the condor team themselves, quite often it is fellow users.
sorry to bother you again with my question, but this problem still persists. I have recieved so far no idea how to find out why condor-jobs are rejected ...


thanks for this suggestion, but the output really does not help me further (see below). It looks like that 150 machine are good to run the jobs on, but still they are rejected for unknown reasons! I need them to start immediately because of a timely limited online-demonstration for the work I am doing.
Any other suggestions?


> condor_q -better-analyze 1082109.0

1082109.000:  Run analysis summary.  Of 152 machines,
      2 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
    150 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job

The Requirements _expression_ for your job is:

( target.Arch == "X86_64" ) && ( target.OpSys == "LINUX" ) &&
( ( target.CkptArch == target.Arch ) || ( target.CkptArch is undefined ) ) &&
( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys is undefined ) ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( target.Disk >= 10000 )          150
2   ( target.Arch == "X86_64" )       152
3   ( target.OpSys == "LINUX" )       152
4   ( ( target.CkptArch == target.Arch ) || ( target.CkptArch is undefined ) )
5   ( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys is undefined ) )
6   ( ( 1024 * target.Memory ) >= 10000 )152

I have a problems when sumbitting a DAG to condor; before any of the jobs gets executed they are rejected for unknown reasons, like the following messages suggest:

> condor_q -analyze 1076700.0


If you're running 6.8.x on Linux you can use the -better-analyze option which is infinitely more helpful than -analyze:

condor_q -better-analyze 1076700.0

- Ian

