[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Issues with cgroup memory limits, qedit memory units and expiring scitokens



Hi all,

I apologize for the multiple issues in a single email. I have multiple jobs in similar condition and I hoping for enough insight to break this down into separate issues.

Situation:

I am working on creating a hveto job that runs on a weeks worth of data as well as the current 24 hrs. It is a 3 sequential job DAG.

The memory required by Job is data dependent.

Running 7 DAGs with RequestMemory=4G, 2 of them completed the other 5 are hung in the [condor] Running state. condor_ssh_to_job showed the python executables in the [ps] interruptible wait state at about 14hr runtime now at 20hrs trying to ssh to the job puts it on hold withe the message:

Job has gone over cgroup memory limit of 4096 megabytes. Peak usage: 4097 megabytes. Consider resubmitting with a higher request_memory.

the same thing happened when I used condor_vacate_job on one of them.

Problems:

1102.000: Run analysis summary ignoring user priority. Of 550 machines,
ÂÂÂ 550 are rejected by your job's requirements
ÂÂÂÂÂ 0 reject your job because of their own requirements
ÂÂÂÂÂ 0 match and are already running your jobs
ÂÂÂÂÂ 0 match but are serving other users
ÂÂÂÂÂ 0 are able to run your job

WARNING:Â Be advised:
ÂÂ No machines matched the jobs's constraints

While setting it to 6000:

1102.000: Run analysis summary ignoring user priority. Of 550 machines,
ÂÂÂÂÂ 1 are rejected by your job's requirements
ÂÂÂÂ 14 reject your job because of their own requirements
ÂÂÂÂÂ 0 match and are already running your jobs
ÂÂÂÂÂ 0 match but are serving other users
ÂÂÂ 535 are able to run your job