[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] åå: Condor_submit: job cannot excute in the second node



Hi Christoph,

I tried to stop the startd in the machine which I submit jobs and the job stuck in the queue. So the three machines are actually connected right? I think maybe there are something wrong with my submit file.
Thank you for your advise!

Warm Regards,
Xinjie Zeng

Beyer, Christoph <christoph.beyer@xxxxxxx>ä2020å6æ2æ åä02:41åéï
Hi,

this is looking good, I suspect your negotiator is filling the nodes vertically, what happens if you submit 1.000 jobs ?

You could also stop the startd on the 'one machine' and see if jobs will then pick another machine or stuck in the queue.

You can change the behaviour of the negotiator and make him fill your pool horizontally using this recipe:


best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Zeng, Xinjie (zengxe)" <xinjie.zeng@xxxxxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 2. Juni 2020 08:31:40
Betreff: [HTCondor-users] åå: ÂCondor_submit: job cannot excute in the second node

Hi Christoph,

Thank you for your quick reply!

Â

I tried to submit 66 jobs again and I check the condor_q -better-analyze 9.

The output is :

â

Job 9.063 defines the following attributes:

Â

ÂÂÂ DiskUsage = 10000

ÂÂÂ FileSystemDomain = "domian.com"

ÂÂÂ ImageSize = 10000

ÂÂÂ RequestDisk = DiskUsage

ÂÂÂ RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)

Â

The Requirements _expression_ for job 9.063 reduces to these conditions:

Â

ÂÂÂÂÂÂÂÂ Slots

Step Matched Condition

-----Â --------Â ---------

[0]ÂÂÂÂÂÂÂÂ 192Â TARGET.Arch == "X86_64"

[1]ÂÂÂÂÂÂÂÂ 192Â TARGET.OpSys == "LINUX"

[3]ÂÂÂÂÂÂÂÂ 192Â TARGET.Disk >= RequestDisk

[5]ÂÂÂÂÂÂÂÂ 192Â TARGET.Memory >= RequestMemory

[8]ÂÂÂÂÂÂÂÂ 192Â TARGET.HasFileTransfer

Â

Â

009.063:Â Job is running.

Â

Last successful match: Tue Jun Â2 02:26:21 2020

Â

Â

009.063: Run analysis summary ignoring user priority. Of 192 machines,

ÂÂÂÂÂ 0 are rejected by your job's requirements

ÂÂÂÂÂ 0 reject your job because of their own requirements

ÂÂÂÂ 64 match and are already running your jobs

ÂÂÂÂÂ 0 match but are serving other users

128 are able to run your job

Â

It seems like all works fine, but they still didnât execute in another computers. In one machine I has 64 core so they actually all run within one machine. Also, I only can see 64 jobs here but I had 66 jobs. In condor_q, I found 2 jobs were done, but I donât think it could be done in seconds/

Â

Warm Regards

Xinjie Zeng

Â

åää: Beyer, Christoph
åéæé: 2020å6æ2æ 1:52
æää: htcondor-users
äé: Re: [HTCondor-users] Condor_submit: job cannot excute in the second node

Â

Hi,

Â

if you send a 'lot' of jobs you can check on an idle job with condor_q -better-analyze <jobid> or condor_q -analyze <jobid> there you will see the number of potential nodes this job could technically run on.

Â

Once you discovered that it can only run on one machine you can check why it is not able to run on a specific machine using:

Â

condor_q -better-analyze <jobid> -reverse -machine <nodename> (<nodename> needs to be fqdn here for some reason)

Â

Best

christoph

Â


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Â

Von: "Xinjie Zeng" <xinjie.zeng@xxxxxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 2. Juni 2020 05:19:45
Betreff: [HTCondor-users] Condor_submit: job cannot excute in the second node

Â

Hi All

I have set up HTcondor in three computers by tarball and condor_config. The version is:

condor_version
$CondorVersion: 8.8.9 May 06 2020 BuildID: 503068 $
$CondorPlatform: x86_64_CentOS7 $

When I submit the jobs in central manager, which also configured as submit and execute, I found that the jobs are only executed in the central manager. They didn't execute in other two computers, It seems like that the three computers didn't connect to each other. However, when I check condor_status, I can see all three nodes in the pool. Could any one give some help?

Any advises are appreciated!

Thank you very much!

Â

Warm regards,

Xinjie Zeng


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Â


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Xinjie Zeng
EE PhD student
Department of Electrical and Computer Engineering