[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Error funning jobs on hetrogenous cluster



I declare my own "OPSYS_FLAVOUR" for each LINUX in my Pool.
You will also need to add it to the STARTD ClassAds.
It can then be specified in the REQUIREMENTS statement

See
http://epubs.cclrc.ac.uk/bitstream/1725/CondorGotchas.ppt
and
http://epubs.cclrc.ac.uk/bitstream/1723/Gotchas2.ppt
for hints+tips for this

I use it generally for building releases of tarballs, then I know I have
a built version for that distro.

Unfortunately this is done manually. It's a shame that finer-grained
information isn't available from Condor by default, but I think the current
string is obtained in the same way as "uname -a".

If you use cron + Hawkeye for automatically updated ClassAds, then you can
always add something to "work out" what distro you have.

Cheers

JK

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Atle Rudshaug
> Sent: Wednesday, October 24, 2007 9:22 AM
> To: condor-users@xxxxxxxxxxx
> Subject: [Condor-users] Error funning jobs on hetrogenous cluster
> 
> 
> I have a test cluster with one Debian, one Kubuntu and one Fedora
> node. I get different errors on all the nodes. I guess I need a local
> executable on every node compiled for that spesific distro? Is there
> some kind of requirement I can state in the submit file that can
> specify distro the executable needs to run? Is there some way to send
> my own libraries that my executable needs or do I have to have them on
> the same path on each node? Can I have them on NFS? Guess I need to
> compile them with NFS paths to lib-files in the Makefile then?
> 
> #Submit file
> universe        = vanilla
> executable    = dagoc
> output           = dagoc.out.$(CLUSTER).$(PROCESS)
> error             = dagoc.err.$(CLUSTER).$(PROCESS)
> log               = dagoc.log.$(CLUSTER)
> should_transfer_files = YES
> when_to_transfer_output = ON_EXIT
> transfer_input_files = /mnt/dagocproject/dbases/TEST.db
> arguments       = -c -start=10 -stop=20 
> /mnt/dagocproject/setups/TEST_remote.sup
> queue 5
> 
> 
> What does the following error mean?
> dagoc.err.102.0 and dagoc.err.102.4
> --------------------------------------------------------------
> ----------------------
> condor_exec.exe: symbol lookup error: condor_exec.exe: undefined
> symbol: 
> _ZSt22__uninitialized_copy_aIN9__gnu_cxx17__normal_iteratorIPK
> SsSt6vectorISsSaISsEEEEPSsSsET0_T_SA_S9_SaIT1_E
> 
> 
> Here I need to compile the executable on the node that got this error.
> dagoc.err.102.1
> --------------------------------------------------------------
> ----------------------
> condor_exec.exe: /lib/tls/i686/cmov/libc.so.6: version `GLIBC_2.4' not
> found (required by condor_exec.exe)
> 
> dagoc.log.11:
> --------------------------------------------------------------
> ----------------------
> 000 (102.000.000) 10/24 09:37:23 Job submitted from host: <xxx.247>
> ...
> 000 (102.001.000) 10/24 09:37:23 Job submitted from host: <xxx.247>
> ...
> 000 (102.002.000) 10/24 09:37:23 Job submitted from host: <xxx.247>
> ...
> 000 (102.003.000) 10/24 09:37:23 Job submitted from host: <xxx.247>
> ...
> 000 (102.004.000) 10/24 09:37:23 Job submitted from host: <xxx.247>
> ...
> 001 (102.000.000) 10/24 09:37:30 Job executing on host: <xxx.251>
> ...
> 005 (102.000.000) 10/24 09:37:32 Job terminated.
>         (1) Normal termination (return value 127)
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>         819  -  Run Bytes Sent By Job
>         23796974  -  Run Bytes Received By Job
>         819  -  Total Bytes Sent By Job
>         23796974  -  Total Bytes Received By Job
> ...
> 001 (102.001.000) 10/24 09:37:32 Job executing on host: <xxx.245>
> ...
> 005 (102.001.000) 10/24 09:37:32 Job terminated.
>         (1) Normal termination (return value 1)
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>         107  -  Run Bytes Sent By Job
>         23796974  -  Run Bytes Received By Job
>         107  -  Total Bytes Sent By Job
>         23796974  -  Total Bytes Received By Job
> ...
> 001 (102.002.000) 10/24 09:37:32 Job executing on host: <xxx.247>
> ...
> 001 (102.003.000) 10/24 09:37:34 Job executing on host: <xxx.247>
> ...
> 001 (102.004.000) 10/24 09:37:39 Job executing on host: <xxx.251>
> ...
> 005 (102.004.000) 10/24 09:37:39 Job terminated.
>         (1) Normal termination (return value 127)
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>         819  -  Run Bytes Sent By Job
>         23796974  -  Run Bytes Received By Job
>         819  -  Total Bytes Sent By Job
>         23796974  -  Total Bytes Received By Job
> ...
> 005 (102.002.000) 10/24 09:37:46 Job terminated.
>         (1) Normal termination (return value 0)
>                 Usr 0 00:00:07, Sys 0 00:00:00  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>                 Usr 0 00:00:07, Sys 0 00:00:00  -  Total Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>         4536073  -  Run Bytes Sent By Job
>         23796974  -  Run Bytes Received By Job
>         4536073  -  Total Bytes Sent By Job
>         23796974  -  Total Bytes Received By Job
> ...
> 005 (102.003.000) 10/24 09:37:48 Job terminated.
>         (1) Normal termination (return value 0)
>                 Usr 0 00:00:07, Sys 0 00:00:00  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>                 Usr 0 00:00:07, Sys 0 00:00:00  -  Total Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>         4536071  -  Run Bytes Sent By Job
>         23796974  -  Run Bytes Received By Job
>         4536071  -  Total Bytes Sent By Job
>         23796974  -  Total Bytes Received By Job
> ...
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to 
> condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/
>