[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs are dumped



Hi,

It was the "universe" setting. I did a lot of tests and everything is working now!

Thank you!!!

Roberto

------------------------------------------------------------------------------------------------------------------------
Prof. Dr. Roberto Fernandes Tavares Neto
Departamento de Engenharia de ProduÃÃo / Industrial Engineering Department
Universidade Federal de SÃo Carlos
tavares@xxxxxxxxxxxxx   tel +55 16 3351-9532
http://www.dep.ufscar.br/tavares
------------------------------------------------------------------------------------------------------------------------

On Sat, Dec 2, 2017 at 11:14 AM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:


On Dec 2, 2017, at 3:22 AM, Roberto Tavares <tavares@xxxxxxxxxxxxx> wrote:

Hi Todd,

Hummm... it does not work...



Hi Roberto,

Please follow the Quick Start link I gave you in my last post. And/or read sections 2.4 and 2.5 in the HTCondor Manual.Â

The below wonât work because you have âuniverse=standardâ in your submit file. Change to âuniverse=vanillaâ as I already suggested, or simply remove that line (vanilla is the default setting).Â

Also the submit file below does not tell HTCondor to transfer any files (like the executable) from your submit machine to your worker node, so a shared file system is assumed. /tmp is never shared across machines, so the only node this job could possibly run would be on the same node you submitted the job.Â

All of this is discussed in the quick start guide previously mentioned (link to it is on HTCondor.org homepage), and also documented in high detail the Manual. I think you will save yourself a lot of time by following the Quick Start Guide - it gives clear cut and paste examples and is not very long. Let us know if you find it helpful.Â

Hope this helps
Todd

The script is named teste1.sh, it's chmod'ed 777. The contents are:
#!/bin/sh
echo "works" >> /tmp/itisworking.txt

The submission file is
####################
#
# submit description file
# Example 1: queuing multiple jobs with differing
# command line arguments and output files.
# Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
#################### Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂÂ
                                    Â
Executable   = teste1.sh                         ÂÂ
Universe    = standard
                                    Â
Arguments   Â= 1                       ÂÂ
Output Â= foo.out1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂÂ
Error  = foo.err1
QueueÂ

All the logs are empty at the end.

The result of the ShadowLog is

12/02/17 06:56:08 (?.?) (81788):*******************************************
12/02/17 06:56:08 (?.?) (81788):uid=0, euid=122, gid=0, egid=131
12/02/17 06:56:08 (?.?) (81788):Hostname = "<xxx.xxx.xxx.xxx:17345?addrs=xxx.xxx.xxx.xxx-17345>", Job = 39.0
12/02/17 06:56:08 (39.0) (81788):Requesting Primary Starter
12/02/17 06:56:08 (39.0) (81788):Shadow: Request to run a job was ACCEPTED
12/02/17 06:56:08 (39.0) (81788):Shadow: RSC_SOCK connected, fd = 17
12/02/17 06:56:08 (39.0) (81788):Shadow: CLIENT_LOG connected, fd = 18
12/02/17 06:56:08 (39.0) (81788):My_Filesystem_Domain = "my domain"
12/02/17 06:56:08 (39.0) (81788):My_UID_Domain = "my domain"
12/02/17 06:56:08 (39.0) (81788):Can't get address for checkpoint server host (NULL): No such file or directory
12/02/17 06:56:08 (39.0) (81788): Entering pseudo_get_file_stream
12/02/17 06:56:08 (39.0) (81788): file = "/var/lib/condor/spool/39/cluster39.ickpt.subproc0"
12/02/17 06:56:08 (39.0) (81788):Created TCP listen socket <192.168.0.2:23983>
12/02/17 06:56:08 (39.0) (81788):Shadow: Job 39.0 exited, termsig = 0, coredump = 128, retcode = 0
12/02/17 06:56:08 (39.0) (81788):user_time = 0 ticks
12/02/17 06:56:08 (39.0) (81788):sys_time = 0 ticks
12/02/17 06:56:08 (39.0) (81788):Shadow: Cannot notify user( Condor Job 39.0, tavares, w )
12/02/17 06:56:08 (39.0) (81788):Static Policy: removing job because OnExitRemove has become true
12/02/17 06:56:08 (39.0) (81788):********** Shadow Exiting(102) **********

(xxx.xxx.xxx.xxx is my IP for the external network - eth0; 192.168.0.2 is the internal IP - eth1)

Is there any other relevant log or anything else that I should look for?

Thanks!

Roberto

On Fri, Dec 1, 2017 at 7:36 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 12/1/2017 3:07 PM, Roberto Tavares wrote:
Hello,

I think I'm almoust there!

I'm trying to run a simple script:

echo "It Works" >> /tmp/thisshouldwork.txt


Are you specifying "thisshouldwork.txt" as your executable? If so, I would not expect it to work. Does it work from the command prompt without involving HTCondor ? (my guess is no). Instead of

 Âecho "It Works"

you probably want

 Â#!/bin/sh
 Âecho "It Works"

and then do chmod 700 thisshouldwork.txt (to set the executable bit). This is life on a Linux/Unix environment, nothing specific to HTCondor here.

Take a look at
Âhttp://research.cs.wisc.edu/htcondor/manual/quickstart.html
I think you will find it helpful at getting started, it covers the above issues.

In looking at the log below, it looks like you submitted the job to HTcondor' "standard" universe, which you likely do not want to do (unless you have the C or Fortran souce code to your program). Instead, you want the 'vanilla' universe, by placing the following into your submit file:

 Âuniverse = vanilla

(This is the default on recent HTCondor installs....).

Hope the above helps,
Todd

What happens:

- it goes to the queue
- it is removed from the queue
- it does not run (log files empty and file in tmp is not created) and it seems to fall into some black hole... :(

The maximum that I could reach that shows any error is the ShadowLog file, that gives me:

12/01/17 18:51:25 (?.?) (74915):******* Standard Shadow starting up *******
12/01/17 18:51:25 (?.?) (74915):** $CondorVersion: 8.4.12 Jul 06 2017 BuildID: 409562 $
12/01/17 18:51:25 (?.?) (74915):** $CondorPlatform: x86_64_Ubuntu14 $
12/01/17 18:51:25 (?.?) (74915):*******************************************
12/01/17 18:51:25 (?.?) (74915):uid=0, euid=122, gid=0, egid=131
12/01/17 18:51:25 (?.?) (74915):Hostname = "<xxx.xxx.xxx.xxx:17345?addrs=xxx.xxx.xxx.xxx-17345>", Job = 37.0
12/01/17 18:51:25 (37.0) (74915):Requesting Primary Starter
12/01/17 18:51:25 (37.0) (74915):Shadow: Request to run a job was ACCEPTED
12/01/17 18:51:25 (37.0) (74915):Shadow: RSC_SOCK connected, fd = 17
12/01/17 18:51:25 (37.0) (74915):Shadow: CLIENT_LOG connected, fd = 18
12/01/17 18:51:25 (37.0) (74915):My_Filesystem_Domain = "my domain"
12/01/17 18:51:25 (37.0) (74915):My_UID_Domain = "my domain"
12/01/17 18:51:25 (37.0) (74915):*Can't get address for checkpoint server host (NULL): No such file or directory*
12/01/17 18:51:25 (37.0) (74915):ÂÂÂ Entering pseudo_get_file_stream
12/01/17 18:51:25 (37.0) (74915):ÂÂÂ file = "/var/lib/condor/spool/37/cluster37.ickpt.subproc0"
12/01/17 18:51:25 (37.0) (74915):Created TCP listen socket <xxx.xxx.xxx.xxx:41412>
12/01/17 18:51:25 (37.0) (74915):Shadow: Job 37.0 exited, termsig = 0, coredump = 128, retcode = 0
12/01/17 18:51:25 (37.0) (74915):user_time = 1 ticks
12/01/17 18:51:25 (37.0) (74915):sys_time = 0 ticks
12/01/17 18:51:25 (37.0) (74915):*Shadow: Cannot notify user( Condor Job 37.0, tavares, w )*
12/01/17 18:51:25 (37.0) (74915):Static Policy: removing job because OnExitRemove has become true
12/01/17 18:51:25 (37.0) (74915):********** Shadow Exiting(102) **********

Just to keep it simple, I'd rather to avoid to use the checkpoint server. Is it possible?

I'm a little clueless now... can you give me any help on that?

Thank you!!!!

Roberto



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing ÂDepartment of Computer Sciences
HTCondor Technical Lead        1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685