[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs are dumped



Hi Todd,

Hummm... it does not work...

The script is named teste1.sh, it's chmod'ed 777. The contents are:
#!/bin/sh
echo "works" >> /tmp/itisworking.txt

The submission file is
####################
#
# submit description file
# Example 1: queuing multiple jobs with differing
# command line arguments and output files.
# Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
#################### Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂÂ
                                    Â
Executable   = teste1.sh                         ÂÂ
Universe    = standard
                                    Â
Arguments   Â= 1                       ÂÂ
Output Â= foo.out1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂÂ
Error  = foo.err1
QueueÂ

All the logs are empty at the end.

The result of the ShadowLog is

12/02/17 06:56:08 (?.?) (81788):*******************************************
12/02/17 06:56:08 (?.?) (81788):uid=0, euid=122, gid=0, egid=131
12/02/17 06:56:08 (?.?) (81788):Hostname = "<xxx.xxx.xxx.xxx:17345?addrs=xxx.xxx.xxx.xxx-17345>", Job = 39.0
12/02/17 06:56:08 (39.0) (81788):Requesting Primary Starter
12/02/17 06:56:08 (39.0) (81788):Shadow: Request to run a job was ACCEPTED
12/02/17 06:56:08 (39.0) (81788):Shadow: RSC_SOCK connected, fd = 17
12/02/17 06:56:08 (39.0) (81788):Shadow: CLIENT_LOG connected, fd = 18
12/02/17 06:56:08 (39.0) (81788):My_Filesystem_Domain = "my domain"
12/02/17 06:56:08 (39.0) (81788):My_UID_Domain = "my domain"
12/02/17 06:56:08 (39.0) (81788):Can't get address for checkpoint server host (NULL): No such file or directory
12/02/17 06:56:08 (39.0) (81788): Entering pseudo_get_file_stream
12/02/17 06:56:08 (39.0) (81788): file = "/var/lib/condor/spool/39/cluster39.ickpt.subproc0"
12/02/17 06:56:08 (39.0) (81788):Created TCP listen socket <192.168.0.2:23983>
12/02/17 06:56:08 (39.0) (81788):Shadow: Job 39.0 exited, termsig = 0, coredump = 128, retcode = 0
12/02/17 06:56:08 (39.0) (81788):user_time = 0 ticks
12/02/17 06:56:08 (39.0) (81788):sys_time = 0 ticks
12/02/17 06:56:08 (39.0) (81788):Shadow: Cannot notify user( Condor Job 39.0, tavares, w )
12/02/17 06:56:08 (39.0) (81788):Static Policy: removing job because OnExitRemove has become true
12/02/17 06:56:08 (39.0) (81788):********** Shadow Exiting(102) **********

(xxx.xxx.xxx.xxx is my IP for the external network - eth0; 192.168.0.2 is the internal IP - eth1)

Is there any other relevant log or anything else that I should look for?

Thanks!

Roberto

On Fri, Dec 1, 2017 at 7:36 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 12/1/2017 3:07 PM, Roberto Tavares wrote:
Hello,

I think I'm almoust there!

I'm trying to run a simple script:

echo "It Works" >> /tmp/thisshouldwork.txt


Are you specifying "thisshouldwork.txt" as your executable? If so, I would not expect it to work. Does it work from the command prompt without involving HTCondor ? (my guess is no). Instead of

 Âecho "It Works"

you probably want

 Â#!/bin/sh
 Âecho "It Works"

and then do chmod 700 thisshouldwork.txt (to set the executable bit). This is life on a Linux/Unix environment, nothing specific to HTCondor here.

Take a look at
Âhttp://research.cs.wisc.edu/htcondor/manual/quickstart.html
I think you will find it helpful at getting started, it covers the above issues.

In looking at the log below, it looks like you submitted the job to HTcondor' "standard" universe, which you likely do not want to do (unless you have the C or Fortran souce code to your program). Instead, you want the 'vanilla' universe, by placing the following into your submit file:

 Âuniverse = vanilla

(This is the default on recent HTCondor installs....).

Hope the above helps,
Todd

What happens:

- it goes to the queue
- it is removed from the queue
- it does not run (log files empty and file in tmp is not created) and it seems to fall into some black hole... :(

The maximum that I could reach that shows any error is the ShadowLog file, that gives me:

12/01/17 18:51:25 (?.?) (74915):******* Standard Shadow starting up *******
12/01/17 18:51:25 (?.?) (74915):** $CondorVersion: 8.4.12 Jul 06 2017 BuildID: 409562 $
12/01/17 18:51:25 (?.?) (74915):** $CondorPlatform: x86_64_Ubuntu14 $
12/01/17 18:51:25 (?.?) (74915):*******************************************
12/01/17 18:51:25 (?.?) (74915):uid=0, euid=122, gid=0, egid=131
12/01/17 18:51:25 (?.?) (74915):Hostname = "<xxx.xxx.xxx.xxx:17345?addrs=xxx.xxx.xxx.xxx-17345>", Job = 37.0
12/01/17 18:51:25 (37.0) (74915):Requesting Primary Starter
12/01/17 18:51:25 (37.0) (74915):Shadow: Request to run a job was ACCEPTED
12/01/17 18:51:25 (37.0) (74915):Shadow: RSC_SOCK connected, fd = 17
12/01/17 18:51:25 (37.0) (74915):Shadow: CLIENT_LOG connected, fd = 18
12/01/17 18:51:25 (37.0) (74915):My_Filesystem_Domain = "my domain"
12/01/17 18:51:25 (37.0) (74915):My_UID_Domain = "my domain"
12/01/17 18:51:25 (37.0) (74915):*Can't get address for checkpoint server host (NULL): No such file or directory*
12/01/17 18:51:25 (37.0) (74915):ÂÂÂ Entering pseudo_get_file_stream
12/01/17 18:51:25 (37.0) (74915):ÂÂÂ file = "/var/lib/condor/spool/37/cluster37.ickpt.subproc0"
12/01/17 18:51:25 (37.0) (74915):Created TCP listen socket <xxx.xxx.xxx.xxx:41412>
12/01/17 18:51:25 (37.0) (74915):Shadow: Job 37.0 exited, termsig = 0, coredump = 128, retcode = 0
12/01/17 18:51:25 (37.0) (74915):user_time = 1 ticks
12/01/17 18:51:25 (37.0) (74915):sys_time = 0 ticks
12/01/17 18:51:25 (37.0) (74915):*Shadow: Cannot notify user( Condor Job 37.0, tavares, w )*
12/01/17 18:51:25 (37.0) (74915):Static Policy: removing job because OnExitRemove has become true
12/01/17 18:51:25 (37.0) (74915):********** Shadow Exiting(102) **********

Just to keep it simple, I'd rather to avoid to use the checkpoint server. Is it possible?

I'm a little clueless now... can you give me any help on that?

Thank you!!!!

Roberto



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing ÂDepartment of Computer Sciences
HTCondor Technical Lead        1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685