[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs are dumped



Hello,

I think I'm almoust there!

I'm trying to run a simple script:

echo "It Works" >> /tmp/thisshouldwork.txt

What happens:

- it goes to the queue
- it is removed from the queue
- it does not run (log files empty and file in tmp is not created) and it seems to fall into some black hole... :(

The maximum that I could reach that shows any error is the ShadowLog file, that gives me:

12/01/17 18:51:25 (?.?) (74915):******* Standard Shadow starting up *******
12/01/17 18:51:25 (?.?) (74915):** $CondorVersion: 8.4.12 Jul 06 2017 BuildID: 409562 $
12/01/17 18:51:25 (?.?) (74915):** $CondorPlatform: x86_64_Ubuntu14 $
12/01/17 18:51:25 (?.?) (74915):*******************************************
12/01/17 18:51:25 (?.?) (74915):uid=0, euid=122, gid=0, egid=131
12/01/17 18:51:25 (?.?) (74915):Hostname = "<xxx.xxx.xxx.xxx:17345?addrs=xxx.xxx.xxx.xxx-17345>", Job = 37.0
12/01/17 18:51:25 (37.0) (74915):Requesting Primary Starter
12/01/17 18:51:25 (37.0) (74915):Shadow: Request to run a job was ACCEPTED
12/01/17 18:51:25 (37.0) (74915):Shadow: RSC_SOCK connected, fd = 17
12/01/17 18:51:25 (37.0) (74915):Shadow: CLIENT_LOG connected, fd = 18
12/01/17 18:51:25 (37.0) (74915):My_Filesystem_Domain = "my domain"
12/01/17 18:51:25 (37.0) (74915):My_UID_Domain = "my domain"
12/01/17 18:51:25 (37.0) (74915):Can't get address for checkpoint server host (NULL): No such file or directory
12/01/17 18:51:25 (37.0) (74915):ÂÂÂ Entering pseudo_get_file_stream
12/01/17 18:51:25 (37.0) (74915):ÂÂÂ file = "/var/lib/condor/spool/37/cluster37.ickpt.subproc0"
12/01/17 18:51:25 (37.0) (74915):Created TCP listen socket <xxx.xxx.xxx.xxx:41412>
12/01/17 18:51:25 (37.0) (74915):Shadow: Job 37.0 exited, termsig = 0, coredump = 128, retcode = 0
12/01/17 18:51:25 (37.0) (74915):user_time = 1 ticks
12/01/17 18:51:25 (37.0) (74915):sys_time = 0 ticks
12/01/17 18:51:25 (37.0) (74915):Shadow: Cannot notify user( Condor Job 37.0, tavares, w )
12/01/17 18:51:25 (37.0) (74915):Static Policy: removing job because OnExitRemove has become true
12/01/17 18:51:25 (37.0) (74915):********** Shadow Exiting(102) **********

Just to keep it simple, I'd rather to avoid to use the checkpoint server. Is it possible?

I'm a little clueless now... can you give me any help on that?

Thank you!!!!

Roberto