[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] large files problem


I have had the same problem. Firstly I thought that it was because the master was a Windows machine and the others were Linux. But after changing to a linux master and checkpoint server the problem is still there.

My jobs are terminated when the largest output-file become almost 2Gb.

The file system on all disks is ext3, I am using Condor 6.8.4 I386-LINUX_RHEL3 on Mandriva Linux. Standard Universe with checkpoint server.

My job log-file:
005 (005.000.000) 04/23 11:14:47 Job terminated.
       (0) Abnormal termination (signal 11)
       (0) No core file
               Usr 0 00:34:18, Sys 0 00:00:01  -  Run Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
               Usr 4 05:00:06, Sys 0 00:02:03  -  Total Remote Usage
               Usr 0 00:00:09, Sys 0 00:00:28  -  Total Local Usage
       35667072  -  Run Bytes Sent By Job
       25345412  -  Run Bytes Received By Job
       0  -  Total Bytes Sent By Job
       0  -  Total Bytes Received By Job
009 (005.000.000) 04/23 11:14:47 Job was aborted by the user.

StarterLog on the executing machine turn out: *FSM* Got asynchronous event "CHILD_EXIT"



Edvin Erdtman Ph.D. student
Department of Natural Sciences
and Örebro Life Science Center
Örebro University
701 82 Örebro, Sweden
phone: +46 (0)19 30 36 69


Does condor support large files?

I'm investigating a problem related to condor, where no files bigger
than 2 GB can be written using the standard universe. The condor
version used in our cluster is:

$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

And here's a short Fortran program that can be used to reproduce the
problem. Warning: it writes a 2.6 GB file, so be sure to have enough
disk space before running it.

    program genbig
    parameter (N=1024)
    real array(N*N)
    do i=1,640
    write(10) (array(j),j=1,N*N)

When compiled without using condor_compile, and run locally, it
generate a testfile of size 2.6 GB, as expected. When run on condor,
using a vanilla universe (so still without using condor_compile), it
works correctly and writes a 2.6 GB testfile.

BUT: when compiled using condor_compile, and run locally on the shell,
it only generates a 2.1 GB file, and then exits. When compiled with
condor compile and submitted to condor using the standard universe, it
writes a 2.1 GB testfile and terminates with

(0) Abnormal termination (signal 11)

Of course, this is a problem, as every single test should generate a
2.6 GB file. Is there a way to fix this problem with condor, or
identify what's the cause of it?

This is only a short program to reproduce the problem; our scientific
code is not as naive, and requires the use of the standard universe in
order to take advantage of the checkpointing.

Any help will be appreciated.