Robert R. McCormick School of Engineering and Applied Science Electrical Engineering and Computer Science Department Center for Ultra-scale Computing and Information Security at Northwestern University

Sponsor:

PnetCDF development was sponsored by the Scientific Data Management Center (SDM) under the DOE program of Scientific Discovery through Advanced Computing (SciDAC). It was also supported in part by National Science Foundation under the SDCI HPC program award numbers OCI-0724599 and HECURA program award numbers CCF-0938000.

Project Team Members:

Northwestern University

Argonne National Laboratory





Northwestern University - EECS Dept.

Tutorial on Subfiling in PnetCDF

Overview

Subfiling is a mechanism to partition the NetCDF file into multiple partitioned files (subfiles) internally, making the NetCDF data appear as a single file to users. In order to use the subfiling feature, users need to provide their intention of using it through the hints. Once users specify this hint, all variables defined in your programs will be partitioned and stored into subfiles. Figure below illustrates a high-level view of the subfiling mechanism.

Building and running with subfiling

The subfiling is disabled by default. To enable it, the PnetCDF source code needs to be configured explicitly using "--enable-subfiling" during configuration. Once it is enabled, users should convey their intention of using subfiling through MPI hints:
     ...
     MPI_Info info;
     MPI_Info_create(&info);

     MPI_Info_set(info, "nc_num_subfiles", "2");

     ncmpi_create(MPI_COMM_WORLD, filename, NC_CLOBBER|NC_64BIT_DATA,
                  info, &ncid);
     ...
     

The example above will create two subfiles in addition to the original file that user specified to create. During writes, all subfile related information is stored in the master file, so that reading from subfiles can be done transparently. In other words, there will be no code change in reading cases.

Note that the programs built with subfiling should run in parallel execution. This applies to both writes and read cases. Note also that users cannot specify the number of subfiles higher than the number of MPI ranks because during subfiling the original communicator is partitioned according to the number of subfiles. If the number of subfiles is greater than the available MPI ranks, the program will create normal file without partitioning.

File layouts with and without subfiling

This section describes the file layout with and without subfiling to help understand the mechanism behind subfiling module in PnetCDF. Note that regardless of whether subfiling is enabled or not, all files generated are in the NetCDF file format. If no subfiling is used, all variables are stored in the original file specified by the user. For example, the .nc file, t1.4.1.0.nc, generated by running test/subfile/test-subfile.c using 2 ranks (mpiexec -n 2 ./test_subfile -f t1 -s 2 -l 4) looks like.

    netcdf t1.4.1.0 {
    // file format: CDF-5 (big variables)
    dimensions:
        dim0_0 = 8 ;
        dim0_1 = 4 ;
        dim0_2 = 4 ;
    variables:
        int var0_0(dim0_0, dim0_1, dim0_2) ;
    data:

    var0_0 =
        ...
    

If the number of subfiles is set to N, the same program will generate N+1 files: 1 original (master) and N subfiles. For example, the above test case will generate one master (t1.4.1.0.nc) and two subfiles (t1.4.1.0.nc.subfile_0 and t1.4.1.0.subfile_1). Since the file is partitioned, now the master file does not contain any datasets but includes the information associated with subfiles. That information is stored as attributes in the master file. The nc_num_subfiles is a metadata that specify the number of subfiles associated with the original file. The subfile-range is on the other hand for specifying the data range stored in each subfile.

Using the example above, the subfiling-related metadata of t1.4.1.0.nc is inserted as attributes in both global and variable-specific:

   netcdf t1.4.1.0 {
   // file format: CDF-5 (big variables)
   dimensions:
        dim0_0 = 8 ;
        dim0_1 = 4 ;
        dim0_2 = 4 ;
   variables:
        int var0_0 ;
                var0_0:_PnetCDF_SubFiling.par_dim_index = 0 ;
                var0_0:_PnetCDF_SubFiling.ndims_org = 3 ;
                var0_0:_PnetCDF_SubFiling.num_subfiles = 2 ;
                var0_0:_PnetCDF_SubFiling.range(dim0_0).subfile.0 = 0, 3 ;
                var0_0:_PnetCDF_SubFiling.range(dim0_1).subfile.0 = 0, 3 ;
                var0_0:_PnetCDF_SubFiling.range(dim0_2).subfile.0 = 0, 3 ;
                var0_0:_PnetCDF_SubFiling.range(dim0_0).subfile.1 = 4, 7 ;
                var0_0:_PnetCDF_SubFiling.range(dim0_1).subfile.1 = 0, 3 ;
                var0_0:_PnetCDF_SubFiling.range(dim0_2).subfile.1 = 0, 3 ;

    // global attributes:
                :num_subfiles = 2 ;
    data:

        var0_0 = 0 ;
    

Note that, since the file has been subfiled, the master file no longer store the data as indicated by var0_0 = 0.

Each subfile contains its dataset as follows:

    netcdf t1.4.1.0.nc.subfile_0 {
    // file format: CDF-5 (big variables)
    dimensions:
        dim0_0.var0_0 = 4 ;
        dim0_1.var0_0 = 4 ;
        dim0_2.var0_0 = 4 ;
    variables:
        int var0_0(dim0_0.var0_0, dim0_1.var0_0, dim0_2.var0_0) ;
                var0_0:_PnetCDF_SubFiling.range(dim0_0).subfile.0 = 0, 3 ;
                var0_0:_PnetCDF_SubFiling.range(dim0_1).subfile.0 = 0, 3 ;
                var0_0:_PnetCDF_SubFiling.range(dim0_2).subfile.0 = 0, 3 ;
                var0_0:_PnetCDF_SubFiling.subfile_index = 0 ;
    data:

         var0_0 = ...
    
    netcdf t1.4.1.0.nc.subfile_1 {
    // file format: CDF-5 (big variables)
    dimensions:
        dim0_0.var0_0 = 4 ;
        dim0_1.var0_0 = 4 ;
        dim0_2.var0_0 = 4 ;
    variables:
        int var0_0(dim0_0.var0_0, dim0_1.var0_0, dim0_2.var0_0) ;
                var0_0:_PnetCDF_SubFiling.range(dim0_0).subfile.1 = 4, 7 ;
                var0_0:_PnetCDF_SubFiling.range(dim0_1).subfile.1 = 0, 3 ;
                var0_0:_PnetCDF_SubFiling.range(dim0_2).subfile.1 = 0, 3 ;
                var0_0:_PnetCDF_SubFiling.subfile_index = 1 ;
    data:

        var0_0 = ...
    

Current limitations

Future work

References


Northwestern University EECS Home | McCormick Home | Northwestern Home | Calendar: Plan-It Purple
© 2011 Robert R. McCormick School of Engineering and Applied Science, Northwestern University
"Tech": 2145 Sheridan Rd, Tech L359, Evanston IL 60208-3118  |  Phone: (847) 491-5410  |  Fax: (847) 491-4455
"Ford": 2133 Sheridan Rd, Ford Building, Rm 3-320, Evanston, IL 60208  |  Fax: (847) 491-5258
Email Director

Last Updated: $LastChangedDate: 2016-11-06 13:49:49 -0600 (Sun, 06 Nov 2016) $