Robert R. McCormick School of Engineering and Applied Science Electrical Engineering and Computer Science Department Center for Ultra-scale Computing and Information Security at Northwestern University

Sponsor:

PnetCDF development was sponsored by the Scientific Data Management Center (SDM) under the DOE program of Scientific Discovery through Advanced Computing (SciDAC). It was also supported in part by National Science Foundation under the SDCI HPC program award numbers OCI-0724599 and HECURA program award numbers CCF-0938000.

Project Team Members:

Northwestern University

Argonne National Laboratory






Northwestern University - EECS Dept.



Parallel netCDF Q&A


This page provides frequently asked questions and tips for obtaining better I/O performance. Readers are also referred to NetCDF FAQ for netCDF specific questions.
  1. Q: How do I improve I/O performance on Lustre?
  2. Q: What is file striping?
  3. Q: Must I use a parallel file system to run PnetCDF?
  4. Q: Can a netCDF-4 program make use of PnetCDF internally (instead of HDF5) to performance parallel I/O?
  5. Q: How do I avoid the "data shift penalty" due to the growth of file header?
  6. Q: How do I enable file access layout alignment for fixed-sized variables?
  7. Q: What run-time environment variables are available in PnetCDF?
  8. Q: How do I find out the PnetCDF and MPI-IO hint values used in my program?
  9. Q: How do I inquire the file header size and the amount of space allocated for it?
  10. Q: Should I consider using nonblocking APIs?
  11. Q: How do I use the buffered nonblocking write APIs?
  12. Q: What is the difference between collective and independent APIs?
  13. Q: Should I use collective APIs or independent APIs?
  14. Q: Is there an API to read/write multiple subarrays of a single variable?
  15. Q: What file formats does PnetCDF support and what are their differences?
  16. Q: How do I obtain the error message corresponding to a returned error code?
  17. Q: Does PnetCDF support fill mode?
  18. Q: Is there an API that reports the amount of data read/written that was carried out by PnetCDF?
  19. Q: Does PnetCDF support compound data types?
  20. Q: Can I run my PnetCDF program sequentially?
  21. Q: What level of parallel I/O data consistency is supported by PnetCDF?
  22. Q: Where can I find PnetCDF example programs?
  23. Q: Is there a mailing list for PnetCDF discussions and questions?

  1. Q: How do I improve I/O performance on Lustre?
    A: Lustre is a parallel file system that allows users to customize a file's striping setting. If the amount of your I/O requests is sufficiently large, then the best strategy is to set the striping count to the maximal allowable by the system. For Lustre, the user configurable parameters are striping count, striping size, and striping offset.
    • Striping count is the number of object storage targets (OST), i.e. the number of file servers storing the files in round robin.
    • Striping size is the size of the block. 1 MB is a good size.
    • Striping offset is the starting OST index.(default -1).
    To find the (default) striping setting of your Lustre, use the command:
    % lfs getstripe filename
    stripe_count: 12 stripe_size: 1048576 stripe_offset: -1
    The command to change a directory's/file's striping setting is "lfs". Its syntax is:
    % lfs setstripe -s stripe_size -c stripe_count -o start_ost_index directory|filename
    Note that users can change a directory's striping settings. New files created in a Lustre directory inherit the same settings. All in all, we recommend the followings.
    1. Use command "lfs" to set the striping count and size for the output directory and create your output files there.
    2. Use collective APIs. Collective I/O coordinates application processes and reorganizes their requests into an access pattern that fits better for the underlying file system.
    3. Use nonblocking APIs for multiple small requests. Nonblocking APIs aggregate small requests into large ones and hence have a better chance to achieve higher performance.

  2. Q: What is file striping?
    A: On parallel file systems, a file can be divided into blocks of same size, called striping size, which are stored in a set of file servers in a round-robin fashion. File striping allows multiple file servers to service I/O requests simultaneously, achieving a higher I/O bandwidth result.

  3. Q: Must I use a parallel file system to run PnetCDF?
    A: No. However, using a parallel file system with a proper file striping setting can significantly improve your parallel I/O performance.

  4. Q: Can a netCDF-4 program make use of PnetCDF internally (instead of HDF5) to performance parallel I/O?
    A: Yes, if netCDF-4 version 4.2.1.1 or newer is used. NetCDF-4 can perform parallel I/O through either PnetCDF or HDF5. (Note that when using HDF5 to carry out parallel I/O, the files must be in HDF5 format, instead of netCDF.) This option can be enabled by setting the file create mode to either NC_CLASSIC_MODEL or NC_NETCDF4, when calling API nc_create_par. In order to use PnetCDF for parallel I/O, one must set the create mode to NC_PNETCDF. When using PnetCDF, netCDF-4 programs can perform parallel I/O on the CDF-1 and CDF-2 files. Supporting CDF-5 in netCDF-4 is in the wish list. Example netCDF-4 programs that use PnetCDF for parallel I/O are available here.

  5. Q: How do I avoid the "data shift penalty" due to the growth of file header?
    A: "Data shift" occurs when the size of file header grows. Files in CDF formats comprise two sections: metadata section and data section. The metadata section also referred as file header is stored at the beginning of the file. The data section is stored after the metadata section and its beginning file offset is determined at the first call to the end-define API (eg. ncmpi_enddef or nc_enddef), when the file is created. Afterward, the beginning offset is recalculated every time the program calls end-define (from entering the re-define mode, finishing changes to metadata, and exiting).

    The "data shift penalty" happens when the new file header grows bigger than the space reserved for the original header and forces PnetCDF to "shift" the entire data section to a location with higher file offset. The header of a netCDF file can grow if a program opens an existing file and enters the redefine mode to add more metadata (e.g. new attributes, dimensions, or variables). PnetCDF provides an I/O hint, nc_header_align_size, to allow user to preserve a larger space for file header if it is expected to grow. The default is 512 bytes if file striping size cannot be obtained from the underneath MPI-IO library. If the file striping size can be obtained and the sum size of all variables is larger than 4 times the file striping size, the default is set to the file striping size. PnetCDF always sets a header space of size equal to a multiple of nc_header_align_size. An example code fragment to set the hint to 1MB and pass it to create a file is given below.

    MPI_Info_create(&info);
    MPI_Info_set(info, "nc_header_align_size", "1048576");
    ncmpi_create(MPI_COMM_WORLD, "filename.nc", NC_CLOBBER|NC_64BIT_DATA, info, &ncid);
    You can also use the run-time environment variable PNETCDF_HINTS to set a desired value. More information regarding to the netCDF file layout, readers are referred to Parts of a NetCDF Classic File.

    Note all I/O hints in PnetCDF and MPI-IO are advisory. The actual values used by PnetCDF and MPI-IO may be different from the ones set by the user programs. Users are encouraged to print the actual values used by both libraries. See I/O hints for how to print the hint values.


  6. Q: How do I enable file access layout alignment for fixed-sized variables?
    A: On most of the file systems, file locking is performed in the units of file blocks. If a write straddles two blocks, then locks must be acquired for both blocks. Aligning the start of a variable to a block boundary can often eliminate all unaligned file system accesses. For IBM's GPFS and Lustre, the locking unit size is also the file striping size. The PnetCDF hint for setting the file alignment size is nc_var_align_size. Below is an example of setting the alignment size to 1 MB.
    MPI_Info_create(&info);
    MPI_Info_set(info, "nc_var_align_size", "1048576");
    ncmpi_create(MPI_COMM_WORLD, "filename.nc", NC_CLOBBER|NC_64BIT_DATA, info, &ncid);
    If you are using independent APIs, then setting this hint is more important than if using collective. This is because most of the latest MPI-IO implementations have incorporated the file access alignment in their collective I/O functions, if MPI-IO can successfully retrieve the file striping information from the underlying parallel file system. This is one of the reasons we encourage PnetCDF users to use collective APIs whenever possible.

    Note all I/O hints in PnetCDF and MPI-IO are advisory. The actual values used by PnetCDF and MPI-IO may be different from the ones set by the user programs. Users are encouraged to print the actual values used by both libraries. See I/O hints for how to print the hint values.


  7. Q: What run-time environment variables are available in PnetCDF?
    A: PnetCDF defines two run-time environment variables: PNETCDF_SAFE_MODE and PNETCDF_HINTS.
    • PNETCDF_HINTS allows users to pass I/O hints to PnetCDF library. Hints include both PnetCDF and MPI-IO hints. The value is a string of hints separated by ";" and each hint is in the form of "keyword=value". E.g. under csh/tcsh environment, use command:
    setenv PNETCDF_HINTS "romio_ds_write=disable;nc_header_align_size=1048576"
    • PNETCDF_SAFE_MODE is used to enable/disable the internal checking for attribute/argument consistency across all processes. Set it to 1 to enable the checking. Default is 0, i.e. disabled.
    Note the environment variables precede the (hint) values set in the application program.

  8. Q: How do I find out the PnetCDF and MPI-IO hint values used in my program?
    A: Hint values can be retrieved from calls to ncmpi_get_file_info and MPI_Info_get. Users are encouraged to check the hint values for the ones used in their programs. Since all hints are advisory, the actual values used by PnetCDF and MPI-IO may be different from the values set by the user programs. The real hint values are automatically adjusted based on many factors, including file size, variable sizes, and file system settings. Example programs that print PnetCDF hints only can be found in the "examples" directory of the PnetCDF release: hints.c, hints.f, and hints.f90. Below is a code fragment in C that prints all I/O hints, including PnetCDF and MPI-IO.
        err = ncmpi_get_file_info(ncid, &info_used);
        MPI_Info_get_nkeys(info_used, &nkeys);
        for (i=0; i<nkeys; i++) {
            char key[MPI_MAX_INFO_KEY], value[MPI_MAX_INFO_VAL];
            int  valuelen, flag;
            MPI_Info_get_nthkey(info_used, i, key);
            MPI_Info_get_valuelen(info_used, key, &valuelen, &flag);
            MPI_Info_get(info_used, key, valuelen+1, value, &flag);
            printf("I/O hint: key = %21s, value = %s\n", key,value);
        }
        MPI_Info_free(&info_used);
        


  9. Q: Should I consider using nonblocking APIs?
    A: Using nonblocking APIs can aggregate a sequence of small requests into a large one and hence achieve better I/O performance. We encourage to try nonblocking APIs, if your program exhibits the following I/O patterns:
    • There are many small variables defined in the netCDF file. Processes read/write multiple variables in a sequence. (PnetCDF aggregation can handle requests across variables.)
    • Processes read/write a sequence of subarrays of the same variable. (PnetCDF aggregation can also handle requests to a single variable.)
    • The numbers of read/write requests are different among processes.
    Note the user buffers should not be touched between the calls to the nonblocking APIs and wait APIs, unless buffered nonblocking write APIs are used. If the contents of the buffers are changed before the wait call, then the outcome (contents in user read buffer or in file) may not be expected. If the user buffers is freed before the wait call, then the program may crash.

  10. Q: How do I use the buffered nonblocking write APIs?
    A: Buffered nonblocking write APIs copy the contents of user buffers into an internally allocated buffer, so the user buffers can be reused immediately after the calls return. A typical way to use these APIs is described below.
    • First, tell PnetCDF how much space can be allocated to be used by the APIs.
    • Make calls to the buffered put APIs.
    • Make calls to the (collective) wait APIs.
    • Free the space allocated by the internal buffer.
    For further information about the buffered nonblocking APIs, readers are referred to this page.

  11. Q: What is the difference between collective and independent APIs?
    A: Collective APIs requires all MPI processes to participate the call. This requirement allows MPI-IO and PnetCDF to coordinate the I/O requesting processes to rearrange requests into a form that can achieve the best performance from the underlying file system. On the contrary, independent APIs (also referred as non-collective) has no such requirement. All PnetCDF collective APIs (except create, open, and close) have a suffix of "_all", corresponding to their independent counterparts. To switch from collective data mode to independent mode, users must call ncmpi_begin_indep_data. API ncmpi_begin_indep_data is to exit the independent mode.

  12. Q: Should I use collective APIs or independent APIs?
    A: Users are encouraged to use collective APIs whenever possible. Collective API calls require the participation of all MPI processes that open the shared file. This requirement allows MPI-IO and PnetCDF to coordinate the I/O requesting processes to rearrange requests into a form that can achieve the best performance from the underlying file system. If the nature of user's I/O does not permit to call collective APIs (such as the number of requests are not equal among processes, or is determined at the run time), then we recommend the followings.
    • Force all the processes participate the collective calls. When a process has nothing to request, users can still call a collective API with zero-length request. This is achieved by set the contents of argument count to zero.
    • Use nonblocking APIs. Individual processes can make any number of calls to nonblocking APIs independently from other processes. At the end, a collective wait API, ncmpi_wait_all, is recommended to used to allow all nonblocking requests to commit to the file system.

  13. Q: Is there an API to read/write multiple subarrays of a single variable?
    A: The family of varn APIs can read/write a list of subarrays of a variable in a single call. These APIs have similar functionality to H5Sselect_elements API in HDF5. See their C Interface Guide for detail information. Example programs of using these APIs can be found under the directory examples of PnetCDF release (C/put_varn_int.c, C/put_varn_float.c, F77/put_varn_int.f, F77/put_varn_real.f, F90/put_varn_int.f90, and F90/put_varn_real.f90).
        ncmpi_get_varn_<type>_all
        ncmpi_get_varn_<type>
        ncmpi_put_varn_<type>_all
        ncmpi_put_varn_<type>
        

  14. Q: What file formats does PnetCDF support and what are their differences?
    A: PnetCDF supports CDF-1, CDF-2, and CDF-5 formats. CDF-1 has been used by netCDF through version 3.5.1. In CDF-1, both file size and individual variable size is limited by what a 4-byte integer can represented (2(32-1) = 2147483648 bytes). Starting from 3.6.0, netCDF added support for CDF-2 format. CDF-2 allows the file size larger than 2 GB. In addition, CDF-2 also allows more special characters in the name strings of defined dimension, variables, and attributes. CDF-2 backward supports CDF-1 format. CDF-5 further relaxes the variable size limitation to allow the size of individual variables larger than 2 GB. CD-5 also adds new data types to include all unsigned and 64-bit integers. Check CDF-5 format specification for detailed differences (highlighted in colors).

  15. Q: How do I obtain the error message corresponding to a returned error code?
    A: All PnetCDF APIs return an integer value, an error code indicating the error status. NC_NOERR, NF_NOERR, and NF90_NOERR mean the API ran successfully. All error codes are non-positive integral values, constants defined in header file pnetcdf.h. APIs ncmpi_strerror/nfmpi_strerror/nf90mpi_strerror turn an error code into a human readable string. For example, NC_EBADID becomes "NetCDF: Not a valid ID". The code fragment below shows a way to check for error and prints the error message.
        err = ncmpi_create(comm, path, cmode, info, &ncid);
        if (err != NC_NOERR) {
            int rank;
            MPI_Comm_rank(comm, &rank);
            printf("Error at rank %d: %s\n", rank, ncmpi_strerror(err));
        }
        

  16. Q: Does PnetCDF support fill mode?
    A: No. This is because fill mode can be very expensive (i.e. prefill variables that are later overwritten with user's data.) See netCDF interface guide on nc_set_fill for more information.

  17. Q: Is there an API that reports the amount of data read/written that was carried out by PnetCDF?
    A: The following two APIs reports the amount of data that has been read/written since the file is opened/created. The amount includes the I/O to the file header as well as the variables. The reported amount is per process rank basis. The APIs can be called in between file open/create and close.
         int ncmpi_inq_get_size(int ncid, MPI_Offset *size);
         int ncmpi_inq_put_size(int ncid, MPI_Offset *size);
        

  18. Q: Does PnetCDF support compound data types?
    A: No. This is due to the limitation of CDF file format specifications.

  19. Q: Can I run my PnetCDF program sequentially?
    A: Yes. Because a PnetCDF program is also an MPI program, it can run on one process under the MPI running environment.

  20. Q: What level of parallel I/O data consistency is supported by PnetCDF?
    A: PnetCDF follows the same parallel I/O data consistency as MPI-IO standard. Readers are also referred to the following paper.
    Rajeev Thakur, William Gropp, and Ewing Lusk, On Implementing MPI-IO Portably and with High Performance, in the Proceedings of the 6th Workshop on I/O in Parallel and Distributed Systems, pp. 23-32, May 1999.

  21. Q: Where can I find PnetCDF example programs?
    A: PnetCDF releases come with a set of example programs in C, Fortran, and Fortran 90. They are available under the directory named "examples".

  22. Q: Is there a mailing list for PnetCDF discussions and questions?
    A: We discuss the design and use of the PnetCDF library on the parallel-netcdf@mcs.anl.gov mailing list. Anyone interested in developing or using PnetCDF is encouraged to join. Visit the list information page for details.

Northwestern University EECS Home | McCormick Home | Northwestern Home | Calendar: Plan-It Purple
© 2011 Robert R. McCormick School of Engineering and Applied Science, Northwestern University
"Tech": 2145 Sheridan Rd, Tech L359, Evanston IL 60208-3118  |  Phone: (847) 491-5410  |  Fax: (847) 491-4455
"Ford": 2133 Sheridan Rd, Ford Building, Rm 3-320, Evanston, IL 60208  |  Fax: (847) 491-5258
Email Director

Last Updated: $LastChangedDate: 2014-07-11 18:11:59 -0500 (Fri, 11 Jul 2014) $