Skip to content

Add functionality to index grib files as a grib file is written#494

Open
ColemanTom wants to merge 1 commit into
ecmwf:developfrom
ColemanTom:index_at_grib_write_time
Open

Add functionality to index grib files as a grib file is written#494
ColemanTom wants to merge 1 commit into
ecmwf:developfrom
ColemanTom:index_at_grib_write_time

Conversation

@ColemanTom

@ColemanTom ColemanTom commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Description

Previous workflow:

  • create grib file
  • save it
  • create index from that grib file

New workflow:

  • create message for grib file
  • save message
  • add message details to index file
  • repeat for remaining messages in file

There is no need for double handling of the grib file data anymore, providing significant time savings for this pattern.

NOTE: I only tested writing a grib file on the fly. In theory it should work for bufr too. I have confirmed bufr index's made before and after this change are identical.

Test cases

I ran both workflows above with the new build, and the output index files are bit identical. I also compared against an index created using 2.36.0 and got bit identical results still.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <eccodes.h>

#define MAX_PARAMS 100

int matches(long paramId, int n, long list[]) {
    for (int i = 0; i < n; i++) {
        if (paramId == list[i]) return 1;
    }
    return 0;
}

int main(int argc, char* argv[]) {
    const char *input = NULL, *output = NULL, *index_keys = NULL;
    char  *output_idx = NULL;

    long params[MAX_PARAMS];
    int n_params = 0;
    int idx_err = 0;
    codes_index* idx;

    /* Parse args */
    for (int i = 1; i < argc; i++) {
        if (strcmp(argv[i], "--input") == 0 && i + 1 < argc) {
            input = argv[++i];
        } else if (strcmp(argv[i], "--output") == 0 && i + 1 < argc) {
            output = argv[++i];
        } else if (strcmp(argv[i], "--params") == 0) {
            while (i + 1 < argc && strncmp(argv[i + 1], "--", 2) != 0) {
                if (n_params >= MAX_PARAMS) {
                    fprintf(stderr, "Too many params (max %d)\n", MAX_PARAMS);
                    return 1;
                }
                params[n_params++] = atol(argv[++i]);
            }
        } else if (strcmp(argv[i], "--index") == 0) {
            index_keys = argv[++i];
        }
    }

    if (!input || !output || n_params == 0) {
        fprintf(stderr, "Usage: --input file --output file --params id1 id2 ...\n");
        return 1;
    }

    FILE* fin = fopen(input, "rb");
    if (!fin) {
        perror("Input file");
        return 1;
    }

    FILE* fout = fopen(output, "wb");
    if (!fout) {
        perror("Output file");
        fclose(fin);
        return 1;
    }

    if (index_keys) {
        size_t len = strlen(output) + 5;
        output_idx = malloc(len);
        if (!output_idx) {
            fprintf(stderr, "Memory allocation failed\n");
            return 1;
        }
        strcpy(output_idx, output);
        strcat(output_idx, ".idx");
        idx = codes_index_new(NULL, index_keys, &idx_err);
        if (!idx) {
            fprintf(stderr, "Index create failed: %s\n", codes_get_error_message(idx_err));
            /* non-fatal: continue without indexing */
        }
    }

    int err = 0;
    codes_handle* h = NULL;

    while (1) {
        h = codes_handle_new_from_file(NULL, fin, PRODUCT_GRIB, &err);

        if (!h) {
            if (err == CODES_SUCCESS) break;  /* EOF */
            fprintf(stderr, "Read error: %s\n", codes_get_error_message(err));
            break;
        }

        long paramId;
        if (codes_get_long(h, "paramId", &paramId) == CODES_SUCCESS) {

            if (matches(paramId, n_params, params)) {
                const void* buffer = NULL;
                size_t size = 0;

                if (codes_get_message(h, &buffer, &size) == CODES_SUCCESS) {
                    if (buffer && size > 0) {
                          off_t offset;
                        if (idx)
                            offset = (off_t)ftello(fout);
                        fwrite(buffer, 1, size, fout);
                        if (idx)
                            codes_index_add_message(idx, h, output, offset);
                    }
                }
            }
        }

        codes_handle_delete(h);
    }

    if (idx) {
        fflush(fout);   /* ensure all bytes are visible before index is used */
        codes_index_write(idx, output_idx);   /* optional: persist the index */
        codes_index_delete(idx);
    }

    fclose(fin);
    fclose(fout);

    return 0;
}

I ran this and collected timings with

#!/bin/bash
set -euo pipefail

./a.out \
    --input ~/tmp/ACCESS-G_2025051218_008.model.grb2 \
    --output separate.grib2 \
    --params 132 130 156 133 260238 157 247

grib_index_build -k paramId,level -o separate.grib2.ix separate.grib2

and

#!/bin/bash
set -euo pipefail

./index.out \
    --input ~/tmp/ACCESS-G_2025051218_008.model.grb2 \
    --output combined.grib2 \
    --params 132 130 156 133 260238 157 247 \
    --index paramId,level

I also tested the fortran bindings and they work too

program grib_filter_index
  use eccodes
  implicit none

  integer            :: fin, fout, idx_id
  integer            :: igrib, iret, i
  integer(kind=8)    :: out_offset, total_length
  integer(kind=4)    :: paramId
  character(len=512) :: input_file, output_file, output_idx
  character(len=256) :: index_keys
  character(len=512) :: arg
  integer            :: nargs, iarg, n_params
  integer, parameter :: MAX_PARAMS = 100
  integer(kind=4)    :: params(MAX_PARAMS)
  logical            :: do_index, matched

  input_file  = ''
  output_file = ''
  index_keys  = ''
  n_params    = 0
  do_index    = .false.

  ! --- Parse command line ---
  nargs = command_argument_count()
  iarg  = 1
  do while (iarg <= nargs)
    call get_command_argument(iarg, arg)
    select case (trim(arg))
    case ('--input')
      iarg = iarg + 1;  call get_command_argument(iarg, input_file)
    case ('--output')
      iarg = iarg + 1;  call get_command_argument(iarg, output_file)
    case ('--index')
      iarg = iarg + 1;  call get_command_argument(iarg, index_keys)
      do_index = .true.
    case ('--params')
      do
        if (iarg >= nargs) exit
        call get_command_argument(iarg + 1, arg)
        if (arg(1:2) == '--') exit
        iarg = iarg + 1
        n_params = n_params + 1
        read(arg, *) params(n_params)
      end do
    end select
    iarg = iarg + 1
  end do

  if (len_trim(input_file) == 0 .or. len_trim(output_file) == 0 .or. n_params == 0) then
    write(0,*) 'Usage: --input f --output f --params id1 id2 ... [--index keys]'
    stop 1
  end if

  ! --- Open files ---
  call grib_open_file(fin,  trim(input_file),  'r', iret)
  if (iret /= GRIB_SUCCESS) then
    write(0,*) 'Cannot open input:', trim(input_file)
    stop 1
  end if

  call grib_open_file(fout, trim(output_file), 'w', iret)
  if (iret /= GRIB_SUCCESS) then
    write(0,*) 'Cannot open output:', trim(output_file)
    stop 1
  end if

  ! --- Create empty index if requested ---
  if (do_index) then
    output_idx = trim(output_file) // '.idx'
    call grib_index_new(idx_id, trim(index_keys), iret)
    if (iret /= GRIB_SUCCESS) then
      write(0,*) 'Warning: index create failed:', iret, ' - continuing without index'
      do_index = .false.
    end if
  end if

  out_offset = 0_8

  ! --- Main loop ---
  do
    call grib_new_from_file(fin, igrib, iret)
    if (iret == GRIB_END_OF_FILE) exit
    if (iret /= GRIB_SUCCESS) then
      write(0,*) 'Read error:', iret
      exit
    end if

    call grib_get(igrib, 'paramId', paramId)

    matched = .false.
    do i = 1, n_params
      if (paramId == params(i)) then
        matched = .true.
        exit
      end if
    end do

    if (matched) then
      call grib_get(igrib, 'totalLength', total_length)

      if (do_index) then
        call grib_index_add_message(idx_id, igrib, trim(output_file), out_offset, iret)
        if (iret /= GRIB_SUCCESS) &
          write(0,*) 'Warning: index_add_message failed:', iret
      end if

      call grib_write(igrib, fout)
      out_offset = out_offset + total_length
    end if

    call grib_release(igrib)
  end do

  ! --- Write index and clean up ---
  if (do_index) then
    call grib_index_write(idx_id, trim(output_idx), iret)
    if (iret /= GRIB_SUCCESS) write(0,*) 'Warning: index write failed:', iret
    call grib_index_release(idx_id)
    write(*,*) 'Index written to:', trim(output_idx)
  end if

  call grib_close_file(fin)
  call grib_close_file(fout)
  write(*,*) 'Written', out_offset, 'bytes to', trim(output_file)

end program grib_filter_index

Timing details

I ran the two workflows mentioned above on three different disk types - /dev/shm, an NFS disk and a Lustre disk (no flash storage). In all three cases the new approach provided significant improvements. The number of messages being indexed is 490.

Run # /dev/shm   NFS   Lustre  
Separate Combined Separate Combined Separate Combined
1 11.03 4.7 26.15 10.06 12.79 7.32
2 10.72 5.23 25.97 11.18 14.16 8.19
3 11.23 6 30.34 10.6 14.03 8.32
4 10.81 5.23 27.47 9.67 13.92 8.22
Average 10.9475 5.29 27.4825 10.3775 13.725 8.0125

What has not been done?

  • There are no python bindings. I think it would be good to have them exist, but I have not looked at where they are setup, I think in python-eccodes, but I need to confirm

Contributor Declaration

By opening this pull request, I affirm the following:

  • All authors agree to the Contributor License Agreement.
  • The code follows the project's coding standards.
  • I have performed self-review and added comments where needed.
  • I have added or updated tests to verify that my changes are effective and functional.
  • I have run all existing tests and confirmed they pass.

Previous workflow:
   - create grib file
   - save it
   - create index from that grib file

New workflow:
   - create message for grib file
   - save message
   - add message details to index file
   - repeat for remaining messages in file

There is no need for double handling of the grib file data anymore,
providing significant time savings for this pattern.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant