clev2er

CLEV2ER Sea Ice and Icebergs GPP Project

Documentation for the CLEV2ER Sea Ice and Icebergs GPP project, hosted on GitHub at github.com/mssl-softeng/clev2er_sii.

The GPP runs within a framework designed for (but not restricted to) Level-1b to Level-2 processing of ESA radar altimetry mission data. The key features of the framework are dynamically loaded algorithm classes (from XML or YML lists of algorithms) and in-built support for multi-processing and a consistent automated development and testing workflow. There are many run-time options in the chain controller command line tool.

The diagram below shows a simplified representation of the framework and its components.

CL-SII Framework

Main Features

  • Command line chain controller tool : src/clev2er/tools/run_chain.py
  • input L1b file selection
    • single file
    • multiple files from a single directory
    • recursive search from a single directory
    • date or time based search
    • selection of CRISTAL instrument mode (SIC,SAC,SIO) and/or processing mode (HR,FF,LR,..)
  • dynamic algorithm loading from XML or YML list(s)
    • algorithms are classes of type Algorithm with configurable .init(), .process(), .finalize() functions.
    • Algorithm.init() is called before any L1b file processing.
    • Algorithm.process() is called on every L1b file,
    • Algorithm.finalize() is called after all files have been processed.
    • Each algorithm has access to: L1b Dataset, shared working dict, config dict.
    • Algorithm/chain configuration by XML or YAML configuration files.
    • A shared python dictionary is used to pass algorithm outputs between algorithms in the chain.
  • logging with standard warning, info, debug, error levels (+ multi-processing logging support)
  • optional multi-processing built in, configurable maximum number of processes used.
  • algorithm timing (with MP support)
  • chain timing
  • support for breakpoint files (saved as NetCDF4 files)

Change Log

This section details major changes to the framework (not individual chains):

Date Change
01-Mar-25 Initial CLEV2ER SII repository setup, adapted from CLEV2ER LIIW

Installation of the Framework for Development

This section describes installation of the framework for development purposes. Seperate procedures are documented in the CLEV2ER Software Installation & User Manual (D-SUM) for customer installation.

Note that the framework installation has been tested on Linux and MacOS systems. Use on other operating systems is possible but may require additional install steps, and is not directly supported.

Make sure you have git installed on your target system.

Clone the git public repository in to a suitable directory on your system. This will create a directory called /clev2er_sii in your current directory.

with https: git clone https://github.com/mssl-softeng/clev2er_sii.git

or with ssh: git clone git@github.com:mssl-softeng/clev2er_sii.git

or with the GitHub CLI: gh repo clone mssl-softeng/clev2er_sii

Go to the CLEV2ER package base directory

cd clev2er_sii

Package and Environment Installation

To install the CLEV2ER package, run the following command (on a Linux or MacOS operating system):

./install_env.sh

This will

  • install python 3.12 in a virtual env
  • install poetry package manager
  • install required python packages
  • install pre-commit hooks
  • create a setup script called ./activate.sh to activate the environment and setup necessary environment variables

Load the Virtual Environment

Now you are all setup to go. Whenever you want to run any CLEV2ER chains you must first load the CLEV2ER virtual environment using the following steps:

  • Go to the CLEV2ER package base directory (clev2er_sii)
  • run :
source ./activate.sh

You should now be setup to run processing chains, etc.

Run a simple chain test example

The following command will run a simple example test chain which dynamically loads 2 template algorithms and runs them on a set of CryoSat L1b files in a test data directory. The algorithms do not perform any actual processing as they are just template examples. Make sure you have the virtual environment already loaded using poetry shell before running this command.

run_chain.py -n testchain -d $CLEV2ER_BASE_DIR/testdata -r

There should be no errors. Note that run_chain.py is setup as an executable, so it is not necessary to use python run_chain.py, although this will also work.

Note that the algorithms that are dynamically run are located in $CLEV2ER_BASE_DIR/src/clev2er/algorithms/testchain/alg_template1.py, alg_template2.py

The list of algorithms (and their order) for testchain are defined in $CLEV2ER_BASE_DIR/config/algorithm_lists/testchain/testchain_alglist.xml

Chain configuration settings are defined in

$CLEV2ER_BASE_DIR/config/main_config.xml and

Algorithm configuration settings are defined in

$CLEV2ER_BASE_DIR/config/chain_configs/testchain/testchain_config.xml

To find all the command line options for run_chain.py, type:

run_chain.py -h

For further info, please see clev2er.tools

Developer Requirements

This section details additional installation requirements for developers who will develop/adapt new chains or algorithms.

Developer Workflow

This section describes the method that should be used to contribute to the project code. The basic method to develop a new feature is:

On your local repository:

  1. Make sure your local 'master' branch is checked out and up-to-date (some steps may not be necessary).

    cd $CLEV2ER_BASE_DIR
    git checkout master
    git pull
    

  2. Create a new branch, named xxx_featurename, where xxx is your initials

    git checkout -b xxx_featurename

  3. Develop and test your new feature within this branch, making git additions and commits as necessary. You should have at least one commit (probably several).

    git commit -a -m "description of change"

  4. If you are developing a new module, then you must also write a pytest test for that module in a tests directory located in the same directory as the module. Note the section on pytest markers at the end of this document.

  5. Static analysis tests will be run on your changes using pre-commit, either automatically during a git commit or by running in the directory of the code change or in the repository base directory (for a more complete check):

    pre-commit run --all

  6. Once tested, push the new feature branch to GitHub

    git push -u origin xxx_featurename [first time], or just git push

  7. Go to GitHub: [github.com/mssl-softeng/clev2er_sii] (https://github.com/mssl-softeng/clev2er_sii) or direct to the pull request URL shown in your git pull command.

  8. Create a Pull Request on GitHub for your feature branch. This will automatically start a CI workflow that tests your branch for code issues and runs pytest tests. If it fails you should correct the errors on your local branch and repeat (steps 3 onwards) until it passes all tests.

  9. Finally your pull request will be reviewed and if accepted merged into the 'master' branch.

  10. You can then delete your local branch and the remote branch on Github.

    git branch -d xxx_featurename
    git push origin --delete xxx_featurename
    
    
  11. Repeat the whole process to add your next feature.

Framework and Chain Configuration

The framework (run controller) and individual named algorithm chains each have separate configuration files. Configuration options can be categorized as:

  • run controller (or main framework ) default configuration
  • per chain default configuration (to configure individual algorithms and resources)
  • command line options (for input selection and modifying any default configuration options)

Chains can be configured using XML or YAML configuration files and optional command line options in the following order of increasing precedence:

  • main config file: $CLEV2ER_BASE_DIR/config/main_config.xml [Must be XML]
  • chain specific config file: $CLEV2ER_BASE_DIR/config/chain_configs/*chain_name/config_file_name*.xml, XML or .yml
  • command line options
  • command line additional config options using the --conf_opts

The configurations are passed to the chain's algorithms and finder classes, via a merged python dictionary, available to the Algorithm classes as self.config.

Run Control Configuration

The default run control configuration file is $CLEV2ER_BASE_DIR/config/main_config.xml

This contains general default settings for the chain controller. Each of these can be overridden by the relevant command line options.

Setting Options Description
use_multi_processing true or false if true multi-processing is used
max_processes_for_multiprocessing int max number of processes to use for multi-processing
use_shared_memory true or false if true allow use of shared memory. Experimental feature
stop_on_error true or false stop chain on first error found, or log error and skip

Chain Specific Configuration

The default configuration for your chain's algorithms and finder classes should be placed in the chain specific config file:

$CLEV2ER_BASE_DIR/config/chain_configs/<chain_name>/<anyname>[.xml,.XML,or .yml]

Configuration files may be either XML(.xml) or YAML (.yml) format.

Formatting Rules for Chain Configuration Files

YAML or XML files can contain multi-level settings for key value pairs of boolean, int, float or str.

  • boolean values must be set to the string true or false (case insensitive)
  • environment variables are allowed within strings as $ENV_NAME or ${ENV_NAME} (and will be evaluated)
  • YAML or XML files may have multiple levels (or sections)
  • XML files must have a top root level named configuration wrapping the lower levels. This is removed from the python config dictionary before being passed to the algorithms.
  • chain configuration files must have a
    • log_files section to provide locations of the log files (see below)
    • breakpoint_files section to provide locations of the log files (see below)

Example of sections from a 2 level config file in YML:

# some_key: str:  description
some_key: a string

section1:
    key1: 1
    key2: 1.5
    some_data_location: $MYDATA/dem.nc

section2:
    key: false

Example of sections from a 2 level config file in XML:

<?xml version="1.0"?>

<!-- configuration xml level required, but removed in python dict -->
<configuration>

<!--some_key: str:  description-->
<some_key>a string</some_key>

<section1>
   <key1>1</key1>
   <key2>1.5</key2>
   <some_data_location>$MYDATA/dem.nc</some_data_location>
</section1>

<section2>
   <key>false</key>
</section2>

</configuration>

These settings are available within Algorithm classes as a python dictionary called self.config as in the following examples:

self.config['section1']['key1']
self.config['section1']['some_data_location']
self.config['some_key']

The config file will also be merged with the main run control dictionary. Settings in the chain configuration file will take precedence over the main run control dictionary (if they are identical), so you can override any main config settings in the named chain config if you want.

Required Chain Configuration Settings

Each chain configuration file should contain sections to configure logging and breakpoints. See the section on logging below for an explanation of the settings.

Here is a minimal configuration file (XML format)

<?xml version="1.0"?>
<!--chain: mychain configuration file-->

<configuration> <!-- note this level is removed in python dict -->

<!--Setup default locations to store breakpoint files-->
<breakpoint_files>
    <!-- set the default directory where breakpoint files are stored -->
    <default_dir>/tmp</default_dir>
</breakpoint_files>

<log_files>
    <!-- default directory to store log files -->
    <default_dir>/tmp</default_dir>
    <!-- info_name : str : file name base str for info files -->
    <info_name>info</info_name>
    <!-- error_name : str : file name base str for errorfiles -->
    <error_name>error</error_name>
    <!-- debug_name : str : file name base str for debug files -->
    <debug_name>debug</debug_name>
    <!-- logname_str : str : additional string to add to end of log filename, before .log
    Leave empty if mot required
    -->
    <logname_str></logname_str>

    <!-- append_date_selection : true or false, if year and month are specified on
    command line append _MMYYYY to log file base name (before .log) -->
    <append_date_selection>true</append_date_selection>
    <append_process_id>false</append_process_id>
    <append_start_time>true</append_start_time>
</log_files>

<!-- add more levels and settings below here -->

<resources>
        <physical_constants>

                <directory>$CLEV2ER_BASE_DIR/testdata/adf/common</directory>
                <filename>
CR__AX_GR_CST__AX_00000000T000000_99999999T999999_20240201T000000__________________CPOM_SIR__V01.NC
                </filename>
                <mandatory>True</mandatory>
        </physical_constants>
</resources>

</configuration>

The requirement for specific settings are set by the chain and it's algorithms. An example of a chain configuration file can be found at:

$CLEV2ER_BASE_DIR/config/chain_configs/testchain/testchain_config.xml

For testing purposes it is sometimes useful to modify configuration settings directly from the command line. This can be done using the command line option --conf_opts which can contain a comma separated list of section:key:value pairs.

An example of changing the value of the setting above would be:

--conf_opts resources:mydata:${MYDATA_DIR}/somedata2.nc

Developing New Chains

  1. Decide on a chain name. For example newchain
  2. Create $CLEV2ER_BASE_DIR/algorithms/newchain/ directory to store the new chain's algorithms.
  3. Create $CLEV2ER_BASE_DIR/algorithms/newchain/tests to store the new chain's algorithm unit tests (using tests formatted for pytest). At least one algorithm test file should be created per algorithm, which should contain suitable test functions.
  4. Create your algorithms by copying and renaming the algorithm class template $CLEV2ER_BASE_DIR/algorithms/testchain/alg_template1.py in to your algorithm directory. Each algorithm should have a different file name of your choice. For example: alg_retrack.py, alg_geolocate.py. You need to fill in the appropriate sections of the init(), process() and finalize() functions for each algorithm (see section below for more details on using algorithm classes).
  5. You must also create a test for each algorithm in $CLEV2ER_BASE_DIR/algorithms/newchain/tests. You should copy/adapt the test template $CLEV2ER_BASE_DIR/algorithms/testchain/tests/test_alg_template1.py for your new test.
  6. Each algorithm and their unit tests must pass the static code checks (pylint, mypy, etc) which are automatically run as git pre-commit hooks.
  7. Create a first XML or YML configuration file for the chain in $CLEV2ER_BASE_DIR/config/chain_configs/newchain/anyname.yml or .xml. The configuration file contains any settings or resource locations that are required by your algorithms, and may include environment variables.
  8. If required create one or more finder class files. These allow fine control of L1b file selection from the command line (see section below for more details).
  9. Create an algorithm list YML file in $CLEV2ER_BASE_DIR/config/algorithm_lists/newchain/anyname.xml (or .yml) You can copy the template in $CLEV2ER_BASE_DIR/config/algorithm_lists/testchain/testchain_config.xml
  10. To test your chain on a single L1b file, you can use run_chain.py --name newchain -f /path/to/a/l1b_file. There are many other options for running chains (see run_chain.py -h).

Algorithm and Finder Classes

This section discusses how to develop algorithms for your chain. There are two types of algorithms, both of which are dynamically loaded at chain run-time.

  • Main algorithms : standard chain algorithm classes
  • Finder algorithms : optional classes to manage input L1b file selection

Algorithm Lists

Algorithms are dynamically loaded in a chain when (and in the order ) they are named in the chain's algorithm list YAML or XML file: $CLEV2ER_BASE_DIR/config/algorithm_lists/**chainname**/**chainname**.yml,.xml. This has two sections (l1b_file_selectors, and algorithms) as shown in the example below:

YML version:

# List of L1b selector classes to call in order
l1b_file_selectors:
  - find_lrm  # find LRM mode files that match command line options
  - find_sin  # find SIN mode files that match command line options
# List of main algorithms to call in order
algorithms:
  - alg_identify_file # find and store basic l1b parameters
  - alg_skip_on_mode  # finds the instrument mode of L1b, skip SAR files
  #- alg_...

XML version:

The xml version requires an additional toplevel <algorithm_list> that wraps the other sections. It also allows you to enable or disable individual algorithms within the list by setting the values Enable or Disable, and to set breakpoints by setting the value to BreakpointAfter.

<?xml version="1.0"?>

<algorithm_list>
    <algorithms>
        <alg_identify_file>Enable</alg_identify_file>
        <alg_skip_on_mode>Enable</alg_skip_on_mode>
        <!-- ... more algorithms -->
        <alg_retrack>BreakpointAfter</alg_retrack>
    </algorithms>

    <l1b_file_selectors>
        <find_lrm>Enable</find_lrm>
        <find_sin>Enable</find_sin>
    </l1b_file_selectors>
</algorithm_list>

Main Algorithms

Each algorithm is implemented in a separate module located in

$CLEV2ER_BASE_DIR/src/clev2er/algorithms/<chainname>/<alg_name>.py

Each algorithm module should contain an Algorithm class, as per the algorithm template in:

$CLEV2ER_BASE_DIR/src/clev2er/algorithms/testchain/alg_template1.py

Please copy this template for all algorithms.

Algorithm class modules have three main functions:

  • init() : used for initializing/loading resources. Called once at the start of processing.
  • process(l1b:Dataset,shared_dict:dict) : called for every L1b file. The results of the processing may be saved in the shared_dict, so that it can be accessed by algorithms called further down the chain. The L1b data for the current file being processed is passed to this function in a netcdf4 Dataset as argument l1b.
  • finalize() : called at the end of all processing to free resouces.

All of the functions have access to the merged chain configuration dictionary self.config.

All logging must be done using self.log.info(), self.log.error(), self.log.debug().

Algorithm.process() return values

It is important to note that Algorithm.process() return values affect how the chain operates. The .process() function returns (bool, str).

Return values must be set as follows:

  • (True,"") when the processing has completed without errors and continuation to the next algorithm in the chain (if available) is expected.
  • (False,"SKIP_OK any reason message") when the processing has found a valid reason for the chain to skip any further processing of the L1b file. For example if it does not measure over the target area. This will be logged as DEBUG message but is not an error. The chain will move to processing the next L1b file.
  • (False,"some error message") : In this case the error message will be logged to the error log and the file will be skipped. If config["chain"]["stop_on_error"] is False then the chain will continue to the next L1b file. If config["chain"]["stop_on_error"] is True, then the chain will stop.

FileFinder Classes

FileFinder class modules provide more complex and tailored L1b input file selection than would be possible with the standard run_chain.py command line options of :

  • (--file path) : choose single L1b file
  • (--dir dir) : choose all L1b files in a flat directory

FileFinder classes are only used as the file selection method if the --file and --dir command line options are not used.

For example you may wish to select files using a specific search pattern, or from multiple directories.

FileFinder classes are automatically initialized with :

  • self.config dict from the merged chain dict, any settings can be used for file selection
  • self.months (from command line option --month, if used)
  • self.years (from command line option --year, if used)

FileFinder classes return a list of file paths through their .find_files() function. Code needs to be added to the .find_files() function to generate the file list.

Any number of differently named FileFinder class modules can be specified in the algorithm list file, under the l1b_file_selectors: section. File lists are concatentated if more than one Finder class is used.

An example of a FileFinder class module can be found in:

clev2er.algorithms.cryotempo.find_lrm.py

Logging

Logging within the chain is performed using the python standard logging.Logger mechanism but with some minor adaption to support multi-processing.

Within algorithm modules, logging should be performed using the in-class Logger instance accessed using self.log :

  • self.log.info('message') : to log informational messages
  • self.log.error('message') : to log error messages
  • self.log.debug('message') : to log messages for debugging

Debugging messages are only produced/saved if the chain is run in debug mode (use run_chain.py --debug command line option)

Log file Locations

Info, error, and debug logs are stored in separate log files. The locations of the log files are set in the chain configuration file in a section called log_files. You can use environment variables in your log file paths.

# Default locations for log files
log_files:
  append_year_month_to_logname: true
  errors: ${CT_LOG_DIR}/errors.log
  info:   ${CT_LOG_DIR}/info.log
  debug:  ${CT_LOG_DIR}/debug.log

The append_year_month_to_logname setting is used if the chain is run with the --year (and/or) --month command line args. Note that these command line options are passed to the optional finder classes to generate a list of L1b input files.

If these are used and the append_year_month_to_logname setting is true, then the year and month are appended to the log file names as follows:

  • logname_MMYYYY.log : if both month and year are specified
  • logname_YYYY.log : if only year is used

Logging when using Multi-Processing

When multi-processing mode is selected then logged messages are automatically passed through a pipe to a temporary file (logfilename.mp). This will contain an unordered list of messages from all processes, which is difficult to read directly.

At the end of the chain run the multi-processing log outputs are automatically sorted so that messages relating to each L1b file processing are collected together in order. This is then merged in to the main log file.

Breakpoint Files

Breakpoints can be set after any Algorithm by:
  • setting the BreakpointAfter value in the chain's Algorithm list, or
  • using the run_chain.py command line argument **--breakpoint_after** *algorithm_name*
When a breakpoint is set:
  • the chain will stop after the specified algorithm has completed for each input file.
  • the contents of the chain's shared_dict will be saved as a NetCDF4 file in the <breakpoint_dir> as specified in the breakpoints:default_dir section in the chain configuration file.
  • the NetCDF4 file will be named as <breakpoint_dir>/<l1b_file_name>_bkp.nc
  • if multiple L1b files are being processed through the chain, a breakpoint file will be created for each.
  • single values or strings in the shared_dict will be included as global or group NetCDF attributes.
  • if there are multiple levels in the shared_dict then a NetCDF group will be created for each level.
  • multi-dimensional arrays (or numpy arrays) are supported up to dimension 3.
  • NetCDF dimension variables will not be named with physical meaning (ie time), as this information can not be generically derived. Instead dimensions will be named dim1, dim2, etc.
  • all variables with the same dimension will share a common NetCDF dimension (ie dim1, etc)

Developer Notes

Code checks before committing

It is recommended to run pre-commit before a git commit. This runs the static code analysis tests (isort, pylint, ruff, mypy,.. ) on your code and shows you any failures before you commit. The same tests are also run when you commit (and must pass).

precommit run --all

Pytest Markers

Pytest markers are setup in $CLEV2ER_BASE_DIR/pytest.ini

It is important to use the correct pytest marker due to the use of GitHub CI workflows that run pytest on the whole repository source code. Some pytest tests are not suitable for GitHub CI workflow runs due to their large external data dependencies. These need to be marked with pytest.mark.requires_external_data so that they are skipped. These tests can be run locally where access to the data is available.

The following Pytest markers should be used in front of relevant pytest functions:

  • requires_external_data: testable on local systems with access to all external data/ADF (outside repo)
  • non_core: used to label non-core function tests such as area plotting functions

Example:

@pytest.mark.requires_external_data  # not testable on GitHub due to external data
def test_alg_lig_process_large_dem(l1b_file) -> None:

or placed at the top of a module:

pytestmark = pytest.mark.non_core

GitHub Pages Documentation from in-code Docstrings

This user manual is hosted on GitHub pages (https://mssl-softeng.github.io/clev2er_sii)

Content is created from doctrings (optionally containing Markdown: https://www.markdownguide.org/basic-syntax/#code ) in the code, using the pdoc package : https://pdoc.dev

Diagrams can be implemented using mermaid: https://mermaid.js.org

The site is locally built in $CLEV2ER_BASE_DIR/docs, using a pre-commit hook (hook id: pdocs_build). Hooks are configured in $CLEV2ER_BASE_DIR/.pre-commit-config.yaml

The hook calls the script $CLEV2ER_BASE_DIR/pdocs_build.sh to build the site whenever a git commit is run in branch gh_pages.

When a git push is run, GitHub automatically extracts the site from the docs directory and publishes it.

The front page of the site (ie this page) is located in the doctring within $CLEV2ER_BASE_DIR/src/clev2er/__init__.py.

The docstring within __init__.py of each package directory should provide markdown to describe the directories beneath it.

Process to Update Docs

One method of updating the GitHub Pages documentation from the code (ie to process the docstrings in to html in the /docs folder)

  • Edit docstrings in master branch code (or by pull request from other branch)
  • git commit -a -m "docs update"
  • git checkout gh_pages
  • git merge master
  • pre-commit run --all (runs pdocs to update the html in docs folder)
  • git commit -a -m "docs update"
  • git push
  • git checkout master (return to master branch)
  • git merge gh_pages
  • git push

Why isn't this run automatically from the master branch or in a GitHub workflow? This is because pdocs (part of the pre-commit hooks) requires all code dependencies to be in place, including external data, when parsing/importing the code. External data is not available on GitHub, and also on some minimal installations of the master branch. So, to avoid pre-commit failing due to pdocs on other branches, or GitHub workflows doing the same, the docs are only updated on a controlled 'gh_pages' branch (which has all external data installed).

  1"""
  2# CLEV2ER Sea Ice and Icebergs GPP Project
  3
  4Documentation for the CLEV2ER Sea Ice and Icebergs GPP project, hosted on GitHub at
  5[github.com/mssl-softeng/clev2er_sii](https://github.com/mssl-softeng/clev2er_sii).
  6
  7The GPP runs within a framework designed for (but not
  8restricted to) Level-1b to Level-2 processing of ESA radar altimetry mission data. The key features
  9of the framework are dynamically loaded algorithm classes (from XML or YML lists of algorithms) and
 10in-built support for multi-processing and a consistent automated development and testing workflow.
 11There are many run-time options in the chain controller command line tool.
 12
 13The diagram below shows a simplified representation of the framework and its components.
 14
 15![CL-SII Framework](https://www.homepages.ucl.ac.uk/~ucasamu/cl_liiw_framework.png)
 16
 17## Main Features
 18
 19* Command line chain controller tool : src/clev2er/tools/run_chain.py
 20* input L1b file selection
 21  * single file
 22  * multiple files from a single directory
 23  * recursive search from a single directory
 24  * date or time based search
 25  * selection of CRISTAL instrument mode (SIC,SAC,SIO) and/or processing mode (HR,FF,LR,..)
 26* dynamic algorithm loading from XML or YML list(s)
 27  * algorithms are classes of type Algorithm with configurable .init(), .process(), .finalize()
 28    functions.
 29  * Algorithm.init() is called before any L1b file processing.
 30  * Algorithm.process() is called on every L1b file,
 31  * Algorithm.finalize() is called after all files have been processed.
 32  * Each algorithm has access to: L1b Dataset, shared working dict, config dict.
 33  * Algorithm/chain configuration by XML or YAML configuration files.
 34  * A shared python dictionary is used to pass algorithm outputs between algorithms in the chain.
 35* logging with standard warning, info, debug, error levels (+ multi-processing logging support)
 36* optional multi-processing built in, configurable maximum number of processes used.
 37* algorithm timing (with MP support)
 38* chain timing
 39* support for breakpoint files (saved as NetCDF4 files)
 40
 41
 42## Change Log
 43
 44This section details major changes to the framework (not individual chains):
 45
 46| Date | Change |
 47| ------- | ------- |
 48| 01-Mar-25 | Initial CLEV2ER SII repository setup, adapted from CLEV2ER LIIW|
 49
 50## Installation of the Framework for Development
 51
 52This section describes installation of the framework for development purposes.
 53Seperate procedures are documented in the CLEV2ER Software Installation & User Manual (D-SUM)
 54for customer installation.
 55
 56Note that the framework installation has been tested on Linux and MacOS systems. Use on
 57other operating systems is possible but may require additional install steps, and is not
 58directly supported.
 59
 60Make sure you have *git* installed on your target system.
 61
 62Clone the git public repository in to a suitable directory on your system.
 63This will create a directory called **/clev2er_sii** in your current directory.
 64
 65with https:
 66`git clone https://github.com/mssl-softeng/clev2er_sii.git`
 67
 68or with ssh:
 69`git clone git@github.com:mssl-softeng/clev2er_sii.git`
 70
 71or with the GitHub CLI:
 72`gh repo clone mssl-softeng/clev2er_sii`
 73
 74Go to the CLEV2ER package base directory
 75
 76```
 77cd clev2er_sii
 78```
 79
 80### Package and Environment Installation
 81
 82To install the CLEV2ER package, run the following command (on a Linux
 83or MacOS operating system):
 84
 85```
 86./install_env.sh
 87```
 88
 89This will
 90- install python 3.12 in a virtual env
 91- install poetry package manager
 92- install required python packages
 93- install pre-commit hooks
 94- create a setup script called `./activate.sh` to activate the environment
 95  and setup necessary environment variables
 96
 97### Load the Virtual Environment
 98
 99Now you are all setup to go. Whenever you want to run any CLEV2ER chains you
100must first load the CLEV2ER virtual environment using  the following steps:
101
102- Go to the CLEV2ER package base directory (clev2er_sii)
103- run :
104
105```
106source ./activate.sh
107```
108
109You should now be setup to run processing chains, etc.
110
111## Run a simple chain test example
112
113The following command will run a simple example test chain which dynamically loads
1142 template algorithms and runs them on a set of CryoSat L1b files in a test data directory.
115The algorithms do not perform any actual processing as they are just template examples.
116Make sure you have the virtual environment already loaded using `poetry shell` before
117running this command.
118
119```
120run_chain.py -n testchain -d $CLEV2ER_BASE_DIR/testdata -r
121```
122
123There should be no errors. Note that run_chain.py is setup as an executable, so it is not
124necessary to use `python run_chain.py`, although this will also work.
125
126Note that the algorithms that are dynamically run are located in
127$CLEV2ER_BASE_DIR/src/clev2er/algorithms/testchain/alg_template1.py, alg_template2.py
128
129The list of algorithms (and their order) for *testchain* are defined in
130$CLEV2ER_BASE_DIR/config/algorithm_lists/testchain/testchain_alglist.xml
131
132Chain configuration settings are defined in
133
134$CLEV2ER_BASE_DIR/config/main_config.xml and
135
136Algorithm configuration settings are defined in
137
138$CLEV2ER_BASE_DIR/config/chain_configs/testchain/testchain_config.xml
139
140To find all the command line options for *run_chain.py*, type:
141
142`run_chain.py -h`
143
144For further info, please see `clev2er.tools`
145
146## Developer Requirements
147
148This section details additional installation requirements for developers who will develop/adapt
149new chains or algorithms.
150
151## Developer Workflow
152
153This section describes the method that should be used to contribute to the project code.
154The basic method to develop a new feature is:
155
156On your local repository:
157
1581. Make sure your local 'master' branch is checked out and up-to-date
159   (some steps may not be necessary).
160
161    ```
162    cd $CLEV2ER_BASE_DIR
163    git checkout master
164    git pull
165    ```
166
1672. Create a new branch, named xxx_featurename, where xxx is your initials
168
169    `git checkout -b xxx_featurename`
170
1713. Develop and test your new feature within this branch, making git additions and commits
172   as necessary.
173   You should have at least one commit (probably several).
174
175   `git commit -a -m "description of change"`
176
1774. If you are developing a new module, then you must also write a pytest test
178   for that module in a tests directory located in the same directory as the module.
179   Note the section on pytest markers at the end of this document.
180
1815. Static analysis tests will be run on your changes using pre-commit, either
182   automatically during a git commit or by running in the directory of the code
183   change or in the repository base directory (for a more complete check):
184
185   `pre-commit run --all`
186
1876. Once tested, push the new feature branch to GitHub
188
189    `git push -u origin xxx_featurename`  [first time], or just `git push`
190
1917. Go to GitHub: [github.com/mssl-softeng/clev2er_sii]
192   (https://github.com/mssl-softeng/clev2er_sii)
193   or direct to the pull request URL shown in your git pull command.
194
1958. Create a Pull Request on GitHub for your feature branch. This will automatically start a CI
196   workflow that tests your branch for code issues and runs pytest tests. If it fails you
197   should correct the errors on your local branch and repeat (steps 3 onwards) until it passes
198   all tests.
199
2009. Finally your pull request will be reviewed and if accepted merged into the 'master' branch.
201
20210. You can then delete your local branch and the remote branch on Github.
203
204   ```
205   git branch -d xxx_featurename
206   git push origin --delete xxx_featurename
207
208   ```
209
21011. Repeat the whole process to add your next feature.
211
212## Framework and Chain Configuration
213
214The framework (run controller) and individual named algorithm chains each have
215separate configuration files. Configuration options can be categorized as:
216
217- run controller (or main framework ) default configuration
218- per chain default configuration (to configure individual algorithms and resources)
219- command line options (for input selection and modifying any default configuration
220  options)
221
222Chains can be configured using XML or YAML configuration files and optional command line
223options in the following order of increasing precedence:
224
225- main config file: $CLEV2ER_BASE_DIR/config/main_config.xml [Must be XML]
226- chain specific config file:
227  $CLEV2ER_BASE_DIR/config/chain_configs/*chain_name*/*config_file_name*.xml,
228  XML or .yml
229- command line options
230- command line additional config options using the --conf_opts
231
232The configurations are passed to
233the chain's algorithms and finder classes, via a merged python dictionary, available
234to the Algorithm classes as self.config.
235
236### Run Control Configuration
237
238The default run control configuration file is `$CLEV2ER_BASE_DIR/config/main_config.xml`
239
240This contains general default settings for the chain controller. Each of these can
241be overridden by the relevant command line options.
242
243| Setting | Options | Description |
244| ------- | ------- | ----------- |
245| use_multi_processing | true or false | if true multi-processing is used |
246| max_processes_for_multiprocessing | int | max number of processes to use for multi-processing |
247| use_shared_memory | true or false | if true allow use of shared memory. Experimental feature |
248| stop_on_error | true or false | stop chain on first error found, or log error and skip |
249
250### Chain Specific Configuration
251
252The default configuration for your chain's algorithms and finder classes should be placed in
253the chain specific config file:
254
255`$CLEV2ER_BASE_DIR/config/chain_configs/<chain_name>/<anyname>[.xml,.XML,or .yml]`
256
257Configuration files may be either XML(.xml) or YAML (.yml) format.
258
259#### Formatting Rules for Chain Configuration Files
260
261YAML or XML files can contain multi-level settings for key value pairs of boolean,
262int, float or str.
263
264- boolean values must be set to the string **true** or **false** (case insensitive)
265- environment variables are allowed within strings as $ENV_NAME or ${ENV_NAME} (and will be
266  evaluated)
267- YAML or XML files may have multiple levels (or sections)
268- XML files must have a top root level named *configuration*  wrapping the lower levels.
269  This is removed from the python config dictionary before being passed to the algorithms.
270- chain configuration files must have a
271    - **log_files** section to provide locations of the log files (see below)
272    - **breakpoint_files** section to provide locations of the log files (see below)
273
274Example of sections from a 2 level config file in YML:
275
276```
277# some_key: str:  description
278some_key: a string
279
280section1:
281    key1: 1
282    key2: 1.5
283    some_data_location: $MYDATA/dem.nc
284
285section2:
286    key: false
287```
288
289Example of sections from a 2 level config file in XML:
290
291```
292<?xml version="1.0"?>
293
294<!-- configuration xml level required, but removed in python dict -->
295<configuration>
296
297<!--some_key: str:  description-->
298<some_key>a string</some_key>
299
300<section1>
301   <key1>1</key1>
302   <key2>1.5</key2>
303   <some_data_location>$MYDATA/dem.nc</some_data_location>
304</section1>
305
306<section2>
307   <key>false</key>
308</section2>
309
310</configuration>
311
312```
313
314These settings are available within Algorithm classes as a python dictionary called
315**self.config** as in the following examples:
316
317```
318self.config['section1']['key1']
319self.config['section1']['some_data_location']
320self.config['some_key']
321```
322
323The config file will also be
324merged with the main run control dictionary. Settings in the chain configuration
325file will take precedence over the main run control dictionary (if they are identical), so
326you can override any main config settings in the named chain config if you want.
327
328### Required Chain Configuration Settings
329
330Each chain configuration file should contain sections to configure logging and breakpoints.
331See the section on logging below for an explanation of the settings.
332
333Here is a minimal configuration file (XML format)
334
335```
336<?xml version="1.0"?>
337<!--chain: mychain configuration file-->
338
339<configuration> <!-- note this level is removed in python dict -->
340
341<!--Setup default locations to store breakpoint files-->
342<breakpoint_files>
343    <!-- set the default directory where breakpoint files are stored -->
344    <default_dir>/tmp</default_dir>
345</breakpoint_files>
346
347<log_files>
348    <!-- default directory to store log files -->
349    <default_dir>/tmp</default_dir>
350    <!-- info_name : str : file name base str for info files -->
351    <info_name>info</info_name>
352    <!-- error_name : str : file name base str for errorfiles -->
353    <error_name>error</error_name>
354    <!-- debug_name : str : file name base str for debug files -->
355    <debug_name>debug</debug_name>
356    <!-- logname_str : str : additional string to add to end of log filename, before .log
357    Leave empty if mot required
358    -->
359    <logname_str></logname_str>
360
361    <!-- append_date_selection : true or false, if year and month are specified on
362    command line append _MMYYYY to log file base name (before .log) -->
363    <append_date_selection>true</append_date_selection>
364    <append_process_id>false</append_process_id>
365    <append_start_time>true</append_start_time>
366</log_files>
367
368<!-- add more levels and settings below here -->
369
370<resources>
371        <physical_constants>
372
373                <directory>$CLEV2ER_BASE_DIR/testdata/adf/common</directory>
374                <filename>
375CR__AX_GR_CST__AX_00000000T000000_99999999T999999_20240201T000000__________________CPOM_SIR__V01.NC
376                </filename>
377                <mandatory>True</mandatory>
378        </physical_constants>
379</resources>
380
381</configuration>
382
383```
384
385The requirement for specific settings are set by the chain and it's algorithms.
386An example of a chain configuration file can be found at:
387
388`$CLEV2ER_BASE_DIR/config/chain_configs/testchain/testchain_config.xml`
389
390For testing purposes it is sometimes useful to modify configuration settings directly
391from the command line. This can be done using the command line option --conf_opts which
392can contain a comma separated list of section:key:value pairs.
393
394An example of changing the value of the setting above would be:
395
396--conf_opts resources:mydata:${MYDATA_DIR}/somedata2.nc
397
398## Developing New Chains
399
4001. Decide on a chain name. For example **newchain**
4012. Create $CLEV2ER_BASE_DIR/algorithms/**newchain**/ directory to store the new chain's algorithms.
4023. Create $CLEV2ER_BASE_DIR/algorithms/**newchain**/tests to store the new chain's
403   algorithm unit tests (using tests formatted for pytest). At least one algorithm test file
404   should be created per algorithm, which should contain suitable test functions.
4054. Create your algorithms by copying and renaming the algorithm class template
406   $CLEV2ER_BASE_DIR/algorithms/testchain/alg_template1.py in to your algorithm directory. Each
407   algorithm
408   should have a different file name of your choice. For example: alg_retrack.py, alg_geolocate.py.
409   You need to fill in the appropriate sections of the init(), process() and finalize() functions
410   for each algorithm (see section below for more details on using algorithm classes).
4115. You must also create a test for each algorithm in
412   $CLEV2ER_BASE_DIR/algorithms/**newchain**/tests.
413   You should copy/adapt the test template
414   $CLEV2ER_BASE_DIR/algorithms/testchain/tests/test_alg_template1.py
415   for your new test.
4166. Each algorithm and their unit tests must pass the static code checks (pylint, mypy, etc) which
417   are automatically run as git pre-commit hooks.
4187. Create a first XML or YML configuration file for the chain in
419   $CLEV2ER_BASE_DIR/config/chain_configs/**newchain**/**anyname**.yml or .xml.
420   The configuration file contains any settings or resource locations that are required
421   by your algorithms, and may include environment variables.
4228. If required create one or more finder class files. These allow fine control of L1b file
423   selection from the command line (see section below for more details).
4249. Create an algorithm list YML file in
425   $CLEV2ER_BASE_DIR/config/algorithm_lists/**newchain**/**anyname**.xml (or .yml)
426   You can copy the template
427   in `$CLEV2ER_BASE_DIR/config/algorithm_lists/testchain/testchain_config.xml`
42810. To test your chain on a single L1b file, you can use
429   `run_chain.py --name newchain -f /path/to/a/l1b_file`. There are many other options for
430    running chains (see `run_chain.py -h`).
431
432## Algorithm and Finder Classes
433
434This section discusses how to develop algorithms for your chain. There are two types
435of algorithms, both of which are dynamically loaded at chain run-time.
436
437- Main algorithms : standard chain algorithm classes
438- Finder algorithms : optional classes to manage input L1b file selection
439
440### Algorithm Lists
441
442Algorithms are dynamically loaded in a chain when (and in the order ) they are named in the chain's
443algorithm list YAML or XML file:
444$CLEV2ER_BASE_DIR/config/algorithm_lists/**chainname**/**chainname**.yml,.xml.
445This has two sections (l1b_file_selectors, and algorithms) as shown in the example below:
446
447YML version:
448
449```
450# List of L1b selector classes to call in order
451l1b_file_selectors:
452  - find_lrm  # find LRM mode files that match command line options
453  - find_sin  # find SIN mode files that match command line options
454# List of main algorithms to call in order
455algorithms:
456  - alg_identify_file # find and store basic l1b parameters
457  - alg_skip_on_mode  # finds the instrument mode of L1b, skip SAR files
458  #- alg_...
459```
460
461XML version:
462
463The xml version requires an additional toplevel `<algorithm_list>` that wraps the other sections.
464It also allows you to enable or disable individual algorithms within the list by setting the
465values *Enable* or *Disable*, and to set breakpoints by setting the value to *BreakpointAfter*.
466
467```
468<?xml version="1.0"?>
469
470<algorithm_list>
471    <algorithms>
472        <alg_identify_file>Enable</alg_identify_file>
473        <alg_skip_on_mode>Enable</alg_skip_on_mode>
474        <!-- ... more algorithms -->
475        <alg_retrack>BreakpointAfter</alg_retrack>
476    </algorithms>
477
478    <l1b_file_selectors>
479        <find_lrm>Enable</find_lrm>
480        <find_sin>Enable</find_sin>
481    </l1b_file_selectors>
482</algorithm_list>
483
484```
485
486### Main Algorithms
487
488Each algorithm is implemented in a separate module located in
489
490`$CLEV2ER_BASE_DIR/src/clev2er/algorithms/<chainname>/<alg_name>.py`
491
492Each algorithm module should contain an Algorithm class, as per the algorithm
493template in:
494
495`$CLEV2ER_BASE_DIR/src/clev2er/algorithms/testchain/alg_template1.py`
496
497Please copy this template for all algorithms.
498
499Algorithm class modules have three main functions:
500
501- **init()** :  used for initializing/loading resources. Called once at the start of processing.
502- **process**(l1b:Dataset,shared_dict:dict) : called for every L1b file. The results of the
503  processing may be saved in the shared_dict, so that it can be accessed by algorithms called
504  further down the chain. The L1b data for the current file being processed is passed to this
505  function in a netcdf4 Dataset as argument l1b.
506- **finalize**() : called at the end of all processing to free resouces.
507
508All of the functions have access to the merged chain configuration dictionary **self.config**.
509
510All logging must be done using **self.log**.info(), **self.log**.error(), **self.log**.debug().
511
512#### Algorithm.process() return values
513
514It is important to note that Algorithm.**process()** return values affect how the
515chain operates. The .process() function returns (bool, str).
516
517Return values must be set as follows:
518
519- (**True**,"") when the processing has completed without errors and continuation to the
520  next algorithm in the chain (if available) is expected.
521- (**False**,"**SKIP_OK** any reason message") when the processing has found a valid reason for the
522  chain to skip any further processing of the L1b file. For example if it does not measure over the
523  target area. This will be logged as DEBUG message but is not an error. The chain will move to
524  processing the next L1b file.
525- (**False**,"some error message") : In this case the error message will be logged to the error log
526  and the file will be skipped. If **config**["chain"]["**stop_on_error**"] is False then the
527  chain will continue to the next L1b file. If **config**["chain"]["**stop_on_error**"] is True,
528  then the chain will stop.
529
530### FileFinder Classes
531
532FileFinder class modules provide more complex and tailored L1b input file selection
533than would be possible with the standard **run_chain.py** command line options of :
534
535- (**--file path**) : choose single L1b file
536- (**--dir dir**) : choose all L1b files in a flat directory
537
538FileFinder classes are only used as the file selection method if the --file and --dir
539command line options are **not** used.
540
541For example you may wish to select files using a specific search pattern, or from multiple
542directories.
543
544FileFinder classes are automatically initialized with :
545
546- **self.config** dict from the merged chain dict, any settings can be used for file selection
547- **self.months** (from command line option --month, if used)
548- **self.years** (from command line option --year, if used)
549
550FileFinder classes return a list of file paths through their .find_files() function.
551Code needs to be added to the .find_files() function to generate the file list.
552
553Any number of differently named FileFinder class modules can be specified in the algorithm list
554file,
555under the **l1b_file_selectors:** section. File lists are concatentated if more than one Finder
556class is used.
557
558An example of a FileFinder class module can be found in:
559
560`clev2er.algorithms.cryotempo.find_lrm.py`
561
562## Logging
563
564Logging within the chain is performed using the python standard logging.Logger mechanism
565but with some minor adaption to support multi-processing.
566
567Within algorithm modules, logging should be performed using the in-class Logger
568instance accessed using **self.**log :
569
570- self.log.**info**('message') : to log informational messages
571- self.log.**error**('message') : to log error messages
572- self.log.**debug**('message') : to log messages for debugging
573
574Debugging messages are only produced/saved if the chain is run in debug mode (use
575run_chain.py **--debug** command line option)
576
577### Log file Locations
578
579Info, error, and debug logs are stored in separate log files. The locations
580of the log files are set in the chain configuration file in a section called
581**log_files**. You can use environment variables in your log file paths.
582
583```
584# Default locations for log files
585log_files:
586  append_year_month_to_logname: true
587  errors: ${CT_LOG_DIR}/errors.log
588  info:   ${CT_LOG_DIR}/info.log
589  debug:  ${CT_LOG_DIR}/debug.log
590```
591
592The **append_year_month_to_logname** setting is used if the chain is
593run with the --year (and/or) --month command line args. Note that these
594command line options are passed to the optional finder classes to generate a
595list of L1b input files.
596
597If these are used and the append_year_month_to_logname setting is **true**,
598then the year and month are appended to the log file names as follows:
599
600- *logname*_*MMYYYY*.log : if both month and year are specified
601- *logname*_*YYYY*.log : if only year is used
602
603### Logging when using Multi-Processing
604
605When multi-processing mode is selected then logged messages are automatically passed
606through a pipe to a temporary file (*logfilename*.mp). This will
607contain an unordered list of messages from all processes, which is difficult
608to read directly.
609
610At the end of the chain run the multi-processing log outputs are automatically sorted
611so that messages relating to each L1b file processing are collected together
612in order. This is then merged in to the main log file.
613
614## Breakpoint Files
615
616Breakpoints can be set after any Algorithm by:
617  - setting the *BreakpointAfter* value in the chain's Algorithm list, or
618  - using the run_chain.py command line argument **--breakpoint_after** *algorithm_name*
619
620When a breakpoint is set:
621  - the chain will stop after the specified algorithm has completed for each input file.
622  - the contents of the chain's *shared_dict* will be saved as a NetCDF4 file in the
623    ```<breakpoint_dir>``` as specified in the *breakpoints:default_dir* section in the chain
624    configuration file.
625  - the NetCDF4 file will be named as ```<breakpoint_dir>/<l1b_file_name>_bkp.nc```
626  - if multiple L1b files are being processed through the chain, a breakpoint file
627    will be created for each.
628  - single values or strings in the *shared_dict* will be included as global or group
629    NetCDF attributes.
630  - if there are multiple levels in the *shared_dict* then a NetCDF group will be
631    created for each level.
632  - multi-dimensional arrays (or numpy arrays) are supported up to dimension 3.
633  - NetCDF dimension variables will not be named with physical meaning (ie time),
634    as this information can not be generically derived. Instead dimensions will be
635    named dim1, dim2, etc.
636  - all variables with the same dimension will share a common NetCDF dimension (ie dim1, etc)
637
638## Developer Notes
639
640### Code checks before committing
641
642It is recommended to run pre-commit before a `git commit`. This runs the static
643code analysis tests (isort, pylint, ruff, mypy,.. ) on your code and shows you any
644failures before you commit. The same tests are also run when you commit (and must pass).
645
646`precommit run --all`
647
648### Pytest Markers
649
650Pytest markers are setup in $CLEV2ER_BASE_DIR/pytest.ini
651
652It is important to use the correct pytest marker due to the use of GitHub CI
653workflows that run pytest on the whole repository source code. Some pytest
654tests are not suitable for GitHub CI workflow runs due to their large external
655data dependencies. These need to be marked with `pytest.mark.requires_external_data`
656so that they are skipped. These tests can be run locally where access to the data
657is available.
658
659The following Pytest markers should be used in front of relevant pytest functions:
660
661- **requires_external_data**:
662  testable on local systems with access to all external data/ADF (outside repo)
663- **non_core**:
664  used to label non-core function tests such as area plotting functions
665
666Example:
667
668```python
669@pytest.mark.requires_external_data  # not testable on GitHub due to external data
670def test_alg_lig_process_large_dem(l1b_file) -> None:
671```
672
673or placed at the top of a module:
674
675```pytestmark = pytest.mark.non_core```
676
677
678### GitHub Pages Documentation from in-code Docstrings
679
680This user manual is hosted on GitHub pages (https://mssl-softeng.github.io/clev2er_sii)
681
682Content is created from doctrings
683(optionally containing Markdown: https://www.markdownguide.org/basic-syntax/#code )
684in the code,
685using the *pdoc* package : https://pdoc.dev
686
687Diagrams can be implemented using mermaid: https://mermaid.js.org
688
689The site is locally built in `$CLEV2ER_BASE_DIR/docs`, using a pre-commit hook
690(hook id: pdocs_build).
691Hooks are configured in `$CLEV2ER_BASE_DIR/.pre-commit-config.yaml`
692
693The hook calls the script `$CLEV2ER_BASE_DIR/pdocs_build.sh` to build the site
694whenever a `git commit` is run **in branch gh_pages**.
695
696When a `git push` is run, GitHub automatically extracts the site from the
697docs directory and publishes it.
698
699The front page of the site (ie this page) is located in the doctring within
700`$CLEV2ER_BASE_DIR/src/clev2er/__init__.py`.
701
702The docstring within `__init__.py` of each package directory should provide
703markdown to describe the directories beneath it.
704
705#### Process to Update Docs
706
707One method of updating the GitHub Pages documentation from the code (ie
708to process the docstrings in to html in the /docs folder)
709
710- Edit docstrings in master branch code (or by pull request from other branch)
711- git commit -a -m "docs update"
712- git checkout gh_pages
713- git merge master
714- pre-commit run --all (runs pdocs to update the html in docs folder)
715- git commit -a -m "docs update"
716- git push
717- git checkout master  (return to master branch)
718- git merge gh_pages
719- git push
720
721Why isn't this run automatically from the master branch or in a GitHub workflow?
722This is because pdocs (part of the pre-commit hooks) requires all code dependencies
723to be in place, including external data, when parsing/importing the code.
724External data is not available on GitHub, and also on some minimal installations
725of the master branch. So, to avoid pre-commit failing due to pdocs on other
726branches, or GitHub workflows doing the same, the docs are only updated on a
727controlled 'gh_pages' branch (which has all external data installed).
728
729
730"""