clev2er
CLEV2ER Sea Ice and Icebergs GPP Project
Documentation for the CLEV2ER Sea Ice and Icebergs GPP project, hosted on GitHub at github.com/mssl-softeng/clev2er_sii.
The GPP runs within a framework designed for (but not restricted to) Level-1b to Level-2 processing of ESA radar altimetry mission data. The key features of the framework are dynamically loaded algorithm classes (from XML or YML lists of algorithms) and in-built support for multi-processing and a consistent automated development and testing workflow. There are many run-time options in the chain controller command line tool.
The diagram below shows a simplified representation of the framework and its components.
Main Features
- Command line chain controller tool : src/clev2er/tools/run_chain.py
- input L1b file selection
- single file
- multiple files from a single directory
- recursive search from a single directory
- date or time based search
- selection of CRISTAL instrument mode (SIC,SAC,SIO) and/or processing mode (HR,FF,LR,..)
- dynamic algorithm loading from XML or YML list(s)
- algorithms are classes of type Algorithm with configurable .init(), .process(), .finalize() functions.
- Algorithm.init() is called before any L1b file processing.
- Algorithm.process() is called on every L1b file,
- Algorithm.finalize() is called after all files have been processed.
- Each algorithm has access to: L1b Dataset, shared working dict, config dict.
- Algorithm/chain configuration by XML or YAML configuration files.
- A shared python dictionary is used to pass algorithm outputs between algorithms in the chain.
- logging with standard warning, info, debug, error levels (+ multi-processing logging support)
- optional multi-processing built in, configurable maximum number of processes used.
- algorithm timing (with MP support)
- chain timing
- support for breakpoint files (saved as NetCDF4 files)
Change Log
This section details major changes to the framework (not individual chains):
Date | Change |
---|---|
01-Mar-25 | Initial CLEV2ER SII repository setup, adapted from CLEV2ER LIIW |
Installation of the Framework for Development
This section describes installation of the framework for development purposes. Seperate procedures are documented in the CLEV2ER Software Installation & User Manual (D-SUM) for customer installation.
Note that the framework installation has been tested on Linux and MacOS systems. Use on other operating systems is possible but may require additional install steps, and is not directly supported.
Make sure you have git installed on your target system.
Clone the git public repository in to a suitable directory on your system. This will create a directory called /clev2er_sii in your current directory.
with https:
git clone https://github.com/mssl-softeng/clev2er_sii.git
or with ssh:
git clone git@github.com:mssl-softeng/clev2er_sii.git
or with the GitHub CLI:
gh repo clone mssl-softeng/clev2er_sii
Go to the CLEV2ER package base directory
cd clev2er_sii
Package and Environment Installation
To install the CLEV2ER package, run the following command (on a Linux or MacOS operating system):
./install_env.sh
This will
- install python 3.12 in a virtual env
- install poetry package manager
- install required python packages
- install pre-commit hooks
- create a setup script called
./activate.sh
to activate the environment and setup necessary environment variables
Load the Virtual Environment
Now you are all setup to go. Whenever you want to run any CLEV2ER chains you must first load the CLEV2ER virtual environment using the following steps:
- Go to the CLEV2ER package base directory (clev2er_sii)
- run :
source ./activate.sh
You should now be setup to run processing chains, etc.
Run a simple chain test example
The following command will run a simple example test chain which dynamically loads
2 template algorithms and runs them on a set of CryoSat L1b files in a test data directory.
The algorithms do not perform any actual processing as they are just template examples.
Make sure you have the virtual environment already loaded using poetry shell
before
running this command.
run_chain.py -n testchain -d $CLEV2ER_BASE_DIR/testdata -r
There should be no errors. Note that run_chain.py is setup as an executable, so it is not
necessary to use python run_chain.py
, although this will also work.
Note that the algorithms that are dynamically run are located in $CLEV2ER_BASE_DIR/src/clev2er/algorithms/testchain/alg_template1.py, alg_template2.py
The list of algorithms (and their order) for testchain are defined in $CLEV2ER_BASE_DIR/config/algorithm_lists/testchain/testchain_alglist.xml
Chain configuration settings are defined in
$CLEV2ER_BASE_DIR/config/main_config.xml and
Algorithm configuration settings are defined in
$CLEV2ER_BASE_DIR/config/chain_configs/testchain/testchain_config.xml
To find all the command line options for run_chain.py, type:
run_chain.py -h
For further info, please see clev2er.tools
Developer Requirements
This section details additional installation requirements for developers who will develop/adapt new chains or algorithms.
Developer Workflow
This section describes the method that should be used to contribute to the project code. The basic method to develop a new feature is:
On your local repository:
Make sure your local 'master' branch is checked out and up-to-date (some steps may not be necessary).
cd $CLEV2ER_BASE_DIR git checkout master git pull
Create a new branch, named xxx_featurename, where xxx is your initials
git checkout -b xxx_featurename
Develop and test your new feature within this branch, making git additions and commits as necessary. You should have at least one commit (probably several).
git commit -a -m "description of change"
If you are developing a new module, then you must also write a pytest test for that module in a tests directory located in the same directory as the module. Note the section on pytest markers at the end of this document.
Static analysis tests will be run on your changes using pre-commit, either automatically during a git commit or by running in the directory of the code change or in the repository base directory (for a more complete check):
pre-commit run --all
Once tested, push the new feature branch to GitHub
git push -u origin xxx_featurename
[first time], or justgit push
Go to GitHub: [github.com/mssl-softeng/clev2er_sii] (https://github.com/mssl-softeng/clev2er_sii) or direct to the pull request URL shown in your git pull command.
Create a Pull Request on GitHub for your feature branch. This will automatically start a CI workflow that tests your branch for code issues and runs pytest tests. If it fails you should correct the errors on your local branch and repeat (steps 3 onwards) until it passes all tests.
Finally your pull request will be reviewed and if accepted merged into the 'master' branch.
You can then delete your local branch and the remote branch on Github.
git branch -d xxx_featurename git push origin --delete xxx_featurename
Repeat the whole process to add your next feature.
Framework and Chain Configuration
The framework (run controller) and individual named algorithm chains each have separate configuration files. Configuration options can be categorized as:
- run controller (or main framework ) default configuration
- per chain default configuration (to configure individual algorithms and resources)
- command line options (for input selection and modifying any default configuration options)
Chains can be configured using XML or YAML configuration files and optional command line options in the following order of increasing precedence:
- main config file: $CLEV2ER_BASE_DIR/config/main_config.xml [Must be XML]
- chain specific config file: $CLEV2ER_BASE_DIR/config/chain_configs/*chain_name/config_file_name*.xml, XML or .yml
- command line options
- command line additional config options using the --conf_opts
The configurations are passed to the chain's algorithms and finder classes, via a merged python dictionary, available to the Algorithm classes as self.config.
Run Control Configuration
The default run control configuration file is $CLEV2ER_BASE_DIR/config/main_config.xml
This contains general default settings for the chain controller. Each of these can be overridden by the relevant command line options.
Setting | Options | Description |
---|---|---|
use_multi_processing | true or false | if true multi-processing is used |
max_processes_for_multiprocessing | int | max number of processes to use for multi-processing |
use_shared_memory | true or false | if true allow use of shared memory. Experimental feature |
stop_on_error | true or false | stop chain on first error found, or log error and skip |
Chain Specific Configuration
The default configuration for your chain's algorithms and finder classes should be placed in the chain specific config file:
$CLEV2ER_BASE_DIR/config/chain_configs/<chain_name>/<anyname>[.xml,.XML,or .yml]
Configuration files may be either XML(.xml) or YAML (.yml) format.
Formatting Rules for Chain Configuration Files
YAML or XML files can contain multi-level settings for key value pairs of boolean, int, float or str.
- boolean values must be set to the string true or false (case insensitive)
- environment variables are allowed within strings as $ENV_NAME or ${ENV_NAME} (and will be evaluated)
- YAML or XML files may have multiple levels (or sections)
- XML files must have a top root level named configuration wrapping the lower levels. This is removed from the python config dictionary before being passed to the algorithms.
- chain configuration files must have a
- log_files section to provide locations of the log files (see below)
- breakpoint_files section to provide locations of the log files (see below)
Example of sections from a 2 level config file in YML:
# some_key: str: description
some_key: a string
section1:
key1: 1
key2: 1.5
some_data_location: $MYDATA/dem.nc
section2:
key: false
Example of sections from a 2 level config file in XML:
<?xml version="1.0"?>
<!-- configuration xml level required, but removed in python dict -->
<configuration>
<!--some_key: str: description-->
<some_key>a string</some_key>
<section1>
<key1>1</key1>
<key2>1.5</key2>
<some_data_location>$MYDATA/dem.nc</some_data_location>
</section1>
<section2>
<key>false</key>
</section2>
</configuration>
These settings are available within Algorithm classes as a python dictionary called self.config as in the following examples:
self.config['section1']['key1']
self.config['section1']['some_data_location']
self.config['some_key']
The config file will also be merged with the main run control dictionary. Settings in the chain configuration file will take precedence over the main run control dictionary (if they are identical), so you can override any main config settings in the named chain config if you want.
Required Chain Configuration Settings
Each chain configuration file should contain sections to configure logging and breakpoints. See the section on logging below for an explanation of the settings.
Here is a minimal configuration file (XML format)
<?xml version="1.0"?>
<!--chain: mychain configuration file-->
<configuration> <!-- note this level is removed in python dict -->
<!--Setup default locations to store breakpoint files-->
<breakpoint_files>
<!-- set the default directory where breakpoint files are stored -->
<default_dir>/tmp</default_dir>
</breakpoint_files>
<log_files>
<!-- default directory to store log files -->
<default_dir>/tmp</default_dir>
<!-- info_name : str : file name base str for info files -->
<info_name>info</info_name>
<!-- error_name : str : file name base str for errorfiles -->
<error_name>error</error_name>
<!-- debug_name : str : file name base str for debug files -->
<debug_name>debug</debug_name>
<!-- logname_str : str : additional string to add to end of log filename, before .log
Leave empty if mot required
-->
<logname_str></logname_str>
<!-- append_date_selection : true or false, if year and month are specified on
command line append _MMYYYY to log file base name (before .log) -->
<append_date_selection>true</append_date_selection>
<append_process_id>false</append_process_id>
<append_start_time>true</append_start_time>
</log_files>
<!-- add more levels and settings below here -->
<resources>
<physical_constants>
<directory>$CLEV2ER_BASE_DIR/testdata/adf/common</directory>
<filename>
CR__AX_GR_CST__AX_00000000T000000_99999999T999999_20240201T000000__________________CPOM_SIR__V01.NC
</filename>
<mandatory>True</mandatory>
</physical_constants>
</resources>
</configuration>
The requirement for specific settings are set by the chain and it's algorithms. An example of a chain configuration file can be found at:
$CLEV2ER_BASE_DIR/config/chain_configs/testchain/testchain_config.xml
For testing purposes it is sometimes useful to modify configuration settings directly from the command line. This can be done using the command line option --conf_opts which can contain a comma separated list of section:key:value pairs.
An example of changing the value of the setting above would be:
--conf_opts resources:mydata:${MYDATA_DIR}/somedata2.nc
Developing New Chains
- Decide on a chain name. For example newchain
- Create $CLEV2ER_BASE_DIR/algorithms/newchain/ directory to store the new chain's algorithms.
- Create $CLEV2ER_BASE_DIR/algorithms/newchain/tests to store the new chain's algorithm unit tests (using tests formatted for pytest). At least one algorithm test file should be created per algorithm, which should contain suitable test functions.
- Create your algorithms by copying and renaming the algorithm class template $CLEV2ER_BASE_DIR/algorithms/testchain/alg_template1.py in to your algorithm directory. Each algorithm should have a different file name of your choice. For example: alg_retrack.py, alg_geolocate.py. You need to fill in the appropriate sections of the init(), process() and finalize() functions for each algorithm (see section below for more details on using algorithm classes).
- You must also create a test for each algorithm in $CLEV2ER_BASE_DIR/algorithms/newchain/tests. You should copy/adapt the test template $CLEV2ER_BASE_DIR/algorithms/testchain/tests/test_alg_template1.py for your new test.
- Each algorithm and their unit tests must pass the static code checks (pylint, mypy, etc) which are automatically run as git pre-commit hooks.
- Create a first XML or YML configuration file for the chain in $CLEV2ER_BASE_DIR/config/chain_configs/newchain/anyname.yml or .xml. The configuration file contains any settings or resource locations that are required by your algorithms, and may include environment variables.
- If required create one or more finder class files. These allow fine control of L1b file selection from the command line (see section below for more details).
- Create an algorithm list YML file in
$CLEV2ER_BASE_DIR/config/algorithm_lists/newchain/anyname.xml (or .yml)
You can copy the template
in
$CLEV2ER_BASE_DIR/config/algorithm_lists/testchain/testchain_config.xml
- To test your chain on a single L1b file, you can use
run_chain.py --name newchain -f /path/to/a/l1b_file
. There are many other options for running chains (seerun_chain.py -h
).
Algorithm and Finder Classes
This section discusses how to develop algorithms for your chain. There are two types of algorithms, both of which are dynamically loaded at chain run-time.
- Main algorithms : standard chain algorithm classes
- Finder algorithms : optional classes to manage input L1b file selection
Algorithm Lists
Algorithms are dynamically loaded in a chain when (and in the order ) they are named in the chain's algorithm list YAML or XML file: $CLEV2ER_BASE_DIR/config/algorithm_lists/**chainname**/**chainname**.yml,.xml. This has two sections (l1b_file_selectors, and algorithms) as shown in the example below:
YML version:
# List of L1b selector classes to call in order
l1b_file_selectors:
- find_lrm # find LRM mode files that match command line options
- find_sin # find SIN mode files that match command line options
# List of main algorithms to call in order
algorithms:
- alg_identify_file # find and store basic l1b parameters
- alg_skip_on_mode # finds the instrument mode of L1b, skip SAR files
#- alg_...
XML version:
The xml version requires an additional toplevel <algorithm_list>
that wraps the other sections.
It also allows you to enable or disable individual algorithms within the list by setting the
values Enable or Disable, and to set breakpoints by setting the value to BreakpointAfter.
<?xml version="1.0"?>
<algorithm_list>
<algorithms>
<alg_identify_file>Enable</alg_identify_file>
<alg_skip_on_mode>Enable</alg_skip_on_mode>
<!-- ... more algorithms -->
<alg_retrack>BreakpointAfter</alg_retrack>
</algorithms>
<l1b_file_selectors>
<find_lrm>Enable</find_lrm>
<find_sin>Enable</find_sin>
</l1b_file_selectors>
</algorithm_list>
Main Algorithms
Each algorithm is implemented in a separate module located in
$CLEV2ER_BASE_DIR/src/clev2er/algorithms/<chainname>/<alg_name>.py
Each algorithm module should contain an Algorithm class, as per the algorithm template in:
$CLEV2ER_BASE_DIR/src/clev2er/algorithms/testchain/alg_template1.py
Please copy this template for all algorithms.
Algorithm class modules have three main functions:
- init() : used for initializing/loading resources. Called once at the start of processing.
- process(l1b:Dataset,shared_dict:dict) : called for every L1b file. The results of the processing may be saved in the shared_dict, so that it can be accessed by algorithms called further down the chain. The L1b data for the current file being processed is passed to this function in a netcdf4 Dataset as argument l1b.
- finalize() : called at the end of all processing to free resouces.
All of the functions have access to the merged chain configuration dictionary self.config.
All logging must be done using self.log.info(), self.log.error(), self.log.debug().
Algorithm.process() return values
It is important to note that Algorithm.process() return values affect how the chain operates. The .process() function returns (bool, str).
Return values must be set as follows:
- (True,"") when the processing has completed without errors and continuation to the next algorithm in the chain (if available) is expected.
- (False,"SKIP_OK any reason message") when the processing has found a valid reason for the chain to skip any further processing of the L1b file. For example if it does not measure over the target area. This will be logged as DEBUG message but is not an error. The chain will move to processing the next L1b file.
- (False,"some error message") : In this case the error message will be logged to the error log and the file will be skipped. If config["chain"]["stop_on_error"] is False then the chain will continue to the next L1b file. If config["chain"]["stop_on_error"] is True, then the chain will stop.
FileFinder Classes
FileFinder class modules provide more complex and tailored L1b input file selection than would be possible with the standard run_chain.py command line options of :
- (--file path) : choose single L1b file
- (--dir dir) : choose all L1b files in a flat directory
FileFinder classes are only used as the file selection method if the --file and --dir command line options are not used.
For example you may wish to select files using a specific search pattern, or from multiple directories.
FileFinder classes are automatically initialized with :
- self.config dict from the merged chain dict, any settings can be used for file selection
- self.months (from command line option --month, if used)
- self.years (from command line option --year, if used)
FileFinder classes return a list of file paths through their .find_files() function. Code needs to be added to the .find_files() function to generate the file list.
Any number of differently named FileFinder class modules can be specified in the algorithm list file, under the l1b_file_selectors: section. File lists are concatentated if more than one Finder class is used.
An example of a FileFinder class module can be found in:
clev2er.algorithms.cryotempo.find_lrm.py
Logging
Logging within the chain is performed using the python standard logging.Logger mechanism but with some minor adaption to support multi-processing.
Within algorithm modules, logging should be performed using the in-class Logger instance accessed using self.log :
- self.log.info('message') : to log informational messages
- self.log.error('message') : to log error messages
- self.log.debug('message') : to log messages for debugging
Debugging messages are only produced/saved if the chain is run in debug mode (use run_chain.py --debug command line option)
Log file Locations
Info, error, and debug logs are stored in separate log files. The locations of the log files are set in the chain configuration file in a section called log_files. You can use environment variables in your log file paths.
# Default locations for log files
log_files:
append_year_month_to_logname: true
errors: ${CT_LOG_DIR}/errors.log
info: ${CT_LOG_DIR}/info.log
debug: ${CT_LOG_DIR}/debug.log
The append_year_month_to_logname setting is used if the chain is run with the --year (and/or) --month command line args. Note that these command line options are passed to the optional finder classes to generate a list of L1b input files.
If these are used and the append_year_month_to_logname setting is true, then the year and month are appended to the log file names as follows:
- logname_MMYYYY.log : if both month and year are specified
- logname_YYYY.log : if only year is used
Logging when using Multi-Processing
When multi-processing mode is selected then logged messages are automatically passed through a pipe to a temporary file (logfilename.mp). This will contain an unordered list of messages from all processes, which is difficult to read directly.
At the end of the chain run the multi-processing log outputs are automatically sorted so that messages relating to each L1b file processing are collected together in order. This is then merged in to the main log file.
Breakpoint Files
Breakpoints can be set after any Algorithm by:
- setting the BreakpointAfter value in the chain's Algorithm list, or
- using the run_chain.py command line argument **--breakpoint_after** *algorithm_name*
When a breakpoint is set:
- the chain will stop after the specified algorithm has completed for each input file.
- the contents of the chain's shared_dict will be saved as a NetCDF4 file in the
<breakpoint_dir>
as specified in the breakpoints:default_dir section in the chain configuration file.- the NetCDF4 file will be named as
<breakpoint_dir>/<l1b_file_name>_bkp.nc
- if multiple L1b files are being processed through the chain, a breakpoint file will be created for each.
- single values or strings in the shared_dict will be included as global or group NetCDF attributes.
- if there are multiple levels in the shared_dict then a NetCDF group will be created for each level.
- multi-dimensional arrays (or numpy arrays) are supported up to dimension 3.
- NetCDF dimension variables will not be named with physical meaning (ie time), as this information can not be generically derived. Instead dimensions will be named dim1, dim2, etc.
- all variables with the same dimension will share a common NetCDF dimension (ie dim1, etc)
Developer Notes
Code checks before committing
It is recommended to run pre-commit before a git commit
. This runs the static
code analysis tests (isort, pylint, ruff, mypy,.. ) on your code and shows you any
failures before you commit. The same tests are also run when you commit (and must pass).
precommit run --all
Pytest Markers
Pytest markers are setup in $CLEV2ER_BASE_DIR/pytest.ini
It is important to use the correct pytest marker due to the use of GitHub CI
workflows that run pytest on the whole repository source code. Some pytest
tests are not suitable for GitHub CI workflow runs due to their large external
data dependencies. These need to be marked with pytest.mark.requires_external_data
so that they are skipped. These tests can be run locally where access to the data
is available.
The following Pytest markers should be used in front of relevant pytest functions:
- requires_external_data: testable on local systems with access to all external data/ADF (outside repo)
- non_core: used to label non-core function tests such as area plotting functions
Example:
@pytest.mark.requires_external_data # not testable on GitHub due to external data
def test_alg_lig_process_large_dem(l1b_file) -> None:
or placed at the top of a module:
pytestmark = pytest.mark.non_core
GitHub Pages Documentation from in-code Docstrings
This user manual is hosted on GitHub pages (https://mssl-softeng.github.io/clev2er_sii)
Content is created from doctrings (optionally containing Markdown: https://www.markdownguide.org/basic-syntax/#code ) in the code, using the pdoc package : https://pdoc.dev
Diagrams can be implemented using mermaid: https://mermaid.js.org
The site is locally built in $CLEV2ER_BASE_DIR/docs
, using a pre-commit hook
(hook id: pdocs_build).
Hooks are configured in $CLEV2ER_BASE_DIR/.pre-commit-config.yaml
The hook calls the script $CLEV2ER_BASE_DIR/pdocs_build.sh
to build the site
whenever a git commit
is run in branch gh_pages.
When a git push
is run, GitHub automatically extracts the site from the
docs directory and publishes it.
The front page of the site (ie this page) is located in the doctring within
$CLEV2ER_BASE_DIR/src/clev2er/__init__.py
.
The docstring within __init__.py
of each package directory should provide
markdown to describe the directories beneath it.
Process to Update Docs
One method of updating the GitHub Pages documentation from the code (ie to process the docstrings in to html in the /docs folder)
- Edit docstrings in master branch code (or by pull request from other branch)
- git commit -a -m "docs update"
- git checkout gh_pages
- git merge master
- pre-commit run --all (runs pdocs to update the html in docs folder)
- git commit -a -m "docs update"
- git push
- git checkout master (return to master branch)
- git merge gh_pages
- git push
Why isn't this run automatically from the master branch or in a GitHub workflow? This is because pdocs (part of the pre-commit hooks) requires all code dependencies to be in place, including external data, when parsing/importing the code. External data is not available on GitHub, and also on some minimal installations of the master branch. So, to avoid pre-commit failing due to pdocs on other branches, or GitHub workflows doing the same, the docs are only updated on a controlled 'gh_pages' branch (which has all external data installed).
1""" 2# CLEV2ER Sea Ice and Icebergs GPP Project 3 4Documentation for the CLEV2ER Sea Ice and Icebergs GPP project, hosted on GitHub at 5[github.com/mssl-softeng/clev2er_sii](https://github.com/mssl-softeng/clev2er_sii). 6 7The GPP runs within a framework designed for (but not 8restricted to) Level-1b to Level-2 processing of ESA radar altimetry mission data. The key features 9of the framework are dynamically loaded algorithm classes (from XML or YML lists of algorithms) and 10in-built support for multi-processing and a consistent automated development and testing workflow. 11There are many run-time options in the chain controller command line tool. 12 13The diagram below shows a simplified representation of the framework and its components. 14 15 16 17## Main Features 18 19* Command line chain controller tool : src/clev2er/tools/run_chain.py 20* input L1b file selection 21 * single file 22 * multiple files from a single directory 23 * recursive search from a single directory 24 * date or time based search 25 * selection of CRISTAL instrument mode (SIC,SAC,SIO) and/or processing mode (HR,FF,LR,..) 26* dynamic algorithm loading from XML or YML list(s) 27 * algorithms are classes of type Algorithm with configurable .init(), .process(), .finalize() 28 functions. 29 * Algorithm.init() is called before any L1b file processing. 30 * Algorithm.process() is called on every L1b file, 31 * Algorithm.finalize() is called after all files have been processed. 32 * Each algorithm has access to: L1b Dataset, shared working dict, config dict. 33 * Algorithm/chain configuration by XML or YAML configuration files. 34 * A shared python dictionary is used to pass algorithm outputs between algorithms in the chain. 35* logging with standard warning, info, debug, error levels (+ multi-processing logging support) 36* optional multi-processing built in, configurable maximum number of processes used. 37* algorithm timing (with MP support) 38* chain timing 39* support for breakpoint files (saved as NetCDF4 files) 40 41 42## Change Log 43 44This section details major changes to the framework (not individual chains): 45 46| Date | Change | 47| ------- | ------- | 48| 01-Mar-25 | Initial CLEV2ER SII repository setup, adapted from CLEV2ER LIIW| 49 50## Installation of the Framework for Development 51 52This section describes installation of the framework for development purposes. 53Seperate procedures are documented in the CLEV2ER Software Installation & User Manual (D-SUM) 54for customer installation. 55 56Note that the framework installation has been tested on Linux and MacOS systems. Use on 57other operating systems is possible but may require additional install steps, and is not 58directly supported. 59 60Make sure you have *git* installed on your target system. 61 62Clone the git public repository in to a suitable directory on your system. 63This will create a directory called **/clev2er_sii** in your current directory. 64 65with https: 66`git clone https://github.com/mssl-softeng/clev2er_sii.git` 67 68or with ssh: 69`git clone git@github.com:mssl-softeng/clev2er_sii.git` 70 71or with the GitHub CLI: 72`gh repo clone mssl-softeng/clev2er_sii` 73 74Go to the CLEV2ER package base directory 75 76``` 77cd clev2er_sii 78``` 79 80### Package and Environment Installation 81 82To install the CLEV2ER package, run the following command (on a Linux 83or MacOS operating system): 84 85``` 86./install_env.sh 87``` 88 89This will 90- install python 3.12 in a virtual env 91- install poetry package manager 92- install required python packages 93- install pre-commit hooks 94- create a setup script called `./activate.sh` to activate the environment 95 and setup necessary environment variables 96 97### Load the Virtual Environment 98 99Now you are all setup to go. Whenever you want to run any CLEV2ER chains you 100must first load the CLEV2ER virtual environment using the following steps: 101 102- Go to the CLEV2ER package base directory (clev2er_sii) 103- run : 104 105``` 106source ./activate.sh 107``` 108 109You should now be setup to run processing chains, etc. 110 111## Run a simple chain test example 112 113The following command will run a simple example test chain which dynamically loads 1142 template algorithms and runs them on a set of CryoSat L1b files in a test data directory. 115The algorithms do not perform any actual processing as they are just template examples. 116Make sure you have the virtual environment already loaded using `poetry shell` before 117running this command. 118 119``` 120run_chain.py -n testchain -d $CLEV2ER_BASE_DIR/testdata -r 121``` 122 123There should be no errors. Note that run_chain.py is setup as an executable, so it is not 124necessary to use `python run_chain.py`, although this will also work. 125 126Note that the algorithms that are dynamically run are located in 127$CLEV2ER_BASE_DIR/src/clev2er/algorithms/testchain/alg_template1.py, alg_template2.py 128 129The list of algorithms (and their order) for *testchain* are defined in 130$CLEV2ER_BASE_DIR/config/algorithm_lists/testchain/testchain_alglist.xml 131 132Chain configuration settings are defined in 133 134$CLEV2ER_BASE_DIR/config/main_config.xml and 135 136Algorithm configuration settings are defined in 137 138$CLEV2ER_BASE_DIR/config/chain_configs/testchain/testchain_config.xml 139 140To find all the command line options for *run_chain.py*, type: 141 142`run_chain.py -h` 143 144For further info, please see `clev2er.tools` 145 146## Developer Requirements 147 148This section details additional installation requirements for developers who will develop/adapt 149new chains or algorithms. 150 151## Developer Workflow 152 153This section describes the method that should be used to contribute to the project code. 154The basic method to develop a new feature is: 155 156On your local repository: 157 1581. Make sure your local 'master' branch is checked out and up-to-date 159 (some steps may not be necessary). 160 161 ``` 162 cd $CLEV2ER_BASE_DIR 163 git checkout master 164 git pull 165 ``` 166 1672. Create a new branch, named xxx_featurename, where xxx is your initials 168 169 `git checkout -b xxx_featurename` 170 1713. Develop and test your new feature within this branch, making git additions and commits 172 as necessary. 173 You should have at least one commit (probably several). 174 175 `git commit -a -m "description of change"` 176 1774. If you are developing a new module, then you must also write a pytest test 178 for that module in a tests directory located in the same directory as the module. 179 Note the section on pytest markers at the end of this document. 180 1815. Static analysis tests will be run on your changes using pre-commit, either 182 automatically during a git commit or by running in the directory of the code 183 change or in the repository base directory (for a more complete check): 184 185 `pre-commit run --all` 186 1876. Once tested, push the new feature branch to GitHub 188 189 `git push -u origin xxx_featurename` [first time], or just `git push` 190 1917. Go to GitHub: [github.com/mssl-softeng/clev2er_sii] 192 (https://github.com/mssl-softeng/clev2er_sii) 193 or direct to the pull request URL shown in your git pull command. 194 1958. Create a Pull Request on GitHub for your feature branch. This will automatically start a CI 196 workflow that tests your branch for code issues and runs pytest tests. If it fails you 197 should correct the errors on your local branch and repeat (steps 3 onwards) until it passes 198 all tests. 199 2009. Finally your pull request will be reviewed and if accepted merged into the 'master' branch. 201 20210. You can then delete your local branch and the remote branch on Github. 203 204 ``` 205 git branch -d xxx_featurename 206 git push origin --delete xxx_featurename 207 208 ``` 209 21011. Repeat the whole process to add your next feature. 211 212## Framework and Chain Configuration 213 214The framework (run controller) and individual named algorithm chains each have 215separate configuration files. Configuration options can be categorized as: 216 217- run controller (or main framework ) default configuration 218- per chain default configuration (to configure individual algorithms and resources) 219- command line options (for input selection and modifying any default configuration 220 options) 221 222Chains can be configured using XML or YAML configuration files and optional command line 223options in the following order of increasing precedence: 224 225- main config file: $CLEV2ER_BASE_DIR/config/main_config.xml [Must be XML] 226- chain specific config file: 227 $CLEV2ER_BASE_DIR/config/chain_configs/*chain_name*/*config_file_name*.xml, 228 XML or .yml 229- command line options 230- command line additional config options using the --conf_opts 231 232The configurations are passed to 233the chain's algorithms and finder classes, via a merged python dictionary, available 234to the Algorithm classes as self.config. 235 236### Run Control Configuration 237 238The default run control configuration file is `$CLEV2ER_BASE_DIR/config/main_config.xml` 239 240This contains general default settings for the chain controller. Each of these can 241be overridden by the relevant command line options. 242 243| Setting | Options | Description | 244| ------- | ------- | ----------- | 245| use_multi_processing | true or false | if true multi-processing is used | 246| max_processes_for_multiprocessing | int | max number of processes to use for multi-processing | 247| use_shared_memory | true or false | if true allow use of shared memory. Experimental feature | 248| stop_on_error | true or false | stop chain on first error found, or log error and skip | 249 250### Chain Specific Configuration 251 252The default configuration for your chain's algorithms and finder classes should be placed in 253the chain specific config file: 254 255`$CLEV2ER_BASE_DIR/config/chain_configs/<chain_name>/<anyname>[.xml,.XML,or .yml]` 256 257Configuration files may be either XML(.xml) or YAML (.yml) format. 258 259#### Formatting Rules for Chain Configuration Files 260 261YAML or XML files can contain multi-level settings for key value pairs of boolean, 262int, float or str. 263 264- boolean values must be set to the string **true** or **false** (case insensitive) 265- environment variables are allowed within strings as $ENV_NAME or ${ENV_NAME} (and will be 266 evaluated) 267- YAML or XML files may have multiple levels (or sections) 268- XML files must have a top root level named *configuration* wrapping the lower levels. 269 This is removed from the python config dictionary before being passed to the algorithms. 270- chain configuration files must have a 271 - **log_files** section to provide locations of the log files (see below) 272 - **breakpoint_files** section to provide locations of the log files (see below) 273 274Example of sections from a 2 level config file in YML: 275 276``` 277# some_key: str: description 278some_key: a string 279 280section1: 281 key1: 1 282 key2: 1.5 283 some_data_location: $MYDATA/dem.nc 284 285section2: 286 key: false 287``` 288 289Example of sections from a 2 level config file in XML: 290 291``` 292<?xml version="1.0"?> 293 294<!-- configuration xml level required, but removed in python dict --> 295<configuration> 296 297<!--some_key: str: description--> 298<some_key>a string</some_key> 299 300<section1> 301 <key1>1</key1> 302 <key2>1.5</key2> 303 <some_data_location>$MYDATA/dem.nc</some_data_location> 304</section1> 305 306<section2> 307 <key>false</key> 308</section2> 309 310</configuration> 311 312``` 313 314These settings are available within Algorithm classes as a python dictionary called 315**self.config** as in the following examples: 316 317``` 318self.config['section1']['key1'] 319self.config['section1']['some_data_location'] 320self.config['some_key'] 321``` 322 323The config file will also be 324merged with the main run control dictionary. Settings in the chain configuration 325file will take precedence over the main run control dictionary (if they are identical), so 326you can override any main config settings in the named chain config if you want. 327 328### Required Chain Configuration Settings 329 330Each chain configuration file should contain sections to configure logging and breakpoints. 331See the section on logging below for an explanation of the settings. 332 333Here is a minimal configuration file (XML format) 334 335``` 336<?xml version="1.0"?> 337<!--chain: mychain configuration file--> 338 339<configuration> <!-- note this level is removed in python dict --> 340 341<!--Setup default locations to store breakpoint files--> 342<breakpoint_files> 343 <!-- set the default directory where breakpoint files are stored --> 344 <default_dir>/tmp</default_dir> 345</breakpoint_files> 346 347<log_files> 348 <!-- default directory to store log files --> 349 <default_dir>/tmp</default_dir> 350 <!-- info_name : str : file name base str for info files --> 351 <info_name>info</info_name> 352 <!-- error_name : str : file name base str for errorfiles --> 353 <error_name>error</error_name> 354 <!-- debug_name : str : file name base str for debug files --> 355 <debug_name>debug</debug_name> 356 <!-- logname_str : str : additional string to add to end of log filename, before .log 357 Leave empty if mot required 358 --> 359 <logname_str></logname_str> 360 361 <!-- append_date_selection : true or false, if year and month are specified on 362 command line append _MMYYYY to log file base name (before .log) --> 363 <append_date_selection>true</append_date_selection> 364 <append_process_id>false</append_process_id> 365 <append_start_time>true</append_start_time> 366</log_files> 367 368<!-- add more levels and settings below here --> 369 370<resources> 371 <physical_constants> 372 373 <directory>$CLEV2ER_BASE_DIR/testdata/adf/common</directory> 374 <filename> 375CR__AX_GR_CST__AX_00000000T000000_99999999T999999_20240201T000000__________________CPOM_SIR__V01.NC 376 </filename> 377 <mandatory>True</mandatory> 378 </physical_constants> 379</resources> 380 381</configuration> 382 383``` 384 385The requirement for specific settings are set by the chain and it's algorithms. 386An example of a chain configuration file can be found at: 387 388`$CLEV2ER_BASE_DIR/config/chain_configs/testchain/testchain_config.xml` 389 390For testing purposes it is sometimes useful to modify configuration settings directly 391from the command line. This can be done using the command line option --conf_opts which 392can contain a comma separated list of section:key:value pairs. 393 394An example of changing the value of the setting above would be: 395 396--conf_opts resources:mydata:${MYDATA_DIR}/somedata2.nc 397 398## Developing New Chains 399 4001. Decide on a chain name. For example **newchain** 4012. Create $CLEV2ER_BASE_DIR/algorithms/**newchain**/ directory to store the new chain's algorithms. 4023. Create $CLEV2ER_BASE_DIR/algorithms/**newchain**/tests to store the new chain's 403 algorithm unit tests (using tests formatted for pytest). At least one algorithm test file 404 should be created per algorithm, which should contain suitable test functions. 4054. Create your algorithms by copying and renaming the algorithm class template 406 $CLEV2ER_BASE_DIR/algorithms/testchain/alg_template1.py in to your algorithm directory. Each 407 algorithm 408 should have a different file name of your choice. For example: alg_retrack.py, alg_geolocate.py. 409 You need to fill in the appropriate sections of the init(), process() and finalize() functions 410 for each algorithm (see section below for more details on using algorithm classes). 4115. You must also create a test for each algorithm in 412 $CLEV2ER_BASE_DIR/algorithms/**newchain**/tests. 413 You should copy/adapt the test template 414 $CLEV2ER_BASE_DIR/algorithms/testchain/tests/test_alg_template1.py 415 for your new test. 4166. Each algorithm and their unit tests must pass the static code checks (pylint, mypy, etc) which 417 are automatically run as git pre-commit hooks. 4187. Create a first XML or YML configuration file for the chain in 419 $CLEV2ER_BASE_DIR/config/chain_configs/**newchain**/**anyname**.yml or .xml. 420 The configuration file contains any settings or resource locations that are required 421 by your algorithms, and may include environment variables. 4228. If required create one or more finder class files. These allow fine control of L1b file 423 selection from the command line (see section below for more details). 4249. Create an algorithm list YML file in 425 $CLEV2ER_BASE_DIR/config/algorithm_lists/**newchain**/**anyname**.xml (or .yml) 426 You can copy the template 427 in `$CLEV2ER_BASE_DIR/config/algorithm_lists/testchain/testchain_config.xml` 42810. To test your chain on a single L1b file, you can use 429 `run_chain.py --name newchain -f /path/to/a/l1b_file`. There are many other options for 430 running chains (see `run_chain.py -h`). 431 432## Algorithm and Finder Classes 433 434This section discusses how to develop algorithms for your chain. There are two types 435of algorithms, both of which are dynamically loaded at chain run-time. 436 437- Main algorithms : standard chain algorithm classes 438- Finder algorithms : optional classes to manage input L1b file selection 439 440### Algorithm Lists 441 442Algorithms are dynamically loaded in a chain when (and in the order ) they are named in the chain's 443algorithm list YAML or XML file: 444$CLEV2ER_BASE_DIR/config/algorithm_lists/**chainname**/**chainname**.yml,.xml. 445This has two sections (l1b_file_selectors, and algorithms) as shown in the example below: 446 447YML version: 448 449``` 450# List of L1b selector classes to call in order 451l1b_file_selectors: 452 - find_lrm # find LRM mode files that match command line options 453 - find_sin # find SIN mode files that match command line options 454# List of main algorithms to call in order 455algorithms: 456 - alg_identify_file # find and store basic l1b parameters 457 - alg_skip_on_mode # finds the instrument mode of L1b, skip SAR files 458 #- alg_... 459``` 460 461XML version: 462 463The xml version requires an additional toplevel `<algorithm_list>` that wraps the other sections. 464It also allows you to enable or disable individual algorithms within the list by setting the 465values *Enable* or *Disable*, and to set breakpoints by setting the value to *BreakpointAfter*. 466 467``` 468<?xml version="1.0"?> 469 470<algorithm_list> 471 <algorithms> 472 <alg_identify_file>Enable</alg_identify_file> 473 <alg_skip_on_mode>Enable</alg_skip_on_mode> 474 <!-- ... more algorithms --> 475 <alg_retrack>BreakpointAfter</alg_retrack> 476 </algorithms> 477 478 <l1b_file_selectors> 479 <find_lrm>Enable</find_lrm> 480 <find_sin>Enable</find_sin> 481 </l1b_file_selectors> 482</algorithm_list> 483 484``` 485 486### Main Algorithms 487 488Each algorithm is implemented in a separate module located in 489 490`$CLEV2ER_BASE_DIR/src/clev2er/algorithms/<chainname>/<alg_name>.py` 491 492Each algorithm module should contain an Algorithm class, as per the algorithm 493template in: 494 495`$CLEV2ER_BASE_DIR/src/clev2er/algorithms/testchain/alg_template1.py` 496 497Please copy this template for all algorithms. 498 499Algorithm class modules have three main functions: 500 501- **init()** : used for initializing/loading resources. Called once at the start of processing. 502- **process**(l1b:Dataset,shared_dict:dict) : called for every L1b file. The results of the 503 processing may be saved in the shared_dict, so that it can be accessed by algorithms called 504 further down the chain. The L1b data for the current file being processed is passed to this 505 function in a netcdf4 Dataset as argument l1b. 506- **finalize**() : called at the end of all processing to free resouces. 507 508All of the functions have access to the merged chain configuration dictionary **self.config**. 509 510All logging must be done using **self.log**.info(), **self.log**.error(), **self.log**.debug(). 511 512#### Algorithm.process() return values 513 514It is important to note that Algorithm.**process()** return values affect how the 515chain operates. The .process() function returns (bool, str). 516 517Return values must be set as follows: 518 519- (**True**,"") when the processing has completed without errors and continuation to the 520 next algorithm in the chain (if available) is expected. 521- (**False**,"**SKIP_OK** any reason message") when the processing has found a valid reason for the 522 chain to skip any further processing of the L1b file. For example if it does not measure over the 523 target area. This will be logged as DEBUG message but is not an error. The chain will move to 524 processing the next L1b file. 525- (**False**,"some error message") : In this case the error message will be logged to the error log 526 and the file will be skipped. If **config**["chain"]["**stop_on_error**"] is False then the 527 chain will continue to the next L1b file. If **config**["chain"]["**stop_on_error**"] is True, 528 then the chain will stop. 529 530### FileFinder Classes 531 532FileFinder class modules provide more complex and tailored L1b input file selection 533than would be possible with the standard **run_chain.py** command line options of : 534 535- (**--file path**) : choose single L1b file 536- (**--dir dir**) : choose all L1b files in a flat directory 537 538FileFinder classes are only used as the file selection method if the --file and --dir 539command line options are **not** used. 540 541For example you may wish to select files using a specific search pattern, or from multiple 542directories. 543 544FileFinder classes are automatically initialized with : 545 546- **self.config** dict from the merged chain dict, any settings can be used for file selection 547- **self.months** (from command line option --month, if used) 548- **self.years** (from command line option --year, if used) 549 550FileFinder classes return a list of file paths through their .find_files() function. 551Code needs to be added to the .find_files() function to generate the file list. 552 553Any number of differently named FileFinder class modules can be specified in the algorithm list 554file, 555under the **l1b_file_selectors:** section. File lists are concatentated if more than one Finder 556class is used. 557 558An example of a FileFinder class module can be found in: 559 560`clev2er.algorithms.cryotempo.find_lrm.py` 561 562## Logging 563 564Logging within the chain is performed using the python standard logging.Logger mechanism 565but with some minor adaption to support multi-processing. 566 567Within algorithm modules, logging should be performed using the in-class Logger 568instance accessed using **self.**log : 569 570- self.log.**info**('message') : to log informational messages 571- self.log.**error**('message') : to log error messages 572- self.log.**debug**('message') : to log messages for debugging 573 574Debugging messages are only produced/saved if the chain is run in debug mode (use 575run_chain.py **--debug** command line option) 576 577### Log file Locations 578 579Info, error, and debug logs are stored in separate log files. The locations 580of the log files are set in the chain configuration file in a section called 581**log_files**. You can use environment variables in your log file paths. 582 583``` 584# Default locations for log files 585log_files: 586 append_year_month_to_logname: true 587 errors: ${CT_LOG_DIR}/errors.log 588 info: ${CT_LOG_DIR}/info.log 589 debug: ${CT_LOG_DIR}/debug.log 590``` 591 592The **append_year_month_to_logname** setting is used if the chain is 593run with the --year (and/or) --month command line args. Note that these 594command line options are passed to the optional finder classes to generate a 595list of L1b input files. 596 597If these are used and the append_year_month_to_logname setting is **true**, 598then the year and month are appended to the log file names as follows: 599 600- *logname*_*MMYYYY*.log : if both month and year are specified 601- *logname*_*YYYY*.log : if only year is used 602 603### Logging when using Multi-Processing 604 605When multi-processing mode is selected then logged messages are automatically passed 606through a pipe to a temporary file (*logfilename*.mp). This will 607contain an unordered list of messages from all processes, which is difficult 608to read directly. 609 610At the end of the chain run the multi-processing log outputs are automatically sorted 611so that messages relating to each L1b file processing are collected together 612in order. This is then merged in to the main log file. 613 614## Breakpoint Files 615 616Breakpoints can be set after any Algorithm by: 617 - setting the *BreakpointAfter* value in the chain's Algorithm list, or 618 - using the run_chain.py command line argument **--breakpoint_after** *algorithm_name* 619 620When a breakpoint is set: 621 - the chain will stop after the specified algorithm has completed for each input file. 622 - the contents of the chain's *shared_dict* will be saved as a NetCDF4 file in the 623 ```<breakpoint_dir>``` as specified in the *breakpoints:default_dir* section in the chain 624 configuration file. 625 - the NetCDF4 file will be named as ```<breakpoint_dir>/<l1b_file_name>_bkp.nc``` 626 - if multiple L1b files are being processed through the chain, a breakpoint file 627 will be created for each. 628 - single values or strings in the *shared_dict* will be included as global or group 629 NetCDF attributes. 630 - if there are multiple levels in the *shared_dict* then a NetCDF group will be 631 created for each level. 632 - multi-dimensional arrays (or numpy arrays) are supported up to dimension 3. 633 - NetCDF dimension variables will not be named with physical meaning (ie time), 634 as this information can not be generically derived. Instead dimensions will be 635 named dim1, dim2, etc. 636 - all variables with the same dimension will share a common NetCDF dimension (ie dim1, etc) 637 638## Developer Notes 639 640### Code checks before committing 641 642It is recommended to run pre-commit before a `git commit`. This runs the static 643code analysis tests (isort, pylint, ruff, mypy,.. ) on your code and shows you any 644failures before you commit. The same tests are also run when you commit (and must pass). 645 646`precommit run --all` 647 648### Pytest Markers 649 650Pytest markers are setup in $CLEV2ER_BASE_DIR/pytest.ini 651 652It is important to use the correct pytest marker due to the use of GitHub CI 653workflows that run pytest on the whole repository source code. Some pytest 654tests are not suitable for GitHub CI workflow runs due to their large external 655data dependencies. These need to be marked with `pytest.mark.requires_external_data` 656so that they are skipped. These tests can be run locally where access to the data 657is available. 658 659The following Pytest markers should be used in front of relevant pytest functions: 660 661- **requires_external_data**: 662 testable on local systems with access to all external data/ADF (outside repo) 663- **non_core**: 664 used to label non-core function tests such as area plotting functions 665 666Example: 667 668```python 669@pytest.mark.requires_external_data # not testable on GitHub due to external data 670def test_alg_lig_process_large_dem(l1b_file) -> None: 671``` 672 673or placed at the top of a module: 674 675```pytestmark = pytest.mark.non_core``` 676 677 678### GitHub Pages Documentation from in-code Docstrings 679 680This user manual is hosted on GitHub pages (https://mssl-softeng.github.io/clev2er_sii) 681 682Content is created from doctrings 683(optionally containing Markdown: https://www.markdownguide.org/basic-syntax/#code ) 684in the code, 685using the *pdoc* package : https://pdoc.dev 686 687Diagrams can be implemented using mermaid: https://mermaid.js.org 688 689The site is locally built in `$CLEV2ER_BASE_DIR/docs`, using a pre-commit hook 690(hook id: pdocs_build). 691Hooks are configured in `$CLEV2ER_BASE_DIR/.pre-commit-config.yaml` 692 693The hook calls the script `$CLEV2ER_BASE_DIR/pdocs_build.sh` to build the site 694whenever a `git commit` is run **in branch gh_pages**. 695 696When a `git push` is run, GitHub automatically extracts the site from the 697docs directory and publishes it. 698 699The front page of the site (ie this page) is located in the doctring within 700`$CLEV2ER_BASE_DIR/src/clev2er/__init__.py`. 701 702The docstring within `__init__.py` of each package directory should provide 703markdown to describe the directories beneath it. 704 705#### Process to Update Docs 706 707One method of updating the GitHub Pages documentation from the code (ie 708to process the docstrings in to html in the /docs folder) 709 710- Edit docstrings in master branch code (or by pull request from other branch) 711- git commit -a -m "docs update" 712- git checkout gh_pages 713- git merge master 714- pre-commit run --all (runs pdocs to update the html in docs folder) 715- git commit -a -m "docs update" 716- git push 717- git checkout master (return to master branch) 718- git merge gh_pages 719- git push 720 721Why isn't this run automatically from the master branch or in a GitHub workflow? 722This is because pdocs (part of the pre-commit hooks) requires all code dependencies 723to be in place, including external data, when parsing/importing the code. 724External data is not available on GitHub, and also on some minimal installations 725of the master branch. So, to avoid pre-commit failing due to pdocs on other 726branches, or GitHub workflows doing the same, the docs are only updated on a 727controlled 'gh_pages' branch (which has all external data installed). 728 729 730"""