clev2er
CLEV2ER Land Ice and Inland Waters GPP Project
Documentation for the CLEV2ER Land Ice and Inland Waters GPP project, hosted on GitHub at github.com/mssl-softeng/clev2er_liiw.
The GPP runs within a framework designed for (but not restricted to) Level-1b to Level-2 processing of ESA radar altimetry mission data. The key features of the framework are dynamically loaded algorithm classes (from XML or YML lists of algorithms) and in-built support for multi-processing and a consistent automated development and testing workflow. There are many run-time options in the chain controller command line tool.
The diagram below shows a simplified representation of the framework and its components.
Main Features
- Command line chain controller tool : src/clev2er/tools/run_chain.py
- input L1b file selection (single file, multiple files or dynamic algorithm selection)
- dynamic algorithm loading from XML or YML list(s)
- algorithms are classes of type Algorithm with configurable .init(), .process(), .finalize() functions.
- Algorithm.init() is called before any L1b file processing.
- Algorithm.process() is called on every L1b file,
- Algorithm.finalize() is called after all files have been processed.
- Each algorithm has access to: L1b Dataset, shared working dict, config dict.
- Algorithm/chain configuration by XML or YAML configuration files.
- A shared python dictionary is used to pass algorithm outputs between algorithms in the chain.
- logging with standard warning, info, debug, error levels (+ multi-processing logging support)
- optional multi-processing built in, configurable maximum number of processes used.
- optional use of shared memory (for example for large DEMs and Masks) when using multi-processing. This is an optional experimental feature that must be used with great care as it can result in memory leaks (requiring a server reboot to free) if shared memory is not correctly closed.
- algorithm timing (with MP support)
- chain timing
- support for breakpoint files (saved as NetCDF4 files)
Other processing chains developed within framework:
Change Log
This section details major changes to the framework (not individual chains):
Date | Change |
---|---|
14-Mar-24 | Documentation deployment workflow moved to gh_pages branch |
13-Mar-24 | Target Python version for project updated to 3.11 |
10-Mar-24 | Allows any_name.xml or .XML, .yml or .YML files for config or algorithm list |
09-Mar-24 | removed baseline and version support. Will use git branching instead |
15-Nov-23 | algorithm_lists file directory structure changed to now add directory /chainname/ |
10-Nov-23 | breakpoint support added. See section on breakpoints below. |
Installation of the Framework
Note that the framework installation has been tested on Linux and MacOS systems. Use on other operating systems is possible but may require additional install steps, and is not directly supported.
Make sure you have git installed on your target system.
Clone the git public repository in to a suitable directory on your system. This will create a directory called /clev2er_liiw in your current directory.
with https:
git clone https://github.com/mssl-softeng/clev2er_liiw.git
or with ssh:
git clone git@github.com:mssl-softeng/clev2er_liiw.git
or with the GitHub CLI:
gh repo clone mssl-softeng/clev2er_liiw
Shell Environment Setup
The following shell environment variables need to be set to support framework operations.
In a bash shell this might be done by adding export lines to your $HOME/.bashrc file.
- Set the CLEV2ER_BASE_DIR environment variable to the root of the clev2er package.
- Add $CLEV2ER_BASE_DIR/src to PYTHONPATH.
- Add ${CLEV2ER_BASE_DIR}/src/clev2er/tools to the PATH.
- Set the shell's ulimit -n to allow enough file descriptors to be available for multi-processing.
An example environment setup is shown below (the path in the first line should be adapted for your specific directory path):
export CLEV2ER_BASE_DIR=/Users/someuser/software/clev2er_liiw
export PYTHONPATH=$PYTHONPATH:$CLEV2ER_BASE_DIR/src
export PATH=${CLEV2ER_BASE_DIR}/src/clev2er/tools:${PATH}
# for multi-processing/shared mem support set ulimit
# to make sure you have enough file descriptors available
ulimit -n 8192
Environment Setup for Specific Chains
Additional environment setup maybe required for specific chains. This is not necessary unless you intend to use these chains.
CLEV2ER Sea Ice Chain
The following is an example of potential additional environment variables required by the CLEV2ER seaice chain. Actual values currently TBD.
# Specific Environment for CLEV2ER:landice chain
export CLEV2ER_DATA_DIR=/some/dir/somewhere
export CLEV2ER_LOG_DIR=/some/logdir/somewhere
Python Requirement
python v3.11 must be installed or available before proceeding. A recommended minimal method of installation of python 3.11 is using Miniconda.
To install Python 3.11 using Miniconda, select the appropriate link for your operating system from:
https://docs.anaconda.com/free/miniconda/miniconda-other-installer-links/
For example, for Linux (select different installer for other operating systems), download the installer and install a minimal python 3.11 installation using:
wget https://repo.anaconda.com/miniconda/Miniconda3-py311_24.1.2-0-Linux-x86_64.sh
chmod +x Miniconda3-py311_24.1.2-0-Linux-x86_64.sh
./Miniconda3-py311_24.1.2-0-Linux-x86_64.sh
Do you wish the installer to initialize Miniconda3
by running conda init? [yes|no] yes
You may need to start a new shell to refresh your environment before checking that python 3.11 is in your path.
Check that python v3.11 is now available, by typing:
python -V
Virtual Environment and Package Requirements
This project uses poetry (a dependency manager, see: https://python-poetry.org/) to manage package dependencies and virtual envs.
First, you need to install poetry on your system using instructions from https://python-poetry.org/docs/#installation. Normally this just requires running:
curl -sSL https://install.python-poetry.org | python3 -
You should also then ensure that poetry is in your path, such that the command
poetry --version
returns the poetry version number. You may need to modify your PATH variable in order to achieve this.
To make sure poetry is setup to use Python 3.11 virtual env when in the CLEV2ER base directory
cd $CLEV2ER_BASE_DIR
poetry env use $(which python3.11)
Install Required Python packages using Poetry
Run the following command to install python dependencies for this project (for info, it uses settings in pyproject.toml to know what to install)
cd $CLEV2ER_BASE_DIR
poetry install
Load the Virtual Environment
Now you are all setup to go. Whenever you want to run any CLEV2ER chains you
must first load the virtual environment using the poetry shell
or poetry run
commands.
cd $CLEV2ER_BASE_DIR
poetry shell
You should now be setup to run processing chains, etc.
Run a simple chain test example
The following command will run a simple example test chain which dynamically loads
2 template algorithms and runs them on a set of CryoSat L1b files in a test data directory.
The algorithms do not perform any actual processing as they are just template examples.
Make sure you have the virtual environment already loaded using poetry shell
before
running this command.
run_chain.py -n testchain -d $CLEV2ER_BASE_DIR/testdata/cs2/l1bfiles
There should be no errors. Note that run_chain.py is setup as an executable, so it is not
necessary to use python run_chain.py
, although this will also work.
Note that the algorithms that are dynamically run are located in $CLEV2ER_BASE_DIR/src/clev2er/algorithms/testchain/alg_template1.py, alg_template2.py
The list of algorithms (and their order) for testchain are defined in $CLEV2ER_BASE_DIR/config/algorithm_lists/testchain/testchain_alglist.xml
Chain configuration settings are defined in
$CLEV2ER_BASE_DIR/config/main_config.xml and
Algorithm configuration settings are defined in
$CLEV2ER_BASE_DIR/config/chain_configs/testchain/testchain_config.xml
To find all the command line options for run_chain.py, type:
run_chain.py -h
For further info, please see clev2er.tools
Developer Requirements
This section details additional installation requirements for developers who will develop/adapt new chains or algorithms.
Install pre-commit hooks
pre-commit hooks are static code analysis scripts which are run (and must be passed) before each git commit. For this project they include pylint, ruff, mypy, black, isort.
To install pre-commit hooks, do the following: (note that the second line is not necessary if
you have already loaded the virtual environment using poetry shell
)
cd $CLEV2ER_BASE_DIR
poetry shell
pre-commit install
pre-commit run --all-files
Now, whenever you make changes to your code, it is recommended to run the following in your current code directory.
pre-commit run --all-files
This will check that your code passes all static code
tests prior to running git commit. Note that these same tests are also run when
you do a new commit, ie using git commit -a -m "commit message"
. If the tests fail
you must correct the errors before proceeding, and then rerun the git commit.
Developer Workflow
This section describes the method that should be used to contribute to the project code. The basic method to develop a new feature is:
On your local repository:
Make sure your local 'master' branch is checked out and up-to-date (some steps may not be necessary).
cd $CLEV2ER_BASE_DIR git checkout master git pull
Create a new branch, named xxx_featurename, where xxx is your initials
git checkout -b xxx_featurename
Develop and test your new feature within this branch, making git additions and commits as necessary. You should have at least one commit (probably several).
git commit -a -m "description of change"
If you are developing a new module, then you must also write a pytest test for that module in a tests directory located in the same directory as the module. Note the section on pytest markers at the end of this document.
Static analysis tests will be run on your changes using pre-commit, either automatically during a git commit or by running in the directory of the code change or in the repository base directory (for a more complete check):
pre-commit run --all
Once tested, push the new feature branch to GitHub
git push -u origin xxx_featurename
[first time], or justgit push
Go to GitHub: [github.com/mssl-softeng/clev2er_liiw] (https://github.com/mssl-softeng/clev2er_liiw) or direct to the pull request URL shown in your git pull command.
Create a Pull Request on GitHub for your feature branch. This will automatically start a CI workflow that tests your branch for code issues and runs pytest tests. If it fails you should correct the errors on your local branch and repeat (steps 3 onwards) until it passes all tests.
Finally your pull request will be reviewed and if accepted merged into the 'master' branch.
You can then delete your local branch and the remote branch on Github.
git branch -d xxx_featurename git push origin --delete xxx_featurename
Repeat the whole process to add your next feature.
Framework and Chain Configuration
The framework (run controller) and individual named algorithm chains each have separate configuration files. Configuration options can be categorized as:
- run controller (or main framework ) default configuration
- per chain default configuration (to configure individual algorithms and resources)
- command line options (for input selection and modifying any default configuration options)
Chains can be configured using XML or YAML configuration files and optional command line options in the following order of increasing precedence:
- main config file: $CLEV2ER_BASE_DIR/config/main_config.xml [Must be XML]
- chain specific config file: $CLEV2ER_BASE_DIR/config/chain_configs/chain_name/config_file_name.xml, XML or .yml
- command line options
- command line additional config options using the --conf_opts
The configurations are passed to the chain's algorithms and finder classes, via a merged python dictionary, available to the Algorithm classes as self.config.
Run Control Configuration
The default run control configuration file is $CLEV2ER_BASE_DIR/config/main_config.xml
This contains general default settings for the chain controller. Each of these can be overridden by the relevant command line options.
Setting | Options | Description |
---|---|---|
use_multi_processing | true or false | if true multi-processing is used |
max_processes_for_multiprocessing | int | max number of processes to use for multi-processing |
use_shared_memory | true or false | if true allow use of shared memory. Experimental feature |
stop_on_error | true or false | stop chain on first error found, or log error and skip |
Chain Specific Configuration
The default configuration for your chain's algorithms and finder classes should be placed in the chain specific config file:
$CLEV2ER_BASE_DIR/config/chain_configs/<chain_name>/<anyname>[.xml,.XML,or .yml]
Configuration files may be either XML(.xml) or YAML (.yml) format.
Formatting Rules for Chain Configuration Files
YAML or XML files can contain multi-level settings for key value pairs of boolean, int, float or str.
- boolean values must be set to the string true or false (case insensitive)
- environment variables are allowed within strings as $ENV_NAME or ${ENV_NAME} (and will be evaluated)
- YAML or XML files may have multiple levels (or sections)
- XML files must have a top root level named configuration wrapping the lower levels. This is removed from the python config dictionary before being passed to the algorithms.
- chain configuration files must have a
- log_files section to provide locations of the log files (see below)
- breakpoint_files section to provide locations of the log files (see below)
Example of sections from a 2 level config file in YML:
# some_key: str: description
some_key: a string
section1:
key1: 1
key2: 1.5
some_data_location: $MYDATA/dem.nc
section2:
key: false
Example of sections from a 2 level config file in XML:
<?xml version="1.0"?>
<!-- configuration xml level required, but removed in python dict -->
<configuration>
<!--some_key: str: description-->
<some_key>a string</some_key>
<section1>
<key1>1</key1>
<key2>1.5</key2>
<some_data_location>$MYDATA/dem.nc</some_data_location>
</section1>
<section2>
<key>false</key>
</section2>
</configuration>
These settings are available within Algorithm classes as a python dictionary called self.config as in the following examples:
self.config['section1']['key1']
self.config['section1']['some_data_location']
self.config['some_key']
The config file will also be merged with the main run control dictionary. Settings in the chain configuration file will take precedence over the main run control dictionary (if they are identical), so you can override any main config settings in the named chain config if you want.
Required Chain Configuration Settings
Each chain configuration file should contain sections to configure logging and breakpoints. See the section on logging below for an explanation of the settings.
Here is a minimal configuration file (XML format)
<?xml version="1.0"?>
<!--chain: mychain configuration file-->
<configuration> <!-- note this level is removed in python dict -->
<!--Setup default locations to store breakpoint files-->
<breakpoint_files>
<!-- set the default directory where breakpoint files are stored -->
<default_dir>/tmp</default_dir>
</breakpoint_files>
<log_files>
<!-- default directory to store log files -->
<default_dir>/tmp</default_dir>
<!-- info_name : str : file name base str for info files -->
<info_name>info</info_name>
<!-- error_name : str : file name base str for errorfiles -->
<error_name>error</error_name>
<!-- debug_name : str : file name base str for debug files -->
<debug_name>debug</debug_name>
<!-- logname_str : str : additional string to add to end of log filename, before .log
Leave empty if mot required
-->
<logname_str></logname_str>
<!-- append_date_selection : true or false, if year and month are specified on
command line append _MMYYYY to log file base name (before .log) -->
<append_date_selection>true</append_date_selection>
<append_process_id>false</append_process_id>
<append_start_time>true</append_start_time>
</log_files>
<!-- add more levels and settings below here -->
<resources>
<physical_constants>
<directory>$CLEV2ER_BASE_DIR/testdata/adf/common</directory>
<filename>
CR__AX_GR_CST__AX_00000000T000000_99999999T999999_20240201T000000__________________CPOM_SIR__V01.NC
</filename>
<mandatory>True</mandatory>
</physical_constants>
</resources>
</configuration>
The requirement for specific settings are set by the chain and it's algorithms. An example of a chain configuration file can be found at:
$CLEV2ER_BASE_DIR/config/chain_configs/testchain/testchain_config.xml
For testing purposes it is sometimes useful to modify configuration settings directly from the command line. This can be done using the command line option --conf_opts which can contain a comma separated list of section:key:value pairs.
An example of changing the value of the setting above would be:
--conf_opts resources:mydata:${MYDATA_DIR}/somedata2.nc
Developing New Chains
- Decide on a chain name. For example newchain
- Create $CLEV2ER_BASE_DIR/algorithms/newchain/ directory to store the new chain's algorithms.
- Create $CLEV2ER_BASE_DIR/algorithms/newchain/tests to store the new chain's algorithm unit tests (using tests formatted for pytest). At least one algorithm test file should be created per algorithm, which should contain suitable test functions.
- Create your algorithms by copying and renaming the algorithm class template $CLEV2ER_BASE_DIR/algorithms/testchain/alg_template1.py in to your algorithm directory. Each algorithm should have a different file name of your choice. For example: alg_retrack.py, alg_geolocate.py. You need to fill in the appropriate sections of the init(), process() and finalize() functions for each algorithm (see section below for more details on using algorithm classes).
- You must also create a test for each algorithm in $CLEV2ER_BASE_DIR/algorithms/newchain/tests. You should copy/adapt the test template $CLEV2ER_BASE_DIR/algorithms/testchain/tests/test_alg_template1.py for your new test.
- Each algorithm and their unit tests must pass the static code checks (pylint, mypy, etc) which are automatically run as git pre-commit hooks.
- Create a first XML or YML configuration file for the chain in $CLEV2ER_BASE_DIR/config/chain_configs/newchain/anyname.yml or .xml. The configuration file contains any settings or resource locations that are required by your algorithms, and may include environment variables.
- If required create one or more finder class files. These allow fine control of L1b file selection from the command line (see section below for more details).
- Create an algorithm list YML file in
$CLEV2ER_BASE_DIR/config/algorithm_lists/newchain/anyname.xml (or .yml)
You can copy the template
in
$CLEV2ER_BASE_DIR/config/algorithm_lists/testchain/testchain_config.xml
- To test your chain on a single L1b file, you can use
run_chain.py --name newchain -f /path/to/a/l1b_file
. There are many other options for running chains (seerun_chain.py -h
).
Algorithm and Finder Classes
This section discusses how to develop algorithms for your chain. There are two types of algorithms, both of which are dynamically loaded at chain run-time.
- Main algorithms : standard chain algorithm classes
- Finder algorithms : optional classes to manage input L1b file selection
Algorithm Lists
Algorithms are dynamically loaded in a chain when (and in the order ) they are named in the chain's algorithm list YAML or XML file: $CLEV2ER_BASE_DIR/config/algorithm_lists/chainname/chainname.yml,.xml. This has two sections (l1b_file_selectors, and algorithms) as shown in the example below:
YML version:
# List of L1b selector classes to call in order
l1b_file_selectors:
- find_lrm # find LRM mode files that match command line options
- find_sin # find SIN mode files that match command line options
# List of main algorithms to call in order
algorithms:
- alg_identify_file # find and store basic l1b parameters
- alg_skip_on_mode # finds the instrument mode of L1b, skip SAR files
#- alg_...
XML version:
The xml version requires an additional toplevel <algorithm_list>
that wraps the other sections.
It also allows you to enable or disable individual algorithms within the list by setting the
values Enable or Disable, and to set breakpoints by setting the value to BreakpointAfter.
<?xml version="1.0"?>
<algorithm_list>
<algorithms>
<alg_identify_file>Enable</alg_identify_file>
<alg_skip_on_mode>Enable</alg_skip_on_mode>
<!-- ... more algorithms -->
<alg_retrack>BreakpointAfter</alg_retrack>
</algorithms>
<l1b_file_selectors>
<find_lrm>Enable</find_lrm>
<find_sin>Enable</find_sin>
</l1b_file_selectors>
</algorithm_list>
Main Algorithms
Each algorithm is implemented in a separate module located in
$CLEV2ER_BASE_DIR/src/clev2er/algorithms/<chainname>/<alg_name>.py
Each algorithm module should contain an Algorithm class, as per the algorithm template in:
$CLEV2ER_BASE_DIR/src/clev2er/algorithms/testchain/alg_template1.py
Please copy this template for all algorithms.
Algorithm class modules have three main functions:
- init() : used for initializing/loading resources. Called once at the start of processing.
- process(l1b:Dataset,shared_dict:dict) : called for every L1b file. The results of the processing may be saved in the shared_dict, so that it can be accessed by algorithms called further down the chain. The L1b data for the current file being processed is passed to this function in a netcdf4 Dataset as argument l1b.
- finalize() : called at the end of all processing to free resouces.
All of the functions have access to the merged chain configuration dictionary self.config.
All logging must be done using self.log.info(), self.log.error(), self.log.debug().
Algorithm.process() return values
It is important to note that Algorithm.process() return values affect how the chain operates. The .process() function returns (bool, str).
Return values must be set as follows:
- (True,"") when the processing has completed without errors and continuation to the next algorithm in the chain (if available) is expected.
- (False,"SKIP_OK any reason message") when the processing has found a valid reason for the chain to skip any further processing of the L1b file. For example if it does not measure over the target area. This will be logged as DEBUG message but is not an error. The chain will move to processing the next L1b file.
- (False,"some error message") : In this case the error message will be logged to the error log and the file will be skipped. If config["chain"]["stop_on_error"] is False then the chain will continue to the next L1b file. If config["chain"]["stop_on_error"] is True, then the chain will stop.
FileFinder Classes
FileFinder class modules provide more complex and tailored L1b input file selection than would be possible with the standard run_chain.py command line options of :
- (--file path) : choose single L1b file
- (--dir dir) : choose all L1b files in a flat directory
FileFinder classes are only used as the file selection method if the --file and --dir command line options are not used.
For example you may wish to select files using a specific search pattern, or from multiple directories.
FileFinder classes are automatically initialized with :
- self.config dict from the merged chain dict, any settings can be used for file selection
- self.months (from command line option --month, if used)
- self.years (from command line option --year, if used)
FileFinder classes return a list of file paths through their .find_files() function. Code needs to be added to the .find_files() function to generate the file list.
Any number of differently named FileFinder class modules can be specified in the algorithm list file, under the l1b_file_selectors: section. File lists are concatentated if more than one Finder class is used.
An example of a FileFinder class module can be found in:
clev2er.algorithms.cryotempo.find_lrm.py
Logging
Logging within the chain is performed using the python standard logging.Logger mechanism but with some minor adaption to support multi-processing.
Within algorithm modules, logging should be performed using the in-class Logger instance accessed using self.log :
- self.log.info('message') : to log informational messages
- self.log.error('message') : to log error messages
- self.log.debug('message') : to log messages for debugging
Debugging messages are only produced/saved if the chain is run in debug mode (use run_chain.py --debug command line option)
Log file Locations
Info, error, and debug logs are stored in separate log files. The locations of the log files are set in the chain configuration file in a section called log_files. You can use environment variables in your log file paths.
# Default locations for log files
log_files:
append_year_month_to_logname: true
errors: ${CT_LOG_DIR}/errors.log
info: ${CT_LOG_DIR}/info.log
debug: ${CT_LOG_DIR}/debug.log
The append_year_month_to_logname setting is used if the chain is run with the --year (and/or) --month command line args. Note that these command line options are passed to the optional finder classes to generate a list of L1b input files.
If these are used and the append_year_month_to_logname setting is true, then the year and month are appended to the log file names as follows:
- logname_MMYYYY.log : if both month and year are specified
- logname_YYYY.log : if only year is used
Logging when using Multi-Processing
When multi-processing mode is selected then logged messages are automatically passed through a pipe to a temporary file (logfilename.mp). This will contain an unordered list of messages from all processes, which is difficult to read directly.
At the end of the chain run the multi-processing log outputs are automatically sorted so that messages relating to each L1b file processing are collected together in order. This is then merged in to the main log file.
Breakpoint Files
Breakpoints can be set after any Algorithm by:
- setting the BreakpointAfter value in the chain's Algorithm list, or
- using the run_chain.py command line argument --breakpoint_after algorithm_name
When a breakpoint is set:
- the chain will stop after the specified algorithm has completed for each input file.
- the contents of the chain's shared_dict will be saved as a NetCDF4 file in the
<breakpoint_dir>
as specified in the breakpoints:default_dir section in the chain configuration file.- the NetCDF4 file will be named as
<breakpoint_dir>/<l1b_file_name>_bkp.nc
- if multiple L1b files are being processed through the chain, a breakpoint file will be created for each.
- single values or strings in the shared_dict will be included as global or group NetCDF attributes.
- if there are multiple levels in the shared_dict then a NetCDF group will be created for each level.
- multi-dimensional arrays (or numpy arrays) are supported up to dimension 3.
- NetCDF dimension variables will not be named with physical meaning (ie time), as this information can not be generically derived. Instead dimensions will be named dim1, dim2, etc.
- all variables with the same dimension will share a common NetCDF dimension (ie dim1, etc)
Developer Notes
Code checks before committing
It is recommended to run pre-commit before a git commit
. This runs the static
code analysis tests (isort, pylint, ruff, mypy,.. ) on your code and shows you any
failures before you commit. The same tests are also run when you commit (and must pass).
precommit run --all
Pytest Markers
Pytest markers are setup in $CLEV2ER_BASE_DIR/pytest.ini
It is important to use the correct pytest marker due to the use of GitHub CI
workflows that run pytest on the whole repository source code. Some pytest
tests are not suitable for GitHub CI workflow runs due to their large external
data dependencies. These need to be marked with pytest.mark.requires_external_data
so that they are skipped. These tests can be run locally where access to the data
is available.
The following Pytest markers should be used in front of relevant pytest functions:
- requires_external_data: testable on local systems with access to all external data/ADF (outside repo)
- non_core: used to label non-core function tests such as area plotting functions
Example:
@pytest.mark.requires_external_data # not testable on GitHub due to external data
def test_alg_lig_process_large_dem(l1b_file) -> None:
or placed at the top of a module:
pytestmark = pytest.mark.non_core
GitHub Pages Documentation from in-code Docstrings
This user manual is hosted on GitHub pages (https://mssl-softeng.github.io/clev2er_liiw)
Content is created from doctrings (optionally containing Markdown: https://www.markdownguide.org/basic-syntax/#code ) in the code, using the pdoc package : https://pdoc.dev
Diagrams can be implemented using mermaid: https://mermaid.js.org
The site is locally built in $CLEV2ER_BASE_DIR/docs
, using a pre-commit hook
(hook id: pdocs_build).
Hooks are configured in $CLEV2ER_BASE_DIR/.pre-commit-config.yaml
The hook calls the script $CLEV2ER_BASE_DIR/pdocs_build.sh
to build the site
whenever a git commit
is run in branch gh_pages.
When a git push
is run, GitHub automatically extracts the site from the
docs directory and publishes it.
The front page of the site (ie this page) is located in the doctring within
$CLEV2ER_BASE_DIR/src/clev2er/__init__.py
.
The docstring within __init__.py
of each package directory should provide
markdown to describe the directories beneath it.
Process to Update Docs
One method of updating the GitHub Pages documentation from the code (ie to process the docstrings in to html in the /docs folder)
- Edit docstrings in master branch code (or by pull request from other branch)
- git commit -a -m "docs update"
- git checkout gh_pages
- git merge master
- pre-commit run --all (runs pdocs to update the html in docs folder)
- git commit -a -m "docs update"
- git push
- git checkout master (return to master branch)
- git merge gh_pages
- git push
Why isn't this run automatically from the master branch or in a GitHub workflow? This is because pdocs (part of the pre-commit hooks) requires all code dependencies to be in place, including external data, when parsing/importing the code. External data is not available on GitHub, and also on some minimal installations of the master branch. So, to avoid pre-commit failing due to pdocs on other branches, or GitHub workflows doing the same, the docs are only updated on a controlled 'gh_pages' branch (which has all external data installed).
1""" 2# CLEV2ER Land Ice and Inland Waters GPP Project 3 4Documentation for the CLEV2ER Land Ice and Inland Waters GPP project, hosted on GitHub at 5[github.com/mssl-softeng/clev2er_liiw](https://github.com/mssl-softeng/clev2er_liiw). 6 7The GPP runs within a framework designed for (but not 8restricted to) Level-1b to Level-2 processing of ESA radar altimetry mission data. The key features 9of the framework are dynamically loaded algorithm classes (from XML or YML lists of algorithms) and 10in-built support for multi-processing and a consistent automated development and testing workflow. 11There are many run-time options in the chain controller command line tool. 12 13The diagram below shows a simplified representation of the framework and its components. 14 15 16 17## Main Features 18 19* Command line chain controller tool : src/clev2er/tools/run_chain.py 20* input L1b file selection (single file, multiple files or dynamic algorithm selection) 21* dynamic algorithm loading from XML or YML list(s) 22 * algorithms are classes of type Algorithm with configurable .init(), .process(), .finalize() 23 functions. 24 * Algorithm.init() is called before any L1b file processing. 25 * Algorithm.process() is called on every L1b file, 26 * Algorithm.finalize() is called after all files have been processed. 27 * Each algorithm has access to: L1b Dataset, shared working dict, config dict. 28 * Algorithm/chain configuration by XML or YAML configuration files. 29 * A shared python dictionary is used to pass algorithm outputs between algorithms in the chain. 30* logging with standard warning, info, debug, error levels (+ multi-processing logging support) 31* optional multi-processing built in, configurable maximum number of processes used. 32* optional use of shared memory (for example for large DEMs and Masks) when using multi-processing. 33This is an optional experimental feature that must be used with great care as it can result in 34memory leaks (requiring a server reboot to free) if shared memory is not correctly closed. 35* algorithm timing (with MP support) 36* chain timing 37* support for breakpoint files (saved as NetCDF4 files) 38 39##Other processing chains developed within framework: 40 41- [CLEV2ER Sea Ice & Icebergs](https://github.com/mssl-softeng/clev2er_sii) 42- [CryoTEMPO Land Ice](https://github.com/mssl-softeng/clev2er_cryotempo) 43- [CPOM Sea Ice](https://github.com/CPOM-Altimetry/cpom_seaice) 44- [Generic Framework](https://github.com/mssl-softeng/clev2er) 45 46## Change Log 47 48This section details major changes to the framework (not individual chains): 49 50| Date | Change | 51| ------- | ------- | 52| 14-Mar-24 | Documentation deployment workflow moved to gh_pages branch| 53| 13-Mar-24 | Target Python version for project updated to 3.11| 54| 10-Mar-24 | Allows any_name.xml or .XML, .yml or .YML files for config or algorithm list| 55| 09-Mar-24 | removed baseline and version support. Will use git branching instead| 56| 15-Nov-23 | algorithm_lists file directory structure changed to now add directory /*chainname*/| 57| 10-Nov-23 | breakpoint support added. See section on breakpoints below. | 58 59## Installation of the Framework 60 61Note that the framework installation has been tested on Linux and MacOS systems. Use on 62other operating systems is possible but may require additional install steps, and is not 63directly supported. 64 65Make sure you have *git* installed on your target system. 66 67Clone the git public repository in to a suitable directory on your system. 68This will create a directory called **/clev2er_liiw** in your current directory. 69 70with https: 71`git clone https://github.com/mssl-softeng/clev2er_liiw.git` 72 73or with ssh: 74`git clone git@github.com:mssl-softeng/clev2er_liiw.git` 75 76or with the GitHub CLI: 77`gh repo clone mssl-softeng/clev2er_liiw` 78 79## Shell Environment Setup 80 81The following shell environment variables need to be set to support framework 82operations. 83 84In a bash shell this might be done by adding export lines to your $HOME/.bashrc file. 85 86- Set the *CLEV2ER_BASE_DIR* environment variable to the root of the clev2er package. 87- Add $CLEV2ER_BASE_DIR/src to *PYTHONPATH*. 88- Add ${CLEV2ER_BASE_DIR}/src/clev2er/tools to the *PATH*. 89- Set the shell's *ulimit -n* to allow enough file descriptors to be available for 90 multi-processing. 91 92An example environment setup is shown below (the path in the first line should be 93adapted for your specific directory path): 94 95```script 96export CLEV2ER_BASE_DIR=/Users/someuser/software/clev2er_liiw 97export PYTHONPATH=$PYTHONPATH:$CLEV2ER_BASE_DIR/src 98export PATH=${CLEV2ER_BASE_DIR}/src/clev2er/tools:${PATH} 99# for multi-processing/shared mem support set ulimit 100# to make sure you have enough file descriptors available 101ulimit -n 8192 102``` 103 104### Environment Setup for Specific Chains 105 106Additional environment setup maybe required for specific chains. This is not 107necessary unless you intend to use these chains. 108 109#### CLEV2ER Sea Ice Chain 110 111The following is an example of potential additional environment variables 112required by the CLEV2ER **seaice** 113chain. Actual values currently TBD. 114 115```script 116# Specific Environment for CLEV2ER:landice chain 117export CLEV2ER_DATA_DIR=/some/dir/somewhere 118export CLEV2ER_LOG_DIR=/some/logdir/somewhere 119``` 120 121## Python Requirement 122 123python v3.11 must be installed or available before proceeding. 124A recommended minimal method of installation of python 3.11 is using Miniconda. 125 126To install Python 3.11 using Miniconda, select the appropriate link for your operating system from: 127 128https://docs.anaconda.com/free/miniconda/miniconda-other-installer-links/ 129 130For example, for **Linux** (select different installer for other operating systems), 131download the installer and install a minimal python 3.11 installation using: 132 133```script 134wget https://repo.anaconda.com/miniconda/Miniconda3-py311_24.1.2-0-Linux-x86_64.sh 135chmod +x Miniconda3-py311_24.1.2-0-Linux-x86_64.sh 136./Miniconda3-py311_24.1.2-0-Linux-x86_64.sh 137 138Do you wish the installer to initialize Miniconda3 139by running conda init? [yes|no] yes 140``` 141You may need to start a new shell to refresh your environment before 142checking that python 3.11 is in your path. 143 144Check that python v3.11 is now available, by typing: 145 146``` 147python -V 148``` 149 150## Virtual Environment and Package Requirements 151 152This project uses *poetry* (a dependency manager, see: https://python-poetry.org/) to manage 153package dependencies and virtual envs. 154 155First, you need to install *poetry* on your system using instructions from 156https://python-poetry.org/docs/#installation. Normally this just requires running: 157 158`curl -sSL https://install.python-poetry.org | python3 -` 159 160You should also then ensure that poetry is in your path, such that the command 161 162`poetry --version` 163 164returns the poetry version number. You may need to modify your 165PATH variable in order to achieve this. 166 167To make sure poetry is setup to use Python 3.11 virtual env when in the CLEV2ER base directory 168 169``` 170cd $CLEV2ER_BASE_DIR 171poetry env use $(which python3.11) 172``` 173 174### Install Required Python packages using Poetry 175 176Run the following command to install python dependencies for this project 177(for info, it uses settings in pyproject.toml to know what to install) 178 179``` 180cd $CLEV2ER_BASE_DIR 181poetry install 182``` 183 184### Load the Virtual Environment 185 186Now you are all setup to go. Whenever you want to run any CLEV2ER chains you 187must first load the virtual environment using the `poetry shell` or `poetry run` commands. 188 189``` 190cd $CLEV2ER_BASE_DIR 191poetry shell 192``` 193 194You should now be setup to run processing chains, etc. 195 196## Run a simple chain test example 197 198The following command will run a simple example test chain which dynamically loads 1992 template algorithms and runs them on a set of CryoSat L1b files in a test data directory. 200The algorithms do not perform any actual processing as they are just template examples. 201Make sure you have the virtual environment already loaded using `poetry shell` before 202running this command. 203 204`run_chain.py -n testchain -d $CLEV2ER_BASE_DIR/testdata/cs2/l1bfiles` 205 206There should be no errors. Note that run_chain.py is setup as an executable, so it is not 207necessary to use `python run_chain.py`, although this will also work. 208 209Note that the algorithms that are dynamically run are located in 210$CLEV2ER_BASE_DIR/src/clev2er/algorithms/testchain/alg_template1.py, alg_template2.py 211 212The list of algorithms (and their order) for *testchain* are defined in 213$CLEV2ER_BASE_DIR/config/algorithm_lists/testchain/testchain_alglist.xml 214 215Chain configuration settings are defined in 216 217$CLEV2ER_BASE_DIR/config/main_config.xml and 218 219Algorithm configuration settings are defined in 220 221$CLEV2ER_BASE_DIR/config/chain_configs/testchain/testchain_config.xml 222 223To find all the command line options for *run_chain.py*, type: 224 225`run_chain.py -h` 226 227For further info, please see `clev2er.tools` 228 229## Developer Requirements 230 231This section details additional installation requirements for developers who will develop/adapt 232new chains or algorithms. 233 234### Install pre-commit hooks 235 236pre-commit hooks are static code analysis scripts which are run (and must be passed) before 237each git commit. For this project they include pylint, ruff, mypy, black, isort. 238 239To install pre-commit hooks, do the following: (note that the second line is not necessary if 240you have already loaded the virtual environment using `poetry shell`) 241 242``` 243cd $CLEV2ER_BASE_DIR 244poetry shell 245pre-commit install 246pre-commit run --all-files 247``` 248 249Now, whenever you make changes to your code, it is recommended to run the following 250in your current code directory. 251 252```pre-commit run --all-files``` 253 254This will check that your code passes all static code 255tests prior to running git commit. Note that these same tests are also run when 256you do a new commit, ie using `git commit -a -m "commit message"`. If the tests fail 257you must correct the errors before proceeding, and then rerun the git commit. 258 259## Developer Workflow 260 261This section describes the method that should be used to contribute to the project code. 262The basic method to develop a new feature is: 263 264On your local repository: 265 2661. Make sure your local 'master' branch is checked out and up-to-date 267 (some steps may not be necessary). 268 269 ``` 270 cd $CLEV2ER_BASE_DIR 271 git checkout master 272 git pull 273 ``` 274 2752. Create a new branch, named xxx_featurename, where xxx is your initials 276 277 `git checkout -b xxx_featurename` 278 2793. Develop and test your new feature within this branch, making git additions and commits 280 as necessary. 281 You should have at least one commit (probably several). 282 283 `git commit -a -m "description of change"` 284 2854. If you are developing a new module, then you must also write a pytest test 286 for that module in a tests directory located in the same directory as the module. 287 Note the section on pytest markers at the end of this document. 288 2895. Static analysis tests will be run on your changes using pre-commit, either 290 automatically during a git commit or by running in the directory of the code 291 change or in the repository base directory (for a more complete check): 292 293 `pre-commit run --all` 294 2956. Once tested, push the new feature branch to GitHub 296 297 `git push -u origin xxx_featurename` [first time], or just `git push` 298 2997. Go to GitHub: [github.com/mssl-softeng/clev2er_liiw] 300 (https://github.com/mssl-softeng/clev2er_liiw) 301 or direct to the pull request URL shown in your git pull command. 302 3038. Create a Pull Request on GitHub for your feature branch. This will automatically start a CI 304 workflow that tests your branch for code issues and runs pytest tests. If it fails you 305 should correct the errors on your local branch and repeat (steps 3 onwards) until it passes 306 all tests. 307 3089. Finally your pull request will be reviewed and if accepted merged into the 'master' branch. 309 31010. You can then delete your local branch and the remote branch on Github. 311 312 ``` 313 git branch -d xxx_featurename 314 git push origin --delete xxx_featurename 315 316 ``` 317 31811. Repeat the whole process to add your next feature. 319 320 321## Framework and Chain Configuration 322 323The framework (run controller) and individual named algorithm chains each have 324separate configuration files. Configuration options can be categorized as: 325 326- run controller (or main framework ) default configuration 327- per chain default configuration (to configure individual algorithms and resources) 328- command line options (for input selection and modifying any default configuration 329 options) 330 331Chains can be configured using XML or YAML configuration files and optional command line 332options in the following order of increasing precedence: 333 334- main config file: $CLEV2ER_BASE_DIR/config/main_config.xml [Must be XML] 335- chain specific config file: 336 $CLEV2ER_BASE_DIR/config/chain_configs/*chain_name*/*config_file_name*.xml, 337 XML or .yml 338- command line options 339- command line additional config options using the --conf_opts 340 341The configurations are passed to 342the chain's algorithms and finder classes, via a merged python dictionary, available 343to the Algorithm classes as self.config. 344 345### Run Control Configuration 346 347The default run control configuration file is `$CLEV2ER_BASE_DIR/config/main_config.xml` 348 349This contains general default settings for the chain controller. Each of these can 350be overridden by the relevant command line options. 351 352| Setting | Options | Description | 353| ------- | ------- | ----------- | 354| use_multi_processing | true or false | if true multi-processing is used | 355| max_processes_for_multiprocessing | int | max number of processes to use for multi-processing | 356| use_shared_memory | true or false | if true allow use of shared memory. Experimental feature | 357| stop_on_error | true or false | stop chain on first error found, or log error and skip | 358 359### Chain Specific Configuration 360 361The default configuration for your chain's algorithms and finder classes should be placed in 362the chain specific config file: 363 364`$CLEV2ER_BASE_DIR/config/chain_configs/<chain_name>/<anyname>[.xml,.XML,or .yml]` 365 366Configuration files may be either XML(.xml) or YAML (.yml) format. 367 368#### Formatting Rules for Chain Configuration Files 369 370YAML or XML files can contain multi-level settings for key value pairs of boolean, 371int, float or str. 372 373- boolean values must be set to the string **true** or **false** (case insensitive) 374- environment variables are allowed within strings as $ENV_NAME or ${ENV_NAME} (and will be 375 evaluated) 376- YAML or XML files may have multiple levels (or sections) 377- XML files must have a top root level named *configuration* wrapping the lower levels. 378 This is removed from the python config dictionary before being passed to the algorithms. 379- chain configuration files must have a 380 - **log_files** section to provide locations of the log files (see below) 381 - **breakpoint_files** section to provide locations of the log files (see below) 382 383Example of sections from a 2 level config file in YML: 384 385``` 386# some_key: str: description 387some_key: a string 388 389section1: 390 key1: 1 391 key2: 1.5 392 some_data_location: $MYDATA/dem.nc 393 394section2: 395 key: false 396``` 397 398Example of sections from a 2 level config file in XML: 399 400``` 401<?xml version="1.0"?> 402 403<!-- configuration xml level required, but removed in python dict --> 404<configuration> 405 406<!--some_key: str: description--> 407<some_key>a string</some_key> 408 409<section1> 410 <key1>1</key1> 411 <key2>1.5</key2> 412 <some_data_location>$MYDATA/dem.nc</some_data_location> 413</section1> 414 415<section2> 416 <key>false</key> 417</section2> 418 419</configuration> 420 421``` 422 423These settings are available within Algorithm classes as a python dictionary called 424**self.config** as in the following examples: 425 426``` 427self.config['section1']['key1'] 428self.config['section1']['some_data_location'] 429self.config['some_key'] 430``` 431 432The config file will also be 433merged with the main run control dictionary. Settings in the chain configuration 434file will take precedence over the main run control dictionary (if they are identical), so 435you can override any main config settings in the named chain config if you want. 436 437### Required Chain Configuration Settings 438 439Each chain configuration file should contain sections to configure logging and breakpoints. 440See the section on logging below for an explanation of the settings. 441 442Here is a minimal configuration file (XML format) 443 444``` 445<?xml version="1.0"?> 446<!--chain: mychain configuration file--> 447 448<configuration> <!-- note this level is removed in python dict --> 449 450<!--Setup default locations to store breakpoint files--> 451<breakpoint_files> 452 <!-- set the default directory where breakpoint files are stored --> 453 <default_dir>/tmp</default_dir> 454</breakpoint_files> 455 456<log_files> 457 <!-- default directory to store log files --> 458 <default_dir>/tmp</default_dir> 459 <!-- info_name : str : file name base str for info files --> 460 <info_name>info</info_name> 461 <!-- error_name : str : file name base str for errorfiles --> 462 <error_name>error</error_name> 463 <!-- debug_name : str : file name base str for debug files --> 464 <debug_name>debug</debug_name> 465 <!-- logname_str : str : additional string to add to end of log filename, before .log 466 Leave empty if mot required 467 --> 468 <logname_str></logname_str> 469 470 <!-- append_date_selection : true or false, if year and month are specified on 471 command line append _MMYYYY to log file base name (before .log) --> 472 <append_date_selection>true</append_date_selection> 473 <append_process_id>false</append_process_id> 474 <append_start_time>true</append_start_time> 475</log_files> 476 477<!-- add more levels and settings below here --> 478 479<resources> 480 <physical_constants> 481 482 <directory>$CLEV2ER_BASE_DIR/testdata/adf/common</directory> 483 <filename> 484CR__AX_GR_CST__AX_00000000T000000_99999999T999999_20240201T000000__________________CPOM_SIR__V01.NC 485 </filename> 486 <mandatory>True</mandatory> 487 </physical_constants> 488</resources> 489 490</configuration> 491 492``` 493 494The requirement for specific settings are set by the chain and it's algorithms. 495An example of a chain configuration file can be found at: 496 497`$CLEV2ER_BASE_DIR/config/chain_configs/testchain/testchain_config.xml` 498 499For testing purposes it is sometimes useful to modify configuration settings directly 500from the command line. This can be done using the command line option --conf_opts which 501can contain a comma separated list of section:key:value pairs. 502 503An example of changing the value of the setting above would be: 504 505--conf_opts resources:mydata:${MYDATA_DIR}/somedata2.nc 506 507## Developing New Chains 508 5091. Decide on a chain name. For example **newchain** 5102. Create $CLEV2ER_BASE_DIR/algorithms/**newchain**/ directory to store the new chain's algorithms. 5113. Create $CLEV2ER_BASE_DIR/algorithms/**newchain**/tests to store the new chain's 512 algorithm unit tests (using tests formatted for pytest). At least one algorithm test file 513 should be created per algorithm, which should contain suitable test functions. 5144. Create your algorithms by copying and renaming the algorithm class template 515 $CLEV2ER_BASE_DIR/algorithms/testchain/alg_template1.py in to your algorithm directory. Each 516 algorithm 517 should have a different file name of your choice. For example: alg_retrack.py, alg_geolocate.py. 518 You need to fill in the appropriate sections of the init(), process() and finalize() functions 519 for each algorithm (see section below for more details on using algorithm classes). 5205. You must also create a test for each algorithm in 521 $CLEV2ER_BASE_DIR/algorithms/**newchain**/tests. 522 You should copy/adapt the test template 523 $CLEV2ER_BASE_DIR/algorithms/testchain/tests/test_alg_template1.py 524 for your new test. 5256. Each algorithm and their unit tests must pass the static code checks (pylint, mypy, etc) which 526 are automatically run as git pre-commit hooks. 5277. Create a first XML or YML configuration file for the chain in 528 $CLEV2ER_BASE_DIR/config/chain_configs/**newchain**/**anyname**.yml or .xml. 529 The configuration file contains any settings or resource locations that are required 530 by your algorithms, and may include environment variables. 5318. If required create one or more finder class files. These allow fine control of L1b file 532 selection from the command line (see section below for more details). 5339. Create an algorithm list YML file in 534 $CLEV2ER_BASE_DIR/config/algorithm_lists/**newchain**/**anyname**.xml (or .yml) 535 You can copy the template 536 in `$CLEV2ER_BASE_DIR/config/algorithm_lists/testchain/testchain_config.xml` 53710. To test your chain on a single L1b file, you can use 538 `run_chain.py --name newchain -f /path/to/a/l1b_file`. There are many other options for 539 running chains (see `run_chain.py -h`). 540 541## Algorithm and Finder Classes 542 543This section discusses how to develop algorithms for your chain. There are two types 544of algorithms, both of which are dynamically loaded at chain run-time. 545 546- Main algorithms : standard chain algorithm classes 547- Finder algorithms : optional classes to manage input L1b file selection 548 549### Algorithm Lists 550 551Algorithms are dynamically loaded in a chain when (and in the order ) they are named in the chain's 552algorithm list YAML or XML file: 553$CLEV2ER_BASE_DIR/config/algorithm_lists/**chainname**/**chainname**.yml,.xml. 554This has two sections (l1b_file_selectors, and algorithms) as shown in the example below: 555 556YML version: 557 558``` 559# List of L1b selector classes to call in order 560l1b_file_selectors: 561 - find_lrm # find LRM mode files that match command line options 562 - find_sin # find SIN mode files that match command line options 563# List of main algorithms to call in order 564algorithms: 565 - alg_identify_file # find and store basic l1b parameters 566 - alg_skip_on_mode # finds the instrument mode of L1b, skip SAR files 567 #- alg_... 568``` 569 570XML version: 571 572The xml version requires an additional toplevel `<algorithm_list>` that wraps the other sections. 573It also allows you to enable or disable individual algorithms within the list by setting the 574values *Enable* or *Disable*, and to set breakpoints by setting the value to *BreakpointAfter*. 575 576``` 577<?xml version="1.0"?> 578 579<algorithm_list> 580 <algorithms> 581 <alg_identify_file>Enable</alg_identify_file> 582 <alg_skip_on_mode>Enable</alg_skip_on_mode> 583 <!-- ... more algorithms --> 584 <alg_retrack>BreakpointAfter</alg_retrack> 585 </algorithms> 586 587 <l1b_file_selectors> 588 <find_lrm>Enable</find_lrm> 589 <find_sin>Enable</find_sin> 590 </l1b_file_selectors> 591</algorithm_list> 592 593``` 594 595### Main Algorithms 596 597Each algorithm is implemented in a separate module located in 598 599`$CLEV2ER_BASE_DIR/src/clev2er/algorithms/<chainname>/<alg_name>.py` 600 601Each algorithm module should contain an Algorithm class, as per the algorithm 602template in: 603 604`$CLEV2ER_BASE_DIR/src/clev2er/algorithms/testchain/alg_template1.py` 605 606Please copy this template for all algorithms. 607 608Algorithm class modules have three main functions: 609 610- **init()** : used for initializing/loading resources. Called once at the start of processing. 611- **process**(l1b:Dataset,shared_dict:dict) : called for every L1b file. The results of the 612 processing may be saved in the shared_dict, so that it can be accessed by algorithms called 613 further down the chain. The L1b data for the current file being processed is passed to this 614 function in a netcdf4 Dataset as argument l1b. 615- **finalize**() : called at the end of all processing to free resouces. 616 617All of the functions have access to the merged chain configuration dictionary **self.config**. 618 619All logging must be done using **self.log**.info(), **self.log**.error(), **self.log**.debug(). 620 621#### Algorithm.process() return values 622 623It is important to note that Algorithm.**process()** return values affect how the 624chain operates. The .process() function returns (bool, str). 625 626Return values must be set as follows: 627 628- (**True**,"") when the processing has completed without errors and continuation to the 629 next algorithm in the chain (if available) is expected. 630- (**False**,"**SKIP_OK** any reason message") when the processing has found a valid reason for the 631 chain to skip any further processing of the L1b file. For example if it does not measure over the 632 target area. This will be logged as DEBUG message but is not an error. The chain will move to 633 processing the next L1b file. 634- (**False**,"some error message") : In this case the error message will be logged to the error log 635 and the file will be skipped. If **config**["chain"]["**stop_on_error**"] is False then the 636 chain will continue to the next L1b file. If **config**["chain"]["**stop_on_error**"] is True, 637 then the chain will stop. 638 639### FileFinder Classes 640 641FileFinder class modules provide more complex and tailored L1b input file selection 642than would be possible with the standard **run_chain.py** command line options of : 643 644- (**--file path**) : choose single L1b file 645- (**--dir dir**) : choose all L1b files in a flat directory 646 647FileFinder classes are only used as the file selection method if the --file and --dir 648command line options are **not** used. 649 650For example you may wish to select files using a specific search pattern, or from multiple 651directories. 652 653FileFinder classes are automatically initialized with : 654 655- **self.config** dict from the merged chain dict, any settings can be used for file selection 656- **self.months** (from command line option --month, if used) 657- **self.years** (from command line option --year, if used) 658 659FileFinder classes return a list of file paths through their .find_files() function. 660Code needs to be added to the .find_files() function to generate the file list. 661 662Any number of differently named FileFinder class modules can be specified in the algorithm list 663file, 664under the **l1b_file_selectors:** section. File lists are concatentated if more than one Finder 665class is used. 666 667An example of a FileFinder class module can be found in: 668 669`clev2er.algorithms.cryotempo.find_lrm.py` 670 671## Logging 672 673Logging within the chain is performed using the python standard logging.Logger mechanism 674but with some minor adaption to support multi-processing. 675 676Within algorithm modules, logging should be performed using the in-class Logger 677instance accessed using **self.**log : 678 679- self.log.**info**('message') : to log informational messages 680- self.log.**error**('message') : to log error messages 681- self.log.**debug**('message') : to log messages for debugging 682 683Debugging messages are only produced/saved if the chain is run in debug mode (use 684run_chain.py **--debug** command line option) 685 686### Log file Locations 687 688Info, error, and debug logs are stored in separate log files. The locations 689of the log files are set in the chain configuration file in a section called 690**log_files**. You can use environment variables in your log file paths. 691 692``` 693# Default locations for log files 694log_files: 695 append_year_month_to_logname: true 696 errors: ${CT_LOG_DIR}/errors.log 697 info: ${CT_LOG_DIR}/info.log 698 debug: ${CT_LOG_DIR}/debug.log 699``` 700 701The **append_year_month_to_logname** setting is used if the chain is 702run with the --year (and/or) --month command line args. Note that these 703command line options are passed to the optional finder classes to generate a 704list of L1b input files. 705 706If these are used and the append_year_month_to_logname setting is **true**, 707then the year and month are appended to the log file names as follows: 708 709- *logname*_*MMYYYY*.log : if both month and year are specified 710- *logname*_*YYYY*.log : if only year is used 711 712### Logging when using Multi-Processing 713 714When multi-processing mode is selected then logged messages are automatically passed 715through a pipe to a temporary file (*logfilename*.mp). This will 716contain an unordered list of messages from all processes, which is difficult 717to read directly. 718 719At the end of the chain run the multi-processing log outputs are automatically sorted 720so that messages relating to each L1b file processing are collected together 721in order. This is then merged in to the main log file. 722 723## Breakpoint Files 724 725Breakpoints can be set after any Algorithm by: 726 - setting the *BreakpointAfter* value in the chain's Algorithm list, or 727 - using the run_chain.py command line argument **--breakpoint_after** *algorithm_name* 728 729When a breakpoint is set: 730 - the chain will stop after the specified algorithm has completed for each input file. 731 - the contents of the chain's *shared_dict* will be saved as a NetCDF4 file in the 732 ```<breakpoint_dir>``` as specified in the *breakpoints:default_dir* section in the chain 733 configuration file. 734 - the NetCDF4 file will be named as ```<breakpoint_dir>/<l1b_file_name>_bkp.nc``` 735 - if multiple L1b files are being processed through the chain, a breakpoint file 736 will be created for each. 737 - single values or strings in the *shared_dict* will be included as global or group 738 NetCDF attributes. 739 - if there are multiple levels in the *shared_dict* then a NetCDF group will be 740 created for each level. 741 - multi-dimensional arrays (or numpy arrays) are supported up to dimension 3. 742 - NetCDF dimension variables will not be named with physical meaning (ie time), 743 as this information can not be generically derived. Instead dimensions will be 744 named dim1, dim2, etc. 745 - all variables with the same dimension will share a common NetCDF dimension (ie dim1, etc) 746 747## Developer Notes 748 749### Code checks before committing 750 751It is recommended to run pre-commit before a `git commit`. This runs the static 752code analysis tests (isort, pylint, ruff, mypy,.. ) on your code and shows you any 753failures before you commit. The same tests are also run when you commit (and must pass). 754 755`precommit run --all` 756 757### Pytest Markers 758 759Pytest markers are setup in $CLEV2ER_BASE_DIR/pytest.ini 760 761It is important to use the correct pytest marker due to the use of GitHub CI 762workflows that run pytest on the whole repository source code. Some pytest 763tests are not suitable for GitHub CI workflow runs due to their large external 764data dependencies. These need to be marked with `pytest.mark.requires_external_data` 765so that they are skipped. These tests can be run locally where access to the data 766is available. 767 768The following Pytest markers should be used in front of relevant pytest functions: 769 770- **requires_external_data**: 771 testable on local systems with access to all external data/ADF (outside repo) 772- **non_core**: 773 used to label non-core function tests such as area plotting functions 774 775Example: 776 777```python 778@pytest.mark.requires_external_data # not testable on GitHub due to external data 779def test_alg_lig_process_large_dem(l1b_file) -> None: 780``` 781 782or placed at the top of a module: 783 784```pytestmark = pytest.mark.non_core``` 785 786 787### GitHub Pages Documentation from in-code Docstrings 788 789This user manual is hosted on GitHub pages (https://mssl-softeng.github.io/clev2er_liiw) 790 791Content is created from doctrings 792(optionally containing Markdown: https://www.markdownguide.org/basic-syntax/#code ) 793in the code, 794using the *pdoc* package : https://pdoc.dev 795 796Diagrams can be implemented using mermaid: https://mermaid.js.org 797 798The site is locally built in `$CLEV2ER_BASE_DIR/docs`, using a pre-commit hook 799(hook id: pdocs_build). 800Hooks are configured in `$CLEV2ER_BASE_DIR/.pre-commit-config.yaml` 801 802The hook calls the script `$CLEV2ER_BASE_DIR/pdocs_build.sh` to build the site 803whenever a `git commit` is run **in branch gh_pages**. 804 805When a `git push` is run, GitHub automatically extracts the site from the 806docs directory and publishes it. 807 808The front page of the site (ie this page) is located in the doctring within 809`$CLEV2ER_BASE_DIR/src/clev2er/__init__.py`. 810 811The docstring within `__init__.py` of each package directory should provide 812markdown to describe the directories beneath it. 813 814#### Process to Update Docs 815 816One method of updating the GitHub Pages documentation from the code (ie 817to process the docstrings in to html in the /docs folder) 818 819- Edit docstrings in master branch code (or by pull request from other branch) 820- git commit -a -m "docs update" 821- git checkout gh_pages 822- git merge master 823- pre-commit run --all (runs pdocs to update the html in docs folder) 824- git commit -a -m "docs update" 825- git push 826- git checkout master (return to master branch) 827- git merge gh_pages 828- git push 829 830Why isn't this run automatically from the master branch or in a GitHub workflow? 831This is because pdocs (part of the pre-commit hooks) requires all code dependencies 832to be in place, including external data, when parsing/importing the code. 833External data is not available on GitHub, and also on some minimal installations 834of the master branch. So, to avoid pre-commit failing due to pdocs on other 835branches, or GitHub workflows doing the same, the docs are only updated on a 836controlled 'gh_pages' branch (which has all external data installed). 837 838 839"""