Jekyll2023-11-12T20:18:44+00:00https://changliao.github.io/science/feed.xmlEarth ScienceBehind-the-scenes of Earth Science modeling.
Chang Liaochangliao.climate@gmail.comWhat prevent us from learning new programming skills2023-11-11T00:00:00+00:002023-11-11T00:00:00+00:00https://changliao.github.io/science/posts/2023/11/11/why_earth_science_need_rse<p>A couple of days ago, I ran into a Twitter post:
https://x.com/<em>VincentS</em>/status/1722674693616910450?s=20</p>
<p><img src="https://github.com/changliao/science/blob/main/_figures/programming/github.jpeg?raw=true" alt="Figure 1" /></p>
<p>Although I recently turned to Bluesky for social networking (https://bsky.app/profile/changliao.bsky.social),</p>
<p>If you can open the Twitter post with the figure, you might wonder what the intention of this post is. There are a few comments on this post as well. There are some quick takeaways based on my understanding. First, some developers don’t actually write code but might provide suggestions to other developers. Some developers take the time to actually write the code. One comment also pointed out that some GitHub activities are also <strong>fake</strong> because they are not actually programming activities but instead spell checks. This reminds me that some people also buy GitHub stars for some purposes.</p>
<p>I am from an academic background, so this reflects some reality. Most senior modeling researchers do not code or don’t even have a GitHub account, yet they still claim they are modelers. In contrast, a PhD/postdoc or early career scientist may still participate in programming activities. Thus, their GitHub profile may resemble the figure’s lower part.</p>
<p>So, what prevents senior modelers from coding? My experience and observation provide me with several explanations:</p>
<ol>
<li>Senior modelers don’t have time for programming. Some are busy with proposals and team building and often have to lend heavy lifting to early careers. Especially if you consider the technology of GitHub is relatively new, many modelers came to fame before GitHub was born.</li>
<li>Senior modelers actually need to gain coding skills. This is also possible because many modelers are more mathematical-based or equation-based and need more experience in computer programming.
I have also seen peers use Excel for modeling, which differs from the standard practice.</li>
<li>Senior modelers stopped learning new skills. This might be a deeper problem that most of us ignore.</li>
</ol>
<p>I will skip reasons 1 and 2 because reason 3 feels more personal.
My personal experience is that I received most of my programming training during my undergraduate years, from 2005 to 2009. Most of my C/C++ knowledge was taught in classes. I also had some courses taught in MATLAB for image processing for remote sensing datasets.
I self-taught IDL (to replace MATLAB) and C# during my master’s program between 2009 and 2012.
I then re-picked up C++ during my PhD 2012-2017.
After my Ph.D., I self-taught Python to replace IDL.</p>
<p>I have used Python and C++ daily, but I still feel I need to improve my programming skills. Why? Because I still need to catch up with the latest C++ and Python features. For an Earth scientist, once you find a solution to do a task, you are very often likely to stick with that solution for a long time. This is what we call a habit. I have seen peers use MATLAB and NCL and refuse to switch to Python even though they know NCL will not be supported.</p>
<p>As scientists, we must focus on the science, not the process or the solution.</p>
<p>On the other hand, our advances in high-performance computing (HPC) often shield our limits in programming skills. I have also seen peers write inferior performance code and run it on HPC.
No one will question the code if running on HPC takes a short time.
If a code takes a lot of time to run on HPC, most modelers will consider this a computationally expensive code. Most of us will not question whether it is because the code was poorly written. That is also why we need FAIR, so peers can help each other to improve the code.</p>
<p>Most organizations need a mechanism to train scientists to become better modelers. And it ultimately depends on personal career development. Since academia often only rewards publications, only some scientists will invest time in programming. To stay in the game, they will instead use more expensive computers (more considerable project funding) or hire early careers to compensate for the computational demands. Once an early career becomes senior and accesses more resources, they will do the same.</p>Chang LiaoA couple of days ago, I ran into a Twitter post: https://x.com/VincentS/status/1722674693616910450?s=20Issue in land river coupling in E3SM2023-03-31T00:00:00+00:002023-03-31T00:00:00+00:00https://changliao.github.io/science/posts/2023/03/31/land_river_coupling_issue<p>Recently when I was testing some land river coupling in E3SM, I found some longstanding issues.</p>
<p>When the coupler needs to send fluxes or states from one to another, to conserve mass, the process sometimes needs to consider the area associated with it.</p>
<p>For example, if the flux is runoff, which is expressed as mm/day, then the coupler calculates the mass as: flux X area. However, within a grid cell, the area is partially covered by land, so area is calculated as: dArea_grid X dFraction_land.</p>
<p>However, in the earlier development, this fraction of land is often set as 1.0 before lake and river are small at 1.0 degree resolution (~100km). This decision will make the area of river as 0.0.</p>
<p>The problem comes when we want to transfer flux from river back to land.</p>
<p>Again, design decision in the earlier stage can cause problem in the later stage. Another example of technical debt.</p>Chang LiaoRecently when I was testing some land river coupling in E3SM, I found some longstanding issues.How to couple land and river model using a MPAS mesh2023-03-24T00:00:00+00:002023-03-24T00:00:00+00:00https://changliao.github.io/science/posts/2023/03/24/land_river_using_mpas<p>The E3SM river component MOSART can be run using a MPAS mesh. However, the MOSART requires forcing data such as surface runoff from a land model.</p>
<p>If the land model is not turned on, we can still run the MOSART with external forcing data, which is often not using the MPAS mesh.</p>
<p>This article explain how to run a coupled lnd-rof-(atm) simulation with the rof on the MPAS mesh.</p>
<p>We need to carry out several steps, but not necessarily in the following order:</p>
<ul>
<li>Generate the MPAS mesh-based MOSART parameters, and generate the domain file;</li>
<li>Generate the envolope lnd domain using the MOSART domain file, use this domain file as the atmosphere domain as well (somehow datm is still needed for some bad reason)</li>
<li>Generate the mapping between these two domain files</li>
<li>Since now land and river are not on the same grid, we need to create a new compset/grid to reflect this</li>
<li>Update the coupler so the dlnd variable can be accepted. In general, the dlnd will use the mapping file convert stream files, then they are passed to coupler through l2x</li>
<li>Update the coupler so the rof can accept the incoming variable through x2r</li>
</ul>Chang LiaoThe E3SM river component MOSART can be run using a MPAS mesh. However, the MOSART requires forcing data such as surface runoff from a land model.A review on remap2023-03-09T00:00:00+00:002023-03-09T00:00:00+00:00https://changliao.github.io/science/posts/2023/03/09/review_remap<p>https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/872579110/Running+E3SM+on+New+Grids</p>
<p>https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/1043235115/Special+Considerations+for+FV+Physics+Grids</p>
<p>https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/178848194/Transition+to+TempestRemap+for+Atmosphere+grids</p>
<p>https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/84541856/Creating+mapping+and+domain+files</p>
<p>https://github.com/ClimateGlobalChange/tempestremap</p>Chang Liaohttps://acme-climate.atlassian.net/wiki/spaces/DOC/pages/872579110/Running+E3SM+on+New+GridsWhere should hydrology go2023-03-02T00:00:00+00:002023-03-02T00:00:00+00:00https://changliao.github.io/science/posts/2023/03/02/where_should_hydrology_go<p>This post is a reflection following up a recent article.
https://blogs.egu.eu/divisions/hs/2023/03/01/where-should-hydrology-go/</p>
<p>From a computational hydrologist’s perspective, one limitation in hydrology is how to connect
the water cycle with both natural and anthropogenic processes in the Earth system model.</p>
<p>It is generally easy to focus on one process or term, such as runoff or ET. However, it is challenging to link ET with runoff in different landscapes.</p>
<p>In the Earth system model framework, we need to consider all the water cycle processes. For example, how does water flows from land to river, then to lake or ocean? And how does ET come from land or lake into the atmosphere?</p>
<p>The first challenge in ESM is how to represent land, river, and lake appropriately so that they can communicate. For example, the Antarctic and Greenland are considered masses of glaciers, but many other hydrologic processes on them are ignored.</p>
<p>The second challenge is how to consider the vegetation and animal feedback with the water cycle. This is also important for the carbon cycle.</p>
<p>The last challenge is how to consider the human factor, including agriculture, and dam operation.</p>
<p>There is also a dependency relationship between these challenges. For example, without improving the representation of the natural system, there will be large uncertainty in the human factor.</p>
<p>In ESM, we need to consider all the above three challenges all together to have a better understanding of the water cycle.</p>Chang LiaoThis post is a reflection following up a recent article. https://blogs.egu.eu/divisions/hs/2023/03/01/where-should-hydrology-go/Thoughts on research software2023-02-28T00:00:00+00:002023-02-28T00:00:00+00:00https://changliao.github.io/science/posts/2023/02/28/research_software<p>This post is a reflection following up a recent article.
https://www.nature.com/articles/s41559-023-02008-w</p>
<p>In my experience in Earth science, research software development is always under-appreciated. Software development has never received enough credits and needless to say publish in a high impact journal.</p>
<p>Looking back in 2015-2017 when I started to use Github, there were barriers that prevents me from practicing the open source better. I had several papers that I didn’t share all the code and data then.</p>
<p>But now I have contributed multiple open source projects. With platforms like Github, Zenoto, Overleaf, sharing resources are becoming easier and easier.</p>
<p>But at the same time, I can still see lots of papers are not making the data and code publicly available, especially when the conclusions drawn are also questionable.</p>
<p>My current practices to promote open science:</p>
<ol>
<li>Only read and recommend papers that share both data and code;</li>
<li>Only cite papers that share both data and code;</li>
</ol>
<p>Like the old saying, talk is cheap, show me the code, we cannot trust research that cannot be reproduced.</p>
<p>Personally, I think if you can’t even share your work with your family with excitements, how can you convince yourself the meaningness of research?</p>Chang LiaoThis post is a reflection following up a recent article. https://www.nature.com/articles/s41559-023-02008-wThe domain file in ESM2023-02-08T00:00:00+00:002023-02-08T00:00:00+00:00https://changliao.github.io/science/posts/2023/02/08/domain_file<p>In order to set up a MPAS mesh-based MOSART/ELM simulation, I need to prepare a <code class="language-plaintext highlighter-rouge">domain file</code>.</p>
<p>After some effort, I was not able to find any documentation describing the so-called domain file.</p>
<p>However, I found quite some documentation on how to generate this domain file, such as: https://www2.cesm.ucar.edu/models/cesm1.2/clm/models/lnd/clm/doc/UsersGuide/x11812.html</p>
<p>Without looking at the official documentation, the only way to understand the structure of the domain file is through existing files and possibly code.</p>
<p>In general, the domain file stores the information of mesh cells, including cell center, vertices, and area.</p>
<p>The cell center is either a 1D (unstructured) or 2D (structured) array.</p>
<p>As a result, the vertices can be a 2D or 3D (structured) array. In practice, the vertices array often uses the (nj, ni, nv) structures to store the data.</p>
<p>For unstructured mesh, we can set <code class="language-plaintext highlighter-rouge">nj</code> or <code class="language-plaintext highlighter-rouge">ni</code> as 1.</p>
<h3 id="different-types-of-domain-files">Different types of domain files</h3>
<h2 id="elm-surface-data">ELM surface data</h2>
<p><code class="language-plaintext highlighter-rouge">gen_domain to create a domain file for datm from a mapping file. The domain file is then used by BOTH DATM AND CLM to define the grid and land-mask.</code></p>
<h2 id="stream-file">Stream file</h2>
<h3 id="differences-between-global-and-local-domain-files">Differences between global and local domain files</h3>
<p><code class="language-plaintext highlighter-rouge">ATM_DOMAIN_FILE</code> and <code class="language-plaintext highlighter-rouge">ATM_DOMAIN_PATH</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <entry id="ATM_DOMAIN_FILE">
<type>char</type>
<default_value>UNSET</default_value>
<group>run_domain</group>
<file>env_run.xml</file>
<desc>atm domain file</desc>
</entry>
<entry id="ATM_DOMAIN_PATH">
<type>char</type>
<default_value>$DIN_LOC_ROOT/share/domains</default_value>
<group>run_domain</group>
<file>env_run.xml</file>
<desc>path of atm domain file</desc>
</entry>
</code></pre></div></div>Chang LiaoIn order to set up a MPAS mesh-based MOSART/ELM simulation, I need to prepare a domain file.Leap year and technical debt2023-02-02T00:00:00+00:002023-02-02T00:00:00+00:00https://changliao.github.io/science/posts/2023/02/02/leapyear_technical_debt<p>In my work, I need to convert an <code class="language-plaintext highlighter-rouge">E3SM</code> model output into a different format. The output file is in the <code class="language-plaintext highlighter-rouge">netCDF</code> and I found some interesting design issue in the model.</p>
<p>The model runs at some time step but the output can be in a different time step. For example, the model can run at 3-hour time step but the output may be daily, monthly.</p>
<p>These are controlled by several namelist variable. However, The model cannot handle the different number of day in different months, which is also relevant to the leap year.</p>
<p>As a result, some of the output time series has 360 days (12 * 30), some 365 days, and some 366 days. In my opinion, this is a typical technical debt which I learned recently.</p>
<p>This type of design makes the postprocessing and exchange with other workflow extremely difficult. For example, the <code class="language-plaintext highlighter-rouge">time</code> variable within the netcdf is used to index the time series. And this variable may start from 0 or 1, and the length is also variable.</p>
<p>Ideally, we should use the exact number of days throughout the whole model so they are consistent in all processes.</p>
<p>Now with this issue, lot of <code class="language-plaintext highlighter-rouge">guessing</code> efforts are needed because the output is simply unusable.</p>
<p>Reference: https://en.wikipedia.org/wiki/Technical_debt</p>Chang LiaoIn my work, I need to convert an E3SM model output into a different format. The output file is in the netCDF and I found some interesting design issue in the model.Mesh independent vs Topological relationship2023-01-23T00:00:00+00:002023-01-23T00:00:00+00:00https://changliao.github.io/science/posts/2023/01/23/mesh-independent-or-topological-relationship<p><code class="language-plaintext highlighter-rouge">PyFlowline</code> is mesh independent, and it uses <code class="language-plaintext highlighter-rouge">topological relationship</code> to model river networks. But what are the relationships between these two features? This is also the question I asked myself when presenting the model to the team members.</p>
<p>For example, one may ask “Which feature is more important?” or “Can I turn off the topological relationship feature?”</p>
<p>To understand their relationships, we also need to consider HexWatershed.</p>
<p>From one side, without topological relationship, river networks become a binary mask. And that means we cannot produce conceptual river network using PyFlowline anymore. However, HexWatershed is still able to produce it after watershed delineation. From this perspective, topological relationship must be on for PyFlowline, but not for HexWatershed.</p>
<p>Then what makes the model <code class="language-plaintext highlighter-rouge">mesh independent</code>? Both models were designed in a way that it does not rely on 2D index, which also means some traditional methods can be extended to mesh independent if the 2D index structure assumption can be abandoned.</p>
<table>
<thead>
<tr>
<th>Feature</th>
<th>What if without</th>
<th>Relation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mesh independent</td>
<td>Cannot couple river and other hydrologic features</td>
<td>Does not concern topological relationship, but it helps capture details</td>
</tr>
<tr>
<td>Topological relationship</td>
<td>Cannot assist stream burning</td>
<td>Supports unstructured mesh by default</td>
</tr>
</tbody>
</table>
<p>For PyFlowline alone, topological relationship may be more important because it is how the model capture the river network. However, with the mesh independent, it is possible to use refined mesh near river to capture river features. To this extend, mesh independent enhances the model.</p>
<p>For HexWatershed, as long as the river networks are available, the topological relationship only improve river bed slope. Thus the mesh independent may be more important.</p>Chang LiaoPyFlowline is mesh independent, and it uses topological relationship to model river networks. But what are the relationships between these two features? This is also the question I asked myself when presenting the model to the team members.Visualization of the priority flood algorithm within HexWatershed2023-01-07T00:00:00+00:002023-01-07T00:00:00+00:00https://changliao.github.io/science/posts/2023/01/07/hexwatershd-algorithm-visualization<p>HexWatershed uses several algorithms to generate most flow routing parameters, including the depression-free elevation.</p>
<p>However, because of the incorporation of stream burning, the original priority flood algorithm has been upgraded to include the classical DEM reconditioning algorithm.
The mix of two distinct algorithms under the same umbrella leads to a level of complexity, for both the developer and the end user. A few additional features make this workflow even more complex:</p>
<ul>
<li>It is designed to support structured and unstructured meshes</li>
<li>It is designed to support regional and global scale simulations</li>
</ul>
<p>A visualization of the algorithms is essential to diagnose and understand the model.</p>
<p>In the early stage of HexWatershed development, I was also inspired by <a href="https://www.redblobgames.com/grids/hexagons/">RedblobGames by Amit</a>.</p>
<p>The interactive visualization is helpful for some algorithm design and implementation. Since I am not familiar with those web-based interactive visualization, I decided to use Python as a start.</p>
<p>In this post, I provide a visualization of the stream-burning built-in priority flood algorithm using a case study.</p>
<p>First, below is animation of how HexWatershed processes the elevation of each MPAS cell.</p>
<p><img src="https://github.com/changliao/science/blob/main/_figures/hexwatershed/algorithm/priority_flood.gif?raw=true" alt="Figure 1" /></p>
<p>This animation is not meant to demonstrate the results of depression-filling, but instead focuses on how the algorithm processes the domain cell by cell. In general, there are two major steps:</p>
<ol>
<li>The algorithm starts at the boundary, find the river outlet, then search upstream using a binary search method. There are also two sub-steps in this step:
<ul>
<li>The algorithm processes river first, then its upstream.</li>
<li>The algorithm process river first, then its riparian zone.</li>
</ul>
</li>
<li>The algorithm then pushes the whole domain boundary, and conduct the classical priority-flood depression filling. It will automatically skip river cells which are already processed.</li>
</ol>
<p>Although the animation suggests that each cell is processed only once. In reality, river cell may be processed multiple times due to the breaching method.</p>
<p>Besides, two texts are placed at the active cell left and right. If the right side is higher, the elevation is increased. Or else, it is decreased.</p>
<p>At the end of the animation, you will notice there are several holes, or islands. They are peaks/summit. The algorithm will still perform although the domain is broken into several parts.</p>
<p>Next, below is zoom-in view following the algorithm.</p>
<p><img src="https://github.com/changliao/science/blob/main/_figures/hexwatershed/algorithm/priority_flood_track.gif?raw=true" alt="Figure 2" /></p>
<p>Email me if you have any questions.</p>Chang LiaoHexWatershed uses several algorithms to generate most flow routing parameters, including the depression-free elevation.