Software contribution and development guide

Software design

These guidelines concern larger software (more than a couple of lines in a shell script) and are not intended to be in the way for the first few commits or to experiments. Stay closer to them as the project matures and grows.

Science software can usually be split into 4 categories. The design process and mindset behind each one is quite different. These guidelines are not gospel and always have exceptions. Do question them and use your own judgement if they apply to your particular case.

Packages

There are collections of tools meant to be used inside other software. scipy is an example of a package. Packages can still have entry-points (i.e. command line interfaces), but those seldom cover the full functionality of the package. A package thrives being well-designed from a user perspective but also in terms of performance and extensibility (key in science), several refactoring phases, avid unit-testing and functional testing (see testing).

Data analysis

Analysis of measurement data. This kind of software can be anything from a few python scripts to a larger packages structure but that only has one, or a few, main functions or entry-points. Any data analysis depends on models (physical or not) and usually also statistics. Design of data analysis warrants good structure to support the development and validation process but also to clearly show the methodology. I.e. it should be easy to test the vital models and methods used to analyze the data and any external examination should get a clear grasp of what the analysis method is. It should also be easy to modify said functions when revelations about the problem manifest (there almost always are unknowns in novel analysis applications).

Numerical evaluation

Numerical evaluation of models and physics. Also called simulation software. Closely related with data analysis software but often requiring much more design effort on the interface, performance and unit-testing side. Especially the numerical stability of methods and consistency in edge cases are much more relevant in simulations (as may initial conditions for simulations usually never come up in measured data, e.g. orbital eccentricity = 0).

Others

There are always scientific software that do not fall into the above categories, for example a live quick-look data webpage for an instrument. This might use a package or a small part of a data analysis code to generate the images being displayed but the majority of the code will be webpage deployment development, and hence software design strategies and concepts should follow that type.

Diagrams

We aim to follow the C4 model when drawing architecture diagrams used for discussion and design planning but also for documentation. An apt introduction to the C4 model is this recorded conference talk: Visualising software architecture with the C4 model by Simon Brown, Agile on the Beach 2019.

Diagrams should not be drawn for diagram’s sake; they should serve a clear purpose.

— Me

The main purpose of drawing diagrams is to discuss and explain design with/to other developers and to efficiently show what is conceptually happening “under the hood”. Some relevant specifics to add to the C4 model when it comes to developing science software are

To easily create version controlled design diagrams in this style you can use PlantUML which is available as a downloadable JAR.

To simplify using the C4 principles we use C4-PlantUML. We do not have to download this package as we can include the need files directly from github as

!include https://raw.githubusercontent.com/plantuml-stdlib/C4-PlantUML/master/C4_Component.puml

There are a number of examples in the C4-PlantUML repository to start from. A sample diagram from Metecho is used below.

To generate the SVG file (assuming Java and Graphviz are installed), simply run the plantuml.jar file with the puml file as an input parameter together with -tsvg to dictate the output format:

java -jar plantuml.jar -tsvg metecho_data.puml

Which creates the following: Component diagram of metecho.data subpackagemetecho.data[Component]Raw Data Interface[Python] Allows automatic selection ofthe proper loading functionfor the given raw data if theformat is supportedRaw data formatsdatabase[Python] Provides a listing of allsupported raw data formatsand their respective readersFormats registrator[Python] Handles the registration ofnew raw data formats intothe databaseRaw data formats[Python wrapped] A collection of data formatreaders and format validatorsScientist/pipeline/radar-system Uses metecho to load rawdata into a common formatfrom various radar systemsData storage[Filesystem] Radar raw data files locatedin a filesystemUser defined raw dataformats[Python wrapped] A collection of userimplemented data formatreaders and format validatorsRead raw data fileCheck if file issupported and get asupported datareaderRead file withsupported readerRegister reader andformat validatorRegister reader andformat validatorAdd format todatabaseLegend personsystemcontainercomponentexternal personexternal systemexternal containerexternal component Source: metecho C4 component diagram

For clarity, the usage of this data subpackage in the wild looks like this

import metecho
mu_file = '2009-06-27T09.54.05.690000000.h5'
raw = metecho.data.RawDataInterface(mu_file)
# Use raw data here

This process of generating diagrams can be automated in a CI so that figures can be compiled and included in static documentation generated by e.g. sphinx. During development of diagrams and for ease of later editing, it is highly recommended to use editor plugins such as these for visual studio code.

Container diagrams

For “other” types of software outlined above, it is useful to how the different separately deployable parts interact. What are the communication channels and who are the intended users of each part.

ALIS_4D[System]Quick look website[SPA] Low res data overview andarchive, keogramsalis4d.irf.se[Apache directory] Publicly accessible imagessorted by dateGoptabord[php, git repository] Experiment setup files, phpwebsite for viewinggoptavalv.alis.irf.se[zfs] System wide storageFonoglob[Terminal + daemon] User interface client formanual control of thestationsStation [AOSET][See separate containerdiagram]ScientistPIAnyone on theInternetRequests specificexperiments using[https]Browses data andfollows experimentsin real time on[https]Accesses annotatedimage data from[https]Experiment setupfiles deployed to[git]Distributescommands toSends quicklookimages toProxies access todata on[remotefs]Syncs data from[rsync over ssh]Deletes unusabledata from[ssh]Runs[ssh]Legend personsystemcontainerexternal personexternal systemexternal container Source: puml for alis container diagram

Activity diagrams

As mentioned in data analysis and numerical evaluation software, here is an example activity diagram that outlines an algorithm. They also have a place in publications.

Metropolis-Hastings Markov chain Monte Carlo algorithmStart point set as next pointCalculate next point posteriorSample proposal distributionProposal sample pointCalculate proposal posteriorSample acceptance distributionProposal sample point set as next pointNext point unchangedSample acceptedTrueFalseDone samplingTrueFalse Source: puml for MCMC activity diagram

Class diagrams

As the C4 model suggests, few things needs to be documented on class level. The plantuml plugin ironically doesn’t even have a specific diagram level for this (making it the C3 model). The interaction between specific classes and functions can be described with the component diagram type instead, when needed. For interactions between functions and other functions or classes (e.g. for functional programming), think of functions as a class with one method. Turn of stereotypes as described in the puml source to reduce potential confusion with actual component diagrams.

Arrows in class diagrams indicate “knows of”. Arrows for functions calls are specified like Rel(projection_function, optical_transfer_function, "calculates optical transfer using"). The return values are implied from the call. Inheritance arrows are likewise in the direction of “knows of” (aside: inherit from abstract classes, not concrete implementations).

Propagator[«abstract base class»]KeplerSpace ObjectObserved TrackingGoverned in time byTracks Source: puml for propagator class diagram

Architecture

As the software grows, the need for boundaries increases. You want to be able to solve one problem at the time and also limit the amount of rewriting is needed to implement a new feature or fix a bug. Some general principles to help achieve this are listed below.

Separation of concerns

Do not mix concerns about input/output with the core problem-solving of your application. Reading files, parsing json, handling user input, communicating with other processes, loading modules, making http requests and so on are not your core business. When these concerns become too closely entangled with the thing that only you do, it makes it way harder to replace. Do you need another plot? Is it going to be on a website now? Wait, did EISCAT change their API? These changes from the outside world should be touching your outer layers of your code, not require a complete rewrite.

Your core problem-solving should preferably be completely oblivious of how it is interacted with. This makes it easier to test and reuse in new scenarios. If it is a commonly occurring problem domain (e.g. parsing xml, render html, writing a file), use off the shelf software libraries instead of writing your own and focus on what nobody else does.

Example: Instead of reading a file as part of your algorithm, parse the file and pass its contents in a format/datatype that is easy for your code to deal with (maybe a numpy array or language primitives?). This means that the file parsing can be replaced with a http call without changing the algorithm, just add another converter from http response (json?) to your particular datatype.

Open for extension

…but closed for modification. To continue the example above, instead of changing your existing data parser to accept json instead of the original file format, pay close attention to the words add another converter (or adapter). This means that you can continue supporting your users that use the file format.

Complexity is a constant

Just because you separated your concerns (above) and modularised your code, doesn’t mean the complexity goes away. Abstracting concepts into modules/functions helps you focus on one thing at the time, but be wary that the complexity as a whole doesn’t change. Lots of small functions/modules might not make the whole design easier to understand.

Do not make things too abstract, draw a line in the sand and stick to it. It’s a small difference between making things general and making things unusable and bloated.

Your problem will continue to be complex, avoid making it complicated too.

Reusable small interfaces

The more (kinds and numbers of) parameters your functions or classes handle, the harder they become to test, debug and reuse. Isolate what your function does and require the minimum amount of parameters needed to achieve that. Consider creating a wrapper function to take care of the combinations of various supported inputs types, and pass the most convenient structure to your actual calculation function. This means you can test and debug the type conversion separately from the calculations.

Example: Your function might support various different ways to define an angle, either in radians, degrees, as a string or float, an Angle class and so on. Handle the conversion between these elsewhere than in your core calculation.

In reverse, don’t require the user (or another function) to provide more information than you are actually using. If you are dependent on a certain data structure being passed, be liberal with what you accept as long as it at least provides what you need.

Managing state

Unless you are into strict functional programming, you are going to need to store state somewhere. The more things that can change state, the harder debugging and refactoring will be.

Iterative development

… things go through cycles of three ages: 1. Explore is about discovery and learning 2. Stabilise is about repeatability and consistency 3. Commoditise is about efficiency and cost

— Dan North, Software faster

I usually find that to go from “it works” to “its good” takes about 2-3 full re-writes of the code base. So I have, both with packages and project specific code, started with the habit of first creating a bare minimum skeleton version of the idea. This skeleton version can maybe perform between 1-10% of what the final code will be able, but it lets me get the feel for what will come, how to structure the future code and what must be developed. This is sometimes referred to as a minimum viable product or MVP within software development.

Do not fear deleting code. It might feel like a waste of time, but usually it is faster to start from scratch with new knowledge. If you miss something it is available in version control.

After this, I spend a few hours with pen and paper or a whiteboard and plan out important relations between objects in the code, whether it’s functions or classes or data levels or all simultaneously. It’s important to get a broad stroke idea of the hierarchy as the same operation can usually be done on multiple levels. This often does not affect the end result but the choice of level will dramatically change how the code looks and performs.

How many times when a project ends have you thought “I wish I’d known that and had a chance to do it over”. Why not go ahead and do that earlier in the project cycle instead, when there is still time?

Only after this part do I come to the writing, usually I start out with defining some sort of large “black-box” functional test case that I can use to do iterative development with. This really helps me focus on where I need to get: just get that test running without compilation / run-time errors. Next step, make sure the test itself passes. Next step, refactor the code in the direction needed. If performance is an issue, focus on execution time. If it’s for a package, focus on generalization, extensibility and ease of updates in the future. Once you have a version of this part that will not see dramatic change, make sure documentation is accurate and up to date. At this point, if this is a package, I would add unit tests for all the small different sub-steps that are used in the large functional test. Lastly, if it’s a package: turn the functional test into an example. If it tests a vital part of the code, it probably uses it correctly and is a great example on how the code is used.

Then go back to the start and repeat for the next “part” of the code base. For the ‘space-object radar observation simulator package’ I developed, I redid this process about 8-10 times until I had all the tools in the repository to move forward and do project work using the package.

knowledge time danger!
When a project is going to decline of active development, it enters a danger zone

When working iteratively as described, you gain more knowledge as you go. Every collaborator knows what is going on and things can change directions fast. This is the “explore” phase quoted above. Pay attention to when a project needs to transition into stabilisation, as it is required for the project to live through the danger zone of slowly diminishing knowledge. Here, documentation, refactoring and polishing of the code base is more important than when exploring. When the project hopefully emerges through the danger zone in a good state, it is possible to pick it up again and make minor changes too, without having to redo everything in order to understand it.

Not every project needs to transition into a finished product. If a project is to test an idea or essentially run only once, it matters less how efficient and extendable it is. But for the projects that do it is important to transition from prototype to code fit for production. A fantastic story arc on this is Matt Parker’s crude calculations of the Jotto problem (or Wordle):

  1. Can you find: five five-letter words with twenty-five unique letters?
  2. Someone improved my code by 40,832,277,770%

Or in short conclusion, it is ok to have code that will run once even if it takes a long time to run as long as it solves the problem, or answers the scientific question. But for something that is going to run over and over again it pays to optimise. This does not mean optimisation needs to be perfected first.

Quality attributes

Having covered design and development processes and phases we also need to briefly introduce properties of the software that should be kept in mind and optimized, in parallel with the main goal of the software, during said design and development processes. These properties are usually called quality attributes in the literature.

Usually, the following attributes are the most important for science software:

With the following being added on for package-like software:

There are of course tens of other attributes but the main takeaway is that during the design process one should think about WHICH attributes matter for the project at hand, and then make choices that favor those attributes.

An example could be to properly document code and using style guides increases extensibility and maintainability since the code is easier to understand and read. While proper method segregation into separate functions and identification of trivial parallelizable portions of code can greatly increase scalability when running on a computer-cluster.

General tips and tricks

Package code

Project specific code

Glossary

Profiling
What I call profiling is having a system in place in the code to measure execution times of units of code, calling frequency of function, and other performance related variables. This is extremely useful in figuring out what gets used a lot, what is slow and where the code spends the most time.
Logging
Logging is a system to report information to a common output from anywhere inside the code, that can be easily enabled or disabled. The output could be the terminal or a logfile. This is very useful in debugging.