Python Resources for Modelers

Written by LP, updated 02/19/2025 (work in progress)

Here we provide a list of fundamental concepts to know to responsibly create, collaborate on, and interact with responsible and reproducible Python code. We have outlined this somewhat like a syllabus for self-directed learning, with links to resources to get started. The reader should make sure they understand all key concepts in the checklist, and read more otherwise.

Note: we particularly like Real Python, which has many free and thorough tutorials on various Python subjects!

Coding correctly

Clean code matters for academic research with high standards.

Recommended textbooks (PDF links are unaffiliated with epi-ENGAGE):

Key concepts

"The Fundamental Theorem of Readability: Code should be written to minimize the time it would take for someone else to understand it." (Boswell & Foucher)
Function and variable names should be specific and precise -- ChatGPT is great at helping brainstorm names!
DOT -- DO ONE THING -- "Functions should do one thing. They should do it well. They should do it only." And "functions should do something, or answer something, but not both." (Martin) -- nice articles here and here.
DRY -- DON'T REPEAT YOURSELF (Hunt & Thomas) -- example resources here and here.
Tactical instead of strategic programming leads to technical debt -- "Complexity comes from an accumulation of dependencies and obscurities. As complexity increases, it leads to change amplification, a high cognitive load, and unknown unknowns." (Ousterhout)

Advanced concepts

For classes -- "Single Responsibility Principle" (Martin) -- example resources here and here.

Test, test, test!

Key concepts

Any time a user changes code or adds code, they should add new tests and also make sure that all the old tests still pass.
Unit testing is critical to ensuring code behaves as intended and ensuring that changes/updates do not add new bugs or break existing functionality. If we haven't tested it, we shouldn't trust it. We should prove our code works! Check out this great MIT lecture on testing here.
Testing should have good coverage and handle different edge cases -- bugs often occur "on the boundaries" of inputs.
The CLT Base Model code uses pytest for easy management of unit testing -- here's an in-depth tutorial.
We must also implement higher-level tests such as integration tests and acceptance tests, to make sure the whole shebang works.
One crucial reason to write modular code (and functions that do one thing only) is to allow for testing of specific modules and functions (Boswell & Foucher). If the code is not easy to test, we cannot easily guarantee its quality.

Bonus articles on testing

Check out these nice articles on unit testing here and here. And here is a nice overview of TDD -- Test-Driven Development (Martin).

Python modules and imports, IDEs, and running Python scripts

Key concepts

Jupyter notebooks are for exploration, NOT for collaborative software development. Nature wrote a blog post warning about Jupyter notebooks: "computational notebooks can also be confusing and foster poor coding practices. And they are difficult to share, collaborate on and reproduce. A 2019 study found that just 24% of 863,878 publicly available Jupyter notebooks on GitHub could be successfully re-executed, and only 4% produced the same results" -- blog post link here and conference paper here.
As programs grow in size, they should be split into smaller files and organized -- modules and importing are key here. Here's a quick tutorial and a more in-depth tutorial.
Interactive Developer Environments (IDEs) are ideal for working with Python modules. We recommend Pycharm -- the community edition here is free to download.
Python can be run in many different ways -- users should understand the difference between running a script in an IDE Console and from the command line -- here's a guide.
We recommend pip or conda for Python package management. There can sometimes be some issues with mixing pip and conda, so it may be helpful to consistently only use one or the other.
Users should be comfortable reading API documentation such as the CLT Base Model Code's references or numpy's references -- there is no need to try to memorize syntax -- and similarly there should not be an over-reliance on tutorials. Users should be comfortable parsing the technical details of classes and methods, customizing default arguments, and playtesting parts of the package for themselves.

Documentation and version control

Key concepts

[HIGH PRIORITY] The CLT Base Model Code is hosted on github -- users should be familiar with git. We strongly recommend this nice guide on github collaboration here.
- Specifically, users should know the following concepts in git: cloning a repo, local and remote repos, committing and pushing, pull requests, pulling, merging.
Collaborators should never push to main -- they should create a pull request (after adding tests and checking all tests) to request the main branch incorporate any new changes.
Users should STOP the common practice of saving versions like simulation_v3.2_final_final_01302025 -- this is horrible for many reasons. Users should instead use git version control and intentional git commit messages to keep track of their code versions.
Users should add .txt and .md files documenting any mathematical parameters to their repo and make this a part of their version control setup. This way, users can keep track of WHY and WHEN certain parameter values were changed.
The CLT Base Model Code documentation on this website is generated using mkdocs, and automatically populates code references hosted on this website from the code. The code docstrings must follow a specific format, outlined here.
Users should always write proper function signatures in their code -- specifying data input and return types is very important. This article here is a starting point to learn more.

Other important points...

Key concepts

[HIGH PRIORITY] The CLT Base Model code uses object-oriented programming (OOP), which may be new to some modelers. The coding structure is similar to StarSim and TACC's Pandemic Simulator -- it is a tried and tested strategy. We strongly recommend this tutorial for those new to OOP.
- Specifically, users should know the following concepts: inheritance, attributes, abstract base classes, abstract methods.
[HIGH PRIORITY] We need to be responsible with random number generation. Please, stop using np.random.seed! Use generators instead! Here is a superb article outlining some dangers with misusing numpy random number generation, with links to the relevant numpy documentation.
For-loops are horrible for efficiency and vectorized operations are better. Often, mathematical computations should be written in matrix multiplication form instead of for-loops. Here is a tutorial on the essentials of numpy and vectorized operations on numpy arrays.
The dataclasses module makes storing data very easy -- here's a quick tutorial.
Users should understand interfaces and duck typing in Python -- here's a short overview.
Parallel processing and cluster job submission (elaboration and resources coming soon...)