bito concepts

Definitions

PCSP

PCSP stands for parent-child subsplit pair, that is, a pair consisting of a parent subsplit and a child subsplit. It’s a general concept rather than a specific implementation of the concept. For example, see the documentation of PCSPFun (in node.hpp) and PCSP Bitsets (in bitset.hpp) for two different ways of using this concept.

PSP

PSP stands for primary subsplit pair, which is a way of parameterizing branch lengths on trees. See the 2019 ICLR paper for details.

SBN

SBN stands for subsplit Bayes network, which is how we parameterize variational distributions on tree topologies. See the 2018 NeurIPS paper for details.

Notation

  • \(X\): taxon set

  • \(A, B, \ldots\) taxon subsets

  • \(\tau\): topology

  • \(\theta\): branch lengths or other per-tree model parameters such as node heights

  • \(t, s, u\): subsplits

  • \(q(s | t)\): conditional probability of child subsplit \(s\) given parent subsplit \(t\)

  • \(t \rightarrow s\): parent-child subsplit pair consisting of a parent \(t\) and child \(s\)

Terminology and conventions

We follow different naming conventions for C++ and Python. In C++ we follow the Google C++ conventions: PascalCase for types and functions, snake_case for variables, and a_trailing_underscore_ for member variables. In Python we just use snake_case for everything, with no trailing underscore. Here we’ll just use the Python versions, and the C++ hackers will have to extrapolate.

In Python bitsets will be expressed as strings, while in C++ they are Bitset objects. Hashtables are implemented using unordered_map in C++ and dictionaries in Python, but we will use the Python terminology even if we are talking about a C++ map.

Implementation

All of the SBN parameters are held in a single sbn_parameters block of doubles that is available for reading and writing through the bito instance interface. We index into that block using an “indexer” dictionary which maps from bitsets representing PCSPs to the index for that PCSP. See the definition of PCSP bitset in bitset.hpp and the tests in bito.hpp to see this in action.

The sbn_parameters block is set up such that PCSPs that share a parent are laid out contiguously. This is nice because if we want to sample from the child subsplit conditional on the parent, then we have all of the probabilities there in one place. The parent_to_range dictionary maps parent bitsets to the range of indices corresponding to children descending from that parent.

We can ask for an “indexer representation” of a tree, which is described in sbn_maps.hpp and shown off in the use of StringIndexerRepresentationOf in bito.hpp. Basically, its the expression of the rootsplits and PCSPs for a tree, in index form.

PSPs are indexed using a different scheme, which uses a vector of three vectors. The first vector describes the rootsplit, the second describes the child subsplit going “up” the tree, and the third the child subsplit going “down” the tree. These directionality terms only make sense on rooted trees, but we represent unrooted trees as rooted trees in the typical way.