forestviews

An R package for visualising the paths through a random forest with Sankey diagrams and parallel coordinates plots.

The forestviews R package is available from GitHub. A pre-print of our manuscript introducing the visualisation techniques implemented in forestviews may be obtained from arXiv.

The diagram above is an interactive Sankey diagram produced with forestviews. This Sankey diagram represents all paths through the first four nodes of the 100 decision trees that constitute a random forest we have fitted to Anderson’s Iris data. The paths through these trees are represented by the flow from left to right of the curving links between the rectangular blocks.

The rectangular blocks represent groups of nodes in the decision trees of the random forest. The left most vertical group of blocks represents the root nodes of these trees. We label these root nodes as Node.1 reflecting how each is the first node along paths from root to leaf nodes. Each block represents a group of root nodes that were defined by a particular covariate with which the block is also labeled. Thus Node.1_Petal.Width represents root nodes defined by the covariate Petal.Width.

We have assigned the label Node.2 to the blocks representing groups of nodes which occur immediately after root nodes on the paths from root to leaf nodes. In other words, the blocks representing daughter nodes of the root nodes are each labeled Node.2. These nodes are represented by the vertical group of blocks immediately to the right of the group of blocks that represent the root nodes. These blocks are also labeled with the covariate that defined the nodes they represent. Subsequent groups of nodes along paths from root to leaf nodes have been labeled Node.3, Node.4 and so forth and positioned in subsequent vertical groups further to the right in the diagram.

The height of a block represents the proportion of the paths through the trees that passed through a node at the position along a path represented by the horizontal position of the block and defined by the covariate with which the block is labeled. Thus it is apparent from the Sankey diagram above that the majority of the decision trees in our random forest had root nodes defined by the covariate Petal.Width.

Leaf nodes have no daughter nodes and the groups of leaf nodes at each position along the paths through the decision trees are represented by blocks labeled Terminus. To maintain the correspondence between the vertical heights of blocks and the proportions of paths through the trees terminal nodes are propagated to the right throughout the diagram though of course in the decision trees the diagram represents these nodes do not have any daughter nodes.

Hovering the mouse over a block will display a text box with information on the nodes represented by that block. This information includes how far along the paths from root to leaf nodes these nodes occur and the covariate which defined these nodes. The number at the bottom of the text box is the number of nodes in the decision trees of the random forest that had the characteristics the block represents. The identities of the covariates defining the groups nodes represented by the blocks has been mapped to the colours of the blocks. You can also click and drag blocks up and down to alter the vertical layout of the diagram.

The curved links between adjacent pairs of blocks represent the edges between the groups of nodes these blocks represent. The width of a link between a pair of blocks represents the number of paths which passed along an edge between nodes of the characteristics represented by the blocks the link connects. Hovering the mouse over a link displays a text box with information on the edges represented by that link. The text box informs you of the identities of the covariates which define the nodes connected by the group of edges the link represents. The number at the bottom of the text box is the number of edges that link represents and the thickness of the link is proportional to this number. The covariate defining the nodes in a block from which a link originates has been mapped to the colour of that link.

We have written a manuscript about visualising the paths through random forests with Sankey diagrams (and parallel coordinates plots). This manuscript introduces our visualisations in the context of other visualisations of random forests and explains our visualisations and the insights they facilitate in much greater detail than I have above. A pre-print of this manuscript is available on arXiv.