### Previous mathematical fashions of tumour inhabitants genetics

Many earlier research of tumour inhabitants genetics have used non-spatial branching processes^{21}, by which most cancers clones develop exponentially with out interacting. Unless driver mutations improve cell health by lower than 1%, these fashions predict decrease clonal range and decrease numbers of driver mutations than sometimes noticed in stable tumours^{46}. Among spatial fashions, a preferred possibility is the Eden development mannequin (or boundary-growth mannequin), by which cells are positioned on a daily grid with a most of one cell per website, and a cell can divide provided that an unoccupied neighbouring website is offered to obtain the new daughter cell^{32,47,61}. Other strategies with one cell per website embody the voter mannequin^{32,62,63} (by which cells can invade neighbouring occupied websites) and the spatial branching course of^{47} (by which cells budge one another to create space to divide). Further mathematical fashions have been designed to recapitulate glandular tumour structure by permitting every grid website or ‘deme’ to comprise a number of cells and by simulating tumour development by way of deme fission all through the tumour^{5,26} or solely at the tumour boundary^{27}. A category of fashions by which most cancers cells are organized into demes and disperse into empty house has additionally been proposed^{36,52,64}. Supplementary Table 2 summarizes chosen research representing the state of the artwork of stochastic modelling of tumour inhabitants genetics.

Our fundamental methodological improvements are to implement all these distinct mannequin constructions, and extra fashions of invasive tumours, inside a typical framework, and to mix them with strategies for monitoring driver and passenger mutations at single-cell decision. The result’s a extremely versatile framework for modelling tumour inhabitants genetics that can be utilized to look at penalties of variation not solely in mutation charges and choice coefficients, but additionally in spatial structure and method of cell dispersal^{65}.

### Computational mannequin structure

Simulated tumours in our fashions are made up of patches of interacting cells positioned on a daily grid of websites. In retaining with the inhabitants genetics literature, we refer to those patches as demes. All demes inside a mannequin have the similar carrying capability, which might be set to any constructive integer. Each cell belongs to each a deme and a genotype. If two cells belong to the similar deme and the similar genotype then they’re an identical in each respect, and therefore the mannequin state is recorded in phrases of such subpopulations quite than in phrases of particular person cells. For the sake of simplicity, computational effectivity and mathematical tractability, we assume that cells inside a deme kind a well-mixed inhabitants. The well-mixed assumption is in line with earlier mathematical fashions of tumour evolution^{5,26,27,36,64} and with experimental proof in the case of stem cells inside colonic crypts^{66}.

### Initial situations

A simulation begins with a single tumour cell positioned in a deme at the centre of the grid. If the mannequin is parameterized to incorporate regular cells, then these are initially distributed all through the grid such that every deme’s inhabitants measurement is the same as its carrying capability. Otherwise, if regular cells are absent, then the demes surrounding the tumour are initially unoccupied.

### Stopping situation

The simulation stops when the quantity of tumour cells reaches a threshold worth. Because we have an interest solely in tumours that attain a big measurement, if the tumour cell inhabitants succumbs to stochastic extinction, then outcomes are discarded and the simulation is restarted (with a distinct seed for the pseudo-random quantity generator).

### Within-deme dynamics

Tumour cells bear stochastic division, dying, dispersal and mutation occasions, whereas regular cells bear solely division and dying. The within-deme dying price is density-dependent. When the deme inhabitants measurement is lower than or equal to the carrying capability, the dying price takes a hard and fast worth *d*_{0} that’s lower than the preliminary division price. When the deme inhabitants measurement exceeds carrying capability, the dying price takes a distinct fastened worth *d*_{1} that’s a lot better than the largest attainable division price. Hence, all genotypes develop roughly exponentially till the carrying capability is attained, after which level the within-deme dynamics resemble a beginning–dying Moran course of—a regular, effectively characterised mannequin of inhabitants genetics.

In all spatially structured simulations, we set *d*_{0} = 0 to forestall demes from changing into empty. For the non-spatial (branching course of) mannequin, we set *d*_{0} > 0 and dispersal price equal to zero, so that every one cells all the time belong to a single deme (with carrying capability better than the most tumour inhabitants measurement).

### Mutation

When a cell divides, every daughter cell inherits its guardian’s genotype plus a quantity of further mutations drawn from a Poisson distribution. Each mutation is exclusive, in line with the infinite-sites assumption of canonical inhabitants genetics fashions. Whereas some earlier research have examined the results of solely a single driver mutation (Supplementary Table 2), in our mannequin there isn’t any restrict on the quantity of mutations a cell can purchase. Most mutations are passenger mutations with no phenotypic impact. The the rest are drivers, every of which will increase the cell division or dispersal price.

The programme information the speedy ancestor of every clone (outlined in phrases of driver mutations) and the matrix of Hamming distances between clones (that’s, for every pair of clones, what number of driver mutations are present in just one clone), which collectively enable us to reconstruct driver phylogenetic timber. To enhance effectivity, the distance matrix excludes clones that didn’t develop to greater than ten cells and failed to provide another clone earlier than changing into extinct.

### Driver mutation results

Whereas earlier fashions have sometimes assumed that the results of driver mutations mix multiplicatively, this could probably lead to implausible trait values (particularly in the case of division price if the price of buying drivers scales with the division price). To stay biologically real looking, our mannequin invokes diminishing returns epistasis, such that the common impact of driver mutations on a trait worth *r* decreases as *r* will increase. Specifically, the impact of a driver is to multiply the trait worth *r* by an element of 1 + *s*(1 − *r*/*m*), the place *s* > 0 is the mutation impact and *m* is an higher certain. Nevertheless, as a result of we set *m* to be a lot bigger than the preliminary worth of *r*, the mixed impact of drivers in all fashions in the present examine is roughly multiplicative. For every mutation, the worth of the choice coefficient *s* is drawn from an exponential distribution.

### Dispersal

Depending on mannequin parameterization, dispersal happens by way of both invasion or deme fission (Supplementary Table 3). In the case of invasion, the dispersal price corresponds to the chance {that a} cell newly created by a division occasion will instantly try to invade a neighbouring deme. This explicit formulation ensures consistency with a regular inhabitants genetics mannequin often known as the spatial Moran course of. The vacation spot deme is chosen uniformly at random from the 4 nearest neighbours (von Neumann neighbourhood). Invasion might be restricted to the tumour boundary, by which case the chance {that a} deme might be invaded is 1 − *N*/*Ok* if *N*≤*Ok* and 0 in any other case, the place *N* is the quantity of tumour cells in the deme and *Ok* is the carrying capability. If a cell fails in an invasion try, then it stays in its unique deme. If invasion will not be restricted to the tumour boundary, then invasion makes an attempt are all the time profitable.

In fission fashions, a deme can bear fission provided that its inhabitants measurement is bigger than or equal to carrying capability. As with invasion, deme fission instantly follows cell division (in order that outcomes for the totally different dispersal sorts are readily comparable). The chance {that a} deme will try fission is the same as the sum of the dispersal charges of its constituent cells (as much as a most of 1). Deme fission includes transferring half of the cells from the unique deme into a brand new deme, which is positioned beside the unique deme. If the dividing deme comprises an odd quantity of cells, then the break up is essentially unequal, by which case every deme has a 50% probability of receiving the bigger share. Genotypes are redistributed between the two demes with out bias in accordance with a multinomial distribution. Cell division price has solely a minor impact on deme fission price as a result of a deme created by fission takes solely a single cell era to achieve carrying capability.

If fission is restricted to the tumour boundary, then the new deme’s assigned location is chosen uniformly at random from the 4 nearest neighbours, and if the assigned location already comprises tumour cells, then the fission try fails. If fission is allowed all through the tumour, then an angle is chosen uniformly at random, and demes are budged alongside a straight line at that angle to create space for the new deme beside the unique deme.

Our explicit technique of cell dispersal was chosen to allow comparability between our outcomes and people of earlier research and to facilitate mathematical evaluation. In explicit, when the deme carrying capability is about to 1, our mannequin approximates an Eden development mannequin (if fission is restricted to the tumour boundary, or if dispersal is restricted to the tumour boundary and regular cells are absent), a voter mannequin (if invasion is allowed all through the tumour) or a spatial branching course of (if fission is allowed all through).

To pretty evaluate totally different spatial constructions and manners of cell dispersal, we set dispersal charges in every case such that the time taken for a tumour to develop from one cell to at least one million cells is roughly the similar as in the impartial Eden development mannequin with maximal dispersal price. This signifies that, throughout fashions, the cell dispersal price decreases with growing deme measurement. Given that tumour cell cycle instances are on the order of just a few days, the timespans of a number of hundred cell generations in our fashions realistically correspond to a number of years of tumour development. More particularly, if we assume tumours take between 5 and 50 years to develop and the cell cycle time is between 1 and 10 days (each uniform priors), then the quantity of cell generations is between 400 and eight,000 in 95% of believable instances. This order of magnitude is in line with tumour ages inferred from molecular information^{67}.

We be aware that, along with gland fission, gland fusion has been reported in regular human gut^{68}, which raises the risk that gland fusion may happen throughout colorectal tumour growth. However, the price of crypt fission in tumours is way elevated relative to the price in wholesome tissue, and should exceed the price of crypt fusion (or else the tumour wouldn’t develop). Therefore, even when crypt fusion happens in human tumours, we don’t count on it to have a considerable affect on evolutionary mode. This view is supported by earlier computational modelling^{69}.

### Two versus three dimensions

We selected to conduct our examine in two dimensions for 2 fundamental causes. First, the results of deme carrying capability on evolutionary dynamics are qualitatively related in two and three dimensions, but a two-dimensional mannequin is easier, simpler to analyse, and simpler to visualise. Second, we aimed to create a way that’s readily reproducible utilizing modest computational sources and but can signify the long-term evolution of a pretty big tumour at single-cell decision.

One million cells in two dimensions corresponds to a cross-section of a three-dimensional tumour with many a couple of million cells. Therefore, in comparison with a three-dimensional mannequin, a two-dimensional mannequin can present richer perception into how evolutionary dynamics change over a big quantity of cell generations. Developing an approximate, coarse-grained analogue of our mannequin that may effectively simulate the inhabitants dynamics of very giant tumours with totally different spatial constructions in three dimensions is a crucial course for future analysis.

### Implementation

The programme carried out Gillespie’s actual stochastic simulation algorithm^{70} for statistically right simulation of cell occasions. The order of occasion choice is (1) deme, (2) cell sort (regular or tumour), (3) genotype, and (4) occasion sort. At every stage, the chance of deciding on an merchandise (deme, cell sort, genotype or occasion sort) is proportional to the sum of occasion charges for that merchandise, inside the earlier merchandise. We measured elapsed time in phrases of cell generations, the place a era is the same as the anticipated cell cycle time of the preliminary tumour cell.

### Sequencing information

We surveyed the multi-region and single-cell tumour sequencing literature to determine information units appropriate for comparability with our mannequin outcomes. Studies revealed earlier than 2015 (for instance, refs. ^{71,72,73,74}) have been excluded as they have been discovered to have inadequate sequencing depth for our functions. We additionally excluded research that reconstructed phylogenies utilizing samples from metastases or from multifocal tumours (for instance, refs. ^{75,76,77,78,79,80}) as a result of our mannequin will not be designed to correspond to such situations. The seven research we selected to incorporate in our comparability are characterised by both high-coverage multi-region sequencing or large-sample single-cell sequencing of a number of tumours.

The ccRCC investigation^{81} we chosen concerned multi-region deep sequencing, concentrating on a panel of greater than 100 putative driver genes. Three research of NSCLC^{10}, mesothelioma^{40} and breast most cancers^{39} carried out multi-region whole-exome sequencing (first two research) or whole-genome sequencing (latter examine), and reported putative driver mutations. We additionally used information from single-cell RNA sequencing research of uveal melanoma^{42} and breast most cancers^{41}, by which chromosome copy quantity variations have been used to deduce clonal structure, and a examine of acute myeloid leukaemia (AML) that used single-cell DNA sequencing^{24}. All seven research constructed phylogenetic timber, that are readily similar to the timber predicted by our modelling. The methodological range of these research contributes to demonstrating the robustness of the patterns we search to clarify.

From every of the seven cohorts, we obtained information for between three and eight tumours. In the ccRCC information set, we targeted on the 5 tumours for which driver frequencies have been reported in the unique publication. For NSCLC, we used information for the 5 tumours for which a minimum of six multi-region samples have been sequenced. In mesothelioma, we chosen the six tumours that had a minimum of 5 samples taken. From the breast most cancers multi-region examine, we used information for the three untreated tumours that have been subjected to multi-region sequencing. From the single-cell sequencing research of uveal melanoma and breast most cancers, we used all the revealed information (eight tumours in every case), and from the AML examine, we chosen a random pattern of eight tumours.

In multi-region sequencing information units, it’s unsure whether or not all putative driver mutations have been true drivers of tumour development. One option to interpret the information (interpretation I1) is to imagine that every one putative driver mutations have been true drivers that occurred independently. Alternatively, the extra conservative interpretation I2 assumes that every mutational cluster (a definite peak in the variant allele frequency distribution) corresponds to precisely one driver mutation, whereas all different mutations are hitchhikers. Thus, I1 permits linear chains of nodes that in I2 are mixed into single nodes (evaluate Supplementary Figs. 9 and 10), and I1 results in a better estimate of the imply quantity of driver mutations per cell (our abstract index *n*). If each the fraction of putative driver mutations that aren’t true drivers (false positives) and the fraction of true driver mutations that aren’t counted as such (false negatives) are low, or if these fractions roughly cancel out, then interpretation I1 will give an excellent approximation of *n* whereas I2 will give a decrease certain. For the ccRCC, NSCLC and breast most cancers instances in our information set, I1 generates values of *n* in the vary 3–10 (imply 6.1), in line with estimates based mostly on different methodologies^{13,51}, whereas for I2 the vary is only one–4 (imply 2.5). Accordingly, we used interpretation I1.

### Clonal range index

To measure clonal range, we used the inverse Simpson index outlined as (D=1/{sum }_{i}{p}_{i}^{2}), the place *p*_{i} is the frequency of the *i*th mixture of driver mutations. For instance, if the inhabitants includes *okay* clones of equal measurement, then *p*_{i} = 1/*okay* for each worth of *i*, and so *D* = 1/(*okay* × 1/*okay*^{2}) = *okay*. Clonal range has a decrease certain *D* = 1. The inverse Simpson index is comparatively strong to including or eradicating uncommon sorts, which makes it applicable for evaluating information units with differing sensitivity thresholds. Further examples are illustrated in Supplementary Fig. 11.

*D* is constrained by an higher certain for timber with *n* < 2, the place *n* is the imply quantity of driver mutations per cell. Indeed, *n* = ∑_{i}*ip*_{i} ≥ *p*_{1} + 2(1 − *p*_{1}) = 2 − *p*_{1}, so *p*_{1} ≥ 2 − *n* > 0, since *n* < 2. Therefore,

$$D=frac{1}{{sum }_{i}{p}_{i}^{2}}le frac{1}{{p}_{1}^{2}}le frac{1}{{(2-n)}^{2}}.$$

To see that this certain is tight, assume 1 ≤ *n* < 2 and take into account a star-shaped tree with *N* nodes such that *p*_{1} = 2 − *n* and different nodes have equal weights *p*_{i} = (1 − *p*_{1})/(*N* − 1) = (*n* − 1)/(*N* − 1) for *i* ≥ 2. The imply quantity of driver mutations per cell is *p*_{1} + 2(1 − *p*_{1}) = 2 − *p*_{1} = *n*, and the inverse Simpson index is

$$start{array}{l}D=frac{1}{mathop{sum }nolimits_{i = 1}^{N}{p}_{i}^{2}}=frac{1}{{p}_{1}^{2}+mathop{sum }nolimits_{i = 2}^{N}{p}_{i}^{2}}=frac{1}{{(2-n)}^{2}+(N-1){((n-1)/(N-1))}^{2}}=frac{1}{{(2-n)}^{2}+{(n-1)}^{2}/(N-1)}.finish{array}$$

This amount goes to 1/(2 − *n*)^{2} as the quantity of nodes *N* goes to infinity, so the certain 1/(2 − *n*)^{2} could also be approached arbitrarily intently.

It is informative to derive the relationship between *D* and *n* for a inhabitants that evolves by way of a sequence of clonal sweeps, such that every new sweep begins solely after the earlier sweep is full. For a given worth of *n*, our simulations hardly ever produce timber with *D* values under the curves of this trajectory. Suppose {that a} inhabitants includes a guardian sort and a daughter sort, with frequencies *p* and 1 − *p*, respectively. If the daughter has *m* driver mutations, then the guardian will need to have *m* − 1 driver mutations and *n* should fulfill *m* − 1 ≤ *n* ≤ *m*. More particularly,

$$n=(m-1)p+m(1-p)=m-p Rightarrow p=m-n=1-{n},$$

the place {*n*} denotes the fractional half of *n* (or 1 if *n* = *m*). The trajectory is due to this fact described by

$$D=frac{1}{{p}^{2}+{(1-p)}^{2}}=frac{1}{{(1-{n})}^{2}+{{n}}^{2}}.$$

We moreover calculated a curve representing the most attainable range of linear timber. In the fundamental textual content and under, we seek advice from this curve as comparable to timber with an intermediate diploma of branching. Specifically, this intermediate-branching curve is outlined such that for each level under the curve (and with *D* > 1), there exist each linear timber and branching timber which have the corresponding values of *n* and *D*, whereas for each level above the curve there exist solely branching timber. Derivation of the curve’s equation is supplied in Supplementary Information. A primary-order approximation (correct inside 1% for *n* ≥ 2.2) is *D* ≈ 9(2*n* − 1)/8.

To assess the extent to which clusters of factors (*n*, *D*) are effectively separated, we calculated silhouette widths utilizing the cluster R bundle^{82}. A constructive imply silhouette width signifies that clusters are distinct.

### Other range indices

Our range index fulfills the similar function as the intratumour heterogeneity (ITH) index utilized in the TRACERx Renal examine^{9}, outlined as the ratio of the quantity of subclonal driver mutations to the quantity of clonal driver mutations. However, in comparison with ITH, our index has the benefits of being a steady variable and being strong to methodological variations that have an effect on capacity to detect low-frequency mutations. In calculating ITH from sequencing information, we included all putative driver mutations, whereas ref. ^{9} used solely a subset of mutations. For mannequin output, we categorized mutations with frequency above 99% as clonal and we excluded mutations with frequency lower than 1%. ITH and the inverse Simpson index are strongly correlated throughout our fashions (Spearman’s *ρ* = 0.98, or *ρ* = 0.81 for instances with *D* > 2; Extended Data Fig. 9c).

The Shannon index, outlined as ({sum }_{i}{p}_{i}{{mathrm{log}}},{p}_{i}), is one other various to the Simpson index. The exponential of this index has the similar items as the inverse Simpson index (equal quantity of sorts). Compared to the Simpson index, the Shannon index offers extra weight to uncommon sorts, which makes it considerably much less appropriate for evaluating information units with differing sensitivity thresholds.

### Defining evolutionary modes in phrases of indices *D* and *n*

In defining areas in phrases of indices *D* and *n* (Table 1 and Fig. 3c), we first famous that if a inhabitants undergoes a succession of non-overlapping clonal sweeps, then at most two clones coexist at any time, and therefore *D* ≤ 2. Allowing for some overlap between sweeps, we outlined the ‘selective sweeps’ area as having *D* < 10/3 and *D* under the intermediate-branching curve. We put the higher boundary at *D* = 10/3 as a result of this intersects with the intermediate-branching curve at *n* = 2.

We used *D* = 20 to outline the boundary between the ‘branching’ and ‘progressive diversification’ areas. The TRACERx Renal examine^{9} as an alternative categorized timber containing greater than 10 clones as extremely branched, versus branched. It is acceptable for us to make use of a better threshold as a result of our areas are based mostly on true tumour range values, quite than the sometimes decrease values inferred from multi-region sequencing information. Finally, we outlined an ‘effectively almost neutral’ area containing star-shaped timber with *n* < 2 and *D* above the intermediate-branching curve.

It is feasible to assemble timber that don’t match the labels we now have assigned to areas. For instance (as proven in Supplementary Information), there exist linear timber inside the branching and progressive diversification areas. Such exceptions are an unavoidable consequence of representing high-dimensional objects, similar to phylogenetic timber, in phrases of a small quantity of abstract indices. Our labels are applicable for the subset of timber that we now have proven to come up from tumour evolution.

### Previously outlined tree steadiness indices

Conventionally, the steadiness of a tree is the diploma to which branching occasions break up the tree into subtrees with the similar quantity of leaves, or terminal nodes. A balanced tree thus signifies extra equal extinction and speciation charges than an unbalanced tree^{83}. Tree steadiness indices are generally used to say the correctness of tree reconstruction strategies and to categorise timber. We thought of three beforehand outlined indices, all of that are imbalance indices, which signifies that extra balanced timber are assigned smaller values. We subtracted every of these indices from 1 to acquire measurements of tree steadiness.

Let *T* = (*V*, *E*) be a tree with a set of nodes *V* and edges *E*. Let ∣*V*∣ = *N*, and therefore ∣*E*∣ = *N* − 1 (since every node has precisely one guardian, besides the root). We outlined *l* as the quantity of leaves of the tree. The root is labelled 1 and the leaves are numbered from *N* − *l* + 1 to *N*. There is just one cladogram with two leaves, which is maximally balanced in accordance with all the beforehand outlined indices mentioned under. We additionally thought of the single-node tree to be maximally balanced with respect to those beforehand outlined indices. The following definitions then apply when *l* ≥ 3.

For every leaf *j*, we outlined *ν*_{j} as the quantity of inside nodes between *j* and the root, which is included in the rely. Then a normalized model of Sackin’s index, initially launched in ref. ^{84}, is outlined as

$${I}_{S,mathrm{norm}}(T)=frac{mathop{sum }limits_{j=N-l+1}^{N}{nu }_{j}-l}{frac{1}{2}(l+2)(l-1)-l},$$

the place to have the ability to evaluate indices of timber on totally different quantity of leaves *l*, we subtracted the minimal worth for a given *l* and divided by the vary of the index on all timber on *n* leaves, as in ref. ^{85}.

For an inside node *i* of a binary tree *T*, we outlined *T*_{L}(*i*) as the quantity of leaves subtended by the left department of *T*_{i}, the subtree rooted at *i*, and *T*_{R}(*i*) the quantity of leaves subtended by its proper department. Then, the unnormalized Colless index^{86} of *T* is

$${I}_{C}(T)=mathop{sum }limits_{i=1}^{N-l}| {T}_{L}(i)-{T}_{R}(i)| .$$

Since Colless index is outlined just for bifurcating timber, we used the default normalized Colless-like index ({{mathfrak{C}}}_{{mathrm{MDM}},,{{mathrm{ln}}}(l+e),,{mathrm{norm}}}) outlined in ref. ^{85}. This consisted of measuring the dissimilarity between the subtrees (T^{prime}) rooted at a given inner node by computing the imply deviation from the median (MDM) of the *f*-sizes of these subtrees. In this case, (f(l)={{mathrm{ln}}}(l+e)) and the *f*-size of (T^{prime}) is outlined as

$$mathop{sum}limits_{vin V(T^{prime} )}{mathrm{ln}}({mbox{deg}}(v)+e).$$

These dissimilarities have been then summed and the outcome was normalized as for Sackin’s index.

The cophenetic worth *ϕ*(*i*, *j*) of a pair of leaves *i*, *j* is the depth of their lowest frequent ancestor (such that the root has depth 0). The whole cophenetic index^{87} of *T* is then the sum of the cophenetic values over all pairs of leaves, and a normalized model is

$${I}_{{{Phi }},{mathrm{norm}}}(T)=frac{mathop{sum}limits_{N-l+1le i < jle N}phi (i,j)}{left({l}atop{3}proper)},$$

the place right here the minimal worth of the cophenetic index is 0 for all *l* (for a star-shaped tree with *l* leaves).

These three steadiness indices have been designed for analysing species phylogenies and are thus outlined on cladograms, that are timber by which leaves correspond to extant species and inner nodes are hypothetical frequent ancestors. Conventional cladograms don’t have any notion of node measurement. Cladograms additionally lack linear elements as every inner node essentially corresponds to a branching occasion. The driver phylogenetic timber reported in multi-region sequencing research and generated by our fashions are as an alternative clone timber (also called mutation timber), by which all nodes of non-zero measurement signify extant clones. To apply earlier steadiness indices to driver phylogenetic timber, we first transformed the timber to cladograms by including a leaf to every non-zero-sized inner node and collapsing linear chains of zero-sized nodes.

Whereas range indices similar to *D* are comparatively strong to the addition or removing of uncommon clones, the steadiness indices described above are a lot much less strong as a result of they deal with all clones equally, regardless of inhabitants measurement (Supplementary Figs. 6, 7 and eight). This hampered comparability between mannequin outcomes and information for 2 causes. First, as a result of sampling error, even top quality multi-region sequencing research underestimate the quantity of subclonal, regionally considerable driver mutations by roughly 25%^{81}. Second, bulk sequencing can not detect driver mutations current in solely a really small fraction of cells.

### A sturdy tree steadiness index

To overcome the shortcomings of earlier indices, we now have developed a extra strong tree steadiness index based mostly on an prolonged definition: tree steadiness is the diploma to which inner nodes break up the tree into subtrees of equal measurement, the place measurement refers to the sum of all node populations.

Let *f*(*v*) > 0 denote the measurement of node *v*. For an inner node *i*, let *V*(*T*_{i}) denote the set of nodes of *T*_{i}, the subtree rooted at *i*. We then outline

$$start{array}{l}{S}_{i}=mathop{sum}limits_{vin V({T}_{i})}f(v)=,{{mathrm{the}} {mathrm{measurement}} {mathrm{of}}},,{T}_{i}, {S}_{i}^{* }=mathop{sum}limits_{vin V({T}_{i})atop {vne i}}f(v)=,{{mathrm{the}} {mathrm{measurement}} {mathrm{of}}},,{T}_{i},,{{mathrm{with out}} {mathrm{its}} {mathrm{root}}},,i.finish{array}$$

For *i* in the set of inner nodes (widetilde{V}), and *j* in the set *C*(*i*) of youngsters of *i*, we outline ({p}_{ij}={S}_{j}/{S}_{i}^{* }). We then computed the steadiness rating ({W}_{i}^{1}) of a node (iin widetilde{V}) as the normalized Shannon entropy of the sizes of the subtrees rooted at the youngsters of *i*:

$${W}_{i}^{1}=mathop{sum}limits_{jin C(i)}{W}_{ij}^{1},quad ,{{mbox{with}}},{W}_{ij}^{1}=left{start{array}{ll}-{p}_{ij}{{{mathrm{log}}},}_{{d}^{+}(i)}{p}_{ij}&,{{mbox{if}}},,{p}_{ij} > 0,{{mbox{and}}},,{d}^{+}(i)ge 2, 0&,{{mbox{in any other case,}}},finish{array}proper.$$

the place *d*^{+}(*i*) is the out-degree (the quantity of youngsters) of node *i*. Finally, for every node *i*, we weighted the steadiness rating by the product of ({S}_{i}^{* }) and a non-root dominance issue ({S}_{i}^{* }/{S}_{i}.) Our normalized steadiness index is then

$${J}^{1}:= frac{1}{{sum }_{kin widetilde{V}}{S}_{okay}^{* }}mathop{sum}limits_{iin widetilde{V}}{S}_{i}^{* }frac{{S}_{i}^{* }}{{S}_{i}}{W}_{i}^{1}.$$

Supplementary Fig. 11 illustrates the calculation of *J*^{1} for 4 exemplary timber. We additional describe the fascinating properties of this index, and its relationship to different tree steadiness indices, in one other article^{43}.

When *n* ≤ 2 (the place *n* is the imply quantity of driver mutations per cell), the non-root dominance issue can not exceed *n* − 1, whereas the different elements in *J*^{1} are at most 1, which means *J*^{1} ≤ *n* − 1 for all *n* ≤ 2. Also for *n* > 2, we now have *J*^{1} ≤ 1 < *n* − 1. Thus, it’s not possible to assemble timber which have *J*^{1} > *n* − 1, as proven in Fig. 4a.

### Clonal turnover indices

For every time level *t* ≥ *δ**t*, we outlined a clonal turnover index as

$${{Theta }}(t)=mathop{sum}limits_{i}{left({f}_{i}(t)-{f}_{i}(t-tau )proper)}^{2},$$

the place *f*_{i}(*t*) is the frequency of clone *i* at time *t*, and *τ* is 10% of the whole simulation time measured in cell generations. The imply worth (overline{{{Theta }}}) over time measures the whole extent of clonal turnover.

To decide whether or not clonal turnover principally occurred early, late or all through tumour evolution, we calculated the weighted common

$${overline{T}}_{{{Theta }}}=frac{1}{max (t)}left(mathop{sum}limits_{t}{{Theta }}(t)tbigg/mathop{sum}limits_{t}{{Theta }}(t)proper),$$

the place (max (t)) denotes the closing time of the simulation. This amount takes values between 0 and 1, and is greater if clonal turnover happens principally late throughout tumour development. If the price of clonal turnover is fixed over time, then ({overline{T}}_{{{Theta }}}approx 0.55).

### Histology slide evaluation to find out the quantity of cells per gland

We randomly chosen 5 tumours of every of 4 most cancers sorts (colorectal most cancers, clear cell renal most cancers, lung adenocarcinoma and breast most cancers) from The Cancer Genome Atlas (TCGA) reference database (http://portal.gdc.cancer.gov). Using QuPath v0.2.0m4^{88}, we manually delineated 5 consultant teams of tumour cells in every picture and robotically counted the quantity of cells in every group. We outlined a bunch as a set of tumour cells straight touching one another, separated from different teams by stroma or different non-tumour tissue (Extended Data Fig. 3).

The quantity of cells per group ranged from 5 to eight,485, with 50% of instances having between 53 and 387 cells (Extended Data Fig. 4a). Variation in the quantity of cells per group was bigger between quite than inside tumours, whereas cell density was comparatively constant between tumours (Extended Data Fig. 4b). Because our cell counts have been derived from cross sections, they’d underestimate the true populations of three-dimensional glands. On the different hand, it’s unknown what quantity of cells are capable of self-renew and contribute to long-term tumour development and evolution^{89}. On steadiness, due to this fact, it’s affordable to imagine that every gland of an invasive glandular tumour can comprise between just a few hundred and some thousand interacting cells. This vary of values is, furthermore, remarkably in line with outcomes of a current examine that used a really totally different technique to deduce the quantity of cells in tumour-originating niches. Across a spread of tissue sorts, this examine concluded that cells sometimes work together in communities of 300–1,900 cells^{30}. Another current examine of breast most cancers utilized the Louvain technique for group detection to determine two-dimensional tumour communities sometimes in the vary of 10–100 cells.^{29}

### Reporting Summary

Further info on analysis design is offered in the Nature Research Reporting Summary linked to this text.