Evolutionary biology has long theorized processes--recombination, lineage divergence, drug-resistance sweeps, introgression, refugial persistence--whose signatures in genomic data are incompatible with tree structure. We argue that the shape of genetic-distance data, formalized through simplicial complexes and quantified through persistent homology, is a direct observable of these processes. The Vietoris--Rips filtration of a genetic-distance matrix yields the Betti numbers {beta}0 (connected components), {beta}1 (loops), and {beta}2 (cavities); we read {beta}1 not as a literal count of recombination events but as a quantity that is monotone in effective recombination above a sampling-dependent geometric baseline, and we organise the resulting shapes into a four-letter alphabet of topological primitives (K1 clonal, K2 divergence, K3 reticulation, K4 higher-order reticulation). Coalescent and Wright--Fisher simulations establish the two load-bearing claims: {beta}1 rises monotonically with the recombination rate over six orders of magnitude, and persistent-homology features separate reticulate from non-reticulate histories with 98--100% recall (the residual confusion falls entirely within the non-reticulate K1/K2 pair, which {beta}1 does not distinguish). We then apply the pipeline to four empirical systems. (i) On the MalariaGEN Pf7 Plasmodium falciparum dataset (n = 20,864, 33 countries), per-population {beta}1 spans two orders of magnitude and diverges significantly from a label--shuffle null (median 20.5, range 8-32); the ordering runs opposite to recombination rate-freely-recombining African populations sit lowest and clonal/swept Southeast Asian and Papuan populations highest--because at the population scale {beta}1 is dominated by demographic structure rather than recombination rate, a point we reconcile explicitly with the controlled dose-response. (ii) Colombian Cauca SP-resistant samples carry {beta}1 = 12 against a near-clonal SP-sensitive baseline of {beta}1 = 5 (and two orders of magnitude more total persistence), the high-{beta}1, multi-origin band of K3 consistent with resistance carried on several genomic backgrounds. (iii) The Cambodia artemisinin sweep (2008-2018) traces a K3 [->] K1 trajectory, {beta}1 rising to a mid-sweep peak of 45 and collapsing to 13 at fixation--to our knowledge the first direct observation of a selective-sweep transient in topological coordinates, with the caveat that the per-bin values are medians of three subsamples with wide bars. (iv) On Arabidopsis thaliana 1001 Genomes data, Iberian relict populations (Spain, {beta}1/n = 0.64) exceed post-glacial-expansion populations (Sweden 0.54; United Kingdom 0), generalising the framework beyond pathogens. A P. falciparum mitochondrial negative control recovers {beta}1 = 0 across all subsamples, establishing pipeline specificity. Moving above the 1-skeleton, {beta}2 is zero at the clonal/expansion limits and positive across the reticulate systems; a controlled two-vs-three-way admixture simulation confirms that {beta}2 separates regimes that share a {beta}1 profile, while the further suggestion that the ratio = {beta}2/{beta}1 separates microevolutionary from macroevolutionary timescales is presented, given the small number of systems and the absence of a {beta}2 null, as a hypothesis for future testing. Together these results demonstrate that the topology of genetic-distance data is an evolutionary observable, with immediate implications for drug-resistance surveillance in P. falciparum.
Author summaryBiologists usually picture the history of life as a tree, in which lineages split and never rejoin. Many of the most consequential evolutionary events break that picture: malaria parasites recombine in the mosquito gut, drug-resistant strains arise repeatedly on different genetic backgrounds, and plant populations that survived the Ice Age in southern refuges carry tangled ancestry that no tree can represent. We ask a different question of genetic data--not "what tree fits?" but "what shape does the data make?" --and answer it with topological data analysis, which measures shape through three counts (the Betti numbers) of clusters, loops, and higher-order cavities. Loops appear when lineages recombine and rejoin. We show, in simulations and in real Plasmodium falciparum and Arabidopsis thaliana genomes, that the loop count rises with recombination above a baseline set by finite sampling, cleanly separates recombining from clonal histories, and tracks a real artemisinin-resistance sweep in Cambodia as it rises and collapses over a decade. A non-recombining mitochondrial control correctly shows no loops. The shape of genetic data is thus a direct, tree-free readout of evolutionary process, with immediate value for drug-resistance surveillance.
Feged-Rivadeneira, A. et al. · CC-BY 4.0