Week 13, Tuesday — Euler Paths & Genome Assembly
March 31, 2026
DNA is a double helix: two strands, connected at every position by a base pair. A always bonds with T; G always bonds with C.
The genome sequence is the order of bases along one strand, read 5’→3’: A T G A T C A T G …
The other strand is determined automatically.
3 billion base pairs = 3 billion rungs on this ladder.
No machine reads it all at once. Instead: the lab fragments many copies of the genome into millions of short pieces; a sequencing machine reads each piece and outputs a short string of A, T, G, C. You receive a file of millions of reads — no positions, no order. Your job: reassemble. Shotgun sequencing.
Each fragment is a short string of bases — a k-mer.
For a toy genome and k=3:
In real sequencing: millions of rows, k ≈ 100–300.
Given only this bag of 3-mers — no positions, no order:
ATG TGA GAT ATC TCA CAT ATG
Can you recover the original sequence?
Draw a node for each fragment, an edge when suffix of one = prefix of next.
Build: Checking all pairs for overlap (edges) costs \(O(n^2)\) — trillions of operations for a real genome.
Solution: Hamiltonian path (visit every node once). NP-hard in general.
We need a smarter graph.
Instead of nodes = k-mers, let nodes = (k−1)-mers and edges = k-mers.
| k-mer | prefix (k−1) | suffix (k−1) | edge |
|---|---|---|---|
| ATG | AT | TG | AT → TG |
| TGA | TG | GA | TG → GA |
| GAT | GA | AT | GA → AT |
| ATC | AT | TC | AT → TC |
| TCA | TC | CA | TC → CA |
| CAT | CA | AT | CA → AT |
| ATG | AT | TG | AT → TG (again!) |
| Vertex | Neighbors |
|---|---|
| AT | TG, TC, TG |
| CA | AT |
| GA | AT |
| TC | CA |
| TG | GA |
The genome is hidden in this graph — but how do we read it out?
Assembling the genome means spelling out all seven fragments in order.
That means visiting every edge exactly once.
What kind of graph traversal visits every edge exactly once?
An Euler path visits every edge exactly once.
Leonhard Euler, 1736 — the Königsberg bridge problem.
Genome assembly is finding an Euler path in the de Bruijn graph.
Euler’s theorem: An Euler path exists iff exactly two nodes have unequal in and outdegree, one with out − in = 1 (start), one with in − out = 1 (end).
Check our graph:
| Node | Out | In | Out − In | Role |
|---|---|---|---|---|
| AT | 3 | 2 | +1 | start |
| TG | 1 | 2 | −1 | end |
| GA | 1 | 1 | 0 | ✓ |
| TC | 1 | 1 | 0 | ✓ |
| CA | 1 | 1 | 0 | ✓ |
From AT we can take edges in two different orders:
Path 1 — AT→TG first
AT→TG→GA→AT→TC→CA→AT→TG
Assembles to: ATGATCATG
Path 2 — AT→TC first
AT→TC→CA→AT→TG→GA→AT→TG
Assembles to: ATCATGATG
Both paths use every edge exactly once. Both are valid Euler paths. And both sequences produce the identical k-mer multiset:
The k-mers alone cannot tell you which genome is real.
The ambiguity comes from repeats fitting inside a k-mer.
Use longer k: if k > repeat length, the repeat no longer creates a branch.
| k | repeat “ATG” fits in k-mer? | ambiguity? |
|---|---|---|
| 3 | yes | ✓ two valid assemblies |
| 4 | no — “ATGA” ≠ “ATCA” | single assembly |
Trade-off: longer k requires longer reads, and longer reads are harder and more expensive to sequence.
This is why sequencing technology and assembly algorithms co-evolved.
| Path | Circuit | |
|---|---|---|
| Start = End? | No | Yes |
| Unbalanced nodes | Exactly 2 | None |
| Genome type | Linear (humans) | Circular (bacteria) |
Our example is a path: start at AT (out − in = +1), end at TG (in − out = +1).
A circuit is just a path where the start and end happen to be the same node.
We know DFS from week 6. Here is the classic version:
This visits each node once — and gets stuck. We need to visit each edge once.
Two things must change — and the second one is subtle.
Think of it from v’s perspective:
“I only add myself to the path once I’ve explored every edge I can reach. So I appear in the list after everything downstream of me.”
That means the list comes out in reverse — the last node on the path gets appended first, the first node last.
We could fix it by prepending (path.insert(0, v)) — but that’s slow. Instead we just append and reverse at the end:
The genome is a real piece of DNA — can you figure out what organism it’s from? Try your assembled sequence on NCBI BLAST.