DSCI 221: Data Structures and Algorithms for Data Science

The Problem

DNA is a double helix: two strands, connected at every position by a base pair. A always bonds with T; G always bonds with C.

The genome sequence is the order of bases along one strand, read 5’→3’: A T G A T C A T G …

The other strand is determined automatically.

3 billion base pairs = 3 billion rungs on this ladder.

No machine reads it all at once. Instead: the lab fragments many copies of the genome into millions of short pieces; a sequencing machine reads each piece and outputs a short string of A, T, G, C. You receive a file of millions of reads — no positions, no order. Your job: reassemble. Shotgun sequencing.

The Fragments Arrive as Data

Each fragment is a short string of bases — a k-mer.

For a toy genome and k=3:

In real sequencing: millions of rows, k ≈ 100–300.

Can You Reassemble It?

Given only this bag of 3-mers — no positions, no order:

ATG  TGA  GAT  ATC  TCA  CAT  ATG

Can you recover the original sequence?

Naïve Idea: Overlap Graph

Draw a node for each fragment, an edge when suffix of one = prefix of next.

Build: Checking all pairs for overlap (edges) costs \(O(n^2)\) — trillions of operations for a real genome.

Solution: Hamiltonian path (visit every node once). NP-hard in general.

We need a smarter graph.

A Different Graph

Instead of nodes = k-mers, let nodes = (k−1)-mers and edges = k-mers.

k-mer	prefix (k−1)	suffix (k−1)	edge
ATG	AT	TG	AT → TG
TGA	TG	GA	TG → GA
GAT	GA	AT	GA → AT
ATC	AT	TC	AT → TC
TCA	TC	CA	TC → CA
CAT	CA	AT	CA → AT
ATG	AT	TG	AT → TG (again!)

The de Bruijn Graph

Vertex	Neighbors
AT	TG, TC, TG
CA	AT
GA	AT
TC	CA
TG	GA

The genome is hidden in this graph — but how do we read it out?

What Does Traversal Mean Here?

Assembling the genome means spelling out all seven fragments in order.

That means visiting every edge exactly once.

What kind of graph traversal visits every edge exactly once?

Euler Path

An Euler path visits every edge exactly once.

Leonhard Euler, 1736 — the Königsberg bridge problem.

Genome assembly is finding an Euler path in the de Bruijn graph.

When Does an Euler Path Exist?

Euler’s theorem: An Euler path exists iff exactly two nodes have unequal in and outdegree, one with out − in = 1 (start), one with in − out = 1 (end).

Check our graph:

Node	Out	In	Out − In	Role
AT	3	2	+1	start
TG	1	2	−1	end
GA	1	1	0	✓
TC	1	1	0	✓
CA	1	1	0	✓

Assembly Isn’t Always Unique

From AT we can take edges in two different orders:

Path 1 — AT→TG first

AT→TG→GA→AT→TC→CA→AT→TG

Assembles to: ATGATCATG

Path 2 — AT→TC first

AT→TC→CA→AT→TG→GA→AT→TG

Assembles to: ATCATGATG

Both paths use every edge exactly once. Both are valid Euler paths. And both sequences produce the identical k-mer multiset:

The k-mers alone cannot tell you which genome is real.

How Real Assemblers Resolve It

The ambiguity comes from repeats fitting inside a k-mer.

Use longer k: if k > repeat length, the repeat no longer creates a branch.

k	repeat “ATG” fits in k-mer?	ambiguity?
3	yes	✓ two valid assemblies
4	no — “ATGA” ≠ “ATCA”	single assembly

Trade-off: longer k requires longer reads, and longer reads are harder and more expensive to sequence.

This is why sequencing technology and assembly algorithms co-evolved.

Euler Path vs. Euler Circuit

	Path	Circuit
Start = End?	No	Yes
Unbalanced nodes	Exactly 2	None
Genome type	Linear (humans)	Circular (bacteria)

Our example is a path: start at AT (out − in = +1), end at TG (in − out = +1).

A circuit is just a path where the start and end happen to be the same node.

From Euler to Code

We know DFS from week 6. Here is the classic version:

This visits each node once — and gets stuck. We need to visit each edge once.

Two things must change — and the second one is subtle.

Why Does It Come Out Backwards?

    path.append(v)   # after the while loop — post-order

Think of it from v’s perspective:

“I only add myself to the path once I’ve explored every edge I can reach. So I appear in the list after everything downstream of me.”

That means the list comes out in reverse — the last node on the path gets appended first, the first node last.

We could fix it by prepending (path.insert(0, v)) — but that’s slow. Instead we just append and reverse at the end:

path = []
euler_dfs(graph, start, path)
path = path[::-1]   # one reversal, O(n)

The Code Task

PrairieLearn activity

The genome is a real piece of DNA — can you figure out what organism it’s from? Try your assembled sequence on NCBI BLAST.