Pairwise alignment

Definition: An alignment of two sequences

and

is obtained by

first inserting spaces (

′

−

′

) either into, before or at the ends of

and

to obtain

′

and

′

such that

′

| = |T

′

, and then placing

′

on top

′

such that every character in

′

is uniquely aligned with a

charater in

′

Example: two aligned sequences:

S: GTAGTACAGCT-CAGTTGGGATCACAGGCTTCT

|||| || ||| ||||||

||||||

|||

T: GTAGAACGGCTTCAGTTG---TCACAGCGTTC-

Sequence Alignment – p.4/36

Similarity measure

σ(a, b)

- the score (weight) of the alignment of character

with

character

, where

a, b ∈ Σ ∪ {

′

−

′

}

wher

Σ = {

′

}

For example

σ(a, b) =











a = b

and

a, b ∈ Σ

a 6= b

and

a, b ∈ Σ

−1

a 6= b

and

a =

′

−

′

b =

′

−

′

Similarity between

and

given the alignment

′

, T

′

)

V (S, T ) =

σ(S

′

, T

′

)

Sequence Alignment – p.5/36

Nomenclature

- an alphabet, a non-empty finite set. For example,

Σ = {A, C, G, T }

string

over

is any finite sequence of characters from

- the set of all strings over

of length

. Note that

= {ǫ}

The set of all strings over

of any length is denoted

∗

∈N

a substring

of a string

T = t

· · · t

is a string

T = t

1+i

· · · t

, where

0 ≤ i

and

m + i ≤ n

a prefix

of a string

T = t

· · · t

is a string

T = t

· · · t

, where

m ≤ n

a suffix

of a string

T = t

· · · t

is a string

T = t

−m+1

· · · t

, where

m ≤ n

a subsequence

of a string

T = t

· · · t

is a string

T = t

· · · t

such

that

< · · · < i

, where

m ≤ n

Sequence Alignment – p.7/36

Pairwise global alignment

Example: one possible alignment between

ACGCTTTG

and

CATGTAT

S: AC--GCTTTG

T: -CATG-TAT-

Global alignment

Input: Two sequences

S = s

· · · s

and

T = t

· · · t

(

and

are

approximately the same).

Question: Find an optimal alignment

S → S

′

and

T → T

′

such that

V =

d
i

σ(S

′

, T

′

)

is maximal.

Sequence Alignment – p.9/36

Dynamic programming

Let

V (i, j)

be the optimal alignment score of

1···i

and

1···j

(

0 ≤ i ≤ n

0 ≤ j ≤ m

has the following properties:

Base conditions:

V (i, 0) =

σ(S

′

−

′

)

(1)

V (0, j) =

σ(

′

−

′

, T

)

(2)

(3)

Recurrence relationship:

V (i, j) = max











V (i − 1, j − 1) + σ(S

, T

)

V (i − 1, j) + σ(S

′

−

′

)

V (i, j − 1) + σ(

′

−

′

, T

)

(4)

for all

i ∈ [1, n]

and

j ∈ [1, m]

Sequence Alignment – p.10/36

Tabular computation

-1

-2

-3

-4

-5

-1

-2

-1

-2

-3

-1

-4

-1

-5

-2

-6

-3

Score: match=+2, mismatch=-1.

Sequence Alignment – p.12/36

Global alignment in linear space

Let

(i, j)

denote the optimal alignment value of the last

characters in sequence

against the last

characters in sequence

V (n, m) = max

∈[0,m]

V (

, k) + V

(

, m − k)

(5)

Sequence Alignment – p.14/36

Global alignment in linear space

Hirschberg’s algorithm:

1. Compute

V (i, j)

. Save the values of

-th row. Denote

V (i, j)

the

forward matrix

2. Compute

(i, j)

. Save the values of

-th row. Denote

(i, j)

the

forward matrix

3. Find the column

∗

such that

F (

∗

) + B(

, m − k

∗

)

is maximal

4. Now that

∗

is found, recursively partition the problem into two sub

problems: i) Find the path from

(0, 0)

(n/2, k

∗

)

ii) Find the path from

(n/2, m − k

∗

)

(n, m)

Sequence Alignment – p.15/36

Local alignment problem

Input: Given two sequences

and

Question: Find the subsequece

and

, whose simililarity

(optimal global alignment) is maximal (over all such pairs of

subsequences).

Example: S=

GGTCTGAG

and T=

AAACGA

Score: match = 2; indel/substitution=-1

The optimal local alignment is

α =

CTGA

and

β =

CGA

CTGA

(

α ∈ S

)

C-GA

(

β ∈ T

)

Sequence Alignment – p.17/36

Local Suffix Alignment Problem

Input: Given two sequences

and

and two indices

and

Question: Find a (possibly empty) suffix

1···i

and a (possibliy

empty) suffix

1···j

such that the value of the alignment between

and

is maximal over all alignments of suffixes of

1···i

and

1···j

Terminology and Restriction

V (i, j)

: denote the value of the optimal local suffix alignment for a

given pair

of indices.

Limit the pair-wise scores by:

σ(x, y) =







≥ 0

match

≤ 0

do not match, or one of them is a space

(6)

Sequence Alignment – p.18/36

Local Suffix Alignment Problem

Recursive Definitions

Base conditions:

V (i, 0) = 0, V (0, j) = 0

for all

and

Recurrence relation:

V (i, j) = max











V (i − 1, j − 1) + σ(S

, T

)

V (i − 1, j) + σ(S

′

−

′

)

V (i, j − 1) + σ(

′

−

′

, T

)

(7)

Compute

∗

and

∗

V (i

∗

, j

∗

) =

max

∈[1,n],j∈[1,m]

V (i, j)

Sequence Alignment – p.19/36

Local Suffix Alignment Problem

Score: match=+2, mismatch=-1.

Sequence Alignment – p.20/36

Affine Gap Penalty Model

A total penalty for a gap of length

is:

total

= W

+ qW

where

: the weight for “openning the gap”

: the weight for “extending the gap” with one more space

Under this model, the score for a particular alignment

S → S

′

and

T → T

′

is:

∈{k:S

′

−

′

& T

′

−

′

}

σ(S

′

, T

′

) + W

gaps

+ W

spaces

Sequence Alignment – p.22/36

Global alignment with affine gap penality

To align sequence

and

, consider the prefixes

1···i

and

1···j

Any alignment of these two prefixes is one of the following three types:

Type 1 (

A(i, j)

): Characters

and

are aligned opposite each

other.

S: ************i
T: ************j

Type 2 (

L(i, j)

): Character

is aligned to a chracter to the left of

S: ************i------
T: ******************j

Type 3 (

R(i, j)

): Character

is aligned to a chracter to the right of

S: ******************i
T: *************j-----

Sequence Alignment – p.23/36

Recursive Definition

Recursive Definition

Base conditions:

V (0, 0) =0

(8)

V (i, 0) =R(i, 0) = W

+ iW

(9)

V (0, j) =L(0, j) = W

+ jW

(10)

Recurrence relation:

V (i, j) =max{A(i, j), L(i, j), R(i, j)}

(11)

A(i, j) =V (i − 1, j − 1) + σ(S

, T

)

(12)

L(i, j) =max{L(i, j − 1) + W

, V (i, j − 1) + W

+ W

}

(13)

R(i, j) =max{R(i − 1, j) + W

, V (i − 1, j) + W

+ W

}

(14)

Sequence Alignment – p.25/36

Local alignment problem

Local alignment problem

Input: Given two sequences

and

Question: Find the subsequece

and

, whose similarity

(optimal global alignment) is maximal (over all such pairs of

subsequences).

Example: S=

GGTCTGAG

and T=

AAACGA

Score: match = 2; indel/substitution=-1

The optimal local alignment is

α =

CTGA

and

β =

CGA

CTGA

(

α ∈ S

)

C-GA

(

β ∈ T

)

Suppose the maximal local alignment score between

and

How to measure the significane of

Sequence Alignment – p.26/36

Measure statistical significance

One possible solution:

1. Generate many random sequences

, T

, · · · , T

, (e.g.

N > 10, 000

2. Find the optimal alignment score

between

and

for all

3. p-value

N
i

I(S

≥ S)/N

However, the solution is not practical.

Sequence Alignment – p.27/36

Extreme value distribution (EVD)

Suppose that

, X

, · · · , X

are iid random variables. Denote the

maximum of these r.v. by

max

= max{X

, X

, · · · , X

}

Suppose that

, · · · X

are continuous r.v. with density function

(x)

and cumulative distribution function

(x)

Question: what is the distribution of

max

Sequence Alignment – p.28/36

Extreme value distribution (EVD)

Note that

Prob(X

max

≤ x) = [Prob(X ≤ x)]

. Hence

max

(x) = (F

(x))

Density function of

max

(x) = nf

(x)(F

(n))

−1

Sequence Alignment – p.29/36

Example: the exponential distribution

the exponential distribution

(x) =λe

−λx

x ≥ 0

(15)

(x) =1 − e

−λx

x ≥ 0

(16)

Mean:

1/λ

; Variance:

1/λ

Sequence Alignment – p.30/36

EVD of the exponential distribution

Mean and variance of

max

E[X

max

] =

(1 +

1
2

+ · · · +

)

→∞

−→

(γ + log n)

(19)

Var[X

max

] =

(1 +

+ · · · +

)

→∞

−→

6λ

(20)

where

γ = 0.5772 . . .

is Euler’s constant.

Sequence Alignment – p.32/36

Asymptotic distribution

Asymptotic formula for the distribution of

max

Define a rescaled

max

U =

max

− log(n)/λ

1/λ

= λX

max

− log n

n → ∞

, the mean of

approaches

and the variance of

approaches

Sequence Alignment – p.33/36

Gumbel distribution

The cumulative distribution:

Prob(U ≤ u) =Prob)(X

max

≤ (u + log n)/λ)

(21)

=(1 − e

−u

/n)

(22)

−e

−u

n → ∞

(23)

Or equivalently

Prob(U ≥ u) = 1 − e

−e

−u

n → ∞

which is called Gumbel distribution.

Sequence Alignment – p.34/36

EVD of the exponential distribution

EVD for large

The density function

(u) = e

−u

−e

−u

≈ e

−u

(1 − e

−u

−2u

− . . . ) ≈ e

−u

which decays much slower than the Gaussian distribution.

Sequence Alignment – p.35/36

Karlin & Altschul statistics

Karlin & Altschul statistics

For local ungapped alignments between two sequences of length

and

, the probability that there is a match of a score greater than

is:

P (x ≥ S) = 1 − e

−Kmne

−λS

Denote

E(S) = Kmne

−λS

- the expected number of unrelated

matches with score greather than

Significane requirement:

E(S)

should be significantly less than

that is

S <

log(mn)

log K

Sequence Alignment – p.36/36

Document Outline