Shortened Glossary of Terms from the BLAST NCBI Web-Site

Shortened Glossary of Terms from the BLAST NCBI Web-Site.

Bioinformatics

The merger of biotechnology and information technology with the goal of

revealing new insights and principles in biology.

Proteomics

Systematic analysis of protein expression of normal and diseased tissues that

involves the separation, identification and characterization of all of the proteins in an

organism.

***********************************************

Alignment:

The process of lining up two or more sequences to achieve maximal levels of

identity (and conservation, in the case of amino acid sequences) for the purpose of

assessing the degree of similarity and the possibility of homology.

Bit score

The value S' is derived from the raw alignment score S in which the statistical

properties of the scoring system used have been taken into account. Because bit

scores have been normalized with respect to the scoring system, they can be used to

compare alignment scores from different searches.

BLAST

Basic Local Alignment Search Tool. (Altschul et al.) A sequence comparison

algorithm optimized for speed used to search sequence databases for optimal local

alignments to a query. The initial search is done for a word of length "W" that scores

at least "T" when compared to the query using a substitution matrix. Word hits are

then extended in either direction in an attempt to generate an alignment with a score

exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity

of the search. For additional details, see one of the BLAST tutorials (Query or

BLAST) or the narrative guide to BLAST.

BLOSUM

Blocks Substitution Matrix. A substitution matrix in which scores for each

position are derived from observations of the frequencies of substitutions in blocks

of local alignments in related proteins. Each matrix is tailored to a particular

evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from

which scores were derived was created using sequences sharing no more than 62%

identity. Sequences more identical than 62% are represented by a single sequence in

the alignment so as to avoid over-weighting closely related family members.

(Henikoff and Henikoff)

Domain

A discrete portion of a protein assumed to fold independently of the rest of

the protein and possessing its own function.

E value

Expectation value. The number of different alignments with scores equivalent to

or better than S that are expected to occur in a database search by chance. The lower

the E value, the more significant the score.

FASTA

The first widely used algorithm for database similarity searching. The program

looks for optimal local alignments by scanning the sequence for small matches called

"words". Initially, the scores of segments in which there are multiple word hits are

calculated ("init1"). Later the scores of several segments may be summed to generate

an "initn" score. An optimized alignment that includes gaps is shown in the output as

"opt". The sensitivity and speed of the search are inversely related and controlled by

the "k-tup" variable which specifies the size of a "word". (Pearson and Lipman)

gap

A space introduced into an alignment to compensate for insertions and

deletions in one sequence relative to another. To prevent the accumulation of too

many gaps in an alignment, introduction of a gap causes the deduction of a fixed

amount (the gap score) from the alignment score. Extension of the gap to encompass

additional nucleotides or amino acid is also penalized in the scoring of an alignment.

Homology

Similarity attributed to descent from a common ancestor.

HSP

High-scoring segment pair. Local alignments with no gaps that achieve one of

the top alignment scores in a given search.

Identity

The extent to which two (nucleotide or amino acid) sequences are invariant.

Motif

A short conserved region in a protein sequence. Motifs are frequently highly

conserved parts of domains.

Multiple Sequence Alignment

An alignment of three or more sequences with gaps inserted in the sequences

such that residues with common structural positions and/or ancestral residues are

aligned in the same column. Clustal W is one of the most widely used multiple

sequence alignment programs

Orthologous

Homologous sequences in different species that arose from a common

ancestral gene during speciation; may or may not be responsible for a similar

function.

P value

The probability of an alignment occurring with the score in question or better.

The p value is calculated by relating the observed alignment score, S, to the expected

distribution of HSP scores from comparisons of random sequences of the same

length and composition as the query to the database. The most highly significant P

values will be those close to 0. P values and E values are different ways of

representing the significance of the alignment.

PAM

Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify

the amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the

amount of evolution which will change, on average, 1% of amino acids in a protein

sequence. A PAM(x) substitution matrix is a look-up table in which scores for each

amino acid substitution have been calculated based on the frequency of that

substitution in closely related proteins that have experienced a certain amount (x) of

evolutionary divergence.

Paralogous

Homologous sequences within a single species that arose by gene

duplication.

PSI-BLAST

Position-Specific Iterative BLAST. An iterative search using the BLAST

algorithm. A profile is built after the initial search, which is then used in subsequent

searches. The process may be repeated, if desired with new sequences found in each

cycle used to refine the profile. Details can be found in this discussion of

PSI-BLAST. (Altschul et al.)

Substitution Matrix

A substitution matrix containing values proportional to the probability that

amino acid i mutates into amino acid j for all pairs of amino acids. such matrices are

constructed by assembling a large and diverse sample of verified pairwise alignments

of amino acids. If the sample is large enough to be statistically significant, the

resulting matrices should reflect the true probabilities of mutations occuring through

a period of evolution.