Shortened Glossary of Terms from the BLAST NCBI Web-Site.

 

          Bioinformatics

                        The merger of biotechnology and information technology with the goal of

                   revealing new insights and principles in biology.

           

          Proteomics

                        Systematic analysis of protein expression of normal and diseased tissues that

                   involves the separation, identification and characterization of all of the proteins in an

                   organism.

***********************************************

                Alignment:

The process of lining up two or more sequences to achieve maximal levels of

                   identity (and conservation, in the case of amino acid sequences) for the purpose of

                   assessing the degree of similarity and the possibility of homology.

 

                Bit score

                        The value S' is derived from the raw alignment score S in which the statistical

                   properties of the scoring system used have been taken into account. Because bit

                   scores have been normalized with respect to the scoring system, they can be used to

                   compare alignment scores from different searches.

 

                BLAST

                        Basic Local Alignment Search Tool. (Altschul et al.) A sequence comparison

                   algorithm optimized for speed used to search sequence databases for optimal local

                   alignments to a query. The initial search is done for a word of length "W" that scores

                   at least "T" when compared to the query using a substitution matrix. Word hits are

                   then extended in either direction in an attempt to generate an alignment with a score

                   exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity

                   of the search. For additional details, see one of the BLAST tutorials (Query or

                   BLAST) or the narrative guide to BLAST.

 

                BLOSUM

                        Blocks Substitution Matrix. A substitution matrix in which scores for each

                   position are derived from observations of the frequencies of substitutions in blocks

                   of local alignments in related proteins. Each matrix is tailored to a particular

                   evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from

                   which scores were derived was created using sequences sharing no more than 62%

                   identity. Sequences more identical than 62% are represented by a single sequence in

                   the alignment so as to avoid over-weighting closely related family members.

(Henikoff and Henikoff)

 

                Domain

                        A discrete portion of a protein assumed to fold independently of the rest of

                   the protein and possessing its own function.

 

                E value

                        Expectation value. The number of different alignments with scores equivalent to

                   or better than S that are expected to occur in a database search by chance. The lower

                   the E value, the more significant the score.

 

                FASTA

                        The first widely used algorithm for database similarity searching. The program

                   looks for optimal local alignments by scanning the sequence for small matches called

                   "words". Initially, the scores of segments in which there are multiple word hits are

                   calculated ("init1"). Later the scores of several segments may be summed to generate

                   an "initn" score. An optimized alignment that includes gaps is shown in the output as

                   "opt". The sensitivity and speed of the search are inversely related and controlled by

                   the "k-tup" variable which specifies the size of a "word". (Pearson and Lipman)

 

                gap

                        A space introduced into an alignment to compensate for insertions and

                   deletions in one sequence relative to another. To prevent the accumulation of too

                   many gaps in an alignment, introduction of a gap causes the deduction of a fixed

                   amount (the gap score) from the alignment score. Extension of the gap to encompass

                   additional nucleotides or amino acid is also penalized in the scoring of an alignment.

 

                Homology

                        Similarity attributed to descent from a common ancestor.

 

                HSP

                        High-scoring segment pair. Local alignments with no gaps that achieve one of

                   the top alignment scores in a given search.

 

                Identity

                        The extent to which two (nucleotide or amino acid) sequences are invariant.

 

                Motif

                        A short conserved region in a protein sequence. Motifs are frequently highly

                   conserved parts of domains.

 

                Multiple Sequence Alignment

                        An alignment of three or more sequences with gaps inserted in the sequences

                   such that residues with common structural positions and/or ancestral residues are

                   aligned in the same column. Clustal W is one of the most widely used multiple

                   sequence alignment programs

 

                Orthologous

                        Homologous sequences in different species that arose from a common

                   ancestral gene during speciation; may or may not be responsible for a similar

                   function.   


   P value

                        The probability of an alignment occurring with the score in question or better.

                   The p value is calculated by relating the observed alignment score, S, to the expected

                   distribution of HSP scores from comparisons of random sequences of the same

                   length and composition as the query to the database. The most highly significant P

                   values will be those close to 0. P values and E values are different ways of

                   representing the significance of the alignment.

 

               PAM

                        Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify

                   the amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the

                   amount of evolution which will change, on average, 1% of amino acids in a protein

                   sequence. A PAM(x) substitution matrix is a look-up table in which scores for each

                   amino acid substitution have been calculated based on the frequency of that

                   substitution in closely related proteins that have experienced a certain amount (x) of

                   evolutionary divergence.

 

               Paralogous

                        Homologous sequences within a single species that arose by gene

                   duplication.

 

               PSI-BLAST

                        Position-Specific Iterative BLAST. An iterative search using the BLAST

                   algorithm. A profile is built after the initial search, which is then used in subsequent

                   searches. The process may be repeated, if desired with new sequences found in each

                   cycle used to refine the profile. Details can be found in this discussion of

                   PSI-BLAST. (Altschul et al.)

 

               Substitution Matrix

                        A substitution matrix containing values proportional to the probability that

                   amino acid i mutates into amino acid j for all pairs of amino acids. such matrices are

                   constructed by assembling a large and diverse sample of verified pairwise alignments

                   of amino acids. If the sample is large enough to be statistically significant, the

                   resulting matrices should reflect the true probabilities of mutations occuring through

                   a period of evolution.