MPEP 2412.05(b)
Representation and Symbols of Nucleotide Sequence Data

Ninth Edition of the MPEP, Revision 07.2022, Last Revised in February 2023

Previous: §2412.05(a) | Next: §2412.05(c)

2412.05(b)    Representation and Symbols of Nucleotide Sequence Data [R-07.2022]

[Editor Note: This section is applicable to all applications filed on or after July 1, 2022, having disclosures of nucleotide and/or amino acid sequences as defined in 37 CFR 1.831(b). Formatting representations of XML (eXtensible Markup Language) elements in this section appear different than shown in Standard ST.26, which may be accessed at: www.wipo.int /export/sites/www/standards/en/pdf/03-26-01.pdf.]

37 C.F.R. 1.832  Representation of nucleotide and/or amino acid sequence data in the "Sequence Listing XML" part of a patent application filed on or after July 1, 2022.

  • *****

  • (b) The representation and symbols for nucleotide sequence data shall conform to the requirements of paragraphs (b)(1) through (4) of this section.
    • (1) A nucleotide sequence must be represented in the manner described in paragraphs 11–12 of WIPO Standard ST.26.
    • (2) All nucleotides, including nucleotide analogs, modified nucleotides, and "unknown" nucleotides, within a nucleotide sequence must be represented using the symbols set forth in paragraphs 13–16, 19, and 21 of WIPO Standard ST.26.
    • (3) Modified nucleotides within a nucleotide sequence must be described in the manner discussed in paragraphs 17, 18, and 19 of WIPO Standard ST.26.
    • (4) A region containing a known number of contiguous "a," "c," "g," "t," or "n" residues for which the same description applies may be jointly described in the manner described in paragraph 22 of WIPO Standard ST.26.
  • *****

I.    REPRESENTATION OF NUCLEOTIDE SEQUENCE

WIPO Standard ST.26, paragraph 11, provides that a nucleotide sequence must be represented only by a single strand, in the 5’ to 3’ direction from left to right, or in the direction from left to right that mimics the 5’ to 3’ direction. The designations 5’ and 3’ or any other similar designations must not be included in the sequence. A double-stranded nucleotide sequence disclosed by enumeration of its residues of both strands must be represented as:

  • (a) a single sequence or as two separate sequences, each assigned its own sequence identifier, where the two separate strands are fully complementary to each other, or
  • (b) two separate sequences, each assigned its own sequence identifier, where the two strands are not fully complementary to each other.

WIPO Standard ST.26, paragraph 12, provides that the first nucleotide presented in the sequence is residue position number 1. When nucleotide sequences are circular in configuration, applicant must choose the nucleotide in residue position number 1. Numbering is continuous throughout the entire sequence in the 5’ to 3’ direction, or in the direction that mimics the 5’ to 3’ direction. The last residue position number must equal the number of nucleotides in the sequence.

II.    SYMBOLS FOR A NUCLEOTIDE SEQUENCE

WIPO Standard ST.26, paragraph 13, provides that all nucleotides in a sequence must be represented using the symbols Table 1: List of Nucleotides Symbols (see MPEP § 2412.03(a)). Only lower-case letters must be used. Any symbol used to represent a nucleotide is the equivalent of only one residue.

WIPO Standard ST.26, paragraph 14, sets forth that the symbol "t" will be construed as thymine in deoxyribonucleic acid (DNA) and uracil in ribonucleic acid (RNA). Uracil in DNA or thymine in RNA is considered a modified nucleotide and must be further described in a feature table. See MPEP § 2413.01(g), subsection I for more detail regarding a "feature table."

WIPO Standard ST.26, paragraph 15, provides that where an ambiguity symbol (representing two or more alternative nucleotides) is appropriate, the most restrictive symbol should be used, as listed in Table 1: List of Nucleotides Symbols (see MPEP § 2412.03(a)). For example, if a nucleotide in a given position could be "a" or "g", then "r" should be used, rather than "n". The symbol "n" will be construed as any one of "a", "c", "g", or "t/u" except where it is used with a further description in a feature table. The symbol "n" must not be used to represent anything other than a nucleotide. A single modified or "unknown" nucleotide may be represented by the symbol "n", together with a further description in a feature table. See MPEP § 2413.01(g), subsection I, for more detail regarding a "feature table." For representation of sequence variants, i.e., alternatives, deletions, insertions or substitutions, see MPEP § 2412.05(c); and also MPEP § 2413.01(g), subsection XII for information on variants.

WIPO Standard ST.26, paragraph 16, sets forth that modified nucleotides should be represented in the sequence as the corresponding unmodified nucleotides, i.e., "a", "c", "g" or "t" whenever possible. Any modified nucleotide in a sequence that cannot otherwise be represented by any other symbol in Table 1: List of Nucleotides Symbols (see MPEP § 2412.03(a)), i.e., an "other" nucleotide, such as a non-naturally occurring nucleotide, must be represented by the symbol "n". The symbol "n" is the equivalent of only one residue.

WIPO Standard ST.26, paragraph 19, specifies that uracil in DNA or thymine in RNA are considered modified nucleotides and must be represented in the sequence as "t" and be further described in a feature table using the feature key "modified_base", the qualifier "mod_base" with "OTHER" as the qualifier value and the qualifier "note" with "uracil" or "thymine", respectively, as the qualifier value. See MPEP § 2413.01(g), subsection I for more detail regarding a "feature table."

WIPO Standard ST.26, paragraph 21, provides that any "unknown" nucleotide must be represented by the symbol "n" in the sequence. An "unknown" nucleotide should be further described in a feature table using the feature key "unsure". The symbol "n" is the equivalent of only one residue. See MPEP § 2413.01(g), subsection I, for more detail regarding a "feature table."

III.    DESCRIPTION OF MODIFIED NUCLEOTIDES WITHIN A NUCLEOTIDE SEQUENCE

WIPO Standard ST.26, paragraph 17, specifies that a modified nucleotide must be further described in a feature table (see MPEP § 2413.01(g), subsection I, for more detail regarding a "feature table") using the feature key "modified_base" and the mandatory qualifier "mod_base" in conjunction with a single abbreviation from Table 2: List of Modified Nucleotides in subsection IV, below, as the qualifier value. See MPEP § 2413.01(g) subsections II and III, for more information regarding use of a feature key; and MPEP § 2413.01(g) subsections V and VI, for more information regarding use of a qualifier. If the abbreviation is "OTHER", the complete unabbreviated name of the modified nucleotide must be provided as the value in a "note" qualifier. For a listing of alternative modified nucleotides, the qualifier value "OTHER" may be used in conjunction with a further "note" qualifier. The abbreviations (or full names) provided in Table 2 must not be used in the sequence itself.

WIPO Standard ST.26, paragraph 18, describes that a nucleotide sequence including one or more regions of consecutive modified nucleotides that share the same backbone moiety must be further described in a feature table as required for a modified nucleotide. See MPEP § 2413.01(g), subsection I, for information regarding a feature table and MPEP § 2412.03(e) regarding modified nucleotides. The modified nucleotides of each such region may be jointly described in a single INSDFeature element as provided in accordance with 37 CFR 1.832(b)(4). See MPEP § 2413.01(g), subsection I, for information regarding INSDFeature elements of a feature table. The most restrictive unabbreviated chemical name that encompasses all of the modified nucleotides in the range or a list of the chemical names of all the nucleotides in the range must be provided as the value in the "note" qualifier. For example, a glycol nucleic acid sequence containing "a", "c", "g", or "t" nucleobases may be described in the "note" qualifier as "2,3-dihydroxypropyl nucleosides." Alternatively, the same sequence may be described in the "note" qualifier as "2,3-dihydroxypropyladenine, 2,3-dihydroxypropylthymine, 2,3-dihydroxypropylguanine, or 2,3-dihydroxypropylcytosine." Where an individual modified nucleotide in the region includes an additional modification, then the modified nucleotide must also be further described in a feature table as required for a modified nucleotide. See MPEP § 2413.01(g), subsection I, for more detail regarding a "feature table".

WIPO Standard ST.26, paragraph 19, provides that uracil in DNA or thymine in RNA are considered modified nucleotides and must be represented in the sequence as "t" and be further described in a feature table using the feature key "modified_base", the qualifier "mod_base" with "OTHER" as the qualifier value and the qualifier "note" with "uracil" or "thymine", respectively, as the qualifier value.

IV.    JOINTLY DESCRIBING A REGION OF A NUCLEOTIDE SEQUENCE

WIPO Standard ST.26, paragraph 22, specifies that a region containing a known number of contiguous "a", "c", "g", "t", or "n" residues for which the same description applies may be jointly described using a single INSDFeature element with the syntax "x..y" as the location descriptor in the element INSDFeature_location. See MPEP § 2413.01(g) subsection I, for description of INSDFeature elements in a Feature Table. For representation of sequence variants, i.e., alternatives, deletions, insertions or substitutions, see MPEP § 2412.05(c) and MPEP § 2413.01(g), subsection XII, for information on variants.

Abbreviation Definition
ac4c  4-acetylcytidine 
chm5u  5-(carboxyhydroxymethyl)uridine 
cm  2'-O-methylcytidine 
cmnm5s2u  5-carboxymethylaminomethyl-2- thiouridine 
cmnm5u  5-carboxymethylaminomethyluridine 
dhu  dihydrouridine 
fm  2'-O-methylpseudouridine 
gal q beta, D-galactosylqueuosine 
gm  2'-O-methylguanosine 
i inosine 
i6a  N6-isopentenyladenosine 
m1a  1-methyladenosine 
m1f  1-methylpseudouridine 
m1g  1-methylguanosine 
m1i  1-methylinosine 
m22g  2,2-dimethylguanosine 
m2a  2-methyladenosine 
m2g  2-methylguanosine 
m3c  3-methylcytidine 
m4c  N4-methylcytosine 
m5c  5-methylcytidine 
m6a  N6-methyladenosine 
m7g  7-methylguanosine 
mam5u  5-methylaminomethyluridine 
mam5s2u  5-methoxyaminomethyl-2-thiouridine 
man q beta, D-mannosylqueuosine 
mcm5s2u  5-methoxycarbonylmethyl-2- thiouridine
mcm5u  5-methoxycarbonylmethyluridine 
mo5u  5-methoxyuridine 
ms2i6a  2-methylthio-N6- isopentenyladenosine
ms2t6a  N-((9-beta-D-ribofuranosyl-2- methylthiopurine-6- yl)carbamoyl)threonine
mt6a  N-((9-beta-D-ribofuranosylpurine-6- yl)N-methylcarbamoyl)threonine
mv  uridine-5-oxyacetic acid-methylester
o5u  uridine-5-oxyacetic acid 
osyw  wybutoxosine 
p pseudouridine 
q queuosine 
s2c  2-thiocytidine 
s2t  5-methyl-2-thiouridine 
s2u  2-thiouridine 
s4u  4-thiouridine 
m5u  5-methyluridine 
t6a  N-((9-beta-D-ribofuranosylpurine-6- yl)-carbamoyl)threonine
tm  2'-O-methyl-5-methyluridine 
um 2'-O-methyluridine 
yw wybutosine 
x 3-(3-amino-3-carboxy-propyl)uridine, (acp3)u
OTHER  (requires note qualifier)

(Reproduced from WIPO Standard ST. 26, Annex I, Section 2)