Abstract
A set of 22 551 unique human
Not
I flanking sequences (16.2 Mb) was generated. More than 40% of the set had regions with significant similarity to known proteins and expressed sequences. The data demonstrate that regions flanking
Not
I sites are less likely to form nucleosomes efficiently and resemble promoter regions. The draft human genome sequence contained 55.7% of the
Not
I flanking sequences, Celera’s database contained matches to 57.2% of the clones and all public databases (including non-human and previously sequenced
Not
I flanks) matched 89.2% of the
Not
I flanking sequences (identity ≥90% over at least 50 bp, data from December 2001). The data suggest that the shotgun sequencing approach used to generate the draft human genome sequence resulted in a bias against cloning and sequencing of
Not
I flanks. A rough estimation (based primarily on chromosomes 21 and 22) is that the human genome contains 15 000–20 000
Not
I sites, of which 6000–9000 are unmethylated in any particular cell. The results of the study suggest that the existing tools for computational determination of CpG islands fail to identify a significant fraction of functional CpG islands, and unmethylated DNA stretches with a high frequency of CpG dinucleotides can be found even in regions with low CG content.