Document

CSCI1200 Data Structures

妮妮2024-04-23 14:56:14

The genome data above is a small region from human chromosome 18.The queries will be also strings from the DNA alphabet. A typical query (search) could look like:CTCATCACAAATGGATAGAACGACCTGGTAA match for this query can be found at the start of the 5th line ofgenome small.txt.The strategy that we will employ is to index the genome file with a series ofk-mers.Ak-meris a sequenceofkletters (there may be repeated letters) from the DNA alphabet, wherekis an integer. A givenk-mermay appear many times within the genome.

We index the genome by building ahash tablewith thek-mersas the keys.The hash table is to beimplemented with an array or a vector at the top level. Ahash functionwill map thek-merto a position inthe table. You will chose an appropriate structure to store thekmersand genomic locations of thek-mers, i.e.the positions where they are found in the genomic sequence. You build the hash table by iterating throughthe genome sequence with a series of overlapping windows of lengthk, calculating the the index into the hashtable using the hash function. Store thekmerand its genomic location in the table. When iterating throughthe genome sequence, the firstk-meris the genome sequence from 0 to k-1; the second is the sequence from1 to k, etc.

When searching a biological sequence database, it’s often the case that we do not find an exact match toa query string. MiniBLAST will process queries of varying lengths and allow for mismatches between thegenome and query string. To search the genome, MiniBLAST uses the firstkletters of the query string asa seed. It is important that searching the database for the initial seed be efficient. Thus, the choice of hashfunction and table implementation is important. If the seed can be found in the table, the program shouldtry to extend the match by adding letters from the indexed genomic position until the full query is matchedor the allowed number of mismatches is exceeded. For simplicity we require that the seed be an exact match.The mismatches may occur anywhere after the seed in the match string.The choice of the hash function is up to you. A good hash function should be fast, O(1) computation, andprovide a random, uniform distribution of keys throughout the table. You may use one of the hash functionsmentioned in lecture, one found on the internet, or one of your own devising. If you choose to download ahash function from the internet, you must provide the URL in your README and include the source codewith your submission. If the downloaded file requires a copyright notice, you MUST include that notice. Besure to observe any copyright restrictions on the use of the code. In your README file, describe your hashfunction and table implementation.