Thursday, May 30, 2019

Data Mining Essay -- Technology, Data Processing

1 Data Pre-processing1.1 k-mers extractionAssume Ka = (a1,a2...ak) is a k-mer of continuous sequence of length k, and a = 1,, S, where S is the cumulative spell of k-mers in that series. In the case of a sequence of length L, we have L k + 1 total number of k-mers that can be wedded out making use of k length window drifting procedure.1.2Generation Of Position Frequency Matrices For the positive dataset, 500 sequences were used to calculate k-mer frequencies from deuce-ace successive windows. The three windows ar (1) window A, from -75 to -26 bp before the polyA site, (2) window B, from -25 to -1 bp before the polyA site, and (3) window C, from 1 to 25 bp after the polyA site. The highly informative k-mer frequencies (HIK) feature transmitter consisted of cumulated frequencies of all monomer, dimmer, and trimer frequencies for the three regions. This results in 3 regions x 4 monomer frequencies, 3 x 16 dimer frequencies, and 3 x 64 trimer frequencies. Hence, a total of 252 f eatures are obtained. The negative dataset was computed from frequencies in similarly spaced windows, but from the beginning of 500 other independent sequences (windows A, -300 to -251 bp B, -251 to -226 bp and C, -225 to -201 bp1.3Background Probability FeatureThe give chase space is create verbally as Y = fp ng indicating that a sequence with a polyA site is detected (positiveclass label p) or not detected (negative class label n). A classiffier, i.e., a mapping from instance space to label space, is found by means of learning from a set of examples. An example is of the form z = (x y) with x 2 X and y 2 Y. The symbol Z will be used as a compact notation for X _Y. Training data area sequence of examplesS = (x1 y1) (xn ... ...clude GC-rich redundant motifs and diffuse motifs that are difficult to detect.Suggestions and Further Research Motif discovery in DNA datasets is a challenging problem domain due to lack of understanding of the nature of the data, and the mechanisms to which proteins recognize and interact with its binding sites are still gravel to biologist. Hence, predicting binding sites by using computational algorithms is still far from satisfaction. Many computational motif discovery algorithms have been proposed in the past decade. Like most(prenominal) of these algorithms, it shares some common challenges that require further investigation. The first is the scalability of the system for large scale dataset such as ChIP sequences. The scalability is the ability of a tool to conserve its prediction performances and efficiency while the size of the datasets increases.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.