Discontinuity in long Deoxyribonucleic Acid (DNA) sequences creates harmful diseases. Changes in the DNA structure refers to changes in the human immunity system. Tuberculosis is a critical disease that causes coughing, fatigue, unintentional weight loss and fever on aged people due to the disorder in the DNA. Breaks or mutations over long DNA sequences are the pivotal reasons for this fatal disease. This study developed an automated machine learning technique to assess the total number of such breaks in the long DNA sequences. Data cleansing and deep neural network techniques are applied to handle this big data. The National Center for Biotechnology Information (NCBI) database has been used to extract the amino acid sequences for Tuberculosis disease from the big DNA datasets. Results reveal that the proposed automated approach is significantly effective for the determination of DNA sequence breaks for the tuberculosis diseases due to the high sensitivity of Markov chain as well as the effective normalization techniques. This approach fixed the size of the training datasets and recursively divide the whole dataset into certain length. The study also adopts multiple predictions approaches, such as the hidden Markov chain, Box-Cox transformation and linear transformation to forecast about the breaks for any long positions of the training and testing datasets. The results demonstrated that hidden the Markov chain model provided faster analysis with more accurate and reliable results.
Preview PaperProvide a Feedback